This blog is a guide for running the MLPerf inference v1.1 benchmark. Information about how to run the MLPerf inference v1.1 benchmark is available online at different locations. This blog provides all the steps in one place.

MLPerf is a benchmarking suite that measures the performance of Machine Learning (ML) workloads. It focuses on the most important aspects of the ML life cycle: training and inference. For more information, see Introduction to MLPerf™ Inference v1.1 Performance with Dell EMC Servers .

This blog focuses on inference setup and describes the steps to run closed data center MLPerf inference v1.1 tests on Dell Technologies servers with NVIDIA GPUs. It enables you to run the tests and reproduce the results that we observed in our HPC & AI Innovation Lab . For details about the hardware and the software stack for different systems in the benchmark, see this list of systems .

The MLPerf inference v1.1 suite contains the following benchmarks:

  • Resnet50
  • SSD-Resnet34
  • BERT
  • DLRM
  • RNN-T
  • 3D U-Net

Note : The BERT, DLRM, and 3D U-Net models have 99 percent (default accuracy) and 99.9 percent (high accuracy) targets.

This blog describes the steps to run all these benchmarks.

1 Getting started

A system under test consists of a defined set of hardware and software resources that will be measured for performance. The hardware resources may include processors, accelerators, memories, disks, and interconnect. The software resources may include an operating system, compilers, libraries, and drivers that significantly influence the running time of a benchmark. In this case, the system on which you clone the MLPerf repository and run the benchmark is known as the system under test (SUT).

For storage, SSD RAID or local NVMe drives are acceptable for running all the subtests without any penalty. Inference does not have strict requirements for fast-parallel storage. However, the BeeGFS or Lustre file system, the PixStor storage solution, and so on help make multiple copies of large datasets.

2 Prerequisites

Prerequisites for running the MLPerf inference v1.1 tests include:

  • An x86_64 Dell EMC systems
  • Docker installed with the NVIDIA runtime hook
  • Ampere-based NVIDIA GPUs (Turing GPUs have legacy support but are no longer maintained for optimizations.)
  • NVIDIA driver version 470.xx or later
  • As of inference v1.0, ECC turned ON
    Set ECC to on,run the following command:
    sudo nvidia-smi --ecc-config=1

Preparing to run the MLPerf inference v1.1

Before you can run the MLPerf inference v1.1 tests, perform the following tasks to prepare your environment.

3.1 Clone the MLPerf repository

  1. Clone the repository to your home directory or another acceptable path:
    cd -
    git clone https://github.com/mlperf/inference_results_v1.1
  2. Go to the closed/DellEMC directory:
    cd inference_results_v1.1/closed/DellEMC
  3. Create a “scratch” directory with at least 3 TB of space in which to store the models, datasets, preprocessed data, and so on:
    mkdir scratch
  4. Export the absolute path for $MLPERF_SCRATCH_PATH with the scratch directory:
    export MLPERF_SCRATCH_PATH=/home/user/inference_results_v1.1/closed/DellEMC/scratch

3.2 Set up the configuration file

The closed/DellEMC/configs directory includes an __init__.py file that lists configurations for different Dell EMC servers that were systems in the MLPerf Inference v1.1 benchmark. If necessary, modify the configs/ <benchmark> / <Scenario> /__init__.py file to include the system that will run the benchmark.

Note : If your system is already present in the configuration file, there is no need to add another configuration.

In the c onfigs/ <benchmark> / <Scenario> /__init__.py file, select a similar configuration and modify it based on the current system, matching the number and type of GPUs in your system.

For this blog, we used a Dell EMC PowerEdge R7525 server with a one-A100 GPU as the example. We chose R7525_A100_PCIE_40GBx1 as the name for this new system. Because the R7525_A100_PCIE_40GBx1 system is not already in the list of systems, we added the R7525_A100-PCIe-40GBx1 configuration.

Because the R7525_A100_PCIE_40GBx3 reference system is the most similar, we modified that configuration and picked Resnet50 Server as the example benchmark.

The following example shows the reference configuration for three GPUs for the Resnet50 Server benchmark in the configs/resnet50/Server/__init__.py file:

@ConfigRegistry.register(HarnessType.LWIS, AccuracyTarget.k_99, PowerSetting.MaxP)
class R7525_A100_PCIE_40GBx3(BenchmarkConfiguration):
     system = System("R7525_A100-PCIE-40GB", Architecture.Ampere, 3)
     active_sms = 100
     input_dtype = "int8"
     input_format = "linear"
     map_path = "data_maps/<dataset_name>/val_map.txt"
     precision = "int8"
     tensor_path = "${PREPROCESSED_DATA_DIR}/<dataset_name>/ResNet50/int8_linear"
     use_deque_limit = True
     deque_timeout_usec = 5742
     gpu_batch_size = 205
     gpu_copy_streams = 11
     gpu_inference_streams = 9
     server_target_qps = 91250
     use_cuda_thread_per_device = True
     use_graphs = True
     scenario = Scenario.Server
     benchmark = Benchmark.ResNet50
     start_from_device=True

This example shows the modified configuration for one GPU:

@ConfigRegistry.register(HarnessType.LWIS, AccuracyTarget.k_99, PowerSetting.MaxP)
class R7525_A100_PCIE_40GBx1(BenchmarkConfiguration):
     system = System("R7525_A100-PCIE-40GB", Architecture.Ampere, 1)
     active_sms = 100
     input_dtype = "int8"
     input_format = "linear"
     map_path = "data_maps/<dataset_name>/val_map.txt"
     precision = "int8"
     tensor_path = "${PREPROCESSED_DATA_DIR}/<dataset_name>/ResNet50/int8_linear"
     use_deque_limit = True
     deque_timeout_usec = 5742
     gpu_batch_size = 205
     gpu_copy_streams = 11
     gpu_inference_streams = 9
     server_target_qps = 30400
     use_cuda_thread_per_device = True
     use_graphs = True
     scenario = Scenario.Server
     benchmark = Benchmark.ResNet50
     start_from_device=True

We modified the queries per second (QPS) parameter ( server_target_qps ) to match the number of GPUs. The server_target_qps parameter is linearly scalable, therefore the QPS = number of GPUs x QPS per GPU.

The modified parameter is server_target_qps set to 30400 in accordance with one GPU performance expectation.

3.3 Add the new system

After you add the new system to the __init__.py file as shown in the preceding example, add the new system to the list of available systems. The list of available systems is in the code/common/system_list.py file. This entry  informs the benchmark that a new system exists and ensures that the benchmark selects the correct configuration.

Note : If your system is already added, there is no need to add it to the code/common/system_list.py file.

Add the new system to the list of available systems in the c ode/common/system_list.py file.

At the end of the file, there is a class called KnownSystems . This class defines a list of SystemClass objects that describe supported systems as shown in the following example:

SystemClass(<system ID>, [<list of names reported by nvidia-smi>], [<known PCI IDs of this system>], <architecture>, [list of known supported gpu counts>])

Where:

  • For <system ID> , enter the system ID with which you want to identify this system.
  • For <list of names reported by nvidia-smi> , run the nvidia-smi -L command and use the name that is returned.
  • For <known PCI IDs of this system> , run the following command:
$ CUDA_VISIBLE_ORDER=PCI_BUS_ID nvidia-smi --query-gpu=gpu_name,pci.device_id --format=csv
name, pci.device_id
A100-PCIE-40GB, 0x20F110DE
---

This pci.device_id field is in the 0x <PCI ID> 10DE format, where 10DE is the NVIDIA PCI vendor ID. Use the four hexadecimal digits between 0x and 10DE as your PCI ID for the system. In this case, it is 20F1.

  • For <architecture> , use the architecture Enum, which is at the top of the file. In this case, A100 is Ampere architecture.
  • For the <list of known GPU counts> , enter the number of GPUs of the systems you want to support (that is, [1,2,4] if you want to support 1x, 2x, and 4x GPU variants of this system.). Because we already have a 3x variant in the system_list.py file, we simply need to include the number 1 as an additional entry.

Note : Because a configuration is already present for the PowerEdge R7525 server, we added the number 1 for our configuration, as shown in the following example. If your system does not exist in the system_list.py file, add the entire configuration and not just the number.

class KnownSystems:
     Global List of supported systems
# before the addition of 1 - this config only supports R7525_A100-PCIE-40GB x3  
# R7525_A100_PCIE_40GB= SystemClass("R7525_A100-PCIE-40GB", ["A100-PCIE-40GB"], ["20F1"], Architecture.Ampere, [3])
# after the addition – this config now supports R7525_A100-PCIE-40GB x1 and R7525_A100-PCIE-40GB x3 versions.
R7525_A100_PCIE_40GB= SystemClass("R7525_A100-PCIE-40GB", ["A100-PCIE-40GB ["20F1"], Architecture.Ampere, [1, 3])
DSS8440_A100_PCIE_80GB = SystemClass("DSS8440_A100-PCIE-80GB", ["A100-PCIE-80GB"], ["20B5"], Architecture.Ampere, [10])
DSS8440_A30 = SystemClass("DSS8440_A30", ["A30"], ["20B7"], Architecture.Ampere, [8], valid_mig_slices=[MIGSlice(1, 6), MIGSlice(2, 12), MIGSlice(4, 24)])
R750xa_A100_PCIE_40GB = SystemClass("R750xa_A100-PCIE-40GB", ["A100-PCIE-40GB"], ["20F1"], Architecture.Ampere, [4])
R750xa_A100_PCIE_80GB = SystemClass("R750xa_A100-PCIE-80GB", ["A100-PCIE-80GB"], ["20B5"], Architecture.Ampere, [4],valid_mig_slices=[MIGSlice(1, 10), MIGSlice(2, 20), MIGSlice(3, 40)])
     ----

Note : You must provide different configurations in the configs/resnet50/Server/__init__.py file for the x1 variant and x3 variant.  In the preceding example, the R7525_A100-PCIE-40GBx3 configuration is different from the R7525_A100-PCIE-40GBx1 configuration.

3.4 Build the Docker image and required libraries

Build the Docker image and then launch an interactive container. Then, in the interactive container, build the required libraries for inferencing.

  1. To build the Docker image, run the make prebuild command inside the closed/DellEMC folder:
    Command :
    make prebuild

    The following example shows sample output:

    Launching Docker session
    nvidia-docker run --rm -it -w /work \
    -v /home/user/article_inference_v1.1/closed/DellEMC:/work -v     /home/user:/mnt//home/user \
    --cap-add SYS_ADMIN \
       -e NVIDIA_VISIBLE_DEVICES=0 \
       --shm-size=32gb \
       -v /etc/timezone:/etc/timezone:ro -v /etc/localtime:/etc/localtime:ro \
       --security-opt apparmor=unconfined --security-opt seccomp=unconfined \
       --name mlperf-inference-user -h mlperf-inference-user --add-host mlperf-inference-user:127.0.0.1 \
       --user 1002:1002 --net host --device /dev/fuse \
       -v =/home/user/inference_results_v1.0/closed/DellEMC/scratch:/home/user/inference_results_v1.1/closed/DellEMC/scratch  \
       -e MLPERF_SCRATCH_PATH=/home/user/inference_results_v1.0/closed/DellEMC/scratch \
       -e HOST_HOSTNAME=node009 
    mlperf-inference:user        

    The Docker container is launched with all the necessary packages installed.

  2. Access the interactive terminal on the container.
  3. To build the required libraries for inferencing, run the make build command inside the interactive container:
    Command
    make build

The following example shows sample output:

(mlperf) user@mlperf-inference-user:/work$ make build
[ 26%] Linking CXX executable /work/build/bin/harness_default
make[4]: Leaving directory '/work/build/harness'
make[4]: Leaving directory '/work/build/harness'
make[4]: Leaving directory '/work/build/harness'
[ 36%] Built target harness_bert
[ 50%] Built target harness_default
[ 55%] Built target harness_dlrm
make[4]: Leaving directory '/work/build/harness'
[ 63%] Built target harness_rnnt
make[4]: Leaving directory '/work/build/harness'
[ 81%] Built target harness_triton
make[4]: Leaving directory '/work/build/harness'
[100%] Built target harness_triton_mig
make[3]: Leaving directory '/work/build/harness'
make[2]: Leaving directory '/work/build/harness'
Finished building harness.
make[1]: Leaving directory '/work' 
(mlperf) user@mlperf-inference-user:/work

The container in which you can run the benchmarks is built.

3.5 Download and preprocess validation data and models

To run the MLPerf inference v1.1, download datasets and models, and then preprocess them. MLPerf provides scripts that download the trained models. The scripts also download the dataset for benchmarks other than Resnet50, DLRM, and 3D U-Net.

For Resnet50, DLRM, and 3D U-Net, register for an account and then download the datasets manually:

Except for the Resnet50, DLRM, and 3D U-Net datasets, run the following commands to download all the models, datasets, and then preprocess them:

$ make download_model # Downloads models and saves to $MLPERF_SCRATCH_PATH/models
$ make download_data # Downloads datasets and saves to $MLPERF_SCRATCH_PATH/data
$ make preprocess_data # Preprocess data and saves to $MLPERF_SCRATCH_PATH/preprocessed_data

Note : These commands download all the datasets, which might not be required if the objective is to run one specific benchmark. To run a specific benchmark rather than all the benchmarks, see the following sections for information about the specific benchmark.

(mlperf) user@mlperf-inference-user:/work$ tree -d -L 1
├── build
├── code
├── compliance
├── configs
├── data_maps
├── docker
├── measurements
├── power
├── results
├── scripts
└── systems
# different folders are as follows
├── build—Logs, preprocessed data, engines, models, plugins, and so on 
├── code—Source code for all the benchmarks
├── compliance—Passed compliance checks 
├── configs—Configurations that run different benchmarks for different system setups
├── data_maps—Data maps for different benchmarks
├── docker—Docker files to support building the container
├── measurements—Measurement values for different benchmarks
├── power—files specific to power submission (it’s only needed for power submissions)
├── results—Final result logs 
├── scratch—Storage for models, preprocessed data, and the dataset that is symlinked to the preceding build directory
├── scripts—Support scripts 
└── systems—Hardware and software details of systems in the benchmark

4 Running the benchmarks

After you have performed the preceding tasks to prepare your environment, run any of the benchmarks that are required for your tests.

The Resnet50, SSD-Resnet34, and RNN-T benchmarks have 99 percent (default accuracy) targets.

The BERT, DLRM, and 3D U-Net benchmarks have 99 percent (default accuracy) and 99.9 percent (high accuracy) targets. For information about running these benchmarks, see the Running high accuracy target benchmarks section  below.

If you downloaded and preprocessed all the datasets (as shown in the previous section), there is no need to do so again. Skip the download and preprocessing steps in the procedures for the following benchmarks.

NVIDIA TensorRT is the inference engine for the backend. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning applications.

4.1 Run the Resnet50 benchmark

To set up the Resnet50 dataset and model to run the inference:

  1. If you already downloaded and preprocessed the datasets, go step 5.
  2. Download the required validation dataset ( https://github.com/mlcommons/training/tree/master/image_classification ).
  3. Extract the images to $MLPERF_SCRATCH_PATH/data/<dataset_name>/
  4. Run the following commands:
    make download_model BENCHMARKS=resnet50
    make preprocess_data BENCHMARKS=resnet50
  5. Generate the TensorRT engines:
    # generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario 
     make generate_engines RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline,Server --config_ver=default"
  6. Run the benchmark:
# run the performance benchmark
make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" 
make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Server --config_ver=default --test_mode=PerformanceOnly"
# run the accuracy benchmark 
make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" 
make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"

The following example shows the output for a PerformanceOnly mode and displays a “VALID” result:

======================= Perf harness results: =======================
R7525_A100-PCIe-40GBx1_TRT-default-Server:
      resnet50: Scheduled samples per second : 30400.32 and Result is : VALID
======================= Accuracy results: =======================
R7525_A100-PCIe-40GBx1_TRT-default-Server:
     resnet50: No accuracy results in PerformanceOnly mode.

4.2 Run the SSD-Resnet34 benchmark

To set up the SSD-Resnet34 dataset and model to run the inference:

  1. If necessary, download and preprocess the dataset:
    make download_model BENCHMARKS=ssd-resnet34
    make download_data BENCHMARKS=ssd-resnet34 
    make preprocess_data BENCHMARKS=ssd-resnet34
  2. Generate the TensorRT engines:
    # generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario
    make generate_engines RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline,Server --config_ver=default"
  3. Run the benchmark:
# run the performance benchmark
make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly"
make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Server --config_ver=default --test_mode=PerformanceOnly"
# run the accuracy benchmark
make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" 
make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"

4.3 Run the RNN-T benchmark

To set up the RNN-T dataset and model to run the inference:

  1. If necessary, download and preprocess the dataset:
    make download_model BENCHMARKS=rnnt
    make download_data BENCHMARKS=rnnt 
    make preprocess_data BENCHMARKS=rnnt
  2. Generate the TensorRT engines :
    # generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario
    make generate_engines RUN_ARGS="--benchmarks=rnnt --scenarios=Offline,Server --config_ver=default" 
  3. Run the benchmark:
# run the performance benchmark
make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly"
make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" 
# run the accuracy benchmark 
make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" 
make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"

5 Running high accuracy target benchmarks

The BERT, DLRM, and 3D U-Net benchmarks have high accuracy targets.

5.1 Run the BERT benchmark

To set up the BERT dataset and model to run the inference:

  1. If necessary, download and preprocess the dataset:
    make download_model BENCHMARKS=bert 
    make download_data BENCHMARKS=bert 
    make preprocess_data BENCHMARKS=bert
  2. Generate the TensorRT engines:
    # generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and also for default and high accuracy targets.
    make generate_engines RUN_ARGS="--benchmarks=bert --scenarios=Offline,Server --config_ver=default,high_accuracy"
  3. Run the benchmark :
# run the performance benchmark
make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly"
make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=default --test_mode=PerformanceOnly"
make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly"
make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=high_accuracy --test_mode=PerformanceOnly" 
# run the accuracy benchmark  
make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" 
make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=default --test_mode=AccuracyOnly" 
make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly" 
make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=high_accuracy --test_mode=AccuracyOnly"

5.2 Run the DLRM benchmark

To set up the DLRM dataset and model to run the inference:

  1. If you already downloaded and preprocessed the datasets, go to step 5.
  2. Download the Criteo Terabyte dataset .
  3. Extract the images to $MLPERF_SCRATCH_PATH/data/criteo/ directory.
  4. Run the following commands:
    make download_model BENCHMARKS=dlrm
    make preprocess_data BENCHMARKS=dlrm
  5. Generate the TensorRT engines:
    # generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and also for default and high accuracy targets.
    make generate_engines RUN_ARGS="--benchmarks=dlrm --scenarios=Offline,Server --config_ver=default, high_accuracy" 
  6. Run the benchmark :
# run the performance benchmark
make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly"
make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=default --test_mode=PerformanceOnly"
make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly"
make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=high_accuracy --test_mode=PerformanceOnly"
# run the accuracy benchmark
make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly"
make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"
make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly"
make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=high_accuracy --test_mode=AccuracyOnly"

5.3 Run the 3D U-Net benchmark

Note : This benchmark only has the Offline scenario.

To set up the 3D U-Net dataset and model to run the inference:

  1. If you already downloaded and preprocessed the datasets, go to step 5
  2. Download the BraTS challenge data .
  3. Extract the images to the $MLPERF_SCRATCH_PATH/data/BraTS/MICCAI_BraTS_2019_Data_Training directory.
  4. Run the following commands:
    make download_model BENCHMARKS=3d-unet
    make preprocess_data BENCHMARKS=3d-unet
  5. Generate the TensorRT engines:
    # generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and for default and high accuracy targets.
    make generate_engines RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default,high_accuracy"
  6. Run the benchmark:
# run the performance benchmark
make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly"
make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly"
# run the accuracy benchmark 
make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" 
make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly"

6 Limitations and Best Practices for Running MLPerf

Note the following limitations and best practices:

  • To build the engine and run the benchmark by using a single command, use the make run RUN_ARGS… shortcut. The shortcut is valid alternative to the make generate_engines … && make run_harness.. commands.
  • Include the --fast flag in the RUN_ARGS command to test runs quickly by setting the run time to one minute. For example
 make run_harness RUN_ARGS="–-fast --benchmarks=<bmname> --scenarios=<scenario> --config_ver=<cver> --test_mode=PerformanceOnly"

The benchmark runs for one minute instead of the default 10 minutes.

  • If the server results are “INVALID”, reduce the server_target_qps for a Server scenario run. If the latency constraints are not met during the run, “INVALID” results are expected.
  • If the results are “INVALID” for an Offline scenario run, then increase the offline_expected_qps . “INVALID” runs for the Offline scenario occurs when the system can deliver a significantly higher QPS than what is provided through the offline_expected_qps configuration.
  • If the batch size changes, rebuild the engine.
  • Only the BERT, DLRM, 3D-U-Net benchmarks support high accuracy targets.
  • 3D-U-Net only has the Offline scenario.
  • Triton Inference Server runs by passing triton and high_accuracy_triton for default and high_accuracy targets respectively inside the config_ver argument.
  • When running a command with RUN_ARGS , be aware of the quotation marks. Errors can occur if you omit the quotation marks.

Conclusion

This blog provided the step-by-step procedures to run and reproduce closed data center MLPerf inference v1.1 results on Dell EMC servers with NVIDIA GPUs.


Abstract

Dell Technologies recently submitted results to the MLPerf Inference v3.1 benchmark suite. This blog examines the results on the Dell PowerEdge XR4520c, PowerEdge XR7620, and PowerEdge XR5610 servers with the NVIDIA L4 GPU.

MLPerf Inference background

The MLPerf Inference benchmarking suite is a comprehensive framework designed to fairly evaluate the performance of a wide range of machine learning inference tasks on various hardware and software configurations. The MLCommons TM community aims to provide a standardized set of deep learning workloads with which to work and as fair measuring and auditing methodologies. The MLPerf Inference submission results serve as valuable information for researchers, customers, and partners to make informed decisions about inference capabilities on various edge and data center systems.

The MLPerf Inference edge suite includes three scenarios:

  • Single-stream —This scenario’s performance metric is 90 percent latency. A common use case is the Siri voice assistant on iOS products on which Siri’s engine waits until the query has been asked and then returns results.
  • Multi-stream —This scenario has a higher performance metric with a 99 percent latency. An example use case is self-driving cars. Self-driving cars use multiple cameras and lidar inputs to real-time driving decisions that have a direct impact on what happens on the road.
  • Offline —This scenario is measured by throughput. An example of Offline processing on the edge is a phone sharing an album suggestion that is based on a recent set of photos and videos from a particular event.

Edge computing

In traditional cloud computing at the data center, data from phones, tablets, sensors, and machines are sent to physically distant data centers to be processed. The location of where the data has been gathered and where it is processed are separate. The concept of edge computing shifts this methodology by processing data on the device itself or on local compute resources that are available nearby. The available compute resources nearby are known as the “devices on the edge.” Edge computing is prevalent in several industries such as self-driving cars, retail analytics, truck fleet management, smart grid energy distribution, healthcare, and manufacturing.

Edge computing complements traditional cloud computing by reducing processing speed in terms of lowering latency, improving efficiency, enhancing security, and enabling higher reliability. By processing data on the edge, the load on central data centers is eased as is the time to receive a response for any type of inference queries. With the offloading of computation in data centers, network congestion for cloud users becomes less of a concern. Also, because sensitive data is processed at the edge and is not exposed to threats across a wider network, the risk of sensitive data being compromised is less. Furthermore, if connectivity to the cloud is disrupted and is intermittent, edge computing can enable systems to continue functioning. With several devices on the edge acting as computational minidata centers, the problem of a single point of failure is mitigated and additional scalability becomes easily achievable.

Dell PowerEdge system and GPU overview

Dell PowerEdge XR4520c server

For projects that need a robust and adaptable server to handle demanding AI workloads on the edge, the PowerEdge XR4520c server is an excellent option. Dell Technologies designed the PowerEdge XR4520c server with reliability to withstand challenging edge environments. The PowerEdge XR4520c server delivers the power and compute required for real-time analytics on the edge with Intel Xeon Scalable processors. The edge-optimized design decisions include a rugged exterior and an extended temperature range to operate in remote locations and industrial environments. Also, the compact form factor and space-efficient design enable deployment on the edge. Like all Dell PowerEdge products, this server comes with world class Dell support and Dell’s (Integrated Dell Remote Access Controller (iDRAC) for remote management. For additional information about the technical specifications of the PowerEdge XR4520c server, see to the specification sheet .

Figure 1: Front view of the Dell PowerEdge XR4520c server

Figure 2: Top view of the Dell PowerEdge XR4520c server

Dell PowerEdge XR7620 server

The PowerEdge XR7620 server is top-of-the-line for deep learning in the edge. Powered with the latest Intel Xeon Scalable processors, the reduced training time and additional number of inferences is remarkable on the PowerEdge XR7620 server. Dell Technologies has designed this as a half-width server for rugged environments with a dust and particle filter and extended temperature range from –5C to 55C (23 F to 131 F). Furthermore, Dell’s comprehensive security and data protection features include data encryption and zero-trust logic for the protection of sensitive data. For additional information about the technical specifications of the PowerEdge XR7620 server, see the specification sheet .

Figure 3: Front view of the Dell PowerEdge XR7620 server

Figure 4: Rear view of the Dell PowerEdge XR7620 server

Dell PowerEdge XR5610 server

The Dell PowerEdge XR5610 server is an excellent option for AI workloads on the edge. This all-pupose, rugged single-socket server is a versatile edge server that has been built for telecom, defense, retail and other demanding edge environments. As shown in the following figures, the short chassis can fit in space-constrained environments and is also a formidable option when considering power efficiency. This server is driven by Intel Xeon Scalable processors and is boosted with NVIDIA GPUs as well as high-speed NVIDIA NVLink interconnects. For additional information about the technical specifications of the PowerEdge XR5610 server, see the specification sheet .

Figure 5: Front view of the Dell PowerEdge XR5610 server

Figure 6: Top view of the Dell PowerEdge XR5610 server

NVIDIA L4 GPU

The NVIDIA L4 GPU is an excellent strategic option for the edge as it consumes less energy and space but delivers exceptional performance. The NVIDIA L4 GPU is based on the Ada Lovelace architecture and delivers extraordinary performance for video, AI, graphics, and virtualization. The NVIDIA L4 GPU comes with NVIDIA’s cutting-edge AI software stack including CUDA, cuDNN, and support for several deep learning frameworks like Tensorflow and PyTorch.

Systems Under Test

The following table lists the Systems Under Test (SUT) that are described in this blog.

T able 1: MLPerf Inference v3.1 system configuration of the Dell PowerEdge XR7620 and the PowerEdge XR4520c servers

Platform

Dell PowerEdge XR7620 (1x L4, TensorRT)

Dell PowerEdge XR4520c (1x L4, TensorRT)

MLPerf system ID

XR7620_L4x1_TRT

XR4520c_L4x1_TRT

Operating system

CentOS 8

Ubuntu 22.04

CPU

Dual Intel Xeon Gold 6448Y CPU @ 2.10 GHz

Single Intel Xeon D-2776NT CPU @ 2.10

Memory

256 GB

128 GB

GPU

NVIDIA L4

GPU count

1

Software stack

TensorRT 9.0.0

CUDA 12.2

cuDNN 8.8.0

Driver 535.54.03

DALI 1.28.0

TensorRT 9.0.0

CUDA 12.2

cuDNN 8.9.2

Driver 525.105.17

DALI 1.28.0

Performance from Inference v3.1

The following figure compares the Dell PowerEdge XR4520c and PowerEdge XR7620 servers across the ResNet50, RetinaNet, RNNT and BERT 99 Single-stream, Multi-stream, and Offline benchmarks. Across all the benchmarks in this comparison, we can state that the performance in the image classification, object detection, speech to text and language processing workloads packaged with NVIDIA L4 GPUs for both servers provide exceptional performance.

Figure 7: Dell PowerEdge XR4520c and PowerEdge XR7620 servers across the ResNet50, RetinaNet, RNNT, and BERT 99 Single and Multi-stream benchmarks

Figure 8: Dell PowerEdge XR4520c and PowerEdge XR7620 servers across the ResNet50, RetinaNet, RNNT, and BERT 99 Offline benchmarks

Like ResNet50 and RetinaNet, the 3D-Unet benchmark falls under the vision area but focuses on the medical image segmentation task. The following figures show identical performance of the two servers in both the default and high accuracy modes in the Single-stream and Offline scenarios.

Figure 9: Dell PowerEdge XR4520c and PowerEdge XR7620 servers across 3D-Unet Single-stream

Figure 10: Dell PowerEdge XR4520c and PowerEdge XR7620 server across 3D-Unet Offline

Dell PowerEdge XR5610 power submission

In the MLPerf Inference v3.0 round of submissions, Dell Technologies made a power submission under the preview category for the Dell PowerEdge XR5610 server with the NVIDIA L4 GPU. For the v3.1 round of submissions, Dell Technologies made another power submission for the same server in the closed edge category. As shown in the following table, the detailed configurations of both the systems across the rounds of submissions show that the hardware remained consistent, but that the software stack was updated. In terms of system performance per watt, the PowerEdge XR 5610 server claims the top spot in image classification, object detection, speech-to-text, language processing, and medical image segmentation workloads.

Table 2: MLPerf Inference v3.0 and v3.1 system configuration of the Dell PowerEdge XR5610 server

Platform

Dell PowerEdge XR5610 (1x L4, MaxQ, TensorRT) v3.0

Dell PowerEdge XR5610 (1x L4, MaxQ, TensorRT) v3.1

MLPerf system ID

XR5610_L4x1_TRT_MaxQ

XR5610_L4x1_TRT_MaxQ

Operating system

CentOS 8.2

CPU

Intel(R) Xeon(R) Gold 5423N CPU @ 2.10 GHz

Memory

256 GB

GPU

NVIDIA L4

GPU count

1

Software stack

TensorRT 8.6.0

CUDA 12.0

cuDNN 8.8.0

Driver 515.65.01

DALI 1.17.0

TensorRT 9.0.0

CUDA 12.2

cuDNN 8.9.2

Driver 525.105.17

DALI 1.28.0

The power submission includes extra power results in each submission. For each submitted benchmark, there is a power metric that is paired with it. The metric for the Single-stream and Multi-stream performance results is Latency in milliseconds and the corresponding power consumption is noted in millijoules (mj). The Offline performance numbers are recorded in samples per second(samples/s), and the corresponding power readings are delivered in watts. The following table shows a breakdown for the calculations for queries per millijoules and samples/s per watt have been calculated.

Table 3: Breakdown of reading a power submission

Scenario

Performance metric

Power metric

Performance per unit of energy

Single Stream

Latency (ms)

Millijoules (mj)

1 query/mj -> queries/mj

Multi Stream

Latency (ms)

Millijoules (mj)

8 queries/mj -> queries/mj

Offline

Samples/s

Watts

Samples/s / Watts -> performance per Watt

The following figure shows the improvements in the performance per energy used on the Dell PowerEdge XR5610 server across the v3.1 and v3.0 rounds of submission. Across all the benchmarks, the server extracted double the performance per energy. For the RNNT Single-stream benchmark, the servers showed a brilliant performance jump of close to five times greater. The performance improvements came from hardware and software optimizations. Also, BIOS firmware upgrades also contributed significantly.

Figure 11: Dell PowerEdge XR5610 with NVIDIA L4 GPU power submission for v3.1 compared to v3.0

The following figure shows the Single-stream and Multi-stream latency results from the Dell PowerEdge XR5610 server:

Figure 12: Dell PowerEdge XR5610 NVIDIA L4 GPU L4 v3.1 server

Conclusion

Both the Dell PowerEdge XR4520c and Dell PowerEdge XR7620 servers continue to showcase excellent performance in the edge suite for MLPerf Inference. The Dell PowerEdge XR5610 server showed a consistent doubling in performance per energy across all benchmarks confirming itself as a power efficient server option. Built for the edge, the Dell PowerEdge XR portfolio proves to be an outstanding option with consistent performance in the MLPerf Inference v3.1 submission. As the need for edge computing continues to grow, the MLPerf Inference edge suite shows that Dell PowerEdge servers continue to be an excellent option for any Artificial Intelligence workload.

MLCommons results

https://mlcommons.org/en/inference-edge-31/

MLPerf Inference v3.1 system IDs:

  • 3.1-0072 - Dell PowerEdge XR4520c (1x L4, TensorRT)
  • 3.1-0073 - Dell PowerEdge XR5610 (1x L4, MaxQ, TensorRT)
  • 3.1-0074 - Dell PowerEdge XR7620 (1x L4, TensorRT)

The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

Today, MLCommons released the latest version (v3.1) of MLPerf Inference results . Dell Technologies has made submissions to the inference benchmark since its version 0.5 launch in 2019. We continue to demonstrate outstanding results across different models in the benchmark such as image classification, object detection, natural language processing, speech recognition, recommender system and medical image segmentation, and LLM summarization. See our MLPerf™ Inference v2.1 with NVIDIA GPU-Based Benchmarks on Dell PowerEdge Servers white paper that introduces the MLCommons Inference benchmark. Generative AI (GenAI) has taken deep learning computing needs by storm and there is an ever-increasing need to enable high-performance innovative inferencing approaches. This blog provides an overview of the performance summaries that Dell PowerEdge servers enable end users to deliver on their AI Inference transformation.

What is new with Inference 3.1?

Inference 3.1 and Dell’s submission include the following:

  • The inference benchmark has added two exciting new benchmarks:
    1. LLM-based models, such as GPT-J
    2. DLRM-V2 with multi-hot encodings using the DLRM-DCNv2 architecture
  • Dell’s submission has been expanded to include the new PowerEdge XE8640 and PowerEdge XE9640 servers accelerated by NVIDIA GPUs.
  • Dell’s submission includes results of PowerEdge servers with Qualcomm accelerators.
  • Besides accelerator-based results, Dell’s submission includes Intel-based CPU-only results.

Overview of results

Dell Technologies submitted 230 results across 20 different configurations. The most impressive results were generated by PowerEdge XE9680, XE9640, XE8640, R760xa, and servers with the new NVIDIA H100 PCIe and SXM Tensor Core GPUs, PowerEdge XR7620 and XR5610 servers with the NVIDIA L4 Tensor Core GPUs , and the PowerEdge R760xa server with the NVIDIA L40 GPU .

Overall, NVIDIA-based results include the following accelerators:

  • (New) Four-way NVIDIA H100 Tensor Core GPU (SXM)
  • (New) Four-way NVIDIA L40 GPU
  • Eight-way NVIDIA H100 Tensor Core GPU (SXM)
  • Four-way NVIDIA A100 Tensor Core GPU (PCIe)
  • NVIDIA L4 Tensor Core GPU

These accelerators were benchmarked on different servers such as PowerEdge XE9680, XE8640, XE9640, R760xa, XR7620, XR5610, and R750xa servers across data center and edge suites.

The large number of result choices offers end users an opportunity to make system purchase decisions and set performance and design expectations.

Interesting Dell Datapoints

The most interesting datapoints include:

  • The performance numbers on newly released Dell PowerEdge servers are outstanding.
  • Among 21 submitters, Dell Technologies was one of the few companies that covered all benchmarks in all closed divisions for data center, edge, and edge power suites.
  • The PowerEdge XE9680 system with eight NVIDIA H100 SXM GPUs procures the highest performance titles with ResNet Server, RetinaNet Server, RNNT Server and Offline, BERT 99 Server, BERT 99.9 Offline, DLRM-DCNv2 99, and DLRM-DNCv2 99.9 Offline benchmarks.
  • The PowerEdge XE8640 system with four NVIDIA H100 SXM GPUs procures the highest performance titles with all the data center suite benchmarks.
  • The PowerEdge XE9640 system with four NVIDIA H100 SXM GPUs procures the highest performance titles for all systems among other liquid cooled systems for all data center suite benchmarks.
  • The PowerEdge XR5610 system with an NVIDIA L4 Tensor Core GPU offers approximately two- to three-times higher performance/watt compared to the last round and procures the highest power efficiency titles with Resnet RetinaNet 3d-unet 99, 3D U-Net 99.9 and Bert-99.

Highlights

The following figure shows the different system performance for offline and server scenarios in the data center. These results provide an overview; future blogs will provide more details about the results.

The figure shows that these servers delivered excellent performance for all models in the benchmark such as ResNet, RetinaNet, 3D-U-Net, RNN-T, BERT, DLRM-v2, and GPT-J. It is important to recognize that different benchmarks operate on varied scales. They have all been showcased in the following figures to offer a comprehensive overview.

Fig 1: System throughput for submitted systems for the data center suite

The following figure shows single-stream and MultiStream scenario results for the edge for ResNet, RetinaNet, 3D-Unet, RNN-T, and BERT 99 and GPTJ benchmarks. The lower the latency, the better the results.

Fig 2: System throughput for submitted systems for the edge

Conclusion

We have provided MLCommons-compliant submissions to the Inference 3.1 benchmark across various benchmarks and suites for all tasks in the benchmark such as image classification, object detection, natural language processing, speech recognition, recommender systems and medical image segmentation, and LLM summarization. These results indicate that with the newer generation of Dell PowerEdge servers such as the PowerEdge XE9680, XE8640, XE9640, and R760xa servers and newer GPUs from NVIDIA, end users can benefit from higher performance from their data center and edge inference deployments. We have also secured numerous Number 1 titles that make Dell PowerEdge servers an excellent choice for inference data center and edge deployments. End users can refer to different results across various servers to make performance and sizing decisions. With these results, Dell Technologies can help fuel enterprises’ AI transformation, including Generative AI adoption and expansion effectively.

Future Steps

More blogs that provide an in-depth comparison of the performance of specific models with different accelerators are on their way soon. For any questions or requests, contact your local Dell representative.

MLCommons Results

https://mlcommons.org/en/inference-datacenter-31/

https://mlcommons.org/en/inference-edge-31/

The graphs above are MLCommons results MLPerf IDs f rom 3.1-0058 to 3.1-0069 on the closed datacenter, 3.1-0058 to 3.1-0075 on the closed edge, and 3.1-0073 on closed edge power.

The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.