San Francisco

dave spink toolset

DEEP LEARNING BENCHMARK SUITE (DLBS):
OVERVIEW	JSON FILE	NVIDIA	TENSORFLOW
CAFFE2	INFERENCE

THE PROJECT

Verizon (VMG) asked us for help in implementing a GPU benchmarking suite for testing different frameworks e.g. tensorflow, PyTorch, Caffe2 etc. Hence, we implemented the HPE DLBS created by Sergey and Natalia.

OVERVIEW

The DBLS is for designed for running consistent and reproducible AI/ML benchmark experiments. It's licensed under Apache 2.0.

Supports:

Neural networks e.g. VGGs, ResNets, AlexNet and GoogleNet models. Can also integrate with 3rd party benchmark projects.
Frameworks including Caffe (BVLC/NVIDIA/Intel), Caffe2, TensorFlow, MXNet, PyTorch and NVIDIA's inference engine TensorRT.
Training and inference phases. (we don't render images)
Synthetic and real data
Bare metal or docker environments (docker is highly recommended).

Installation Overview:

Install docker and NVIDIA docker (it's a plug-in for docker that uses GPU's and the OS GPU driver)
Clone DLBS from GitHub
Build/Pull docker images
Run
Parse Logs
Report

Frameworks:

dlbs

Architecture Overview:

dlbs

DLBS is the experimenter (benchmarking suite) while the benchmarks run in containers. You start with frameworks on the right and we use benchmarks (framework specific) to train neural networks.

Training benchmark:

They actually run the train model. Currently we measure only performance i.e. it's a not a complete training model.
The synthetic data, means there is no data being read from the disk. We create this data in CPU memory; it small and random.
You can provide a dataset that contains images (JPEG).

Synthetic data vs. Real Data for benchmarking:

During training process NN uses training data to optimize its parameters. Getting data into GPU memory is called an ingestion process and may be computationally expensive.
For instance, we want to transform input images on the fly. We usually do that on CPUs - while NN processes current batch , CPU is busy with preparing next batch to make GPU busy all the time. Sometimes, CPUs do not keep up with GPUs what slows down the process.
During benchmarks, we must ensure that ingestion pipeline is not a bottleneck, sort of disable it. This is where term synthetic data comes into play. Synthetic data means random data of appropriate size in host or GPU memory that makes ingestion pipeline almost invisible in terms of latencies/computation requirements.
It means that we can compare implementations of ingestion pipelines / storage offerings / various network interconnects by how close in terms of performance they are to benchmarks with synthetic data.

Inference

It not as well tested as training models, however it does work. We successfully tested inference at Oath.
We only really use TensorRT for inference

JSON FILE

Configuration Run JSON File:

$ cd .tutorials/recipes/multi_gpu_compute_scaling
$ cat config.json

dlbs

Results:

{
    "exp.device_type": "gpu",
    "exp.framework_title": "TensorFlow-nvcnn",
    "exp.framework_ver": "1.7.0",
    "exp.model_title": "ResNet50",
    "exp.phase": "training",
    "exp.replica_batch": 128,
    "results.throughput": 2349.2708413786054,
    "results.time": 217.93996289484733
}

Results Time - Is an average time in milliseconds to process one batch of data.

Results Throughput - Is the number of instances per second, in this case, number of images/seconds.

NVIDIA

Check NVIDIA drivers:

# nvidia-smi    !!check the driver version is NVIDIA Driver:  396.26 or later.  (410.48)

Check for GPU-to-GPU communication (bandwidthTest):

# Check CUDA installation. You should be able to see cuda-install-samples-9.0.sh file or similar depending on CUDA version.
ls /usr/local/cuda/bin

# Get sample projects into home folder
/usr/local/cuda/bin/cuda-install-samples-* $HOME

# Build projects
cd $HOME/NVIDIA_CUDA-*/1_Utilities/p2pBandwidthLatencyTest
make

# Run it
./p2pBandwidthLatencyTest

# If you see output similar to this one, everything's fine.

dlbs

Install Docker: See Docker for installation instructions.

NVIDIA Docker Package:

$ wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker-1.0.1-1.x86_64.rpm
$ sudo rpm -i /tmp/nvidia-docker*.rpm && rm /tmp/nvidia-docker*.rpm
$ sudo systemctl start nvidia-docker

TENSORFLOW

Clone DLBS:

$ cd    !!home directory
$ git clone https://github.com/HewlettPackard/dlcookbook-dlbs dlbs
Cloning into 'dlbs'...

Setup environment paths:

$ cd dlbs
$ export BENCH_ROOT=`pwd`
$ ls -l scripts/environment.sh        **check executable
$ . ./scripts/environment.sh
$ echo $PYTHONPATH $reporter

Pull images, login to https://ngc.nvidia.com

$ docker login nvcr.io
   Username: $oauthtoken
   Password: 
   Login Succeeded
$ docker pull nvcr.io/nvidia/tensorflow:18.07-py3  
$ docker pull nvcr.io/nvidia/pytorch:18.06-py3
$ docker logout nvcr.io

Verify you can run docker and/or nvidia-docker:

$ docker run -ti nvcr.io/nvidia/tensorflow:18.07-py3 /bin/bash
$ nvidia-docker run -ti nvcr.io/nvidia/tensorflow:18.07-py3 /bin/bash
$ nvidia-docker run -ti nvcr.io/nvidia/tensorflow:18.07-py3 nvidia-smi   !!Check you see GPUs inside the docker container.

See a list of frameworks:

$ python ./python/dlbs/experimenter.py help --frameworks --no-colors

TensorFlow
nvtfcnn Highly optimized benchmark backend that provides best results on multi-GPU machines.
This backend must be used for best performance.
NGC container: nvcr.io/nvidia/tensorflow:18.07-py3

nvcnn Previous version of the 'nvtfcnn' benchmark backend.
NGC container: nvcr.io/nvidia/tensorflow:18.04-py3

tensorflow Google's TF_CNN_BENCHMARKS project. Will not provide very good performance with 8 GPUs.
Reference DLBS container: hpe/tensorflow:cuda9-cudnn7
Caffe
bvlc_caffe Benchmarks based on original BVLC Caffe implementation.
Reference DLBS container: hpe/bvlc_caffe:cuda9-cudnn7
nvidia_caffe Benchmarks based on NVIDIA's version of BVLC Caffe.
NGC container: nvcr.io/nvidia/caffe:18.05-py2
intel_caffe Intel's version of BVLC Caffe suitable for CPUs.
Reference DLBS container: hpe/intel_caffe:cuda9-cudnn7
Caffes
caffe2 Default benchmark backend for Caffe2 framework.
NGC container: nvcr.io/nvidia/caffe2:18.05-py2
MXNET
mxnet Default benchmark backend for MXNET framework.
NGC container: nvcr.io/nvidia/mxnet:18.05-py2
PyTorch
pytorch Default benchmark backend for PyTorch framework.'
NGC container: nvcr.io/nvidia/pytorch:18.06-py3
TensorRT
tensorrt Default benchmark backend for TensorRT inference engine.
DLBS container: dlbs/tensorrt:18.10

Create JSON file:

$ vi test.json
{
  "parameters": {
    "exp.framework": "nvcnn",
    "exp.docker_image": "nvcr.io/nvidia/tensorflow:18.07-py3",

    "exp.num_warmup_batches": 10,
    "exp.num_batches": 40,
    "exp.log_file": "./logs/${exp.num_gpus}_${exp.model}_${exp.effective_batch}.log",

    "exp.dtype": "float16",
    "exp.model": "resnet50",
    "exp.replica_batch": 128
  },
  "variables":{
    "exp.gpus": ["0", "0,1", "0,1,2,3", "0,1,2,3,4,5,6,7"]
  }
}

Run experiment:

$ cd dlbs
$ mkdir logs
$ python ./python/dlbs/experimenter.py run --config test.json

Create folder for benchmark results and parse log files:

$ mkdir $BENCH_ROOT/reports
$ params="exp.effective_batch,exp.replica_batch,results.time,results.throughput,exp.model_title,exp.phase,exp.gpus,exp.num_gpus,exp.model"
$ python $logparser ./logs/ --recursive --output_params ${params} --output_file ./reports/results.json
$ cat ./reports/results.json    !! example for reference
	
   "data": [
        {
            "exp.model_title": "ResNet50",
            "exp.effective_batch": 128,
            "exp.gpus": "0",
            "results.time": 175.11470166412064,
            "exp.phase": "training",
            "exp.model": "resnet50",
            "exp.replica_batch": 128,
            "results.throughput": 730.949479304775,
            "exp.num_gpus": 1
        },
        {
            "exp.model_title": "ResNet50",
            "exp.effective_batch": 256,
            "exp.gpus": "0,1",
            "results.time": 195.7543116174103,
            "exp.phase": "training",
            "exp.model": "resnet50",
            "exp.replica_batch": 128,
            "results.throughput": 1307.7617442232188,
            "exp.num_gpus": 2
        },
        {
            "exp.model_title": "ResNet50",
            "exp.effective_batch": 512,
            "exp.gpus": "0,1,2,3",
            "results.time": 206.3526550066783,
            "exp.phase": "training",
            "exp.model": "resnet50",
            "exp.replica_batch": 128,
            "results.throughput": 2481.1893017971097,
            "exp.num_gpus": 4
        },
        {
            "exp.model_title": "ResNet50",
            "exp.effective_batch": 1024,
            "exp.gpus": "0,1,2,3,4,5,6,7",
            "results.time": 229.79593794386074,
            "exp.phase": "training",
            "exp.model": "resnet50",
            "exp.replica_batch": 128,
            "results.throughput": 4456.127506701897,
            "exp.num_gpus": 8
        }
    ]

Create textual report:

$ python $reporter --summary_file ./reports/results.json --type 'weak-scaling' --target_variable 'results.time' > ./reports/results.txt
	
$ cat ./reports/results.txt
	
Batch time (milliseconds)
Network              Batch      1          2          4          8
ResNet50             128        175.11     195.75     206.35     229.80
	
Inferences Per Second (IPS, throughput)
Network              Batch      1          2          4          8
ResNet50             128        730        1307       2481       4456
	
Speedup (instances per second)
Network              Batch      1          2          4          8
ResNet50             128        1          1.79       3.40       6.10
	
Efficiency  = 100% * t1 / tN
Network              Batch      1          2          4          8
ResNet50             128        100.00     89.45      84.86      76.20

Other Training JSON files 01. fast:

{
  "parameters": {
    "pytorch.docker_image": "nvcr.io/nvidia/pytorch:18.06-py3",
    "nvtfcnn.docker_image": "nvcr.io/nvidia/tensorflow:18.07-py3",
    "exp.docker": true,

    "exp.git_hashtag": "92ef23344c4bfd0e222677c4674fe4f24d154658",

    "exp.num_warmup_batches": 100,
    "exp.num_batches": 400,
    "exp.log_file": "${BENCH_ROOT}/logs/${exp.framework}/${exp.data}/${exp.dtype}/$(\"${exp.gpus}\".replace(\",\",\".\"))$_${exp.model}_${exp.effective_batch}.log",

    "exp.phase": "training",

    "exp.sys_info": "cpuinfo,meminfo,lscpu,nvidiasmi,dmi"
  },
  "variables":{
    "exp.framework": ["pytorch", "nvtfcnn"],
    "exp.dtype": ["float16"],
    "exp.gpus": ["0", "0,1", "0,1,2,3", "0,1,2,3,4,5,6,7"],
    "exp.model": ["resnet50"],
    "exp.replica_batch": [256]
  }
}

Other Training JSON files 02. medium:

{
  "parameters": {
    "pytorch.docker_image": "nvcr.io/nvidia/pytorch:18.06-py3",
    "nvtfcnn.docker_image": "nvcr.io/nvidia/tensorflow:18.07-py3",
    "exp.docker": true,

    "exp.git_hashtag": "92ef23344c4bfd0e222677c4674fe4f24d154658",

    "exp.num_warmup_batches": 100,
    "exp.num_batches": 400,
    "exp.log_file": "${BENCH_ROOT}/logs/${exp.framework}/${exp.data}/${exp.dtype}/$(\"${exp.gpus}\".replace(\",\",\".\"))$_${exp.model}_${exp.effective_batch}.log",

    "exp.phase": "training",

    "exp.status": "disabled",
    "exp.sys_info": "cpuinfo,meminfo,lscpu,nvidiasmi,dmi"
  },
  "variables":{
    "exp.framework": ["pytorch", "nvtfcnn"],
    "exp.dtype": ["float16"],
    "exp.gpus": ["0", "0,1", "0,1,2,3", "0,1,2,3,4,5,6,7"],
    "exp.model": ["resnet50", "alexnet_owt"],
    "exp.replica_batch": [256, 1024]
  },
  "extensions": [
    {"condition": {"exp.model":["alexnet_owt"],  "exp.replica_batch":[1024]},    "parameters": {"exp.status": ""}},
    {"condition": {"exp.model":["resnet50"],     "exp.replica_batch":[256]},     "parameters": {"exp.status": ""}}
  ]
}

Other Training JSON files 03. complete:

{
  "parameters": {
    "pytorch.docker_image": "nvcr.io/nvidia/pytorch:18.06-py3",
    "nvtfcnn.docker_image": "nvcr.io/nvidia/tensorflow:18.07-py3",
    "exp.docker": true,

    "exp.git_hashtag": "92ef23344c4bfd0e222677c4674fe4f24d154658",

    "exp.num_warmup_batches": 100,
    "exp.num_batches": 400,
    "exp.log_file": "${BENCH_ROOT}/logs/${exp.framework}/${exp.data}/${exp.dtype}/$(\"${exp.gpus}\".replace(\",\",\".\"))$_${exp.model}_${exp.effective_batch}.log",

    "exp.phase": "training",

    "exp.status": "disabled",
    "exp.sys_info": "cpuinfo,meminfo,lscpu,nvidiasmi,dmi"
  },
  "variables":{
    "exp.framework": ["pytorch", "nvtfcnn"],
    "exp.dtype": ["float16", "float32"],
    "exp.gpus": ["0", "0,1", "0,1,2,3", "0,1,2,3,4,5,6,7"],
    "exp.model": ["resnet50", "resnet152", "googlenet", "vgg16", "alexnet_owt"],
    "exp.replica_batch": [64, 128, 256, 1024]
  },
  "extensions": [
    {"condition": {"exp.model":"resnet50", "exp.dtype": "float16",      "exp.replica_batch":[256]},     "parameters": {"exp.status": ""}},
    {"condition": {"exp.model":"resnet50", "exp.dtype": "float32",      "exp.replica_batch":[128]},     "parameters": {"exp.status": ""}},
    {"condition": {"exp.model":"vgg16", "exp.dtype": "float16",         "exp.replica_batch":[128]},     "parameters": {"exp.status": ""}},
    {"condition": {"exp.model":"vgg16", "exp.dtype": "float32",         "exp.replica_batch":[64]},      "parameters": {"exp.status": ""}},
    {"condition": {"exp.model":["resnet152", "googlenet"],              "exp.replica_batch":[128]},     "parameters": {"exp.status": ""}},
    {"condition": {"exp.model":["alexnet_owt"],                         "exp.replica_batch":[1024]},    "parameters": {"exp.status": ""}}
  ]
}
$ docker pull nvcr.io/nvidia/caffe2:18.05-py2

CAFFE2

Pull the optimized image from NVIDIA:

$ docker pull nvcr.io/nvidia/caffe2:18.05-py2

Modify the JSON configuration file, see highlighted section below:

$ vi config.json
{
  "parameters": {
    "caffe2.docker_image": "nvcr.io/nvidia/caffe2:18.05-py2",
    "exp.docker": true,
 
    "exp.git_hashtag": "92ef23344c4bfd0e222677c4674fe4f24d154658",
 
    "exp.num_warmup_batches": 100,
    "exp.num_batches": 400,
    "exp.log_file": "${BENCH_ROOT}/logs/${exp.framework}/${exp.data}/${exp.dtype}/$(\"${exp.gpus}\".replace(\",\",\".\"))$_${exp.model}_${exp.effective_batch}.log",
 
    "exp.phase": "training",
 
    "exp.sys_info": "cpuinfo,meminfo,lscpu,nvidiasmi,dmi"
  },
  "variables":{
    "exp.framework": ["caffe2"],
    "exp.dtype": ["float16"],
    "exp.gpus": ["0", "0,1", "0,1,2,3", "0,1,2,3,4,5,6,7"],
    "exp.model": ["resnet50"],
    "exp.replica_batch": [256]
  }
}

INFERENCE

Login https://developer.nvidia.com/tensorrt and download nv-tensorrt-repo-ubuntu1604-cuda9.0-trt5.0.0.10-rc-20180906_1-1_amd64.deb

dlbs

Build the framework:

$ mv dlbs dlbs-bck1
$ git clone https://github.com/HewlettPackard/dlcookbook-dlbs dlbs
	
$ mv nv-tensorrt-repo-ubuntu1604-cuda9.0-trt5.0.0.10-rc-20180906_1-1_amd64.deb ./dlbs/docker/tensorrt/18.11
	
$ cd dlbs/docker
$ ./build.sh tensorrt/18.11
$ docker images
$ cd 
$ cd ./yahoo/inference
$ vi config.json   **replace 18.10 with 18.11

Plan B (if the above build fails):

Get docker binary image "dlbs_tensorrt:18.11"
$ docker load --input dlbs_tensorrt:18.11

Inference JSON File

$ cat config.json
{
    "parameters": {
        "exp.num_warmup_batches": 100,
        "exp.num_batches": 400,
        "monitor.frequency": 0,
        "exp.status": "disabled",
        "exp.log_file": "${BENCH_ROOT}/${logdir}/$(\"${exp.gpus}\".replace(\",\",\".\"))$_${exp.model}_${exp.effective_batch}.log",
        "exp.docker": true,
        "exp.docker_image": "dlbs/tensorrt:18.11",
        "exp.framework": "tensorrt",
        "exp.phase": "inference",
        "exp.dtype": "float16"
    },
    "variables": {
        "exp.gpus": ["0", "0,4", "0,2,4,6", "0,1,2,3,4,5,6,7"],
        "exp.model": ["alexnet_owt", "resnet152", "resnet50"],
        "exp.replica_batch": [128, 256, 1024]
    },
    "extensions": [
        {
            "condition": {"exp.model": "alexnet_owt", "exp.replica_batch": [1024]},
            "parameters": {"exp.status":"", "exp.num_batches": 500}
        },
        {
            "condition": {"exp.model": "resnet152", "exp.replica_batch": [128]},
            "parameters": {"exp.status":"", "exp.num_batches": 250}
        },
        {
            "condition": {"exp.model": "resnet50", "exp.replica_batch": [256]},
            "parameters": {"exp.status":"", "exp.num_batches": 300}
        }
    ]
}

DEEP LEARNING BENCHMARK SUITE (DLBS):

THE PROJECT

OVERVIEW

JSON FILE

NVIDIA

TENSORFLOW

CAFFE2

INFERENCE