NVIDIA DCGM - FindHao

Introduction

NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments. It also provides APIs to let developers integrate it into their own GPU profiling/monitoring tools.

Installation

If you have not setup official cuda repository, you can do the following changes.

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
$ sudo dpkg -i cuda-keyring_1.0-1_all.deb
$ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"

Then install dcgm via apt.

$ sudo apt-get update \
    && sudo apt-get install -y datacenter-gpu-manager

Start the DCGM service

$ sudo systemctl --now enable nvidia-dcgm

NVIDIA supported platforms lists all supported the platforms and OS. Only Tesla series like A100 support all features like profile module.

DCGM Mode

There are two modes you can choose to use DCGM.

Embedded Mode

In this mode, you have to load and manage dcgm library by your self. I have added such functionality to pytorch/benchmark. By adding --flops dcgm, you can get the flops for the chosen model. Triton model analyzer is a good example and I refers a lot from its source code.

Standalone mode

In this mode, nv-hostengineis provided by nvidia and it serves like a monitor server which communicate with GPU driver. It should be run in background. dcgmi is the client to communicate with nv-hostengine. Usually you can use dcgmi discovery -l to list all available GPUs and check if DCGM works well.

Tips

dcgmi modules -l 
# to list all supported hardware counters belonging to profile domain
dcgmi profile -l 
dcgmi discovery -l
# keep running to get metric records. 1004 tensor_active
dcgmi dmon -e 1001,1004

关闭nvidia自动转换FP32计算为TF32计算

export NVIDIA_TF32_OVERRIDE=0

开启普通用户使用profiler的权限，否则会报错components.model_analyzer.dcgm.dcgm_structs.DCGMError_ProfilingNotSupported: Profiling is not supported for this group of GPUs or GPU.

Reference

NVIDIA DCGM Source Code