Introduction
NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments. It also provides APIs to let developers integrate it into their own GPU profiling/monitoring tools.
Installation
If you have not setup official cuda repository, you can do the following changes.
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
$ sudo dpkg -i cuda-keyring_1.0-1_all.deb
$ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
Then install dcgm via apt.
$ sudo apt-get update \
&& sudo apt-get install -y datacenter-gpu-manager
Start the DCGM service
$ sudo systemctl --now enable nvidia-dcgm
NVIDIA supported platforms lists all supported the platforms and OS. Only Tesla series like A100 support all features like profile module.
DCGM Mode
There are two modes you can choose to use DCGM.
Embedded Mode
In this mode, you have to load and manage dcgm library by your self. I have added such functionality to pytorch/benchmark. By adding --flops dcgm
, you can get the flops for the chosen model. Triton model analyzer is a good example and I refers a lot from its source code.
Standalone mode
In this mode, nv-hostengine
is provided by nvidia and it serves like a monitor server which communicate with GPU driver. It should be run in background. dcgmi
is the client to communicate with nv-hostengine
. Usually you can use dcgmi discovery -l
to list all available GPUs and check if DCGM works well.
Tips
dcgmi modules -l
# to list all supported hardware counters belonging to profile domain
dcgmi profile -l
dcgmi discovery -l
# keep running to get metric records. 1004 tensor_active
dcgmi dmon -e 1001,1004
关闭nvidia自动转换FP32计算为TF32计算
export NVIDIA_TF32_OVERRIDE=0
开启普通用户使用profiler的权限,否则会报错components.model_analyzer.dcgm.dcgm_structs.DCGMError_ProfilingNotSupported: Profiling is not supported for this group of GPUs or GPU.
Comments