NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments. It also provides APIs to let developers integrate it into their own GPU profiling/monitoring tools.
If you have not setup official cuda repository, you can do the following changes.
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb $ sudo dpkg -i cuda-keyring_1.0-1_all.deb $ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
Then install dcgm via apt.
$ sudo apt-get update \ && sudo apt-get install -y datacenter-gpu-manager
Start the DCGM service
$ sudo systemctl --now enable nvidia-dcgm
NVIDIA supported platforms lists all supported the platforms and OS. Only Tesla series like A100 support all features like profile module.
There are two modes you can choose to use DCGM.
In this mode, you have to load and manage dcgm library by your self. I have added such functionality to pytorch/benchmark. By adding
--flops dcgm, you can get the flops for the chosen model. Triton model analyzer is a good example and I refers a lot from its source code.
In this mode,
nv-hostengineis provided by nvidia and it serves like a monitor server which communicate with GPU driver. It should be run in background.
dcgmi is the client to communicate with
nv-hostengine. Usually you can use
dcgmi discovery -l to list all available GPUs and check if DCGM works well.
dcgmi modules -l # to list all supported hardware counters belonging to profile domain dcgmi profile -l dcgmi discovery -l # keep running to get metric records. 1004 tensor_active dcgmi dmon -e 1001,1004
components.model_analyzer.dcgm.dcgm_structs.DCGMError_ProfilingNotSupported: Profiling is not supported for this group of GPUs or GPU.