Exception during experiment start


#1

One (out of dozens) experiment failed to start, with the following traceback:

Traceback (most recent call last):
File “/net/people/plgamn/env/lib/python3.6/site-packages/neptune/internal/cli/commands/executing/abstract_experiment_executor.py”, line 38, in execute
return self._execute(experiment, args)
File “/net/people/plgamn/env/lib/python3.6/site-packages/neptune/internal/cli/commands/executing/experiment_executor.py”, line 103, in _execute
self._spawn_job_related_threads(experiment.id)
File “/net/people/plgamn/env/lib/python3.6/site-packages/neptune/internal/cli/commands/executing/experiment_executor.py”, line 280, in _spawn_job_related_threads
self._spawn_hardware_metric_reporting_thread(experiment_id)
File “/net/people/plgamn/env/lib/python3.6/site-packages/neptune/internal/cli/commands/executing/experiment_executor.py”, line 308, in _spawn_hardware_metric_reporting_thread
gauge_mode=self._hardware_metrics_gauge_mode, experiment_id=experiment_id, reference_timestamp=time.time())
File “/net/people/plgamn/env/lib/python3.6/site-packages/neptune/internal/cli/hardware/metrics/service/metric_service_factory.py”, line 35, in create
).create(gauge_mode=gauge_mode)
File “/net/people/plgamn/env/lib/python3.6/site-packages/neptune/internal/cli/hardware/resources/system_resource_info_factory.py”, line 34, in create
return self.__create_whole_system_resource_info()
File “/net/people/plgamn/env/lib/python3.6/site-packages/neptune/internal/cli/hardware/resources/system_resource_info_factory.py”, line 45, in __create_whole_system_resource_info
gpu_memory_amount_bytes=self.__gpu_monitor.get_top_card_memory_in_bytes()
File “/net/people/plgamn/env/lib/python3.6/site-packages/neptune/internal/cli/hardware/gpu/gpu_monitor.py”, line 35, in get_top_card_memory_in_bytes
]), default=0)
File “/net/people/plgamn/env/lib/python3.6/site-packages/neptune/internal/cli/hardware/gpu/gpu_monitor.py”, line 40, in __nvml_get_or_else
return getter()
File “/net/people/plgamn/env/lib/python3.6/site-packages/neptune/internal/cli/hardware/gpu/gpu_monitor.py”, line 34, in
for card_index in range(nvmlDeviceGetCount())
ValueError: max() arg is an empty sequence

I do not have reproducible example, as all other runs succeeded.


#2

Hi,

This seems to be an issue with nvidia-ml-py3 package - for some reason it detects nvidia correctly, but returns that no GPUs are available.

Does this happen consistently on the same machine?

The way we handle this is an obvious bug on our side, we’ll release a fix on our side first thing on Monday.

Best regards,
Hubert Jaworski


#3

Hi,

We just pushed a new version of out CLI (2.8.24) to pypi.

It should fix this issue

Best regards,
Hubert Jaworski


#4

Thank you for the fix. The bug was very rare and was probably caused by some weird hardware configuration. I can’t reproduce it though, as the node the computation ran on was randomly assigned by slurm.

Best regards,
Andrzej