Experiment crashes with "Process exited with return code -9"


#1

Hi,

My experiment keeps failing with no reason that I can understand - https://app.neptune.ml/-/dashboard/experiment/c66c0c4f-914d-4706-b208-4f3f85d2ac57/details.

In the logs we have:

2018-12-24 09:00:31,504 neptune.internal.cli.threads.hardware_metric_reporting_thread ERROR    hardware_metric_reporting_thread.py:41 - run() Server timed out.

but not sure it’s related to this. Stderr logs have nothing.

Please advise.

Thanks


#2

Hi,

It seems your process is receiving a kill signal (exit code 9).

We’ll look into why does this happen and get back to you ASAP,

Best regards,
Hubert Jaworski


#3

Hi,

We analyzed the logs and we found out that your process is being killed because you run out of RAM,

Since I’m not a devops, I cannot give you the full analysis, but I’m happy to PM you our internal logs that are relevant to your experiment.

I noticed you are using a m-8k80 worker type (with GPUs) with a base-cpu-py3 environment (without GPU support), is this intentional?

Best regards,
Hubert Jaworski


#4

Ok, makes sense. Any reason why in the details tab it doesn’t signal the job as killed due to being out of memory? Some jobs do and since this one didn’t it was misleading.

I was trying to increase the available memory, tbh am a bit confused by the pricing page. Also scikit-learn does not use gpu afaik. I just needed a cpu only with about 50GB RAM - what’s the correct combo for this?

Thanks


#5

I’m not an expert and might be wrong, but here’s what I think happend:

We use Kubernetes for scheduling experiments in the cloud, but the logs suggest that it was the Linux kernel of underlying virtual machine that killed your process.

Our tools detected it as a “failed experiment” since the process, from our point of view just ended with non-zero exit code, which obviously is a bug and needs fixing on our side.

Kubernetes usually sends a different signal to inform us that experiment used too much RAM and was killed.

As for the monitoring tab, you should have a graph there showing memory usages, but as you found in logs, the hardware_metric_reporting_thread failed to send final memory usage statistics.

Regarding worker pricing:

In your case, you get the same results when using a m worker and m-8k80 worker since you don’t use the GPUs (and m is cheaper)

For example, every xs worker has 13GB RAM, and every m worker you get 52GB RAM.

If you want to increase RAM I suggest using l (lowercase L) worker for 104GB RAM or even xl worker for 208 GB RAM

I’ll let our team know that the pricing page needs improving for clarity.

Best regards,
Hubert


#6

Thanks for clearing it up!

Cheers


#7

Glad I could help! :slight_smile: