"Websocket connection lost. Retrying" For 3+ Hours


#1

Hi, I’m running an experiment for Kaggle Salt Identification Competition. I found multiple “Websocket connection lost. Retrying…” in the stdout file, and the recent one has lasted for over three hours. Is this situation normal? Do I need to abort this experiment (It has been running for 16+ hours :frowning: )?

Thanks,
Sonia


#2

Hi,

This situation is definitely not normal and a bug at our side,

However, if you are not using actions, this should not affect your experiment at all (except the ugly logs)

We’ll look into it

Best regards,
Hubert


#3

Hi @SoniaZhang,

We looked deeper into your issue.

First of all Websocket connection lost. message is not related to the problem. There were just temporary network problems few times during experiment life-time but they were handled by Neptune just fine. We should probably add some additional message informing that connection was restored, so those warnings won’t be confusing.

Secondly, if you look at the monitoring tab in your experiment’s dashboard you can see that about 7th hour of your experiment life-time the GPU utilization dropped to 0% and since that moment CPU utilization had almost constant value of about 12.5%.
12.5% of 8 cores is exactly 100% of one core. So it looks like your code ran into an infinite loop, stopped to do any calculations on GPU and so stopped to produce any output.


#4

I see. Thank you very much!