Experiment Crashed


#1

I had a cluster running an experiment and right before it finished, it crashed with the following error:

Additional info: Found failed worker for experiment 53e438f3-1165-45ed-9d28-94a09c312fdf in state running.

Any idea what causes this? Definitely kind of bummed I paid for 12 hours of server time, only to have it crash and not get any results back, so I don’t want to risk having it happen again.

Thanks!


#2

Hi,

Does your experiment produce a lot of files/logs/output?

Our logs indicate that your worker run out of disk space and was killed because of it.

If you need more detailed info about this crash, I can PM with our Kubernetes logs

Best regards,
Hubert