Experiment crashes on cloud after several epochs


#1

I was using s2k80 with pytorch-0.3.1-gpu-py3 and see the following in the log:

01:13 Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/6HE7P24FYMW6H7KQOOI2RBIA74:/var/lib/docker/overlay2/l/IEI225RJYVK4HW7CNCEOWMWGCL:/var/lib/docker/overlay2/l/EAZ5HN6WAFWIVXTCUDDY5K627T:/var/lib/docker/overlay2/l/BP5H2QMZ2VZKCMA5YGRUJCA7W7:/var/lib/docker/overlay2/l/AKP3YCDOXAPZLHCIRZNAU3YUZ2:/var/lib/docker/overlay2/l/2R4GN4D6AQ7CMZJNKUEPSHKSYY:/var/lib/docker/overlay2/l/WUVLVF3G7PHL2JQ7NPDHFTKJVJ:/var/lib/docker/overlay2/l/FLSK66F4BFAEODEA2NVZSCOLRN:/var/lib/docker/overlay2/l/KHSWEOYMLOSMH’
01:13 Unexpected end of /proc/mounts line `APRJ564DPPCZO:/var/lib/docker/overlay2/l/3KI4PT6ABGOR3XJW656JCKLT6V:/var/lib/docker/overlay2/l/Y5BIVCOICZUHQCCKKUTVTPV5UX:/var/lib/docker/overlay2/l/YFA4QAVYZXX2AEWOVEJDFANHO4:/var/lib/docker/overlay2/l/VWIE3DENY6LQ5BPWL62SJBCVMY:/var/lib/docker/overlay2/l/ANBAGCLY4IY7SPW7T36Y7JY4ZK:/var/lib/docker/overlay2/l/TXWHAMHQ2VNBVQMWYTXSPZX22Z:/var/lib/docker/overlay2/l/IC6ODS662TSZ6HZ5G2A5MRYF3D:/var/lib/docker/overlay2/l/LFGXJFL5GDPZRSUASLIHKFRCTX:/var/lib/docker/overlay2/l/6Y7D5B367CHV3FEDISNNMPZ3HV:/var/lib/do’
01:13 Unexpected end of /proc/mounts line `cker/overlay2/l/2GWAEQSQNQJBAFVPNLPVVL4NS3:/var/lib/docker/overlay2/l/LEWZYGERNFUC4PGY23MRI4MIYV:/var/lib/docker/overlay2/l/YG6VXVQR6NRRREUAAKZZPBFVZQ:/var/lib/docker/overlay2/l/BHWJ73URGPZHY3B6TV6Y74GPR6:/var/lib/docker/overlay2/l/HU34IBVH4GKAOG33BYUY55VHL3:/var/lib/docker/overlay2/l/HTYHDJFFARZXCVXSEDCIN2F3SO:/var/lib/docker/overlay2/l/KKQTFQWSPOUT22DKLUCWCWP5XF:/var/lib/docker/overlay2/l/BLVI5M6PP634EPJVJXJPYW7UXP,upperdir=/var/lib/docker/overlay2/bef3e1b41fdfa23796959832d2f271041cdcc8948d1c948c75ec6cbe’

Google leads to this: https://devtalk.nvidia.com/default/topic/1027077/container-pytorch/-quot-unexpected-end-of-proc-mounts-line-overlay-quot-on-p3-8xlarge/

The crash seems quite similar to the above link. The string len is exactly 512. Is this a bug in Neptune’s platform?


#2

Hi @lih,

Thank you for reporting this issue.

The logs you pasted are quite likely unrelated to the crash. They happened at the beginning of your experiment and are related to the nvidia library bug you linked to.

We’re still investigating the actual reason of the crash, but in the meantime, we’ve added bonus credits to your account to reimburse you for the crashed experiments.

Best,
Piotr