Neptune - Fastai integration

#1

Hi!

I am trying to setup some analysis using fastai library (pytorch based), following this post. Before the training, I am trying to run learn.lr_find() and learn.recorder.plot() but I am having some errors that does not appear in my Jupyter Notebook:

Traceback (most recent call last):
  File "/users/genomics/jgibert/Scripts/Lymphoma_Fastai_Neptune.py", line 25, in <module>
    learn.lr_find()
  File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/fastai/train.py", line 32, in lr_find
    learn.fit(epochs, start_lr, callbacks=[cb], wd=wd)
  File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/fastai/basic_train.py", line 196, in fit
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/fastai/basic_train.py", line 107, in fit
    if cb_handler.on_epoch_end(val_loss): break
  File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/fastai/callback.py", line 316, in on_epoch_end
    self('epoch_end', call_mets = val_loss is not None)
  File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/fastai/callback.py", line 250, in __call__
    for cb in self.callbacks: self._call_and_update(cb, cb_name, **kwargs)
  File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/fastai/callback.py", line 240, in _call_and_update
    new = ifnone(getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs), dict())
  File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/neptunecontrib/monitoring/fastai.py", line 91, in on_epoch_end
    self._exp.send_metric(self._prefix + metric_name, float(metric_value))
TypeError: float() argument must be a string or a number, not 'NoneType'

Any idea why this is happening? Also, how could I export the plot to Neptune? Should I use this? And the .pth file?

Thanks!
J

#2

@Joan_Gibert it seems that there is a bug there.
When I implemented it, I thought of the lr_find() as the pre-experiment phase and the support for .lr_find() is not there at the moment.
I would suggest to find the optimal lr (without neptune callback) and then create_experiment with neptune callback.
I know it is not ideal and I will work on a fix ASAP and let you know when it is fixed.

When it comes to sending plots and .pth files it is actually really simple.
Just go:

neptune.send_image('chart_channel', 'image.jpeg')
neptune.send_artifact('file.pth')

Remember that you can send many images to one channel if you want, as was done in this experiment for example.

I hope this helps!

#3

I think the best way to approach it, which I have just tested is:

Create learner run learner.lr_find() and plot it with learner.recorder.plot()

learn = create_cnn(data, models.resnet18, metrics=accuracy)
learn.lr_find()
learn.recorder.plot()

Figure out what is the best learning rate, say lr=1e-2.
Create Neptune experiment with a parameter lr and add Neptune callback to learner:

with neptune.create_experiment(parameters={'lr':1e-2}):
    neptune_monitor = NeptuneMonitor()
    learn.callbacks.append(neptune_monitor)
    
    learn.callbacks.append(neptune_monitor)
    learn.fit_one_cycle(20, 1e-2)

Now you can even record the output from your `lr_find()` analysis and everything is just simple and clean.

Do you like this answer @Joan_Gibert?
#4

Hi @jakub_czakon,

Thanks for your fast reply. The thing here is that I am running the code in a Slurm cluster and I cannot manage quite well the learn.recorder.plot() function.

In any case I will give it a try :smile:
Cheers!

#5

Hmm,

In that case, you could probably just find the lr automatically as explained in this post:

def find_appropriate_lr(model:Learner, lr_diff:int = 15, loss_threshold:float = .05, adjust_value:float = 1) -> float:
    #Run the Learning Rate Finder
    model.lr_find()
    
    #Get loss values and their corresponding gradients, and get lr values
    losses = np.array(model.recorder.losses)
    assert(lr_diff < len(losses))
    loss_grad = np.gradient(losses)
    lrs = model.recorder.lrs
    
    #Search for index in gradients where loss is lowest before the loss spike
    #Initialize right and left idx using the lr_diff as a spacing unit
    #Set the local min lr as -1 to signify if threshold is too low
    r_idx = -1
    l_idx = r_idx - lr_diff
    while (l_idx >= -len(losses)) and (abs(loss_grad[r_idx] - loss_grad[l_idx]) > loss_threshold):
        local_min_lr = lrs[l_idx]
        r_idx -= 1
        l_idx -= 1

    lr_to_use = local_min_lr * adjust_value
        
    return lr_to_use

and then pass it as experiment parameter. Basically, the full code would read along the lines of:

from fastai.vision import *
import neptune
from neptunecontrib.monitoring.fastai import NeptuneMonitor

def find_appropriate_lr(model:Learner, lr_diff:int = 15, loss_threshold:float = .05, adjust_value:float = 1) -> float:
    #Run the Learning Rate Finder
    model.lr_find()
    
    #Get loss values and their corresponding gradients, and get lr values
    losses = np.array(model.recorder.losses)
    assert(lr_diff < len(losses))
    loss_grad = np.gradient(losses)
    lrs = model.recorder.lrs
    
    #Search for index in gradients where loss is lowest before the loss spike
    #Initialize right and left idx using the lr_diff as a spacing unit
    #Set the local min lr as -1 to signify if threshold is too low
    r_idx = -1
    l_idx = r_idx - lr_diff
    while (l_idx >= -len(losses)) and (abs(loss_grad[r_idx] - loss_grad[l_idx]) > loss_threshold):
        local_min_lr = lrs[l_idx]
        r_idx -= 1
        l_idx -= 1

    lr_to_use = local_min_lr * adjust_value
        
    return lr_to_use

neptune.init(project_qualified_name='USER_NAME/PROJECT_NAME')

mnist = untar_data(URLs.MNIST_TINY)
tfms = get_transforms(do_flip=False)

data = (ImageItemList.from_folder(mnist)
        .split_by_folder()
        .label_from_folder()
        .transform(tfms, size=32)
        .databunch()
        .normalize(imagenet_stats))

learn = create_cnn(data, models.resnet18, metrics=accuracy)
selected_lr = find_appropriate_lr(learn)
    
with neptune.create_experiment(params={'lr':selected_lr}):
    learn.callbacks.append(NeptuneMonitor())
    learn.fit_one_cycle(20, selected_lr)

I have just tested and it seems to be working pretty well.

#6

Has this worked for you @Joan_Gibert?

#7

Hi @jakub_czakon,

That worked perfectly. Thank you! :slight_smile:

1 Like
#8

Hi @jakub_czakon,

Following a little bit this issue, I am wondering if you could send to Neptune two different learning processes in the same job.

I am thinking on (following a little bit how JH teaches training with fastai):
1) Use pretrained model with a given lr (found with find_appropriate_lr) - 10 epochs
2) Unfreeze firsts layers with learn.unfreeze()
3) Retrain with another lr - 10 more epochs

Is it feasible to record this on Neptune or could I record only the unfreeze part?

Thanks!

#9

Sure @Joan_Gibert you can.

I cannot test it right now, but I am pretty sure something like this should work:


with neptune.create_experiment() as exp1:
   neptune_monitor_exp1 = NeptuneMonitor(experiment=exp1)
   learner.callbacks.append(neptune_monitor_exp1)
   learner.fit()

with neptune.create_experiment() as exp2:
   learner.unfreeze()
   neptune_monitor_exp2 = NeptuneMonitor(experiment=exp2)
   learner.callbacks[-1] = neptune_monitor_exp2 # or something else that swaps the callback to a new one
   learner.fit()

I hope it helps.
#10

Hi @jakub_czakon,

Following your recomendations I can run both cycles in the same script. However, when I tried to upload the stage.pth file (both before and after the unfreezing) I got an connection error:

Traceback (most recent call last):
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 304, in _send_until_done
        return self.connection.send(data)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1540, in send
        self._raise_ssl_error(self._ssl, result)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1448, in _raise_ssl_error
        raise SysCallError(errno, errorcode.get(errno))
    OpenSSL.SSL.SysCallError: (32, 'EPIPE')

During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
        chunked=chunked)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/urllib3/connectionpool.py", line 357, in _make_request
        conn.request(method, url, **httplib_request_kw)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/http/client.py", line 1239, in request
        self._send_request(method, url, body, headers, encode_chunked)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/http/client.py", line 1285, in _send_request
        self.endheaders(body, encode_chunked=encode_chunked)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/http/client.py", line 1234, in endheaders
        self._send_output(message_body, encode_chunked=encode_chunked)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/http/client.py", line 1065, in _send_output
        self.send(chunk)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/http/client.py", line 986, in send
        self.sock.sendall(data)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 316, in sendall
        sent = self._send_until_done(data[total_sent:total_sent + SSL_WRITE_BLOCKSIZE])
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 311, in _send_until_done
        raise SocketError(str(e))
    OSError: (32, 'EPIPE')

During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
        timeout=timeout
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
        _stacktrace=sys.exc_info()[2])
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/urllib3/util/retry.py", line 357, in increment
        raise six.reraise(type(error), error, _stacktrace)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/urllib3/packages/six.py", line 685, in reraise
        raise value.with_traceback(tb)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
        chunked=chunked)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/urllib3/connectionpool.py", line 357, in _make_request
        conn.request(method, url, **httplib_request_kw)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/http/client.py", line 1239, in request
        self._send_request(method, url, body, headers, encode_chunked)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/http/client.py", line 1285, in _send_request
        self.endheaders(body, encode_chunked=encode_chunked)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/http/client.py", line 1234, in endheaders
        self._send_output(message_body, encode_chunked=encode_chunked)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/http/client.py", line 1065, in _send_output
        self.send(chunk)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/http/client.py", line 986, in send
        self.sock.sendall(data)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 316, in sendall
        sent = self._send_until_done(data[total_sent:total_sent + SSL_WRITE_BLOCKSIZE])
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 311, in _send_until_done
        raise SocketError(str(e))
    urllib3.exceptions.ProtocolError: ('Connection aborted.', OSError("(32, 'EPIPE')",))

During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/neptune/client.py", line 48, in wrapper
        return func(*args, **kwargs)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/neptune/client.py", line 555, in upload_experiment_output
        experiment=experiment)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/neptune/client.py", line 649, in _upload_loop
        ret = self._upload_loop_chunk(fun, part, data, **kwargs)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/neptune/client.py", line 668, in _upload_loop_chunk
        **kwargs)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/neptune/client.py", line 685, in _upload_raw_data
        return session.send(session.prepare_request(request))
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/requests/sessions.py", line 646, in send
        r = adapter.send(request, **kwargs)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/requests/adapters.py", line 498, in send
        raise ConnectionError(err, request=request)
    requests.exceptions.ConnectionError: ('Connection aborted.', OSError("(32, 'EPIPE')",))

During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/users/genomics/jgibert/Scripts/Lymphoma_Fastai_Neptune_3.py", line 73, in <module>
        exp1.send_artifact('/users/genomics/jgibert/Lymphoma_DNN/models/stage-1.pth')
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/neptune/experiments.py", line 288, in send_artifact
        experiment=self)
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/neptune/internal/storage/storage_utils.py", line 127, in upload_to_storage
        upload_api_fun(**dict(kwargs, data=file_chunk_stream))
      File "/soft/EB_repo/devel/programs/goolf/1.7.20/Python/3.6.2/lib/python3.6/site-packages/neptune/client.py", line 53, in wrapper
        raise ConnectionLost()
    neptune.api_exceptions.ConnectionLost: Connection lost. Please try again.

A snippet of the code of the first part (pre-unfreezing):

learn = create_cnn(data, models.resnet50, metrics=accuracy,pretrained=True)

with neptune.create_experiment(name='20190506-Part1', params={'bs': 64, 'lr': 0.0190546, 'nr_epochs': 20, 'pretrain': True}) as exp1:
    neptune_monitor_exp1 = NeptuneMonitor(experiment=exp1)
    learn.callbacks.append(neptune_monitor_exp1)
    learn.fit_one_cycle(20,0.0190546)
    learn.save('stage-1')
    exp1.send_artifact('/users/genomics/jgibert/Lymphoma_DNN/models/stage-1.pth')

Thanks!

#11

I have just tried something like this:

from fastai.vision import *
import neptune
from neptunecontrib.monitoring.fastai import NeptuneMonitor

neptune.init(project_qualified_name='jakub-czakon/examples')

mnist = untar_data(URLs.MNIST_TINY)
tfms = get_transforms(do_flip=False)

data = (ImageItemList.from_folder(mnist)
        .split_by_folder()
        .label_from_folder()
        .transform(tfms, size=32)
        .databunch()
        .normalize(imagenet_stats))

with neptune.create_experiment(name='frozen') as exp1:
    neptune_monitor_1 = NeptuneMonitor(experiment=exp1)
    learn = create_cnn(data, models.resnet18, metrics=accuracy)
    learn.callbacks.append(neptune_monitor_1)
    learn.fit_one_cycle(10, 1e-2)
    learn.save('/mnt/ml-team/homes/jakub.czakon/code/blogs/fastai/stage-1')
    exp1.send_artifact('/mnt/ml-team/homes/jakub.czakon/code/blogs/fastai/stage-1.pth')

with neptune.create_experiment(name='unfrozen') as exp2:
    neptune_monitor_2 = NeptuneMonitor(experiment=exp2)
    learn.unfreeze()
    learn.callbacks[-1] = neptune_monitor_2
    learn.fit_one_cycle(10, 1e-3)
    learn.save('/mnt/ml-team/homes/jakub.czakon/code/blogs/fastai/stage-2')
    exp2.send_artifact('/mnt/ml-team/homes/jakub.czakon/code/blogs/fastai/stage-2.pth')

…and it worked.

I think it must have been some random connection error on our part.

Here are the experiments:

#12

Hi @jakub_czakon,

I run the very same code above changing only my paths and didn’t work:

from fastai.vision import *
import neptune
from neptunecontrib.monitoring.fastai import NeptuneMonitor
import os
neptune.init(api_token=os.environ["TOKEN"], project_qualified_name='tato2014/Lymphoma')

mnist = untar_data(URLs.MNIST_TINY)
tfms = get_transforms(do_flip=False)

data = (ImageList.from_folder(mnist)
        .split_by_folder()
        .label_from_folder()
        .transform(tfms, size=32)
        .databunch()
        .normalize(imagenet_stats))

with neptune.create_experiment(name='frozen') as exp1:
    neptune_monitor_1 = NeptuneMonitor(experiment=exp1)
    learn = create_cnn(data, models.resnet18, metrics=accuracy)
    learn.callbacks.append(neptune_monitor_1)
    learn.fit_one_cycle(10, 1e-2)
    learn.save('/users/genomics/jgibert/stage-1')
    exp1.send_artifact('/users/genomics/jgibert/stage-1.pth')

with neptune.create_experiment(name='unfrozen') as exp2:
    neptune_monitor_2 = NeptuneMonitor(experiment=exp2)
    learn.unfreeze()
    learn.callbacks[-1] = neptune_monitor_2
    learn.fit_one_cycle(10, 1e-3)
    learn.save('/users/genomics/jgibert/stage-2')
    exp2.send_artifact('/users/genomics/jgibert/stage-2.pth')

Any idea?
Thanks

#13

It may be a typo here:

data = (ImageList.from_folder(mnist)

vs what worked for me

data = (ImageItemList.from_folder(mnist)

On a separate note, I prefer to put my token in the Environment variable NEPTUNE_API_TOKEN
because then I can do this:

neptune.init(project_qualified_name='jakub-czakon/examples')

instead of this

import os
neptune.init(api_token=os.environ["TOKEN"], 
             project_qualified_name='jakub-czakon/examples')

Which version of:

  • neptune-client
  • neptune-contrib
  • fastai

do you have?

#14

I ran again with ImageItemList and it worked:

My versions are:

  • neptune-client==0.2.10 (worked on 0.2.8 as well)
  • neptune-contrib==0.5.1 (worked on 0.4.1 as well)
  • fastai==1.0.45
#15

Ok, I maybe is something in my side I would check and update.

Versions:
neptune-client==0.2.10
neptune-contrib==0.4.1
fastai==1.0.51

ImageItemList change in the last version of fastai

Thanks!

#16

Just tried it out with the newest fastai==1.0.52 version (changed to ImageList) and it worked just fine.

Experiments: