I was trying to train OpenNMT example on mac with cpu with the following steps:

Env: python3.5, Pytorch 0.1.10.1

  • preprocess data and shrink src and tgt to have only the first 100 sentences by inserting the following lines after line133 in preprocess.py
        shrink = True
        if shrink:
            src = src[0:100]
            tgt = tgt[0:100]
    

    then, I ran

    python preprocess.py -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo

  • then I train using python train.py -data data/demo.train.pt -save_model demo_model

    Then it rans ok for a while before an error appeared:

    (dlnd-tf-lab)  ->python train.py -data data/demo.train.pt -save_model demo_model
    Namespace(batch_size=64, brnn=False, brnn_merge='concat', curriculum=False, data='data/demo.train.pt', dropout=0.3, epochs=13, extra_shuffle=False, gpus=[], input_feed=1, layers=2, learning_rate=1.0, learning_rate_decay=0.5, log_interval=50, max_generator_batches=32, max_grad_norm=5, optim='sgd', param_init=0.1, pre_word_vecs_dec=None, pre_word_vecs_enc=None, rnn_size=500, save_model='demo_model', start_decay_at=8, start_epoch=1, train_from='', train_from_state_dict='', word_vec_size=500)
    Loading data from 'data/demo.train.pt'
     * vocabulary size. source = 24999; target = 35820
     * number of training sentences. 100
     * maximum batch size. 64
    Building model...
    * number of parameters: 58121320
    NMTModel (
      (encoder): Encoder (
        (word_lut): Embedding(24999, 500, padding_idx=0)
        (rnn): LSTM(500, 500, num_layers=2, dropout=0.3)
      (decoder): Decoder (
        (word_lut): Embedding(35820, 500, padding_idx=0)
        (rnn): StackedLSTM (
          (dropout): Dropout (p = 0.3)
          (layers): ModuleList (
            (0): LSTMCell(1000, 500)
            (1): LSTMCell(500, 500)
        (attn): GlobalAttention (
          (linear_in): Linear (500 -> 500)
          (sm): Softmax ()
          (linear_out): Linear (1000 -> 500)
          (tanh): Tanh ()
        (dropout): Dropout (p = 0.3)
      (generator): Sequential (
        (0): Linear (500 -> 35820)
        (1): LogSoftmax ()
    Train perplexity: 29508.9
    Train accuracy: 0.0216306
    Validation perplexity: 4.50917e+08
    Validation accuracy: 3.57853
    Train perplexity: 1.07012e+07
    Train accuracy: 0.06198
    Validation perplexity: 103639
    Validation accuracy: 0.944334
    Train perplexity: 458795
    Train accuracy: 0.031198
    Validation perplexity: 43578.2
    Validation accuracy: 3.42942
    Train perplexity: 144931
    Train accuracy: 0.0432612
    Validation perplexity: 78366.8
    Validation accuracy: 2.33598
    Decaying learning rate to 0.5
    Train perplexity: 58696.8
    Train accuracy: 0.0278702
    Validation perplexity: 14045.8
    Validation accuracy: 3.67793
    Decaying learning rate to 0.25
    Train perplexity: 10045.1
    Train accuracy: 0.0457571
    Validation perplexity: 26435.6
    Validation accuracy: 4.87078
    Decaying learning rate to 0.125
    Train perplexity: 10301.5
    Train accuracy: 0.0490849
    Validation perplexity: 24243.5
    Validation accuracy: 3.62823
    Decaying learning rate to 0.0625
    Train perplexity: 7927.77
    Train accuracy: 0.062812
    Validation perplexity: 7180.49
    Validation accuracy: 5.31809
    Decaying learning rate to 0.03125
    Train perplexity: 4573.5
    Train accuracy: 0.047421
    Validation perplexity: 6545.51
    Validation accuracy: 5.6163
    Decaying learning rate to 0.015625
    Train perplexity: 3995.7
    Train accuracy: 0.0549085
    Validation perplexity: 6316.25
    Validation accuracy: 5.4175
    Decaying learning rate to 0.0078125
    Train perplexity: 3715.81
    Train accuracy: 0.0540765
    Validation perplexity: 6197.91
    Validation accuracy: 5.86481
    Decaying learning rate to 0.00390625
    Train perplexity: 3672.46
    Train accuracy: 0.0540765
    Validation perplexity: 6144.18
    Validation accuracy: 6.01392
    Decaying learning rate to 0.00195312
    Train perplexity: 3689.7
    Train accuracy: 0.0528286
    Validation perplexity: 6113.55
    Validation accuracy: 6.31213
    Decaying learning rate to 0.000976562
    Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x118b19b70>
    Traceback (most recent call last):
      File "/Users/Natsume/miniconda2/envs/dlnd-tf-lab/lib/python3.5/weakref.py", line 117, in remove
    TypeError: 'NoneType' object is not callable
    

    Could you tell me how to fix it? Thanks!

    I think might be seeing a [bug in python 3.5 weakref]
    (Issue 29519: weakref spewing exceptions during finalization when combined with multiprocessing - Python tracker) that occurs during shutdown.

    On my machine I was able to resolve it by applying this patch (though I seem to recall that there was some fuzz in the line numbers):
    https://github.com/python/cpython/commit/9cd7e17640a49635d1c1f8c2989578a8fc2c1de6.patch

    Best regards

    Thomas

    Thanks a lot, Tom!

    According to your suggestion, in order to get code running, I switched to python 2.7, and it trains without error! and it works for python 3.6 too. I am testing on python3.5 after conda update python. Now it seems they all work without error when training like above.

    Thanks again!

  •