KEY [
FEAT1.1 FEAT1.2 FEAT1.3 ... FEAT1.n
FEATm.1 FEATm.2 FEATm.3 ... FEATm.n ]
Vocabularies
The main goal of the preprocessing is to build the word and features vocabularies and assign each word to an index within these dictionaries.
By default, word vocabularies are limited to 50,000. You can change this value with the -src_vocab_size
and -tgt_vocab_size
. Alternatively, you can prune the vocabulary size by setting the minimum frequency of words with the -src_words_min_frequency
and -tgt_words_min_frequency
options.
When pruning vocabularies to 50,000, the preprocessing will actually report a vocabulary size of 50,004 because of 4 special tokens that are automatically added.
The preprocessing script will generate *.dict
files containing the vocabularies: source and target token vocabularies are named PREFIX.src.dict
and PREFIX.tgt.dict
, while features' vocabulary files are named PREFIX.{source,target}_feature_N.dict
.
These files are optional for the rest of the workflow. However, it is common to reuse vocabularies across dataset using the -src_vocab
and -tgt_vocab
options. This is particularly needed when retraining a model on new data: the vocabulary has to be the same.
Vocabularies can be generated beforehand with the tools/build_vocab.lua
script.
Each line of dictionary files is space-separated fields:
token
the vocab entry.
ID
its index used internally to map tokens to integer as an entry of lookup tables.
(optional) the vocab frequency in the corpus it was extracted form. This field is generated.
other fields are ignored
if you provide your own vocabulary - be sure to integrate the 4 special tokens: <blank> <unk> <s> </s>
. A good practice is to keep them at the beginning of the file with the respective index 1, 2, 3, 4
Shuffling and sorting
By default, OpenNMT both shuffles and sorts the data before the training. This process comes from 2 constraints of batch training:
shuffling: sentences within a batch should come from different parts of the corpus
sorting: sentences within a batch should have the same source length (i.e. without padding to maximize efficiency)
During the training, batches are also randomly selected unless the -curriculum
option is used.
Sentence length
During preprocessing, too long sentences (with source longer than -src_seq_length
or target longer than -tgt_seq_length
) are discarded from the corpus. You can have an idea of the distribution of sentence length in your training corpus by looking at the preprocess log where a table gives percent of sentences with length 1-10, 11-20, 21-30, ..., 90+:
[04/14/17 00:40:10 INFO] * Source Sentence Length (range of 10): [ 7% ; 35% ; 32% ; 16% ; 7% ; 0% ; 0% ; 0% ; 0% ; 0% ]
[04/14/17 00:40:10 INFO] * Target Sentence Length (range of 10): [ 9% ; 38% ; 30% ; 15% ; 5% ; 0% ; 0% ; 0% ; 0% ; 0% ]