Step 1: Install the Fasttext
git clone
https://github.com/facebookresearch/fastText.git
cd fastText
Step 2: Training the Fasttext
Create a directory called result in the fasttext directory.
Please prepare a large text file for training.
Training
…/fasttext skipgram -input ./emb_data.txt -output result/model
emb_data :
large text file for training
output: You can write down the directory path and the name of the model to be created.
As a result, * .vec and * .bin are created.
Learn about key options
./fasttext skipgram -input data/fil9 -output result/fil9 -minn 2 -maxn 5 -dim 300
./fasttext skipgram -input data/fil9 -output result/fil9 -epoch 1 -lr 0.5
./fasttext skipgram -input data/fil9 -output result/fil9 -thread 4
This option is most important
Dim
: dimension
Skipgram or cbow Your Choice !!
Usage
…/fasttext skipgram -input ko_emb_data.txt -output restult_dimension_512/ko-vec-pcj -minn 2 -maxn 5 -dim 512
Step 3 (optional) Experience secondary functionality
Visually check words
echo “asparagus pidgey yellow” | ./fasttext print-word-vectors result/fil9.bin
echo “YOUR WORD” | …/…/fasttext print-word-vectors YOUR_MODEL_BIN_FILE
Nearest neighbor queries
./fasttext nn result/fil9.bin
…/…/fasttext nn YOUR_MODEL_BIN_FILE
Word analogies
./fasttext analogies result/fil9.bin
…/…/fasttext analogies YOUR_MODEL_BIN_FILE
Now let’s look at how to use pretrained embedding in OpenNMT py
Step 1: Preprocess the data
Preprocess
python3 …/…/…/preprocess.py -train_src ./src-train.txt -train_tgt ./tgt-train.txt -valid_src ./src-val.txt -valid_tgt ./tgt-val.txt -save_data ./data -src_vocab_size 32000 -tgt_vocab_size 32000
Step 2: Prepare embedding
Command Like below
python3 …/…/…/tools/embeddings_to_torch.py -emb_file_both “…/…/ko_embedding/ko.vec” -dict_file ./data.vocab.pt -output_file "./embeddings"
python3 …/…/…/tools/embeddings_to_torch.py -emb_file_both “.YOUR_VEC_FILE” -dict_file YOUR_VOCAB_PT_FILE -output_file "./embeddings"
As a result, embedding.dec.pt and embeddings.enc.pt are created.
Step 3: Transformer Training with Pretrained embedding
Command
python3 …/…/train.py -data ./data -save_model ./model/model -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 500000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -pre_word_vecs_enc “./embeddings.enc.pt” -pre_word_vecs_dec “./embeddings.dec.pt” -world_size 8 -gpu_ranks 0 1 2 3 4 5 6 7 -log_file ./log &
Key parameters
pre_word_vecs_enc “./embeddings.enc.pt” -pre_word_vecs_dec “./embeddings.dec.pt”
So far, this was a simple pretrained embedding tutorial using Fasttext.
Training of the fasttext (Step 2) model will be done for both source and target language seperately or the large text file you are talking about will have both source and target language in the same file, such that eventually ‘embeddings.enc.pt’ will be generated for the source language and ‘embeddings.dec.pt’ for the target language?
If training of the fasttext model should be done seperately for source and target language, then how should embedding be prepared separately for encoder and decoder because as stated in step 2 (Prepare embedding), it requires only vector file generated when fasttext model is trained.
When we do the preprocessing of the of the training and validation set, i guess, please correct me if am wrong, by default openNMT uses SentencePiece model for tokenization, then how fasttext embedding for subword would look like at the time of training?
How to handle the named entities in the OpenNMT? Is there any tag that we have to provide to named entities in the training dataset?
Currently, I am thinking of identifying named entities using pretrained transformer NER model and replacing them by their tags in the training and validation dataset for example ORG or PER. Please give your suggestion!
Thanks!