OpenNMT Pytorch - Using FastText Pretrained Embedding Tutorial ...

相关文章推荐

文质彬彬的草稿纸 · TWLO:New York 株価 - ...· 5 月前 ·

发呆的猴子 · 使用Markdown 设置文本格式– ...· 7 月前 ·

玩篮球的香槟 · 神龙架和老君山出游记(一)_神农架旅游攻略_ ...· 1 年前 ·

聪明伶俐的沙发 · 企业并购背后的估值理念及常见问题| ...· 1 年前 ·

帅气的鼠标 · 360搜索知识商城· 1 年前 ·

Step 1: Install the Fasttext

git clone https://github.com/facebookresearch/fastText.git

cd fastText

Step 2: Training the Fasttext

Create a directory called result in the fasttext directory.

Please prepare a large text file for training.

Training

…/fasttext skipgram -input ./emb_data.txt -output result/model

emb_data : large text file for training

output: You can write down the directory path and the name of the model to be created.

As a result, * .vec and * .bin are created.

Learn about key options

./fasttext skipgram -input data/fil9 -output result/fil9 -minn 2 -maxn 5 -dim 300

./fasttext skipgram -input data/fil9 -output result/fil9 -epoch 1 -lr 0.5

./fasttext skipgram -input data/fil9 -output result/fil9 -thread 4

This option is most important

Dim : dimension

Skipgram or cbow Your Choice !!
Usage

…/fasttext skipgram -input ko_emb_data.txt -output restult_dimension_512/ko-vec-pcj -minn 2 -maxn 5 -dim 512

Step 3 (optional) Experience secondary functionality

Visually check words

echo “asparagus pidgey yellow” | ./fasttext print-word-vectors result/fil9.bin

echo “YOUR WORD” | …/…/fasttext print-word-vectors YOUR_MODEL_BIN_FILE

Nearest neighbor queries

./fasttext nn result/fil9.bin

…/…/fasttext nn YOUR_MODEL_BIN_FILE

Word analogies

./fasttext analogies result/fil9.bin

…/…/fasttext analogies YOUR_MODEL_BIN_FILE

Now let’s look at how to use pretrained embedding in OpenNMT py

Step 1: Preprocess the data

Preprocess

python3 …/…/…/preprocess.py -train_src ./src-train.txt -train_tgt ./tgt-train.txt -valid_src ./src-val.txt -valid_tgt ./tgt-val.txt -save_data ./data -src_vocab_size 32000 -tgt_vocab_size 32000

Step 2: Prepare embedding

Command Like below

python3 …/…/…/tools/embeddings_to_torch.py -emb_file_both “…/…/ko_embedding/ko.vec” -dict_file ./data.vocab.pt -output_file "./embeddings"

python3 …/…/…/tools/embeddings_to_torch.py -emb_file_both “.YOUR_VEC_FILE” -dict_file YOUR_VOCAB_PT_FILE -output_file "./embeddings"

As a result, embedding.dec.pt and embeddings.enc.pt are created.

Step 3: Transformer Training with Pretrained embedding

Command

python3 …/…/train.py -data ./data -save_model ./model/model -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 500000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -pre_word_vecs_enc “./embeddings.enc.pt” -pre_word_vecs_dec “./embeddings.dec.pt” -world_size 8 -gpu_ranks 0 1 2 3 4 5 6 7 -log_file ./log &

Key parameters
pre_word_vecs_enc “./embeddings.enc.pt” -pre_word_vecs_dec “./embeddings.dec.pt”
So far, this was a simple pretrained embedding tutorial using Fasttext.

Training of the fasttext (Step 2) model will be done for both source and target language seperately or the large text file you are talking about will have both source and target language in the same file, such that eventually ‘embeddings.enc.pt’ will be generated for the source language and ‘embeddings.dec.pt’ for the target language?

If training of the fasttext model should be done seperately for source and target language, then how should embedding be prepared separately for encoder and decoder because as stated in step 2 (Prepare embedding), it requires only vector file generated when fasttext model is trained.

When we do the preprocessing of the of the training and validation set, i guess, please correct me if am wrong, by default openNMT uses SentencePiece model for tokenization, then how fasttext embedding for subword would look like at the time of training?

How to handle the named entities in the OpenNMT? Is there any tag that we have to provide to named entities in the training dataset?
Currently, I am thinking of identifying named entities using pretrained transformer NER model and replacing them by their tags in the training and validation dataset for example ORG or PER. Please give your suggestion!

Thanks!