Pretraining

In the field of NLP, Pretrained Language Models has become a very important basic technology. In this chapter, we will introduce the pre training tools in EduNLP:

  • How to train with a corpus to get a pretrained model

  • How to load the pretrained model

  • Public pretrained models

Import modules

from EduNLP.I2V import get_pretrained_i2v
from EduNLP.Vector import get_pretrained_t2v

Train a model

The module interface definition is in EduNLP.Pretrain, including tokenization, data processing, model definition, model training.

Basic Steps

1.Determine the type of model and select the appropriate tokenizer (GensimWordTokenizer、 GensimSegTokenizer) to finish tokenization.

2.Call train_vector function to get the required pretrained model。

Examples:

>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True)
>>> token_item = tokenizer("有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]']

# 10 dimension with fasstext method
train_vector(sif_items, "../../../data/w2v/gensim_luna_stem_tf_", 10, method="d2v")

Load models

Transfer the obtained model to the I2V module to load the model.

Examples:

>>> model_path = "../test_model/d2v/test_gensim_luna_stem_tf_d2v_256.bin"
>>> i2v = D2V("text","d2v",filepath=model_path, pretrained_t2v = False)

Examples of Model Training

Get the dataset

Examples of d2v in gensim model

Examples of w2v in gensim model

Examples of seg_token

Examples of advanced models