EduNLP.Pretrain¶
- class EduNLP.Pretrain.GensimSegTokenizer(symbol='gms', depth=None, flatten=False, **kwargs)[source]¶
- Parameters
symbol (str) –
- select the methods to symbolize:
”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,
e.g. gms, fgm
depth (int or None) – 0: only separate at SIFSep ; 1: only separate at SIFTag ; 2: separate at SIFTag and SIFSep ; otherwise, separate all segments ;
- Returns
tokenizer
- Return type
Tokenizer
Examples
>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], [\FormFigureID{wrong1?}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']] >>> tokenizer = GensimSegTokenizer(symbol="fgm", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], ['[FORMULA]'], ['如图'], ['[FIGURE]'], ['[FORMULA]'],...['[FORMULA]'], ['最大值'], ['[MARK]']]
- class EduNLP.Pretrain.GensimWordTokenizer(symbol='gm', general=False)[source]¶
- Parameters
symbol (str) –
- select the methods to symbolize:
”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,
e.g.: gm, fgm, gmas, fgmas
general (bool) –
True: when item isn’t in standard format, and want to tokenize formulas(except formulas in figure) linearly.
False: when use ‘ast’ mothed to tokenize formulas instead of ‘linear’.
- Returns
tokenizer
- Return type
Tokenizer
Examples
>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True) >>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item.tokens[:10]) ['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]'] >>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False) >>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item.tokens[:10]) ['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]']
- EduNLP.Pretrain.train_vector(items, w2v_prefix, embedding_dim=None, method='sg', binary=None, train_params=None)[source]¶
- Parameters
items:str – the text of question
w2v_prefix –
embedding_dim (int) – vector_size
method (str) – the method of training, e.g.: sg, cbow, fasttext, d2v, bow, tfidf
binary (model format) – True:bin; False:kv
train_params (dict) – the training parameters passed to model
- Returns
tokenizer
- Return type
Tokenizer
Examples
>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], [\FormFigureID{wrong1?}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']] >>> train_vector(token_item[:10], "examples/test_model/data/gensim_luna_stem_t_", 100) 'examples/test_model/data/gensim_luna_stem_t_sg_100.kv'