EduNLP.Pretrain¶
EduNLP.Pretrain.gensim_vec¶
- class EduNLP.Pretrain.gensim_vec.GensimWordTokenizer(symbol='gm', general=False)[source]¶
- Parameters
symbol (str) –
- select the methods to symbolize:
”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,
e.g.: gm, fgm, gmas, fgmas
general (bool) –
True: when item isn’t in standard format, and want to tokenize formulas(except formulas in figure) linearly.
False: when use ‘ast’ mothed to tokenize formulas instead of ‘linear’.
- Returns
tokenizer
- Return type
Examples
>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item.tokens[:10]) ['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]'] >>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item.tokens[:10]) ['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]']
- EduNLP.Pretrain.gensim_vec.train_vector(items, w2v_prefix, embedding_dim=None, method='sg', binary=None, train_params=None)[source]¶
- Parameters
items:str – the text of question
w2v_prefix –
embedding_dim (int) – vector_size
method (str) – the method of training, e.g.: sg, cbow, fasttext, d2v, bow, tfidf
binary (model format) – True:bin; False:kv
train_params (dict) – the training parameters passed to model
- Returns
tokenizer
- Return type
Examples
>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']] >>> train_vector(token_item[:10], "examples/test_model/w2v/gensim_luna_stem_t_", 100) 'examples/test_model/w2v/gensim_luna_stem_t_sg_100.kv'
- class EduNLP.Pretrain.gensim_vec.GensimSegTokenizer(symbol='gms', depth=None, flatten=False, **kwargs)[source]¶
- Parameters
symbol (str) –
- select the methods to symbolize:
”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,
e.g. gms, fgm
depth (int or None) – 0: only separate at SIFSep ; 1: only separate at SIFTag ; 2: separate at SIFTag and SIFSep ; otherwise, separate all segments ;
- Returns
tokenizer
- Return type
Examples
>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']] >>> tokenizer = GensimSegTokenizer(symbol="fgm", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], ['[FORMULA]'], ['如图'], ['[FIGURE]'], ['[FORMULA]'],...['[FORMULA]'], ['最大值'], ['[MARK]']]
EduNLP.Pretrain.elmo_vec¶
- class EduNLP.Pretrain.elmo_vec.ElmoTokenizer(path: Optional[str] = None)[source]¶
Examples
>>> t=ElmoTokenizer() >>> items = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,\ ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$"] >>> t.tokenize(items[0]) ['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]'] >>> len(t) 18
- class EduNLP.Pretrain.elmo_vec.ElmoDataset(texts: list, tokenizer: ElmoTokenizer, max_length=128)[source]¶
- EduNLP.Pretrain.elmo_vec.train_elmo(texts: list, output_dir: str, pretrained_dir: Optional[str] = None, emb_dim=512, hid_dim=512, batch_size=2, epochs=3, lr: float = 0.0005, device=None)[source]¶
- Parameters
texts (list, required) – The training corpus of shape (text_num, token_num), a text must be tokenized into tokens
output_dir (str, required) – The directory to save trained model files
pretrained_dir (str, optional) – The pretrained model files’ directory
emb_dim (int, optional, default=512) – The embedding dim
hid_dim (int, optional, default=1024) – The hidden dim
batch_size (int, optional, default=2) – The training batch size
epochs (int, optional, default=3) – The training epochs
lr (float, optional, default=5e-4) – The learning rate
device (str, optional) – Default is ‘cuda’ if available, otherwise ‘cpu’
- Returns
output_dir – The directory that trained model files are saved
- Return type
str
EduNLP.Pretrain.bert_vec¶
- class EduNLP.Pretrain.bert_vec.BertTokenizer(pretrain_model='bert-base-chinese', add_special_tokens=False, text_tokenizer=None)[source]¶
- Parameters
pretrain_model – used pretrained model
add_special_tokens – Whether to add tokens like [FIGURE], [TAG], etc.
text_tokenizer – Which text tokenizer to use. Must be consistent with TOKENIZER dictionary.
Examples
>>> tokenizer = BertTokenizer(add_special_tokens=True) >>> item = "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$" >>> token_item = tokenizer(item) >>> print(token_item.input_ids[:10]) [101, 1062, 2466, 1963, 1745, 21129, 166, 117, 167, 5276] >>> print(tokenizer.tokenize(item)[:10]) ['公', '式', '如', '图', '[FIGURE]', 'x', ',', 'y', '约', '束'] >>> items = [item, item] >>> token_items = tokenizer(items, return_tensors='pt') >>> print(token_items.input_ids.shape) torch.Size([2, 27]) >>> print(len(tokenizer.tokenize(items))) 2 >>> tokenizer.save_pretrained('test_dir') >>> tokenizer = BertTokenizer.from_pretrained('test_dir')
- EduNLP.Pretrain.bert_vec.finetune_bert(items, output_dir, pretrain_model='bert-base-chinese', train_params=None)[source]¶
- Parameters
items:dict – the tokenization results of questions
output_dir (str) – the path to save the model
pretrain_model (str) – the name or path of pre-trained model
train_params (dict) – the training parameters passed to Trainer
Examples
>>> tokenizer = BertTokenizer() >>> stems = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$", ... "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$"] >>> token_item = [tokenizer(i) for i in stems] >>> print(token_item[0].keys()) dict_keys(['input_ids', 'token_type_ids', 'attention_mask']) >>> finetune_bert(token_item, "examples/test_model/data/data/bert") {'train_runtime': ..., ..., 'epoch': 1.0}
EduNLP.Pretrain.disenqnet_vec¶
- class EduNLP.Pretrain.disenqnet_vec.DisenQTokenizer(vocab_path=None, max_length=250, tokenize_method='space', num_token='<num>', unk_token='<unk>', pad_token='<pad>', *args, **argv)[source]¶
Examples
>>> tokenizer = DisenQTokenizer() >>> test_items = [{ ... "content": "甲 数 除以 乙 数 的 商 是 1.5 , 如果 甲 数 增加 20 , 则 甲 数 是 乙 的 4 倍 . 原来 甲 数 = .", ... "knowledge": ["*", "-", "/"], "difficulty": 0.2, "length": 7}] >>> tokenizer.set_vocab(test_items, ... trim_min_count=1, key=lambda x: x["content"], silent=True) >>> token_items = [tokenizer(i, key=lambda x: x["content"]) for i in test_items] >>> print(token_items[0].keys()) dict_keys(['content_idx', 'content_len'])
- tokenize(items: (<class 'list'>, <class 'str'>, <class 'dict'>), key=<function DisenQTokenizer.<lambda>>, **kwargs)[source]¶
- Parameters
items (list or str or dict) – the question items
key (function) – determine how to get the text of each item
- Returns
tokens – the token of items
- Return type
list
- set_vocab(items: list, key=<function DisenQTokenizer.<lambda>>, trim_min_count=1, silent=True)[source]¶
- Parameters
items (list) – can be the list of str, or list of dict
key (function) – determine how to get the text of each item
- classmethod from_pretrained(tokenizer_config_dir)[source]¶
Parameters:¶
- tokenizer_config_dir: str
must contain tokenizer_config.json and vocab.list
- save_pretrained(tokenizer_config_dir)[source]¶
Parameters:¶
- tokenizer_config_dir: str
save tokenizer params in tokenizer_config.json and save words in vocab.list
- property vocab_size¶
- class EduNLP.Pretrain.disenqnet_vec.QuestionDataset(items, disen_tokenizer, predata_dir, dataset_type, silent=False, embed_dim=128, trim_min_count=50, data_formation=None, w2v_workers=1)[source]¶
Question dataset including text, length, concept Tensors
- EduNLP.Pretrain.disenqnet_vec.train_disenqnet(train_items, disen_tokenizer, output_dir, predata_dir, train_params=None, test_items=None, silent=False, data_formation=None)[source]¶
- Parameters
train_items (list) – the raw train question list
disen_tokenizer (DisenQTokenizer) – the initial DisenQTokenizer use for training.
output_dir (str) – the path to save the model
predata_dir (str) – the dirname to load or save predata (including wv.th, vocab.list and concept.list)
train_params (dict, defaults to None) –
the training parameters for data, model and trianer. - “trim_min”: int
data param, the trim_min_count for vocab and word2vec, by default 2
- ”w2v_workers”: int
data param, the number of workers for word2vec, by default 1
- ”hidden”: int
model param, by default 128
- ”dropout”: float
model param, dropout rate, by default 0.2
- ”pos_weight”: int
model param, positive sample weight in unbalanced multi-label concept classifier, by default 1
- ”cp”: float
model param, weight of concept loss, by default 1.5
- ”mi”: float
model param, weight of mutual information loss, by default 1.0
- ”dis”: float
model param, weight of disentangling loss, by default 2.0
- ”epoch”: int
train param, number of epoch, by default 1
- ”batch”: int
train param, batch size, by default 64
- ”lr”: float
train param, learning rate, by default 1e-3
- ”step”: int
train param, step_size for StepLR, by default 20
- ”gamma”: float
train param, gamma for StepLR, by default 0.5
- ”warm_up”: int
train param, number of epoch for warming up, by default 1
- ”adv”: int
train param, ratio of disc/enc training for adversarial process, by default 10
- ”device”: str
train param, ‘cpu’ or ‘cuda’, by default “cpu”
test_items (list, defaults to None) – the raw test question list, default is None
silent (bool, defaults to False) – whether to print processing inforamtion
data_formation (dict, defaults to None) – Mapping “content” and “knowledge” for the item formation. For example, {“content”: “ques_content”, “knowledge”: “know_name”}
Examples
>>> train_data = load_items("static/test_data/disenq_train.json")[:100] >>> test_data = load_items("static/test_data/disenq_test.json")[:100] >>> tokenizer = DisenQTokenizer(max_length=250, tokenize_method="space") >>> train_disenqnet(train_data, tokenizer, ... "examples/test_model/disenq","examples/test_model/disenq", silent=True)
EduNLP.Pretrain.quesnet_vec¶
Pre-process input text, tokenizing, building vocabs, and pre-train word level vectors.
- class EduNLP.Pretrain.quesnet_vec.Question(id, content, answer, false_options, labels)¶
- property answer¶
Alias for field number 2
- property content¶
Alias for field number 1
- property false_options¶
Alias for field number 3
- property id¶
Alias for field number 0
- property labels¶
Alias for field number 4
- class EduNLP.Pretrain.quesnet_vec.QuesNetTokenizer(img_dir=None, vocab_path=None, max_length=250, meta=None, img_token='<img>', unk_token='<unk>', pad_token='<pad>', *args, **argv)[source]¶
Examples
>>> tokenizer = QuesNetTokenizer(meta=['knowledge']) >>> test_items = [{"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B, $\FigureID{test_id}$", ... "knowledge": "['*', '-', '/']"}, {"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B", ... "knowledge": "['*', '-', '/']"}] >>> tokenizer.set_vocab(test_items, ... trim_min_count=1, key=lambda x: x["ques_content"], silent=True) >>> token_items = [tokenizer(i, key=lambda x: x["ques_content"]) for i in test_items] >>> print(token_items[0].keys()) dict_keys(['content_idx', 'meta_idx']) >>> token_items = tokenizer(test_items, key=lambda x: x["ques_content"]) >>> print(len(token_items["content_idx"])) 2
- tokenize(item: ~typing.Union[str, dict, list], key=<function QuesNetTokenizer.<lambda>>, *args, **kwargs)[source]¶
- load_vocab(path)[source]¶
- Parameters
path (str) – path of vocabulary files it must be a directory containing word.txt (meta.txt is optional)
- set_vocab(items: list, key=<function QuesNetTokenizer.<lambda>>, trim_min_count=50, silent=True)[source]¶
- Parameters
items (list) – can be the list of str, or list of dict
key (function) – determine how to get the text of each item
trim_min_count –
silent –
- save_vocab(save_vocab_path)[source]¶
- Parameters
save_vocab_path (str) – path to save word vocabulary and meta vocabulary
- classmethod from_pretrained(tokenizer_config_dir, img_dir=None)[source]¶
Parameters:¶
- tokenizer_config_dir: str
must contain tokenizer_config.json and vocab/word.txt vocab/meta_{meta_name}.txt
- img_dir: str
default None the path of image directory
- save_pretrained(tokenizer_config_dir)[source]¶
Parameters:¶
- tokenizer_config_dir: str
save tokenizer params in tokenizer_config.json and save words in vocab.list
- property vocab_size¶
- class EduNLP.Pretrain.quesnet_vec.QuestionLoader(ques_file, tokenizer: ~EduNLP.Pretrain.quesnet_vec.QuesNetTokenizer, pipeline=None, range=None, meta: ~typing.Optional[list] = None, content_key=<function QuestionLoader.<lambda>>, meta_key=<function QuestionLoader.<lambda>>, answer_key=<function QuestionLoader.<lambda>>, option_key=<function QuestionLoader.<lambda>>, skip=0)[source]¶
- class EduNLP.Pretrain.quesnet_vec.PrefetchIter(data, *label, length=None, batch_size=1, shuffle=True)[source]¶
Iterator on data and labels, with states for save and restore.
- EduNLP.Pretrain.quesnet_vec.pretrain_embedding_layer(dataset: EmbeddingDataset, ae: AE, lr: float = 0.001, log_step: int = 1, epochs: int = 3, batch_size: int = 4, device=device(type='cpu'))[source]¶
- EduNLP.Pretrain.quesnet_vec.pretrain_quesnet(path, output_dir, tokenizer, save_embs=False, train_params=None)[source]¶
pretrain quesnet
- Parameters
path (str) – path of question file
output_dir (str) – output path·
tokenizer (QuesNetTokenizer) – quesnet tokenizer
save_embs (bool, optional) – whether to save pretrained word/image/meta embeddings seperately
train_params (dict, optional) –
the training parameters and model parameters, by default None - “n_epochs”: int, default = 1
train param, number of epochs
- ”batch_size”: int, default = 6
train param, batch size
- ”lr”: float, default = 1e-3
train param, learning rate
- ”save_every”: int, default = 0
train param, save steps interval
- ”log_steps”: int, default = 10
train param, log steps interval
- ”device”: str, default = ‘cpu’
train param, ‘cpu’ or ‘cuda’
- ”max_steps”: int, default = 0
train param, stop training when reach max steps
- ”emb_size”: int, default = 256
model param, the embedding size of word, figure, meta info
- ”feat_size”: int, default = 256
model param, the size of question infer vector
Examples
>>> tokenizer = QuesNetTokenizer(meta=['know_name']) >>> items = [{"ques_content": "若复数$z=1+2 i+i^{3}$,则$|z|=$,$\FigureID{000004d6-0479-11ec-829b-797d5eb43535}$", ... "ques_id": "726cdbec-33a9-11ec-909c-98fa9b625adb", ... "know_name": "['代数', '集合', '集合的相等']" ... }] >>> tokenizer.set_vocab(items, key=lambda x: x['ques_content'], trim_min_count=1, silent=True) >>> pretrain_quesnet('./data/quesnet_data.json', './testQuesNet', tokenizer)