EduNLP.Pretrain

EduNLP.Pretrain.gensim_vec

class EduNLP.Pretrain.gensim_vec.GensimWordTokenizer(symbol='gm', general=False)[source]
Parameters
  • symbol (str) –

    select the methods to symbolize:

    ”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,

    e.g.: gm, fgm, gmas, fgmas

  • general (bool) –

    True: when item isn’t in standard format, and want to tokenize formulas(except formulas in figure) linearly.

    False: when use ‘ast’ mothed to tokenize formulas instead of ‘linear’.

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]']
>>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]']
batch_process(*items)[source]
EduNLP.Pretrain.gensim_vec.train_vector(items, w2v_prefix, embedding_dim=None, method='sg', binary=None, train_params=None)[source]
Parameters
  • items:str – the text of question

  • w2v_prefix

  • embedding_dim (int) – vector_size

  • method (str) – the method of training, e.g.: sg, cbow, fasttext, d2v, bow, tfidf

  • binary (model format) – True:bin; False:kv

  • train_params (dict) – the training parameters passed to model

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']]
>>> train_vector(token_item[:10], "examples/test_model/w2v/gensim_luna_stem_t_", 100) 
'examples/test_model/w2v/gensim_luna_stem_t_sg_100.kv'
class EduNLP.Pretrain.gensim_vec.GensimSegTokenizer(symbol='gms', depth=None, flatten=False, **kwargs)[source]
Parameters
  • symbol (str) –

    select the methods to symbolize:

    ”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,

    e.g. gms, fgm

  • depth (int or None) – 0: only separate at SIFSep ; 1: only separate at SIFTag ; 2: separate at SIFTag and SIFSep ; otherwise, separate all segments ;

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']]
>>> tokenizer = GensimSegTokenizer(symbol="fgm", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], ['[FORMULA]'], ['如图'], ['[FIGURE]'], ['[FORMULA]'],...['[FORMULA]'], ['最大值'], ['[MARK]']]

EduNLP.Pretrain.elmo_vec

class EduNLP.Pretrain.elmo_vec.ElmoTokenizer(path: Optional[str] = None)[source]

Examples

>>> t=ElmoTokenizer()
>>> items = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$"]
>>> t.tokenize(items[0])
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]']
>>> len(t)
18
tokenize(item: (<class 'str'>, <class 'list'>), freeze_vocab=False, return_length=False)[source]
to_index(item: list, max_length=128, pad_to_max_length=False)[source]
append(item)[source]
save_vocab(path)[source]
load_vocab(path)[source]
class EduNLP.Pretrain.elmo_vec.ElmoDataset(texts: list, tokenizer: ElmoTokenizer, max_length=128)[source]
EduNLP.Pretrain.elmo_vec.elmo_collate_fn(batch_data)[source]
EduNLP.Pretrain.elmo_vec.train_elmo(texts: list, output_dir: str, pretrained_dir: Optional[str] = None, emb_dim=512, hid_dim=512, batch_size=2, epochs=3, lr: float = 0.0005, device=None)[source]
Parameters
  • texts (list, required) – The training corpus of shape (text_num, token_num), a text must be tokenized into tokens

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_dir (str, optional) – The pretrained model files’ directory

  • emb_dim (int, optional, default=512) – The embedding dim

  • hid_dim (int, optional, default=1024) – The hidden dim

  • batch_size (int, optional, default=2) – The training batch size

  • epochs (int, optional, default=3) – The training epochs

  • lr (float, optional, default=5e-4) – The learning rate

  • device (str, optional) – Default is ‘cuda’ if available, otherwise ‘cpu’

Returns

output_dir – The directory that trained model files are saved

Return type

str

EduNLP.Pretrain.bert_vec

class EduNLP.Pretrain.bert_vec.BertTokenizer(pretrain_model='bert-base-chinese', add_special_tokens=False, text_tokenizer=None)[source]
Parameters
  • pretrain_model – used pretrained model

  • add_special_tokens – Whether to add tokens like [FIGURE], [TAG], etc.

  • text_tokenizer – Which text tokenizer to use. Must be consistent with TOKENIZER dictionary.

Examples

>>> tokenizer = BertTokenizer(add_special_tokens=True)
>>> item = "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$"
>>> token_item = tokenizer(item)
>>> print(token_item.input_ids[:10])
[101, 1062, 2466, 1963, 1745, 21129, 166, 117, 167, 5276]
>>> print(tokenizer.tokenize(item)[:10])
['公', '式', '如', '图', '[FIGURE]', 'x', ',', 'y', '约', '束']
>>> items = [item, item]
>>> token_items = tokenizer(items, return_tensors='pt')
>>> print(token_items.input_ids.shape)
torch.Size([2, 27])
>>> print(len(tokenizer.tokenize(items)))
2
>>> tokenizer.save_pretrained('test_dir') 
>>> tokenizer = BertTokenizer.from_pretrained('test_dir') 
tokenize(item: Union[list, str], *args, **kwargs)[source]
save_pretrained(tokenizer_config_dir)[source]
classmethod from_pretrained(tokenizer_config_dir)[source]
EduNLP.Pretrain.bert_vec.finetune_bert(items, output_dir, pretrain_model='bert-base-chinese', train_params=None)[source]
Parameters
  • items:dict – the tokenization results of questions

  • output_dir (str) – the path to save the model

  • pretrain_model (str) – the name or path of pre-trained model

  • train_params (dict) – the training parameters passed to Trainer

Examples

>>> tokenizer = BertTokenizer()
>>> stems = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$",
... "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$"]
>>> token_item = [tokenizer(i) for i in stems]
>>> print(token_item[0].keys())
dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
>>> finetune_bert(token_item, "examples/test_model/data/data/bert") 
{'train_runtime': ..., ..., 'epoch': 1.0}

EduNLP.Pretrain.disenqnet_vec

EduNLP.Pretrain.disenqnet_vec.check_num(s)[source]
EduNLP.Pretrain.disenqnet_vec.list_to_onehot(item_list, item2index)[source]
EduNLP.Pretrain.disenqnet_vec.load_list(path)[source]
EduNLP.Pretrain.disenqnet_vec.save_list(item2index, path)[source]
class EduNLP.Pretrain.disenqnet_vec.DisenQTokenizer(vocab_path=None, max_length=250, tokenize_method='space', num_token='<num>', unk_token='<unk>', pad_token='<pad>', *args, **argv)[source]

Examples

>>> tokenizer = DisenQTokenizer()
>>> test_items = [{
...     "content": "甲 数 除以 乙 数 的 商 是 1.5 , 如果 甲 数 增加 20 , 则 甲 数 是 乙 的 4 倍 . 原来 甲 数 = .",
...     "knowledge": ["*", "-", "/"], "difficulty": 0.2, "length": 7}]
>>> tokenizer.set_vocab(test_items,
...     trim_min_count=1, key=lambda x: x["content"], silent=True)
>>> token_items = [tokenizer(i, key=lambda x: x["content"]) for i in test_items]
>>> print(token_items[0].keys())
dict_keys(['content_idx', 'content_len'])
set_text_tokenizer(tokenize_method)[source]
tokenize(items: (<class 'list'>, <class 'str'>, <class 'dict'>), key=<function DisenQTokenizer.<lambda>>, **kwargs)[source]
Parameters
  • items (list or str or dict) – the question items

  • key (function) – determine how to get the text of each item

Returns

tokens – the token of items

Return type

list

load_vocab(path)[source]
set_vocab(items: list, key=<function DisenQTokenizer.<lambda>>, trim_min_count=1, silent=True)[source]
Parameters
  • items (list) – can be the list of str, or list of dict

  • key (function) – determine how to get the text of each item

save_vocab(save_vocab_path)[source]
classmethod from_pretrained(tokenizer_config_dir)[source]
tokenizer_config_dir: str

must contain tokenizer_config.json and vocab.list

save_pretrained(tokenizer_config_dir)[source]
tokenizer_config_dir: str

save tokenizer params in tokenizer_config.json and save words in vocab.list

property vocab_size
class EduNLP.Pretrain.disenqnet_vec.QuestionDataset(items, disen_tokenizer, predata_dir, dataset_type, silent=False, embed_dim=128, trim_min_count=50, data_formation=None, w2v_workers=1)[source]

Question dataset including text, length, concept Tensors

process_dataset(items, trim_min_count, embed_dim, init=False)[source]
collate_data(batch_data)[source]
EduNLP.Pretrain.disenqnet_vec.train_disenqnet(train_items, disen_tokenizer, output_dir, predata_dir, train_params=None, test_items=None, silent=False, data_formation=None)[source]
Parameters
  • train_items (list) – the raw train question list

  • disen_tokenizer (DisenQTokenizer) – the initial DisenQTokenizer use for training.

  • output_dir (str) – the path to save the model

  • predata_dir (str) – the dirname to load or save predata (including wv.th, vocab.list and concept.list)

  • train_params (dict, defaults to None) –

    the training parameters for data, model and trianer. - “trim_min”: int

    data param, the trim_min_count for vocab and word2vec, by default 2

    • ”w2v_workers”: int

      data param, the number of workers for word2vec, by default 1

    • ”hidden”: int

      model param, by default 128

    • ”dropout”: float

      model param, dropout rate, by default 0.2

    • ”pos_weight”: int

      model param, positive sample weight in unbalanced multi-label concept classifier, by default 1

    • ”cp”: float

      model param, weight of concept loss, by default 1.5

    • ”mi”: float

      model param, weight of mutual information loss, by default 1.0

    • ”dis”: float

      model param, weight of disentangling loss, by default 2.0

    • ”epoch”: int

      train param, number of epoch, by default 1

    • ”batch”: int

      train param, batch size, by default 64

    • ”lr”: float

      train param, learning rate, by default 1e-3

    • ”step”: int

      train param, step_size for StepLR, by default 20

    • ”gamma”: float

      train param, gamma for StepLR, by default 0.5

    • ”warm_up”: int

      train param, number of epoch for warming up, by default 1

    • ”adv”: int

      train param, ratio of disc/enc training for adversarial process, by default 10

    • ”device”: str

      train param, ‘cpu’ or ‘cuda’, by default “cpu”

  • test_items (list, defaults to None) – the raw test question list, default is None

  • silent (bool, defaults to False) – whether to print processing inforamtion

  • data_formation (dict, defaults to None) – Mapping “content” and “knowledge” for the item formation. For example, {“content”: “ques_content”, “knowledge”: “know_name”}

Examples

>>> train_data = load_items("static/test_data/disenq_train.json")[:100]
>>> test_data = load_items("static/test_data/disenq_test.json")[:100]
>>> tokenizer = DisenQTokenizer(max_length=250, tokenize_method="space")
>>> train_disenqnet(train_data, tokenizer,
... "examples/test_model/disenq","examples/test_model/disenq", silent=True)  

EduNLP.Pretrain.quesnet_vec

Pre-process input text, tokenizing, building vocabs, and pre-train word level vectors.

class EduNLP.Pretrain.quesnet_vec.Question(id, content, answer, false_options, labels)
property answer

Alias for field number 2

property content

Alias for field number 1

property false_options

Alias for field number 3

property id

Alias for field number 0

property labels

Alias for field number 4

EduNLP.Pretrain.quesnet_vec.save_list(item2index, path)[source]
class EduNLP.Pretrain.quesnet_vec.QuesNetTokenizer(img_dir=None, vocab_path=None, max_length=250, meta=None, img_token='<img>', unk_token='<unk>', pad_token='<pad>', *args, **argv)[source]

Examples

>>> tokenizer = QuesNetTokenizer(meta=['knowledge'])
>>> test_items = [{"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B, $\FigureID{test_id}$",
... "knowledge": "['*', '-', '/']"}, {"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B",
... "knowledge": "['*', '-', '/']"}]
>>> tokenizer.set_vocab(test_items,
... trim_min_count=1, key=lambda x: x["ques_content"], silent=True)
>>> token_items = [tokenizer(i, key=lambda x: x["ques_content"]) for i in test_items]
>>> print(token_items[0].keys())
dict_keys(['content_idx', 'meta_idx'])
>>> token_items = tokenizer(test_items, key=lambda x: x["ques_content"])
>>> print(len(token_items["content_idx"]))
2
tokenize(item: ~typing.Union[str, dict, list], key=<function QuesNetTokenizer.<lambda>>, *args, **kwargs)[source]
load_vocab(path)[source]
Parameters

path (str) – path of vocabulary files it must be a directory containing word.txt (meta.txt is optional)

set_vocab(items: list, key=<function QuesNetTokenizer.<lambda>>, trim_min_count=50, silent=True)[source]
Parameters
  • items (list) – can be the list of str, or list of dict

  • key (function) – determine how to get the text of each item

  • trim_min_count

  • silent

save_vocab(save_vocab_path)[source]
Parameters

save_vocab_path (str) – path to save word vocabulary and meta vocabulary

classmethod from_pretrained(tokenizer_config_dir, img_dir=None)[source]
tokenizer_config_dir: str

must contain tokenizer_config.json and vocab/word.txt vocab/meta_{meta_name}.txt

img_dir: str

default None the path of image directory

save_pretrained(tokenizer_config_dir)[source]
tokenizer_config_dir: str

save tokenizer params in tokenizer_config.json and save words in vocab.list

padding(idx, max_length, type='word')[source]
property vocab_size
set_img_dir(path)[source]
EduNLP.Pretrain.quesnet_vec.clip(v, low, high)[source]
class EduNLP.Pretrain.quesnet_vec.Lines(filename, skip=0, preserve_newline=False)[source]
class EduNLP.Pretrain.quesnet_vec.QuestionLoader(ques_file, tokenizer: ~EduNLP.Pretrain.quesnet_vec.QuesNetTokenizer, pipeline=None, range=None, meta: ~typing.Optional[list] = None, content_key=<function QuestionLoader.<lambda>>, meta_key=<function QuestionLoader.<lambda>>, answer_key=<function QuestionLoader.<lambda>>, option_key=<function QuestionLoader.<lambda>>, skip=0)[source]
split_(split_ratio)[source]
EduNLP.Pretrain.quesnet_vec.optimizer(*models, **kwargs)[source]
class EduNLP.Pretrain.quesnet_vec.PrefetchIter(data, *label, length=None, batch_size=1, shuffle=True)[source]

Iterator on data and labels, with states for save and restore.

produce()[source]
class EduNLP.Pretrain.quesnet_vec.EmbeddingDataset(data, data_type='image')[source]
EduNLP.Pretrain.quesnet_vec.pretrain_iter(ques, batch_size)[source]
EduNLP.Pretrain.quesnet_vec.critical(f)[source]
EduNLP.Pretrain.quesnet_vec.pretrain_embedding_layer(dataset: EmbeddingDataset, ae: AE, lr: float = 0.001, log_step: int = 1, epochs: int = 3, batch_size: int = 4, device=device(type='cpu'))[source]
EduNLP.Pretrain.quesnet_vec.pretrain_quesnet(path, output_dir, tokenizer, save_embs=False, train_params=None)[source]

pretrain quesnet

Parameters
  • path (str) – path of question file

  • output_dir (str) – output path·

  • tokenizer (QuesNetTokenizer) – quesnet tokenizer

  • save_embs (bool, optional) – whether to save pretrained word/image/meta embeddings seperately

  • train_params (dict, optional) –

    the training parameters and model parameters, by default None - “n_epochs”: int, default = 1

    train param, number of epochs

    • ”batch_size”: int, default = 6

      train param, batch size

    • ”lr”: float, default = 1e-3

      train param, learning rate

    • ”save_every”: int, default = 0

      train param, save steps interval

    • ”log_steps”: int, default = 10

      train param, log steps interval

    • ”device”: str, default = ‘cpu’

      train param, ‘cpu’ or ‘cuda’

    • ”max_steps”: int, default = 0

      train param, stop training when reach max steps

    • ”emb_size”: int, default = 256

      model param, the embedding size of word, figure, meta info

    • ”feat_size”: int, default = 256

      model param, the size of question infer vector

Examples

>>> tokenizer = QuesNetTokenizer(meta=['know_name'])
>>> items = [{"ques_content": "若复数$z=1+2 i+i^{3}$,则$|z|=$,$\FigureID{000004d6-0479-11ec-829b-797d5eb43535}$",
... "ques_id": "726cdbec-33a9-11ec-909c-98fa9b625adb",
... "know_name": "['代数', '集合', '集合的相等']"
... }]
>>> tokenizer.set_vocab(items, key=lambda x: x['ques_content'], trim_min_count=1, silent=True)
>>> pretrain_quesnet('./data/quesnet_data.json', './testQuesNet', tokenizer)