EduNLP.Pretrain¶

EduNLP.Pretrain.pretrian_utils¶

class EduNLP.Pretrain.pretrian_utils.EduVocab(vocab_path: Optional[str] = None, corpus_items: Optional[List[str]] = None, bos_token: str = '[BOS]', eos_token: str = '[EOS]', pad_token: str = '[PAD]', unk_token: str = '[UNK]', specials: Optional[List[str]] = None, lower: bool = False, trim_min_count: int = 1, **kwargs)[source]¶

The vocabulary container for a corpus.

Parameters

vocab_path (str, optional) – vocabulary path to initialize this container, by default None
corpus_items (List[str], optional) – corpus items to update this vocabulary, by default None
bos_token (str, optional) – token representing for the start of a sentence, by default “[BOS]”
eos_token (str, optional) – token representing for the end of a sentence, by default “[EOS]”
pad_token (str, optional) – token representing for padding, by default “[PAD]”
unk_token (str, optional) – token representing for unknown word, by default “[UNK]”
specials (List[str], optional) – spacials tokens in vocabulary, by default None
lower (bool, optional) – wheather to lower the corpus items, by default False
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1

property vocab_size¶

property special_tokens¶

property tokens¶

to_idx(token)[source]¶: convert token to index

to_token(idx)[source]¶: convert index to index

convert_sequence_to_idx(tokens, bos=False, eos=False)[source]¶: convert sentence of tokens to sentence of indexs

convert_sequence_to_token(idxs, **kwargs)[source]¶: convert sentence of indexs to sentence of tokens

set_vocab(corpus_items: List[str], lower: bool = False, trim_min_count: int = 1, silent=True)[source]¶

Update the vocabulary with the tokens in corpus items

Parameters

corpus_items (List[str], optional) – corpus items to update this vocabulary, by default None
lower (bool, optional) – wheather to lower the corpus items, by default False
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1

load_vocab(vocab_path: str)[source]¶

Load the vocabulary from vocab_file

Parameters: vocab_path (str) – path to save vocabulary file

save_vocab(vocab_path: str)[source]¶

Save the vocabulary into vocab_file

Parameters: vocab_path (str) – path to save vocabulary file

add_specials(tokens: List[str])[source]¶: Add special tokens into vocabulary

add_tokens(tokens: List[str])[source]¶: Add tokens into vocabulary

class EduNLP.Pretrain.pretrian_utils.EduDataset(tokenizer, ds_disk_path: Optional[Dataset] = None, items: Optional[Union[List[dict], List[str]]] = None, stem_key: str = 'text', label_key: Optional[str] = None, feature_keys: Optional[List[str]] = None, num_processor: Optional[int] = None, **kwargs)[source]¶

The base class implements a Dataset, which package the datasets.Dataset and provide more convenience, including parallel preprocessing, offline loadding and so on.

Parameters

tokenizer – PretrainedEduTokenizer or model-specific Pretrained Tokenizer
ds_disk_path (HFDataset, optional) – the dataset_path to save dataset used by datasets.Dataset, by default None
items (Union[List[dict], List[str]], optional) – input items to process, by default None
stem_key (str, optional) – the content of items to process, by default “text”
label_key (Optional[str], optional) – the labels of items to process, by default None
feature_keys (Optional[List[str]], optional) – the additional features of items to remain, by default None
num_processor (int, optional) – specific the number of cpus for parallel speedup, by default None

ds¶

map will break down for super large data which is greater than 4GB

Type: Note

to_disk(ds_disk_path)[source]¶: Save the processed dataset into local files

collect_fn()[source]¶

class EduNLP.Pretrain.pretrian_utils.PretrainedEduTokenizer(vocab_path: Optional[str] = None, max_length: int = 250, tokenize_method: str = 'pure_text', add_specials: Tuple[list, bool] = False, **kwargs)[source]¶

This base class is in charge of preparing the inputs for a model

Parameters

vocab_path (str, optional) – _description_, by default None
max_length (int, optional) – used to clip the sentence out of max_length, by default None
tokenize_method (str, optional) – default: “space” - when text is already seperated by space, use “space” - when text is raw string format, use Tokenizer defined in get_tokenizer(), such as “pure_text” and “text”
add_specials (Tuple[list, bool], optional) – by default None - For bool, it means whether to add EDU_SPYMBOLS to vocabulary - For list, it means the added special tokens besides EDU_SPYMBOLS

tokenize(items: ~typing.Tuple[list, str, dict], key=<function PretrainedEduTokenizer.<lambda>>, **kwargs)[source]¶

Parameters

items (list or str or dict) – the question items
key (function) – determine how to get the text of each item

Returns

tokens – the token of items

Return type

list

encode(items: ~typing.Tuple[str, dict, ~typing.List[str], ~typing.List[dict]], key=<function PretrainedEduTokenizer.<lambda>>, **kwargs)[source]¶

decode(token_ids: list, key=<function PretrainedEduTokenizer.<lambda>>, **kwargs)[source]¶

classmethod from_pretrained(tokenizer_config_dir: str, **kwargs)[source]¶

Load tokenizer from local files

Parameters:¶

tokenizer_config_dir: str: The dir path containing tokenizer_config.json and vocab.list

save_pretrained(tokenizer_config_dir: str)[source]¶

Save tokenizer into local files

Parameters:¶

tokenizer_config_dir: str: save tokenizer params in /tokenizer_config.json and save words in /vocab.list

property vocab_size¶

set_vocab(items: list, key=<function PretrainedEduTokenizer.<lambda>>, lower: bool = False, trim_min_count: int = 1, do_tokenize: bool = True)[source]¶

Update the vocabulary with the tokens in corpus items

Parameters

items (list) – can be the list of str, or list of dict
key (function, optional) – determine how to get the text of each item
lower (bool, optional) – wheather to lower the corpus items, by default False
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1
do_tokenize (bool, optional) – wheather tokenize items before updating vocab, by default True

Returns

token_items

Return type

list

add_specials(tokens)[source]¶: Add special tokens into vocabulary

add_tokens(tokens)[source]¶: Add tokens into vocabulary

EduNLP.Pretrain.hugginface_utils¶

class EduNLP.Pretrain.hugginface_utils.TokenizerForHuggingface(pretrained_model='bert-base-chinese', max_length=512, tokenize_method: str = 'pure_text', add_specials: Union[List[str], bool] = False, **kwargs)[source]¶

Parameterss¶

pretrained_model:: used pretrained model
add_specials:: Whether to add tokens like [FIGURE], [TAG], etc.
tokenize_method:: Which text tokenizer to use. Must be consistent with TOKENIZER dictionary.

Examples

>>> tokenizer = TokenizerForHuggingface(add_special_tokens=True)
>>> item = "有公式$\FormFigureID{wrong1?}$，如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$"
>>> token_item = tokenizer(item)
>>> print(token_item.input_ids[:10])
tensor([[ 101, 1062, 2466, 1963, 1745,  138,  100,  140,  166,  117,  167, 5276,
         3338, 3340,  816, 1062, 2466,  102,  168,  134,  166,  116,  128,  167,
         3297, 1920,  966,  138,  100,  140,  102]])
>>> print(tokenizer.tokenize(item)[:10])
['公', '式', '如', '图', '[', '[UNK]', ']', 'x', ',', 'y']
>>> items = [item, item]
>>> token_items = tokenizer(items, return_tensors='pt')
>>> print(token_items.input_ids.shape)
torch.Size([2, 31])
>>> print(len(tokenizer.tokenize(items)))
2
>>> tokenizer.save_pretrained('test_dir') 
>>> tokenizer = TokenizerForHuggingface.from_pretrained('test_dir') 

tokenize(items: ~typing.Union[list, str, dict], key=<function TokenizerForHuggingface.<lambda>>, **kwargs)[source]¶

encode(items: ~typing.Tuple[str, dict, ~typing.List[str], ~typing.List[dict]], key=<function TokenizerForHuggingface.<lambda>>, **kwargs)[source]¶

decode(token_ids: list, key=<function TokenizerForHuggingface.<lambda>>, **kwargs)[source]¶

classmethod from_pretrained(tokenizer_config_dir, **kwargs)[source]¶

save_pretrained(tokenizer_config_dir)[source]¶

property vocab_size¶

set_vocab(items: ~typing.Tuple[~typing.List[str], ~typing.List[dict]], key=<function TokenizerForHuggingface.<lambda>>, lower=False, trim_min_count: int = 1, do_tokenize: bool = True)[source]¶

Parameters

items (list) – can be the list of str, or list of dict
key (function) – determine how to get the text of each item
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1
do_tokenize (bool, optional) – wheather tokenize items before updating vocab, by default True

add_specials(added_spectials: List[str])[source]¶

add_tokens(added_tokens: List[str])[source]¶

EduNLP.Pretrain.gensim_vec¶

class EduNLP.Pretrain.gensim_vec.GensimWordTokenizer(symbol='gm', general=False)[source]¶

Parameters

symbol (str) –

select the methods to symbolize:
”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,

e.g.: gm, fgm, gmas, fgmas
general (bool) –
True: when item isn’t in standard format, and want to tokenize formulas(except formulas in figure) linearly.

False: when use ‘ast’ mothed to tokenize formulas instead of ‘linear’.

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$，如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]']
>>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$，如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]']

batch_process(*items)[source]¶

EduNLP.Pretrain.gensim_vec.train_vector(items, w2v_prefix, embedding_dim=None, method='sg', binary=None, train_params=None)[source]¶

Parameters

items：str – the text of question
w2v_prefix –
embedding_dim (int) – vector_size
method (str) – the method of training, e.g.: sg, cbow, fasttext, d2v, bow, tfidf
binary (model format) – True:bin; False:kv
train_params (dict) – the training parameters passed to model

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$，如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']]
>>> train_vector(token_item[:10], "examples/test_model/w2v/gensim_luna_stem_t_", 100) 
'examples/test_model/w2v/gensim_luna_stem_t_sg_100.kv'

class EduNLP.Pretrain.gensim_vec.GensimSegTokenizer(symbol='gms', depth=None, flatten=False, **kwargs)[source]¶

Parameters

symbol (str) –

select the methods to symbolize:
”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,

e.g. gms, fgm
depth (int or None) – 0: only separate at SIFSep ; 1: only separate at SIFTag ; 2: separate at SIFTag and SIFSep ; otherwise, separate all segments ;

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$，如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']]
>>> tokenizer = GensimSegTokenizer(symbol="fgm", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$，如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], ['[FORMULA]'], ['如图'], ['[FIGURE]'], ['[FORMULA]'],...['[FORMULA]'], ['最大值'], ['[MARK]']]

EduNLP.Pretrain.elmo_vec¶

class EduNLP.Pretrain.elmo_vec.ElmoTokenizer(vocab_path=None, max_length=250, tokenize_method='pure_text', add_specials=True, **kwargs)[source]¶

Examples

>>> t=ElmoTokenizer()
>>> items = ["有公式$\FormFigureID{wrong1?}$，如图$\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$"]
>>> len(t)
14
>>> t.tokenize(items[0])
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]']
>>> t(items[0])
{'seq_idx': tensor([1, 1, 6, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 7]), 'seq_len': tensor(17)}
>>> t.set_vocab(items[0])
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]']
>>> len(t)
45
>>> t(items[0])
{'seq_idx': tensor([ 1,  1,  6, 26, 27, 28,  1,  1,  9, 35, 36, 26, 37, 38, 28,  1,  7]), 'seq_len': tensor(17)}

class EduNLP.Pretrain.elmo_vec.ElmoDataset(tokenizer: ElmoTokenizer, **kwargs)[source]¶

collate_fn(batch_data)[source]¶

EduNLP.Pretrain.elmo_vec.train_elmo(items: Union[List[dict], List[str]], output_dir: str, pretrained_dir: Optional[str] = None, tokenizer_params=None, data_params=None, model_params=None, train_params=None)[source]¶

Parameters

items (list, required) – The training corpus, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_dir (str, optional) – The pretrained directory for model and tokenizer
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) –
- stem_key
- label_key
The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –

EduNLP.Pretrain.elmo_vec.train_elmo_for_property_prediction(train_items: list, output_dir: str, pretrained_dir=None, eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]¶

Parameters

train_items (list, required) – The training items, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_dir (str, optional) – The pretrained directory for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –

EduNLP.Pretrain.elmo_vec.train_elmo_for_knowledge_prediction(train_items: list, output_dir: str, pretrained_dir=None, eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]¶

Parameters

train_items (list, required) – The training items, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_dir (str, optional) – The pretrained directory for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –

EduNLP.Pretrain.bert_vec¶

class EduNLP.Pretrain.bert_vec.BertTokenizer(pretrained_model='bert-base-chinese', max_length=512, tokenize_method: str = 'pure_text', add_specials: Union[List[str], bool] = False, **kwargs)[source]¶

Examples

>>> tokenizer = BertTokenizer(add_special_tokens=True)
>>> item = "有公式$\FormFigureID{wrong1?}$，如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$"
>>> token_item = tokenizer(item)
>>> print(token_item.input_ids)
tensor([[ 101, 1062, 2466, 1963, 1745,  138,  100,  140,  166,  117,  167, 5276,
         3338, 3340,  816, 1062, 2466,  102,  168,  134,  166,  116,  128,  167,
         3297, 1920,  966,  138,  100,  140,  102]])
>>> print(tokenizer.tokenize(item)[:10])
['公', '式', '如', '图', '[', '[UNK]', ']', 'x', ',', 'y']
>>> items = [item, item]
>>> token_items = tokenizer(items, return_tensors='pt')
>>> print(token_items.input_ids.shape)
torch.Size([2, 31])
>>> print(len(tokenizer.tokenize(items)))
2
>>> tokenizer.save_pretrained('test_dir') 
>>> tokenizer = BertTokenizer.from_pretrained('test_dir') 

class EduNLP.Pretrain.bert_vec.BertDataset(tokenizer, ds_disk_path: Optional[Dataset] = None, items: Optional[Union[List[dict], List[str]]] = None, stem_key: str = 'text', label_key: Optional[str] = None, feature_keys: Optional[List[str]] = None, num_processor: Optional[int] = None, **kwargs)[source]¶

EduNLP.Pretrain.bert_vec.finetune_bert(items: Union[List[dict], List[str]], output_dir: str, pretrained_model='bert-base-chinese', tokenizer_params=None, data_params=None, model_params=None, train_params=None)[source]¶

Parameters

items (list, required) – The training corpus, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –

Examples

>>> stems = ["有公式$\FormFigureID{wrong1?}$，如图$\FigureID{088f15ea-xxx}$",
... "有公式$\FormFigureID{wrong1?}$，如图$\FigureID{088f15ea-xxx}$"]
>>> finetune_bert(stems, "examples/test_model/data/data/bert") 
{'train_runtime': ..., ..., 'epoch': 1.0}

EduNLP.Pretrain.bert_vec.finetune_bert_for_property_prediction(train_items, output_dir, pretrained_model='bert-base-chinese', eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]¶

Parameters

train_items (list, required) – The training corpus, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –

EduNLP.Pretrain.bert_vec.finetune_bert_for_knowledge_prediction(train_items, output_dir, pretrained_model='bert-base-chinese', eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]¶

Parameters

train_items (list, required) – The training corpus, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –

EduNLP.Pretrain.disenqnet_vec¶

EduNLP.Pretrain.disenqnet_vec.check_num(s)[source]¶

EduNLP.Pretrain.disenqnet_vec.load_list_to_dict(path)[source]¶

EduNLP.Pretrain.disenqnet_vec.save_dict_to_list(item2index, path)[source]¶

class EduNLP.Pretrain.disenqnet_vec.DisenQTokenizer(vocab_path=None, max_length=250, tokenize_method='pure_text', add_specials: Optional[list] = None, num_token='[NUM]', **kwargs)[source]¶

Examples

>>> tokenizer = DisenQTokenizer()
>>> test_items = [{
...     "content": "甲 数 除以 乙 数 的 商 是 1.5 ， 如果 甲 数 增加 20 ， 则 甲 数 是 乙 的 4 倍 ． 原来 甲 数 = ．",
...     "knowledge": ["*", "-", "/"], "difficulty": 0.2, "length": 7}]
>>> tokenizer.set_vocab(test_items,
...     key=lambda x: x["content"], trim_min_count=1)
[['甲', '数', '除以', '乙', '数', '商', '[NUM]', '甲', '数', '增加', '[NUM]', '甲', '数', '乙', '倍', '甲', '数']]
>>> token_items = [tokenizer(i, key=lambda x: x["content"]) for i in test_items]
>>> print(token_items[0].keys())
dict_keys(['seq_idx', 'seq_len'])

EduNLP.Pretrain.disenqnet_vec.preprocess_dataset(pretrained_dir, disen_tokenizer, items, data_formation, trim_min_count=None, embed_dim=None, w2v_params=None, silent=False)[source]¶

class EduNLP.Pretrain.disenqnet_vec.DisenQDataset(items: List[Dict], tokenizer: DisenQTokenizer, data_formation: Dict, mode='train', concept_to_idx=None, **kwargs)[source]¶

collate_fn(batch_data)[source]¶

class EduNLP.Pretrain.disenqnet_vec.DisenQTrainer(model: Optional[Union[PreTrainedModel, Module]] = None, args: Optional[TrainingArguments] = None, data_collator: Optional[DataCollator] = None, train_dataset: Optional[Dataset] = None, eval_dataset: Optional[Dataset] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, model_init: Optional[Callable[[], PreTrainedModel]] = None, compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None, callbacks: Optional[List[TrainerCallback]] = None, optimizers: Tuple[Optimizer, LambdaLR] = (None, None), preprocess_logits_for_metrics: Optional[Callable[[Tensor, Tensor], Tensor]] = None)[source]¶

create_optimizer_and_scheduler(num_training_steps: int)[source]¶

Setup the optimizer and the learning rate scheduler.

We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the Trainer’s init through optimizers, or subclass and override this method (or create_optimizer and/or create_scheduler) in a subclass.

training_step(model: Module, inputs: Dict[str, Union[Tensor, Any]]) → Tensor[source]¶

Perform a training step on a batch of inputs.

Subclass and override to inject custom behavior.

Parameters

model (nn.Module) – The model to train.
inputs (Dict[str, Union[torch.Tensor, Any]]) –
The inputs and targets of the model.

The dictionary will be unpacked before being fed to the model. Most models expect the targets under the argument labels. Check your model’s documentation for all accepted arguments.

Returns

The tensor with training loss on this batch.

Return type

torch.Tensor

class EduNLP.Pretrain.disenqnet_vec.DisenQTrainingArguments(output_dir: str, overwrite_output_dir: bool = False, do_train: bool = False, do_eval: bool = False, do_predict: bool = False, evaluation_strategy: Union[transformers.trainer_utils.IntervalStrategy, str] = 'no', prediction_loss_only: bool = False, per_device_train_batch_size: int = 8, per_device_eval_batch_size: int = 8, per_gpu_train_batch_size: Union[int, NoneType] = None, per_gpu_eval_batch_size: Union[int, NoneType] = None, gradient_accumulation_steps: int = 1, eval_accumulation_steps: Union[int, NoneType] = None, eval_delay: Union[float, NoneType] = 0, learning_rate: float = 5e-05, weight_decay: float = 0.0, adam_beta1: float = 0.9, adam_beta2: float = 0.999, adam_epsilon: float = 1e-08, max_grad_norm: float = 1.0, num_train_epochs: float = 3.0, max_steps: int = -1, lr_scheduler_type: Union[transformers.trainer_utils.SchedulerType, str] = 'linear', warmup_ratio: float = 0.0, warmup_steps: int = 0, log_level: Union[str, NoneType] = 'passive', log_level_replica: Union[str, NoneType] = 'passive', log_on_each_node: bool = True, logging_dir: Union[str, NoneType] = None, logging_strategy: Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps', logging_first_step: bool = False, logging_steps: int = 500, logging_nan_inf_filter: bool = True, save_strategy: Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps', save_steps: int = 500, save_total_limit: Union[int, NoneType] = None, save_on_each_node: bool = False, no_cuda: bool = False, use_mps_device: bool = False, seed: int = 42, data_seed: Union[int, NoneType] = None, jit_mode_eval: bool = False, use_ipex: bool = False, bf16: bool = False, fp16: bool = False, fp16_opt_level: str = 'O1', half_precision_backend: str = 'auto', bf16_full_eval: bool = False, fp16_full_eval: bool = False, tf32: Union[bool, NoneType] = None, local_rank: int = -1, xpu_backend: Union[str, NoneType] = None, tpu_num_cores: Union[int, NoneType] = None, tpu_metrics_debug: bool = False, debug: str = '', dataloader_drop_last: bool = False, eval_steps: Union[int, NoneType] = None, dataloader_num_workers: int = 0, past_index: int = -1, run_name: Union[str, NoneType] = None, disable_tqdm: Union[bool, NoneType] = None, remove_unused_columns: Union[bool, NoneType] = True, label_names: Union[List[str], NoneType] = None, load_best_model_at_end: Union[bool, NoneType] = False, metric_for_best_model: Union[str, NoneType] = None, greater_is_better: Union[bool, NoneType] = None, ignore_data_skip: bool = False, sharded_ddp: str = '', fsdp: str = '', fsdp_min_num_params: int = 0, fsdp_transformer_layer_cls_to_wrap: Union[str, NoneType] = None, deepspeed: Union[str, NoneType] = None, label_smoothing_factor: float = 0.0, optim: Union[transformers.training_args.OptimizerNames, str] = 'adamw_hf', optim_args: Union[str, NoneType] = None, adafactor: bool = False, group_by_length: bool = False, length_column_name: Union[str, NoneType] = 'length', report_to: Union[List[str], NoneType] = None, ddp_find_unused_parameters: Union[bool, NoneType] = None, ddp_bucket_cap_mb: Union[int, NoneType] = None, dataloader_pin_memory: bool = True, skip_memory_metrics: bool = True, use_legacy_prediction_loop: bool = False, push_to_hub: bool = False, resume_from_checkpoint: Union[str, NoneType] = None, hub_model_id: Union[str, NoneType] = None, hub_strategy: Union[transformers.trainer_utils.HubStrategy, str] = 'every_save', hub_token: Union[str, NoneType] = None, hub_private_repo: bool = False, gradient_checkpointing: bool = False, include_inputs_for_metrics: bool = False, fp16_backend: str = 'auto', push_to_hub_model_id: Union[str, NoneType] = None, push_to_hub_organization: Union[str, NoneType] = None, push_to_hub_token: Union[str, NoneType] = None, mp_parameters: str = '', auto_find_batch_size: bool = False, full_determinism: bool = False, torchdynamo: Union[str, NoneType] = None, ray_scope: Union[str, NoneType] = 'last', ddp_timeout: Union[int, NoneType] = 1800, step_size: int = False, trim_min: int = False, hidden_size: int = False, gamma: float = False)[source]¶

step_size: int = False¶

trim_min: int = False¶

hidden_size: int = False¶

gamma: float = False¶

EduNLP.Pretrain.disenqnet_vec.train_disenqnet(train_items: List[dict], output_dir: str, pretrained_dir: Optional[str] = None, eval_items=None, tokenizer_params=None, data_params=None, model_params=None, train_params=None, w2v_params=None)[source]¶

Parameters

train_items (List[dict]) – _description_
output_dir (str) – _description_
pretrained_dir (str, optional) – _description_, by default None
tokenizer_params (_type_, optional) – _description_, by default None
data_params (_type_, optional) – _description_, by default None
model_params (_type_, optional) – _description_, by default None
train_params (_type_, optional) – _description_, by default None

EduNLP.Pretrain.quesnet_vec¶

Pre-process input text, tokenizing, building vocabs, and pre-train word level vectors.

class EduNLP.Pretrain.quesnet_vec.Question(id, content, answer, false_options, labels)¶

property answer¶: Alias for field number 2

property content¶: Alias for field number 1

property false_options¶: Alias for field number 3

property id¶: Alias for field number 0

property labels¶: Alias for field number 4

EduNLP.Pretrain.quesnet_vec.save_list(item2index, path)[source]¶

class EduNLP.Pretrain.quesnet_vec.QuesNetTokenizer(vocab_path=None, meta_vocab_dir=None, img_dir: Optional[str] = None, max_length=250, tokenize_method='custom', symbol='mas', add_specials: Optional[list] = None, meta: Optional[List[str]] = None, img_token='<img>', unk_token='<unk>', pad_token='<pad>', **kwargs)[source]¶

Examples

>>> tokenizer = QuesNetTokenizer(meta=['knowledge'])
>>> test_items = [{"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B, $\FigureID{test_id}$",
... "knowledge": "['*', '-', '/']"}, {"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B",
... "knowledge": "['*', '-', '/']"}]
>>> tokenizer.set_vocab(test_items,
... trim_min_count=1, key=lambda x: x["ques_content"], silent=True)
>>> tokenizer.set_meta_vocab(test_items, silent=True)
>>> token_items = [tokenizer(i, key=lambda x: x["ques_content"]) for i in test_items]
>>> print(token_items[0].keys())
dict_keys(['seq_idx', 'meta_idx'])
>>> token_items = tokenizer(test_items, key=lambda x: x["ques_content"])
>>> print(len(token_items["seq_idx"]))
2

load_meta_vocab(meta_vocab_dir)[source]¶

set_meta_vocab(items: list, meta: Optional[List[str]] = None, silent=True)[source]¶

set_vocab(items: list, key=<function QuesNetTokenizer.<lambda>>, lower: bool = False, trim_min_count: int = 1, do_tokenize: bool = True, silent=True)[source]¶

Parameters

items (list) – can be the list of str, or list of dict
key (function) – determine how to get the text of each item
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1
silent –

classmethod from_pretrained(tokenizer_config_dir, img_dir=None, **kwargs)[source]¶

Parameters:¶

tokenizer_config_dir: str: must contain tokenizer_config.json and vocab.txt and meta_{meta_name}.txt
img_dir: str: default None the path of image directory

save_pretrained(tokenizer_config_dir)[source]¶

Save tokenizer into local files

Parameters:¶

tokenizer_config_dir: str: save tokenizer params in /tokenizer_config.json and save words in vocab.txt and save metas in meta_{meta_name}.txt

padding(idx, max_length, type='word')[source]¶

set_img_dir(path)[source]¶

EduNLP.Pretrain.quesnet_vec.clip(v, low, high)[source]¶

class EduNLP.Pretrain.quesnet_vec.Lines(filename, skip=0, preserve_newline=False)[source]¶

class EduNLP.Pretrain.quesnet_vec.QuestionLoader(ques: ~EduNLP.Pretrain.quesnet_vec.Lines, tokenizer: ~EduNLP.Pretrain.quesnet_vec.QuesNetTokenizer, pipeline=None, range=None, meta: ~typing.Optional[list] = None, content_key=<function QuestionLoader.<lambda>>, meta_key=<function QuestionLoader.<lambda>>, answer_key=<function QuestionLoader.<lambda>>, option_key=<function QuestionLoader.<lambda>>, skip=0)[source]¶

split_(split_ratio)[source]¶

EduNLP.Pretrain.quesnet_vec.optimizer(*models, **kwargs)[source]¶

class EduNLP.Pretrain.quesnet_vec.PrefetchIter(data, *label, length=None, batch_size=1, shuffle=True)[source]¶

Iterator on data and labels, with states for save and restore.

produce()[source]¶

class EduNLP.Pretrain.quesnet_vec.EmbeddingDataset(data, data_type='image')[source]¶

EduNLP.Pretrain.quesnet_vec.pretrain_iter(ques, batch_size)[source]¶

EduNLP.Pretrain.quesnet_vec.critical(f)[source]¶

EduNLP.Pretrain.quesnet_vec.pretrain_embedding_layer(dataset: EmbeddingDataset, ae: AE, lr: float = 0.001, log_step: int = 1, epochs: int = 3, batch_size: int = 4, device=device(type='cpu'))[source]¶

EduNLP.Pretrain.quesnet_vec.pretrain_quesnet(path, output_dir, img_dir=None, save_embs=False, train_params=None)[source]¶

pretrain quesnet

Parameters

path (str) – path of question file
output_dir (str) – output path·
tokenizer (QuesNetTokenizer) – quesnet tokenizer
save_embs (bool, optional) – whether to save pretrained word/image/meta embeddings seperately
train_params (dict, optional) –
the training parameters and model parameters, by default None - “n_epochs”: int, default = 1

train param, number of epochs
- ”batch_size”: int, default = 6
  train param, batch size
- ”lr”: float, default = 1e-3
  train param, learning rate
- ”save_every”: int, default = 0
  train param, save steps interval
- ”log_steps”: int, default = 10
  train param, log steps interval
- ”device”: str, default = ‘cpu’
  train param, ‘cpu’ or ‘cuda’
- ”max_steps”: int, default = 0
  train param, stop training when reach max steps
- ”emb_size”: int, default = 256
  model param, the embedding size of word, figure, meta info
- ”feat_size”: int, default = 256
  model param, the size of question infer vector

Examples

>>> tokenizer = QuesNetTokenizer(meta=['know_name'])
>>> items = [{"ques_content": "若复数$z=1+2 i+i^{3}$，则$|z|=$，$\FigureID{000004d6-0479-11ec-829b-797d5eb43535}$",
... "ques_id": "726cdbec-33a9-11ec-909c-98fa9b625adb",
... "know_name": "['代数', '集合', '集合的相等']"
... }]
>>> tokenizer.set_vocab(items, key=lambda x: x['ques_content'], trim_min_count=1, silent=True)
>>> pretrain_quesnet('./data/standard_luna_data.json', './testQuesNet', tokenizer)