EduNLP.Pretrain

EduNLP.Pretrain.pretrian_utils

class EduNLP.Pretrain.pretrian_utils.EduVocab(vocab_path: Optional[str] = None, corpus_items: Optional[List[str]] = None, bos_token: str = '[BOS]', eos_token: str = '[EOS]', pad_token: str = '[PAD]', unk_token: str = '[UNK]', specials: Optional[List[str]] = None, lower: bool = False, trim_min_count: int = 1, **kwargs)[source]

The vocabulary container for a corpus.

Parameters
  • vocab_path (str, optional) – vocabulary path to initialize this container, by default None

  • corpus_items (List[str], optional) – corpus items to update this vocabulary, by default None

  • bos_token (str, optional) – token representing for the start of a sentence, by default “[BOS]”

  • eos_token (str, optional) – token representing for the end of a sentence, by default “[EOS]”

  • pad_token (str, optional) – token representing for padding, by default “[PAD]”

  • unk_token (str, optional) – token representing for unknown word, by default “[UNK]”

  • specials (List[str], optional) – spacials tokens in vocabulary, by default None

  • lower (bool, optional) – wheather to lower the corpus items, by default False

  • trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1

property vocab_size
property special_tokens
property tokens
to_idx(token)[source]

convert token to index

to_token(idx)[source]

convert index to index

convert_sequence_to_idx(tokens, bos=False, eos=False)[source]

convert sentence of tokens to sentence of indexs

convert_sequence_to_token(idxs, **kwargs)[source]

convert sentence of indexs to sentence of tokens

set_vocab(corpus_items: List[str], lower: bool = False, trim_min_count: int = 1, silent=True)[source]

Update the vocabulary with the tokens in corpus items

Parameters
  • corpus_items (List[str], optional) – corpus items to update this vocabulary, by default None

  • lower (bool, optional) – wheather to lower the corpus items, by default False

  • trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1

load_vocab(vocab_path: str)[source]

Load the vocabulary from vocab_file

Parameters

vocab_path (str) – path to save vocabulary file

save_vocab(vocab_path: str)[source]

Save the vocabulary into vocab_file

Parameters

vocab_path (str) – path to save vocabulary file

add_specials(tokens: List[str])[source]

Add special tokens into vocabulary

add_tokens(tokens: List[str])[source]

Add tokens into vocabulary

class EduNLP.Pretrain.pretrian_utils.EduDataset(tokenizer, ds_disk_path: Optional[Dataset] = None, items: Optional[Union[List[dict], List[str]]] = None, stem_key: str = 'text', label_key: Optional[str] = None, feature_keys: Optional[List[str]] = None, num_processor: Optional[int] = None, **kwargs)[source]

The base class implements a Dataset, which package the datasets.Dataset and provide more convenience, including parallel preprocessing, offline loadding and so on.

Parameters
  • tokenizer – PretrainedEduTokenizer or model-specific Pretrained Tokenizer

  • ds_disk_path (HFDataset, optional) – the dataset_path to save dataset used by datasets.Dataset, by default None

  • items (Union[List[dict], List[str]], optional) – input items to process, by default None

  • stem_key (str, optional) – the content of items to process, by default “text”

  • label_key (Optional[str], optional) – the labels of items to process, by default None

  • feature_keys (Optional[List[str]], optional) – the additional features of items to remain, by default None

  • num_processor (int, optional) – specific the number of cpus for parallel speedup, by default None

ds

map will break down for super large data which is greater than 4GB

Type

Note

to_disk(ds_disk_path)[source]

Save the processed dataset into local files

collect_fn()[source]
class EduNLP.Pretrain.pretrian_utils.PretrainedEduTokenizer(vocab_path: Optional[str] = None, max_length: int = 250, tokenize_method: str = 'pure_text', add_specials: Tuple[list, bool] = False, **kwargs)[source]

This base class is in charge of preparing the inputs for a model

Parameters
  • vocab_path (str, optional) – _description_, by default None

  • max_length (int, optional) – used to clip the sentence out of max_length, by default None

  • tokenize_method (str, optional) – default: “space” - when text is already seperated by space, use “space” - when text is raw string format, use Tokenizer defined in get_tokenizer(), such as “pure_text” and “text”

  • add_specials (Tuple[list, bool], optional) – by default None - For bool, it means whether to add EDU_SPYMBOLS to vocabulary - For list, it means the added special tokens besides EDU_SPYMBOLS

tokenize(items: ~typing.Tuple[list, str, dict], key=<function PretrainedEduTokenizer.<lambda>>, **kwargs)[source]
Parameters
  • items (list or str or dict) – the question items

  • key (function) – determine how to get the text of each item

Returns

tokens – the token of items

Return type

list

encode(items: ~typing.Tuple[str, dict, ~typing.List[str], ~typing.List[dict]], key=<function PretrainedEduTokenizer.<lambda>>, **kwargs)[source]
decode(token_ids: list, key=<function PretrainedEduTokenizer.<lambda>>, **kwargs)[source]
classmethod from_pretrained(tokenizer_config_dir: str, **kwargs)[source]

Load tokenizer from local files

Parameters:

tokenizer_config_dir: str

The dir path containing tokenizer_config.json and vocab.list

save_pretrained(tokenizer_config_dir: str)[source]

Save tokenizer into local files

Parameters:

tokenizer_config_dir: str

save tokenizer params in /tokenizer_config.json and save words in /vocab.list

property vocab_size
set_vocab(items: list, key=<function PretrainedEduTokenizer.<lambda>>, lower: bool = False, trim_min_count: int = 1, do_tokenize: bool = True)[source]

Update the vocabulary with the tokens in corpus items

Parameters
  • items (list) – can be the list of str, or list of dict

  • key (function, optional) – determine how to get the text of each item

  • lower (bool, optional) – wheather to lower the corpus items, by default False

  • trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1

  • do_tokenize (bool, optional) – wheather tokenize items before updating vocab, by default True

Returns

token_items

Return type

list

add_specials(tokens)[source]

Add special tokens into vocabulary

add_tokens(tokens)[source]

Add tokens into vocabulary

EduNLP.Pretrain.hugginface_utils

class EduNLP.Pretrain.hugginface_utils.TokenizerForHuggingface(pretrained_model='bert-base-chinese', max_length=512, tokenize_method: str = 'pure_text', add_specials: Union[List[str], bool] = False, **kwargs)[source]

Parameterss

pretrained_model:

used pretrained model

add_specials:

Whether to add tokens like [FIGURE], [TAG], etc.

tokenize_method:

Which text tokenizer to use. Must be consistent with TOKENIZER dictionary.

Examples

>>> tokenizer = TokenizerForHuggingface(add_special_tokens=True)
>>> item = "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$"
>>> token_item = tokenizer(item)
>>> print(token_item.input_ids[:10])
tensor([[ 101, 1062, 2466, 1963, 1745,  138,  100,  140,  166,  117,  167, 5276,
         3338, 3340,  816, 1062, 2466,  102,  168,  134,  166,  116,  128,  167,
         3297, 1920,  966,  138,  100,  140,  102]])
>>> print(tokenizer.tokenize(item)[:10])
['公', '式', '如', '图', '[', '[UNK]', ']', 'x', ',', 'y']
>>> items = [item, item]
>>> token_items = tokenizer(items, return_tensors='pt')
>>> print(token_items.input_ids.shape)
torch.Size([2, 31])
>>> print(len(tokenizer.tokenize(items)))
2
>>> tokenizer.save_pretrained('test_dir') 
>>> tokenizer = TokenizerForHuggingface.from_pretrained('test_dir') 
tokenize(items: ~typing.Union[list, str, dict], key=<function TokenizerForHuggingface.<lambda>>, **kwargs)[source]
encode(items: ~typing.Tuple[str, dict, ~typing.List[str], ~typing.List[dict]], key=<function TokenizerForHuggingface.<lambda>>, **kwargs)[source]
decode(token_ids: list, key=<function TokenizerForHuggingface.<lambda>>, **kwargs)[source]
classmethod from_pretrained(tokenizer_config_dir, **kwargs)[source]
save_pretrained(tokenizer_config_dir)[source]
property vocab_size
set_vocab(items: ~typing.Tuple[~typing.List[str], ~typing.List[dict]], key=<function TokenizerForHuggingface.<lambda>>, lower=False, trim_min_count: int = 1, do_tokenize: bool = True)[source]
Parameters
  • items (list) – can be the list of str, or list of dict

  • key (function) – determine how to get the text of each item

  • trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1

  • do_tokenize (bool, optional) – wheather tokenize items before updating vocab, by default True

add_specials(added_spectials: List[str])[source]
add_tokens(added_tokens: List[str])[source]

EduNLP.Pretrain.gensim_vec

class EduNLP.Pretrain.gensim_vec.GensimWordTokenizer(symbol='gm', general=False)[source]
Parameters
  • symbol (str) –

    select the methods to symbolize:

    ”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,

    e.g.: gm, fgm, gmas, fgmas

  • general (bool) –

    True: when item isn’t in standard format, and want to tokenize formulas(except formulas in figure) linearly.

    False: when use ‘ast’ mothed to tokenize formulas instead of ‘linear’.

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]']
>>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]']
batch_process(*items)[source]
EduNLP.Pretrain.gensim_vec.train_vector(items, w2v_prefix, embedding_dim=None, method='sg', binary=None, train_params=None)[source]
Parameters
  • items:str – the text of question

  • w2v_prefix

  • embedding_dim (int) – vector_size

  • method (str) – the method of training, e.g.: sg, cbow, fasttext, d2v, bow, tfidf

  • binary (model format) – True:bin; False:kv

  • train_params (dict) – the training parameters passed to model

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']]
>>> train_vector(token_item[:10], "examples/test_model/w2v/gensim_luna_stem_t_", 100) 
'examples/test_model/w2v/gensim_luna_stem_t_sg_100.kv'
class EduNLP.Pretrain.gensim_vec.GensimSegTokenizer(symbol='gms', depth=None, flatten=False, **kwargs)[source]
Parameters
  • symbol (str) –

    select the methods to symbolize:

    ”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,

    e.g. gms, fgm

  • depth (int or None) – 0: only separate at SIFSep ; 1: only separate at SIFTag ; 2: separate at SIFTag and SIFSep ; otherwise, separate all segments ;

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']]
>>> tokenizer = GensimSegTokenizer(symbol="fgm", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], ['[FORMULA]'], ['如图'], ['[FIGURE]'], ['[FORMULA]'],...['[FORMULA]'], ['最大值'], ['[MARK]']]

EduNLP.Pretrain.elmo_vec

class EduNLP.Pretrain.elmo_vec.ElmoTokenizer(vocab_path=None, max_length=250, tokenize_method='pure_text', add_specials=True, **kwargs)[source]

Examples

>>> t=ElmoTokenizer()
>>> items = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$"]
>>> len(t)
14
>>> t.tokenize(items[0])
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]']
>>> t(items[0])
{'seq_idx': tensor([1, 1, 6, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 7]), 'seq_len': tensor(17)}
>>> t.set_vocab(items[0])
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]']
>>> len(t)
45
>>> t(items[0])
{'seq_idx': tensor([ 1,  1,  6, 26, 27, 28,  1,  1,  9, 35, 36, 26, 37, 38, 28,  1,  7]), 'seq_len': tensor(17)}
class EduNLP.Pretrain.elmo_vec.ElmoDataset(tokenizer: ElmoTokenizer, **kwargs)[source]
collate_fn(batch_data)[source]
EduNLP.Pretrain.elmo_vec.train_elmo(items: Union[List[dict], List[str]], output_dir: str, pretrained_dir: Optional[str] = None, tokenizer_params=None, data_params=None, model_params=None, train_params=None)[source]
Parameters
  • items (list, required) – The training corpus, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_dir (str, optional) – The pretrained directory for model and tokenizer

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) –

    • stem_key

    • label_key

    The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

EduNLP.Pretrain.elmo_vec.train_elmo_for_property_prediction(train_items: list, output_dir: str, pretrained_dir=None, eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]
Parameters
  • train_items (list, required) – The training items, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_dir (str, optional) – The pretrained directory for model and tokenizer

  • eval_items (list, required) – The evaluating items, each item could be str or dict

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

EduNLP.Pretrain.elmo_vec.train_elmo_for_knowledge_prediction(train_items: list, output_dir: str, pretrained_dir=None, eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]
Parameters
  • train_items (list, required) – The training items, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_dir (str, optional) – The pretrained directory for model and tokenizer

  • eval_items (list, required) – The evaluating items, each item could be str or dict

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

EduNLP.Pretrain.bert_vec

class EduNLP.Pretrain.bert_vec.BertTokenizer(pretrained_model='bert-base-chinese', max_length=512, tokenize_method: str = 'pure_text', add_specials: Union[List[str], bool] = False, **kwargs)[source]

Examples

>>> tokenizer = BertTokenizer(add_special_tokens=True)
>>> item = "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$"
>>> token_item = tokenizer(item)
>>> print(token_item.input_ids)
tensor([[ 101, 1062, 2466, 1963, 1745,  138,  100,  140,  166,  117,  167, 5276,
         3338, 3340,  816, 1062, 2466,  102,  168,  134,  166,  116,  128,  167,
         3297, 1920,  966,  138,  100,  140,  102]])
>>> print(tokenizer.tokenize(item)[:10])
['公', '式', '如', '图', '[', '[UNK]', ']', 'x', ',', 'y']
>>> items = [item, item]
>>> token_items = tokenizer(items, return_tensors='pt')
>>> print(token_items.input_ids.shape)
torch.Size([2, 31])
>>> print(len(tokenizer.tokenize(items)))
2
>>> tokenizer.save_pretrained('test_dir') 
>>> tokenizer = BertTokenizer.from_pretrained('test_dir') 
class EduNLP.Pretrain.bert_vec.BertDataset(tokenizer, ds_disk_path: Optional[Dataset] = None, items: Optional[Union[List[dict], List[str]]] = None, stem_key: str = 'text', label_key: Optional[str] = None, feature_keys: Optional[List[str]] = None, num_processor: Optional[int] = None, **kwargs)[source]
EduNLP.Pretrain.bert_vec.finetune_bert(items: Union[List[dict], List[str]], output_dir: str, pretrained_model='bert-base-chinese', tokenizer_params=None, data_params=None, model_params=None, train_params=None)[source]
Parameters
  • items (list, required) – The training corpus, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer

  • eval_items (list, required) – The evaluating items, each item could be str or dict

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

Examples

>>> stems = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$",
... "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$"]
>>> finetune_bert(stems, "examples/test_model/data/data/bert") 
{'train_runtime': ..., ..., 'epoch': 1.0}
EduNLP.Pretrain.bert_vec.finetune_bert_for_property_prediction(train_items, output_dir, pretrained_model='bert-base-chinese', eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]
Parameters
  • train_items (list, required) – The training corpus, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer

  • eval_items (list, required) – The evaluating items, each item could be str or dict

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

EduNLP.Pretrain.bert_vec.finetune_bert_for_knowledge_prediction(train_items, output_dir, pretrained_model='bert-base-chinese', eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]
Parameters
  • train_items (list, required) – The training corpus, each item could be str or dict

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer

  • eval_items (list, required) – The evaluating items, each item could be str or dict

  • tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer

  • data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer

  • model_params (dict, optional, default=None) – The parameters passed to Trainer

  • train_params (dict, optional, default=None) –

EduNLP.Pretrain.disenqnet_vec

EduNLP.Pretrain.disenqnet_vec.check_num(s)[source]
EduNLP.Pretrain.disenqnet_vec.load_list_to_dict(path)[source]
EduNLP.Pretrain.disenqnet_vec.save_dict_to_list(item2index, path)[source]
class EduNLP.Pretrain.disenqnet_vec.DisenQTokenizer(vocab_path=None, max_length=250, tokenize_method='pure_text', add_specials: Optional[list] = None, num_token='[NUM]', **kwargs)[source]

Examples

>>> tokenizer = DisenQTokenizer()
>>> test_items = [{
...     "content": "甲 数 除以 乙 数 的 商 是 1.5 , 如果 甲 数 增加 20 , 则 甲 数 是 乙 的 4 倍 . 原来 甲 数 = .",
...     "knowledge": ["*", "-", "/"], "difficulty": 0.2, "length": 7}]
>>> tokenizer.set_vocab(test_items,
...     key=lambda x: x["content"], trim_min_count=1)
[['甲', '数', '除以', '乙', '数', '商', '[NUM]', '甲', '数', '增加', '[NUM]', '甲', '数', '乙', '倍', '甲', '数']]
>>> token_items = [tokenizer(i, key=lambda x: x["content"]) for i in test_items]
>>> print(token_items[0].keys())
dict_keys(['seq_idx', 'seq_len'])
EduNLP.Pretrain.disenqnet_vec.preprocess_dataset(pretrained_dir, disen_tokenizer, items, data_formation, trim_min_count=None, embed_dim=None, w2v_params=None, silent=False)[source]
class EduNLP.Pretrain.disenqnet_vec.DisenQDataset(items: List[Dict], tokenizer: DisenQTokenizer, data_formation: Dict, mode='train', concept_to_idx=None, **kwargs)[source]
collate_fn(batch_data)[source]
class EduNLP.Pretrain.disenqnet_vec.DisenQTrainer(model: Optional[Union[PreTrainedModel, Module]] = None, args: Optional[TrainingArguments] = None, data_collator: Optional[DataCollator] = None, train_dataset: Optional[Dataset] = None, eval_dataset: Optional[Dataset] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, model_init: Optional[Callable[[], PreTrainedModel]] = None, compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None, callbacks: Optional[List[TrainerCallback]] = None, optimizers: Tuple[Optimizer, LambdaLR] = (None, None), preprocess_logits_for_metrics: Optional[Callable[[Tensor, Tensor], Tensor]] = None)[source]
create_optimizer_and_scheduler(num_training_steps: int)[source]

Setup the optimizer and the learning rate scheduler.

We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the Trainer’s init through optimizers, or subclass and override this method (or create_optimizer and/or create_scheduler) in a subclass.

training_step(model: Module, inputs: Dict[str, Union[Tensor, Any]]) Tensor[source]

Perform a training step on a batch of inputs.

Subclass and override to inject custom behavior.

Parameters
  • model (nn.Module) – The model to train.

  • inputs (Dict[str, Union[torch.Tensor, Any]]) –

    The inputs and targets of the model.

    The dictionary will be unpacked before being fed to the model. Most models expect the targets under the argument labels. Check your model’s documentation for all accepted arguments.

Returns

The tensor with training loss on this batch.

Return type

torch.Tensor

class EduNLP.Pretrain.disenqnet_vec.DisenQTrainingArguments(output_dir: str, overwrite_output_dir: bool = False, do_train: bool = False, do_eval: bool = False, do_predict: bool = False, evaluation_strategy: Union[transformers.trainer_utils.IntervalStrategy, str] = 'no', prediction_loss_only: bool = False, per_device_train_batch_size: int = 8, per_device_eval_batch_size: int = 8, per_gpu_train_batch_size: Union[int, NoneType] = None, per_gpu_eval_batch_size: Union[int, NoneType] = None, gradient_accumulation_steps: int = 1, eval_accumulation_steps: Union[int, NoneType] = None, eval_delay: Union[float, NoneType] = 0, learning_rate: float = 5e-05, weight_decay: float = 0.0, adam_beta1: float = 0.9, adam_beta2: float = 0.999, adam_epsilon: float = 1e-08, max_grad_norm: float = 1.0, num_train_epochs: float = 3.0, max_steps: int = -1, lr_scheduler_type: Union[transformers.trainer_utils.SchedulerType, str] = 'linear', warmup_ratio: float = 0.0, warmup_steps: int = 0, log_level: Union[str, NoneType] = 'passive', log_level_replica: Union[str, NoneType] = 'passive', log_on_each_node: bool = True, logging_dir: Union[str, NoneType] = None, logging_strategy: Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps', logging_first_step: bool = False, logging_steps: int = 500, logging_nan_inf_filter: bool = True, save_strategy: Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps', save_steps: int = 500, save_total_limit: Union[int, NoneType] = None, save_on_each_node: bool = False, no_cuda: bool = False, use_mps_device: bool = False, seed: int = 42, data_seed: Union[int, NoneType] = None, jit_mode_eval: bool = False, use_ipex: bool = False, bf16: bool = False, fp16: bool = False, fp16_opt_level: str = 'O1', half_precision_backend: str = 'auto', bf16_full_eval: bool = False, fp16_full_eval: bool = False, tf32: Union[bool, NoneType] = None, local_rank: int = -1, xpu_backend: Union[str, NoneType] = None, tpu_num_cores: Union[int, NoneType] = None, tpu_metrics_debug: bool = False, debug: str = '', dataloader_drop_last: bool = False, eval_steps: Union[int, NoneType] = None, dataloader_num_workers: int = 0, past_index: int = -1, run_name: Union[str, NoneType] = None, disable_tqdm: Union[bool, NoneType] = None, remove_unused_columns: Union[bool, NoneType] = True, label_names: Union[List[str], NoneType] = None, load_best_model_at_end: Union[bool, NoneType] = False, metric_for_best_model: Union[str, NoneType] = None, greater_is_better: Union[bool, NoneType] = None, ignore_data_skip: bool = False, sharded_ddp: str = '', fsdp: str = '', fsdp_min_num_params: int = 0, fsdp_transformer_layer_cls_to_wrap: Union[str, NoneType] = None, deepspeed: Union[str, NoneType] = None, label_smoothing_factor: float = 0.0, optim: Union[transformers.training_args.OptimizerNames, str] = 'adamw_hf', adafactor: bool = False, group_by_length: bool = False, length_column_name: Union[str, NoneType] = 'length', report_to: Union[List[str], NoneType] = None, ddp_find_unused_parameters: Union[bool, NoneType] = None, ddp_bucket_cap_mb: Union[int, NoneType] = None, dataloader_pin_memory: bool = True, skip_memory_metrics: bool = True, use_legacy_prediction_loop: bool = False, push_to_hub: bool = False, resume_from_checkpoint: Union[str, NoneType] = None, hub_model_id: Union[str, NoneType] = None, hub_strategy: Union[transformers.trainer_utils.HubStrategy, str] = 'every_save', hub_token: Union[str, NoneType] = None, hub_private_repo: bool = False, gradient_checkpointing: bool = False, include_inputs_for_metrics: bool = False, fp16_backend: str = 'auto', push_to_hub_model_id: Union[str, NoneType] = None, push_to_hub_organization: Union[str, NoneType] = None, push_to_hub_token: Union[str, NoneType] = None, mp_parameters: str = '', auto_find_batch_size: bool = False, full_determinism: bool = False, torchdynamo: Union[str, NoneType] = None, ray_scope: Union[str, NoneType] = 'last', ddp_timeout: Union[int, NoneType] = 1800, step_size: int = False, trim_min: int = False, hidden_size: int = False, gamma: float = False)[source]
step_size: int = False
trim_min: int = False
hidden_size: int = False
gamma: float = False
EduNLP.Pretrain.disenqnet_vec.train_disenqnet(train_items: List[dict], output_dir: str, pretrained_dir: Optional[str] = None, eval_items=None, tokenizer_params=None, data_params=None, model_params=None, train_params=None, w2v_params=None)[source]
Parameters
  • train_items (List[dict]) – _description_

  • output_dir (str) – _description_

  • pretrained_dir (str, optional) – _description_, by default None

  • tokenizer_params (_type_, optional) – _description_, by default None

  • data_params (_type_, optional) – _description_, by default None

  • model_params (_type_, optional) – _description_, by default None

  • train_params (_type_, optional) – _description_, by default None

EduNLP.Pretrain.quesnet_vec

Pre-process input text, tokenizing, building vocabs, and pre-train word level vectors.

class EduNLP.Pretrain.quesnet_vec.Question(id, content, answer, false_options, labels)
property answer

Alias for field number 2

property content

Alias for field number 1

property false_options

Alias for field number 3

property id

Alias for field number 0

property labels

Alias for field number 4

EduNLP.Pretrain.quesnet_vec.save_list(item2index, path)[source]
class EduNLP.Pretrain.quesnet_vec.QuesNetTokenizer(vocab_path=None, meta_vocab_dir=None, img_dir: Optional[str] = None, max_length=250, tokenize_method='custom', symbol='mas', add_specials: Optional[list] = None, meta: Optional[List[str]] = None, img_token='<img>', unk_token='<unk>', pad_token='<pad>', **kwargs)[source]

Examples

>>> tokenizer = QuesNetTokenizer(meta=['knowledge'])
>>> test_items = [{"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B, $\FigureID{test_id}$",
... "knowledge": "['*', '-', '/']"}, {"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B",
... "knowledge": "['*', '-', '/']"}]
>>> tokenizer.set_vocab(test_items,
... trim_min_count=1, key=lambda x: x["ques_content"], silent=True)
>>> tokenizer.set_meta_vocab(test_items, silent=True)
>>> token_items = [tokenizer(i, key=lambda x: x["ques_content"]) for i in test_items]
>>> print(token_items[0].keys())
dict_keys(['seq_idx', 'meta_idx'])
>>> token_items = tokenizer(test_items, key=lambda x: x["ques_content"])
>>> print(len(token_items["seq_idx"]))
2
load_meta_vocab(meta_vocab_dir)[source]
set_meta_vocab(items: list, meta: Optional[List[str]] = None, silent=True)[source]
set_vocab(items: list, key=<function QuesNetTokenizer.<lambda>>, lower: bool = False, trim_min_count: int = 1, do_tokenize: bool = True, silent=True)[source]
Parameters
  • items (list) – can be the list of str, or list of dict

  • key (function) – determine how to get the text of each item

  • trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1

  • silent

classmethod from_pretrained(tokenizer_config_dir, img_dir=None, **kwargs)[source]

Parameters:

tokenizer_config_dir: str

must contain tokenizer_config.json and vocab.txt and meta_{meta_name}.txt

img_dir: str

default None the path of image directory

save_pretrained(tokenizer_config_dir)[source]

Save tokenizer into local files

Parameters:

tokenizer_config_dir: str

save tokenizer params in /tokenizer_config.json and save words in vocab.txt and save metas in meta_{meta_name}.txt

padding(idx, max_length, type='word')[source]
set_img_dir(path)[source]
EduNLP.Pretrain.quesnet_vec.clip(v, low, high)[source]
class EduNLP.Pretrain.quesnet_vec.Lines(filename, skip=0, preserve_newline=False)[source]
class EduNLP.Pretrain.quesnet_vec.QuestionLoader(ques: ~EduNLP.Pretrain.quesnet_vec.Lines, tokenizer: ~EduNLP.Pretrain.quesnet_vec.QuesNetTokenizer, pipeline=None, range=None, meta: ~typing.Optional[list] = None, content_key=<function QuestionLoader.<lambda>>, meta_key=<function QuestionLoader.<lambda>>, answer_key=<function QuestionLoader.<lambda>>, option_key=<function QuestionLoader.<lambda>>, skip=0)[source]
split_(split_ratio)[source]
EduNLP.Pretrain.quesnet_vec.optimizer(*models, **kwargs)[source]
class EduNLP.Pretrain.quesnet_vec.PrefetchIter(data, *label, length=None, batch_size=1, shuffle=True)[source]

Iterator on data and labels, with states for save and restore.

produce()[source]
class EduNLP.Pretrain.quesnet_vec.EmbeddingDataset(data, data_type='image')[source]
EduNLP.Pretrain.quesnet_vec.pretrain_iter(ques, batch_size)[source]
EduNLP.Pretrain.quesnet_vec.critical(f)[source]
EduNLP.Pretrain.quesnet_vec.pretrain_embedding_layer(dataset: EmbeddingDataset, ae: AE, lr: float = 0.001, log_step: int = 1, epochs: int = 3, batch_size: int = 4, device=device(type='cpu'))[source]
EduNLP.Pretrain.quesnet_vec.pretrain_quesnet(path, output_dir, img_dir=None, save_embs=False, train_params=None)[source]

pretrain quesnet

Parameters
  • path (str) – path of question file

  • output_dir (str) – output path·

  • tokenizer (QuesNetTokenizer) – quesnet tokenizer

  • save_embs (bool, optional) – whether to save pretrained word/image/meta embeddings seperately

  • train_params (dict, optional) –

    the training parameters and model parameters, by default None - “n_epochs”: int, default = 1

    train param, number of epochs

    • ”batch_size”: int, default = 6

      train param, batch size

    • ”lr”: float, default = 1e-3

      train param, learning rate

    • ”save_every”: int, default = 0

      train param, save steps interval

    • ”log_steps”: int, default = 10

      train param, log steps interval

    • ”device”: str, default = ‘cpu’

      train param, ‘cpu’ or ‘cuda’

    • ”max_steps”: int, default = 0

      train param, stop training when reach max steps

    • ”emb_size”: int, default = 256

      model param, the embedding size of word, figure, meta info

    • ”feat_size”: int, default = 256

      model param, the size of question infer vector

Examples

>>> tokenizer = QuesNetTokenizer(meta=['know_name'])
>>> items = [{"ques_content": "若复数$z=1+2 i+i^{3}$,则$|z|=$,$\FigureID{000004d6-0479-11ec-829b-797d5eb43535}$",
... "ques_id": "726cdbec-33a9-11ec-909c-98fa9b625adb",
... "know_name": "['代数', '集合', '集合的相等']"
... }]
>>> tokenizer.set_vocab(items, key=lambda x: x['ques_content'], trim_min_count=1, silent=True)
>>> pretrain_quesnet('./data/standard_luna_data.json', './testQuesNet', tokenizer)