EduNLP.Pretrain¶
EduNLP.Pretrain.pretrian_utils¶
- class EduNLP.Pretrain.pretrian_utils.EduVocab(vocab_path: Optional[str] = None, corpus_items: Optional[List[str]] = None, bos_token: str = '[BOS]', eos_token: str = '[EOS]', pad_token: str = '[PAD]', unk_token: str = '[UNK]', specials: Optional[List[str]] = None, lower: bool = False, trim_min_count: int = 1, **kwargs)[source]¶
The vocabulary container for a corpus.
- Parameters
vocab_path (str, optional) – vocabulary path to initialize this container, by default None
corpus_items (List[str], optional) – corpus items to update this vocabulary, by default None
bos_token (str, optional) – token representing for the start of a sentence, by default “[BOS]”
eos_token (str, optional) – token representing for the end of a sentence, by default “[EOS]”
pad_token (str, optional) – token representing for padding, by default “[PAD]”
unk_token (str, optional) – token representing for unknown word, by default “[UNK]”
specials (List[str], optional) – spacials tokens in vocabulary, by default None
lower (bool, optional) – wheather to lower the corpus items, by default False
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1
- property vocab_size¶
- property special_tokens¶
- property tokens¶
- convert_sequence_to_idx(tokens, bos=False, eos=False)[source]¶
convert sentence of tokens to sentence of indexs
- set_vocab(corpus_items: List[str], lower: bool = False, trim_min_count: int = 1, silent=True)[source]¶
Update the vocabulary with the tokens in corpus items
- Parameters
corpus_items (List[str], optional) – corpus items to update this vocabulary, by default None
lower (bool, optional) – wheather to lower the corpus items, by default False
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1
- load_vocab(vocab_path: str)[source]¶
Load the vocabulary from vocab_file
- Parameters
vocab_path (str) – path to save vocabulary file
- class EduNLP.Pretrain.pretrian_utils.EduDataset(tokenizer, ds_disk_path: Optional[Dataset] = None, items: Optional[Union[List[dict], List[str]]] = None, stem_key: str = 'text', label_key: Optional[str] = None, feature_keys: Optional[List[str]] = None, num_processor: Optional[int] = None, **kwargs)[source]¶
The base class implements a Dataset, which package the datasets.Dataset and provide more convenience, including parallel preprocessing, offline loadding and so on.
- Parameters
tokenizer – PretrainedEduTokenizer or model-specific Pretrained Tokenizer
ds_disk_path (HFDataset, optional) – the dataset_path to save dataset used by datasets.Dataset, by default None
items (Union[List[dict], List[str]], optional) – input items to process, by default None
stem_key (str, optional) – the content of items to process, by default “text”
label_key (Optional[str], optional) – the labels of items to process, by default None
feature_keys (Optional[List[str]], optional) – the additional features of items to remain, by default None
num_processor (int, optional) – specific the number of cpus for parallel speedup, by default None
- ds¶
map will break down for super large data which is greater than 4GB
- Type
Note
- class EduNLP.Pretrain.pretrian_utils.PretrainedEduTokenizer(vocab_path: Optional[str] = None, max_length: int = 250, tokenize_method: str = 'pure_text', add_specials: Tuple[list, bool] = False, **kwargs)[source]¶
This base class is in charge of preparing the inputs for a model
- Parameters
vocab_path (str, optional) – _description_, by default None
max_length (int, optional) – used to clip the sentence out of max_length, by default None
tokenize_method (str, optional) – default: “space” - when text is already seperated by space, use “space” - when text is raw string format, use Tokenizer defined in get_tokenizer(), such as “pure_text” and “text”
add_specials (Tuple[list, bool], optional) – by default None - For bool, it means whether to add EDU_SPYMBOLS to vocabulary - For list, it means the added special tokens besides EDU_SPYMBOLS
- tokenize(items: ~typing.Tuple[list, str, dict], key=<function PretrainedEduTokenizer.<lambda>>, **kwargs)[source]¶
- Parameters
items (list or str or dict) – the question items
key (function) – determine how to get the text of each item
- Returns
tokens – the token of items
- Return type
list
- encode(items: ~typing.Tuple[str, dict, ~typing.List[str], ~typing.List[dict]], key=<function PretrainedEduTokenizer.<lambda>>, **kwargs)[source]¶
- classmethod from_pretrained(tokenizer_config_dir: str, **kwargs)[source]¶
Load tokenizer from local files
Parameters:¶
- tokenizer_config_dir: str
The dir path containing tokenizer_config.json and vocab.list
- save_pretrained(tokenizer_config_dir: str)[source]¶
Save tokenizer into local files
Parameters:¶
- tokenizer_config_dir: str
save tokenizer params in /tokenizer_config.json and save words in /vocab.list
- property vocab_size¶
- set_vocab(items: list, key=<function PretrainedEduTokenizer.<lambda>>, lower: bool = False, trim_min_count: int = 1, do_tokenize: bool = True)[source]¶
Update the vocabulary with the tokens in corpus items
- Parameters
items (list) – can be the list of str, or list of dict
key (function, optional) – determine how to get the text of each item
lower (bool, optional) – wheather to lower the corpus items, by default False
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1
do_tokenize (bool, optional) – wheather tokenize items before updating vocab, by default True
- Returns
token_items
- Return type
list
EduNLP.Pretrain.hugginface_utils¶
- class EduNLP.Pretrain.hugginface_utils.TokenizerForHuggingface(pretrained_model='bert-base-chinese', max_length=512, tokenize_method: str = 'pure_text', add_specials: Union[List[str], bool] = False, **kwargs)[source]¶
Parameterss¶
- pretrained_model:
used pretrained model
- add_specials:
Whether to add tokens like [FIGURE], [TAG], etc.
- tokenize_method:
Which text tokenizer to use. Must be consistent with TOKENIZER dictionary.
Examples
>>> tokenizer = TokenizerForHuggingface(add_special_tokens=True) >>> item = "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$" >>> token_item = tokenizer(item) >>> print(token_item.input_ids[:10]) tensor([[ 101, 1062, 2466, 1963, 1745, 138, 100, 140, 166, 117, 167, 5276, 3338, 3340, 816, 1062, 2466, 102, 168, 134, 166, 116, 128, 167, 3297, 1920, 966, 138, 100, 140, 102]]) >>> print(tokenizer.tokenize(item)[:10]) ['公', '式', '如', '图', '[', '[UNK]', ']', 'x', ',', 'y'] >>> items = [item, item] >>> token_items = tokenizer(items, return_tensors='pt') >>> print(token_items.input_ids.shape) torch.Size([2, 31]) >>> print(len(tokenizer.tokenize(items))) 2 >>> tokenizer.save_pretrained('test_dir') >>> tokenizer = TokenizerForHuggingface.from_pretrained('test_dir')
- tokenize(items: ~typing.Union[list, str, dict], key=<function TokenizerForHuggingface.<lambda>>, **kwargs)[source]¶
- encode(items: ~typing.Tuple[str, dict, ~typing.List[str], ~typing.List[dict]], key=<function TokenizerForHuggingface.<lambda>>, **kwargs)[source]¶
- property vocab_size¶
- set_vocab(items: ~typing.Tuple[~typing.List[str], ~typing.List[dict]], key=<function TokenizerForHuggingface.<lambda>>, lower=False, trim_min_count: int = 1, do_tokenize: bool = True)[source]¶
- Parameters
items (list) – can be the list of str, or list of dict
key (function) – determine how to get the text of each item
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1
do_tokenize (bool, optional) – wheather tokenize items before updating vocab, by default True
EduNLP.Pretrain.gensim_vec¶
- class EduNLP.Pretrain.gensim_vec.GensimWordTokenizer(symbol='gm', general=False)[source]¶
- Parameters
symbol (str) –
- select the methods to symbolize:
”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,
e.g.: gm, fgm, gmas, fgmas
general (bool) –
True: when item isn’t in standard format, and want to tokenize formulas(except formulas in figure) linearly.
False: when use ‘ast’ mothed to tokenize formulas instead of ‘linear’.
- Returns
tokenizer
- Return type
Examples
>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item.tokens[:10]) ['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]'] >>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item.tokens[:10]) ['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]']
- EduNLP.Pretrain.gensim_vec.train_vector(items, w2v_prefix, embedding_dim=None, method='sg', binary=None, train_params=None)[source]¶
- Parameters
items:str – the text of question
w2v_prefix –
embedding_dim (int) – vector_size
method (str) – the method of training, e.g.: sg, cbow, fasttext, d2v, bow, tfidf
binary (model format) – True:bin; False:kv
train_params (dict) – the training parameters passed to model
- Returns
tokenizer
- Return type
Examples
>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']] >>> train_vector(token_item[:10], "examples/test_model/w2v/gensim_luna_stem_t_", 100) 'examples/test_model/w2v/gensim_luna_stem_t_sg_100.kv'
- class EduNLP.Pretrain.gensim_vec.GensimSegTokenizer(symbol='gms', depth=None, flatten=False, **kwargs)[source]¶
- Parameters
symbol (str) –
- select the methods to symbolize:
”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,
e.g. gms, fgm
depth (int or None) – 0: only separate at SIFSep ; 1: only separate at SIFTag ; 2: separate at SIFTag and SIFSep ; otherwise, separate all segments ;
- Returns
tokenizer
- Return type
Examples
>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']] >>> tokenizer = GensimSegTokenizer(symbol="fgm", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], ['[FORMULA]'], ['如图'], ['[FIGURE]'], ['[FORMULA]'],...['[FORMULA]'], ['最大值'], ['[MARK]']]
EduNLP.Pretrain.elmo_vec¶
- class EduNLP.Pretrain.elmo_vec.ElmoTokenizer(vocab_path=None, max_length=250, tokenize_method='pure_text', add_specials=True, **kwargs)[source]¶
Examples
>>> t=ElmoTokenizer() >>> items = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,\ ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$"] >>> len(t) 14 >>> t.tokenize(items[0]) ['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]'] >>> t(items[0]) {'seq_idx': tensor([1, 1, 6, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 7]), 'seq_len': tensor(17)} >>> t.set_vocab(items[0]) ['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]'] >>> len(t) 45 >>> t(items[0]) {'seq_idx': tensor([ 1, 1, 6, 26, 27, 28, 1, 1, 9, 35, 36, 26, 37, 38, 28, 1, 7]), 'seq_len': tensor(17)}
- class EduNLP.Pretrain.elmo_vec.ElmoDataset(tokenizer: ElmoTokenizer, **kwargs)[source]¶
- EduNLP.Pretrain.elmo_vec.train_elmo(items: Union[List[dict], List[str]], output_dir: str, pretrained_dir: Optional[str] = None, tokenizer_params=None, data_params=None, model_params=None, train_params=None)[source]¶
- Parameters
items (list, required) – The training corpus, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_dir (str, optional) – The pretrained directory for model and tokenizer
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) –
stem_key
label_key
The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
- EduNLP.Pretrain.elmo_vec.train_elmo_for_property_prediction(train_items: list, output_dir: str, pretrained_dir=None, eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]¶
- Parameters
train_items (list, required) – The training items, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_dir (str, optional) – The pretrained directory for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
- EduNLP.Pretrain.elmo_vec.train_elmo_for_knowledge_prediction(train_items: list, output_dir: str, pretrained_dir=None, eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]¶
- Parameters
train_items (list, required) – The training items, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_dir (str, optional) – The pretrained directory for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
EduNLP.Pretrain.bert_vec¶
- class EduNLP.Pretrain.bert_vec.BertTokenizer(pretrained_model='bert-base-chinese', max_length=512, tokenize_method: str = 'pure_text', add_specials: Union[List[str], bool] = False, **kwargs)[source]¶
Examples
>>> tokenizer = BertTokenizer(add_special_tokens=True) >>> item = "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$" >>> token_item = tokenizer(item) >>> print(token_item.input_ids) tensor([[ 101, 1062, 2466, 1963, 1745, 138, 100, 140, 166, 117, 167, 5276, 3338, 3340, 816, 1062, 2466, 102, 168, 134, 166, 116, 128, 167, 3297, 1920, 966, 138, 100, 140, 102]]) >>> print(tokenizer.tokenize(item)[:10]) ['公', '式', '如', '图', '[', '[UNK]', ']', 'x', ',', 'y'] >>> items = [item, item] >>> token_items = tokenizer(items, return_tensors='pt') >>> print(token_items.input_ids.shape) torch.Size([2, 31]) >>> print(len(tokenizer.tokenize(items))) 2 >>> tokenizer.save_pretrained('test_dir') >>> tokenizer = BertTokenizer.from_pretrained('test_dir')
- class EduNLP.Pretrain.bert_vec.BertDataset(tokenizer, ds_disk_path: Optional[Dataset] = None, items: Optional[Union[List[dict], List[str]]] = None, stem_key: str = 'text', label_key: Optional[str] = None, feature_keys: Optional[List[str]] = None, num_processor: Optional[int] = None, **kwargs)[source]¶
- EduNLP.Pretrain.bert_vec.finetune_bert(items: Union[List[dict], List[str]], output_dir: str, pretrained_model='bert-base-chinese', tokenizer_params=None, data_params=None, model_params=None, train_params=None)[source]¶
- Parameters
items (list, required) – The training corpus, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
Examples
>>> stems = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$", ... "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$"] >>> finetune_bert(stems, "examples/test_model/data/data/bert") {'train_runtime': ..., ..., 'epoch': 1.0}
- EduNLP.Pretrain.bert_vec.finetune_bert_for_property_prediction(train_items, output_dir, pretrained_model='bert-base-chinese', eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]¶
- Parameters
train_items (list, required) – The training corpus, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
- EduNLP.Pretrain.bert_vec.finetune_bert_for_knowledge_prediction(train_items, output_dir, pretrained_model='bert-base-chinese', eval_items=None, tokenizer_params=None, data_params=None, train_params=None, model_params=None)[source]¶
- Parameters
train_items (list, required) – The training corpus, each item could be str or dict
output_dir (str, required) – The directory to save trained model files
pretrained_model (str, optional) – The pretrained model name or path for model and tokenizer
eval_items (list, required) – The evaluating items, each item could be str or dict
tokenizer_params (dict, optional, default=None) – The parameters passed to ElmoTokenizer
data_params (dict, optional, default=None) – The parameters passed to ElmoDataset and ElmoTokenizer
model_params (dict, optional, default=None) – The parameters passed to Trainer
train_params (dict, optional, default=None) –
EduNLP.Pretrain.disenqnet_vec¶
- class EduNLP.Pretrain.disenqnet_vec.DisenQTokenizer(vocab_path=None, max_length=250, tokenize_method='pure_text', add_specials: Optional[list] = None, num_token='[NUM]', **kwargs)[source]¶
Examples
>>> tokenizer = DisenQTokenizer() >>> test_items = [{ ... "content": "甲 数 除以 乙 数 的 商 是 1.5 , 如果 甲 数 增加 20 , 则 甲 数 是 乙 的 4 倍 . 原来 甲 数 = .", ... "knowledge": ["*", "-", "/"], "difficulty": 0.2, "length": 7}] >>> tokenizer.set_vocab(test_items, ... key=lambda x: x["content"], trim_min_count=1) [['甲', '数', '除以', '乙', '数', '商', '[NUM]', '甲', '数', '增加', '[NUM]', '甲', '数', '乙', '倍', '甲', '数']] >>> token_items = [tokenizer(i, key=lambda x: x["content"]) for i in test_items] >>> print(token_items[0].keys()) dict_keys(['seq_idx', 'seq_len'])
- EduNLP.Pretrain.disenqnet_vec.preprocess_dataset(pretrained_dir, disen_tokenizer, items, data_formation, trim_min_count=None, embed_dim=None, w2v_params=None, silent=False)[source]¶
- class EduNLP.Pretrain.disenqnet_vec.DisenQDataset(items: List[Dict], tokenizer: DisenQTokenizer, data_formation: Dict, mode='train', concept_to_idx=None, **kwargs)[source]¶
- class EduNLP.Pretrain.disenqnet_vec.DisenQTrainer(model: Optional[Union[PreTrainedModel, Module]] = None, args: Optional[TrainingArguments] = None, data_collator: Optional[DataCollator] = None, train_dataset: Optional[Dataset] = None, eval_dataset: Optional[Dataset] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, model_init: Optional[Callable[[], PreTrainedModel]] = None, compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None, callbacks: Optional[List[TrainerCallback]] = None, optimizers: Tuple[Optimizer, LambdaLR] = (None, None), preprocess_logits_for_metrics: Optional[Callable[[Tensor, Tensor], Tensor]] = None)[source]¶
- create_optimizer_and_scheduler(num_training_steps: int)[source]¶
Setup the optimizer and the learning rate scheduler.
We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the Trainer’s init through optimizers, or subclass and override this method (or create_optimizer and/or create_scheduler) in a subclass.
- training_step(model: Module, inputs: Dict[str, Union[Tensor, Any]]) Tensor[source]¶
Perform a training step on a batch of inputs.
Subclass and override to inject custom behavior.
- Parameters
model (nn.Module) – The model to train.
inputs (Dict[str, Union[torch.Tensor, Any]]) –
The inputs and targets of the model.
The dictionary will be unpacked before being fed to the model. Most models expect the targets under the argument labels. Check your model’s documentation for all accepted arguments.
- Returns
The tensor with training loss on this batch.
- Return type
torch.Tensor
- class EduNLP.Pretrain.disenqnet_vec.DisenQTrainingArguments(output_dir: str, overwrite_output_dir: bool = False, do_train: bool = False, do_eval: bool = False, do_predict: bool = False, evaluation_strategy: Union[transformers.trainer_utils.IntervalStrategy, str] = 'no', prediction_loss_only: bool = False, per_device_train_batch_size: int = 8, per_device_eval_batch_size: int = 8, per_gpu_train_batch_size: Union[int, NoneType] = None, per_gpu_eval_batch_size: Union[int, NoneType] = None, gradient_accumulation_steps: int = 1, eval_accumulation_steps: Union[int, NoneType] = None, eval_delay: Union[float, NoneType] = 0, learning_rate: float = 5e-05, weight_decay: float = 0.0, adam_beta1: float = 0.9, adam_beta2: float = 0.999, adam_epsilon: float = 1e-08, max_grad_norm: float = 1.0, num_train_epochs: float = 3.0, max_steps: int = -1, lr_scheduler_type: Union[transformers.trainer_utils.SchedulerType, str] = 'linear', warmup_ratio: float = 0.0, warmup_steps: int = 0, log_level: Union[str, NoneType] = 'passive', log_level_replica: Union[str, NoneType] = 'passive', log_on_each_node: bool = True, logging_dir: Union[str, NoneType] = None, logging_strategy: Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps', logging_first_step: bool = False, logging_steps: int = 500, logging_nan_inf_filter: bool = True, save_strategy: Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps', save_steps: int = 500, save_total_limit: Union[int, NoneType] = None, save_on_each_node: bool = False, no_cuda: bool = False, use_mps_device: bool = False, seed: int = 42, data_seed: Union[int, NoneType] = None, jit_mode_eval: bool = False, use_ipex: bool = False, bf16: bool = False, fp16: bool = False, fp16_opt_level: str = 'O1', half_precision_backend: str = 'auto', bf16_full_eval: bool = False, fp16_full_eval: bool = False, tf32: Union[bool, NoneType] = None, local_rank: int = -1, xpu_backend: Union[str, NoneType] = None, tpu_num_cores: Union[int, NoneType] = None, tpu_metrics_debug: bool = False, debug: str = '', dataloader_drop_last: bool = False, eval_steps: Union[int, NoneType] = None, dataloader_num_workers: int = 0, past_index: int = -1, run_name: Union[str, NoneType] = None, disable_tqdm: Union[bool, NoneType] = None, remove_unused_columns: Union[bool, NoneType] = True, label_names: Union[List[str], NoneType] = None, load_best_model_at_end: Union[bool, NoneType] = False, metric_for_best_model: Union[str, NoneType] = None, greater_is_better: Union[bool, NoneType] = None, ignore_data_skip: bool = False, sharded_ddp: str = '', fsdp: str = '', fsdp_min_num_params: int = 0, fsdp_transformer_layer_cls_to_wrap: Union[str, NoneType] = None, deepspeed: Union[str, NoneType] = None, label_smoothing_factor: float = 0.0, optim: Union[transformers.training_args.OptimizerNames, str] = 'adamw_hf', optim_args: Union[str, NoneType] = None, adafactor: bool = False, group_by_length: bool = False, length_column_name: Union[str, NoneType] = 'length', report_to: Union[List[str], NoneType] = None, ddp_find_unused_parameters: Union[bool, NoneType] = None, ddp_bucket_cap_mb: Union[int, NoneType] = None, dataloader_pin_memory: bool = True, skip_memory_metrics: bool = True, use_legacy_prediction_loop: bool = False, push_to_hub: bool = False, resume_from_checkpoint: Union[str, NoneType] = None, hub_model_id: Union[str, NoneType] = None, hub_strategy: Union[transformers.trainer_utils.HubStrategy, str] = 'every_save', hub_token: Union[str, NoneType] = None, hub_private_repo: bool = False, gradient_checkpointing: bool = False, include_inputs_for_metrics: bool = False, fp16_backend: str = 'auto', push_to_hub_model_id: Union[str, NoneType] = None, push_to_hub_organization: Union[str, NoneType] = None, push_to_hub_token: Union[str, NoneType] = None, mp_parameters: str = '', auto_find_batch_size: bool = False, full_determinism: bool = False, torchdynamo: Union[str, NoneType] = None, ray_scope: Union[str, NoneType] = 'last', ddp_timeout: Union[int, NoneType] = 1800, step_size: int = False, trim_min: int = False, hidden_size: int = False, gamma: float = False)[source]¶
- step_size: int = False¶
- trim_min: int = False¶
- gamma: float = False¶
- EduNLP.Pretrain.disenqnet_vec.train_disenqnet(train_items: List[dict], output_dir: str, pretrained_dir: Optional[str] = None, eval_items=None, tokenizer_params=None, data_params=None, model_params=None, train_params=None, w2v_params=None)[source]¶
- Parameters
train_items (List[dict]) – _description_
output_dir (str) – _description_
pretrained_dir (str, optional) – _description_, by default None
tokenizer_params (_type_, optional) – _description_, by default None
data_params (_type_, optional) – _description_, by default None
model_params (_type_, optional) – _description_, by default None
train_params (_type_, optional) – _description_, by default None
EduNLP.Pretrain.quesnet_vec¶
Pre-process input text, tokenizing, building vocabs, and pre-train word level vectors.
- class EduNLP.Pretrain.quesnet_vec.Question(id, content, answer, false_options, labels)¶
- property answer¶
Alias for field number 2
- property content¶
Alias for field number 1
- property false_options¶
Alias for field number 3
- property id¶
Alias for field number 0
- property labels¶
Alias for field number 4
- class EduNLP.Pretrain.quesnet_vec.QuesNetTokenizer(vocab_path=None, meta_vocab_dir=None, img_dir: Optional[str] = None, max_length=250, tokenize_method='custom', symbol='mas', add_specials: Optional[list] = None, meta: Optional[List[str]] = None, img_token='<img>', unk_token='<unk>', pad_token='<pad>', **kwargs)[source]¶
Examples
>>> tokenizer = QuesNetTokenizer(meta=['knowledge']) >>> test_items = [{"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B, $\FigureID{test_id}$", ... "knowledge": "['*', '-', '/']"}, {"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B", ... "knowledge": "['*', '-', '/']"}] >>> tokenizer.set_vocab(test_items, ... trim_min_count=1, key=lambda x: x["ques_content"], silent=True) >>> tokenizer.set_meta_vocab(test_items, silent=True) >>> token_items = [tokenizer(i, key=lambda x: x["ques_content"]) for i in test_items] >>> print(token_items[0].keys()) dict_keys(['seq_idx', 'meta_idx']) >>> token_items = tokenizer(test_items, key=lambda x: x["ques_content"]) >>> print(len(token_items["seq_idx"])) 2
- set_vocab(items: list, key=<function QuesNetTokenizer.<lambda>>, lower: bool = False, trim_min_count: int = 1, do_tokenize: bool = True, silent=True)[source]¶
- Parameters
items (list) – can be the list of str, or list of dict
key (function) – determine how to get the text of each item
trim_min_count (int, optional) – the lower bound number for adding a word into vocabulary, by default 1
silent –
- classmethod from_pretrained(tokenizer_config_dir, img_dir=None, **kwargs)[source]¶
Parameters:¶
- tokenizer_config_dir: str
must contain tokenizer_config.json and vocab.txt and meta_{meta_name}.txt
- img_dir: str
default None the path of image directory
- class EduNLP.Pretrain.quesnet_vec.QuestionLoader(ques: ~EduNLP.Pretrain.quesnet_vec.Lines, tokenizer: ~EduNLP.Pretrain.quesnet_vec.QuesNetTokenizer, pipeline=None, range=None, meta: ~typing.Optional[list] = None, content_key=<function QuestionLoader.<lambda>>, meta_key=<function QuestionLoader.<lambda>>, answer_key=<function QuestionLoader.<lambda>>, option_key=<function QuestionLoader.<lambda>>, skip=0)[source]¶
- class EduNLP.Pretrain.quesnet_vec.PrefetchIter(data, *label, length=None, batch_size=1, shuffle=True)[source]¶
Iterator on data and labels, with states for save and restore.
- EduNLP.Pretrain.quesnet_vec.pretrain_embedding_layer(dataset: EmbeddingDataset, ae: AE, lr: float = 0.001, log_step: int = 1, epochs: int = 3, batch_size: int = 4, device=device(type='cpu'))[source]¶
- EduNLP.Pretrain.quesnet_vec.pretrain_quesnet(path, output_dir, img_dir=None, save_embs=False, train_params=None)[source]¶
pretrain quesnet
- Parameters
path (str) – path of question file
output_dir (str) – output path·
tokenizer (QuesNetTokenizer) – quesnet tokenizer
save_embs (bool, optional) – whether to save pretrained word/image/meta embeddings seperately
train_params (dict, optional) –
the training parameters and model parameters, by default None - “n_epochs”: int, default = 1
train param, number of epochs
- ”batch_size”: int, default = 6
train param, batch size
- ”lr”: float, default = 1e-3
train param, learning rate
- ”save_every”: int, default = 0
train param, save steps interval
- ”log_steps”: int, default = 10
train param, log steps interval
- ”device”: str, default = ‘cpu’
train param, ‘cpu’ or ‘cuda’
- ”max_steps”: int, default = 0
train param, stop training when reach max steps
- ”emb_size”: int, default = 256
model param, the embedding size of word, figure, meta info
- ”feat_size”: int, default = 256
model param, the size of question infer vector
Examples
>>> tokenizer = QuesNetTokenizer(meta=['know_name']) >>> items = [{"ques_content": "若复数$z=1+2 i+i^{3}$,则$|z|=$,$\FigureID{000004d6-0479-11ec-829b-797d5eb43535}$", ... "ques_id": "726cdbec-33a9-11ec-909c-98fa9b625adb", ... "know_name": "['代数', '集合', '集合的相等']" ... }] >>> tokenizer.set_vocab(items, key=lambda x: x['ques_content'], trim_min_count=1, silent=True) >>> pretrain_quesnet('./data/standard_luna_data.json', './testQuesNet', tokenizer)