EduNLP.Vector¶
EduNLP.Vector.t2v¶
- class EduNLP.Vector.t2v.T2V(model: str, *args, **kwargs)[source]¶
The function aims to transfer token list to vector. If you have a certain model, you can use T2V directly. Otherwise, calling get_pretrained_t2v function is a better way to get vector which can switch it without your model.
- Parameters
model (str) – select the model type e.g.: d2v, rnn, lstm, gru, elmo, etc.
Examples
>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$, ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$'}] >>> path = "examples/test_model/d2v/d2v_test_256/d2v_test_256.bin" >>> t2v = T2V('d2v',filepath=path) >>> print(t2v(item)) [array([...dtype=float32)]
- property vector_size: int¶
- EduNLP.Vector.t2v.get_pretrained_t2v(name, model_dir='/home/docs/.EduNLP/model')[source]¶
It is a good idea if you want to switch token list to vector earily.
- Parameters
name (str) – select the pretrained model e.g.: d2v_math_300 w2v_math_300 elmo_math_2048 bert_math_768 bert_taledu_768 disenq_math_256 quesnet_math_512
model_dir (str) – the path of model, default: MODEL_DIR = ‘~/.EduNLP/model’
- Returns
t2v model
- Return type
Examples
>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$, ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$'}] >>> i2v = get_pretrained_t2v("d2v_test_256", "examples/test_model/d2v") >>> print(i2v(item)) [array([...dtype=float32)]
EduNLP.Vector.disenqnet¶
- class EduNLP.Vector.disenqnet.disenqnet.DisenQModel(pretrained_dir, device='cpu')[source]¶
- infer_vector(items: dict, vector_type=None, **kwargs) Tensor[source]¶
- Parameters
vector_type (str) – choose the type of items tensor to return. Default is None, which means return both (k_hidden, i_hidden) when vector_type=”k”, return k_hidden; when vector_type=”i”, return i_hidden;
- property vector_size¶
EduNLP.Vector.quesnet¶
- class EduNLP.Vector.quesnet.quesnet.QuesNetModel(pretrained_dir, tokenizer=None, device='cpu')[source]¶
- infer_vector(items: Union[Question, list]) Tensor[source]¶
get question embedding with quesnet
- Parameters
items ((Question, list)) – namedtuple, [‘id’, ‘content’, ‘answer’, ‘false_options’, ‘labels’] or a list of Questions
- infer_tokens(items: Union[Question, list]) Tensor[source]¶
get token embeddings with quesnet
- Parameters
items (Question) – namedtuple, [‘id’, ‘content’, ‘answer’, ‘false_options’, ‘labels’] or a list of Questions
- Returns
meta_emb + word_embs
- Return type
torch.Tensor
- property vector_size¶
EduNLP.Vector.elmo_vec¶
EduNLP.Vector.gensim_vec¶
- class EduNLP.Vector.gensim_vec.W2V(filepath, method=None, binary=None)[source]¶
The part uses gensim library providing FastText, Word2Vec and KeyedVectors method to transfer word to vector.
- Parameters
filepath – path to the pretrained model file
method (str) – fasttext other(Word2Vec)
binary (bool) –
- property vectors¶
- property vector_size¶
- class EduNLP.Vector.gensim_vec.BowLoader(filepath)[source]¶
Using doc2bow model, which has a lot of effects.
Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.
If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its document frequency (self.dfs) by one.
If allow_update is not set, this function is const, aka read-only.
- property vector_size¶
- class EduNLP.Vector.gensim_vec.TfidfLoader(filepath)[source]¶
This module implements functionality related to the Term Frequency - Inverse Document Frequency <https://en.wikipedia.org/wiki/Tf%E2%80%93idf> vector space bag-of-words models.
- property vector_size¶
EduNLP.Vector.embedding¶
- class EduNLP.Vector.embedding.Embedding(w2v: (<class 'EduNLP.Vector.gensim_vec.W2V'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, None), freeze=True, device=None, **kwargs)[source]¶
-
- indexing(items: List[List[str]], padding=False, indexing=True) tuple[source]¶
- Parameters
items (list of list of str(word/token)) –
padding (bool) – whether padding the returned list with default pad_val to make all item in items have the same length
indexing (bool) –
- Returns
token_idx (list of list of int) – the list of the tokens of each item
token_len (list of int) – the list of the length of tokens of each item