EduNLP¶
SIF¶
- EduNLP.SIF.sif.is_sif(item, check_formula=True, return_parser=False)[source]¶
the part aims to check whether the input is sif format
- Parameters
item (str) – a raw item which respects stem
check_formula (bool) –
whether to check the formulas when parsing item.
True if check the validity of formulas in item False if not check the validity of formulas in item, which is faster
return_parser (bool) –
whether to put the parsed item in return.
when True, the format of return is (bool, Parser) when False, the format of return is bool
- Returns
when item can not be parsed correctly, raise ValueError; when item is in stardarded format originally, return Ture (and the Parser of item); when item isn’t in stardarded format originally, return False (and the Parser of item);
- Return type
bool
Examples
>>> text = '若$x,y$满足约束条件' \ ... '$\\left\\{\\begin{array}{c}2 x+y-2 \\leq 0 \\\\ x-y-1 \\geq 0 \\\\ y+1 \\geq 0\\end{array}\\right.$,' \ ... '则$z=x+7 y$的最大值$\\SIFUnderline$' >>> is_sif(text) True >>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...' >>> ret = is_sif(text, return_parser=True) >>> ret (False, <EduNLP.SIF.parser.parser.Parser object...>)
- EduNLP.SIF.sif.to_sif(item, check_formula=True, parser: Optional[Parser] = None)[source]¶
the part aims to switch item to sif formate
- Parameters
items (str) – a raw item which respects stem
check_formula (bool) – whether to check the formulas when parsing item (only work when parser=None).
parser (Parser) – the parser of item returned from is_sif.
- Returns
item – the item which accords with sif format
- Return type
str
Examples
>>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...' >>> siftext = to_sif(text) >>> siftext '某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$(单位...' >>> ret = is_sif(text, return_parser=True) >>> ret (False, <EduNLP.SIF.parser.parser.Parser object...>) >>> to_sif(text, parser=ret[1]) '某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$(单位...
- EduNLP.SIF.sif.sif4sci(item: str, figures: (<class 'dict'>, <class 'bool'>) = None, mode: int = 2, symbol: str = None, tokenization=True, tokenization_params=None, errors='raise')[source]¶
Default to use linear Tokenizer, change the tokenizer by specifying tokenization_params
- Parameters
item (str) – a raw item which respects stem
figures (dict) – when it is a dict, it means the id-to-instance for figures in ‘FormFigureID{…}’ format, when it is a bool, it means whether to instantiate figures in ‘FormFigureBase64{…}’ format
mode (int) – when safe = 2, use is_sif and check formula in item when safe = 1, use is_sif but don’t check formula in item when safe = 0, don’t use is_sif and don’t check anything in item
symbol (str) –
- select the methods to symbolize:
”t”: text “f”: formula “g”: figure “m”: question mark “a”: tag “s”: sep
tokenization (bool) – whether to tokenize item after segmentation
tokenization_params –
the dict of text_params, formula_params and figure_params in tokenization For formula_params:
method: which tokenizer to be used, “linear” or “ast” The parameters only useful for “linear”:
skip_figure_formula: whether to skip the formula in figure format symbolize_figure_formula: whether to symbolize the formula in figure format
- The parameters only useful for “ast”:
ord2token: whether to transfer the variables (mathord) and constants (textord) to special tokens. var_numbering: whether to use number suffix to denote different variables return_type: ‘list’ or ‘ast’
More parameters can be found in the definition in SIF.tokenization.formula
- For figure_params:
figure_instance:whether to return instance of figures in tokens
- For text_params:
See definition in SIF.tokenization.text
errors – warn, raise, coerce, strict, ignore
- Returns
When tokenization is False, return SegmentList; When tokenization is True, return TokenList
- Return type
list
Examples
>>> test_item = r"如图所示,则$\bigtriangleup ABC$的面积是$\SIFBlank$。$\FigureID{1}$" >>> tl = sif4sci(test_item) >>> tl ['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}] >>> tl.describe() {'t': 2, 'f': 2, 'g': 1, 'm': 1} >>> with tl.filter('fgm'): ... tl ['如图所示', '面积'] >>> with tl.filter(keep='t'): ... tl ['如图所示', '面积'] >>> with tl.filter(): ... tl ['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}] >>> tl.text_tokens ['如图所示', '面积'] >>> tl.formula_tokens ['\\bigtriangleup', 'ABC'] >>> tl.figure_tokens [\FigureID{1}] >>> tl.ques_mark_tokens ['\\SIFBlank'] >>> sif4sci(test_item, symbol="gm", tokenization_params={"formula_params": {"method": "ast"}}) ['如图所示', <Formula: \bigtriangleup ABC>, '面积', '[MARK]', '[FIGURE]'] >>> sif4sci(test_item, symbol="tfgm") ['[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[TEXT]', '[FIGURE]'] >>> sif4sci(test_item, symbol="gm", ... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}}) ['如图所示', '\\bigtriangleup', 'A', 'B', 'C', '面积', '[MARK]', '[FIGURE]'] >>> test_item_1 = { ... "stem": r"若$x=2$, $y=\sqrt{x}$,则下列说法正确的是$\SIFChoice$", ... "options": [r"$x < y$", r"$y = x$", r"$y < x$"] ... } >>> tls = [ ... sif4sci(e, symbol="gm", ... tokenization_params={ ... "formula_params": { ... "method": "ast", "return_type": "list", "ord2token": True, "var_numbering": True, ... "link_variable": False} ... }) ... for e in ([test_item_1["stem"]] + test_item_1["options"]) ... ] >>> tls[1:] [['mathord_0', '<', 'mathord_1'], ['mathord_0', '=', 'mathord_1'], ['mathord_0', '<', 'mathord_1']] >>> link_formulas(*tls) >>> tls[1:] [['mathord_0', '<', 'mathord_1'], ['mathord_1', '=', 'mathord_0'], ['mathord_1', '<', 'mathord_0']] >>> from EduNLP.utils import dict2str4sif >>> test_item_1_str = dict2str4sif(test_item_1, tag_mode="head", add_list_no_tag=False) >>> test_item_1_str '$\\SIFTag{stem}$...则下列说法正确的是$\\SIFChoice$$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$' >>> tl1 = sif4sci(test_item_1_str, symbol="gm", ... tokenization_params={"formula_params": {"method": "ast", "return_type": "list", "ord2token": True}}) >>> tl1.get_segments()[0] ['\\SIFTag{stem}'] >>> tl1.get_segments()[1:3] [['[TEXT_BEGIN]', '[TEXT_END]'], ['[FORMULA_BEGIN]', 'mathord', '=', 'textord', '[FORMULA_END]']] >>> tl1.get_segments(add_seg_type=False)[0:3] [['\\SIFTag{stem}'], ['mathord', '=', 'textord'], ['mathord', '=', 'mathord', '{ }', '\\sqrt']] >>> test_item_2 = {"options": [r"$x < y$", r"$y = x$", r"$y < x$"]} >>> test_item_2 {'options': ['$x < y$', '$y = x$', '$y < x$']} >>> test_item_2_str = dict2str4sif(test_item_2, tag_mode="head", add_list_no_tag=False) >>> test_item_2_str '$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$' >>> tl2 = sif4sci(test_item_2_str, symbol="gms", ... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}}) >>> tl2 ['\\SIFTag{options}', 'x', '<', 'y', '[SEP]', 'y', '=', 'x', '[SEP]', 'y', '<', 'x'] >>> tl2.get_segments(add_seg_type=False) [['\\SIFTag{options}'], ['x', '<', 'y'], ['[SEP]'], ['y', '=', 'x'], ['[SEP]'], ['y', '<', 'x']] >>> tl2.get_segments(add_seg_type=False, drop="s") [['\\SIFTag{options}'], ['x', '<', 'y'], ['y', '=', 'x'], ['y', '<', 'x']] >>> tl3 = sif4sci(test_item_1["stem"], symbol="gs") >>> tl3.text_segments [['说法', '正确']] >>> tl3.formula_segments [['x', '=', '2'], ['y', '=', '\\sqrt', '{', 'x', '}']] >>> tl3.figure_segments [] >>> tl3.ques_mark_segments [['\\SIFChoice']] >>> test_item_3 = r"已知$y=x$,则以下说法中$\textf{正确,b}$的是" >>> tl4 = sif4sci(test_item_3) Warning: there is some chinese characters in formula! >>> tl4.text_segments [['已知'], ['说法', '中', '正确']]
EduNLP.Formula¶
- EduNLP.Formula.ast.get_edges(forest)[source]¶
构造边集合
- Parameters
forest (List[Dict]) – 森林
- Returns
edges – 边集合
- Return type
list of tuple(src,dst,type)
- EduNLP.Formula.ast.ast(formula: (<class 'str'>, typing.List[typing.Dict]), index=0, forest_begin=0, father_tree=None, is_str=False)[source]¶
The origin code author is https://github.com/hxwujinze
- Parameters
formula (str or List[Dict]) – 公式字符串或通过katex解析得到的结构体
index (int) – 本子树在树上的位置
forest_begin (int) – 本树在森林中的起始位置
father_tree (List[Dict]) – 父亲树
is_str (bool) –
- Returns
tree (List[Dict]) – 重新解析形成的特征树
todo (finish all types)
Notes
Some functions are not supportd in
katexe.g.,- tag
\begin{equation} \tag{tagName} F=ma \end{equation}\begin{align} \tag{1} y=x+z \end{align}\tag*{hi} x+y^{2x}
- dddot
\frac{ \dddot y }{ x }
For more information, refer to katex support table
EduNLP.I2V¶
- class EduNLP.I2V.i2v.I2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶
It just a api, so you shouldn’t use it directly. If you want to get vector from item, you can use other model like D2V and W2V.
- Parameters
tokenizer (str) – the tokenizer name
t2v (str) – the name of token2vector model
args – the parameters passed to t2v
tokenizer_kwargs (dict) – the parameters passed to tokenizer
pretrained_t2v (bool) –
True: use pretrained t2v model
False: use your own t2v model
kwargs – the parameters passed to t2v
Examples
>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$, ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点, ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"} >>> model_path = "examples/test_model/d2v/test_gensim_luna_stem_tf_d2v_256.bin" >>> i2v = D2V("text","d2v",filepath=model_path, pretrained_t2v = False) >>> i2v(item) ([array([ ...dtype=float32)], None)
- Returns
i2v model
- Return type
- tokenize(items, *args, indexing=True, padding=False, key=<function I2V.<lambda>>, **kwargs) list[source]¶
- infer_vector(items, tokenize=True, indexing=False, padding=False, key=<function I2V.<lambda>>, **kwargs) tuple[source]¶
- property vector_size¶
- class EduNLP.I2V.i2v.D2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶
The model aims to transfer item to vector directly.
I2V
- Parameters
tokenizer (str) – the tokenizer name
t2v (str) – the name of token2vector model
args – the parameters passed to t2v
tokenizer_kwargs (dict) – the parameters passed to tokenizer
pretrained_t2v (bool) – True: use pretrained t2v model False: use your own t2v model
kwargs – the parameters passed to t2v
Examples
>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$, ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点, ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"} >>> model_path = "examples/test_model/d2v/d2v_test_256/d2v_test_256.bin" >>> i2v = D2V("text","d2v",filepath=model_path, pretrained_t2v = False) >>> i2v(item) ([array([ ...dtype=float32)], None)
- Returns
i2v model
- Return type
- infer_vector(items, tokenize=True, indexing=False, padding=False, key=<function D2V.<lambda>>, *args, **kwargs) tuple[source]¶
It is a function to switch item to vector. And before using the function, it is necessary to load model.
- Parameters
items (str) – the text of question
tokenize (bool) – True: tokenize the item
indexing (bool) –
padding (bool) –
key (lambda function) – the parameter passed to tokenizer, select the text to be processed
args – the parameters passed to t2v
kwargs – the parameters passed to t2v
- Returns
vector
- Return type
list
- class EduNLP.I2V.i2v.W2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶
The model aims to transfer tokens to vector.
I2V
- Parameters
tokenizer (str) – the tokenizer name
t2v (str) – the name of token2vector model
args – the parameters passed to t2v
tokenizer_kwargs (dict) – the parameters passed to tokenizer
pretrained_t2v (bool) – True: use pretrained t2v model False: use your own t2v model
kwargs – the parameters passed to t2v
Examples
>>> (); i2v = get_pretrained_i2v("w2v_test_256", "examples/test_model/w2v"); () (...) >>> item_vector, token_vector = i2v(["有学者认为:‘学习’,必须适应实际"]) >>> item_vector [array([...], dtype=float32)]
- Returns
i2v model
- Return type
- infer_vector(items, tokenize=True, indexing=False, padding=False, key=<function W2V.<lambda>>, *args, **kwargs) tuple[source]¶
It is a function to switch item to vector. And before using the function, it is necessary to load model.
- Parameters
items (str) – the text of question
tokenize (bool) – True: tokenize the item
indexing (bool) –
padding (bool) –
key (lambda function) – the parameter passed to tokenizer, select the text to be processed
args – the parameters passed to t2v
kwargs – the parameters passed to t2v
- Returns
vector
- Return type
list
- class EduNLP.I2V.i2v.Elmo(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶
The model aims to transfer item and tokens to vector with Elmo.
I2V
- Parameters
tokenizer (str) – the tokenizer name
t2v (str) – the name of token2vector model
args – the parameters passed to t2v
tokenizer_kwargs (dict) – the parameters passed to tokenizer
pretrained_t2v (bool) – True: use pretrained t2v model False: use your own t2v model
kwargs – the parameters passed to t2v
- Returns
i2v model
- Return type
- infer_vector(items, tokenize=True, return_tensors='pt', *args, **kwargs) tuple[source]¶
It is a function to switch item to vector. And before using the function, it is necessary to load model.
- Parameters
items (str or list) – the text of question
tokenize (bool) – True: tokenize the item
return_tensors (str) – tensor type used in tokenizer
args – the parameters passed to t2v
kwargs – the parameters passed to t2v
- Returns
vector
- Return type
list
- class EduNLP.I2V.i2v.Bert(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶
The model aims to transfer item and tokens to vector with Bert.
I2V
- Parameters
tokenizer (str) – the tokenizer name
t2v (str) – the name of token2vector model
args – the parameters passed to t2v
tokenizer_kwargs (dict) – the parameters passed to tokenizer
pretrained_t2v (bool) – True: use pretrained t2v model False: use your own t2v model
kwargs – the parameters passed to t2v
- Returns
i2v model
- Return type
- infer_vector(items, tokenize=True, return_tensors='pt', *args, **kwargs) tuple[source]¶
It is a function to switch item to vector. And before using the function, it is nesseary to load model.
- Parameters
items (str or list) – the text of question
tokenize (bool) – True: tokenize the item
return_tensors (str) – tensor type used in tokenizer
args – the parameters passed to t2v
kwargs – the parameters passed to t2v
- Returns
vector
- Return type
list
- class EduNLP.I2V.i2v.DisenQ(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶
The model aims to transfer item and tokens to vector with DisenQ. Bases ——- I2V :param tokenizer: the tokenizer name :type tokenizer: str :param t2v: the name of token2vector model :type t2v: str :param args: the parameters passed to t2v :param tokenizer_kwargs: the parameters passed to tokenizer :type tokenizer_kwargs: dict :param pretrained_t2v: True: use pretrained t2v model
False: use your own t2v model
- Parameters
kwargs – the parameters passed to t2v
- Returns
i2v model
- Return type
- infer_vector(items: (<class 'dict'>, <class 'list'>), tokenize=True, key=<function DisenQ.<lambda>>, vector_type=None, **kwargs) tuple[source]¶
It is a function to switch item to vector. And before using the function, it is nesseary to load model. :param item: the item of question :type item: dict or list :param tokenize: True: tokenize the item :type tokenize: bool :param key: the parameter passed to tokenizer, select the text to be processed :type key: lambda function :param args: the parameters passed to t2v :param kwargs: the parameters passed to t2v
- Returns
vector
- Return type
list
- class EduNLP.I2V.i2v.QuesNet(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶
The model aims to transfer item and tokens to vector with quesnet. Bases ——- I2V
- infer_vector(item, tokenize=True, key=<function QuesNet.<lambda>>, meta=['know_name'], *args, **kwargs)[source]¶
It is a function to switch item to vector. And before using the function, it is nesseary to load model. :param item: the item of question, or question list :type item: str or dict or list :param tokenize: True: tokenize the item :type tokenize: bool, optional :param key: _description_, by default lambdax:x :type key: _type_, optional :param meta: meta information, by default [‘know_name’] :type meta: list, optional :param args: the parameters passed to t2v :param kwargs: the parameters passed to t2v
- Returns
token embeddings
question embedding
- EduNLP.I2V.i2v.get_pretrained_i2v(name, model_dir='/home/docs/.EduNLP/model')[source]¶
It is a good idea if you want to switch item to vector earily.
- Parameters
name (str) – the name of item2vector model e.g.: d2v_math_300 w2v_math_300 elmo_math_2048 bert_math_768 bert_taledu_768 disenq_math_256 quesnet_math_512
model_dir (str) – the path of model, default: MODEL_DIR = ‘~/.EduNLP/model’
- Returns
i2v model
- Return type
Examples
>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$, ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点, ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"} >>> (); i2v = get_pretrained_i2v("d2v_test_256", "examples/test_model/d2v"); () (...) >>> print(i2v(item)) ([array([ ...dtype=float32)], None)
EduNLP.Pretrain¶
- EduNLP.Pretrain.train_vector(items, w2v_prefix, embedding_dim=None, method='sg', binary=None, train_params=None)[source]¶
- Parameters
items:str – the text of question
w2v_prefix –
embedding_dim (int) – vector_size
method (str) – the method of training, e.g.: sg, cbow, fasttext, d2v, bow, tfidf
binary (model format) – True:bin; False:kv
train_params (dict) – the training parameters passed to model
- Returns
tokenizer
- Return type
Examples
>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']] >>> train_vector(token_item[:10], "examples/test_model/w2v/gensim_luna_stem_t_", 100) 'examples/test_model/w2v/gensim_luna_stem_t_sg_100.kv'
- class EduNLP.Pretrain.GensimWordTokenizer(symbol='gm', general=False)[source]¶
- Parameters
symbol (str) –
- select the methods to symbolize:
”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,
e.g.: gm, fgm, gmas, fgmas
general (bool) –
True: when item isn’t in standard format, and want to tokenize formulas(except formulas in figure) linearly.
False: when use ‘ast’ mothed to tokenize formulas instead of ‘linear’.
- Returns
tokenizer
- Return type
Examples
>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item.tokens[:10]) ['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]'] >>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item.tokens[:10]) ['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]']
- class EduNLP.Pretrain.GensimSegTokenizer(symbol='gms', depth=None, flatten=False, **kwargs)[source]¶
- Parameters
symbol (str) –
- select the methods to symbolize:
”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,
e.g. gms, fgm
depth (int or None) – 0: only separate at SIFSep ; 1: only separate at SIFTag ; 2: separate at SIFTag and SIFSep ; otherwise, separate all segments ;
- Returns
tokenizer
- Return type
Examples
>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']] >>> tokenizer = GensimSegTokenizer(symbol="fgm", depth=None) >>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$") >>> print(token_item[:10]) [['公式'], ['[FORMULA]'], ['如图'], ['[FIGURE]'], ['[FORMULA]'],...['[FORMULA]'], ['最大值'], ['[MARK]']]
- class EduNLP.Pretrain.ElmoTokenizer(path: Optional[str] = None)[source]¶
Examples
>>> t=ElmoTokenizer() >>> items = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,\ ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$"] >>> t.tokenize(items[0]) ['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]'] >>> len(t) 18
- class EduNLP.Pretrain.ElmoDataset(texts: list, tokenizer: ElmoTokenizer, max_length=128)[source]¶
- EduNLP.Pretrain.train_elmo(texts: list, output_dir: str, pretrained_dir: Optional[str] = None, emb_dim=512, hid_dim=512, batch_size=2, epochs=3, lr: float = 0.0005, device=None)[source]¶
- Parameters
texts (list, required) – The training corpus of shape (text_num, token_num), a text must be tokenized into tokens
output_dir (str, required) – The directory to save trained model files
pretrained_dir (str, optional) – The pretrained model files’ directory
emb_dim (int, optional, default=512) – The embedding dim
hid_dim (int, optional, default=1024) – The hidden dim
batch_size (int, optional, default=2) – The training batch size
epochs (int, optional, default=3) – The training epochs
lr (float, optional, default=5e-4) – The learning rate
device (str, optional) – Default is ‘cuda’ if available, otherwise ‘cpu’
- Returns
output_dir – The directory that trained model files are saved
- Return type
str
- class EduNLP.Pretrain.BertTokenizer(pretrain_model='bert-base-chinese', add_special_tokens=False, text_tokenizer=None)[source]¶
- Parameters
pretrain_model – used pretrained model
add_special_tokens – Whether to add tokens like [FIGURE], [TAG], etc.
text_tokenizer – Which text tokenizer to use. Must be consistent with TOKENIZER dictionary.
Examples
>>> tokenizer = BertTokenizer(add_special_tokens=True) >>> item = "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$, ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$" >>> token_item = tokenizer(item) >>> print(token_item.input_ids[:10]) [101, 1062, 2466, 1963, 1745, 21129, 166, 117, 167, 5276] >>> print(tokenizer.tokenize(item)[:10]) ['公', '式', '如', '图', '[FIGURE]', 'x', ',', 'y', '约', '束'] >>> items = [item, item] >>> token_items = tokenizer(items, return_tensors='pt') >>> print(token_items.input_ids.shape) torch.Size([2, 27]) >>> print(len(tokenizer.tokenize(items))) 2 >>> tokenizer.save_pretrained('test_dir') >>> tokenizer = BertTokenizer.from_pretrained('test_dir')
- EduNLP.Pretrain.finetune_bert(items, output_dir, pretrain_model='bert-base-chinese', train_params=None)[source]¶
- Parameters
items:dict – the tokenization results of questions
output_dir (str) – the path to save the model
pretrain_model (str) – the name or path of pre-trained model
train_params (dict) – the training parameters passed to Trainer
Examples
>>> tokenizer = BertTokenizer() >>> stems = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$", ... "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$"] >>> token_item = [tokenizer(i) for i in stems] >>> print(token_item[0].keys()) dict_keys(['input_ids', 'token_type_ids', 'attention_mask']) >>> finetune_bert(token_item, "examples/test_model/data/data/bert") {'train_runtime': ..., ..., 'epoch': 1.0}
- class EduNLP.Pretrain.QuesNetTokenizer(img_dir=None, vocab_path=None, max_length=250, meta=None, img_token='<img>', unk_token='<unk>', pad_token='<pad>', *args, **argv)[source]¶
Examples
>>> tokenizer = QuesNetTokenizer(meta=['knowledge']) >>> test_items = [{"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B, $\FigureID{test_id}$", ... "knowledge": "['*', '-', '/']"}, {"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B", ... "knowledge": "['*', '-', '/']"}] >>> tokenizer.set_vocab(test_items, ... trim_min_count=1, key=lambda x: x["ques_content"], silent=True) >>> token_items = [tokenizer(i, key=lambda x: x["ques_content"]) for i in test_items] >>> print(token_items[0].keys()) dict_keys(['content_idx', 'meta_idx']) >>> token_items = tokenizer(test_items, key=lambda x: x["ques_content"]) >>> print(len(token_items["content_idx"])) 2
- tokenize(item: ~typing.Union[str, dict, list], key=<function QuesNetTokenizer.<lambda>>, *args, **kwargs)[source]¶
- load_vocab(path)[source]¶
- Parameters
path (str) – path of vocabulary files it must be a directory containing word.txt (meta.txt is optional)
- set_vocab(items: list, key=<function QuesNetTokenizer.<lambda>>, trim_min_count=50, silent=True)[source]¶
- Parameters
items (list) – can be the list of str, or list of dict
key (function) – determine how to get the text of each item
trim_min_count –
silent –
- save_vocab(save_vocab_path)[source]¶
- Parameters
save_vocab_path (str) – path to save word vocabulary and meta vocabulary
- classmethod from_pretrained(tokenizer_config_dir, img_dir=None)[source]¶
- tokenizer_config_dir: str
must contain tokenizer_config.json and vocab/word.txt vocab/meta_{meta_name}.txt
- img_dir: str
default None the path of image directory
- save_pretrained(tokenizer_config_dir)[source]¶
- tokenizer_config_dir: str
save tokenizer params in tokenizer_config.json and save words in vocab.list
- property vocab_size¶
- EduNLP.Pretrain.pretrain_quesnet(path, output_dir, tokenizer, save_embs=False, train_params=None)[source]¶
pretrain quesnet
- Parameters
path (str) – path of question file
output_dir (str) – output path·
tokenizer (QuesNetTokenizer) – quesnet tokenizer
save_embs (bool, optional) – whether to save pretrained word/image/meta embeddings seperately
train_params (dict, optional) –
the training parameters and model parameters, by default None - “n_epochs”: int, default = 1
train param, number of epochs
- ”batch_size”: int, default = 6
train param, batch size
- ”lr”: float, default = 1e-3
train param, learning rate
- ”save_every”: int, default = 0
train param, save steps interval
- ”log_steps”: int, default = 10
train param, log steps interval
- ”device”: str, default = ‘cpu’
train param, ‘cpu’ or ‘cuda’
- ”max_steps”: int, default = 0
train param, stop training when reach max steps
- ”emb_size”: int, default = 256
model param, the embedding size of word, figure, meta info
- ”feat_size”: int, default = 256
model param, the size of question infer vector
Examples
>>> tokenizer = QuesNetTokenizer(meta=['know_name']) >>> items = [{"ques_content": "若复数$z=1+2 i+i^{3}$,则$|z|=$,$\FigureID{000004d6-0479-11ec-829b-797d5eb43535}$", ... "ques_id": "726cdbec-33a9-11ec-909c-98fa9b625adb", ... "know_name": "['代数', '集合', '集合的相等']" ... }] >>> tokenizer.set_vocab(items, key=lambda x: x['ques_content'], trim_min_count=1, silent=True) >>> pretrain_quesnet('./data/quesnet_data.json', './testQuesNet', tokenizer)
- class EduNLP.Pretrain.Question(id, content, answer, false_options, labels)¶
- property answer¶
Alias for field number 2
- property content¶
Alias for field number 1
- property false_options¶
Alias for field number 3
- property id¶
Alias for field number 0
- property labels¶
Alias for field number 4
- class EduNLP.Pretrain.DisenQTokenizer(vocab_path=None, max_length=250, tokenize_method='space', num_token='<num>', unk_token='<unk>', pad_token='<pad>', *args, **argv)[source]¶
Examples
>>> tokenizer = DisenQTokenizer() >>> test_items = [{ ... "content": "甲 数 除以 乙 数 的 商 是 1.5 , 如果 甲 数 增加 20 , 则 甲 数 是 乙 的 4 倍 . 原来 甲 数 = .", ... "knowledge": ["*", "-", "/"], "difficulty": 0.2, "length": 7}] >>> tokenizer.set_vocab(test_items, ... trim_min_count=1, key=lambda x: x["content"], silent=True) >>> token_items = [tokenizer(i, key=lambda x: x["content"]) for i in test_items] >>> print(token_items[0].keys()) dict_keys(['content_idx', 'content_len'])
- tokenize(items: (<class 'list'>, <class 'str'>, <class 'dict'>), key=<function DisenQTokenizer.<lambda>>, **kwargs)[source]¶
- Parameters
items (list or str or dict) – the question items
key (function) – determine how to get the text of each item
- Returns
tokens – the token of items
- Return type
list
- set_vocab(items: list, key=<function DisenQTokenizer.<lambda>>, trim_min_count=1, silent=True)[source]¶
- Parameters
items (list) – can be the list of str, or list of dict
key (function) – determine how to get the text of each item
- classmethod from_pretrained(tokenizer_config_dir)[source]¶
- tokenizer_config_dir: str
must contain tokenizer_config.json and vocab.list
- save_pretrained(tokenizer_config_dir)[source]¶
- tokenizer_config_dir: str
save tokenizer params in tokenizer_config.json and save words in vocab.list
- property vocab_size¶
- EduNLP.Pretrain.train_disenqnet(train_items, disen_tokenizer, output_dir, predata_dir, train_params=None, test_items=None, silent=False, data_formation=None)[source]¶
- Parameters
train_items (list) – the raw train question list
disen_tokenizer (DisenQTokenizer) – the initial DisenQTokenizer use for training.
output_dir (str) – the path to save the model
predata_dir (str) – the dirname to load or save predata (including wv.th, vocab.list and concept.list)
train_params (dict, defaults to None) –
the training parameters for data, model and trianer. - “trim_min”: int
data param, the trim_min_count for vocab and word2vec, by default 2
- ”w2v_workers”: int
data param, the number of workers for word2vec, by default 1
- ”hidden”: int
model param, by default 128
- ”dropout”: float
model param, dropout rate, by default 0.2
- ”pos_weight”: int
model param, positive sample weight in unbalanced multi-label concept classifier, by default 1
- ”cp”: float
model param, weight of concept loss, by default 1.5
- ”mi”: float
model param, weight of mutual information loss, by default 1.0
- ”dis”: float
model param, weight of disentangling loss, by default 2.0
- ”epoch”: int
train param, number of epoch, by default 1
- ”batch”: int
train param, batch size, by default 64
- ”lr”: float
train param, learning rate, by default 1e-3
- ”step”: int
train param, step_size for StepLR, by default 20
- ”gamma”: float
train param, gamma for StepLR, by default 0.5
- ”warm_up”: int
train param, number of epoch for warming up, by default 1
- ”adv”: int
train param, ratio of disc/enc training for adversarial process, by default 10
- ”device”: str
train param, ‘cpu’ or ‘cuda’, by default “cpu”
test_items (list, defaults to None) – the raw test question list, default is None
silent (bool, defaults to False) – whether to print processing inforamtion
data_formation (dict, defaults to None) – Mapping “content” and “knowledge” for the item formation. For example, {“content”: “ques_content”, “knowledge”: “know_name”}
Examples
>>> train_data = load_items("static/test_data/disenq_train.json")[:100] >>> test_data = load_items("static/test_data/disenq_test.json")[:100] >>> tokenizer = DisenQTokenizer(max_length=250, tokenize_method="space") >>> train_disenqnet(train_data, tokenizer, ... "examples/test_model/disenq","examples/test_model/disenq", silent=True)
EduNLP.Tokenizer¶
- class EduNLP.Tokenizer.PureTextTokenizer(*args, **kwargs)[source]¶
Duel with text and plain text formula. And filting special formula like $\FormFigureID{…}$ and $\FormFigureBase64{…}.
- Parameters
items (str) –
key –
args –
kwargs –
- Return type
token
Examples
>>> tokenizer = PureTextTokenizer() >>> items = ["有公式$\\FormFigureID{1}$,如图$\\FigureID{088f15ea-xxx}$,\ ... 若$x,y$满足约束条件公式$\\FormFigureBase64{2}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$"] >>> tokens = tokenizer(items) >>> next(tokens)[:10] ['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z'] >>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"] >>> tokens = tokenizer(items) >>> next(tokens) ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', '\\quad', 'A', '\\cap', 'B', '='] >>> items = [{ ... "stem": "已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$", ... "options": ["1", "2"] ... }] >>> tokens = tokenizer(items, key=lambda x: x["stem"]) >>> next(tokens) ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', '\\quad', 'A', '\\cap', 'B', '=']
- class EduNLP.Tokenizer.TextTokenizer(*args, **kwargs)[source]¶
Duel with text and formula including special formula.
- Parameters
items (str) –
key –
args –
kwargs –
- Return type
token
Examples
>>> tokenizer = TextTokenizer() >>> items = ["有公式$\\FormFigureID{1}$,如图$\\FigureID{088f15ea-xxx}$,\ ... 若$x,y$满足约束条件公式$\\FormFigureBase64{2}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$"] >>> tokens = tokenizer(items) >>> next(tokens)[:10] ['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]'] >>> items = ["$\\SIFTag{stem_begin}$若复数$z=1+2 i+i^{3}$,则$|z|=$$\\SIFTag{stem_end}$\ ... $\\SIFTag{options_begin}$$\\SIFTag{list_0}$0$\\SIFTag{list_1}$1$\\SIFTag{list_2}$$\\sqrt{2}$\ ... $\\SIFTag{list_3}$2$\\SIFTag{options_end}$"] >>> tokens = tokenizer(items) >>> next(tokens)[:10] ['[TAG]', '复数', 'z', '=', '1', '+', '2', 'i', '+', 'i']
- EduNLP.Tokenizer.get_tokenizer(name, *args, **kwargs)[source]¶
It is a total interface to use difference tokenizer. :param name: the name of tokenizer, e.g. text, pure_text. :type name: str :param args: the parameters passed to tokenizer :param kwargs: the parameters passed to tokenizer
- Returns
tokenizer
- Return type
Examples
>>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"] >>> tokenizer = get_tokenizer("text") >>> tokens = tokenizer(items) >>> next(tokens) ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', '\\quad', 'A', '\\cap', 'B', '=']
Vector¶
- class EduNLP.Vector.W2V(filepath, method=None, binary=None)[source]¶
The part uses gensim library providing FastText, Word2Vec and KeyedVectors method to transfer word to vector.
- Parameters
filepath – path to the pretrained model file
method (str) – fasttext other(Word2Vec)
binary (bool) –
- property vectors¶
- property vector_size¶
- class EduNLP.Vector.D2V(filepath, method='d2v')[source]¶
It is a collection which include d2v, bow, tfidf method.
- Parameters
filepath –
method (str) – d2v bow tfidf
item –
- Returns
d2v model
- Return type
- property vector_size¶
- class EduNLP.Vector.BowLoader(filepath)[source]¶
Using doc2bow model, which has a lot of effects.
Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.
If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its document frequency (self.dfs) by one.
If allow_update is not set, this function is const, aka read-only.
- property vector_size¶
- class EduNLP.Vector.TfidfLoader(filepath)[source]¶
This module implements functionality related to the Term Frequency - Inverse Document Frequency <https://en.wikipedia.org/wiki/Tf%E2%80%93idf> vector space bag-of-words models.
- property vector_size¶
- class EduNLP.Vector.RNNModel(rnn_type, w2v: (<class 'EduNLP.Vector.gensim_vec.W2V'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, None), hidden_size, freeze_pretrained=True, model_params=None, device=None, **kwargs)[source]¶
Examples
>>> model = RNNModel("BiLSTM", None, 2, vocab_size=4, embedding_dim=3) >>> seq_idx = [[1, 2, 3], [1, 2, 0], [3, 0, 0]] >>> output, hn = model(seq_idx, indexing=False, padding=False) >>> seq_idx = [[1, 2, 3], [1, 2], [3]] >>> output, hn = model(seq_idx, indexing=False, padding=True) >>> output.shape torch.Size([3, 3, 4]) >>> hn.shape torch.Size([2, 3, 2]) >>> tokens = model.infer_tokens(seq_idx, indexing=False) >>> tokens.shape torch.Size([3, 3, 4]) >>> tokens = model.infer_tokens(seq_idx, agg="mean", indexing=False) >>> tokens.shape torch.Size([3, 4]) >>> item = model.infer_vector(seq_idx, indexing=False) >>> item.shape torch.Size([3, 4]) >>> item = model.infer_vector(seq_idx, agg="mean", indexing=False) >>> item.shape torch.Size([3, 2]) >>> item = model.infer_vector(seq_idx, agg=None, indexing=False) >>> item.shape torch.Size([2, 3, 2])
- infer_vector(items, agg: (<class 'int'>, <class 'str'>, None) = -1, indexing=True, padding=True, *args, **kwargs) Tensor[source]¶
- property vector_size: int¶
- property is_frozen¶
- class EduNLP.Vector.T2V(model: str, *args, **kwargs)[source]¶
The function aims to transfer token list to vector. If you have a certain model, you can use T2V directly. Otherwise, calling get_pretrained_t2v function is a better way to get vector which can switch it without your model.
- Parameters
model (str) – select the model type e.g.: d2v, rnn, lstm, gru, elmo, etc.
Examples
>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$, ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$'}] >>> path = "examples/test_model/d2v/d2v_test_256/d2v_test_256.bin" >>> t2v = T2V('d2v',filepath=path) >>> print(t2v(item)) [array([...dtype=float32)]
- property vector_size: int¶
- EduNLP.Vector.get_pretrained_t2v(name, model_dir='/home/docs/.EduNLP/model')[source]¶
It is a good idea if you want to switch token list to vector earily.
- Parameters
name (str) – select the pretrained model e.g.: d2v_math_300 w2v_math_300 elmo_math_2048 bert_math_768 bert_taledu_768 disenq_math_256 quesnet_math_512
model_dir (str) – the path of model, default: MODEL_DIR = ‘~/.EduNLP/model’
- Returns
t2v model
- Return type
Examples
>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$, ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$'}] >>> i2v = get_pretrained_t2v("d2v_test_256", "examples/test_model/d2v") >>> print(i2v(item)) [array([...dtype=float32)]
- class EduNLP.Vector.Embedding(w2v: (<class 'EduNLP.Vector.gensim_vec.W2V'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, None), freeze=True, device=None, **kwargs)[source]¶
-
- indexing(items: List[List[str]], padding=False, indexing=True) tuple[source]¶
- Parameters
items (list of list of str(word/token)) –
padding (bool) – whether padding the returned list with default pad_val to make all item in items have the same length
indexing (bool) –
- Returns
token_idx (list of list of int) – the list of the tokens of each item
token_len (list of int) – the list of the length of tokens of each item
- class EduNLP.Vector.BertModel(pretrained_model)[source]¶
Examples
>>> from EduNLP.Pretrain import BertTokenizer >>> tokenizer = BertTokenizer("bert-base-chinese", add_special_tokens=False) >>> model = BertModel("bert-base-chinese") >>> item = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,若$x,y$满足约束", ... "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,若$x,y$满足约束"] >>> inputs = tokenizer(item, return_tensors='pt') >>> output = model(inputs) >>> output.shape torch.Size([2, 14, 768]) >>> tokens = model.infer_tokens(inputs) >>> tokens.shape torch.Size([2, 12, 768]) >>> tokens = model.infer_tokens(inputs, return_special_tokens=True) >>> tokens.shape torch.Size([2, 14, 768]) >>> item = model.infer_vector(inputs) >>> item.shape torch.Size([2, 768])
- property vector_size¶
- class EduNLP.Vector.QuesNetModel(pretrained_dir, tokenizer=None, device='cpu')[source]¶
- infer_vector(items: Union[Question, list]) Tensor[source]¶
get question embedding with quesnet
- Parameters
items ((Question, list)) – namedtuple, [‘id’, ‘content’, ‘answer’, ‘false_options’, ‘labels’] or a list of Questions
- infer_tokens(items: Union[Question, list]) Tensor[source]¶
get token embeddings with quesnet
- Parameters
items (Question) – namedtuple, [‘id’, ‘content’, ‘answer’, ‘false_options’, ‘labels’] or a list of Questions
- Returns
meta_emb + word_embs
- Return type
torch.Tensor
- property vector_size¶
- class EduNLP.Vector.DisenQModel(pretrained_dir, device='cpu')[source]¶
- infer_vector(items: dict, vector_type=None, **kwargs) Tensor[source]¶
- Parameters
vector_type (str) – choose the type of items tensor to return. Default is None, which means return both (k_hidden, i_hidden) when vector_type=”k”, return k_hidden; when vector_type=”i”, return i_hidden;
- property vector_size¶