EduNLP¶

SIF¶

EduNLP.SIF.sif.is_sif(item, check_formula=True, return_parser=False)[source]¶

the part aims to check whether the input is sif format

Parameters

item (str) – a raw item which respects stem
check_formula (bool) –
whether to check the formulas when parsing item.

True if check the validity of formulas in item False if not check the validity of formulas in item, which is faster
return_parser (bool) –
whether to put the parsed item in return.

when True, the format of return is (bool, Parser) when False, the format of return is bool

Returns

when item can not be parsed correctly, raise ValueError; when item is in stardarded format originally, return Ture (and the Parser of item); when item isn’t in stardarded format originally, return False (and the Parser of item);

Return type

bool

Examples

>>> text = '若$x,y$满足约束条件' \
...        '$\\left\\{\\begin{array}{c}2 x+y-2 \\leq 0 \\\\ x-y-1 \\geq 0 \\\\ y+1 \\geq 0\\end{array}\\right.$，' \
...        '则$z=x+7 y$的最大值$\\SIFUnderline$'
>>> is_sif(text)
True
>>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x（单位...'
>>> ret = is_sif(text, return_parser=True)
>>> ret 
(False, <EduNLP.SIF.parser.parser.Parser object...>)

EduNLP.SIF.sif.to_sif(item, check_formula=True, parser: Optional[Parser] = None)[source]¶

the part aims to switch item to sif formate

Parameters

items (str) – a raw item which respects stem
check_formula (bool) – whether to check the formulas when parsing item (only work when parser=None).
parser (Parser) – the parser of item returned from is_sif.

Returns

item – the item which accords with sif format

Return type

str

Examples

>>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x（单位...'
>>> siftext = to_sif(text)
>>> siftext
'某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$（单位...'
>>> ret = is_sif(text, return_parser=True)
>>> ret 
(False, <EduNLP.SIF.parser.parser.Parser object...>)
>>> to_sif(text, parser=ret[1])
'某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$（单位...

EduNLP.SIF.sif.sif4sci(item: str, figures: (<class 'dict'>, <class 'bool'>) = None, mode: int = 2, symbol: str = None, tokenization=True, tokenization_params=None, errors='raise')[source]¶

Default to use linear Tokenizer, change the tokenizer by specifying tokenization_params

Parameters

item (str) – a raw item which respects stem
figures (dict) – when it is a dict, it means the id-to-instance for figures in ‘FormFigureID{…}’ format, when it is a bool, it means whether to instantiate figures in ‘FormFigureBase64{…}’ format
mode (int) – when safe = 2, use is_sif and check formula in item when safe = 1, use is_sif but don’t check formula in item when safe = 0, don’t use is_sif and don’t check anything in item
symbol (str) –

select the methods to symbolize:
”t”: text “f”: formula “g”: figure “m”: question mark “a”: tag “s”: sep
tokenization (bool) – whether to tokenize item after segmentation
tokenization_params –
the dict of text_params, formula_params and figure_params in tokenization For formula_params:

method: which tokenizer to be used, “linear” or “ast” The parameters only useful for “linear”:

skip_figure_formula: whether to skip the formula in figure format symbolize_figure_formula: whether to symbolize the formula in figure format

The parameters only useful for “ast”:
ord2token: whether to transfer the variables (mathord) and constants (textord) to special tokens. var_numbering: whether to use number suffix to denote different variables return_type: ‘list’ or ‘ast’

More parameters can be found in the definition in SIF.tokenization.formula

For figure_params:
figure_instance：whether to return instance of figures in tokens

For text_params:
See definition in SIF.tokenization.text
errors – warn, raise, coerce, strict, ignore

Returns

When tokenization is False, return SegmentList; When tokenization is True, return TokenList

Return type

list

Examples

>>> test_item = r"如图所示，则$\bigtriangleup ABC$的面积是$\SIFBlank$。$\FigureID{1}$"
>>> tl = sif4sci(test_item)
>>> tl
['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}]
>>> tl.describe()
{'t': 2, 'f': 2, 'g': 1, 'm': 1}
>>> with tl.filter('fgm'):
...     tl
['如图所示', '面积']
>>> with tl.filter(keep='t'):
...     tl
['如图所示', '面积']
>>> with tl.filter():
...     tl
['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}]
>>> tl.text_tokens
['如图所示', '面积']
>>> tl.formula_tokens
['\\bigtriangleup', 'ABC']
>>> tl.figure_tokens
[\FigureID{1}]
>>> tl.ques_mark_tokens
['\\SIFBlank']
>>> sif4sci(test_item, symbol="gm", tokenization_params={"formula_params": {"method": "ast"}})
['如图所示', <Formula: \bigtriangleup ABC>, '面积', '[MARK]', '[FIGURE]']
>>> sif4sci(test_item, symbol="tfgm")
['[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[TEXT]', '[FIGURE]']
>>> sif4sci(test_item, symbol="gm",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}})
['如图所示', '\\bigtriangleup', 'A', 'B', 'C', '面积', '[MARK]', '[FIGURE]']
>>> test_item_1 = {
...     "stem": r"若$x=2$, $y=\sqrt{x}$，则下列说法正确的是$\SIFChoice$",
...     "options": [r"$x < y$", r"$y = x$", r"$y < x$"]
... }
>>> tls = [
...     sif4sci(e, symbol="gm",
...     tokenization_params={
...         "formula_params": {
...             "method": "ast", "return_type": "list", "ord2token": True, "var_numbering": True,
...             "link_variable": False}
...     })
...     for e in ([test_item_1["stem"]] + test_item_1["options"])
... ]
>>> tls[1:]
[['mathord_0', '<', 'mathord_1'], ['mathord_0', '=', 'mathord_1'], ['mathord_0', '<', 'mathord_1']]
>>> link_formulas(*tls)
>>> tls[1:]
[['mathord_0', '<', 'mathord_1'], ['mathord_1', '=', 'mathord_0'], ['mathord_1', '<', 'mathord_0']]
>>> from EduNLP.utils import dict2str4sif
>>> test_item_1_str = dict2str4sif(test_item_1, tag_mode="head", add_list_no_tag=False)
>>> test_item_1_str  
'$\\SIFTag{stem}$...则下列说法正确的是$\\SIFChoice$$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$'
>>> tl1 = sif4sci(test_item_1_str, symbol="gm",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list", "ord2token": True}})
>>> tl1.get_segments()[0]
['\\SIFTag{stem}']
>>> tl1.get_segments()[1:3]
[['[TEXT_BEGIN]', '[TEXT_END]'], ['[FORMULA_BEGIN]', 'mathord', '=', 'textord', '[FORMULA_END]']]
>>> tl1.get_segments(add_seg_type=False)[0:3]
[['\\SIFTag{stem}'], ['mathord', '=', 'textord'], ['mathord', '=', 'mathord', '{ }', '\\sqrt']]
>>> test_item_2 = {"options": [r"$x < y$", r"$y = x$", r"$y < x$"]}
>>> test_item_2
{'options': ['$x < y$', '$y = x$', '$y < x$']}
>>> test_item_2_str = dict2str4sif(test_item_2, tag_mode="head", add_list_no_tag=False)
>>> test_item_2_str
'$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$'
>>> tl2 = sif4sci(test_item_2_str, symbol="gms",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}})
>>> tl2
['\\SIFTag{options}', 'x', '<', 'y', '[SEP]', 'y', '=', 'x', '[SEP]', 'y', '<', 'x']
>>> tl2.get_segments(add_seg_type=False)
[['\\SIFTag{options}'], ['x', '<', 'y'], ['[SEP]'], ['y', '=', 'x'], ['[SEP]'], ['y', '<', 'x']]
>>> tl2.get_segments(add_seg_type=False, drop="s")
[['\\SIFTag{options}'], ['x', '<', 'y'], ['y', '=', 'x'], ['y', '<', 'x']]
>>> tl3 = sif4sci(test_item_1["stem"], symbol="gs")
>>> tl3.text_segments
[['说法', '正确']]
>>> tl3.formula_segments
[['x', '=', '2'], ['y', '=', '\\sqrt', '{', 'x', '}']]
>>> tl3.figure_segments
[]
>>> tl3.ques_mark_segments
[['\\SIFChoice']]
>>> test_item_3 = r"已知$y=x$，则以下说法中$\textf{正确,b}$的是"
>>> tl4 = sif4sci(test_item_3)
Warning: there is some chinese characters in formula!
>>> tl4.text_segments
[['已知'], ['说法', '中', '正确']]

EduNLP.Formula¶

EduNLP.Formula.ast.str2ast(formula: str, *args, **kwargs)[source]¶: 给字符串的接口

EduNLP.Formula.ast.get_edges(forest)[source]¶

构造边集合

Parameters: forest (List[Dict]) – 森林
Returns: edges – 边集合
Return type: list of tuple(src,dst,type)

EduNLP.Formula.ast.ast(formula: (<class 'str'>, typing.List[typing.Dict]), index=0, forest_begin=0, father_tree=None, is_str=False)[source]¶

The origin code author is https://github.com/hxwujinze

Parameters

formula (str or List[Dict]) – 公式字符串或通过katex解析得到的结构体
index (int) – 本子树在树上的位置
forest_begin (int) – 本树在森林中的起始位置
father_tree (List[Dict]) – 父亲树
is_str (bool) –

Returns

tree (List[Dict]) – 重新解析形成的特征树
todo (finish all types)

Notes

Some functions are not supportd in katex e.g.,

tag
- \begin{equation} \tag{tagName} F=ma \end{equation}
- \begin{align} \tag{1} y=x+z \end{align}
- \tag*{hi} x+y^{2x}
dddot
- \frac{ \dddot y }{ x }

For more information, refer to katex support table

EduNLP.Formula.ast.link_variable(forest)[source]¶

建森林

Parameters: forest (List[Dict]) –
Returns: trees
Return type: List[Dict]

EduNLP.Formula.ast.katex_parse(formula)[source]¶: 将公式传入katex进行语法解析

EduNLP.I2V¶

class EduNLP.I2V.i2v.I2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶

It just a api, so you shouldn’t use it directly. If you want to get vector from item, you can use other model like D2V and W2V.

Parameters

tokenizer (str) – the tokenizer name
t2v (str) – the name of token2vector model
args – the parameters passed to t2v
tokenizer_kwargs (dict) – the parameters passed to tokenizer
pretrained_t2v (bool) –
- True: use pretrained t2v model
- False: use your own t2v model
kwargs – the parameters passed to t2v

Examples

>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形．此图由三个半圆构成，三个半圆的直径分别为直角三角形$ABC$的斜边$BC$,     ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点，    ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"}
>>> model_path = "examples/test_model/d2v/test_gensim_luna_stem_tf_d2v_256.bin"
>>> i2v = D2V("text","d2v",filepath=model_path, pretrained_t2v = False)
>>> i2v(item)
([array([ ...dtype=float32)], None)

Returns: i2v model
Return type: I2V

tokenize(items, *args, indexing=True, padding=False, key=<function I2V.<lambda>>, **kwargs) → list[source]¶

infer_vector(items, tokenize=True, indexing=False, padding=False, key=<function I2V.<lambda>>, **kwargs) → tuple[source]¶

infer_item_vector(tokens, *args, **kwargs) → ...[source]¶

infer_token_vector(tokens, *args, **kwargs) → ...[source]¶

save(config_path)[source]¶

classmethod load(config_path, *args, **kwargs)[source]¶

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', *args, **kwargs)[source]¶

property vector_size¶

class EduNLP.I2V.i2v.D2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶

The model aims to transfer item to vector directly.

I2V

Parameters

tokenizer (str) – the tokenizer name
t2v (str) – the name of token2vector model
args – the parameters passed to t2v
tokenizer_kwargs (dict) – the parameters passed to tokenizer
pretrained_t2v (bool) – True: use pretrained t2v model False: use your own t2v model
kwargs – the parameters passed to t2v

Examples

>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形．此图由三个半圆构成，三个半圆的直径分别为直角三角形$ABC$的斜边$BC$,     ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点，    ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"}
>>> model_path = "examples/test_model/d2v/d2v_test_256/d2v_test_256.bin"
>>> i2v = D2V("text","d2v",filepath=model_path, pretrained_t2v = False)
>>> i2v(item)
([array([ ...dtype=float32)], None)

Returns: i2v model
Return type: I2V

infer_vector(items, tokenize=True, indexing=False, padding=False, key=<function D2V.<lambda>>, *args, **kwargs) → tuple[source]¶

It is a function to switch item to vector. And before using the function, it is necessary to load model.

Parameters

items (str) – the text of question
tokenize (bool) – True: tokenize the item
indexing (bool) –
padding (bool) –
key (lambda function) – the parameter passed to tokenizer, select the text to be processed
args – the parameters passed to t2v
kwargs – the parameters passed to t2v

Returns

vector

Return type

list

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', *args, **kwargs)[source]¶

class EduNLP.I2V.i2v.W2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶

The model aims to transfer tokens to vector.

I2V

Parameters

tokenizer (str) – the tokenizer name
t2v (str) – the name of token2vector model
args – the parameters passed to t2v
tokenizer_kwargs (dict) – the parameters passed to tokenizer
pretrained_t2v (bool) – True: use pretrained t2v model False: use your own t2v model
kwargs – the parameters passed to t2v

Examples

>>> (); i2v = get_pretrained_i2v("w2v_test_256", "examples/test_model/w2v"); () 
(...)
>>> item_vector, token_vector = i2v(["有学者认为：‘学习’，必须适应实际"])
>>> item_vector 
[array([...], dtype=float32)]

Returns: i2v model
Return type: W2V

infer_vector(items, tokenize=True, indexing=False, padding=False, key=<function W2V.<lambda>>, *args, **kwargs) → tuple[source]¶

It is a function to switch item to vector. And before using the function, it is necessary to load model.

Parameters

items (str) – the text of question
tokenize (bool) – True: tokenize the item
indexing (bool) –
padding (bool) –
key (lambda function) – the parameter passed to tokenizer, select the text to be processed
args – the parameters passed to t2v
kwargs – the parameters passed to t2v

Returns

vector

Return type

list

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', *args, **kwargs)[source]¶

class EduNLP.I2V.i2v.Elmo(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶

The model aims to transfer item and tokens to vector with Elmo.

I2V

Parameters

tokenizer (str) – the tokenizer name
t2v (str) – the name of token2vector model
args – the parameters passed to t2v
tokenizer_kwargs (dict) – the parameters passed to tokenizer
pretrained_t2v (bool) – True: use pretrained t2v model False: use your own t2v model
kwargs – the parameters passed to t2v

Returns

i2v model

Return type

Elmo

infer_vector(items, tokenize=True, return_tensors='pt', *args, **kwargs) → tuple[source]¶

It is a function to switch item to vector. And before using the function, it is necessary to load model.

Parameters

items (str or list) – the text of question
tokenize (bool) – True: tokenize the item
return_tensors (str) – tensor type used in tokenizer
args – the parameters passed to t2v
kwargs – the parameters passed to t2v

Returns

vector

Return type

list

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', *args, **kwargs)[source]¶

class EduNLP.I2V.i2v.Bert(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶

The model aims to transfer item and tokens to vector with Bert.

I2V

Parameters

tokenizer (str) – the tokenizer name
t2v (str) – the name of token2vector model
args – the parameters passed to t2v
tokenizer_kwargs (dict) – the parameters passed to tokenizer
pretrained_t2v (bool) – True: use pretrained t2v model False: use your own t2v model
kwargs – the parameters passed to t2v

Returns

i2v model

Return type

Bert

infer_vector(items, tokenize=True, return_tensors='pt', *args, **kwargs) → tuple[source]¶

It is a function to switch item to vector. And before using the function, it is nesseary to load model.

Parameters

items (str or list) – the text of question
tokenize (bool) – True: tokenize the item
return_tensors (str) – tensor type used in tokenizer
args – the parameters passed to t2v
kwargs – the parameters passed to t2v

Returns

vector

Return type

list

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', *args, **kwargs)[source]¶

class EduNLP.I2V.i2v.DisenQ(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶

The model aims to transfer item and tokens to vector with DisenQ. Bases ——- I2V :param tokenizer: the tokenizer name :type tokenizer: str :param t2v: the name of token2vector model :type t2v: str :param args: the parameters passed to t2v :param tokenizer_kwargs: the parameters passed to tokenizer :type tokenizer_kwargs: dict :param pretrained_t2v: True: use pretrained t2v model

False: use your own t2v model

Parameters: kwargs – the parameters passed to t2v
Returns: i2v model
Return type: DisenQ

infer_vector(items: (<class 'dict'>, <class 'list'>), tokenize=True, key=<function DisenQ.<lambda>>, vector_type=None, **kwargs) → tuple[source]¶

It is a function to switch item to vector. And before using the function, it is nesseary to load model. :param item: the item of question :type item: dict or list :param tokenize: True: tokenize the item :type tokenize: bool :param key: the parameter passed to tokenizer, select the text to be processed :type key: lambda function :param args: the parameters passed to t2v :param kwargs: the parameters passed to t2v

Returns: vector
Return type: list

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]¶

class EduNLP.I2V.i2v.QuesNet(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]¶

The model aims to transfer item and tokens to vector with quesnet. Bases ——- I2V

infer_vector(item, tokenize=True, key=<function QuesNet.<lambda>>, meta=['know_name'], *args, **kwargs)[source]¶

It is a function to switch item to vector. And before using the function, it is nesseary to load model. :param item: the item of question, or question list :type item: str or dict or list :param tokenize: True: tokenize the item :type tokenize: bool, optional :param key: _description_, by default lambdax:x :type key: _type_, optional :param meta: meta information, by default [‘know_name’] :type meta: list, optional :param args: the parameters passed to t2v :param kwargs: the parameters passed to t2v

Returns

token embeddings
question embedding

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', *args, **kwargs)[source]¶

EduNLP.I2V.i2v.get_pretrained_i2v(name, model_dir='/home/docs/.EduNLP/model')[source]¶

It is a good idea if you want to switch item to vector earily.

Parameters

name (str) – the name of item2vector model e.g.: d2v_math_300 w2v_math_300 elmo_math_2048 bert_math_768 bert_taledu_768 disenq_math_256 quesnet_math_512
model_dir (str) – the path of model, default: MODEL_DIR = ‘~/.EduNLP/model’

Returns

i2v model

Return type

I2V

Examples

>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形．此图由三个半圆构成，三个半圆的直径分别为直角三角形$ABC$的斜边$BC$,     ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点，    ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"}
>>> (); i2v = get_pretrained_i2v("d2v_test_256", "examples/test_model/d2v"); () 
(...)
>>> print(i2v(item))
([array([ ...dtype=float32)], None)

EduNLP.Pretrain¶

EduNLP.Pretrain.train_vector(items, w2v_prefix, embedding_dim=None, method='sg', binary=None, train_params=None)[source]¶

Parameters

items：str – the text of question
w2v_prefix –
embedding_dim (int) – vector_size
method (str) – the method of training, e.g.: sg, cbow, fasttext, d2v, bow, tfidf
binary (model format) – True:bin; False:kv
train_params (dict) – the training parameters passed to model

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$，如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']]
>>> train_vector(token_item[:10], "examples/test_model/w2v/gensim_luna_stem_t_", 100) 
'examples/test_model/w2v/gensim_luna_stem_t_sg_100.kv'

class EduNLP.Pretrain.GensimWordTokenizer(symbol='gm', general=False)[source]¶

Parameters

symbol (str) –

select the methods to symbolize:
”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,

e.g.: gm, fgm, gmas, fgmas
general (bool) –
True: when item isn’t in standard format, and want to tokenize formulas(except formulas in figure) linearly.

False: when use ‘ast’ mothed to tokenize formulas instead of ‘linear’.

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$，如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]']
>>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$，如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]']

batch_process(*items)[source]¶

class EduNLP.Pretrain.GensimSegTokenizer(symbol='gms', depth=None, flatten=False, **kwargs)[source]¶

Parameters

symbol (str) –

select the methods to symbolize:
”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,

e.g. gms, fgm
depth (int or None) – 0: only separate at SIFSep ; 1: only separate at SIFTag ; 2: separate at SIFTag and SIFSep ; otherwise, separate all segments ;

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$，如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']]
>>> tokenizer = GensimSegTokenizer(symbol="fgm", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$，如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], ['[FORMULA]'], ['如图'], ['[FIGURE]'], ['[FORMULA]'],...['[FORMULA]'], ['最大值'], ['[MARK]']]

class EduNLP.Pretrain.ElmoTokenizer(path: Optional[str] = None)[source]¶

Examples

>>> t=ElmoTokenizer()
>>> items = ["有公式$\FormFigureID{wrong1?}$，如图$\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$"]
>>> t.tokenize(items[0])
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]']
>>> len(t)
18

tokenize(item: (<class 'str'>, <class 'list'>), freeze_vocab=False, return_length=False)[source]¶

to_index(item: list, max_length=128, pad_to_max_length=False)[source]¶

append(item)[source]¶

save_vocab(path)[source]¶

load_vocab(path)[source]¶

class EduNLP.Pretrain.ElmoDataset(texts: list, tokenizer: ElmoTokenizer, max_length=128)[source]¶

EduNLP.Pretrain.train_elmo(texts: list, output_dir: str, pretrained_dir: Optional[str] = None, emb_dim=512, hid_dim=512, batch_size=2, epochs=3, lr: float = 0.0005, device=None)[source]¶

Parameters

texts (list, required) – The training corpus of shape (text_num, token_num), a text must be tokenized into tokens
output_dir (str, required) – The directory to save trained model files
pretrained_dir (str, optional) – The pretrained model files’ directory
emb_dim (int, optional, default=512) – The embedding dim
hid_dim (int, optional, default=1024) – The hidden dim
batch_size (int, optional, default=2) – The training batch size
epochs (int, optional, default=3) – The training epochs
lr (float, optional, default=5e-4) – The learning rate
device (str, optional) – Default is ‘cuda’ if available, otherwise ‘cpu’

Returns

output_dir – The directory that trained model files are saved

Return type

str

class EduNLP.Pretrain.BertTokenizer(pretrain_model='bert-base-chinese', add_special_tokens=False, text_tokenizer=None)[source]¶

Parameters

pretrain_model – used pretrained model
add_special_tokens – Whether to add tokens like [FIGURE], [TAG], etc.
text_tokenizer – Which text tokenizer to use. Must be consistent with TOKENIZER dictionary.

Examples

>>> tokenizer = BertTokenizer(add_special_tokens=True)
>>> item = "有公式$\FormFigureID{wrong1?}$，如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$"
>>> token_item = tokenizer(item)
>>> print(token_item.input_ids[:10])
[101, 1062, 2466, 1963, 1745, 21129, 166, 117, 167, 5276]
>>> print(tokenizer.tokenize(item)[:10])
['公', '式', '如', '图', '[FIGURE]', 'x', ',', 'y', '约', '束']
>>> items = [item, item]
>>> token_items = tokenizer(items, return_tensors='pt')
>>> print(token_items.input_ids.shape)
torch.Size([2, 27])
>>> print(len(tokenizer.tokenize(items)))
2
>>> tokenizer.save_pretrained('test_dir') 
>>> tokenizer = BertTokenizer.from_pretrained('test_dir') 

tokenize(item: Union[list, str], *args, **kwargs)[source]¶

save_pretrained(tokenizer_config_dir)[source]¶

classmethod from_pretrained(tokenizer_config_dir)[source]¶

EduNLP.Pretrain.finetune_bert(items, output_dir, pretrain_model='bert-base-chinese', train_params=None)[source]¶

Parameters

items：dict – the tokenization results of questions
output_dir (str) – the path to save the model
pretrain_model (str) – the name or path of pre-trained model
train_params (dict) – the training parameters passed to Trainer

Examples

>>> tokenizer = BertTokenizer()
>>> stems = ["有公式$\FormFigureID{wrong1?}$，如图$\FigureID{088f15ea-xxx}$",
... "有公式$\FormFigureID{wrong1?}$，如图$\FigureID{088f15ea-xxx}$"]
>>> token_item = [tokenizer(i) for i in stems]
>>> print(token_item[0].keys())
dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
>>> finetune_bert(token_item, "examples/test_model/data/data/bert") 
{'train_runtime': ..., ..., 'epoch': 1.0}

class EduNLP.Pretrain.QuesNetTokenizer(img_dir=None, vocab_path=None, max_length=250, meta=None, img_token='<img>', unk_token='<unk>', pad_token='<pad>', *args, **argv)[source]¶

Examples

>>> tokenizer = QuesNetTokenizer(meta=['knowledge'])
>>> test_items = [{"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B, $\FigureID{test_id}$",
... "knowledge": "['*', '-', '/']"}, {"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B",
... "knowledge": "['*', '-', '/']"}]
>>> tokenizer.set_vocab(test_items,
... trim_min_count=1, key=lambda x: x["ques_content"], silent=True)
>>> token_items = [tokenizer(i, key=lambda x: x["ques_content"]) for i in test_items]
>>> print(token_items[0].keys())
dict_keys(['content_idx', 'meta_idx'])
>>> token_items = tokenizer(test_items, key=lambda x: x["ques_content"])
>>> print(len(token_items["content_idx"]))
2

tokenize(item: ~typing.Union[str, dict, list], key=<function QuesNetTokenizer.<lambda>>, *args, **kwargs)[source]¶

load_vocab(path)[source]¶

Parameters: path (str) – path of vocabulary files it must be a directory containing word.txt (meta.txt is optional)

set_vocab(items: list, key=<function QuesNetTokenizer.<lambda>>, trim_min_count=50, silent=True)[source]¶

Parameters

items (list) – can be the list of str, or list of dict
key (function) – determine how to get the text of each item
trim_min_count –
silent –

save_vocab(save_vocab_path)[source]¶

Parameters: save_vocab_path (str) – path to save word vocabulary and meta vocabulary

classmethod from_pretrained(tokenizer_config_dir, img_dir=None)[source]¶

tokenizer_config_dir: str: must contain tokenizer_config.json and vocab/word.txt vocab/meta_{meta_name}.txt
img_dir: str: default None the path of image directory

save_pretrained(tokenizer_config_dir)[source]¶

tokenizer_config_dir: str: save tokenizer params in tokenizer_config.json and save words in vocab.list

padding(idx, max_length, type='word')[source]¶

property vocab_size¶

set_img_dir(path)[source]¶

EduNLP.Pretrain.pretrain_quesnet(path, output_dir, tokenizer, save_embs=False, train_params=None)[source]¶

pretrain quesnet

Parameters

path (str) – path of question file
output_dir (str) – output path·
tokenizer (QuesNetTokenizer) – quesnet tokenizer
save_embs (bool, optional) – whether to save pretrained word/image/meta embeddings seperately
train_params (dict, optional) –
the training parameters and model parameters, by default None - “n_epochs”: int, default = 1

train param, number of epochs
- ”batch_size”: int, default = 6
  train param, batch size
- ”lr”: float, default = 1e-3
  train param, learning rate
- ”save_every”: int, default = 0
  train param, save steps interval
- ”log_steps”: int, default = 10
  train param, log steps interval
- ”device”: str, default = ‘cpu’
  train param, ‘cpu’ or ‘cuda’
- ”max_steps”: int, default = 0
  train param, stop training when reach max steps
- ”emb_size”: int, default = 256
  model param, the embedding size of word, figure, meta info
- ”feat_size”: int, default = 256
  model param, the size of question infer vector

Examples

>>> tokenizer = QuesNetTokenizer(meta=['know_name'])
>>> items = [{"ques_content": "若复数$z=1+2 i+i^{3}$，则$|z|=$，$\FigureID{000004d6-0479-11ec-829b-797d5eb43535}$",
... "ques_id": "726cdbec-33a9-11ec-909c-98fa9b625adb",
... "know_name": "['代数', '集合', '集合的相等']"
... }]
>>> tokenizer.set_vocab(items, key=lambda x: x['ques_content'], trim_min_count=1, silent=True)
>>> pretrain_quesnet('./data/quesnet_data.json', './testQuesNet', tokenizer) 

class EduNLP.Pretrain.Question(id, content, answer, false_options, labels)¶

property answer¶: Alias for field number 2

property content¶: Alias for field number 1

property false_options¶: Alias for field number 3

property id¶: Alias for field number 0

property labels¶: Alias for field number 4

class EduNLP.Pretrain.DisenQTokenizer(vocab_path=None, max_length=250, tokenize_method='space', num_token='<num>', unk_token='<unk>', pad_token='<pad>', *args, **argv)[source]¶

Examples

>>> tokenizer = DisenQTokenizer()
>>> test_items = [{
...     "content": "甲 数 除以 乙 数 的 商 是 1.5 ， 如果 甲 数 增加 20 ， 则 甲 数 是 乙 的 4 倍 ． 原来 甲 数 = ．",
...     "knowledge": ["*", "-", "/"], "difficulty": 0.2, "length": 7}]
>>> tokenizer.set_vocab(test_items,
...     trim_min_count=1, key=lambda x: x["content"], silent=True)
>>> token_items = [tokenizer(i, key=lambda x: x["content"]) for i in test_items]
>>> print(token_items[0].keys())
dict_keys(['content_idx', 'content_len'])

set_text_tokenizer(tokenize_method)[source]¶

tokenize(items: (<class 'list'>, <class 'str'>, <class 'dict'>), key=<function DisenQTokenizer.<lambda>>, **kwargs)[source]¶

Parameters

items (list or str or dict) – the question items
key (function) – determine how to get the text of each item

Returns

tokens – the token of items

Return type

list

load_vocab(path)[source]¶

set_vocab(items: list, key=<function DisenQTokenizer.<lambda>>, trim_min_count=1, silent=True)[source]¶

Parameters

items (list) – can be the list of str, or list of dict
key (function) – determine how to get the text of each item

save_vocab(save_vocab_path)[source]¶

classmethod from_pretrained(tokenizer_config_dir)[source]¶

tokenizer_config_dir: str: must contain tokenizer_config.json and vocab.list

save_pretrained(tokenizer_config_dir)[source]¶

tokenizer_config_dir: str: save tokenizer params in tokenizer_config.json and save words in vocab.list

property vocab_size¶

EduNLP.Pretrain.train_disenqnet(train_items, disen_tokenizer, output_dir, predata_dir, train_params=None, test_items=None, silent=False, data_formation=None)[source]¶

Parameters

train_items (list) – the raw train question list
disen_tokenizer (DisenQTokenizer) – the initial DisenQTokenizer use for training.
output_dir (str) – the path to save the model
predata_dir (str) – the dirname to load or save predata (including wv.th, vocab.list and concept.list)
train_params (dict, defaults to None) –
the training parameters for data, model and trianer. - “trim_min”: int

data param, the trim_min_count for vocab and word2vec, by default 2
- ”w2v_workers”: int
  data param, the number of workers for word2vec, by default 1
- ”hidden”: int
  model param, by default 128
- ”dropout”: float
  model param, dropout rate, by default 0.2
- ”pos_weight”: int
  model param, positive sample weight in unbalanced multi-label concept classifier, by default 1
- ”cp”: float
  model param, weight of concept loss, by default 1.5
- ”mi”: float
  model param, weight of mutual information loss, by default 1.0
- ”dis”: float
  model param, weight of disentangling loss, by default 2.0
- ”epoch”: int
  train param, number of epoch, by default 1
- ”batch”: int
  train param, batch size, by default 64
- ”lr”: float
  train param, learning rate, by default 1e-3
- ”step”: int
  train param, step_size for StepLR, by default 20
- ”gamma”: float
  train param, gamma for StepLR, by default 0.5
- ”warm_up”: int
  train param, number of epoch for warming up, by default 1
- ”adv”: int
  train param, ratio of disc/enc training for adversarial process, by default 10
- ”device”: str
  train param, ‘cpu’ or ‘cuda’, by default “cpu”
test_items (list, defaults to None) – the raw test question list, default is None
silent (bool, defaults to False) – whether to print processing inforamtion
data_formation (dict, defaults to None) – Mapping “content” and “knowledge” for the item formation. For example, {“content”: “ques_content”, “knowledge”: “know_name”}

Examples

>>> train_data = load_items("static/test_data/disenq_train.json")[:100]
>>> test_data = load_items("static/test_data/disenq_test.json")[:100]
>>> tokenizer = DisenQTokenizer(max_length=250, tokenize_method="space")
>>> train_disenqnet(train_data, tokenizer,
... "examples/test_model/disenq","examples/test_model/disenq", silent=True)  

EduNLP.Tokenizer¶

class EduNLP.Tokenizer.PureTextTokenizer(*args, **kwargs)[source]¶

Duel with text and plain text formula. And filting special formula like $\FormFigureID{…}$ and $\FormFigureBase64{…}.

Parameters

items (str) –
key –
args –
kwargs –

Return type

token

Examples

>>> tokenizer = PureTextTokenizer()
>>> items = ["有公式$\\FormFigureID{1}$，如图$\\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\\FormFigureBase64{2}$,$\\SIFSep$，则$z=x+7 y$的最大值为$\\SIFBlank$"]
>>> tokens = tokenizer(items)
>>> next(tokens)[:10]
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z']
>>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"]
>>> tokens = tokenizer(items)
>>> next(tokens)  
['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<',
'0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',',
'\\quad', 'A', '\\cap', 'B', '=']
>>> items = [{
... "stem": "已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$",
... "options": ["1", "2"]
... }]
>>> tokens = tokenizer(items, key=lambda x: x["stem"])
>>> next(tokens)  
['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<',
'0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',',
'\\quad', 'A', '\\cap', 'B', '=']

class EduNLP.Tokenizer.TextTokenizer(*args, **kwargs)[source]¶

Duel with text and formula including special formula.

Parameters

items (str) –
key –
args –
kwargs –

Return type

token

Examples

>>> tokenizer = TextTokenizer()
>>> items = ["有公式$\\FormFigureID{1}$，如图$\\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\\FormFigureBase64{2}$,$\\SIFSep$，则$z=x+7 y$的最大值为$\\SIFBlank$"]
>>> tokens = tokenizer(items)
>>> next(tokens)[:10]
['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]']
>>> items = ["$\\SIFTag{stem_begin}$若复数$z=1+2 i+i^{3}$，则$|z|=$$\\SIFTag{stem_end}$\
... $\\SIFTag{options_begin}$$\\SIFTag{list_0}$0$\\SIFTag{list_1}$1$\\SIFTag{list_2}$$\\sqrt{2}$\
... $\\SIFTag{list_3}$2$\\SIFTag{options_end}$"]
>>> tokens = tokenizer(items)
>>> next(tokens)[:10]
['[TAG]', '复数', 'z', '=', '1', '+', '2', 'i', '+', 'i']

class EduNLP.Tokenizer.Tokenizer[source]¶

EduNLP.Tokenizer.get_tokenizer(name, *args, **kwargs)[source]¶

It is a total interface to use difference tokenizer. :param name: the name of tokenizer, e.g. text, pure_text. :type name: str :param args: the parameters passed to tokenizer :param kwargs: the parameters passed to tokenizer

Returns: tokenizer
Return type: Tokenizer

Examples

>>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"]
>>> tokenizer = get_tokenizer("text")
>>> tokens = tokenizer(items)
>>> next(tokens)  
['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<',
'0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',',
'\\quad', 'A', '\\cap', 'B', '=']

Vector¶

class EduNLP.Vector.W2V(filepath, method=None, binary=None)[source]¶

The part uses gensim library providing FastText, Word2Vec and KeyedVectors method to transfer word to vector.

Parameters

filepath – path to the pretrained model file
method (str) – fasttext other(Word2Vec)
binary (bool) –

key_to_index(word)[source]¶

property vectors¶

property vector_size¶

infer_vector(items, agg='mean', *args, **kwargs) → list[source]¶

infer_tokens(items, *args, **kwargs) → list[source]¶

class EduNLP.Vector.D2V(filepath, method='d2v')[source]¶

It is a collection which include d2v, bow, tfidf method.

Parameters

filepath –
method (str) – d2v bow tfidf
item –

Returns

d2v model

Return type

D2V

property vector_size¶

infer_vector(items, *args, **kwargs) → list[source]¶

infer_tokens(item, *args, **kwargs) → ...[source]¶

class EduNLP.Vector.BowLoader(filepath)[source]¶

Using doc2bow model, which has a lot of effects.

Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its document frequency (self.dfs) by one.

If allow_update is not set, this function is const, aka read-only.

infer_vector(item, return_vec=False)[source]¶

property vector_size¶

class EduNLP.Vector.TfidfLoader(filepath)[source]¶

This module implements functionality related to the Term Frequency - Inverse Document Frequency <https://en.wikipedia.org/wiki/Tf%E2%80%93idf> vector space bag-of-words models.

infer_vector(item, return_vec=False)[source]¶

property vector_size¶

class EduNLP.Vector.RNNModel(rnn_type, w2v: (<class 'EduNLP.Vector.gensim_vec.W2V'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, None), hidden_size, freeze_pretrained=True, model_params=None, device=None, **kwargs)[source]¶

Examples

>>> model = RNNModel("BiLSTM", None, 2, vocab_size=4, embedding_dim=3)
>>> seq_idx = [[1, 2, 3], [1, 2, 0], [3, 0, 0]]
>>> output, hn = model(seq_idx, indexing=False, padding=False)
>>> seq_idx = [[1, 2, 3], [1, 2], [3]]
>>> output, hn = model(seq_idx, indexing=False, padding=True)
>>> output.shape
torch.Size([3, 3, 4])
>>> hn.shape
torch.Size([2, 3, 2])
>>> tokens = model.infer_tokens(seq_idx, indexing=False)
>>> tokens.shape
torch.Size([3, 3, 4])
>>> tokens = model.infer_tokens(seq_idx, agg="mean", indexing=False)
>>> tokens.shape
torch.Size([3, 4])
>>> item = model.infer_vector(seq_idx, indexing=False)
>>> item.shape
torch.Size([3, 4])
>>> item = model.infer_vector(seq_idx, agg="mean", indexing=False)
>>> item.shape
torch.Size([3, 2])
>>> item = model.infer_vector(seq_idx, agg=None, indexing=False)
>>> item.shape
torch.Size([2, 3, 2])

infer_vector(items, agg: (<class 'int'>, <class 'str'>, None) = -1, indexing=True, padding=True, *args, **kwargs) → Tensor[source]¶

infer_tokens(items, agg=None, *args, **kwargs) → Tensor[source]¶

property vector_size: int¶

set_device(device)[source]¶

save(filepath, save_embedding=False)[source]¶

freeze(*args, **kwargs)[source]¶

property is_frozen¶

eval()[source]¶

train(mode=True)[source]¶

class EduNLP.Vector.T2V(model: str, *args, **kwargs)[source]¶

The function aims to transfer token list to vector. If you have a certain model, you can use T2V directly. Otherwise, calling get_pretrained_t2v function is a better way to get vector which can switch it without your model.

Parameters: model (str) – select the model type e.g.: d2v, rnn, lstm, gru, elmo, etc.

Examples

>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$，    ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$'}]
>>> path = "examples/test_model/d2v/d2v_test_256/d2v_test_256.bin"
>>> t2v = T2V('d2v',filepath=path)
>>> print(t2v(item)) 
[array([...dtype=float32)]

infer_vector(items, *args, **kwargs)[source]¶

infer_tokens(items, *args, **kwargs)[source]¶

property vector_size: int¶

EduNLP.Vector.get_pretrained_t2v(name, model_dir='/home/docs/.EduNLP/model')[source]¶

It is a good idea if you want to switch token list to vector earily.

Parameters

name (str) – select the pretrained model e.g.: d2v_math_300 w2v_math_300 elmo_math_2048 bert_math_768 bert_taledu_768 disenq_math_256 quesnet_math_512
model_dir (str) – the path of model, default: MODEL_DIR = ‘~/.EduNLP/model’

Returns

t2v model

Return type

T2V

Examples

>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$，    ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$，则$z=x+7 y$的最大值为$\SIFBlank$'}]
>>> i2v = get_pretrained_t2v("d2v_test_256", "examples/test_model/d2v") 
>>> print(i2v(item)) 
[array([...dtype=float32)]

EduNLP.Vector.get_pretrained_model_info(name)[source]¶

EduNLP.Vector.get_all_pretrained_models()[source]¶

class EduNLP.Vector.Embedding(w2v: (<class 'EduNLP.Vector.gensim_vec.W2V'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, None), freeze=True, device=None, **kwargs)[source]¶

infer_token_vector(items: List[List[str]], indexing=True) → tuple[source]¶

indexing(items: List[List[str]], padding=False, indexing=True) → tuple[source]¶

Parameters

items (list of list of str(word/token)) –
padding (bool) – whether padding the returned list with default pad_val to make all item in items have the same length
indexing (bool) –

Returns

token_idx (list of list of int) – the list of the tokens of each item
token_len (list of int) – the list of the length of tokens of each item

set_device(device)[source]¶

class EduNLP.Vector.BertModel(pretrained_model)[source]¶

Examples

>>> from EduNLP.Pretrain import BertTokenizer
>>> tokenizer = BertTokenizer("bert-base-chinese", add_special_tokens=False)
>>> model = BertModel("bert-base-chinese")
>>> item = ["有公式$\FormFigureID{wrong1?}$，如图$\FigureID{088f15ea-xxx}$，若$x,y$满足约束",
... "有公式$\FormFigureID{wrong1?}$，如图$\FigureID{088f15ea-xxx}$，若$x,y$满足约束"]
>>> inputs = tokenizer(item, return_tensors='pt')
>>> output = model(inputs)
>>> output.shape
torch.Size([2, 14, 768])
>>> tokens = model.infer_tokens(inputs)
>>> tokens.shape
torch.Size([2, 12, 768])
>>> tokens = model.infer_tokens(inputs, return_special_tokens=True)
>>> tokens.shape
torch.Size([2, 14, 768])
>>> item = model.infer_vector(inputs)
>>> item.shape
torch.Size([2, 768])

infer_vector(items: dict, pooling_strategy='CLS') → Tensor[source]¶

infer_tokens(items: dict, return_special_tokens=False) → Tensor[source]¶

property vector_size¶

class EduNLP.Vector.QuesNetModel(pretrained_dir, tokenizer=None, device='cpu')[source]¶

infer_vector(items: Union[Question, list]) → Tensor[source]¶

get question embedding with quesnet

Parameters: items ((Question, list)) – namedtuple, [‘id’, ‘content’, ‘answer’, ‘false_options’, ‘labels’] or a list of Questions

infer_tokens(items: Union[Question, list]) → Tensor[source]¶

get token embeddings with quesnet

Parameters: items (Question) – namedtuple, [‘id’, ‘content’, ‘answer’, ‘false_options’, ‘labels’] or a list of Questions
Returns: meta_emb + word_embs
Return type: torch.Tensor

property vector_size¶

class EduNLP.Vector.DisenQModel(pretrained_dir, device='cpu')[source]¶

infer_vector(items: dict, vector_type=None, **kwargs) → Tensor[source]¶

Parameters: vector_type (str) – choose the type of items tensor to return. Default is None, which means return both (k_hidden, i_hidden) when vector_type=”k”, return k_hidden; when vector_type=”i”, return i_hidden;

infer_tokens(items: dict, **kwargs) → Tensor[source]¶

property vector_size¶

class EduNLP.Vector.ElmoModel(pretrained_model_path: str)[source]¶

infer_vector(items, *args, **kwargs) → Tensor[source]¶

infer_tokens(items, *args, **kwargs) → Tensor[source]¶

property vector_size¶