EduNLP

SIF

EduNLP.SIF.sif.is_sif(item, check_formula=True, return_parser=False)[source]

the part aims to check whether the input is sif format

Parameters
  • item (str) – a raw item which respects stem

  • check_formula (bool) –

    whether to check the formulas when parsing item.

    True if check the validity of formulas in item False if not check the validity of formulas in item, which is faster

  • return_parser (bool) –

    whether to put the parsed item in return.

    when True, the format of return is (bool, Parser) when False, the format of return is bool

Returns

when item can not be parsed correctly, raise ValueError; when item is in stardarded format originally, return Ture (and the Parser of item); when item isn’t in stardarded format originally, return False (and the Parser of item);

Return type

bool

Examples

>>> text = '若$x,y$满足约束条件' \
...        '$\\left\\{\\begin{array}{c}2 x+y-2 \\leq 0 \\\\ x-y-1 \\geq 0 \\\\ y+1 \\geq 0\\end{array}\\right.$,' \
...        '则$z=x+7 y$的最大值$\\SIFUnderline$'
>>> is_sif(text)
True
>>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...'
>>> ret = is_sif(text, return_parser=True)
>>> ret 
(False, <EduNLP.SIF.parser.parser.Parser object...>)
EduNLP.SIF.sif.to_sif(item, check_formula=True, parser: Optional[Parser] = None)[source]

the part aims to switch item to sif formate

Parameters
  • items (str) – a raw item which respects stem

  • check_formula (bool) – whether to check the formulas when parsing item (only work when parser=None).

  • parser (Parser) – the parser of item returned from is_sif.

Returns

item – the item which accords with sif format

Return type

str

Examples

>>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...'
>>> siftext = to_sif(text)
>>> siftext
'某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$(单位...'
>>> ret = is_sif(text, return_parser=True)
>>> ret 
(False, <EduNLP.SIF.parser.parser.Parser object...>)
>>> to_sif(text, parser=ret[1])
'某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$(单位...
EduNLP.SIF.sif.sif4sci(item: str, figures: (<class 'dict'>, <class 'bool'>) = None, mode: int = 2, symbol: str = None, tokenization=True, tokenization_params=None, errors='raise')[source]

Default to use linear Tokenizer, change the tokenizer by specifying tokenization_params

Parameters
  • item (str) – a raw item which respects stem

  • figures (dict) – when it is a dict, it means the id-to-instance for figures in ‘FormFigureID{…}’ format, when it is a bool, it means whether to instantiate figures in ‘FormFigureBase64{…}’ format

  • mode (int) – when safe = 2, use is_sif and check formula in item when safe = 1, use is_sif but don’t check formula in item when safe = 0, don’t use is_sif and don’t check anything in item

  • symbol (str) –

    select the methods to symbolize:

    ”t”: text “f”: formula “g”: figure “m”: question mark “a”: tag “s”: sep

  • tokenization (bool) – whether to tokenize item after segmentation

  • tokenization_params

    the dict of text_params, formula_params and figure_params in tokenization For formula_params:

    method: which tokenizer to be used, “linear” or “ast” The parameters only useful for “linear”:

    skip_figure_formula: whether to skip the formula in figure format symbolize_figure_formula: whether to symbolize the formula in figure format

    The parameters only useful for “ast”:

    ord2token: whether to transfer the variables (mathord) and constants (textord) to special tokens. var_numbering: whether to use number suffix to denote different variables return_type: ‘list’ or ‘ast’

    More parameters can be found in the definition in SIF.tokenization.formula

    For figure_params:

    figure_instance:whether to return instance of figures in tokens

    For text_params:

    See definition in SIF.tokenization.text

  • errors – warn, raise, coerce, strict, ignore

Returns

When tokenization is False, return SegmentList; When tokenization is True, return TokenList

Return type

list

Examples

>>> test_item = r"如图所示,则$\bigtriangleup ABC$的面积是$\SIFBlank$。$\FigureID{1}$"
>>> tl = sif4sci(test_item)
>>> tl
['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}]
>>> tl.describe()
{'t': 2, 'f': 2, 'g': 1, 'm': 1}
>>> with tl.filter('fgm'):
...     tl
['如图所示', '面积']
>>> with tl.filter(keep='t'):
...     tl
['如图所示', '面积']
>>> with tl.filter():
...     tl
['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}]
>>> tl.text_tokens
['如图所示', '面积']
>>> tl.formula_tokens
['\\bigtriangleup', 'ABC']
>>> tl.figure_tokens
[\FigureID{1}]
>>> tl.ques_mark_tokens
['\\SIFBlank']
>>> sif4sci(test_item, symbol="gm", tokenization_params={"formula_params": {"method": "ast"}})
['如图所示', <Formula: \bigtriangleup ABC>, '面积', '[MARK]', '[FIGURE]']
>>> sif4sci(test_item, symbol="tfgm")
['[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[TEXT]', '[FIGURE]']
>>> sif4sci(test_item, symbol="gm",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}})
['如图所示', '\\bigtriangleup', 'A', 'B', 'C', '面积', '[MARK]', '[FIGURE]']
>>> test_item_1 = {
...     "stem": r"若$x=2$, $y=\sqrt{x}$,则下列说法正确的是$\SIFChoice$",
...     "options": [r"$x < y$", r"$y = x$", r"$y < x$"]
... }
>>> tls = [
...     sif4sci(e, symbol="gm",
...     tokenization_params={
...         "formula_params": {
...             "method": "ast", "return_type": "list", "ord2token": True, "var_numbering": True,
...             "link_variable": False}
...     })
...     for e in ([test_item_1["stem"]] + test_item_1["options"])
... ]
>>> tls[1:]
[['mathord_0', '<', 'mathord_1'], ['mathord_0', '=', 'mathord_1'], ['mathord_0', '<', 'mathord_1']]
>>> link_formulas(*tls)
>>> tls[1:]
[['mathord_0', '<', 'mathord_1'], ['mathord_1', '=', 'mathord_0'], ['mathord_1', '<', 'mathord_0']]
>>> from EduNLP.utils import dict2str4sif
>>> test_item_1_str = dict2str4sif(test_item_1, tag_mode="head", add_list_no_tag=False)
>>> test_item_1_str  
'$\\SIFTag{stem}$...则下列说法正确的是$\\SIFChoice$$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$'
>>> tl1 = sif4sci(test_item_1_str, symbol="gm",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list", "ord2token": True}})
>>> tl1.get_segments()[0]
['\\SIFTag{stem}']
>>> tl1.get_segments()[1:3]
[['[TEXT_BEGIN]', '[TEXT_END]'], ['[FORMULA_BEGIN]', 'mathord', '=', 'textord', '[FORMULA_END]']]
>>> tl1.get_segments(add_seg_type=False)[0:3]
[['\\SIFTag{stem}'], ['mathord', '=', 'textord'], ['mathord', '=', 'mathord', '{ }', '\\sqrt']]
>>> test_item_2 = {"options": [r"$x < y$", r"$y = x$", r"$y < x$"]}
>>> test_item_2
{'options': ['$x < y$', '$y = x$', '$y < x$']}
>>> test_item_2_str = dict2str4sif(test_item_2, tag_mode="head", add_list_no_tag=False)
>>> test_item_2_str
'$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$'
>>> tl2 = sif4sci(test_item_2_str, symbol="gms",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}})
>>> tl2
['\\SIFTag{options}', 'x', '<', 'y', '[SEP]', 'y', '=', 'x', '[SEP]', 'y', '<', 'x']
>>> tl2.get_segments(add_seg_type=False)
[['\\SIFTag{options}'], ['x', '<', 'y'], ['[SEP]'], ['y', '=', 'x'], ['[SEP]'], ['y', '<', 'x']]
>>> tl2.get_segments(add_seg_type=False, drop="s")
[['\\SIFTag{options}'], ['x', '<', 'y'], ['y', '=', 'x'], ['y', '<', 'x']]
>>> tl3 = sif4sci(test_item_1["stem"], symbol="gs")
>>> tl3.text_segments
[['说法', '正确']]
>>> tl3.formula_segments
[['x', '=', '2'], ['y', '=', '\\sqrt', '{', 'x', '}']]
>>> tl3.figure_segments
[]
>>> tl3.ques_mark_segments
[['\\SIFChoice']]
>>> test_item_3 = r"已知$y=x$,则以下说法中$\textf{正确,b}$的是"
>>> tl4 = sif4sci(test_item_3)
Warning: there is some chinese characters in formula!
>>> tl4.text_segments
[['已知'], ['说法', '中', '正确']]

EduNLP.Formula

EduNLP.Formula.ast.str2ast(formula: str, *args, **kwargs)[source]

给字符串的接口

EduNLP.Formula.ast.get_edges(forest)[source]

构造边集合

Parameters

forest (List[Dict]) – 森林

Returns

edges – 边集合

Return type

list of tuple(src,dst,type)

EduNLP.Formula.ast.ast(formula: (<class 'str'>, typing.List[typing.Dict]), index=0, forest_begin=0, father_tree=None, is_str=False)[source]

The origin code author is https://github.com/hxwujinze

Parameters
  • formula (str or List[Dict]) – 公式字符串或通过katex解析得到的结构体

  • index (int) – 本子树在树上的位置

  • forest_begin (int) – 本树在森林中的起始位置

  • father_tree (List[Dict]) – 父亲树

  • is_str (bool) –

Returns

  • tree (List[Dict]) – 重新解析形成的特征树

  • todo (finish all types)

Notes

Some functions are not supportd in katex e.g.,

  1. tag
    • \begin{equation} \tag{tagName} F=ma \end{equation}

    • \begin{align} \tag{1} y=x+z \end{align}

    • \tag*{hi} x+y^{2x}

  2. dddot
    • \frac{ \dddot y }{ x }

For more information, refer to katex support table

建森林

Parameters

forest (List[Dict]) –

Returns

trees

Return type

List[Dict]

EduNLP.Formula.ast.katex_parse(formula)[source]

将公式传入katex进行语法解析

EduNLP.I2V

class EduNLP.I2V.i2v.I2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]

It just a api, so you shouldn’t use it directly. If you want to get vector from item, you can use other model like D2V and W2V.

Parameters
  • tokenizer (str) – the tokenizer name

  • t2v (str) – the name of token2vector model

  • args – the parameters passed to t2v

  • tokenizer_kwargs (dict) – the parameters passed to tokenizer

  • pretrained_t2v (bool) –

    • True: use pretrained t2v model

    • False: use your own t2v model

  • kwargs – the parameters passed to t2v

Examples

>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$,     ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点,    ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"}
>>> model_path = "examples/test_model/d2v/test_gensim_luna_stem_tf_d2v_256.bin"
>>> i2v = D2V("text","d2v",filepath=model_path, pretrained_t2v = False)
>>> i2v(item)
([array([ ...dtype=float32)], None)
Returns

i2v model

Return type

I2V

tokenize(items, *args, indexing=True, padding=False, key=<function I2V.<lambda>>, **kwargs) list[source]
infer_vector(items, tokenize=True, indexing=False, padding=False, key=<function I2V.<lambda>>, **kwargs) tuple[source]
infer_item_vector(tokens, *args, **kwargs) ...[source]
infer_token_vector(tokens, *args, **kwargs) ...[source]
save(config_path)[source]
classmethod load(config_path, *args, **kwargs)[source]
classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', *args, **kwargs)[source]
property vector_size
class EduNLP.I2V.i2v.D2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]

The model aims to transfer item to vector directly.

I2V

Parameters
  • tokenizer (str) – the tokenizer name

  • t2v (str) – the name of token2vector model

  • args – the parameters passed to t2v

  • tokenizer_kwargs (dict) – the parameters passed to tokenizer

  • pretrained_t2v (bool) – True: use pretrained t2v model False: use your own t2v model

  • kwargs – the parameters passed to t2v

Examples

>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$,     ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点,    ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"}
>>> model_path = "examples/test_model/d2v/d2v_test_256/d2v_test_256.bin"
>>> i2v = D2V("text","d2v",filepath=model_path, pretrained_t2v = False)
>>> i2v(item)
([array([ ...dtype=float32)], None)
Returns

i2v model

Return type

I2V

infer_vector(items, tokenize=True, indexing=False, padding=False, key=<function D2V.<lambda>>, *args, **kwargs) tuple[source]

It is a function to switch item to vector. And before using the function, it is necessary to load model.

Parameters
  • items (str) – the text of question

  • tokenize (bool) – True: tokenize the item

  • indexing (bool) –

  • padding (bool) –

  • key (lambda function) – the parameter passed to tokenizer, select the text to be processed

  • args – the parameters passed to t2v

  • kwargs – the parameters passed to t2v

Returns

vector

Return type

list

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', *args, **kwargs)[source]
class EduNLP.I2V.i2v.W2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]

The model aims to transfer tokens to vector.

I2V

Parameters
  • tokenizer (str) – the tokenizer name

  • t2v (str) – the name of token2vector model

  • args – the parameters passed to t2v

  • tokenizer_kwargs (dict) – the parameters passed to tokenizer

  • pretrained_t2v (bool) – True: use pretrained t2v model False: use your own t2v model

  • kwargs – the parameters passed to t2v

Examples

>>> (); i2v = get_pretrained_i2v("w2v_test_256", "examples/test_model/w2v"); () 
(...)
>>> item_vector, token_vector = i2v(["有学者认为:‘学习’,必须适应实际"])
>>> item_vector 
[array([...], dtype=float32)]
Returns

i2v model

Return type

W2V

infer_vector(items, tokenize=True, indexing=False, padding=False, key=<function W2V.<lambda>>, *args, **kwargs) tuple[source]

It is a function to switch item to vector. And before using the function, it is necessary to load model.

Parameters
  • items (str) – the text of question

  • tokenize (bool) – True: tokenize the item

  • indexing (bool) –

  • padding (bool) –

  • key (lambda function) – the parameter passed to tokenizer, select the text to be processed

  • args – the parameters passed to t2v

  • kwargs – the parameters passed to t2v

Returns

vector

Return type

list

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', *args, **kwargs)[source]
class EduNLP.I2V.i2v.Elmo(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]

The model aims to transfer item and tokens to vector with Elmo.

I2V

Parameters
  • tokenizer (str) – the tokenizer name

  • t2v (str) – the name of token2vector model

  • args – the parameters passed to t2v

  • tokenizer_kwargs (dict) – the parameters passed to tokenizer

  • pretrained_t2v (bool) – True: use pretrained t2v model False: use your own t2v model

  • kwargs – the parameters passed to t2v

Returns

i2v model

Return type

Elmo

infer_vector(items, tokenize=True, return_tensors='pt', *args, **kwargs) tuple[source]

It is a function to switch item to vector. And before using the function, it is necessary to load model.

Parameters
  • items (str or list) – the text of question

  • tokenize (bool) – True: tokenize the item

  • return_tensors (str) – tensor type used in tokenizer

  • args – the parameters passed to t2v

  • kwargs – the parameters passed to t2v

Returns

vector

Return type

list

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', *args, **kwargs)[source]
class EduNLP.I2V.i2v.Bert(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]

The model aims to transfer item and tokens to vector with Bert.

I2V

Parameters
  • tokenizer (str) – the tokenizer name

  • t2v (str) – the name of token2vector model

  • args – the parameters passed to t2v

  • tokenizer_kwargs (dict) – the parameters passed to tokenizer

  • pretrained_t2v (bool) – True: use pretrained t2v model False: use your own t2v model

  • kwargs – the parameters passed to t2v

Returns

i2v model

Return type

Bert

infer_vector(items, tokenize=True, return_tensors='pt', *args, **kwargs) tuple[source]

It is a function to switch item to vector. And before using the function, it is nesseary to load model.

Parameters
  • items (str or list) – the text of question

  • tokenize (bool) – True: tokenize the item

  • return_tensors (str) – tensor type used in tokenizer

  • args – the parameters passed to t2v

  • kwargs – the parameters passed to t2v

Returns

vector

Return type

list

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', *args, **kwargs)[source]
class EduNLP.I2V.i2v.DisenQ(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]

The model aims to transfer item and tokens to vector with DisenQ. Bases ——- I2V :param tokenizer: the tokenizer name :type tokenizer: str :param t2v: the name of token2vector model :type t2v: str :param args: the parameters passed to t2v :param tokenizer_kwargs: the parameters passed to tokenizer :type tokenizer_kwargs: dict :param pretrained_t2v: True: use pretrained t2v model

False: use your own t2v model

Parameters

kwargs – the parameters passed to t2v

Returns

i2v model

Return type

DisenQ

infer_vector(items: (<class 'dict'>, <class 'list'>), tokenize=True, key=<function DisenQ.<lambda>>, vector_type=None, **kwargs) tuple[source]

It is a function to switch item to vector. And before using the function, it is nesseary to load model. :param item: the item of question :type item: dict or list :param tokenize: True: tokenize the item :type tokenize: bool :param key: the parameter passed to tokenizer, select the text to be processed :type key: lambda function :param args: the parameters passed to t2v :param kwargs: the parameters passed to t2v

Returns

vector

Return type

list

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', **kwargs)[source]
class EduNLP.I2V.i2v.QuesNet(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]

The model aims to transfer item and tokens to vector with quesnet. Bases ——- I2V

infer_vector(item, tokenize=True, key=<function QuesNet.<lambda>>, meta=['know_name'], *args, **kwargs)[source]

It is a function to switch item to vector. And before using the function, it is nesseary to load model. :param item: the item of question, or question list :type item: str or dict or list :param tokenize: True: tokenize the item :type tokenize: bool, optional :param key: _description_, by default lambdax:x :type key: _type_, optional :param meta: meta information, by default [‘know_name’] :type meta: list, optional :param args: the parameters passed to t2v :param kwargs: the parameters passed to t2v

Returns

  • token embeddings

  • question embedding

classmethod from_pretrained(name, model_dir='/home/docs/.EduNLP/model', *args, **kwargs)[source]
EduNLP.I2V.i2v.get_pretrained_i2v(name, model_dir='/home/docs/.EduNLP/model')[source]

It is a good idea if you want to switch item to vector earily.

Parameters
  • name (str) – the name of item2vector model e.g.: d2v_math_300 w2v_math_300 elmo_math_2048 bert_math_768 bert_taledu_768 disenq_math_256 quesnet_math_512

  • model_dir (str) – the path of model, default: MODEL_DIR = ‘~/.EduNLP/model’

Returns

i2v model

Return type

I2V

Examples

>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$,     ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点,    ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"}
>>> (); i2v = get_pretrained_i2v("d2v_test_256", "examples/test_model/d2v"); () 
(...)
>>> print(i2v(item))
([array([ ...dtype=float32)], None)

EduNLP.Pretrain

EduNLP.Pretrain.train_vector(items, w2v_prefix, embedding_dim=None, method='sg', binary=None, train_params=None)[source]
Parameters
  • items:str – the text of question

  • w2v_prefix

  • embedding_dim (int) – vector_size

  • method (str) – the method of training, e.g.: sg, cbow, fasttext, d2v, bow, tfidf

  • binary (model format) – True:bin; False:kv

  • train_params (dict) – the training parameters passed to model

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']]
>>> train_vector(token_item[:10], "examples/test_model/w2v/gensim_luna_stem_t_", 100) 
'examples/test_model/w2v/gensim_luna_stem_t_sg_100.kv'
class EduNLP.Pretrain.GensimWordTokenizer(symbol='gm', general=False)[source]
Parameters
  • symbol (str) –

    select the methods to symbolize:

    ”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,

    e.g.: gm, fgm, gmas, fgmas

  • general (bool) –

    True: when item isn’t in standard format, and want to tokenize formulas(except formulas in figure) linearly.

    False: when use ‘ast’ mothed to tokenize formulas instead of ‘linear’.

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]']
>>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]']
batch_process(*items)[source]
class EduNLP.Pretrain.GensimSegTokenizer(symbol='gms', depth=None, flatten=False, **kwargs)[source]
Parameters
  • symbol (str) –

    select the methods to symbolize:

    ”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,

    e.g. gms, fgm

  • depth (int or None) – 0: only separate at SIFSep ; 1: only separate at SIFTag ; 2: separate at SIFTag and SIFSep ; otherwise, separate all segments ;

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], [\FormFigureID{1}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']]
>>> tokenizer = GensimSegTokenizer(symbol="fgm", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{1}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{2}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], ['[FORMULA]'], ['如图'], ['[FIGURE]'], ['[FORMULA]'],...['[FORMULA]'], ['最大值'], ['[MARK]']]
class EduNLP.Pretrain.ElmoTokenizer(path: Optional[str] = None)[source]

Examples

>>> t=ElmoTokenizer()
>>> items = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$"]
>>> t.tokenize(items[0])
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]']
>>> len(t)
18
tokenize(item: (<class 'str'>, <class 'list'>), freeze_vocab=False, return_length=False)[source]
to_index(item: list, max_length=128, pad_to_max_length=False)[source]
append(item)[source]
save_vocab(path)[source]
load_vocab(path)[source]
class EduNLP.Pretrain.ElmoDataset(texts: list, tokenizer: ElmoTokenizer, max_length=128)[source]
EduNLP.Pretrain.train_elmo(texts: list, output_dir: str, pretrained_dir: Optional[str] = None, emb_dim=512, hid_dim=512, batch_size=2, epochs=3, lr: float = 0.0005, device=None)[source]
Parameters
  • texts (list, required) – The training corpus of shape (text_num, token_num), a text must be tokenized into tokens

  • output_dir (str, required) – The directory to save trained model files

  • pretrained_dir (str, optional) – The pretrained model files’ directory

  • emb_dim (int, optional, default=512) – The embedding dim

  • hid_dim (int, optional, default=1024) – The hidden dim

  • batch_size (int, optional, default=2) – The training batch size

  • epochs (int, optional, default=3) – The training epochs

  • lr (float, optional, default=5e-4) – The learning rate

  • device (str, optional) – Default is ‘cuda’ if available, otherwise ‘cpu’

Returns

output_dir – The directory that trained model files are saved

Return type

str

class EduNLP.Pretrain.BertTokenizer(pretrain_model='bert-base-chinese', add_special_tokens=False, text_tokenizer=None)[source]
Parameters
  • pretrain_model – used pretrained model

  • add_special_tokens – Whether to add tokens like [FIGURE], [TAG], etc.

  • text_tokenizer – Which text tokenizer to use. Must be consistent with TOKENIZER dictionary.

Examples

>>> tokenizer = BertTokenizer(add_special_tokens=True)
>>> item = "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$"
>>> token_item = tokenizer(item)
>>> print(token_item.input_ids[:10])
[101, 1062, 2466, 1963, 1745, 21129, 166, 117, 167, 5276]
>>> print(tokenizer.tokenize(item)[:10])
['公', '式', '如', '图', '[FIGURE]', 'x', ',', 'y', '约', '束']
>>> items = [item, item]
>>> token_items = tokenizer(items, return_tensors='pt')
>>> print(token_items.input_ids.shape)
torch.Size([2, 27])
>>> print(len(tokenizer.tokenize(items)))
2
>>> tokenizer.save_pretrained('test_dir') 
>>> tokenizer = BertTokenizer.from_pretrained('test_dir') 
tokenize(item: Union[list, str], *args, **kwargs)[source]
save_pretrained(tokenizer_config_dir)[source]
classmethod from_pretrained(tokenizer_config_dir)[source]
EduNLP.Pretrain.finetune_bert(items, output_dir, pretrain_model='bert-base-chinese', train_params=None)[source]
Parameters
  • items:dict – the tokenization results of questions

  • output_dir (str) – the path to save the model

  • pretrain_model (str) – the name or path of pre-trained model

  • train_params (dict) – the training parameters passed to Trainer

Examples

>>> tokenizer = BertTokenizer()
>>> stems = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$",
... "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$"]
>>> token_item = [tokenizer(i) for i in stems]
>>> print(token_item[0].keys())
dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
>>> finetune_bert(token_item, "examples/test_model/data/data/bert") 
{'train_runtime': ..., ..., 'epoch': 1.0}
class EduNLP.Pretrain.QuesNetTokenizer(img_dir=None, vocab_path=None, max_length=250, meta=None, img_token='<img>', unk_token='<unk>', pad_token='<pad>', *args, **argv)[source]

Examples

>>> tokenizer = QuesNetTokenizer(meta=['knowledge'])
>>> test_items = [{"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B, $\FigureID{test_id}$",
... "knowledge": "['*', '-', '/']"}, {"ques_content": "$\triangle A B C$ 的内角为 $A, \quad B",
... "knowledge": "['*', '-', '/']"}]
>>> tokenizer.set_vocab(test_items,
... trim_min_count=1, key=lambda x: x["ques_content"], silent=True)
>>> token_items = [tokenizer(i, key=lambda x: x["ques_content"]) for i in test_items]
>>> print(token_items[0].keys())
dict_keys(['content_idx', 'meta_idx'])
>>> token_items = tokenizer(test_items, key=lambda x: x["ques_content"])
>>> print(len(token_items["content_idx"]))
2
tokenize(item: ~typing.Union[str, dict, list], key=<function QuesNetTokenizer.<lambda>>, *args, **kwargs)[source]
load_vocab(path)[source]
Parameters

path (str) – path of vocabulary files it must be a directory containing word.txt (meta.txt is optional)

set_vocab(items: list, key=<function QuesNetTokenizer.<lambda>>, trim_min_count=50, silent=True)[source]
Parameters
  • items (list) – can be the list of str, or list of dict

  • key (function) – determine how to get the text of each item

  • trim_min_count

  • silent

save_vocab(save_vocab_path)[source]
Parameters

save_vocab_path (str) – path to save word vocabulary and meta vocabulary

classmethod from_pretrained(tokenizer_config_dir, img_dir=None)[source]
tokenizer_config_dir: str

must contain tokenizer_config.json and vocab/word.txt vocab/meta_{meta_name}.txt

img_dir: str

default None the path of image directory

save_pretrained(tokenizer_config_dir)[source]
tokenizer_config_dir: str

save tokenizer params in tokenizer_config.json and save words in vocab.list

padding(idx, max_length, type='word')[source]
property vocab_size
set_img_dir(path)[source]
EduNLP.Pretrain.pretrain_quesnet(path, output_dir, tokenizer, save_embs=False, train_params=None)[source]

pretrain quesnet

Parameters
  • path (str) – path of question file

  • output_dir (str) – output path·

  • tokenizer (QuesNetTokenizer) – quesnet tokenizer

  • save_embs (bool, optional) – whether to save pretrained word/image/meta embeddings seperately

  • train_params (dict, optional) –

    the training parameters and model parameters, by default None - “n_epochs”: int, default = 1

    train param, number of epochs

    • ”batch_size”: int, default = 6

      train param, batch size

    • ”lr”: float, default = 1e-3

      train param, learning rate

    • ”save_every”: int, default = 0

      train param, save steps interval

    • ”log_steps”: int, default = 10

      train param, log steps interval

    • ”device”: str, default = ‘cpu’

      train param, ‘cpu’ or ‘cuda’

    • ”max_steps”: int, default = 0

      train param, stop training when reach max steps

    • ”emb_size”: int, default = 256

      model param, the embedding size of word, figure, meta info

    • ”feat_size”: int, default = 256

      model param, the size of question infer vector

Examples

>>> tokenizer = QuesNetTokenizer(meta=['know_name'])
>>> items = [{"ques_content": "若复数$z=1+2 i+i^{3}$,则$|z|=$,$\FigureID{000004d6-0479-11ec-829b-797d5eb43535}$",
... "ques_id": "726cdbec-33a9-11ec-909c-98fa9b625adb",
... "know_name": "['代数', '集合', '集合的相等']"
... }]
>>> tokenizer.set_vocab(items, key=lambda x: x['ques_content'], trim_min_count=1, silent=True)
>>> pretrain_quesnet('./data/quesnet_data.json', './testQuesNet', tokenizer) 
class EduNLP.Pretrain.Question(id, content, answer, false_options, labels)
property answer

Alias for field number 2

property content

Alias for field number 1

property false_options

Alias for field number 3

property id

Alias for field number 0

property labels

Alias for field number 4

class EduNLP.Pretrain.DisenQTokenizer(vocab_path=None, max_length=250, tokenize_method='space', num_token='<num>', unk_token='<unk>', pad_token='<pad>', *args, **argv)[source]

Examples

>>> tokenizer = DisenQTokenizer()
>>> test_items = [{
...     "content": "甲 数 除以 乙 数 的 商 是 1.5 , 如果 甲 数 增加 20 , 则 甲 数 是 乙 的 4 倍 . 原来 甲 数 = .",
...     "knowledge": ["*", "-", "/"], "difficulty": 0.2, "length": 7}]
>>> tokenizer.set_vocab(test_items,
...     trim_min_count=1, key=lambda x: x["content"], silent=True)
>>> token_items = [tokenizer(i, key=lambda x: x["content"]) for i in test_items]
>>> print(token_items[0].keys())
dict_keys(['content_idx', 'content_len'])
set_text_tokenizer(tokenize_method)[source]
tokenize(items: (<class 'list'>, <class 'str'>, <class 'dict'>), key=<function DisenQTokenizer.<lambda>>, **kwargs)[source]
Parameters
  • items (list or str or dict) – the question items

  • key (function) – determine how to get the text of each item

Returns

tokens – the token of items

Return type

list

load_vocab(path)[source]
set_vocab(items: list, key=<function DisenQTokenizer.<lambda>>, trim_min_count=1, silent=True)[source]
Parameters
  • items (list) – can be the list of str, or list of dict

  • key (function) – determine how to get the text of each item

save_vocab(save_vocab_path)[source]
classmethod from_pretrained(tokenizer_config_dir)[source]
tokenizer_config_dir: str

must contain tokenizer_config.json and vocab.list

save_pretrained(tokenizer_config_dir)[source]
tokenizer_config_dir: str

save tokenizer params in tokenizer_config.json and save words in vocab.list

property vocab_size
EduNLP.Pretrain.train_disenqnet(train_items, disen_tokenizer, output_dir, predata_dir, train_params=None, test_items=None, silent=False, data_formation=None)[source]
Parameters
  • train_items (list) – the raw train question list

  • disen_tokenizer (DisenQTokenizer) – the initial DisenQTokenizer use for training.

  • output_dir (str) – the path to save the model

  • predata_dir (str) – the dirname to load or save predata (including wv.th, vocab.list and concept.list)

  • train_params (dict, defaults to None) –

    the training parameters for data, model and trianer. - “trim_min”: int

    data param, the trim_min_count for vocab and word2vec, by default 2

    • ”w2v_workers”: int

      data param, the number of workers for word2vec, by default 1

    • ”hidden”: int

      model param, by default 128

    • ”dropout”: float

      model param, dropout rate, by default 0.2

    • ”pos_weight”: int

      model param, positive sample weight in unbalanced multi-label concept classifier, by default 1

    • ”cp”: float

      model param, weight of concept loss, by default 1.5

    • ”mi”: float

      model param, weight of mutual information loss, by default 1.0

    • ”dis”: float

      model param, weight of disentangling loss, by default 2.0

    • ”epoch”: int

      train param, number of epoch, by default 1

    • ”batch”: int

      train param, batch size, by default 64

    • ”lr”: float

      train param, learning rate, by default 1e-3

    • ”step”: int

      train param, step_size for StepLR, by default 20

    • ”gamma”: float

      train param, gamma for StepLR, by default 0.5

    • ”warm_up”: int

      train param, number of epoch for warming up, by default 1

    • ”adv”: int

      train param, ratio of disc/enc training for adversarial process, by default 10

    • ”device”: str

      train param, ‘cpu’ or ‘cuda’, by default “cpu”

  • test_items (list, defaults to None) – the raw test question list, default is None

  • silent (bool, defaults to False) – whether to print processing inforamtion

  • data_formation (dict, defaults to None) – Mapping “content” and “knowledge” for the item formation. For example, {“content”: “ques_content”, “knowledge”: “know_name”}

Examples

>>> train_data = load_items("static/test_data/disenq_train.json")[:100]
>>> test_data = load_items("static/test_data/disenq_test.json")[:100]
>>> tokenizer = DisenQTokenizer(max_length=250, tokenize_method="space")
>>> train_disenqnet(train_data, tokenizer,
... "examples/test_model/disenq","examples/test_model/disenq", silent=True)  

EduNLP.Tokenizer

class EduNLP.Tokenizer.PureTextTokenizer(*args, **kwargs)[source]

Duel with text and plain text formula. And filting special formula like $\FormFigureID{…}$ and $\FormFigureBase64{…}.

Parameters
  • items (str) –

  • key

  • args

  • kwargs

Return type

token

Examples

>>> tokenizer = PureTextTokenizer()
>>> items = ["有公式$\\FormFigureID{1}$,如图$\\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\\FormFigureBase64{2}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$"]
>>> tokens = tokenizer(items)
>>> next(tokens)[:10]
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z']
>>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"]
>>> tokens = tokenizer(items)
>>> next(tokens)  
['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<',
'0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',',
'\\quad', 'A', '\\cap', 'B', '=']
>>> items = [{
... "stem": "已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$",
... "options": ["1", "2"]
... }]
>>> tokens = tokenizer(items, key=lambda x: x["stem"])
>>> next(tokens)  
['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<',
'0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',',
'\\quad', 'A', '\\cap', 'B', '=']
class EduNLP.Tokenizer.TextTokenizer(*args, **kwargs)[source]

Duel with text and formula including special formula.

Parameters
  • items (str) –

  • key

  • args

  • kwargs

Return type

token

Examples

>>> tokenizer = TextTokenizer()
>>> items = ["有公式$\\FormFigureID{1}$,如图$\\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\\FormFigureBase64{2}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$"]
>>> tokens = tokenizer(items)
>>> next(tokens)[:10]
['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]']
>>> items = ["$\\SIFTag{stem_begin}$若复数$z=1+2 i+i^{3}$,则$|z|=$$\\SIFTag{stem_end}$\
... $\\SIFTag{options_begin}$$\\SIFTag{list_0}$0$\\SIFTag{list_1}$1$\\SIFTag{list_2}$$\\sqrt{2}$\
... $\\SIFTag{list_3}$2$\\SIFTag{options_end}$"]
>>> tokens = tokenizer(items)
>>> next(tokens)[:10]
['[TAG]', '复数', 'z', '=', '1', '+', '2', 'i', '+', 'i']
class EduNLP.Tokenizer.Tokenizer[source]
EduNLP.Tokenizer.get_tokenizer(name, *args, **kwargs)[source]

It is a total interface to use difference tokenizer. :param name: the name of tokenizer, e.g. text, pure_text. :type name: str :param args: the parameters passed to tokenizer :param kwargs: the parameters passed to tokenizer

Returns

tokenizer

Return type

Tokenizer

Examples

>>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"]
>>> tokenizer = get_tokenizer("text")
>>> tokens = tokenizer(items)
>>> next(tokens)  
['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<',
'0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',',
'\\quad', 'A', '\\cap', 'B', '=']

Vector

class EduNLP.Vector.W2V(filepath, method=None, binary=None)[source]

The part uses gensim library providing FastText, Word2Vec and KeyedVectors method to transfer word to vector.

Parameters
  • filepath – path to the pretrained model file

  • method (str) – fasttext other(Word2Vec)

  • binary (bool) –

key_to_index(word)[source]
property vectors
property vector_size
infer_vector(items, agg='mean', *args, **kwargs) list[source]
infer_tokens(items, *args, **kwargs) list[source]
class EduNLP.Vector.D2V(filepath, method='d2v')[source]

It is a collection which include d2v, bow, tfidf method.

Parameters
  • filepath

  • method (str) – d2v bow tfidf

  • item

Returns

d2v model

Return type

D2V

property vector_size
infer_vector(items, *args, **kwargs) list[source]
infer_tokens(item, *args, **kwargs) ...[source]
class EduNLP.Vector.BowLoader(filepath)[source]

Using doc2bow model, which has a lot of effects.

Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its document frequency (self.dfs) by one.

If allow_update is not set, this function is const, aka read-only.

infer_vector(item, return_vec=False)[source]
property vector_size
class EduNLP.Vector.TfidfLoader(filepath)[source]

This module implements functionality related to the Term Frequency - Inverse Document Frequency <https://en.wikipedia.org/wiki/Tf%E2%80%93idf> vector space bag-of-words models.

infer_vector(item, return_vec=False)[source]
property vector_size
class EduNLP.Vector.RNNModel(rnn_type, w2v: (<class 'EduNLP.Vector.gensim_vec.W2V'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, None), hidden_size, freeze_pretrained=True, model_params=None, device=None, **kwargs)[source]

Examples

>>> model = RNNModel("BiLSTM", None, 2, vocab_size=4, embedding_dim=3)
>>> seq_idx = [[1, 2, 3], [1, 2, 0], [3, 0, 0]]
>>> output, hn = model(seq_idx, indexing=False, padding=False)
>>> seq_idx = [[1, 2, 3], [1, 2], [3]]
>>> output, hn = model(seq_idx, indexing=False, padding=True)
>>> output.shape
torch.Size([3, 3, 4])
>>> hn.shape
torch.Size([2, 3, 2])
>>> tokens = model.infer_tokens(seq_idx, indexing=False)
>>> tokens.shape
torch.Size([3, 3, 4])
>>> tokens = model.infer_tokens(seq_idx, agg="mean", indexing=False)
>>> tokens.shape
torch.Size([3, 4])
>>> item = model.infer_vector(seq_idx, indexing=False)
>>> item.shape
torch.Size([3, 4])
>>> item = model.infer_vector(seq_idx, agg="mean", indexing=False)
>>> item.shape
torch.Size([3, 2])
>>> item = model.infer_vector(seq_idx, agg=None, indexing=False)
>>> item.shape
torch.Size([2, 3, 2])
infer_vector(items, agg: (<class 'int'>, <class 'str'>, None) = -1, indexing=True, padding=True, *args, **kwargs) Tensor[source]
infer_tokens(items, agg=None, *args, **kwargs) Tensor[source]
property vector_size: int
set_device(device)[source]
save(filepath, save_embedding=False)[source]
freeze(*args, **kwargs)[source]
property is_frozen
eval()[source]
train(mode=True)[source]
class EduNLP.Vector.T2V(model: str, *args, **kwargs)[source]

The function aims to transfer token list to vector. If you have a certain model, you can use T2V directly. Otherwise, calling get_pretrained_t2v function is a better way to get vector which can switch it without your model.

Parameters

model (str) – select the model type e.g.: d2v, rnn, lstm, gru, elmo, etc.

Examples

>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$,    ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$'}]
>>> path = "examples/test_model/d2v/d2v_test_256/d2v_test_256.bin"
>>> t2v = T2V('d2v',filepath=path)
>>> print(t2v(item)) 
[array([...dtype=float32)]
infer_vector(items, *args, **kwargs)[source]
infer_tokens(items, *args, **kwargs)[source]
property vector_size: int
EduNLP.Vector.get_pretrained_t2v(name, model_dir='/home/docs/.EduNLP/model')[source]

It is a good idea if you want to switch token list to vector earily.

Parameters
  • name (str) – select the pretrained model e.g.: d2v_math_300 w2v_math_300 elmo_math_2048 bert_math_768 bert_taledu_768 disenq_math_256 quesnet_math_512

  • model_dir (str) – the path of model, default: MODEL_DIR = ‘~/.EduNLP/model’

Returns

t2v model

Return type

T2V

Examples

>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$,    ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$'}]
>>> i2v = get_pretrained_t2v("d2v_test_256", "examples/test_model/d2v") 
>>> print(i2v(item)) 
[array([...dtype=float32)]
EduNLP.Vector.get_pretrained_model_info(name)[source]
EduNLP.Vector.get_all_pretrained_models()[source]
class EduNLP.Vector.Embedding(w2v: (<class 'EduNLP.Vector.gensim_vec.W2V'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, None), freeze=True, device=None, **kwargs)[source]
infer_token_vector(items: List[List[str]], indexing=True) tuple[source]
indexing(items: List[List[str]], padding=False, indexing=True) tuple[source]
Parameters
  • items (list of list of str(word/token)) –

  • padding (bool) – whether padding the returned list with default pad_val to make all item in items have the same length

  • indexing (bool) –

Returns

  • token_idx (list of list of int) – the list of the tokens of each item

  • token_len (list of int) – the list of the length of tokens of each item

set_device(device)[source]
class EduNLP.Vector.BertModel(pretrained_model)[source]

Examples

>>> from EduNLP.Pretrain import BertTokenizer
>>> tokenizer = BertTokenizer("bert-base-chinese", add_special_tokens=False)
>>> model = BertModel("bert-base-chinese")
>>> item = ["有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,若$x,y$满足约束",
... "有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,若$x,y$满足约束"]
>>> inputs = tokenizer(item, return_tensors='pt')
>>> output = model(inputs)
>>> output.shape
torch.Size([2, 14, 768])
>>> tokens = model.infer_tokens(inputs)
>>> tokens.shape
torch.Size([2, 12, 768])
>>> tokens = model.infer_tokens(inputs, return_special_tokens=True)
>>> tokens.shape
torch.Size([2, 14, 768])
>>> item = model.infer_vector(inputs)
>>> item.shape
torch.Size([2, 768])
infer_vector(items: dict, pooling_strategy='CLS') Tensor[source]
infer_tokens(items: dict, return_special_tokens=False) Tensor[source]
property vector_size
class EduNLP.Vector.QuesNetModel(pretrained_dir, tokenizer=None, device='cpu')[source]
infer_vector(items: Union[Question, list]) Tensor[source]

get question embedding with quesnet

Parameters

items ((Question, list)) – namedtuple, [‘id’, ‘content’, ‘answer’, ‘false_options’, ‘labels’] or a list of Questions

infer_tokens(items: Union[Question, list]) Tensor[source]

get token embeddings with quesnet

Parameters

items (Question) – namedtuple, [‘id’, ‘content’, ‘answer’, ‘false_options’, ‘labels’] or a list of Questions

Returns

meta_emb + word_embs

Return type

torch.Tensor

property vector_size
class EduNLP.Vector.DisenQModel(pretrained_dir, device='cpu')[source]
infer_vector(items: dict, vector_type=None, **kwargs) Tensor[source]
Parameters

vector_type (str) – choose the type of items tensor to return. Default is None, which means return both (k_hidden, i_hidden) when vector_type=”k”, return k_hidden; when vector_type=”i”, return i_hidden;

infer_tokens(items: dict, **kwargs) Tensor[source]
property vector_size
class EduNLP.Vector.ElmoModel(pretrained_model_path: str)[source]
infer_vector(items, *args, **kwargs) Tensor[source]
infer_tokens(items, *args, **kwargs) Tensor[source]
property vector_size