EduNLP

SIF

EduNLP.SIF.sif.is_sif(item)[source]

the part aims to check whether the input is sif format

Parameters

item (str) – a raw item which respects stem

Returns

when item can not be parsed correctly, raise Error; when item doesn’t need to be modified, return Ture; when item needs to be modified, return False;

Return type

bool

Examples

>>> text = '若$x,y$满足约束条件' \
...        '$\\left\\{\\begin{array}{c}2 x+y-2 \\leq 0 \\\\ x-y-1 \\geq 0 \\\\ y+1 \\geq 0\\end{array}\\right.$,' \
...        '则$z=x+7 y$的最大值$\\SIFUnderline$'
>>> is_sif(text)
True
>>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...'
>>> is_sif(text)
False
EduNLP.SIF.sif.sif4sci(item: str, figures: (<class 'dict'>, <class 'bool'>) = None, safe=True, symbol: str = None, tokenization=True, tokenization_params=None, errors='raise')[source]

Default to use linear Tokenizer, change the tokenizer by specifying tokenization_params

Parameters
  • item (str) – a raw item which respects stem

  • figures (dict) – {“FigureID”: Base64 encoding of the figure}

  • safe (bool) – Check whether the text conforms to the sif format

  • symbol (str) –

    select the methods to symbolize:

    ”t”: text “f”: formula “g”: figure “m”: question mark “a”: tag “s”: sep

  • tokenization (bool) – True: tokenize the item

  • tokenization_params

    method: which tokenizer to be used, “linear” or “ast”

    The parameters only useful for “linear”: None

    The parameters only useful for “ast”:

    ord2token: whether to transfer the variables (mathord) and constants (textord) to special tokens. var_numbering: whether to use number suffix to denote different variables

  • errors – warn, raise, coerce, strict, ignore

Returns

When tokenization is False, return SegmentList; When tokenization is True, return TokenList

Return type

list

Examples

>>> test_item = r"如图所示,则$\bigtriangleup ABC$的面积是$\SIFBlank$。$\FigureID{1}$"
>>> tl = sif4sci(test_item)
>>> tl
['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}]
>>> tl.describe()
{'t': 2, 'f': 2, 'g': 1, 'm': 1}
>>> with tl.filter('fgm'):
...     tl
['如图所示', '面积']
>>> with tl.filter(keep='t'):
...     tl
['如图所示', '面积']
>>> with tl.filter():
...     tl
['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}]
>>> tl.text_tokens
['如图所示', '面积']
>>> tl.formula_tokens
['\\bigtriangleup', 'ABC']
>>> tl.figure_tokens
[\FigureID{1}]
>>> tl.ques_mark_tokens
['\\SIFBlank']
>>> sif4sci(test_item, symbol="gm", tokenization_params={"formula_params": {"method": "ast"}})
['如图所示', <Formula: \bigtriangleup ABC>, '面积', '[MARK]', '[FIGURE]']
>>> sif4sci(test_item, symbol="tfgm")
['[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[TEXT]', '[FIGURE]']
>>> sif4sci(test_item, symbol="gm",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}})
['如图所示', '\\bigtriangleup', 'A', 'B', 'C', '面积', '[MARK]', '[FIGURE]']
>>> test_item_1 = {
...     "stem": r"若$x=2$, $y=\sqrt{x}$,则下列说法正确的是$\SIFChoice$",
...     "options": [r"$x < y$", r"$y = x$", r"$y < x$"]
... }
>>> tls = [
...     sif4sci(e, symbol="gm",
...     tokenization_params={
...         "formula_params": {
...             "method": "ast", "return_type": "list", "ord2token": True, "var_numbering": True,
...             "link_variable": False}
...     })
...     for e in ([test_item_1["stem"]] + test_item_1["options"])
... ]
>>> tls[1:]
[['mathord_0', '<', 'mathord_1'], ['mathord_0', '=', 'mathord_1'], ['mathord_0', '<', 'mathord_1']]
>>> link_formulas(*tls)
>>> tls[1:]
[['mathord_0', '<', 'mathord_1'], ['mathord_1', '=', 'mathord_0'], ['mathord_1', '<', 'mathord_0']]
>>> from EduNLP.utils import dict2str4sif
>>> test_item_1_str = dict2str4sif(test_item_1, tag_mode="head", add_list_no_tag=False)
>>> test_item_1_str  
'$\\SIFTag{stem}$...则下列说法正确的是$\\SIFChoice$$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$'
>>> tl1 = sif4sci(test_item_1_str, symbol="gm",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list", "ord2token": True}})
>>> tl1.get_segments()[0]
['\\SIFTag{stem}']
>>> tl1.get_segments()[1:3]
[['[TEXT_BEGIN]', '[TEXT_END]'], ['[FORMULA_BEGIN]', 'mathord', '=', 'textord', '[FORMULA_END]']]
>>> tl1.get_segments(add_seg_type=False)[0:3]
[['\\SIFTag{stem}'], ['mathord', '=', 'textord'], ['mathord', '=', 'mathord', '{ }', '\\sqrt']]
>>> test_item_2 = {"options": [r"$x < y$", r"$y = x$", r"$y < x$"]}
>>> test_item_2
{'options': ['$x < y$', '$y = x$', '$y < x$']}
>>> test_item_2_str = dict2str4sif(test_item_2, tag_mode="head", add_list_no_tag=False)
>>> test_item_2_str
'$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$'
>>> tl2 = sif4sci(test_item_2_str, symbol="gms",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}})
>>> tl2
['\\SIFTag{options}', 'x', '<', 'y', '[SEP]', 'y', '=', 'x', '[SEP]', 'y', '<', 'x']
>>> tl2.get_segments(add_seg_type=False)
[['\\SIFTag{options}'], ['x', '<', 'y'], ['[SEP]'], ['y', '=', 'x'], ['[SEP]'], ['y', '<', 'x']]
>>> tl2.get_segments(add_seg_type=False, drop="s")
[['\\SIFTag{options}'], ['x', '<', 'y'], ['y', '=', 'x'], ['y', '<', 'x']]
>>> tl3 = sif4sci(test_item_1["stem"], symbol="gs")
>>> tl3.text_segments
[['说法', '正确']]
>>> tl3.formula_segments
[['x', '=', '2'], ['y', '=', '\\sqrt', '{', 'x', '}']]
>>> tl3.figure_segments
[]
>>> tl3.ques_mark_segments
[['\\SIFChoice']]
>>> test_item_3 = r"已知$y=x$,则以下说法中$\textf{正确,b}$的是"
>>> tl4 = sif4sci(test_item_3)
Warning: there is some chinese characters in formula!
>>> tl4.text_segments
[['已知'], ['说法', '中', '正确']]
EduNLP.SIF.sif.to_sif(item)[source]

the part aims to switch item to sif formate

Parameters

items (str) – a raw item which respects stem

Returns

item – the item which accords with sif format

Return type

str

Examples

>>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...'
>>> siftext = to_sif(text)
>>> siftext
'某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$(单位...'

EduNLP.Formula

EduNLP.Formula.ast.ast(formula: (<class 'str'>, typing.List[typing.Dict]), index=0, forest_begin=0, father_tree=None, is_str=False)[source]

The origin code author is https://github.com/hxwujinze

Parameters
  • formula (str or List[Dict]) – 公式字符串或通过katex解析得到的结构体

  • index (int) – 本子树在树上的位置

  • forest_begin (int) – 本树在森林中的起始位置

  • father_tree (List[Dict]) – 父亲树

  • is_str (bool) –

Returns

  • tree (List[Dict]) – 重新解析形成的特征树

  • todo (finish all types)

Notes

Some functions are not supportd in katex e.g.,

  1. tag
    • \begin{equation} \tag{tagName} F=ma \end{equation}

    • \begin{align} \tag{1} y=x+z \end{align}

    • \tag*{hi} x+y^{2x}

  2. dddot
    • \frac{ \dddot y }{ x }

For more information, refer to katex support table

EduNLP.Formula.ast.get_edges(forest)[source]

构造边集合

Parameters

forest (List[Dict]) – 森林

Returns

edges – 边集合

Return type

list of tuple(src,dst,type)

EduNLP.Formula.ast.katex_parse(formula)[source]

将公式传入katex进行语法解析

建森林

Parameters

forest (List[Dict]) –

Returns

trees

Return type

List[Dict]

EduNLP.Formula.ast.str2ast(formula: str, *args, **kwargs)[source]

给字符串的接口

EduNLP.I2V

class EduNLP.I2V.i2v.D2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]

The model aims to transfer item to vector directly.

I2V

Parameters
  • tokenizer (str) – the tokenizer name

  • t2v (str) – the name of token2vector model

  • args – the parameters passed to t2v

  • tokenizer_kwargs (dict) – the parameters passed to tokenizer

  • pretrained_t2v (bool) – True: use pretrained t2v model False: use your own t2v model

  • kwargs – the parameters passed to t2v

Examples

>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$,     ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点,    ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"}
>>> model_path = "examples/test_model/test_gensim_luna_stem_tf_d2v_256.bin"
>>> i2v = D2V("text","d2v",filepath=model_path, pretrained_t2v = False)
>>> i2v(item)
([array([ ...dtype=float32)], None)
Returns

i2v model

Return type

I2V

infer_vector(items, tokenize=True, indexing=False, padding=False, key=<function D2V.<lambda>>, *args, **kwargs) tuple[source]

It is a function to switch item to vector. And before using the function, it is nesseary to load model.

Parameters
  • items (str) – the text of question

  • tokenize (bool) – True: tokenize the item

  • indexing (bool) –

  • padding (bool) –

  • key (lambda function) – the parameter passed to tokenizer, select the text to be processed

  • args – the parameters passed to t2v

  • kwargs – the parameters passed to t2v

Returns

vector

Return type

list

class EduNLP.I2V.i2v.I2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]

It just a api, so you shouldn’t use it directly. If you want to get vector from item, you can use other model like D2V and W2V.

Parameters
  • tokenizer (str) – the tokenizer name

  • t2v (str) – the name of token2vector model

  • args – the parameters passed to t2v

  • tokenizer_kwargs (dict) – the parameters passed to tokenizer

  • pretrained_t2v (bool) –

    True: use pretrained t2v model

    False: use your own t2v model

  • kwargs – the parameters passed to t2v

Examples

>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$,     ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点,    ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"}
>>> model_path = "examples/test_model/test_gensim_luna_stem_tf_d2v_256.bin" 
>>> i2v = D2V("text","d2v",filepath=model_path, pretrained_t2v = False) 
>>> i2v(item) 
([array([...dtype=float32)], None)
Returns

i2v model

Return type

I2V

class EduNLP.I2V.i2v.W2V(tokenizer, t2v, *args, tokenizer_kwargs: Optional[dict] = None, pretrained_t2v=False, **kwargs)[source]

The model aims to transfer tokens to vector.

I2V

Parameters
  • tokenizer (str) – the tokenizer name

  • t2v (str) – the name of token2vector model

  • args – the parameters passed to t2v

  • tokenizer_kwargs (dict) – the parameters passed to tokenizer

  • pretrained_t2v (bool) – True: use pretrained t2v model False: use your own t2v model

  • kwargs – the parameters passed to t2v

Examples

>>> i2v = get_pretrained_i2v("test_w2v", "examples/test_model/data/w2v")
>>> item_vector, token_vector = i2v(["有学者认为:‘学习’,必须适应实际"])
>>> item_vector 
[array([...], dtype=float32)]
Returns

i2v model

Return type

W2V

infer_vector(items, tokenize=True, indexing=False, padding=False, key=<function W2V.<lambda>>, *args, **kwargs) tuple[source]

It is a function to switch item to vector. And before using the function, it is nesseary to load model.

Parameters
  • items (str) – the text of question

  • tokenize (bool) – True: tokenize the item

  • indexing (bool) –

  • padding (bool) –

  • key (lambda function) – the parameter passed to tokenizer, select the text to be processed

  • args – the parameters passed to t2v

  • kwargs – the parameters passed to t2v

Returns

vector

Return type

list

EduNLP.I2V.i2v.get_pretrained_i2v(name, model_dir='/home/docs/.EduNLP/model')[source]

It is a good idea if you want to switch item to vector earily.

Parameters
  • name (str) – the name of item2vector model e.g.: d2v_all_256 d2v_sci_256 d2v_eng_256 d2v_lit_256 w2v_sci_300 w2v_lit_300

  • model_dir (str) – the path of model, default: MODEL_DIR = ‘~/.EduNLP/model’

Returns

i2v model

Return type

I2V

Examples

>>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$,     ... 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点,    ... 此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"}
>>> i2v = get_pretrained_i2v("test_d2v", "examples/test_model/data/d2v")
>>> print(i2v(item))
([array([ ...dtype=float32)], None)

EduNLP.Pretrain

class EduNLP.Pretrain.GensimSegTokenizer(symbol='gms', depth=None, flatten=False, **kwargs)[source]
Parameters
  • symbol (str) –

    select the methods to symbolize:

    ”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,

    e.g. gms, fgm

  • depth (int or None) – 0: only separate at SIFSep ; 1: only separate at SIFTag ; 2: separate at SIFTag and SIFSep ; otherwise, separate all segments ;

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], [\FormFigureID{wrong1?}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']]
>>> tokenizer = GensimSegTokenizer(symbol="fgm", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], ['[FORMULA]'], ['如图'], ['[FIGURE]'], ['[FORMULA]'],...['[FORMULA]'], ['最大值'], ['[MARK]']]
class EduNLP.Pretrain.GensimWordTokenizer(symbol='gm', general=False)[source]
Parameters
  • symbol (str) –

    select the methods to symbolize:

    ”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,

    e.g.: gm, fgm, gmas, fgmas

  • general (bool) –

    True: when item isn’t in standard format, and want to tokenize formulas(except formulas in figure) linearly.

    False: when use ‘ast’ mothed to tokenize formulas instead of ‘linear’.

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True)
>>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]']
>>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False)
>>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]']
EduNLP.Pretrain.train_vector(items, w2v_prefix, embedding_dim=None, method='sg', binary=None, train_params=None)[source]
Parameters
  • items:str – the text of question

  • w2v_prefix

  • embedding_dim (int) – vector_size

  • method (str) – the method of training, e.g.: sg, cbow, fasttext, d2v, bow, tfidf

  • binary (model format) – True:bin; False:kv

  • train_params (dict) – the training parameters passed to model

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], [\FormFigureID{wrong1?}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']]
>>> train_vector(token_item[:10], "examples/test_model/data/gensim_luna_stem_t_", 100) 
'examples/test_model/data/gensim_luna_stem_t_sg_100.kv'

EduNLP.Tokenizer

class EduNLP.Tokenizer.PureTextTokenizer(*args, **kwargs)[source]

Duel with text and plain text formula. And filting special formula like $\FormFigureID{…}$ and $\FormFigureBase64{…}.

Parameters
  • items (str) –

  • key

  • args

  • kwargs

Return type

token

Examples

>>> tokenizer = PureTextTokenizer()
>>> items = ["有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$"]
>>> tokens = tokenizer(items)
>>> next(tokens)[:10]
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z']
>>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"]
>>> tokens = tokenizer(items)
>>> next(tokens)  
['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<',
'0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',',
'\\quad', 'A', '\\cap', 'B', '=']
>>> items = [{
... "stem": "已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$",
... "options": ["1", "2"]
... }]
>>> tokens = tokenizer(items, key=lambda x: x["stem"])
>>> next(tokens)  
['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<',
'0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',',
'\\quad', 'A', '\\cap', 'B', '=']
class EduNLP.Tokenizer.TextTokenizer(*args, **kwargs)[source]

Duel with text and formula including special formula.

Parameters
  • items (str) –

  • key

  • args

  • kwargs

Return type

token

Examples

>>> tokenizer = TextTokenizer()
>>> items = ["有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$"]
>>> tokens = tokenizer(items)
>>> next(tokens)[:10]
['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]']
>>> items = ["$\\SIFTag{stem_begin}$若复数$z=1+2 i+i^{3}$,则$|z|=$$\\SIFTag{stem_end}$\
... $\\SIFTag{options_begin}$$\\SIFTag{list_0}$0$\\SIFTag{list_1}$1$\\SIFTag{list_2}$$\\sqrt{2}$\
... $\\SIFTag{list_3}$2$\\SIFTag{options_end}$"]
>>> tokens = tokenizer(items)
>>> next(tokens)[:10]
['[TAG]', '复数', 'z', '=', '1', '+', '2', 'i', '+', 'i']
EduNLP.Tokenizer.get_tokenizer(name, *args, **kwargs)[source]

It is a total interface to use difference tokenizer.

Parameters
  • name (str) – the name of tokenizer, e.g. text, pure_text.

  • args – the parameters passed to tokenizer

  • kwargs – the parameters passed to tokenizer

Returns

tokenizer

Return type

Tokenizer

Examples

>>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"]
>>> tokenizer = get_tokenizer("text")
>>> tokens = tokenizer(items)
>>> next(tokens)  
['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<',
'0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',',
'\\quad', 'A', '\\cap', 'B', '=']

Vector

class EduNLP.Vector.BowLoader(filepath)[source]

Using doc2bow model, which has a lot of effects.

Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its document frequency (self.dfs) by one.

If allow_update is not set, this function is const, aka read-only.

class EduNLP.Vector.D2V(filepath, method='d2v')[source]

It is a collection which include d2v, bow, tfidf method.

Parameters
  • filepath

  • method (str) – d2v bow tfidf

  • item

Returns

d2v model

Return type

D2V

class EduNLP.Vector.RNNModel(rnn_type, w2v: (<class 'EduNLP.Vector.gensim_vec.W2V'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, None), hidden_size, freeze_pretrained=True, model_params=None, device=None, **kwargs)[source]

Examples

>>> model = RNNModel("ELMO", None, 2, vocab_size=4, embedding_dim=3)
>>> seq_idx = [[1, 2, 3], [1, 2, 0], [3, 0, 0]]
>>> output, hn = model(seq_idx, indexing=False, padding=False)
>>> seq_idx = [[1, 2, 3], [1, 2], [3]]
>>> output, hn = model(seq_idx, indexing=False, padding=True)
>>> output.shape
torch.Size([3, 3, 4])
>>> hn.shape
torch.Size([2, 3, 2])
>>> tokens = model.infer_tokens(seq_idx, indexing=False)
>>> tokens.shape
torch.Size([3, 3, 4])
>>> tokens = model.infer_tokens(seq_idx, agg="mean", indexing=False)
>>> tokens.shape
torch.Size([3, 4])
>>> item = model.infer_vector(seq_idx, indexing=False)
>>> item.shape
torch.Size([3, 4])
>>> item = model.infer_vector(seq_idx, agg="mean", indexing=False)
>>> item.shape
torch.Size([3, 2])
>>> item = model.infer_vector(seq_idx, agg=None, indexing=False)
>>> item.shape
torch.Size([2, 3, 2])
class EduNLP.Vector.T2V(model: str, *args, **kwargs)[source]

The function aims to transfer token list to vector. If you have a certain model, you can use T2V directly. Otherwise, calling get_pretrained_t2v function is a better way to get vector which can switch it without your model.

Parameters

model (str) – select the model type e.g.: d2v, rnn, lstm, gru, elmo, etc.

Examples

>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$,    ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$'}]
>>> path = "examples/test_model/test_gensim_luna_stem_tf_d2v_256.bin"
>>> t2v = T2V('d2v',filepath=path)
>>> print(t2v(item)) 
[array([...dtype=float32)]
class EduNLP.Vector.TfidfLoader(filepath)[source]

This module implements functionality related to the Term Frequency - Inverse Document Frequency <https://en.wikipedia.org/wiki/Tf%E2%80%93idf> vector space bag-of-words models.

class EduNLP.Vector.W2V(filepath, method=None, binary=None)[source]

The part uses gensim library providing FastText, Word2Vec and KeyedVectors method to transfer word to vector.

Parameters
  • filepath – path to the pretrained model file

  • method (str) – fasttext other(Word2Vec)

  • binary

EduNLP.Vector.get_pretrained_t2v(name, model_dir='/home/docs/.EduNLP/model')[source]

It is a good idea if you want to switch token list to vector earily.

Parameters
  • name (str) – select the pretrained model e.g.: d2v_all_256, d2v_sci_256, d2v_eng_256, d2v_lit_256, w2v_eng_300, w2v_lit_300.

  • model_dir (str) – the path of model, default: MODEL_DIR = ‘~/.EduNLP/model’

Returns

t2v model

Return type

T2V

Examples

>>> item = [{'ques_content':'有公式$\FormFigureID{wrong1?}$和公式$\FormFigureBase64{wrong2?}$,    ... 如图$\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$'}]
>>> i2v = get_pretrained_t2v("test_d2v", "examples/test_model/data/d2v") 
>>> print(i2v(item)) 
[array([...dtype=float32)]