EduNLP.SIF¶

SIF¶

EduNLP.SIF.sif.is_sif(item, check_formula=True, return_parser=False)[source]¶

the part aims to check whether the input is sif format

Parameters

item (str) – a raw item which respects stem
check_formula (bool) –
whether to check the formulas when parsing item.

True if check the validity of formulas in item False if not check the validity of formulas in item, which is faster
return_parser (bool) –
whether to put the parsed item in return.

when True, the format of return is (bool, Parser) when False, the format of return is bool

Returns

when item can not be parsed correctly, raise ValueError; when item is in stardarded format originally, return Ture (and the Parser of item); when item isn’t in stardarded format originally, return False (and the Parser of item);

Return type

bool

Examples

>>> text = '若$x,y$满足约束条件' \
...        '$\\left\\{\\begin{array}{c}2 x+y-2 \\leq 0 \\\\ x-y-1 \\geq 0 \\\\ y+1 \\geq 0\\end{array}\\right.$，' \
...        '则$z=x+7 y$的最大值$\\SIFUnderline$'
>>> is_sif(text)
True
>>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x（单位...'
>>> ret = is_sif(text, return_parser=True)
>>> ret 
(False, <EduNLP.SIF.parser.parser.Parser object...>)

EduNLP.SIF.sif.to_sif(item, check_formula=True, parser: Optional[Parser] = None)[source]¶

the part aims to switch item to sif formate

Parameters

items (str) – a raw item which respects stem
check_formula (bool) – whether to check the formulas when parsing item (only work when parser=None).
parser (Parser) – the parser of item returned from is_sif.

Returns

item – the item which accords with sif format

Return type

str

Examples

>>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x（单位...'
>>> siftext = to_sif(text)
>>> siftext
'某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$（单位...'
>>> ret = is_sif(text, return_parser=True)
>>> ret 
(False, <EduNLP.SIF.parser.parser.Parser object...>)
>>> to_sif(text, parser=ret[1])
'某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$（单位...

EduNLP.SIF.sif.sif4sci(item: str, figures: (<class 'dict'>, <class 'bool'>) = None, mode: int = 2, symbol: str = None, tokenization=True, tokenization_params=None, errors='raise')[source]¶

Default to use linear Tokenizer, change the tokenizer by specifying tokenization_params

Parameters

item (str) – a raw item which respects stem
figures (dict) – when it is a dict, it means the id-to-instance for figures in ‘FormFigureID{…}’ format, when it is a bool, it means whether to instantiate figures in ‘FormFigureBase64{…}’ format
mode (int) – when safe = 2, use is_sif and check formula in item when safe = 1, use is_sif but don’t check formula in item when safe = 0, don’t use is_sif and don’t check anything in item
symbol (str) –

select the methods to symbolize:
”t”: text “f”: formula “g”: figure “m”: question mark “a”: tag “s”: sep
tokenization (bool) – whether to tokenize item after segmentation
tokenization_params –
the dict of text_params, formula_params and figure_params in tokenization For formula_params:

method: which tokenizer to be used, “linear” or “ast” The parameters only useful for “linear”:

skip_figure_formula: whether to skip the formula in figure format symbolize_figure_formula: whether to symbolize the formula in figure format

The parameters only useful for “ast”:
ord2token: whether to transfer the variables (mathord) and constants (textord) to special tokens. var_numbering: whether to use number suffix to denote different variables return_type: ‘list’ or ‘ast’

More parameters can be found in the definition in SIF.tokenization.formula

For figure_params:
figure_instance：whether to return instance of figures in tokens

For text_params:
See definition in SIF.tokenization.text
errors – warn, raise, coerce, strict, ignore

Returns

When tokenization is False, return SegmentList; When tokenization is True, return TokenList

Return type

list

Examples

>>> test_item = r"如图所示，则$\bigtriangleup ABC$的面积是$\SIFBlank$。$\FigureID{1}$"
>>> tl = sif4sci(test_item)
>>> tl
['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}]
>>> tl.describe()
{'t': 2, 'f': 2, 'g': 1, 'm': 1}
>>> with tl.filter('fgm'):
...     tl
['如图所示', '面积']
>>> with tl.filter(keep='t'):
...     tl
['如图所示', '面积']
>>> with tl.filter():
...     tl
['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}]
>>> tl.text_tokens
['如图所示', '面积']
>>> tl.formula_tokens
['\\bigtriangleup', 'ABC']
>>> tl.figure_tokens
[\FigureID{1}]
>>> tl.ques_mark_tokens
['\\SIFBlank']
>>> sif4sci(test_item, symbol="gm", tokenization_params={"formula_params": {"method": "ast"}})
['如图所示', <Formula: \bigtriangleup ABC>, '面积', '[MARK]', '[FIGURE]']
>>> sif4sci(test_item, symbol="tfgm")
['[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[TEXT]', '[FIGURE]']
>>> sif4sci(test_item, symbol="gm",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}})
['如图所示', '\\bigtriangleup', 'A', 'B', 'C', '面积', '[MARK]', '[FIGURE]']
>>> test_item_1 = {
...     "stem": r"若$x=2$, $y=\sqrt{x}$，则下列说法正确的是$\SIFChoice$",
...     "options": [r"$x < y$", r"$y = x$", r"$y < x$"]
... }
>>> tls = [
...     sif4sci(e, symbol="gm",
...     tokenization_params={
...         "formula_params": {
...             "method": "ast", "return_type": "list", "ord2token": True, "var_numbering": True,
...             "link_variable": False}
...     })
...     for e in ([test_item_1["stem"]] + test_item_1["options"])
... ]
>>> tls[1:]
[['mathord_0', '<', 'mathord_1'], ['mathord_0', '=', 'mathord_1'], ['mathord_0', '<', 'mathord_1']]
>>> link_formulas(*tls)
>>> tls[1:]
[['mathord_0', '<', 'mathord_1'], ['mathord_1', '=', 'mathord_0'], ['mathord_1', '<', 'mathord_0']]
>>> from EduNLP.utils import dict2str4sif
>>> test_item_1_str = dict2str4sif(test_item_1, tag_mode="head", add_list_no_tag=False)
>>> test_item_1_str  
'$\\SIFTag{stem}$...则下列说法正确的是$\\SIFChoice$$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$'
>>> tl1 = sif4sci(test_item_1_str, symbol="gm",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list", "ord2token": True}})
>>> tl1.get_segments()[0]
['\\SIFTag{stem}']
>>> tl1.get_segments()[1:3]
[['[TEXT_BEGIN]', '[TEXT_END]'], ['[FORMULA_BEGIN]', 'mathord', '=', 'textord', '[FORMULA_END]']]
>>> tl1.get_segments(add_seg_type=False)[0:3]
[['\\SIFTag{stem}'], ['mathord', '=', 'textord'], ['mathord', '=', 'mathord', '{ }', '\\sqrt']]
>>> test_item_2 = {"options": [r"$x < y$", r"$y = x$", r"$y < x$"]}
>>> test_item_2
{'options': ['$x < y$', '$y = x$', '$y < x$']}
>>> test_item_2_str = dict2str4sif(test_item_2, tag_mode="head", add_list_no_tag=False)
>>> test_item_2_str
'$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$'
>>> tl2 = sif4sci(test_item_2_str, symbol="gms",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}})
>>> tl2
['\\SIFTag{options}', 'x', '<', 'y', '[SEP]', 'y', '=', 'x', '[SEP]', 'y', '<', 'x']
>>> tl2.get_segments(add_seg_type=False)
[['\\SIFTag{options}'], ['x', '<', 'y'], ['[SEP]'], ['y', '=', 'x'], ['[SEP]'], ['y', '<', 'x']]
>>> tl2.get_segments(add_seg_type=False, drop="s")
[['\\SIFTag{options}'], ['x', '<', 'y'], ['y', '=', 'x'], ['y', '<', 'x']]
>>> tl3 = sif4sci(test_item_1["stem"], symbol="gs")
>>> tl3.text_segments
[['说法', '正确']]
>>> tl3.formula_segments
[['x', '=', '2'], ['y', '=', '\\sqrt', '{', 'x', '}']]
>>> tl3.figure_segments
[]
>>> tl3.ques_mark_segments
[['\\SIFChoice']]
>>> test_item_3 = r"已知$y=x$，则以下说法中$\textf{正确,b}$的是"
>>> tl4 = sif4sci(test_item_3)
Warning: there is some chinese characters in formula!
>>> tl4.text_segments
[['已知'], ['说法', '中', '正确']]

Parser¶

class EduNLP.SIF.parser.Parser(data, check_formula=True)[source]¶

Parse the item to standard format.

is_number(uchar)[source]¶: 判断一个unicode是否是数字

is_alphabet(uchar)[source]¶: 判断一个unicode是否是英文字母

is_chinese(uchar)[source]¶: 判断一个unicode是否是汉字

call_error()[source]¶: 语法解析函数

get_token()[source]¶

Get different elements in the item.

Returns: elements
Return type: chinese,alphabet,number,ch_pun_list,en_pun_list,latex formula

next_token()[source]¶

match(terminal)[source]¶

txt()[source]¶

txt_list()[source]¶

description()[source]¶

description_list()[source]¶

use Parser to process and describe the txt

Examples

>>> text = '生产某种零件的A工厂25名工人的日加工零件数_   _'
>>> text_parser = Parser(text)
>>> text_parser.description_list()
>>> text_parser.text
'生产某种零件的$A$工厂$25$名工人的日加工零件数$\\SIFBlank$'
>>> text = 'X的分布列为(   )'
>>> text_parser = Parser(text)
>>> text_parser.description_list()
>>> text_parser.text
'$X$的分布列为$\\SIFChoice$'
>>> text = '① AB是⊙O的直径，AC是⊙O的切线，BC交⊙O于点E．AC的中点为D'
>>> text_parser = Parser(text)
>>> text_parser.description_list()
>>> text_parser.error_flag
1
>>> text = '支持公式如$\\frac{y}{x}$，$\\SIFBlank$，$\\FigureID{1}$，不支持公式如$\\frac{ \\dddot y}{x}$'
>>> text_parser = Parser(text)
>>> text_parser.description_list()
>>> text_parser.fomula_illegal_flag
1

Segment¶

EduNLP.SIF.segment.segment.contextmanager(func)[source]¶

@contextmanager decorator.

Typical usage:

@contextmanager def some_generator(<arguments>):

<setup> try:

yield <value>

finally:
<cleanup>

This makes this:

with some_generator(<arguments>) as <variable>:
<body>

equivalent to this:

<setup> try:

<variable> = <value> <body>

finally:
<cleanup>

class EduNLP.SIF.segment.segment.Symbol[source]¶

class EduNLP.SIF.segment.segment.TextSegment[source]¶

class EduNLP.SIF.segment.segment.LatexFormulaSegment[source]¶

class EduNLP.SIF.segment.segment.Figure(is_base64=False)[source]¶

decode figure which has been encode by base64

classmethod base64_to_numpy(figure: str)[source]¶: Creat a arrary in a designated buffer

class EduNLP.SIF.segment.segment.FigureFormulaSegment(src, is_base64=False, figure_instance: (<class 'dict'>, <class 'bool'>) = None)[source]¶: Duel with figureformula, especially coding in base64

class EduNLP.SIF.segment.segment.FigureSegment(src, is_base64=False, figure_instance: (<class 'dict'>, <class 'bool'>) = None)[source]¶: Duel with figure, especially coding in base64

class EduNLP.SIF.segment.segment.QuesMarkSegment[source]¶

class EduNLP.SIF.segment.segment.TagSegment[source]¶

class EduNLP.SIF.segment.segment.SepSegment[source]¶

class EduNLP.SIF.segment.segment.SegmentList(item, figures: Optional[dict] = None)[source]¶

Parameters

item (str) –
figures (dict) –

Examples

>>> test_item = "如图所示，则三角形$ABC$的面积是$\SIFBlank$。$\FigureID{1}$"
>>> SegmentList(test_item)
['如图所示，则三角形', 'ABC', '的面积是', '\\SIFBlank', '。', \FigureID{1}]

append(segment) → None[source]¶: add segment to corresponding segments

property segments¶: return segments

property text_segments¶: return text segments

property formula_segments¶: return formula segments

property figure_segments¶: return figure segments

property ques_mark_segments¶: return question mark segments

property tag_segments¶: return tag segments

to_symbol(idx, symbol)[source]¶: switch element to its symbol

symbolize(to_symbolize='fgm')[source]¶

Switch designated elements to symbol. It is a good way to protect or preserve the elements which we don’t want to tokenize.

Parameters: to_symbolize – “t”: text “f”: formula “g”: figure “m”: question mark “a”: tag “s”: sep

filter(drop: (<class 'set'>, <class 'str'>) = '', keep: (<class 'set'>, <class 'str'>) = '*')[source]¶

Output special element list selective.Drop means not show.Keep means show.

Parameters

drop (set or str) – The alphabet should be included in “tfgmas”, which means drop selected segments out of return value.
keep (set or str) – The alphabet should be included in “tfgmas”, which means only keep selected segments in return value.

describe()[source]¶: show the length of different segments

EduNLP.SIF.segment.segment.seg(item, figures=None, symbol=None)[source]¶

It is a interface for SegmentList. And show it in an appropriate way.

Parameters

item (str) –
figures (dict, optional) –
symbol (str, optional) –

Returns

segmented item

Return type

list

Examples

>>> test_item = r"如图所示，则$\bigtriangleup ABC$的面积是$\SIFBlank$。$\FigureID{1}$"
>>> s = seg(test_item)
>>> s
['如图所示，则', '\\bigtriangleup ABC', '的面积是', '\\SIFBlank', '。', \FigureID{1}]
>>> s.describe()
{'t': 3, 'f': 1, 'g': 1, 'm': 1}
>>> with s.filter("f"):
...     s
['如图所示，则', '的面积是', '\\SIFBlank', '。', \FigureID{1}]
>>> with s.filter(keep="t"):
...     s
['如图所示，则', '的面积是', '。']
>>> with s.filter():
...     s
['如图所示，则', '\\bigtriangleup ABC', '的面积是', '\\SIFBlank', '。', \FigureID{1}]
>>> seg(test_item, symbol="fgm")
['如图所示，则', '[FORMULA]', '的面积是', '[MARK]', '。', '[FIGURE]']
>>> seg(test_item, symbol="tfgm")
['[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[TEXT]', '[FIGURE]']
>>> seg(r"如图所示，则$\FormFigureID{0}$的面积是$\SIFBlank$。$\FigureID{1}$")
['如图所示，则', \FormFigureID{0}, '的面积是', '\\SIFBlank', '。', \FigureID{1}]
>>> seg(r"如图所示，则$\FormFigureID{0}$的面积是$\SIFBlank$。$\FigureID{1}$", symbol="fgm")
['如图所示，则', '[FORMULA]', '的面积是', '[MARK]', '。', '[FIGURE]']
>>> s.text_segments
['如图所示，则', '的面积是', '。']
>>> s.formula_segments
['\\bigtriangleup ABC']
>>> s.figure_segments
[\FigureID{1}]
>>> s.ques_mark_segments
['\\SIFBlank']
>>> test_item_1 = {
...     "stem": r"若复数$z=1+2 i+i^{3}$，则$|z|=$",
...     "options": ['0', '1', r'$\sqrt{2}$', '2']
... }
>>> from EduNLP.utils import dict2str4sif
>>> test_item_1_str = dict2str4sif(test_item_1)
>>> test_item_1_str
'$\\SIFTag{stem_begin}$...$\\SIFTag{stem_end}$$\\SIFTag{options_begin}$$\\SIFTag{list_0}$0...$\\SIFTag{options_end}$'
>>> s1 = seg(test_item_1_str, symbol="tfgm")
>>> s1
['\\SIFTag{stem_begin}'...'\\SIFTag{stem_end}', '\\SIFTag{options_begin}', '\\SIFTag{list_0}', ...]
>>> with s1.filter(keep="a"):
...     s1
[...'\\SIFTag{list_0}', '\\SIFTag{list_1}', '\\SIFTag{list_2}', '\\SIFTag{list_3}', '\\SIFTag{options_end}']
>>> s1.tag_segments
['\\SIFTag{stem_begin}', '\\SIFTag{stem_end}', '\\SIFTag{options_begin}', ... '\\SIFTag{options_end}']
>>> test_item_1_str_2 = dict2str4sif(test_item_1, tag_mode="head", add_list_no_tag=False)
>>> seg(test_item_1_str_2, symbol="tfgmas")
['[TAG]', ... '[TAG]', '[TEXT]', '[SEP]', '[TEXT]', '[SEP]', '[FORMULA]', '[SEP]', '[TEXT]']
>>> s2 = seg(test_item_1_str_2, symbol="fgm")
>>> s2.tag_segments
['\\SIFTag{stem}', '\\SIFTag{options}']
>>> test_item_2 = r"已知$y=x$，则以下说法中$\textf{正确,b}$的是"
>>> s2 = seg(test_item_2)
>>> s2.text_segments
['已知', '，则以下说法中正确的是']

Tokenization¶

tokenize¶

class EduNLP.SIF.tokenization.tokenization.TokenList(segment_list: SegmentList, text_params=None, formula_params=None, figure_params=None)[source]¶

Parameters

segment_list (list) – segmented item
text_params (dict) –
formula_params (dict) –
figure_params (dict) –

add_seg_type(seg_type, tar: list, add_seg_type=True, mode='delimiter')[source]¶

Add seg tag in different position

Parameters

seg_type (str) – t: text f:formula
tar (list) –
add_seg_type – if the value==False, the function will not be executed.
mode (str) – delimiter: both in the head and at the tail head: only in the head tail: only at the tail

get_segments(add_seg_type=True, add_seg_mode='delimiter', keep='*', drop='', depth=None)[source]¶

call segment function.

Parameters

add_seg_type –
add_seg_mode – delimiter: both in the head and at the tail head: only in the head tail: only at the tail
keep –
drop –
depth (int or None) – 0: only separate at SIFSep 1: only separate at SIFTag 2: separate at SIFTag and SIFSep otherwise, separate all segments

Returns

segmented item

Return type

list

property text_segments¶: get text segment

property formula_segments¶: get formula segment

property figure_segments¶: get figure segment

property ques_mark_segments¶: get question mark segment

property tokens¶: add token to a list

append_text(segment, symbol=False)[source]¶: append text

append_formula(segment, symbol=False, init=True)[source]¶: append formula by different methods

append_figure(segment, **kwargs)[source]¶: append figure

append_ques_mark(segment, **kwargs)[source]¶: append question mark

append_tag(segment, **kwargs)[source]¶: append tag

append_sep(segment, **kwargs)[source]¶: append sep

append(segment, lazy=False)[source]¶

the total api for appending elements

Parameters

segment –
lazy – True:Doesn’t distinguish parmeters. False:It makes same parmeters have the same number.

extend(segments)[source]¶: append every segment in turn

property text_tokens¶: return text tokens

property formula_tokens¶: return formula tokens

property figure_tokens¶: return figure tokens

property ques_mark_tokens¶: return question mark tokens

property inner_formula_tokens¶: return inner formula tokens

filter(drop: (<class 'set'>, <class 'str'>) = '', keep: (<class 'set'>, <class 'str'>) = '*')[source]¶

Output special element list selective.Drop means not show.Keep means show.

Parameters

drop (set or str) – The alphabet should be included in “tfgmas”, which means drop selected segments out of return value.
keep (set or str) – The alphabet should be included in “tfgmas”, which means only keep selected segments in return value.

Returns

filted list

Return type

list

describe()[source]¶: show the total number of each elements

EduNLP.SIF.tokenization.tokenization.tokenize(segment_list: SegmentList, text_params=None, formula_params=None, figure_params=None)[source]¶

an actual api to tokenize item

Parameters

segment_list (list) – segmented item
text_params (dict) – the method to duel with text
formula_params (dict) – the method to duel with formula
figure_params (dict) – the method to duel with figure

Returns

tokenized item

Return type

list

Examples

>>> items = "如图所示，则三角形$ABC$的面积是$\SIFBlank$。$\FigureID{1}$"
>>> tokenize(SegmentList(items))
['如图所示', '三角形', 'ABC', '面积', '\\SIFBlank', \FigureID{1}]
>>> tokenize(SegmentList(items),formula_params={"method": "ast"})
['如图所示', '三角形', <Formula: ABC>, '面积', '\\SIFBlank', \FigureID{1}]

EduNLP.SIF.tokenization.tokenization.link_formulas(*token_list: TokenList, link_vars=True)[source]¶: call formula function

text¶

EduNLP.SIF.tokenization.text.tokenize(text, granularity='word', stopwords='default')[source]¶

Using jieba library to tokenize item by word or char.

Parameters

text –
granularity –
stopwords (str, None or set) –

Examples

>>> tokenize("三角函数是基本初等函数之一")
['三角函数', '初等', '函数']
>>> tokenize("三角函数是基本初等函数之一", granularity="char")
['三', '角', '函', '数', '基', '初', '函', '数']

formula¶

EduNLP.SIF.tokenization.formula.formula.linear_tokenize(formula, preserve_braces=True, number_as_tag=False, *args, **kwargs)[source]¶

linear tokenize formula. It includes three processes:cut, reduce and connect_char.

Parameters

formula –
preserve_braces –
number_as_tag –
args –
kwargs –

Examples

>>> linear_tokenize(r"{x + y}^\frac{1}{2} + 1 = 0")
['{', 'x', '+', 'y', '}', '^', '\\frac', '{', '1', '}', '{', '2', '}', '+', '1', '=', '0']
>>> linear_tokenize(r"ABC,AB,AC")
['ABC', ',', 'AB', ',', 'AC']

EduNLP.SIF.tokenization.formula.formula.ast_tokenize(formula, ord2token=False, var_numbering=False, return_type='formula', *args, **kwargs)[source]¶

According to return type, tokenizing formula by different methods.

Parameters

formula –
ord2token –
var_numbering –
return_type –
args –
kwargs –

Examples

>>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="list")
['x', '+', 'y', '{ }', '\\pi', '{ }', '2', '{ }', '\\frac', '\\supsub', '+', '1', '=', 'x']
>>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="list", ord2token=True)
['mathord', '+', 'mathord', '{ }', 'mathord', '{ }', 'textord', '{ }', '\\frac', '\\supsub', '+', 'textord', '=', 'mathord']
>>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="list", ord2token=True, var_numbering=True)
['mathord_0', '+', 'mathord_1', '{ }', 'mathord_con', '{ }', 'textord', '{ }', '\\frac', '\\supsub', '+', 'textord', '=', 'mathord_0']
>>> len(ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="ast").nodes)
14
>>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x")
<Formula: {x + y}^\frac{\pi}{2} + 1 = x>

EduNLP.SIF.tokenization.formula.formula.tokenize(formula, method='linear', errors='raise', **kwargs)[source]¶

The total function to tokenize formula by linear or ast.

Parameters

formula –
method –
errors (how to handle the exception occurs in ast tokenize) – “coerce”: use linear_tokenize “raise”: raise exception
kwargs –

Examples

>>> tokenize(r"\frac{\pi}{x + y} + 1 = x")
['\\frac', '{', '\\pi', '}', '{', 'x', '+', 'y', '}', '+', '1', '=', 'x']
>>> tokenize(r"\frac{\pi}{x + y} + 1 = x", method="ast", ord2token=True)
<Formula: \frac{\pi}{x + y} + 1 = x>
>>> tokenize(r"\frac{\pi}{x + y} + 1 = x", method="ast", ord2token=True, return_type="list")
['mathord', '{ }', 'mathord', '+', 'mathord', '{ }', '\\frac', '+', 'textord', '=', 'mathord']

class EduNLP.SIF.tokenization.formula.ast_token.Formula(formula: (<class 'str'>, typing.List[typing.Dict]), variable_standardization=False, const_mathord=None, init=True, *args, **kwargs)[source]¶

The part transform a formula to the parsed abstracted syntax tree.

Parameters

formula (str or List[Dict]) – latex formula string or the parsed abstracted syntax tree
variable_standardization –
const_mathord –
init –
args –
kwargs –

Examples

>>> f = Formula("x")
>>> f
<Formula: x>
>>> f.ast
[{'val': {'id': 0, 'type': 'mathord', 'text': 'x', 'role': None}, 'structure': {'bro': [None, None], 'child': None, 'father': None, 'forest': None}}]
>>> f.elements
[{'id': 0, 'type': 'mathord', 'text': 'x', 'role': None}]
>>> f.variable_standardization(inplace=True)
<Formula: x>
>>> f.elements
[{'id': 0, 'type': 'mathord', 'text': 'x', 'role': None, 'var': 0}]

variable_standardization(inplace=False, const_mathord=None, variable_connect_dict=None)[source]¶

It makes same parmeters have the same number.

Parameters

inplace –
const_mathord –
variable_connect_dict –

property ast¶

property elements¶

property ast_graph: (<class 'networkx.classes.graph.Graph'>, <class 'networkx.classes.digraph.DiGraph'>)¶

to_str()[source]¶

reset_ast(formula_ensure_str: bool = True, variable_standardization=False, const_mathord=None, *args, **kwargs)[source]¶

property resetable¶

EduNLP.SIF.tokenization.formula.ast_token.traversal_formula(ast, ord2token=False, var_numbering=False, strategy='post', *args, **kwargs)[source]¶: The part will run only when the return type is list. And it provides two strategy: post and linear. Besides, tokens list will append node follow its type.

EduNLP.SIF.tokenization.formula.ast_token.ast_tokenize(formula, ord2token=False, var_numbering=False, return_type='formula', *args, **kwargs)[source]¶

According to return type, tokenizing formula by different methods.

Parameters

formula –
ord2token –
var_numbering –
return_type –
args –
kwargs –

Examples

>>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="list")
['x', '+', 'y', '{ }', '\\pi', '{ }', '2', '{ }', '\\frac', '\\supsub', '+', '1', '=', 'x']
>>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="list", ord2token=True)
['mathord', '+', 'mathord', '{ }', 'mathord', '{ }', 'textord', '{ }', '\\frac', '\\supsub', '+', 'textord', '=', 'mathord']
>>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="list", ord2token=True, var_numbering=True)
['mathord_0', '+', 'mathord_1', '{ }', 'mathord_con', '{ }', 'textord', '{ }', '\\frac', '\\supsub', '+', 'textord', '=', 'mathord_0']
>>> len(ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="ast").nodes)
14
>>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x")
<Formula: {x + y}^\frac{\pi}{2} + 1 = x>

class EduNLP.SIF.tokenization.formula.linear_token.IntFlag(value)[source]¶: Support for integer-based Flags

EduNLP.SIF.tokenization.formula.linear_token.cut(formula, preserve_braces=True, with_dollar=False, preserve_dollar=False, number_as_tag=False, preserve_src=True)[source]¶

cut formula thoroughly

Parameters

formula (str) –
preserve_braces – when it is False “{” and “}” will be filted
with_dollar – have dollar or not
preserve_dollar – keep “$”
number_as_tag – whether switch number to tag, it just can idenify the number which is more than one bit.
preserve_src –

Returns

return a preliminary list which cut fully

Return type

list

Examples

>>> cut(r"${x + y}^\frac{1}{2} + 12.1 = 0$")
['{x + y}', '^', '\\f', 'r', 'a', 'c', '{1}', '{2}', '+', '12.1', '=', '0']
>>> cut(r"${x + y}^\frac{1}{2} + 12.1 = 0$",preserve_dollar=False)
['{x + y}', '^', '\\f', 'r', 'a', 'c', '{1}', '{2}', '+', '12.1', '=', '0']
>>> cut(r"${x + y}^\frac{1}{2} + 12.1 = 0$",number_as_tag=True)
['{x + y}', '^', '\\f', 'r', 'a', 'c', '{1}', '{2}', '+', '{decimal}', '=', '0']

EduNLP.SIF.tokenization.formula.linear_token.reduce(fea)[source]¶: restore some formula

EduNLP.SIF.tokenization.formula.linear_token.connect_char(words)[source]¶: connect and switch to list type

EduNLP.SIF.tokenization.formula.linear_token.latex_parse(formula, preserve_braces=True, with_dollar=True, preserve_dollar=False, number_as_tag=False, preserve_src=True)[source]¶

EduNLP.SIF.tokenization.formula.linear_token.linear_tokenize(formula, preserve_braces=True, number_as_tag=False, *args, **kwargs)[source]¶

linear tokenize formula. It includes three processes:cut, reduce and connect_char.

Parameters

formula –
preserve_braces –
number_as_tag –
args –
kwargs –

Examples

>>> linear_tokenize(r"{x + y}^\frac{1}{2} + 1 = 0")
['{', 'x', '+', 'y', '}', '^', '\\frac', '{', '1', '}', '{', '2', '}', '+', '1', '=', '0']
>>> linear_tokenize(r"ABC,AB,AC")
['ABC', ',', 'AB', ',', 'AC']