EduNLP.SIF

SIF

EduNLP.SIF.sif.is_sif(item, check_formula=True, return_parser=False)[source]

the part aims to check whether the input is sif format

Parameters
  • item (str) – a raw item which respects stem

  • check_formula (bool) –

    whether to check the formulas when parsing item.

    True if check the validity of formulas in item False if not check the validity of formulas in item, which is faster

  • return_parser (bool) –

    whether to put the parsed item in return.

    when True, the format of return is (bool, Parser) when False, the format of return is bool

Returns

when item can not be parsed correctly, raise ValueError; when item is in stardarded format originally, return Ture (and the Parser of item); when item isn’t in stardarded format originally, return False (and the Parser of item);

Return type

bool

Examples

>>> text = '若$x,y$满足约束条件' \
...        '$\\left\\{\\begin{array}{c}2 x+y-2 \\leq 0 \\\\ x-y-1 \\geq 0 \\\\ y+1 \\geq 0\\end{array}\\right.$,' \
...        '则$z=x+7 y$的最大值$\\SIFUnderline$'
>>> is_sif(text)
True
>>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...'
>>> ret = is_sif(text, return_parser=True)
>>> ret 
(False, <EduNLP.SIF.parser.parser.Parser object...>)
EduNLP.SIF.sif.to_sif(item, check_formula=True, parser: Optional[Parser] = None)[source]

the part aims to switch item to sif formate

Parameters
  • items (str) – a raw item which respects stem

  • check_formula (bool) – whether to check the formulas when parsing item (only work when parser=None).

  • parser (Parser) – the parser of item returned from is_sif.

Returns

item – the item which accords with sif format

Return type

str

Examples

>>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...'
>>> siftext = to_sif(text)
>>> siftext
'某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$(单位...'
>>> ret = is_sif(text, return_parser=True)
>>> ret 
(False, <EduNLP.SIF.parser.parser.Parser object...>)
>>> to_sif(text, parser=ret[1])
'某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$(单位...
EduNLP.SIF.sif.sif4sci(item: str, figures: (<class 'dict'>, <class 'bool'>) = None, mode: int = 2, symbol: str = None, tokenization=True, tokenization_params=None, errors='raise')[source]

Default to use linear Tokenizer, change the tokenizer by specifying tokenization_params

Parameters
  • item (str) – a raw item which respects stem

  • figures (dict) – when it is a dict, it means the id-to-instance for figures in ‘FormFigureID{…}’ format, when it is a bool, it means whether to instantiate figures in ‘FormFigureBase64{…}’ format

  • mode (int) – when safe = 2, use is_sif and check formula in item when safe = 1, use is_sif but don’t check formula in item when safe = 0, don’t use is_sif and don’t check anything in item

  • symbol (str) –

    select the methods to symbolize:

    ”t”: text “f”: formula “g”: figure “m”: question mark “a”: tag “s”: sep

  • tokenization (bool) – whether to tokenize item after segmentation

  • tokenization_params

    the dict of text_params, formula_params and figure_params in tokenization For formula_params:

    method: which tokenizer to be used, “linear” or “ast” The parameters only useful for “linear”:

    skip_figure_formula: whether to skip the formula in figure format symbolize_figure_formula: whether to symbolize the formula in figure format

    The parameters only useful for “ast”:

    ord2token: whether to transfer the variables (mathord) and constants (textord) to special tokens. var_numbering: whether to use number suffix to denote different variables return_type: ‘list’ or ‘ast’

    More parameters can be found in the definition in SIF.tokenization.formula

    For figure_params:

    figure_instance:whether to return instance of figures in tokens

    For text_params:

    See definition in SIF.tokenization.text

  • errors – warn, raise, coerce, strict, ignore

Returns

When tokenization is False, return SegmentList; When tokenization is True, return TokenList

Return type

list

Examples

>>> test_item = r"如图所示,则$\bigtriangleup ABC$的面积是$\SIFBlank$。$\FigureID{1}$"
>>> tl = sif4sci(test_item)
>>> tl
['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}]
>>> tl.describe()
{'t': 2, 'f': 2, 'g': 1, 'm': 1}
>>> with tl.filter('fgm'):
...     tl
['如图所示', '面积']
>>> with tl.filter(keep='t'):
...     tl
['如图所示', '面积']
>>> with tl.filter():
...     tl
['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}]
>>> tl.text_tokens
['如图所示', '面积']
>>> tl.formula_tokens
['\\bigtriangleup', 'ABC']
>>> tl.figure_tokens
[\FigureID{1}]
>>> tl.ques_mark_tokens
['\\SIFBlank']
>>> sif4sci(test_item, symbol="gm", tokenization_params={"formula_params": {"method": "ast"}})
['如图所示', <Formula: \bigtriangleup ABC>, '面积', '[MARK]', '[FIGURE]']
>>> sif4sci(test_item, symbol="tfgm")
['[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[TEXT]', '[FIGURE]']
>>> sif4sci(test_item, symbol="gm",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}})
['如图所示', '\\bigtriangleup', 'A', 'B', 'C', '面积', '[MARK]', '[FIGURE]']
>>> test_item_1 = {
...     "stem": r"若$x=2$, $y=\sqrt{x}$,则下列说法正确的是$\SIFChoice$",
...     "options": [r"$x < y$", r"$y = x$", r"$y < x$"]
... }
>>> tls = [
...     sif4sci(e, symbol="gm",
...     tokenization_params={
...         "formula_params": {
...             "method": "ast", "return_type": "list", "ord2token": True, "var_numbering": True,
...             "link_variable": False}
...     })
...     for e in ([test_item_1["stem"]] + test_item_1["options"])
... ]
>>> tls[1:]
[['mathord_0', '<', 'mathord_1'], ['mathord_0', '=', 'mathord_1'], ['mathord_0', '<', 'mathord_1']]
>>> link_formulas(*tls)
>>> tls[1:]
[['mathord_0', '<', 'mathord_1'], ['mathord_1', '=', 'mathord_0'], ['mathord_1', '<', 'mathord_0']]
>>> from EduNLP.utils import dict2str4sif
>>> test_item_1_str = dict2str4sif(test_item_1, tag_mode="head", add_list_no_tag=False)
>>> test_item_1_str  
'$\\SIFTag{stem}$...则下列说法正确的是$\\SIFChoice$$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$'
>>> tl1 = sif4sci(test_item_1_str, symbol="gm",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list", "ord2token": True}})
>>> tl1.get_segments()[0]
['\\SIFTag{stem}']
>>> tl1.get_segments()[1:3]
[['[TEXT_BEGIN]', '[TEXT_END]'], ['[FORMULA_BEGIN]', 'mathord', '=', 'textord', '[FORMULA_END]']]
>>> tl1.get_segments(add_seg_type=False)[0:3]
[['\\SIFTag{stem}'], ['mathord', '=', 'textord'], ['mathord', '=', 'mathord', '{ }', '\\sqrt']]
>>> test_item_2 = {"options": [r"$x < y$", r"$y = x$", r"$y < x$"]}
>>> test_item_2
{'options': ['$x < y$', '$y = x$', '$y < x$']}
>>> test_item_2_str = dict2str4sif(test_item_2, tag_mode="head", add_list_no_tag=False)
>>> test_item_2_str
'$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$'
>>> tl2 = sif4sci(test_item_2_str, symbol="gms",
... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}})
>>> tl2
['\\SIFTag{options}', 'x', '<', 'y', '[SEP]', 'y', '=', 'x', '[SEP]', 'y', '<', 'x']
>>> tl2.get_segments(add_seg_type=False)
[['\\SIFTag{options}'], ['x', '<', 'y'], ['[SEP]'], ['y', '=', 'x'], ['[SEP]'], ['y', '<', 'x']]
>>> tl2.get_segments(add_seg_type=False, drop="s")
[['\\SIFTag{options}'], ['x', '<', 'y'], ['y', '=', 'x'], ['y', '<', 'x']]
>>> tl3 = sif4sci(test_item_1["stem"], symbol="gs")
>>> tl3.text_segments
[['说法', '正确']]
>>> tl3.formula_segments
[['x', '=', '2'], ['y', '=', '\\sqrt', '{', 'x', '}']]
>>> tl3.figure_segments
[]
>>> tl3.ques_mark_segments
[['\\SIFChoice']]
>>> test_item_3 = r"已知$y=x$,则以下说法中$\textf{正确,b}$的是"
>>> tl4 = sif4sci(test_item_3)
Warning: there is some chinese characters in formula!
>>> tl4.text_segments
[['已知'], ['说法', '中', '正确']]

Parser

class EduNLP.SIF.parser.Parser(data, check_formula=True)[source]

Parse the item to standard format.

is_number(uchar)[source]

判断一个unicode是否是数字

is_alphabet(uchar)[source]

判断一个unicode是否是英文字母

is_chinese(uchar)[source]

判断一个unicode是否是汉字

call_error()[source]

语法解析函数

get_token()[source]

Get different elements in the item.

Returns

elements

Return type

chinese,alphabet,number,ch_pun_list,en_pun_list,latex formula

next_token()[source]
match(terminal)[source]
txt()[source]
txt_list()[source]
description()[source]
description_list()[source]

use Parser to process and describe the txt

Examples

>>> text = '生产某种零件的A工厂25名工人的日加工零件数_   _'
>>> text_parser = Parser(text)
>>> text_parser.description_list()
>>> text_parser.text
'生产某种零件的$A$工厂$25$名工人的日加工零件数$\\SIFBlank$'
>>> text = 'X的分布列为(   )'
>>> text_parser = Parser(text)
>>> text_parser.description_list()
>>> text_parser.text
'$X$的分布列为$\\SIFChoice$'
>>> text = '① AB是⊙O的直径,AC是⊙O的切线,BC交⊙O于点E.AC的中点为D'
>>> text_parser = Parser(text)
>>> text_parser.description_list()
>>> text_parser.error_flag
1
>>> text = '支持公式如$\\frac{y}{x}$,$\\SIFBlank$,$\\FigureID{1}$,不支持公式如$\\frac{ \\dddot y}{x}$'
>>> text_parser = Parser(text)
>>> text_parser.description_list()
>>> text_parser.fomula_illegal_flag
1

Segment

EduNLP.SIF.segment.segment.contextmanager(func)[source]

@contextmanager decorator.

Typical usage:

@contextmanager def some_generator(<arguments>):

<setup> try:

yield <value>

finally:

<cleanup>

This makes this:

with some_generator(<arguments>) as <variable>:

<body>

equivalent to this:

<setup> try:

<variable> = <value> <body>

finally:

<cleanup>

class EduNLP.SIF.segment.segment.Symbol[source]
class EduNLP.SIF.segment.segment.TextSegment[source]
class EduNLP.SIF.segment.segment.LatexFormulaSegment[source]
class EduNLP.SIF.segment.segment.Figure(is_base64=False)[source]

decode figure which has been encode by base64

classmethod base64_to_numpy(figure: str)[source]

Creat a arrary in a designated buffer

class EduNLP.SIF.segment.segment.FigureFormulaSegment(src, is_base64=False, figure_instance: (<class 'dict'>, <class 'bool'>) = None)[source]

Duel with figureformula, especially coding in base64

class EduNLP.SIF.segment.segment.FigureSegment(src, is_base64=False, figure_instance: (<class 'dict'>, <class 'bool'>) = None)[source]

Duel with figure, especially coding in base64

class EduNLP.SIF.segment.segment.QuesMarkSegment[source]
class EduNLP.SIF.segment.segment.TagSegment[source]
class EduNLP.SIF.segment.segment.SepSegment[source]
class EduNLP.SIF.segment.segment.SegmentList(item, figures: Optional[dict] = None)[source]
Parameters
  • item (str) –

  • figures (dict) –

Examples

>>> test_item = "如图所示,则三角形$ABC$的面积是$\SIFBlank$。$\FigureID{1}$"
>>> SegmentList(test_item)
['如图所示,则三角形', 'ABC', '的面积是', '\\SIFBlank', '。', \FigureID{1}]
append(segment) None[source]

add segment to corresponding segments

property segments

return segments

property text_segments

return text segments

property formula_segments

return formula segments

property figure_segments

return figure segments

property ques_mark_segments

return question mark segments

property tag_segments

return tag segments

to_symbol(idx, symbol)[source]

switch element to its symbol

symbolize(to_symbolize='fgm')[source]

Switch designated elements to symbol. It is a good way to protect or preserve the elements which we don’t want to tokenize.

Parameters

to_symbolize – “t”: text “f”: formula “g”: figure “m”: question mark “a”: tag “s”: sep

filter(drop: (<class 'set'>, <class 'str'>) = '', keep: (<class 'set'>, <class 'str'>) = '*')[source]

Output special element list selective.Drop means not show.Keep means show.

Parameters
  • drop (set or str) – The alphabet should be included in “tfgmas”, which means drop selected segments out of return value.

  • keep (set or str) – The alphabet should be included in “tfgmas”, which means only keep selected segments in return value.

describe()[source]

show the length of different segments

EduNLP.SIF.segment.segment.seg(item, figures=None, symbol=None)[source]

It is a interface for SegmentList. And show it in an appropriate way.

Parameters
  • item (str) –

  • figures (dict, optional) –

  • symbol (str, optional) –

Returns

segmented item

Return type

list

Examples

>>> test_item = r"如图所示,则$\bigtriangleup ABC$的面积是$\SIFBlank$。$\FigureID{1}$"
>>> s = seg(test_item)
>>> s
['如图所示,则', '\\bigtriangleup ABC', '的面积是', '\\SIFBlank', '。', \FigureID{1}]
>>> s.describe()
{'t': 3, 'f': 1, 'g': 1, 'm': 1}
>>> with s.filter("f"):
...     s
['如图所示,则', '的面积是', '\\SIFBlank', '。', \FigureID{1}]
>>> with s.filter(keep="t"):
...     s
['如图所示,则', '的面积是', '。']
>>> with s.filter():
...     s
['如图所示,则', '\\bigtriangleup ABC', '的面积是', '\\SIFBlank', '。', \FigureID{1}]
>>> seg(test_item, symbol="fgm")
['如图所示,则', '[FORMULA]', '的面积是', '[MARK]', '。', '[FIGURE]']
>>> seg(test_item, symbol="tfgm")
['[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[TEXT]', '[FIGURE]']
>>> seg(r"如图所示,则$\FormFigureID{0}$的面积是$\SIFBlank$。$\FigureID{1}$")
['如图所示,则', \FormFigureID{0}, '的面积是', '\\SIFBlank', '。', \FigureID{1}]
>>> seg(r"如图所示,则$\FormFigureID{0}$的面积是$\SIFBlank$。$\FigureID{1}$", symbol="fgm")
['如图所示,则', '[FORMULA]', '的面积是', '[MARK]', '。', '[FIGURE]']
>>> s.text_segments
['如图所示,则', '的面积是', '。']
>>> s.formula_segments
['\\bigtriangleup ABC']
>>> s.figure_segments
[\FigureID{1}]
>>> s.ques_mark_segments
['\\SIFBlank']
>>> test_item_1 = {
...     "stem": r"若复数$z=1+2 i+i^{3}$,则$|z|=$",
...     "options": ['0', '1', r'$\sqrt{2}$', '2']
... }
>>> from EduNLP.utils import dict2str4sif
>>> test_item_1_str = dict2str4sif(test_item_1)
>>> test_item_1_str
'$\\SIFTag{stem_begin}$...$\\SIFTag{stem_end}$$\\SIFTag{options_begin}$$\\SIFTag{list_0}$0...$\\SIFTag{options_end}$'
>>> s1 = seg(test_item_1_str, symbol="tfgm")
>>> s1
['\\SIFTag{stem_begin}'...'\\SIFTag{stem_end}', '\\SIFTag{options_begin}', '\\SIFTag{list_0}', ...]
>>> with s1.filter(keep="a"):
...     s1
[...'\\SIFTag{list_0}', '\\SIFTag{list_1}', '\\SIFTag{list_2}', '\\SIFTag{list_3}', '\\SIFTag{options_end}']
>>> s1.tag_segments
['\\SIFTag{stem_begin}', '\\SIFTag{stem_end}', '\\SIFTag{options_begin}', ... '\\SIFTag{options_end}']
>>> test_item_1_str_2 = dict2str4sif(test_item_1, tag_mode="head", add_list_no_tag=False)
>>> seg(test_item_1_str_2, symbol="tfgmas")
['[TAG]', ... '[TAG]', '[TEXT]', '[SEP]', '[TEXT]', '[SEP]', '[FORMULA]', '[SEP]', '[TEXT]']
>>> s2 = seg(test_item_1_str_2, symbol="fgm")
>>> s2.tag_segments
['\\SIFTag{stem}', '\\SIFTag{options}']
>>> test_item_2 = r"已知$y=x$,则以下说法中$\textf{正确,b}$的是"
>>> s2 = seg(test_item_2)
>>> s2.text_segments
['已知', ',则以下说法中正确的是']

Tokenization

tokenize

class EduNLP.SIF.tokenization.tokenization.TokenList(segment_list: SegmentList, text_params=None, formula_params=None, figure_params=None)[source]
Parameters
  • segment_list (list) – segmented item

  • text_params (dict) –

  • formula_params (dict) –

  • figure_params (dict) –

add_seg_type(seg_type, tar: list, add_seg_type=True, mode='delimiter')[source]

Add seg tag in different position

Parameters
  • seg_type (str) – t: text f:formula

  • tar (list) –

  • add_seg_type – if the value==False, the function will not be executed.

  • mode (str) – delimiter: both in the head and at the tail head: only in the head tail: only at the tail

get_segments(add_seg_type=True, add_seg_mode='delimiter', keep='*', drop='', depth=None)[source]

call segment function.

Parameters
  • add_seg_type

  • add_seg_mode – delimiter: both in the head and at the tail head: only in the head tail: only at the tail

  • keep

  • drop

  • depth (int or None) – 0: only separate at SIFSep 1: only separate at SIFTag 2: separate at SIFTag and SIFSep otherwise, separate all segments

Returns

segmented item

Return type

list

property text_segments

get text segment

property formula_segments

get formula segment

property figure_segments

get figure segment

property ques_mark_segments

get question mark segment

property tokens

add token to a list

append_text(segment, symbol=False)[source]

append text

append_formula(segment, symbol=False, init=True)[source]

append formula by different methods

append_figure(segment, **kwargs)[source]

append figure

append_ques_mark(segment, **kwargs)[source]

append question mark

append_tag(segment, **kwargs)[source]

append tag

append_sep(segment, **kwargs)[source]

append sep

append(segment, lazy=False)[source]

the total api for appending elements

Parameters
  • segment

  • lazy – True:Doesn’t distinguish parmeters. False:It makes same parmeters have the same number.

extend(segments)[source]

append every segment in turn

property text_tokens

return text tokens

property formula_tokens

return formula tokens

property figure_tokens

return figure tokens

property ques_mark_tokens

return question mark tokens

property inner_formula_tokens

return inner formula tokens

filter(drop: (<class 'set'>, <class 'str'>) = '', keep: (<class 'set'>, <class 'str'>) = '*')[source]

Output special element list selective.Drop means not show.Keep means show.

Parameters
  • drop (set or str) – The alphabet should be included in “tfgmas”, which means drop selected segments out of return value.

  • keep (set or str) – The alphabet should be included in “tfgmas”, which means only keep selected segments in return value.

Returns

filted list

Return type

list

describe()[source]

show the total number of each elements

EduNLP.SIF.tokenization.tokenization.tokenize(segment_list: SegmentList, text_params=None, formula_params=None, figure_params=None)[source]

an actual api to tokenize item

Parameters
  • segment_list (list) – segmented item

  • text_params (dict) – the method to duel with text

  • formula_params (dict) – the method to duel with formula

  • figure_params (dict) – the method to duel with figure

Returns

tokenized item

Return type

list

Examples

>>> items = "如图所示,则三角形$ABC$的面积是$\SIFBlank$。$\FigureID{1}$"
>>> tokenize(SegmentList(items))
['如图所示', '三角形', 'ABC', '面积', '\\SIFBlank', \FigureID{1}]
>>> tokenize(SegmentList(items),formula_params={"method": "ast"})
['如图所示', '三角形', <Formula: ABC>, '面积', '\\SIFBlank', \FigureID{1}]

call formula function

text

EduNLP.SIF.tokenization.text.tokenize(text, granularity='word', stopwords='default')[source]

Using jieba library to tokenize item by word or char.

Parameters
  • text

  • granularity

  • stopwords (str, None or set) –

Examples

>>> tokenize("三角函数是基本初等函数之一")
['三角函数', '初等', '函数']
>>> tokenize("三角函数是基本初等函数之一", granularity="char")
['三', '角', '函', '数', '基', '初', '函', '数']

formula

EduNLP.SIF.tokenization.formula.formula.linear_tokenize(formula, preserve_braces=True, number_as_tag=False, *args, **kwargs)[source]

linear tokenize formula. It includes three processes:cut, reduce and connect_char.

Parameters
  • formula

  • preserve_braces

  • number_as_tag

  • args

  • kwargs

Examples

>>> linear_tokenize(r"{x + y}^\frac{1}{2} + 1 = 0")
['{', 'x', '+', 'y', '}', '^', '\\frac', '{', '1', '}', '{', '2', '}', '+', '1', '=', '0']
>>> linear_tokenize(r"ABC,AB,AC")
['ABC', ',', 'AB', ',', 'AC']
EduNLP.SIF.tokenization.formula.formula.ast_tokenize(formula, ord2token=False, var_numbering=False, return_type='formula', *args, **kwargs)[source]

According to return type, tokenizing formula by different methods.

Parameters
  • formula

  • ord2token

  • var_numbering

  • return_type

  • args

  • kwargs

Examples

>>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="list")
['x', '+', 'y', '{ }', '\\pi', '{ }', '2', '{ }', '\\frac', '\\supsub', '+', '1', '=', 'x']
>>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="list", ord2token=True)
['mathord', '+', 'mathord', '{ }', 'mathord', '{ }', 'textord', '{ }', '\\frac', '\\supsub', '+', 'textord', '=', 'mathord']
>>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="list", ord2token=True, var_numbering=True)
['mathord_0', '+', 'mathord_1', '{ }', 'mathord_con', '{ }', 'textord', '{ }', '\\frac', '\\supsub', '+', 'textord', '=', 'mathord_0']
>>> len(ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="ast").nodes)
14
>>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x")
<Formula: {x + y}^\frac{\pi}{2} + 1 = x>
EduNLP.SIF.tokenization.formula.formula.tokenize(formula, method='linear', errors='raise', **kwargs)[source]

The total function to tokenize formula by linear or ast.

Parameters
  • formula

  • method

  • errors (how to handle the exception occurs in ast tokenize) – “coerce”: use linear_tokenize “raise”: raise exception

  • kwargs

Examples

>>> tokenize(r"\frac{\pi}{x + y} + 1 = x")
['\\frac', '{', '\\pi', '}', '{', 'x', '+', 'y', '}', '+', '1', '=', 'x']
>>> tokenize(r"\frac{\pi}{x + y} + 1 = x", method="ast", ord2token=True)
<Formula: \frac{\pi}{x + y} + 1 = x>
>>> tokenize(r"\frac{\pi}{x + y} + 1 = x", method="ast", ord2token=True, return_type="list")
['mathord', '{ }', 'mathord', '+', 'mathord', '{ }', '\\frac', '+', 'textord', '=', 'mathord']
class EduNLP.SIF.tokenization.formula.ast_token.Formula(formula: (<class 'str'>, typing.List[typing.Dict]), variable_standardization=False, const_mathord=None, init=True, *args, **kwargs)[source]

The part transform a formula to the parsed abstracted syntax tree.

Parameters
  • formula (str or List[Dict]) – latex formula string or the parsed abstracted syntax tree

  • variable_standardization

  • const_mathord

  • init

  • args

  • kwargs

Examples

>>> f = Formula("x")
>>> f
<Formula: x>
>>> f.ast
[{'val': {'id': 0, 'type': 'mathord', 'text': 'x', 'role': None}, 'structure': {'bro': [None, None], 'child': None, 'father': None, 'forest': None}}]
>>> f.elements
[{'id': 0, 'type': 'mathord', 'text': 'x', 'role': None}]
>>> f.variable_standardization(inplace=True)
<Formula: x>
>>> f.elements
[{'id': 0, 'type': 'mathord', 'text': 'x', 'role': None, 'var': 0}]
variable_standardization(inplace=False, const_mathord=None, variable_connect_dict=None)[source]

It makes same parmeters have the same number.

Parameters
  • inplace

  • const_mathord

  • variable_connect_dict

property ast
property elements
property ast_graph: (<class 'networkx.classes.graph.Graph'>, <class 'networkx.classes.digraph.DiGraph'>)
to_str()[source]
reset_ast(formula_ensure_str: bool = True, variable_standardization=False, const_mathord=None, *args, **kwargs)[source]
property resetable
EduNLP.SIF.tokenization.formula.ast_token.traversal_formula(ast, ord2token=False, var_numbering=False, strategy='post', *args, **kwargs)[source]

The part will run only when the return type is list. And it provides two strategy: post and linear. Besides, tokens list will append node follow its type.

EduNLP.SIF.tokenization.formula.ast_token.ast_tokenize(formula, ord2token=False, var_numbering=False, return_type='formula', *args, **kwargs)[source]

According to return type, tokenizing formula by different methods.

Parameters
  • formula

  • ord2token

  • var_numbering

  • return_type

  • args

  • kwargs

Examples

>>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="list")
['x', '+', 'y', '{ }', '\\pi', '{ }', '2', '{ }', '\\frac', '\\supsub', '+', '1', '=', 'x']
>>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="list", ord2token=True)
['mathord', '+', 'mathord', '{ }', 'mathord', '{ }', 'textord', '{ }', '\\frac', '\\supsub', '+', 'textord', '=', 'mathord']
>>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="list", ord2token=True, var_numbering=True)
['mathord_0', '+', 'mathord_1', '{ }', 'mathord_con', '{ }', 'textord', '{ }', '\\frac', '\\supsub', '+', 'textord', '=', 'mathord_0']
>>> len(ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="ast").nodes)
14
>>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x")
<Formula: {x + y}^\frac{\pi}{2} + 1 = x>
class EduNLP.SIF.tokenization.formula.linear_token.IntFlag(value)[source]

Support for integer-based Flags

EduNLP.SIF.tokenization.formula.linear_token.cut(formula, preserve_braces=True, with_dollar=False, preserve_dollar=False, number_as_tag=False, preserve_src=True)[source]

cut formula thoroughly

Parameters
  • formula (str) –

  • preserve_braces – when it is False “{” and “}” will be filted

  • with_dollar – have dollar or not

  • preserve_dollar – keep “$”

  • number_as_tag – whether switch number to tag, it just can idenify the number which is more than one bit.

  • preserve_src

Returns

return a preliminary list which cut fully

Return type

list

Examples

>>> cut(r"${x + y}^\frac{1}{2} + 12.1 = 0$")
['{x + y}', '^', '\\f', 'r', 'a', 'c', '{1}', '{2}', '+', '12.1', '=', '0']
>>> cut(r"${x + y}^\frac{1}{2} + 12.1 = 0$",preserve_dollar=False)
['{x + y}', '^', '\\f', 'r', 'a', 'c', '{1}', '{2}', '+', '12.1', '=', '0']
>>> cut(r"${x + y}^\frac{1}{2} + 12.1 = 0$",number_as_tag=True)
['{x + y}', '^', '\\f', 'r', 'a', 'c', '{1}', '{2}', '+', '{decimal}', '=', '0']
EduNLP.SIF.tokenization.formula.linear_token.reduce(fea)[source]

restore some formula

EduNLP.SIF.tokenization.formula.linear_token.connect_char(words)[source]

connect and switch to list type

EduNLP.SIF.tokenization.formula.linear_token.latex_parse(formula, preserve_braces=True, with_dollar=True, preserve_dollar=False, number_as_tag=False, preserve_src=True)[source]
EduNLP.SIF.tokenization.formula.linear_token.linear_tokenize(formula, preserve_braces=True, number_as_tag=False, *args, **kwargs)[source]

linear tokenize formula. It includes three processes:cut, reduce and connect_char.

Parameters
  • formula

  • preserve_braces

  • number_as_tag

  • args

  • kwargs

Examples

>>> linear_tokenize(r"{x + y}^\frac{1}{2} + 1 = 0")
['{', 'x', '+', 'y', '}', '^', '\\frac', '{', '1', '}', '{', '2', '}', '+', '1', '=', '0']
>>> linear_tokenize(r"ABC,AB,AC")
['ABC', ',', 'AB', ',', 'AC']