EduNLP.SIF¶
SIF¶
- EduNLP.SIF.sif.is_sif(item, check_formula=True, return_parser=False)[source]¶
the part aims to check whether the input is sif format
- Parameters
item (str) – a raw item which respects stem
check_formula (bool) –
whether to check the formulas when parsing item.
True if check the validity of formulas in item False if not check the validity of formulas in item, which is faster
return_parser (bool) –
whether to put the parsed item in return.
when True, the format of return is (bool, Parser) when False, the format of return is bool
- Returns
when item can not be parsed correctly, raise ValueError; when item is in stardarded format originally, return Ture (and the Parser of item); when item isn’t in stardarded format originally, return False (and the Parser of item);
- Return type
bool
Examples
>>> text = '若$x,y$满足约束条件' \ ... '$\\left\\{\\begin{array}{c}2 x+y-2 \\leq 0 \\\\ x-y-1 \\geq 0 \\\\ y+1 \\geq 0\\end{array}\\right.$,' \ ... '则$z=x+7 y$的最大值$\\SIFUnderline$' >>> is_sif(text) True >>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...' >>> ret = is_sif(text, return_parser=True) >>> ret (False, <EduNLP.SIF.parser.parser.Parser object...>)
- EduNLP.SIF.sif.to_sif(item, check_formula=True, parser: Optional[Parser] = None)[source]¶
the part aims to switch item to sif formate
- Parameters
items (str) – a raw item which respects stem
check_formula (bool) – whether to check the formulas when parsing item (only work when parser=None).
parser (Parser) – the parser of item returned from is_sif.
- Returns
item – the item which accords with sif format
- Return type
str
Examples
>>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...' >>> siftext = to_sif(text) >>> siftext '某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$(单位...' >>> ret = is_sif(text, return_parser=True) >>> ret (False, <EduNLP.SIF.parser.parser.Parser object...>) >>> to_sif(text, parser=ret[1]) '某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$(单位...
- EduNLP.SIF.sif.sif4sci(item: str, figures: (<class 'dict'>, <class 'bool'>) = None, mode: int = 2, symbol: str = None, tokenization=True, tokenization_params=None, errors='raise')[source]¶
Default to use linear Tokenizer, change the tokenizer by specifying tokenization_params
- Parameters
item (str) – a raw item which respects stem
figures (dict) – when it is a dict, it means the id-to-instance for figures in ‘FormFigureID{…}’ format, when it is a bool, it means whether to instantiate figures in ‘FormFigureBase64{…}’ format
mode (int) – when safe = 2, use is_sif and check formula in item when safe = 1, use is_sif but don’t check formula in item when safe = 0, don’t use is_sif and don’t check anything in item
symbol (str) –
- select the methods to symbolize:
”t”: text “f”: formula “g”: figure “m”: question mark “a”: tag “s”: sep
tokenization (bool) – whether to tokenize item after segmentation
tokenization_params –
the dict of text_params, formula_params and figure_params in tokenization For formula_params:
method: which tokenizer to be used, “linear” or “ast” The parameters only useful for “linear”:
skip_figure_formula: whether to skip the formula in figure format symbolize_figure_formula: whether to symbolize the formula in figure format
- The parameters only useful for “ast”:
ord2token: whether to transfer the variables (mathord) and constants (textord) to special tokens. var_numbering: whether to use number suffix to denote different variables return_type: ‘list’ or ‘ast’
More parameters can be found in the definition in SIF.tokenization.formula
- For figure_params:
figure_instance:whether to return instance of figures in tokens
- For text_params:
See definition in SIF.tokenization.text
errors – warn, raise, coerce, strict, ignore
- Returns
When tokenization is False, return SegmentList; When tokenization is True, return TokenList
- Return type
list
Examples
>>> test_item = r"如图所示,则$\bigtriangleup ABC$的面积是$\SIFBlank$。$\FigureID{1}$" >>> tl = sif4sci(test_item) >>> tl ['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}] >>> tl.describe() {'t': 2, 'f': 2, 'g': 1, 'm': 1} >>> with tl.filter('fgm'): ... tl ['如图所示', '面积'] >>> with tl.filter(keep='t'): ... tl ['如图所示', '面积'] >>> with tl.filter(): ... tl ['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}] >>> tl.text_tokens ['如图所示', '面积'] >>> tl.formula_tokens ['\\bigtriangleup', 'ABC'] >>> tl.figure_tokens [\FigureID{1}] >>> tl.ques_mark_tokens ['\\SIFBlank'] >>> sif4sci(test_item, symbol="gm", tokenization_params={"formula_params": {"method": "ast"}}) ['如图所示', <Formula: \bigtriangleup ABC>, '面积', '[MARK]', '[FIGURE]'] >>> sif4sci(test_item, symbol="tfgm") ['[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[TEXT]', '[FIGURE]'] >>> sif4sci(test_item, symbol="gm", ... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}}) ['如图所示', '\\bigtriangleup', 'A', 'B', 'C', '面积', '[MARK]', '[FIGURE]'] >>> test_item_1 = { ... "stem": r"若$x=2$, $y=\sqrt{x}$,则下列说法正确的是$\SIFChoice$", ... "options": [r"$x < y$", r"$y = x$", r"$y < x$"] ... } >>> tls = [ ... sif4sci(e, symbol="gm", ... tokenization_params={ ... "formula_params": { ... "method": "ast", "return_type": "list", "ord2token": True, "var_numbering": True, ... "link_variable": False} ... }) ... for e in ([test_item_1["stem"]] + test_item_1["options"]) ... ] >>> tls[1:] [['mathord_0', '<', 'mathord_1'], ['mathord_0', '=', 'mathord_1'], ['mathord_0', '<', 'mathord_1']] >>> link_formulas(*tls) >>> tls[1:] [['mathord_0', '<', 'mathord_1'], ['mathord_1', '=', 'mathord_0'], ['mathord_1', '<', 'mathord_0']] >>> from EduNLP.utils import dict2str4sif >>> test_item_1_str = dict2str4sif(test_item_1, tag_mode="head", add_list_no_tag=False) >>> test_item_1_str '$\\SIFTag{stem}$...则下列说法正确的是$\\SIFChoice$$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$' >>> tl1 = sif4sci(test_item_1_str, symbol="gm", ... tokenization_params={"formula_params": {"method": "ast", "return_type": "list", "ord2token": True}}) >>> tl1.get_segments()[0] ['\\SIFTag{stem}'] >>> tl1.get_segments()[1:3] [['[TEXT_BEGIN]', '[TEXT_END]'], ['[FORMULA_BEGIN]', 'mathord', '=', 'textord', '[FORMULA_END]']] >>> tl1.get_segments(add_seg_type=False)[0:3] [['\\SIFTag{stem}'], ['mathord', '=', 'textord'], ['mathord', '=', 'mathord', '{ }', '\\sqrt']] >>> test_item_2 = {"options": [r"$x < y$", r"$y = x$", r"$y < x$"]} >>> test_item_2 {'options': ['$x < y$', '$y = x$', '$y < x$']} >>> test_item_2_str = dict2str4sif(test_item_2, tag_mode="head", add_list_no_tag=False) >>> test_item_2_str '$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$' >>> tl2 = sif4sci(test_item_2_str, symbol="gms", ... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}}) >>> tl2 ['\\SIFTag{options}', 'x', '<', 'y', '[SEP]', 'y', '=', 'x', '[SEP]', 'y', '<', 'x'] >>> tl2.get_segments(add_seg_type=False) [['\\SIFTag{options}'], ['x', '<', 'y'], ['[SEP]'], ['y', '=', 'x'], ['[SEP]'], ['y', '<', 'x']] >>> tl2.get_segments(add_seg_type=False, drop="s") [['\\SIFTag{options}'], ['x', '<', 'y'], ['y', '=', 'x'], ['y', '<', 'x']] >>> tl3 = sif4sci(test_item_1["stem"], symbol="gs") >>> tl3.text_segments [['说法', '正确']] >>> tl3.formula_segments [['x', '=', '2'], ['y', '=', '\\sqrt', '{', 'x', '}']] >>> tl3.figure_segments [] >>> tl3.ques_mark_segments [['\\SIFChoice']] >>> test_item_3 = r"已知$y=x$,则以下说法中$\textf{正确,b}$的是" >>> tl4 = sif4sci(test_item_3) Warning: there is some chinese characters in formula! >>> tl4.text_segments [['已知'], ['说法', '中', '正确']]
Parser¶
- class EduNLP.SIF.parser.Parser(data, check_formula=True)[source]¶
Parse the item to standard format.
- get_token()[source]¶
Get different elements in the item.
- Returns
elements
- Return type
chinese,alphabet,number,ch_pun_list,en_pun_list,latex formula
- description_list()[source]¶
use Parser to process and describe the txt
Examples
>>> text = '生产某种零件的A工厂25名工人的日加工零件数_ _' >>> text_parser = Parser(text) >>> text_parser.description_list() >>> text_parser.text '生产某种零件的$A$工厂$25$名工人的日加工零件数$\\SIFBlank$' >>> text = 'X的分布列为( )' >>> text_parser = Parser(text) >>> text_parser.description_list() >>> text_parser.text '$X$的分布列为$\\SIFChoice$' >>> text = '① AB是⊙O的直径,AC是⊙O的切线,BC交⊙O于点E.AC的中点为D' >>> text_parser = Parser(text) >>> text_parser.description_list() >>> text_parser.error_flag 1 >>> text = '支持公式如$\\frac{y}{x}$,$\\SIFBlank$,$\\FigureID{1}$,不支持公式如$\\frac{ \\dddot y}{x}$' >>> text_parser = Parser(text) >>> text_parser.description_list() >>> text_parser.fomula_illegal_flag 1
Segment¶
- EduNLP.SIF.segment.segment.contextmanager(func)[source]¶
@contextmanager decorator.
Typical usage:
@contextmanager def some_generator(<arguments>):
<setup> try:
yield <value>
- finally:
<cleanup>
This makes this:
- with some_generator(<arguments>) as <variable>:
<body>
equivalent to this:
<setup> try:
<variable> = <value> <body>
- finally:
<cleanup>
- class EduNLP.SIF.segment.segment.Figure(is_base64=False)[source]¶
decode figure which has been encode by base64
- class EduNLP.SIF.segment.segment.FigureFormulaSegment(src, is_base64=False, figure_instance: (<class 'dict'>, <class 'bool'>) = None)[source]¶
Duel with figureformula, especially coding in base64
- class EduNLP.SIF.segment.segment.FigureSegment(src, is_base64=False, figure_instance: (<class 'dict'>, <class 'bool'>) = None)[source]¶
Duel with figure, especially coding in base64
- class EduNLP.SIF.segment.segment.SegmentList(item, figures: Optional[dict] = None)[source]¶
- Parameters
item (str) –
figures (dict) –
Examples
>>> test_item = "如图所示,则三角形$ABC$的面积是$\SIFBlank$。$\FigureID{1}$" >>> SegmentList(test_item) ['如图所示,则三角形', 'ABC', '的面积是', '\\SIFBlank', '。', \FigureID{1}]
- property segments¶
return segments
- property text_segments¶
return text segments
- property formula_segments¶
return formula segments
- property figure_segments¶
return figure segments
- property ques_mark_segments¶
return question mark segments
- property tag_segments¶
return tag segments
- symbolize(to_symbolize='fgm')[source]¶
Switch designated elements to symbol. It is a good way to protect or preserve the elements which we don’t want to tokenize.
- Parameters
to_symbolize – “t”: text “f”: formula “g”: figure “m”: question mark “a”: tag “s”: sep
- filter(drop: (<class 'set'>, <class 'str'>) = '', keep: (<class 'set'>, <class 'str'>) = '*')[source]¶
Output special element list selective.Drop means not show.Keep means show.
- Parameters
drop (set or str) – The alphabet should be included in “tfgmas”, which means drop selected segments out of return value.
keep (set or str) – The alphabet should be included in “tfgmas”, which means only keep selected segments in return value.
- EduNLP.SIF.segment.segment.seg(item, figures=None, symbol=None)[source]¶
It is a interface for SegmentList. And show it in an appropriate way.
- Parameters
item (str) –
figures (dict, optional) –
symbol (str, optional) –
- Returns
segmented item
- Return type
list
Examples
>>> test_item = r"如图所示,则$\bigtriangleup ABC$的面积是$\SIFBlank$。$\FigureID{1}$" >>> s = seg(test_item) >>> s ['如图所示,则', '\\bigtriangleup ABC', '的面积是', '\\SIFBlank', '。', \FigureID{1}] >>> s.describe() {'t': 3, 'f': 1, 'g': 1, 'm': 1} >>> with s.filter("f"): ... s ['如图所示,则', '的面积是', '\\SIFBlank', '。', \FigureID{1}] >>> with s.filter(keep="t"): ... s ['如图所示,则', '的面积是', '。'] >>> with s.filter(): ... s ['如图所示,则', '\\bigtriangleup ABC', '的面积是', '\\SIFBlank', '。', \FigureID{1}] >>> seg(test_item, symbol="fgm") ['如图所示,则', '[FORMULA]', '的面积是', '[MARK]', '。', '[FIGURE]'] >>> seg(test_item, symbol="tfgm") ['[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[TEXT]', '[FIGURE]'] >>> seg(r"如图所示,则$\FormFigureID{0}$的面积是$\SIFBlank$。$\FigureID{1}$") ['如图所示,则', \FormFigureID{0}, '的面积是', '\\SIFBlank', '。', \FigureID{1}] >>> seg(r"如图所示,则$\FormFigureID{0}$的面积是$\SIFBlank$。$\FigureID{1}$", symbol="fgm") ['如图所示,则', '[FORMULA]', '的面积是', '[MARK]', '。', '[FIGURE]'] >>> s.text_segments ['如图所示,则', '的面积是', '。'] >>> s.formula_segments ['\\bigtriangleup ABC'] >>> s.figure_segments [\FigureID{1}] >>> s.ques_mark_segments ['\\SIFBlank'] >>> test_item_1 = { ... "stem": r"若复数$z=1+2 i+i^{3}$,则$|z|=$", ... "options": ['0', '1', r'$\sqrt{2}$', '2'] ... } >>> from EduNLP.utils import dict2str4sif >>> test_item_1_str = dict2str4sif(test_item_1) >>> test_item_1_str '$\\SIFTag{stem_begin}$...$\\SIFTag{stem_end}$$\\SIFTag{options_begin}$$\\SIFTag{list_0}$0...$\\SIFTag{options_end}$' >>> s1 = seg(test_item_1_str, symbol="tfgm") >>> s1 ['\\SIFTag{stem_begin}'...'\\SIFTag{stem_end}', '\\SIFTag{options_begin}', '\\SIFTag{list_0}', ...] >>> with s1.filter(keep="a"): ... s1 [...'\\SIFTag{list_0}', '\\SIFTag{list_1}', '\\SIFTag{list_2}', '\\SIFTag{list_3}', '\\SIFTag{options_end}'] >>> s1.tag_segments ['\\SIFTag{stem_begin}', '\\SIFTag{stem_end}', '\\SIFTag{options_begin}', ... '\\SIFTag{options_end}'] >>> test_item_1_str_2 = dict2str4sif(test_item_1, tag_mode="head", add_list_no_tag=False) >>> seg(test_item_1_str_2, symbol="tfgmas") ['[TAG]', ... '[TAG]', '[TEXT]', '[SEP]', '[TEXT]', '[SEP]', '[FORMULA]', '[SEP]', '[TEXT]'] >>> s2 = seg(test_item_1_str_2, symbol="fgm") >>> s2.tag_segments ['\\SIFTag{stem}', '\\SIFTag{options}'] >>> test_item_2 = r"已知$y=x$,则以下说法中$\textf{正确,b}$的是" >>> s2 = seg(test_item_2) >>> s2.text_segments ['已知', ',则以下说法中正确的是']
Tokenization¶
tokenize¶
- class EduNLP.SIF.tokenization.tokenization.TokenList(segment_list: SegmentList, text_params=None, formula_params=None, figure_params=None)[source]¶
- Parameters
segment_list (list) – segmented item
text_params (dict) –
formula_params (dict) –
figure_params (dict) –
- add_seg_type(seg_type, tar: list, add_seg_type=True, mode='delimiter')[source]¶
Add seg tag in different position
- Parameters
seg_type (str) – t: text f:formula
tar (list) –
add_seg_type – if the value==False, the function will not be executed.
mode (str) – delimiter: both in the head and at the tail head: only in the head tail: only at the tail
- get_segments(add_seg_type=True, add_seg_mode='delimiter', keep='*', drop='', depth=None)[source]¶
call segment function.
- Parameters
add_seg_type –
add_seg_mode – delimiter: both in the head and at the tail head: only in the head tail: only at the tail
keep –
drop –
depth (int or None) – 0: only separate at SIFSep 1: only separate at SIFTag 2: separate at SIFTag and SIFSep otherwise, separate all segments
- Returns
segmented item
- Return type
list
- property text_segments¶
get text segment
- property formula_segments¶
get formula segment
- property figure_segments¶
get figure segment
- property ques_mark_segments¶
get question mark segment
- property tokens¶
add token to a list
- append(segment, lazy=False)[source]¶
the total api for appending elements
- Parameters
segment –
lazy – True:Doesn’t distinguish parmeters. False:It makes same parmeters have the same number.
- property text_tokens¶
return text tokens
- property formula_tokens¶
return formula tokens
- property figure_tokens¶
return figure tokens
- property ques_mark_tokens¶
return question mark tokens
- property inner_formula_tokens¶
return inner formula tokens
- filter(drop: (<class 'set'>, <class 'str'>) = '', keep: (<class 'set'>, <class 'str'>) = '*')[source]¶
Output special element list selective.Drop means not show.Keep means show.
- Parameters
drop (set or str) – The alphabet should be included in “tfgmas”, which means drop selected segments out of return value.
keep (set or str) – The alphabet should be included in “tfgmas”, which means only keep selected segments in return value.
- Returns
filted list
- Return type
list
- EduNLP.SIF.tokenization.tokenization.tokenize(segment_list: SegmentList, text_params=None, formula_params=None, figure_params=None)[source]¶
an actual api to tokenize item
- Parameters
segment_list (list) – segmented item
text_params (dict) – the method to duel with text
formula_params (dict) – the method to duel with formula
figure_params (dict) – the method to duel with figure
- Returns
tokenized item
- Return type
list
Examples
>>> items = "如图所示,则三角形$ABC$的面积是$\SIFBlank$。$\FigureID{1}$" >>> tokenize(SegmentList(items)) ['如图所示', '三角形', 'ABC', '面积', '\\SIFBlank', \FigureID{1}] >>> tokenize(SegmentList(items),formula_params={"method": "ast"}) ['如图所示', '三角形', <Formula: ABC>, '面积', '\\SIFBlank', \FigureID{1}]
text¶
- EduNLP.SIF.tokenization.text.tokenize(text, granularity='word', stopwords='default')[source]¶
Using jieba library to tokenize item by word or char.
- Parameters
text –
granularity –
stopwords (str, None or set) –
Examples
>>> tokenize("三角函数是基本初等函数之一") ['三角函数', '初等', '函数'] >>> tokenize("三角函数是基本初等函数之一", granularity="char") ['三', '角', '函', '数', '基', '初', '函', '数']
formula¶
- EduNLP.SIF.tokenization.formula.formula.linear_tokenize(formula, preserve_braces=True, number_as_tag=False, *args, **kwargs)[source]¶
linear tokenize formula. It includes three processes:cut, reduce and connect_char.
- Parameters
formula –
preserve_braces –
number_as_tag –
args –
kwargs –
Examples
>>> linear_tokenize(r"{x + y}^\frac{1}{2} + 1 = 0") ['{', 'x', '+', 'y', '}', '^', '\\frac', '{', '1', '}', '{', '2', '}', '+', '1', '=', '0'] >>> linear_tokenize(r"ABC,AB,AC") ['ABC', ',', 'AB', ',', 'AC']
- EduNLP.SIF.tokenization.formula.formula.ast_tokenize(formula, ord2token=False, var_numbering=False, return_type='formula', *args, **kwargs)[source]¶
According to return type, tokenizing formula by different methods.
- Parameters
formula –
ord2token –
var_numbering –
return_type –
args –
kwargs –
Examples
>>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="list") ['x', '+', 'y', '{ }', '\\pi', '{ }', '2', '{ }', '\\frac', '\\supsub', '+', '1', '=', 'x'] >>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="list", ord2token=True) ['mathord', '+', 'mathord', '{ }', 'mathord', '{ }', 'textord', '{ }', '\\frac', '\\supsub', '+', 'textord', '=', 'mathord'] >>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="list", ord2token=True, var_numbering=True) ['mathord_0', '+', 'mathord_1', '{ }', 'mathord_con', '{ }', 'textord', '{ }', '\\frac', '\\supsub', '+', 'textord', '=', 'mathord_0'] >>> len(ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="ast").nodes) 14 >>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x") <Formula: {x + y}^\frac{\pi}{2} + 1 = x>
- EduNLP.SIF.tokenization.formula.formula.tokenize(formula, method='linear', errors='raise', **kwargs)[source]¶
The total function to tokenize formula by linear or ast.
- Parameters
formula –
method –
errors (how to handle the exception occurs in ast tokenize) – “coerce”: use linear_tokenize “raise”: raise exception
kwargs –
Examples
>>> tokenize(r"\frac{\pi}{x + y} + 1 = x") ['\\frac', '{', '\\pi', '}', '{', 'x', '+', 'y', '}', '+', '1', '=', 'x'] >>> tokenize(r"\frac{\pi}{x + y} + 1 = x", method="ast", ord2token=True) <Formula: \frac{\pi}{x + y} + 1 = x> >>> tokenize(r"\frac{\pi}{x + y} + 1 = x", method="ast", ord2token=True, return_type="list") ['mathord', '{ }', 'mathord', '+', 'mathord', '{ }', '\\frac', '+', 'textord', '=', 'mathord']
- class EduNLP.SIF.tokenization.formula.ast_token.Formula(formula: (<class 'str'>, typing.List[typing.Dict]), variable_standardization=False, const_mathord=None, init=True, *args, **kwargs)[source]¶
The part transform a formula to the parsed abstracted syntax tree.
- Parameters
formula (str or List[Dict]) – latex formula string or the parsed abstracted syntax tree
variable_standardization –
const_mathord –
init –
args –
kwargs –
Examples
>>> f = Formula("x") >>> f <Formula: x> >>> f.ast [{'val': {'id': 0, 'type': 'mathord', 'text': 'x', 'role': None}, 'structure': {'bro': [None, None], 'child': None, 'father': None, 'forest': None}}] >>> f.elements [{'id': 0, 'type': 'mathord', 'text': 'x', 'role': None}] >>> f.variable_standardization(inplace=True) <Formula: x> >>> f.elements [{'id': 0, 'type': 'mathord', 'text': 'x', 'role': None, 'var': 0}]
- variable_standardization(inplace=False, const_mathord=None, variable_connect_dict=None)[source]¶
It makes same parmeters have the same number.
- Parameters
inplace –
const_mathord –
variable_connect_dict –
- property ast¶
- property elements¶
- property ast_graph: (<class 'networkx.classes.graph.Graph'>, <class 'networkx.classes.digraph.DiGraph'>)¶
- reset_ast(formula_ensure_str: bool = True, variable_standardization=False, const_mathord=None, *args, **kwargs)[source]¶
- property resetable¶
- EduNLP.SIF.tokenization.formula.ast_token.traversal_formula(ast, ord2token=False, var_numbering=False, strategy='post', *args, **kwargs)[source]¶
The part will run only when the return type is list. And it provides two strategy: post and linear. Besides, tokens list will append node follow its type.
- EduNLP.SIF.tokenization.formula.ast_token.ast_tokenize(formula, ord2token=False, var_numbering=False, return_type='formula', *args, **kwargs)[source]¶
According to return type, tokenizing formula by different methods.
- Parameters
formula –
ord2token –
var_numbering –
return_type –
args –
kwargs –
Examples
>>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="list") ['x', '+', 'y', '{ }', '\\pi', '{ }', '2', '{ }', '\\frac', '\\supsub', '+', '1', '=', 'x'] >>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="list", ord2token=True) ['mathord', '+', 'mathord', '{ }', 'mathord', '{ }', 'textord', '{ }', '\\frac', '\\supsub', '+', 'textord', '=', 'mathord'] >>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="list", ord2token=True, var_numbering=True) ['mathord_0', '+', 'mathord_1', '{ }', 'mathord_con', '{ }', 'textord', '{ }', '\\frac', '\\supsub', '+', 'textord', '=', 'mathord_0'] >>> len(ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x", return_type="ast").nodes) 14 >>> ast_tokenize(r"{x + y}^\frac{\pi}{2} + 1 = x") <Formula: {x + y}^\frac{\pi}{2} + 1 = x>
- class EduNLP.SIF.tokenization.formula.linear_token.IntFlag(value)[source]¶
Support for integer-based Flags
- EduNLP.SIF.tokenization.formula.linear_token.cut(formula, preserve_braces=True, with_dollar=False, preserve_dollar=False, number_as_tag=False, preserve_src=True)[source]¶
cut formula thoroughly
- Parameters
formula (str) –
preserve_braces – when it is False “{” and “}” will be filted
with_dollar – have dollar or not
preserve_dollar – keep “$”
number_as_tag – whether switch number to tag, it just can idenify the number which is more than one bit.
preserve_src –
- Returns
return a preliminary list which cut fully
- Return type
list
Examples
>>> cut(r"${x + y}^\frac{1}{2} + 12.1 = 0$") ['{x + y}', '^', '\\f', 'r', 'a', 'c', '{1}', '{2}', '+', '12.1', '=', '0'] >>> cut(r"${x + y}^\frac{1}{2} + 12.1 = 0$",preserve_dollar=False) ['{x + y}', '^', '\\f', 'r', 'a', 'c', '{1}', '{2}', '+', '12.1', '=', '0'] >>> cut(r"${x + y}^\frac{1}{2} + 12.1 = 0$",number_as_tag=True) ['{x + y}', '^', '\\f', 'r', 'a', 'c', '{1}', '{2}', '+', '{decimal}', '=', '0']
- EduNLP.SIF.tokenization.formula.linear_token.connect_char(words)[source]¶
connect and switch to list type
- EduNLP.SIF.tokenization.formula.linear_token.latex_parse(formula, preserve_braces=True, with_dollar=True, preserve_dollar=False, number_as_tag=False, preserve_src=True)[source]¶
- EduNLP.SIF.tokenization.formula.linear_token.linear_tokenize(formula, preserve_braces=True, number_as_tag=False, *args, **kwargs)[source]¶
linear tokenize formula. It includes three processes:cut, reduce and connect_char.
- Parameters
formula –
preserve_braces –
number_as_tag –
args –
kwargs –
Examples
>>> linear_tokenize(r"{x + y}^\frac{1}{2} + 1 = 0") ['{', 'x', '+', 'y', '}', '^', '\\frac', '{', '1', '}', '{', '2', '}', '+', '1', '=', '0'] >>> linear_tokenize(r"ABC,AB,AC") ['ABC', ',', 'AB', ',', 'AC']