EduNLP.Tokenizer¶
- class EduNLP.Tokenizer.PureTextTokenizer(*args, **kwargs)[source]¶
Duel with text and plain text formula. And filting special formula like $\FormFigureID{…}$ and $\FormFigureBase64{…}.
- Parameters
items (str) –
key –
args –
kwargs –
- Return type
token
Examples
>>> tokenizer = PureTextTokenizer() >>> items = ["有公式$\\FormFigureID{1}$,如图$\\FigureID{088f15ea-xxx}$,\ ... 若$x,y$满足约束条件公式$\\FormFigureBase64{2}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$"] >>> tokens = tokenizer(items) >>> next(tokens)[:10] ['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z'] >>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"] >>> tokens = tokenizer(items) >>> next(tokens) ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', '\\quad', 'A', '\\cap', 'B', '='] >>> items = [{ ... "stem": "已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$", ... "options": ["1", "2"] ... }] >>> tokens = tokenizer(items, key=lambda x: x["stem"]) >>> next(tokens) ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', '\\quad', 'A', '\\cap', 'B', '=']
- class EduNLP.Tokenizer.TextTokenizer(*args, **kwargs)[source]¶
Duel with text and formula including special formula.
- Parameters
items (str) –
key –
args –
kwargs –
- Return type
token
Examples
>>> tokenizer = TextTokenizer() >>> items = ["有公式$\\FormFigureID{1}$,如图$\\FigureID{088f15ea-xxx}$,\ ... 若$x,y$满足约束条件公式$\\FormFigureBase64{2}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$"] >>> tokens = tokenizer(items) >>> next(tokens)[:10] ['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]'] >>> items = ["$\\SIFTag{stem_begin}$若复数$z=1+2 i+i^{3}$,则$|z|=$$\\SIFTag{stem_end}$\ ... $\\SIFTag{options_begin}$$\\SIFTag{list_0}$0$\\SIFTag{list_1}$1$\\SIFTag{list_2}$$\\sqrt{2}$\ ... $\\SIFTag{list_3}$2$\\SIFTag{options_end}$"] >>> tokens = tokenizer(items) >>> next(tokens)[:10] ['[TAG]', '复数', 'z', '=', '1', '+', '2', 'i', '+', 'i']
- EduNLP.Tokenizer.get_tokenizer(name, *args, **kwargs)[source]¶
It is a total interface to use difference tokenizer. :param name: the name of tokenizer, e.g. text, pure_text. :type name: str :param args: the parameters passed to tokenizer :param kwargs: the parameters passed to tokenizer
- Returns
tokenizer
- Return type
Examples
>>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"] >>> tokenizer = get_tokenizer("text") >>> tokens = tokenizer(items) >>> next(tokens) ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', '\\quad', 'A', '\\cap', 'B', '=']