基于 gensim 的模型训练举例¶

概述¶

您可以使用自己的数据和模型参数来训练和使用自己的模型。

导入模块¶

[13]:

import json
from tqdm import tqdm
from EduNLP.Pretrain import GensimWordTokenizer, train_vector
from EduNLP.Vector import D2V, W2V
from EduNLP.SIF.segment import seg
from EduNLP.SIF.tokenization import tokenize
import time

准备模型训练数据¶

[12]:

test_items = [{'ques_content':'有公式$\\FormFigureID{wrong1?}$和公式$\\FormFigureBase64{wrong2?}$，如图$\\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\\SIFSep$，则$z=x+7 y$的最大值为$\\SIFBlank$'},
             {"ques_content":"Human machine interface for lab abc computer applications"},
             {"ques_content": "A survey of user opinion of computer system response time"},
             {"ques_content": "The EPS user interface management system"},
             {"ques_content": "System and human system engineering testing of EPS"},
             {"ques_content": "Relation of user perceived response time to error measurement"},
             {"ques_content": "The generation of random binary unordered trees"},
             {"ques_content": "The intersection graph of paths in trees"},
             {"ques_content": "Graph minors IV Widths of trees and well quasi ordering"},
             {"ques_content": "Graph minors A survey"}
             ]

def load_items():
    for line in test_items:
        yield line


def data2Token():
    # 线性分词
    tokenization_params = {
        "formula_params": {
            "method": "linear",
        }
    }

    token_items = []
    count = 1
    for item in tqdm(load_items(), "sifing"):
        count = count + 1
        # -------------------------------------------- #
        # """除文本、公式外，其他转化为特殊标记"""
        tokenizer = GensimWordTokenizer(symbol="gmas", general=True)
        token_item = tokenizer(item["ques_content"])

        # -------------------------------------------- #
        if token_item:
            token_items.append(token_item.tokens)
    print("[data2Token] finish ========================> num = ",len(token_items))
    return token_items

token_items = data2Token()
print(token_items[0])

sifing: 10it [00:00, 114.91it/s]

[data2Token] finish ========================> num =  10
['公式', '[FORMULA]', '公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]']

[3]:

len(token_items[0])

[3]:

也可从文件导入数据¶

例如：

[45]:

from EduData import get_data

# 导入项目提供的数据，存放路径："../../data/"
get_data("open-luna", "../../data/")


def load_items():
    with open("../../../data/OpenLUNA.json", encoding="utf-8") as f:
        for line in f:
            yield json.loads(line)

downloader, INFO http://base.ustc.edu.cn/data/OpenLUNA/OpenLUNA.json is saved as ../../data/OpenLUNA.json
downloader, INFO file existed, skipped

[46]:

tokenizer = GensimWordTokenizer(symbol="gm")
sif_items = []
for item in tqdm(load_items(), "sifing"):
    sif_item = tokenizer(
        item["stem"]
    )
    if sif_item:
        sif_items.append(sif_item.tokens)

sif_items[0]

EduNLP.Vector.D2V 模块的训练方法¶

1. 基于 bow 训练模型¶

[6]:

train_vector(token_items, "../../../data/d2v/gensim_luna_stem_tf_", method="bow")

EduNLP, INFO model is saved to ../../../data/d2v/gensim_luna_stem_tf_bow.bin

[6]:

'../../../data/d2v/gensim_luna_stem_tf_bow.bin'

模型测试

[9]:

d2v = D2V("../../../data/d2v/gensim_luna_stem_tf_bow.bin", method = "bow")
print(d2v(token_items[1]))

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

2. 基于 tfidf 训练模型¶

[7]:

train_vector(token_items, "../../../data/d2v/gensim_luna_stem_tf_", method="tfidf")

EduNLP, INFO model is saved to ../../../data/d2v/gensim_luna_stem_tf_bow.bin
EduNLP, INFO model is saved to ../../../data/d2v/gensim_luna_stem_tf_tfidf.bin

[7]:

'../../../data/d2v/gensim_luna_stem_tf_tfidf.bin'

模型测试

[11]:

d2v = D2V("../../../data/d2v/gensim_luna_stem_tf_tfidf.bin", method = "tfidf")
vec_size = d2v.vector_size
print("vec_size = ", vec_size)
print(d2v(token_items[1]))

vec_size =  63
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.37858374396389033, 0.37858374396389033, 0.37858374396389033, 0.2646186811599866, 0.37858374396389033, 0.2646186811599866, 0.37858374396389033, 0.37858374396389033, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

3. 基于 Doc2Vec 训练模型¶

[18]:

# 10 dimension with doc2vec method
train_vector(token_items, "../../../data/w2v/gensim_luna_stem_tf_", 10, method="d2v")

EduNLP, INFO Epoch #0: loss-0.0000
EduNLP, INFO Epoch #1: loss-0.0000
EduNLP, INFO Epoch #2: loss-0.0000
EduNLP, INFO Epoch #3: loss-0.0000
EduNLP, INFO Epoch #4: loss-0.0000
EduNLP, INFO Epoch #5: loss-0.0000
EduNLP, INFO Epoch #6: loss-0.0000
EduNLP, INFO Epoch #7: loss-0.0000
EduNLP, INFO Epoch #8: loss-0.0000
EduNLP, INFO Epoch #9: loss-0.0000
EduNLP, INFO model is saved to ../../../data/w2v/gensim_luna_stem_tf_d2v_10.bin

[18]:

'../../../data/w2v/gensim_luna_stem_tf_d2v_10.bin'

[22]:

d2v = D2V("../../../data/w2v/gensim_luna_stem_tf_d2v_10.bin", method="d2v")
vec_size = d2v.vector_size
print("vec_size = ", vec_size)
print(d2v(token_items[1]))

vec_size =  10
[-0.00211227  0.00167636  0.02313529 -0.04260717 -0.01389424 -0.03898989
  0.01181044  0.01069339 -0.03934718  0.00038158]

EduNLP.Vector.W2V 模块支持的训练方法¶

1. 基于 FastText 训练模型¶

[25]:

# 10 dimension with fasstext method
train_vector(token_items, "../../../data/w2v/gensim_luna_stem_t_",
             10, method="fasttext")

EduNLP, INFO Epoch #0: loss-0.0000
EduNLP, INFO Epoch #1: loss-0.0000
EduNLP, INFO Epoch #2: loss-0.0000
EduNLP, INFO Epoch #3: loss-0.0000
EduNLP, INFO Epoch #4: loss-0.0000
EduNLP, INFO model is saved to ../../../data/w2v/gensim_luna_stem_t_fasttext_10.bin

[25]:

'../../../data/w2v/gensim_luna_stem_t_fasttext_10.bin'

模型测试

[41]:

w2v = W2V("../../../data/w2v/gensim_luna_stem_t_fasttext_10.bin", method="fasttext")
w2v["[FORMULA]"]

[41]:

array([-0.00434524, -0.00836839, -0.02108332,  0.00493213,  0.00461454,
        0.01070305, -0.01737931,  0.0210843 , -0.00525515,  0.00918209],
      dtype=float32)

2. 基于 cbow 训练模型¶

[42]:

train_vector(token_items, "../../../data/w2v/gensim_luna_stem_t_", 10, method="cbow")

EduNLP, INFO Epoch #0: loss-0.0000
EduNLP, INFO Epoch #1: loss-0.0000
EduNLP, INFO Epoch #2: loss-0.0000
EduNLP, INFO Epoch #3: loss-0.0000
EduNLP, INFO Epoch #4: loss-0.0000
EduNLP, INFO model is saved to ../../../data/w2v/gensim_luna_stem_t_cbow_10.kv

[42]:

'../../../data/w2v/gensim_luna_stem_t_cbow_10.kv'

模型测试

[43]:

w2v = W2V("../../../data/w2v/gensim_luna_stem_t_cbow_10.kv",
          method="fasttext")
w2v["[FORMULA]"]

[43]:

array([-0.0156765 ,  0.00329737, -0.04140369, -0.07689971, -0.01493463,
        0.02475806, -0.00877463,  0.05539609, -0.02750023,  0.0224804 ],
      dtype=float32)