基于 gensim 的模型训练举例¶
概述¶
您可以使用自己的数据和模型参数来训练和使用自己的模型。
导入模块¶
[13]:
import json
from tqdm import tqdm
from EduNLP.Pretrain import GensimWordTokenizer, train_vector
from EduNLP.Vector import D2V, W2V
from EduNLP.SIF.segment import seg
from EduNLP.SIF.tokenization import tokenize
import time
准备模型训练数据¶
[12]:
test_items = [{'ques_content':'有公式$\\FormFigureID{wrong1?}$和公式$\\FormFigureBase64{wrong2?}$,如图$\\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$'},
{"ques_content":"Human machine interface for lab abc computer applications"},
{"ques_content": "A survey of user opinion of computer system response time"},
{"ques_content": "The EPS user interface management system"},
{"ques_content": "System and human system engineering testing of EPS"},
{"ques_content": "Relation of user perceived response time to error measurement"},
{"ques_content": "The generation of random binary unordered trees"},
{"ques_content": "The intersection graph of paths in trees"},
{"ques_content": "Graph minors IV Widths of trees and well quasi ordering"},
{"ques_content": "Graph minors A survey"}
]
def load_items():
for line in test_items:
yield line
def data2Token():
# 线性分词
tokenization_params = {
"formula_params": {
"method": "linear",
}
}
token_items = []
count = 1
for item in tqdm(load_items(), "sifing"):
count = count + 1
# -------------------------------------------- #
# """除文本、公式外,其他转化为特殊标记"""
tokenizer = GensimWordTokenizer(symbol="gmas", general=True)
token_item = tokenizer(item["ques_content"])
# -------------------------------------------- #
if token_item:
token_items.append(token_item.tokens)
print("[data2Token] finish ========================> num = ",len(token_items))
return token_items
token_items = data2Token()
print(token_items[0])
sifing: 10it [00:00, 114.91it/s]
[data2Token] finish ========================> num = 10
['公式', '[FORMULA]', '公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '[SEP]', 'z', '=', 'x', '+', '7', 'y', '最大值', '[MARK]']
[3]:
len(token_items[0])
[3]:
19
也可从文件导入数据¶
例如:
[45]:
from EduData import get_data
# 导入项目提供的数据,存放路径:"../../data/"
get_data("open-luna", "../../data/")
def load_items():
with open("../../../data/OpenLUNA.json", encoding="utf-8") as f:
for line in f:
yield json.loads(line)
downloader, INFO http://base.ustc.edu.cn/data/OpenLUNA/OpenLUNA.json is saved as ../../data/OpenLUNA.json
downloader, INFO file existed, skipped
[46]:
tokenizer = GensimWordTokenizer(symbol="gm")
sif_items = []
for item in tqdm(load_items(), "sifing"):
sif_item = tokenizer(
item["stem"]
)
if sif_item:
sif_items.append(sif_item.tokens)
sif_items[0]
EduNLP.Vector.D2V 模块的训练方法¶
1. 基于 bow 训练模型¶
[6]:
train_vector(token_items, "../../../data/d2v/gensim_luna_stem_tf_", method="bow")
EduNLP, INFO model is saved to ../../../data/d2v/gensim_luna_stem_tf_bow.bin
[6]:
'../../../data/d2v/gensim_luna_stem_tf_bow.bin'
模型测试
[9]:
d2v = D2V("../../../data/d2v/gensim_luna_stem_tf_bow.bin", method = "bow")
print(d2v(token_items[1]))
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
2. 基于 tfidf 训练模型¶
[7]:
train_vector(token_items, "../../../data/d2v/gensim_luna_stem_tf_", method="tfidf")
EduNLP, INFO model is saved to ../../../data/d2v/gensim_luna_stem_tf_bow.bin
EduNLP, INFO model is saved to ../../../data/d2v/gensim_luna_stem_tf_tfidf.bin
[7]:
'../../../data/d2v/gensim_luna_stem_tf_tfidf.bin'
模型测试
[11]:
d2v = D2V("../../../data/d2v/gensim_luna_stem_tf_tfidf.bin", method = "tfidf")
vec_size = d2v.vector_size
print("vec_size = ", vec_size)
print(d2v(token_items[1]))
vec_size = 63
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.37858374396389033, 0.37858374396389033, 0.37858374396389033, 0.2646186811599866, 0.37858374396389033, 0.2646186811599866, 0.37858374396389033, 0.37858374396389033, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
3. 基于 Doc2Vec 训练模型¶
[18]:
# 10 dimension with doc2vec method
train_vector(token_items, "../../../data/w2v/gensim_luna_stem_tf_", 10, method="d2v")
EduNLP, INFO Epoch #0: loss-0.0000
EduNLP, INFO Epoch #1: loss-0.0000
EduNLP, INFO Epoch #2: loss-0.0000
EduNLP, INFO Epoch #3: loss-0.0000
EduNLP, INFO Epoch #4: loss-0.0000
EduNLP, INFO Epoch #5: loss-0.0000
EduNLP, INFO Epoch #6: loss-0.0000
EduNLP, INFO Epoch #7: loss-0.0000
EduNLP, INFO Epoch #8: loss-0.0000
EduNLP, INFO Epoch #9: loss-0.0000
EduNLP, INFO model is saved to ../../../data/w2v/gensim_luna_stem_tf_d2v_10.bin
[18]:
'../../../data/w2v/gensim_luna_stem_tf_d2v_10.bin'
[22]:
d2v = D2V("../../../data/w2v/gensim_luna_stem_tf_d2v_10.bin", method="d2v")
vec_size = d2v.vector_size
print("vec_size = ", vec_size)
print(d2v(token_items[1]))
vec_size = 10
[-0.00211227 0.00167636 0.02313529 -0.04260717 -0.01389424 -0.03898989
0.01181044 0.01069339 -0.03934718 0.00038158]
EduNLP.Vector.W2V 模块支持的训练方法¶
1. 基于 FastText 训练模型¶
[25]:
# 10 dimension with fasstext method
train_vector(token_items, "../../../data/w2v/gensim_luna_stem_t_",
10, method="fasttext")
EduNLP, INFO Epoch #0: loss-0.0000
EduNLP, INFO Epoch #1: loss-0.0000
EduNLP, INFO Epoch #2: loss-0.0000
EduNLP, INFO Epoch #3: loss-0.0000
EduNLP, INFO Epoch #4: loss-0.0000
EduNLP, INFO model is saved to ../../../data/w2v/gensim_luna_stem_t_fasttext_10.bin
[25]:
'../../../data/w2v/gensim_luna_stem_t_fasttext_10.bin'
模型测试
[41]:
w2v = W2V("../../../data/w2v/gensim_luna_stem_t_fasttext_10.bin", method="fasttext")
w2v["[FORMULA]"]
[41]:
array([-0.00434524, -0.00836839, -0.02108332, 0.00493213, 0.00461454,
0.01070305, -0.01737931, 0.0210843 , -0.00525515, 0.00918209],
dtype=float32)
2. 基于 cbow 训练模型¶
[42]:
train_vector(token_items, "../../../data/w2v/gensim_luna_stem_t_", 10, method="cbow")
EduNLP, INFO Epoch #0: loss-0.0000
EduNLP, INFO Epoch #1: loss-0.0000
EduNLP, INFO Epoch #2: loss-0.0000
EduNLP, INFO Epoch #3: loss-0.0000
EduNLP, INFO Epoch #4: loss-0.0000
EduNLP, INFO model is saved to ../../../data/w2v/gensim_luna_stem_t_cbow_10.kv
[42]:
'../../../data/w2v/gensim_luna_stem_t_cbow_10.kv'
模型测试
[43]:
w2v = W2V("../../../data/w2v/gensim_luna_stem_t_cbow_10.kv",
method="fasttext")
w2v["[FORMULA]"]
[43]:
array([-0.0156765 , 0.00329737, -0.04140369, -0.07689971, -0.01493463,
0.02475806, -0.00877463, 0.05539609, -0.02750023, 0.0224804 ],
dtype=float32)