使用Bert向量化容器

[1]:
from EduNLP.I2V import Bert, get_pretrained_i2v


# 设置你的数据路径和输出路径
# BASE_DIR = "/your/own/base/path"
BASE_DIR = "../../"

data_dir = f"{BASE_DIR}/static/test_data/OpenLUNA"
output_dir = f"{BASE_DIR}/examples/test_model/data/data/bert"
d:\MySoftwares\Anaconda\envs\data\lib\site-packages\gensim\similarities\__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.
  warnings.warn(msg)

使用I2V加载本地模型

[2]:
tokenizer_kwargs = {"tokenizer_config_dir": output_dir}
i2v = Bert('bert', 'bert', output_dir, tokenizer_kwargs=tokenizer_kwargs)
Some weights of the model checkpoint at ../..//examples/test_model/data/data/bert were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at ../..//examples/test_model/data/data/bert and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[3]:
item = [
        {'stem': '如图$\\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$, \
        若$x,y$满足约束条件$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$'},
        {'stem': '已知圆$x^{2}+y^{2}-6 x=0$,过点(1,2)的直线被该圆所截得的弦的长度的最小值为'}
]
# 可以对单个题目进行表征
i_vec, t_vec = i2v(item[0]['stem'])
print(i_vec.shape) # == torch.Size([x, x])
print(t_vec.shape) # == torch.Size([x, x, x])
print()

# 也可以对题目列表进行表征
i_vec, t_vec = i2v([ item[0]['stem'], item[1]['stem'] ])
print(i_vec.shape) # == torch.Size([x, x])
print(t_vec.shape) # == torch.Size([x, x, x])
torch.Size([1, 768])
torch.Size([1, 21, 768])

torch.Size([2, 768])
torch.Size([2, 32, 768])

使用get_pretrained_i2v加载公开模型

[4]:
# 获取公开的预训练模型
pretrained_dir = f"{BASE_DIR}/examples/test_model/data/data/bert"
i2v = get_pretrained_i2v("luna_bert", model_dir=pretrained_dir)
EduNLP, INFO model_path: ..\..\examples\test_model/data\data\bert\LUNABert
EduNLP, INFO Use pretrained t2v model luna_bert
downloader, INFO http://base.ustc.edu.cn/data/model_zoo/EduNLP/LUNABert.zip is saved as ..\..\examples\test_model/data\data\bert\LUNABert.zip
downloader, INFO file existed, skipped
Some weights of the model checkpoint at ..\..\examples\test_model/data\data\bert\LUNABert were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at ..\..\examples\test_model/data\data\bert\LUNABert and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[5]:
items = [
    "有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,\
    若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$",
    "已知圆$x^{2}+y^{2}-6 x=0$,过点(1,2)的直线被该圆所截得的弦的长度的最小值为"
]

i_vec, t_vec = i2v(items)
print(i_vec.shape)
print(t_vec.shape)
print()

# 也可以单独获取题目表征和各个token的表征
i_vec = i2v.infer_item_vector(items)
print(i_vec.shape)
t_vec = i2v.infer_token_vector(items)
print(t_vec.shape)
print()

# 同样,可以获取单个题目的表征
i_vec, t_vec = i2v(item[0])
print(i_vec.shape)
print(t_vec.shape)
torch.Size([2, 768])
torch.Size([2, 32, 768])

torch.Size([2, 768])
torch.Size([2, 32, 768])

torch.Size([1, 768])
torch.Size([1, 2, 768])