Commit b422937e by 20210509028

Upload New File

parent 0a731e4b
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 基于FastText的意图分类\n",
"\n",
"在这里我们训练一个FastText意图识别模型,并把训练好的模型存放在模型文件里。 意图识别实际上是文本分类任务,需要标注的数据:每一个句子需要对应的标签如闲聊型的,任务型的。但在这个项目中,我们并没有任何标注的数据,而且并不需要搭建闲聊机器人。所以这里搭建的FastText模型只是一个dummy模型,没有什么任何的作用。这个模块只是为了项目的完整性,也让大家明白FastText如何去使用,仅此而已。\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"fastText是一个快速文本分类算法,与基于神经网络的分类算法相比有两大优点:<br>\n",
"1、fastText在保持高精度的情况下加快了训练速度和测试速度<br>\n",
"2、fastText不需要预训练好的词向量,fastText会自己训练词向量<br>\n",
"3、fastText两个重要的优化:Hierarchical Softmax、N-gram<br>\n",
"————————————————<br>\n",
"fastText是一个用于文本分类和词向量表示的库,它能够把文本转化成连续的向量然后用于后续具体的语言任务,目前教程较少!\n",
"fastText中文文档\n",
"http://fasttext.apachecn.org/#/"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"import fasttext\n",
"#facebook推出的一款以n-GRAM模型座的文自然语言处理模块,精度与深度学习差不多,但运行效率要高的多,常用于意图识别和情感分析\n",
"import pandas as pd\n",
"from tqdm import tqdm\n",
"import numpy as np\n",
"import pickle"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>question</th>\n",
" <th>answer</th>\n",
" <th>question_after_preprocessing</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>买二份有没有少点呀</td>\n",
" <td>亲亲真的不好意思我们已经是优惠价了呢小本生意请亲谅解</td>\n",
" <td>[买]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>那就等你们处理喽</td>\n",
" <td>好的亲退了</td>\n",
" <td>[]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>那我不喜欢</td>\n",
" <td>颜色的话一般茶刀茶针和二合一的话都是红木檀和黑木檀哦</td>\n",
" <td>[]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>不是免运费</td>\n",
" <td>本店茶具订单满99包邮除宁夏青海内蒙古海南新疆西藏满39包邮</td>\n",
" <td>[运费]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>好吃吗</td>\n",
" <td>好吃的</td>\n",
" <td>[好吃]</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" question answer question_after_preprocessing\n",
"0 买二份有没有少点呀 亲亲真的不好意思我们已经是优惠价了呢小本生意请亲谅解 [买]\n",
"1 那就等你们处理喽 好的亲退了 []\n",
"2 那我不喜欢 颜色的话一般茶刀茶针和二合一的话都是红木檀和黑木檀哦 []\n",
"3 不是免运费 本店茶具订单满99包邮除宁夏青海内蒙古海南新疆西藏满39包邮 [运费]\n",
"4 好吃吗 好吃的 [好吃]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 读取数据:导入在preprocessor.ipynb中生成的data/question_answer_pares.pkl文件,并将其保存在变量QApares中\n",
"#打开之前文本预处理的pkl数据文件\n",
"with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/question_answer_pares.pkl','rb') as f:\n",
" QApares = pickle.load(f)\n",
"QApares.head()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(QApares)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# 这一文档的目的是为了用fasttext来进行意图识别,将问题分为任务型和闲聊型两类\n",
"# 训练数据集是任务型还是闲聊型本应由人工打上标签,但这并不是我们的重点。我们的重点是教会大家如何用fasttext来进行意图识别\n",
"# 所以这里我们为数据集随机打上0或1的标签,而不去管实际情况如何\n",
"\n",
"#fasttext的输入格式为:单词1 单词2 单词3 ... 单词n __label__标签号\n",
"#我们将问题数据集整理为fasttext需要的输入格式并为其随机打上标签并将结果保存在data/fasttext/fasttext_train.txt和data/fasttext/fasttext_test.txt中\n",
"with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_train.txt','w') as f:\n",
" for content in QApares[:int(0.7*len(QApares))].dropna().question_after_preprocessing: # 选择文本前70%作为训练集,.dropna()默认删除dataframe中所有带空值的行\n",
" f.write('%s __label__%d\\n' % (' '.join(content), np.random.randint(0,2)))#按指定格式写入txt文件,np.random.randint(0,2)生成0,1二分类随机整数,不包含2\n",
"with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_test.txt','w') as f:\n",
" for content in QApares[int(0.7*len(QApares)):].dropna().question_after_preprocessing:\n",
" f.write('%s __label__%d\\n' % (' '.join(content), np.random.randint(0,2)))"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"#使用fasttext进行意图识别,并将模型保存在classifier中\n",
"#https://fasttext.cc/docs/en/python-module.html\n",
"\n",
"classifier = fasttext.train_supervised('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_train.txt', #训练数据文件路径\n",
" label=\"__label__\", #类别前缀\n",
" dim=100, #向量维度\n",
" epoch=5, #训练轮次\n",
" lr=0.1, #学习率\n",
" wordNgrams=2, #n-gram个数\n",
" loss='softmax', #损失函数类型\n",
" thread=5, #线程个数, 每个线程处理输入数据的一段, 0号线程负责loss输出\n",
" verbose=True)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Help on module fasttext.FastText in fasttext:\n",
"\n",
"NAME\n",
" fasttext.FastText\n",
"\n",
"DESCRIPTION\n",
" # Copyright (c) 2017-present, Facebook, Inc.\n",
" # All rights reserved.\n",
" #\n",
" # This source code is licensed under the MIT license found in the\n",
" # LICENSE file in the root directory of this source tree.\n",
"\n",
"FUNCTIONS\n",
" cbow(*kargs, **kwargs)\n",
" \n",
" eprint(*args, **kwargs)\n",
" \n",
" load_model(path)\n",
" Load a model given a filepath and return a model object.\n",
" \n",
" read_args(arg_list, arg_dict, arg_names, default_values)\n",
" \n",
" skipgram(*kargs, **kwargs)\n",
" \n",
" supervised(*kargs, **kwargs)\n",
" \n",
" tokenize(text)\n",
" Given a string of text, tokenize it and return a list of tokens\n",
" \n",
" train_supervised(*kargs, **kwargs)\n",
" Train a supervised model and return a model object.\n",
" \n",
" input must be a filepath. The input text does not need to be tokenized\n",
" as per the tokenize function, but it must be preprocessed and encoded\n",
" as UTF-8. You might want to consult standard preprocessing scripts such\n",
" as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html\n",
" \n",
" The input file must must contain at least one label per line. For an\n",
" example consult the example datasets which are part of the fastText\n",
" repository such as the dataset pulled by classification-example.sh.\n",
" \n",
" train_unsupervised(*kargs, **kwargs)\n",
" Train an unsupervised model and return a model object.\n",
" \n",
" input must be a filepath. The input text does not need to be tokenized\n",
" as per the tokenize function, but it must be preprocessed and encoded\n",
" as UTF-8. You might want to consult standard preprocessing scripts such\n",
" as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html\n",
" \n",
" The input field must not contain any labels or use the specified label prefix\n",
" unless it is ok for those words to be ignored. For an example consult the\n",
" dataset pulled by the example script word-vector-example.sh, which is\n",
" part of the fastText repository.\n",
"\n",
"DATA\n",
" BOW = '<'\n",
" EOS = '</s>'\n",
" EOW = '>'\n",
" absolute_import = _Feature((2, 5, 0, 'alpha', 1), (3, 0, 0, 'alpha', 0...\n",
" displayed_errors = {}\n",
" division = _Feature((2, 2, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0), 8192...\n",
" print_function = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0)...\n",
" unicode_literals = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', ...\n",
" unsupervised_default = {'autotuneDuration': 300, 'autotuneMetric': 'f1...\n",
"\n",
"FILE\n",
" c:\\users\\cuishufeng-ghq\\appdata\\local\\programs\\python\\python37\\lib\\site-packages\\fasttext\\fasttext.py\n",
"\n",
"\n"
]
}
],
"source": [
"help(fasttext.FastText)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(('__label__1',), array([0.50106966]))"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#使用训练好的fasttext模型进行预测\n",
"classifier.predict('今天 月亮 真 圆 啊')"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(('__label__1',), array([0.50073278]))"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"classifier.predict('NLP 是 人工智能 皇冠 上 的 明珠')"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(300, 0.52, 0.52)"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#使用训练好的fasttext模型对测试集文件进行评估\n",
"classifier.test('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_test.txt')"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"#保存模型\n",
"classifier.save_model('model/fasttext.ftz')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment