Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
H
homework
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
20210509028
homework
Commits
b422937e
Commit
b422937e
authored
Jun 23, 2021
by
20210509028
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Upload New File
parent
0a731e4b
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
367 additions
and
0 deletions
+367
-0
project2/fasttext.ipynb
+367
-0
No files found.
project2/fasttext.ipynb
0 → 100644
View file @
b422937e
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 基于FastText的意图分类\n",
"\n",
"在这里我们训练一个FastText意图识别模型,并把训练好的模型存放在模型文件里。 意图识别实际上是文本分类任务,需要标注的数据:每一个句子需要对应的标签如闲聊型的,任务型的。但在这个项目中,我们并没有任何标注的数据,而且并不需要搭建闲聊机器人。所以这里搭建的FastText模型只是一个dummy模型,没有什么任何的作用。这个模块只是为了项目的完整性,也让大家明白FastText如何去使用,仅此而已。\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"fastText是一个快速文本分类算法,与基于神经网络的分类算法相比有两大优点:<br>\n",
"1、fastText在保持高精度的情况下加快了训练速度和测试速度<br>\n",
"2、fastText不需要预训练好的词向量,fastText会自己训练词向量<br>\n",
"3、fastText两个重要的优化:Hierarchical Softmax、N-gram<br>\n",
"————————————————<br>\n",
"fastText是一个用于文本分类和词向量表示的库,它能够把文本转化成连续的向量然后用于后续具体的语言任务,目前教程较少!\n",
"fastText中文文档\n",
"http://fasttext.apachecn.org/#/"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"import fasttext\n",
"#facebook推出的一款以n-GRAM模型座的文自然语言处理模块,精度与深度学习差不多,但运行效率要高的多,常用于意图识别和情感分析\n",
"import pandas as pd\n",
"from tqdm import tqdm\n",
"import numpy as np\n",
"import pickle"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>question</th>\n",
" <th>answer</th>\n",
" <th>question_after_preprocessing</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>买二份有没有少点呀</td>\n",
" <td>亲亲真的不好意思我们已经是优惠价了呢小本生意请亲谅解</td>\n",
" <td>[买]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>那就等你们处理喽</td>\n",
" <td>好的亲退了</td>\n",
" <td>[]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>那我不喜欢</td>\n",
" <td>颜色的话一般茶刀茶针和二合一的话都是红木檀和黑木檀哦</td>\n",
" <td>[]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>不是免运费</td>\n",
" <td>本店茶具订单满99包邮除宁夏青海内蒙古海南新疆西藏满39包邮</td>\n",
" <td>[运费]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>好吃吗</td>\n",
" <td>好吃的</td>\n",
" <td>[好吃]</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" question answer question_after_preprocessing\n",
"0 买二份有没有少点呀 亲亲真的不好意思我们已经是优惠价了呢小本生意请亲谅解 [买]\n",
"1 那就等你们处理喽 好的亲退了 []\n",
"2 那我不喜欢 颜色的话一般茶刀茶针和二合一的话都是红木檀和黑木檀哦 []\n",
"3 不是免运费 本店茶具订单满99包邮除宁夏青海内蒙古海南新疆西藏满39包邮 [运费]\n",
"4 好吃吗 好吃的 [好吃]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 读取数据:导入在preprocessor.ipynb中生成的data/question_answer_pares.pkl文件,并将其保存在变量QApares中\n",
"#打开之前文本预处理的pkl数据文件\n",
"with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/question_answer_pares.pkl','rb') as f:\n",
" QApares = pickle.load(f)\n",
"QApares.head()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(QApares)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# 这一文档的目的是为了用fasttext来进行意图识别,将问题分为任务型和闲聊型两类\n",
"# 训练数据集是任务型还是闲聊型本应由人工打上标签,但这并不是我们的重点。我们的重点是教会大家如何用fasttext来进行意图识别\n",
"# 所以这里我们为数据集随机打上0或1的标签,而不去管实际情况如何\n",
"\n",
"#fasttext的输入格式为:单词1 单词2 单词3 ... 单词n __label__标签号\n",
"#我们将问题数据集整理为fasttext需要的输入格式并为其随机打上标签并将结果保存在data/fasttext/fasttext_train.txt和data/fasttext/fasttext_test.txt中\n",
"with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_train.txt','w') as f:\n",
" for content in QApares[:int(0.7*len(QApares))].dropna().question_after_preprocessing: # 选择文本前70%作为训练集,.dropna()默认删除dataframe中所有带空值的行\n",
" f.write('%s __label__%d\\n' % (' '.join(content), np.random.randint(0,2)))#按指定格式写入txt文件,np.random.randint(0,2)生成0,1二分类随机整数,不包含2\n",
"with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_test.txt','w') as f:\n",
" for content in QApares[int(0.7*len(QApares)):].dropna().question_after_preprocessing:\n",
" f.write('%s __label__%d\\n' % (' '.join(content), np.random.randint(0,2)))"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"#使用fasttext进行意图识别,并将模型保存在classifier中\n",
"#https://fasttext.cc/docs/en/python-module.html\n",
"\n",
"classifier = fasttext.train_supervised('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_train.txt', #训练数据文件路径\n",
" label=\"__label__\", #类别前缀\n",
" dim=100, #向量维度\n",
" epoch=5, #训练轮次\n",
" lr=0.1, #学习率\n",
" wordNgrams=2, #n-gram个数\n",
" loss='softmax', #损失函数类型\n",
" thread=5, #线程个数, 每个线程处理输入数据的一段, 0号线程负责loss输出\n",
" verbose=True)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Help on module fasttext.FastText in fasttext:\n",
"\n",
"NAME\n",
" fasttext.FastText\n",
"\n",
"DESCRIPTION\n",
" # Copyright (c) 2017-present, Facebook, Inc.\n",
" # All rights reserved.\n",
" #\n",
" # This source code is licensed under the MIT license found in the\n",
" # LICENSE file in the root directory of this source tree.\n",
"\n",
"FUNCTIONS\n",
" cbow(*kargs, **kwargs)\n",
" \n",
" eprint(*args, **kwargs)\n",
" \n",
" load_model(path)\n",
" Load a model given a filepath and return a model object.\n",
" \n",
" read_args(arg_list, arg_dict, arg_names, default_values)\n",
" \n",
" skipgram(*kargs, **kwargs)\n",
" \n",
" supervised(*kargs, **kwargs)\n",
" \n",
" tokenize(text)\n",
" Given a string of text, tokenize it and return a list of tokens\n",
" \n",
" train_supervised(*kargs, **kwargs)\n",
" Train a supervised model and return a model object.\n",
" \n",
" input must be a filepath. The input text does not need to be tokenized\n",
" as per the tokenize function, but it must be preprocessed and encoded\n",
" as UTF-8. You might want to consult standard preprocessing scripts such\n",
" as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html\n",
" \n",
" The input file must must contain at least one label per line. For an\n",
" example consult the example datasets which are part of the fastText\n",
" repository such as the dataset pulled by classification-example.sh.\n",
" \n",
" train_unsupervised(*kargs, **kwargs)\n",
" Train an unsupervised model and return a model object.\n",
" \n",
" input must be a filepath. The input text does not need to be tokenized\n",
" as per the tokenize function, but it must be preprocessed and encoded\n",
" as UTF-8. You might want to consult standard preprocessing scripts such\n",
" as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html\n",
" \n",
" The input field must not contain any labels or use the specified label prefix\n",
" unless it is ok for those words to be ignored. For an example consult the\n",
" dataset pulled by the example script word-vector-example.sh, which is\n",
" part of the fastText repository.\n",
"\n",
"DATA\n",
" BOW = '<'\n",
" EOS = '</s>'\n",
" EOW = '>'\n",
" absolute_import = _Feature((2, 5, 0, 'alpha', 1), (3, 0, 0, 'alpha', 0...\n",
" displayed_errors = {}\n",
" division = _Feature((2, 2, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0), 8192...\n",
" print_function = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0)...\n",
" unicode_literals = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', ...\n",
" unsupervised_default = {'autotuneDuration': 300, 'autotuneMetric': 'f1...\n",
"\n",
"FILE\n",
" c:\\users\\cuishufeng-ghq\\appdata\\local\\programs\\python\\python37\\lib\\site-packages\\fasttext\\fasttext.py\n",
"\n",
"\n"
]
}
],
"source": [
"help(fasttext.FastText)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(('__label__1',), array([0.50106966]))"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#使用训练好的fasttext模型进行预测\n",
"classifier.predict('今天 月亮 真 圆 啊')"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(('__label__1',), array([0.50073278]))"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"classifier.predict('NLP 是 人工智能 皇冠 上 的 明珠')"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(300, 0.52, 0.52)"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#使用训练好的fasttext模型对测试集文件进行评估\n",
"classifier.test('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_test.txt')"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"#保存模型\n",
"classifier.save_model('model/fasttext.ftz')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment