Upload New File

b422937e · 20210509028 · 0a731e4b · b422937e
Commit b422937e authored Jun 23, 2021 by 20210509028
Hide whitespace changes
Inline Side-by-side

Showing with 367 additions and 0 deletions

project2/fasttext.ipynb
+367 -0

No files found.
--- a/project2/fasttext.ipynb
+++ b/project2/fasttext.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 基于FastText的意图分类\n",
+    "\n",
+    "在这里我们训练一个FastText意图识别模型，并把训练好的模型存放在模型文件里。 意图识别实际上是文本分类任务，需要标注的数据：每一个句子需要对应的标签如闲聊型的，任务型的。但在这个项目中，我们并没有任何标注的数据，而且并不需要搭建闲聊机器人。所以这里搭建的FastText模型只是一个dummy模型，没有什么任何的作用。这个模块只是为了项目的完整性，也让大家明白FastText如何去使用，仅此而已。\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "fastText是一个快速文本分类算法，与基于神经网络的分类算法相比有两大优点：<br>\n",
+    "1、fastText在保持高精度的情况下加快了训练速度和测试速度<br>\n",
+    "2、fastText不需要预训练好的词向量，fastText会自己训练词向量<br>\n",
+    "3、fastText两个重要的优化：Hierarchical Softmax、N-gram<br>\n",
+    "————————————————<br>\n",
+    "fastText是一个用于文本分类和词向量表示的库，它能够把文本转化成连续的向量然后用于后续具体的语言任务，目前教程较少！\n",
+    "fastText中文文档\n",
+    "http://fasttext.apachecn.org/#/"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import fasttext\n",
+    "#facebook推出的一款以n-GRAM模型座的文自然语言处理模块,精度与深度学习差不多,但运行效率要高的多,常用于意图识别和情感分析\n",
+    "import pandas as pd\n",
+    "from tqdm import tqdm\n",
+    "import numpy as np\n",
+    "import pickle"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>question</th>\n",
+       "      <th>answer</th>\n",
+       "      <th>question_after_preprocessing</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>买二份有没有少点呀</td>\n",
+       "      <td>亲亲真的不好意思我们已经是优惠价了呢小本生意请亲谅解</td>\n",
+       "      <td>[买]</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>那就等你们处理喽</td>\n",
+       "      <td>好的亲退了</td>\n",
+       "      <td>[]</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>那我不喜欢</td>\n",
+       "      <td>颜色的话一般茶刀茶针和二合一的话都是红木檀和黑木檀哦</td>\n",
+       "      <td>[]</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>不是免运费</td>\n",
+       "      <td>本店茶具订单满99包邮除宁夏青海内蒙古海南新疆西藏满39包邮</td>\n",
+       "      <td>[运费]</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>好吃吗</td>\n",
+       "      <td>好吃的</td>\n",
+       "      <td>[好吃]</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "    question                          answer question_after_preprocessing\n",
+       "0  买二份有没有少点呀      亲亲真的不好意思我们已经是优惠价了呢小本生意请亲谅解                          [买]\n",
+       "1   那就等你们处理喽                           好的亲退了                           []\n",
+       "2      那我不喜欢      颜色的话一般茶刀茶针和二合一的话都是红木檀和黑木檀哦                           []\n",
+       "3      不是免运费  本店茶具订单满99包邮除宁夏青海内蒙古海南新疆西藏满39包邮                         [运费]\n",
+       "4        好吃吗                             好吃的                         [好吃]"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# 读取数据：导入在preprocessor.ipynb中生成的data/question_answer_pares.pkl文件，并将其保存在变量QApares中\n",
+    "#打开之前文本预处理的pkl数据文件\n",
+    "with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/question_answer_pares.pkl','rb') as f:\n",
+    "    QApares = pickle.load(f)\n",
+    "QApares.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "pandas.core.frame.DataFrame"
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "type(QApares)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 这一文档的目的是为了用fasttext来进行意图识别，将问题分为任务型和闲聊型两类\n",
+    "# 训练数据集是任务型还是闲聊型本应由人工打上标签，但这并不是我们的重点。我们的重点是教会大家如何用fasttext来进行意图识别\n",
+    "# 所以这里我们为数据集随机打上0或1的标签，而不去管实际情况如何\n",
+    "\n",
+    "#fasttext的输入格式为：单词1 单词2 单词3 ... 单词n __label__标签号\n",
+    "#我们将问题数据集整理为fasttext需要的输入格式并为其随机打上标签并将结果保存在data/fasttext/fasttext_train.txt和data/fasttext/fasttext_test.txt中\n",
+    "with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_train.txt','w') as f:\n",
+    "    for content in QApares[:int(0.7*len(QApares))].dropna().question_after_preprocessing:  # 选择文本前70%作为训练集,.dropna()默认删除dataframe中所有带空值的行\n",
+    "        f.write('%s __label__%d\\n' % (' '.join(content), np.random.randint(0,2)))#按指定格式写入txt文件,np.random.randint(0,2)生成0,1二分类随机整数,不包含2\n",
+    "with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_test.txt','w') as f:\n",
+    "    for content in QApares[int(0.7*len(QApares)):].dropna().question_after_preprocessing:\n",
+    "        f.write('%s __label__%d\\n' % (' '.join(content), np.random.randint(0,2)))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#使用fasttext进行意图识别，并将模型保存在classifier中\n",
+    "#https://fasttext.cc/docs/en/python-module.html\n",
+    "\n",
+    "classifier = fasttext.train_supervised('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_train.txt',     #训练数据文件路径\n",
+    "                                       label=\"__label__\",      #类别前缀\n",
+    "                                       dim=100,       #向量维度\n",
+    "                                       epoch=5,       #训练轮次\n",
+    "                                       lr=0.1,        #学习率\n",
+    "                                       wordNgrams=2,      #n-gram个数\n",
+    "                                       loss='softmax',    #损失函数类型\n",
+    "                                       thread=5,          #线程个数, 每个线程处理输入数据的一段, 0号线程负责loss输出\n",
+    "                                       verbose=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Help on module fasttext.FastText in fasttext:\n",
+      "\n",
+      "NAME\n",
+      "    fasttext.FastText\n",
+      "\n",
+      "DESCRIPTION\n",
+      "    # Copyright (c) 2017-present, Facebook, Inc.\n",
+      "    # All rights reserved.\n",
+      "    #\n",
+      "    # This source code is licensed under the MIT license found in the\n",
+      "    # LICENSE file in the root directory of this source tree.\n",
+      "\n",
+      "FUNCTIONS\n",
+      "    cbow(*kargs, **kwargs)\n",
+      "    \n",
+      "    eprint(*args, **kwargs)\n",
+      "    \n",
+      "    load_model(path)\n",
+      "        Load a model given a filepath and return a model object.\n",
+      "    \n",
+      "    read_args(arg_list, arg_dict, arg_names, default_values)\n",
+      "    \n",
+      "    skipgram(*kargs, **kwargs)\n",
+      "    \n",
+      "    supervised(*kargs, **kwargs)\n",
+      "    \n",
+      "    tokenize(text)\n",
+      "        Given a string of text, tokenize it and return a list of tokens\n",
+      "    \n",
+      "    train_supervised(*kargs, **kwargs)\n",
+      "        Train a supervised model and return a model object.\n",
+      "        \n",
+      "        input must be a filepath. The input text does not need to be tokenized\n",
+      "        as per the tokenize function, but it must be preprocessed and encoded\n",
+      "        as UTF-8. You might want to consult standard preprocessing scripts such\n",
+      "        as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html\n",
+      "        \n",
+      "        The input file must must contain at least one label per line. For an\n",
+      "        example consult the example datasets which are part of the fastText\n",
+      "        repository such as the dataset pulled by classification-example.sh.\n",
+      "    \n",
+      "    train_unsupervised(*kargs, **kwargs)\n",
+      "        Train an unsupervised model and return a model object.\n",
+      "        \n",
+      "        input must be a filepath. The input text does not need to be tokenized\n",
+      "        as per the tokenize function, but it must be preprocessed and encoded\n",
+      "        as UTF-8. You might want to consult standard preprocessing scripts such\n",
+      "        as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html\n",
+      "        \n",
+      "        The input field must not contain any labels or use the specified label prefix\n",
+      "        unless it is ok for those words to be ignored. For an example consult the\n",
+      "        dataset pulled by the example script word-vector-example.sh, which is\n",
+      "        part of the fastText repository.\n",
+      "\n",
+      "DATA\n",
+      "    BOW = '<'\n",
+      "    EOS = '</s>'\n",
+      "    EOW = '>'\n",
+      "    absolute_import = _Feature((2, 5, 0, 'alpha', 1), (3, 0, 0, 'alpha', 0...\n",
+      "    displayed_errors = {}\n",
+      "    division = _Feature((2, 2, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0), 8192...\n",
+      "    print_function = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0)...\n",
+      "    unicode_literals = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', ...\n",
+      "    unsupervised_default = {'autotuneDuration': 300, 'autotuneMetric': 'f1...\n",
+      "\n",
+      "FILE\n",
+      "    c:\\users\\cuishufeng-ghq\\appdata\\local\\programs\\python\\python37\\lib\\site-packages\\fasttext\\fasttext.py\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "help(fasttext.FastText)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(('__label__1',), array([0.50106966]))"
+      ]
+     },
+     "execution_count": 21,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#使用训练好的fasttext模型进行预测\n",
+    "classifier.predict('今天 月亮 真 圆 啊')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(('__label__1',), array([0.50073278]))"
+      ]
+     },
+     "execution_count": 22,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "classifier.predict('NLP 是 人工智能 皇冠 上 的 明珠')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(300, 0.52, 0.52)"
+      ]
+     },
+     "execution_count": 23,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#使用训练好的fasttext模型对测试集文件进行评估\n",
+    "classifier.test('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_test.txt')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#保存模型\n",
+    "classifier.save_model('model/fasttext.ftz')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}