fasttext.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 基于FastText的意图分类\n",
    "\n",
    "在这里我们训练一个FastText意图识别模型，并把训练好的模型存放在模型文件里。 意图识别实际上是文本分类任务，需要标注的数据：每一个句子需要对应的标签如闲聊型的，任务型的。但在这个项目中，我们并没有任何标注的数据，而且并不需要搭建闲聊机器人。所以这里搭建的FastText模型只是一个dummy模型，没有什么任何的作用。这个模块只是为了项目的完整性，也让大家明白FastText如何去使用，仅此而已。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "fastText是一个快速文本分类算法，与基于神经网络的分类算法相比有两大优点：<br>\n",
    "1、fastText在保持高精度的情况下加快了训练速度和测试速度<br>\n",
    "2、fastText不需要预训练好的词向量，fastText会自己训练词向量<br>\n",
    "3、fastText两个重要的优化：Hierarchical Softmax、N-gram<br>\n",
    "————————————————<br>\n",
    "fastText是一个用于文本分类和词向量表示的库，它能够把文本转化成连续的向量然后用于后续具体的语言任务，目前教程较少！\n",
    "fastText中文文档\n",
    "http://fasttext.apachecn.org/#/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "import fasttext\n",
    "#facebook推出的一款以n-GRAM模型座的文自然语言处理模块,精度与深度学习差不多,但运行效率要高的多,常用于意图识别和情感分析\n",
    "import pandas as pd\n",
    "from tqdm import tqdm\n",
    "import numpy as np\n",
    "import pickle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>question</th>\n",
       "      <th>answer</th>\n",
       "      <th>question_after_preprocessing</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>买二份有没有少点呀</td>\n",
       "      <td>亲亲真的不好意思我们已经是优惠价了呢小本生意请亲谅解</td>\n",
       "      <td>[买]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>那就等你们处理喽</td>\n",
       "      <td>好的亲退了</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>那我不喜欢</td>\n",
       "      <td>颜色的话一般茶刀茶针和二合一的话都是红木檀和黑木檀哦</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>不是免运费</td>\n",
       "      <td>本店茶具订单满99包邮除宁夏青海内蒙古海南新疆西藏满39包邮</td>\n",
       "      <td>[运费]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>好吃吗</td>\n",
       "      <td>好吃的</td>\n",
       "      <td>[好吃]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    question                          answer question_after_preprocessing\n",
       "0  买二份有没有少点呀      亲亲真的不好意思我们已经是优惠价了呢小本生意请亲谅解                          [买]\n",
       "1   那就等你们处理喽                           好的亲退了                           []\n",
       "2      那我不喜欢      颜色的话一般茶刀茶针和二合一的话都是红木檀和黑木檀哦                           []\n",
       "3      不是免运费  本店茶具订单满99包邮除宁夏青海内蒙古海南新疆西藏满39包邮                         [运费]\n",
       "4        好吃吗                             好吃的                         [好吃]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 读取数据：导入在preprocessor.ipynb中生成的data/question_answer_pares.pkl文件，并将其保存在变量QApares中\n",
    "#打开之前文本预处理的pkl数据文件\n",
    "with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/question_answer_pares.pkl','rb') as f:\n",
    "    QApares = pickle.load(f)\n",
    "QApares.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.frame.DataFrame"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(QApares)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 这一文档的目的是为了用fasttext来进行意图识别，将问题分为任务型和闲聊型两类\n",
    "# 训练数据集是任务型还是闲聊型本应由人工打上标签，但这并不是我们的重点。我们的重点是教会大家如何用fasttext来进行意图识别\n",
    "# 所以这里我们为数据集随机打上0或1的标签，而不去管实际情况如何\n",
    "\n",
    "#fasttext的输入格式为：单词1 单词2 单词3 ... 单词n __label__标签号\n",
    "#我们将问题数据集整理为fasttext需要的输入格式并为其随机打上标签并将结果保存在data/fasttext/fasttext_train.txt和data/fasttext/fasttext_test.txt中\n",
    "with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_train.txt','w') as f:\n",
    "    for content in QApares[:int(0.7*len(QApares))].dropna().question_after_preprocessing:  # 选择文本前70%作为训练集,.dropna()默认删除dataframe中所有带空值的行\n",
    "        f.write('%s __label__%d\\n' % (' '.join(content), np.random.randint(0,2)))#按指定格式写入txt文件,np.random.randint(0,2)生成0,1二分类随机整数,不包含2\n",
    "with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_test.txt','w') as f:\n",
    "    for content in QApares[int(0.7*len(QApares)):].dropna().question_after_preprocessing:\n",
    "        f.write('%s __label__%d\\n' % (' '.join(content), np.random.randint(0,2)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "#使用fasttext进行意图识别，并将模型保存在classifier中\n",
    "#https://fasttext.cc/docs/en/python-module.html\n",
    "\n",
    "classifier = fasttext.train_supervised('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_train.txt',     #训练数据文件路径\n",
    "                                       label=\"__label__\",      #类别前缀\n",
    "                                       dim=100,       #向量维度\n",
    "                                       epoch=5,       #训练轮次\n",
    "                                       lr=0.1,        #学习率\n",
    "                                       wordNgrams=2,      #n-gram个数\n",
    "                                       loss='softmax',    #损失函数类型\n",
    "                                       thread=5,          #线程个数, 每个线程处理输入数据的一段, 0号线程负责loss输出\n",
    "                                       verbose=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on module fasttext.FastText in fasttext:\n",
      "\n",
      "NAME\n",
      "    fasttext.FastText\n",
      "\n",
      "DESCRIPTION\n",
      "    # Copyright (c) 2017-present, Facebook, Inc.\n",
      "    # All rights reserved.\n",
      "    #\n",
      "    # This source code is licensed under the MIT license found in the\n",
      "    # LICENSE file in the root directory of this source tree.\n",
      "\n",
      "FUNCTIONS\n",
      "    cbow(*kargs, **kwargs)\n",
      "    \n",
      "    eprint(*args, **kwargs)\n",
      "    \n",
      "    load_model(path)\n",
      "        Load a model given a filepath and return a model object.\n",
      "    \n",
      "    read_args(arg_list, arg_dict, arg_names, default_values)\n",
      "    \n",
      "    skipgram(*kargs, **kwargs)\n",
      "    \n",
      "    supervised(*kargs, **kwargs)\n",
      "    \n",
      "    tokenize(text)\n",
      "        Given a string of text, tokenize it and return a list of tokens\n",
      "    \n",
      "    train_supervised(*kargs, **kwargs)\n",
      "        Train a supervised model and return a model object.\n",
      "        \n",
      "        input must be a filepath. The input text does not need to be tokenized\n",
      "        as per the tokenize function, but it must be preprocessed and encoded\n",
      "        as UTF-8. You might want to consult standard preprocessing scripts such\n",
      "        as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html\n",
      "        \n",
      "        The input file must must contain at least one label per line. For an\n",
      "        example consult the example datasets which are part of the fastText\n",
      "        repository such as the dataset pulled by classification-example.sh.\n",
      "    \n",
      "    train_unsupervised(*kargs, **kwargs)\n",
      "        Train an unsupervised model and return a model object.\n",
      "        \n",
      "        input must be a filepath. The input text does not need to be tokenized\n",
      "        as per the tokenize function, but it must be preprocessed and encoded\n",
      "        as UTF-8. You might want to consult standard preprocessing scripts such\n",
      "        as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html\n",
      "        \n",
      "        The input field must not contain any labels or use the specified label prefix\n",
      "        unless it is ok for those words to be ignored. For an example consult the\n",
      "        dataset pulled by the example script word-vector-example.sh, which is\n",
      "        part of the fastText repository.\n",
      "\n",
      "DATA\n",
      "    BOW = '<'\n",
      "    EOS = '</s>'\n",
      "    EOW = '>'\n",
      "    absolute_import = _Feature((2, 5, 0, 'alpha', 1), (3, 0, 0, 'alpha', 0...\n",
      "    displayed_errors = {}\n",
      "    division = _Feature((2, 2, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0), 8192...\n",
      "    print_function = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0)...\n",
      "    unicode_literals = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', ...\n",
      "    unsupervised_default = {'autotuneDuration': 300, 'autotuneMetric': 'f1...\n",
      "\n",
      "FILE\n",
      "    c:\\users\\cuishufeng-ghq\\appdata\\local\\programs\\python\\python37\\lib\\site-packages\\fasttext\\fasttext.py\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "help(fasttext.FastText)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(('__label__1',), array([0.50106966]))"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#使用训练好的fasttext模型进行预测\n",
    "classifier.predict('今天 月亮 真 圆 啊')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(('__label__1',), array([0.50073278]))"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "classifier.predict('NLP 是 人工智能 皇冠 上 的 明珠')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(300, 0.52, 0.52)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#使用训练好的fasttext模型对测试集文件进行评估\n",
    "classifier.test('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_test.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "#保存模型\n",
    "classifier.save_model('model/fasttext.ftz')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}