fasttext.ipynb 13.6 KB
Newer Older
20210509028 committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 基于FastText的意图分类\n",
    "\n",
    "在这里我们训练一个FastText意图识别模型,并把训练好的模型存放在模型文件里。 意图识别实际上是文本分类任务,需要标注的数据:每一个句子需要对应的标签如闲聊型的,任务型的。但在这个项目中,我们并没有任何标注的数据,而且并不需要搭建闲聊机器人。所以这里搭建的FastText模型只是一个dummy模型,没有什么任何的作用。这个模块只是为了项目的完整性,也让大家明白FastText如何去使用,仅此而已。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "fastText是一个快速文本分类算法,与基于神经网络的分类算法相比有两大优点:<br>\n",
    "1、fastText在保持高精度的情况下加快了训练速度和测试速度<br>\n",
    "2、fastText不需要预训练好的词向量,fastText会自己训练词向量<br>\n",
    "3、fastText两个重要的优化:Hierarchical Softmax、N-gram<br>\n",
    "————————————————<br>\n",
    "fastText是一个用于文本分类和词向量表示的库,它能够把文本转化成连续的向量然后用于后续具体的语言任务,目前教程较少!\n",
    "fastText中文文档\n",
    "http://fasttext.apachecn.org/#/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "import fasttext\n",
    "#facebook推出的一款以n-GRAM模型座的文自然语言处理模块,精度与深度学习差不多,但运行效率要高的多,常用于意图识别和情感分析\n",
    "import pandas as pd\n",
    "from tqdm import tqdm\n",
    "import numpy as np\n",
    "import pickle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>question</th>\n",
       "      <th>answer</th>\n",
       "      <th>question_after_preprocessing</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>买二份有没有少点呀</td>\n",
       "      <td>亲亲真的不好意思我们已经是优惠价了呢小本生意请亲谅解</td>\n",
       "      <td>[买]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>那就等你们处理喽</td>\n",
       "      <td>好的亲退了</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>那我不喜欢</td>\n",
       "      <td>颜色的话一般茶刀茶针和二合一的话都是红木檀和黑木檀哦</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>不是免运费</td>\n",
       "      <td>本店茶具订单满99包邮除宁夏青海内蒙古海南新疆西藏满39包邮</td>\n",
       "      <td>[运费]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>好吃吗</td>\n",
       "      <td>好吃的</td>\n",
       "      <td>[好吃]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    question                          answer question_after_preprocessing\n",
       "0  买二份有没有少点呀      亲亲真的不好意思我们已经是优惠价了呢小本生意请亲谅解                          [买]\n",
       "1   那就等你们处理喽                           好的亲退了                           []\n",
       "2      那我不喜欢      颜色的话一般茶刀茶针和二合一的话都是红木檀和黑木檀哦                           []\n",
       "3      不是免运费  本店茶具订单满99包邮除宁夏青海内蒙古海南新疆西藏满39包邮                         [运费]\n",
       "4        好吃吗                             好吃的                         [好吃]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 读取数据:导入在preprocessor.ipynb中生成的data/question_answer_pares.pkl文件,并将其保存在变量QApares中\n",
    "#打开之前文本预处理的pkl数据文件\n",
    "with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/question_answer_pares.pkl','rb') as f:\n",
    "    QApares = pickle.load(f)\n",
    "QApares.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.frame.DataFrame"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(QApares)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 这一文档的目的是为了用fasttext来进行意图识别,将问题分为任务型和闲聊型两类\n",
    "# 训练数据集是任务型还是闲聊型本应由人工打上标签,但这并不是我们的重点。我们的重点是教会大家如何用fasttext来进行意图识别\n",
    "# 所以这里我们为数据集随机打上0或1的标签,而不去管实际情况如何\n",
    "\n",
    "#fasttext的输入格式为:单词1 单词2 单词3 ... 单词n __label__标签号\n",
    "#我们将问题数据集整理为fasttext需要的输入格式并为其随机打上标签并将结果保存在data/fasttext/fasttext_train.txt和data/fasttext/fasttext_test.txt中\n",
    "with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_train.txt','w') as f:\n",
    "    for content in QApares[:int(0.7*len(QApares))].dropna().question_after_preprocessing:  # 选择文本前70%作为训练集,.dropna()默认删除dataframe中所有带空值的行\n",
    "        f.write('%s __label__%d\\n' % (' '.join(content), np.random.randint(0,2)))#按指定格式写入txt文件,np.random.randint(0,2)生成0,1二分类随机整数,不包含2\n",
    "with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_test.txt','w') as f:\n",
    "    for content in QApares[int(0.7*len(QApares)):].dropna().question_after_preprocessing:\n",
    "        f.write('%s __label__%d\\n' % (' '.join(content), np.random.randint(0,2)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "#使用fasttext进行意图识别,并将模型保存在classifier中\n",
    "#https://fasttext.cc/docs/en/python-module.html\n",
    "\n",
    "classifier = fasttext.train_supervised('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_train.txt',     #训练数据文件路径\n",
    "                                       label=\"__label__\",      #类别前缀\n",
    "                                       dim=100,       #向量维度\n",
    "                                       epoch=5,       #训练轮次\n",
    "                                       lr=0.1,        #学习率\n",
    "                                       wordNgrams=2,      #n-gram个数\n",
    "                                       loss='softmax',    #损失函数类型\n",
    "                                       thread=5,          #线程个数, 每个线程处理输入数据的一段, 0号线程负责loss输出\n",
    "                                       verbose=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on module fasttext.FastText in fasttext:\n",
      "\n",
      "NAME\n",
      "    fasttext.FastText\n",
      "\n",
      "DESCRIPTION\n",
      "    # Copyright (c) 2017-present, Facebook, Inc.\n",
      "    # All rights reserved.\n",
      "    #\n",
      "    # This source code is licensed under the MIT license found in the\n",
      "    # LICENSE file in the root directory of this source tree.\n",
      "\n",
      "FUNCTIONS\n",
      "    cbow(*kargs, **kwargs)\n",
      "    \n",
      "    eprint(*args, **kwargs)\n",
      "    \n",
      "    load_model(path)\n",
      "        Load a model given a filepath and return a model object.\n",
      "    \n",
      "    read_args(arg_list, arg_dict, arg_names, default_values)\n",
      "    \n",
      "    skipgram(*kargs, **kwargs)\n",
      "    \n",
      "    supervised(*kargs, **kwargs)\n",
      "    \n",
      "    tokenize(text)\n",
      "        Given a string of text, tokenize it and return a list of tokens\n",
      "    \n",
      "    train_supervised(*kargs, **kwargs)\n",
      "        Train a supervised model and return a model object.\n",
      "        \n",
      "        input must be a filepath. The input text does not need to be tokenized\n",
      "        as per the tokenize function, but it must be preprocessed and encoded\n",
      "        as UTF-8. You might want to consult standard preprocessing scripts such\n",
      "        as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html\n",
      "        \n",
      "        The input file must must contain at least one label per line. For an\n",
      "        example consult the example datasets which are part of the fastText\n",
      "        repository such as the dataset pulled by classification-example.sh.\n",
      "    \n",
      "    train_unsupervised(*kargs, **kwargs)\n",
      "        Train an unsupervised model and return a model object.\n",
      "        \n",
      "        input must be a filepath. The input text does not need to be tokenized\n",
      "        as per the tokenize function, but it must be preprocessed and encoded\n",
      "        as UTF-8. You might want to consult standard preprocessing scripts such\n",
      "        as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html\n",
      "        \n",
      "        The input field must not contain any labels or use the specified label prefix\n",
      "        unless it is ok for those words to be ignored. For an example consult the\n",
      "        dataset pulled by the example script word-vector-example.sh, which is\n",
      "        part of the fastText repository.\n",
      "\n",
      "DATA\n",
      "    BOW = '<'\n",
      "    EOS = '</s>'\n",
      "    EOW = '>'\n",
      "    absolute_import = _Feature((2, 5, 0, 'alpha', 1), (3, 0, 0, 'alpha', 0...\n",
      "    displayed_errors = {}\n",
      "    division = _Feature((2, 2, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0), 8192...\n",
      "    print_function = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0)...\n",
      "    unicode_literals = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', ...\n",
      "    unsupervised_default = {'autotuneDuration': 300, 'autotuneMetric': 'f1...\n",
      "\n",
      "FILE\n",
      "    c:\\users\\cuishufeng-ghq\\appdata\\local\\programs\\python\\python37\\lib\\site-packages\\fasttext\\fasttext.py\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "help(fasttext.FastText)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(('__label__1',), array([0.50106966]))"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#使用训练好的fasttext模型进行预测\n",
    "classifier.predict('今天 月亮 真 圆 啊')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(('__label__1',), array([0.50073278]))"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "classifier.predict('NLP 是 人工智能 皇冠 上 的 明珠')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(300, 0.52, 0.52)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#使用训练好的fasttext模型对测试集文件进行评估\n",
    "classifier.test('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/fasttext/fasttext_test.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "#保存模型\n",
    "classifier.save_model('model/fasttext.ftz')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}