{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 搭建一个简单的问答系统 (Building a Simple QA System)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本次项目的目标是搭建一个基于检索式的简易的问答系统,这是一个最经典的方法也是最有效的方法。  \n",
    "\n",
    "```不要单独创建一个文件,所有的都在这里面编写,不要试图改已经有的函数名字 (但可以根据需求自己定义新的函数)```\n",
    "\n",
    "```预估完成时间```: 5-10小时"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 检索式的问答系统\n",
    "问答系统所需要的数据已经提供,对于每一个问题都可以找得到相应的答案,所以可以理解为每一个样本数据是 ``<问题、答案>``。 那系统的核心是当用户输入一个问题的时候,首先要找到跟这个问题最相近的已经存储在库里的问题,然后直接返回相应的答案即可(但实际上也可以抽取其中的实体或者关键词)。 举一个简单的例子:\n",
    "\n",
    "假设我们的库里面已有存在以下几个<问题,答案>:\n",
    "- <\"贪心学院主要做什么方面的业务?”, “他们主要做人工智能方面的教育”>\n",
    "- <“国内有哪些做人工智能教育的公司?”, “贪心学院”>\n",
    "- <\"人工智能和机器学习的关系什么?\", \"其实机器学习是人工智能的一个范畴,很多人工智能的应用要基于机器学习的技术\">\n",
    "- <\"人工智能最核心的语言是什么?\", ”Python“>\n",
    "- .....\n",
    "\n",
    "假设一个用户往系统中输入了问题 “贪心学院是做什么的?”, 那这时候系统先去匹配最相近的“已经存在库里的”问题。 那在这里很显然是 “贪心学院是做什么的”和“贪心学院主要做什么方面的业务?”是最相近的。 所以当我们定位到这个问题之后,直接返回它的答案 “他们主要做人工智能方面的教育”就可以了。 所以这里的核心问题可以归结为计算两个问句(query)之间的相似度。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 项目中涉及到的任务描述\n",
    "问答系统看似简单,但其中涉及到的内容比较多。 在这里先做一个简单的解释,总体来讲,我们即将要搭建的模块包括:\n",
    "\n",
    "- 文本的读取: 需要从相应的文件里读取```(问题,答案)```\n",
    "- 文本预处理: 清洗文本很重要,需要涉及到```停用词过滤```等工作\n",
    "- 文本的表示: 如果表示一个句子是非常核心的问题,这里会涉及到```tf-idf```, ```Glove```以及```BERT Embedding```\n",
    "- 文本相似度匹配: 在基于检索式系统中一个核心的部分是计算文本之间的```相似度```,从而选择相似度最高的问题然后返回这些问题的答案\n",
    "- 倒排表: 为了加速搜索速度,我们需要设计```倒排表```来存储每一个词与出现的文本\n",
    "- 词义匹配:直接使用倒排表会忽略到一些意思上相近但不完全一样的单词,我们需要做这部分的处理。我们需要提前构建好```相似的单词```然后搜索阶段使用\n",
    "- 拼写纠错:我们不能保证用户输入的准确,所以第一步需要做用户输入检查,如果发现用户拼错了,我们需要及时在后台改正,然后按照修改后的在库里面搜索\n",
    "- 文档的排序: 最后返回结果的排序根据文档之间```余弦相似度```有关,同时也跟倒排表中匹配的单词有关\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 项目中需要的数据:\n",
    "1. ```dev-v2.0.json```: 这个数据包含了问题和答案的pair, 但是以JSON格式存在,需要编写parser来提取出里面的问题和答案。 \n",
    "2. ```glove.6B```: 这个文件需要从网上下载,下载地址为:https://nlp.stanford.edu/projects/glove/, 请使用d=200的词向量\n",
    "3. ```spell-errors.txt``` 这个文件主要用来编写拼写纠错模块。 文件中第一列为正确的单词,之后列出来的单词都是常见的错误写法。 但这里需要注意的一点是我们没有给出他们之间的概率,也就是p(错误|正确),所以我们可以认为每一种类型的错误都是```同等概率```\n",
    "4. ```vocab.txt``` 这里列了几万个英文常见的单词,可以用这个词库来验证是否有些单词被拼错\n",
    "5. ```testdata.txt``` 这里搜集了一些测试数据,可以用来测试自己的spell corrector。这个文件只是用来测试自己的程序。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在本次项目中,你将会用到以下几个工具:\n",
    "- ```sklearn```。具体安装请见:http://scikit-learn.org/stable/install.html  sklearn包含了各类机器学习算法和数据处理工具,包括本项目需要使用的词袋模型,均可以在sklearn工具包中找得到。 \n",
    "- ```jieba```,用来做分词。具体使用方法请见 https://github.com/fxsjy/jieba\n",
    "- ```bert embedding```: https://github.com/imgarylai/bert-embedding\n",
    "- ```nltk```:https://www.nltk.org/index.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第一部分:对于训练数据的处理:读取文件和预处理"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- ```文本的读取```: 需要从文本中读取数据,此处需要读取的文件是```dev-v2.0.json```,并把读取的文件存入一个列表里(list)\n",
    "- ```文本预处理```: 对于问题本身需要做一些停用词过滤等文本方面的处理\n",
    "- ```可视化分析```: 对于给定的样本数据,做一些可视化分析来更好地理解数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.1节: 文本的读取\n",
    "把给定的文本数据读入到```qlist```和```alist```当中,这两个分别是列表,其中```qlist```是问题的列表,```alist```是对应的答案列表"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "def read_corpus():\n",
    "    \"\"\"\n",
    "    读取给定的语料库,并把问题列表和答案列表分别写入到 qlist, alist 里面。 在此过程中,不用对字符换做任何的处理(这部分需要在 Part 2.3里处理)\n",
    "    qlist = [\"问题1\", “问题2”, “问题3” ....]\n",
    "    alist = [\"答案1\", \"答案2\", \"答案3\" ....]\n",
    "    务必要让每一个问题和答案对应起来(下标位置一致)\n",
    "    \"\"\"\n",
    "    # TODO 需要完成的代码部分 ...\n",
    "    filename = 'train-v2.0.json'\n",
    "    qlist = []\n",
    "    alist = []\n",
    "    with open(filename, 'r') as f:\n",
    "        content = json.loads(f.read())\n",
    "        data = content['data']\n",
    "        for item in data:\n",
    "            paragraphs = item['paragraphs']\n",
    "            for p in paragraphs:\n",
    "                qas = p['qas']\n",
    "                for q in qas:\n",
    "                    if q['answers']:\n",
    "                        qlist.append(q['question'])\n",
    "                        alist.append(q['answers'][0]['text'])\n",
    "                    else:\n",
    "                        continue\n",
    "    \n",
    "    assert len(qlist) == len(alist)  # 确保长度一样\n",
    "    return qlist, alist\n",
    "\n",
    "qlist, alist = read_corpus()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.2 理解数据(可视化分析/统计信息)\n",
    "对数据的理解是任何AI工作的第一步, 需要对数据有个比较直观的认识。在这里,简单地统计一下:\n",
    "\n",
    "- 在```qlist```出现的总单词个数\n",
    "- 按照词频画一个```histogram``` plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "63780\n"
     ]
    }
   ],
   "source": [
    "# TODO: 统计一下在qlist中总共出现了多少个单词? 总共出现了多少个不同的单词(unique word)?\n",
    "#       这里需要做简单的分词,对于英文我们根据空格来分词即可,其他过滤暂不考虑(只需分词)\n",
    "word_total_list = []\n",
    "[word_total_list.extend(q.split(' ')) for q in qlist]\n",
    "word_total = len(set(word_total_list))\n",
    "print (word_total)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAA3kAAAJBCAYAAAD7rnyhAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nOzde5Bc130f+O/p7ulBNwhgSBEkQRIUFYl6ULIM2rAe1rpWsWKJUiqhsmU5cmotxtFayVquJFuubOSkHDlWlLXXGyurKltZaa1Y8saWtYqz0mZp0bKixLFlPSiJevAhE3qSFClSfAAgZjCYx9k/5s5gAAIESHT3nRl8PlVdffvc292nBwOgv/f8zrml1hoAAAC2hk7bHQAAAGB0hDwAAIAtRMgDAADYQoQ8AACALUTIAwAA2EKEPAAAgC3kjCGvlLKtlPKZUsoXSym3lVL+edP+26WUb5RSbm1u+5r2Ukp5ZynlQCnlS6WUH1j3WjeWUu5qbjeua//BUsqXm+e8s5RSxvFhAQAAtrreWRwzn+RHa62PlVKmkvxpKeUPm33/qNb6oZOOf3WSa5rbi5O8K8mLSykXJXlrkv1JapLPlVI+Umt9pDnmZ5J8OslNSa5P8ocBAADgSTljyKsrV0t/rHk41dye6ArqNyR5f/O8T5VSZkope5K8PMnHaq0PJ0kp5WNJri+l/OckO2utn2ra35/ktTlDyLv44ovr1VdffabuAwAAbEmf+9znvldr3X1y+9mM5KWU0k3yuSTPSvIbtdZPl1L+xyRvL6X8syQfT/KWWut8kiuS3L3u6fc0bU/Ufs8p2p/Q1VdfnVtuueVsug8AALDllFK+dar2s1p4pda6VGvdl+TKJC8qpbwgyS8keW6SH0pyUZJ/PKK+nlYp5U2llFtKKbc8+OCD4347AACATedJra5Za300ySeSXF9rva+umE/yb5O8qDns3iR71z3tyqbtidqvPEX7qd7/3bXW/bXW/bt3P25UEgAA4Lx3Nqtr7i6lzDTbgyQ/luTOZp5dmpUwX5vkK81TPpLkDc0qmy9JcrDWel+Sm5O8spRyYSnlwiSvTHJzs+9QKeUlzWu9IcmHR/sxAQAAzg9nMydvT5L3NfPyOkk+WGv9j6WU/1RK2Z2kJLk1yd9rjr8pyWuSHEgym+Snk6TW+nAp5W1JPtsc98uri7Ak+dkkv51kkJUFV6ysCQAA8BSUlUUwN5/9+/dXC68AAADnq1LK52qt+09uf1Jz8gAAANjYhDwAAIAtRMgDAADYQoQ8AACALUTIAwAA2EKEPAAAgC1EyAMAANhChDwAAIAtRMgDAADYQoQ8AACALUTIAwAA2EKEPAAAgC1EyAMAANhChDwAAIAtRMgDAADYQoQ8AACALUTIAwAA2EKEvBFZXFrOX/n1/5Lf+fNvtt0VAADgPCbkjUiv28k3v3ck9x862nZXAACA85iQN0KDfjezx5ba7gYAAHAeE/JGaNjvZnZeyAMAANoj5I3QsN/L7IKQBwAAtEfIG6Fhv5u5Y4ttdwMAADiPCXkjNDQnDwAAaJmQN0KDfk/IAwAAWiXkjdBwqptZ5ZoAAECLhLwRUq4JAAC0TcgboeF0N3NCHgAA0CIhb4SG5uQBAAAtE/JGaDDVzdzCUpaXa9tdAQAAzlNC3ggN+90kyZwLogMAAC0R8kZoNeQp2QQAANoi5I3QsN9LEouvAAAArRHyRmhtJG/BtfIAAIB2CHkjNGhC3pF5I3kAAEA7hLwRUq4JAAC0TcgboeMLryjXBAAA2iHkjZBLKAAAAG0T8kZotVzTJRQAAIC2CHkjdHzhFeWaAABAO4S8EVor1zSSBwAAtETIG6GpbidT3ZJZc/IAAICWCHkjNuz3jOQBAACtEfJGbNjvuoQCAADQGiFvxAb9bo4YyQMAAFoi5I3YsN9VrgkAALRGyBux4VRPuSYAANAaIW/EhtNG8gAAgPYIeSO2svCKkAcAALRDyBuxwVRPyAMAAFoj5I2YSygAAABtEvJGTLkmAADQJiFvxIb9XuYXl7O0XNvuCgAAcB4S8kZs2O8mSeYWjOYBAACTJ+SN2KAJebPz5uUBAACTJ+SN2OpInnl5AABAG4S8ERPyAACANgl5Izbs95IkcwvKNQEAgMkT8kbMSB4AANAmIW/EVhdeOTIv5AEAAJMn5I2Yck0AAKBNQt6IKdcEAADaJOSN2NrF0IU8AACgBULeiK2WaxrJAwAA2iDkjVi3U9LvdXLkmDl5AADA5J0x5JVStpVSPlNK+WIp5bZSyj9v2p9RSvl0KeVAKeX3Syn9pn26eXyg2X/1utf6hab9q6WUV61rv75pO1BKecvoP+ZkDftd5ZoAAEArzmYkbz7Jj9Zavz/JviTXl1JekuRXk7yj1vqsJI8keWNz/BuTPNK0v6M5LqWUa5O8Psnzk1yf5DdLKd1SSjfJbyR5dZJrk/xkc+ymNZzqKtcEAABaccaQV1c81jycam41yY8m+VDT/r4kr222b2gep9n/ilJKado/UGudr7V+I8mBJC9qbgdqrV+vtR5L8oHm2E1rON0zkgcAALTirObkNSNutyZ5IMnHknwtyaO11tWJZ/ckuaLZviLJ3UnS7D+Y5Gnr2096zunaN61hv5tZc/IAAIAWnFXIq7Uu1Vr3JbkyKyNvzx1rr06jlPKmUsotpZRbHnzwwTa6cFYGU90cMZIHAAC04EmtrllrfTTJJ5K8NMlMKaXX7Loyyb3N9r1J9iZJs39XkofWt5/0nNO1n+r9311r3V9r3b979+4n0/WJsvAKAADQlrNZXXN3KWWm2R4k+bEkd2Ql7P14c9iNST7cbH+keZxm/3+qtdam/fXN6pvPSHJNks8k+WySa5rVOvtZWZzlI6P4cG0Z9nvKNQEAgFb0znxI9iR5X7MKZifJB2ut/7GUcnuSD5RS/kWSLyT5reb430ryO6WUA0kezkpoS631tlLKB5PcnmQxyZtrrUtJUkr5uSQ3J+kmeW+t9baRfcIWGMkDAADacsaQV2v9UpLrTtH+9azMzzu5/WiS153mtd6e5O2naL8pyU1n0d9NYdjvZnZByAMAACbvSc3J4+wM+r3Mzgt5AADA5Al5YzDsd3NsaTmLS8ttdwUAADjPCHljMOx3k0TJJgAAMHFC3hgM+ytTHS2+AgAATJqQNwarI3lH5l1GAQAAmCwhbwwGq+WaRvIAAIAJE/LGYHUkb86cPAAAYMKEvDEYGskDAABaIuSNwfGFV8zJAwAAJkvIG4PjC68YyQMAACZLyBuDgevkAQAALRHyxkC5JgAA0BYhbwwGUxZeAQAA2iHkjUG3U7JtqpM5IQ8AAJgwIW9Mhv1ejijXBAAAJkzIG5PBVFe5JgAAMHFC3pgM+13lmgAAwMQJeWMynO4ZyQMAACZOyBuT4ZSRPAAAYPKEvDEZ9rsWXgEAACZOyBuTgTl5AABAC4S8MRn2ra4JAABMnpA3JsN+L7PKNQEAgAkT8sZk2O9mbsFIHgAAMFlC3pgM+90sLNUcW1xuuysAAMB5RMgbk0G/lyQWXwEAACZKyBuTYb+bJJldMC8PAACYHCFvTNZCnpE8AABggoS8MRkq1wQAAFog5I3J6kjekXnlmgAAwOQIeWMyWJuTZyQPAACYHCFvTFZH8pRrAgAAkyTkjcn2Zk6ehVcAAIBJEvLGZLA2kmdOHgAAMDlC3pisLbxiJA8AAJggIW9MtvVcJw8AAJg8IW9MOp2SwVRXuSYAADBRQt4YbZ/uGskDAAAmSsgbo0G/6xIKAADARAl5YzSc6uWIck0AAGCChLwxGvSVawIAAJMl5I3RULkmAAAwYULeGA37PSN5AADARAl5YzTsdzO3IOQBAACTI+SN0bDfzZF5C68AAACTI+SNkUsoAAAAkybkjdGw383swlJqrW13BQAAOE8IeWM07PeytFxzbGm57a4AAADnCSFvjIb9bpJkdl7JJgAAMBlC3hithTwrbAIAABMi5I3RoN9Lkswds8ImAAAwGULeGA2nmpE8K2wCAAATIuSN0XBayAMAACZLyBujYVOuOatcEwAAmBAhb4zWFl4xkgcAAEyIkDdGA3PyAACACRPyxmh1JG9OyAMAACZEyBuj7dOrc/KEPAAAYDKEvDGa7nVSioVXAACAyRHyxqiUkuFU10geAAAwMULemA36PSEPAACYGCFvzIb9buaUawIAABMi5I3ZsK9cEwAAmBwhb8yEPAAAYJKEvDEb9ntW1wQAACbmjCGvlLK3lPKJUsrtpZTbSin/oGn/pVLKvaWUW5vba9Y95xdKKQdKKV8tpbxqXfv1TduBUspb1rU/o5Ty6ab990sp/VF/0LYMjOQBAAATdDYjeYtJfr7Wem2SlyR5cynl2mbfO2qt+5rbTUnS7Ht9kucnuT7Jb5ZSuqWUbpLfSPLqJNcm+cl1r/OrzWs9K8kjSd44os/XumG/m7kFIQ8AAJiMM4a8Wut9tdbPN9uHk9yR5IoneMoNST5Qa52vtX4jyYEkL2puB2qtX6+1HkvygSQ3lFJKkh9N8qHm+e9L8tqn+oE2mqFLKAAAABP0pObklVKuTnJdkk83TT9XSvlSKeW9pZQLm7Yrkty97mn3NG2na39akkdrrYsntW8Jw343s/Pm5AEAAJNx1iGvlHJBkn+f5B/WWg8leVeSZybZl+S+JP9qLD08sQ9vKqXcUkq55cEHHxz3243EsN/N7MJSaq1tdwUAADgPnFXIK6VMZSXg/bta6x8kSa31u7XWpVrrcpL3ZKUcM0nuTbJ33dOvbNpO1/5QkplSSu+k9septb671rq/1rp/9+7dZ9P11g363dSazC8ut90VAADgPHA2q2uWJL+V5I5a66+va9+z7rC/keQrzfZHkry+lDJdSnlGkmuSfCbJZ5Nc06yk2c/K4iwfqStDXJ9I8uPN829M8uFz+1gbx3CqmyTm5QEAABPRO/MheVmSn0ry5VLKrU3bP8nK6pj7ktQk30zyd5Ok1npbKeWDSW7Pysqcb661LiVJKeXnktycpJvkvbXW25rX+8dJPlBK+RdJvpCVULklDKdXfsSzxxZz0fYtc2UIAABggzpjyKu1/mmScopdNz3Bc96e5O2naL/pVM+rtX49x8s9t5Rh30geAAAwOU9qdU2ePCEPAACYJCFvzAZTx8s1AQAAxk3IG7PVkbw5I3kAAMAECHljtn1auSYAADA5Qt6YDfrKNQEAgMkR8sbMdfIAAIBJEvLGbGB1TQAAYIKEvDGb7nXSKRZeAQAAJkPIG7NSSrb3e0byAACAiRDyJmDQ71p4BQAAmAghbwKG/a6RPAAAYCKEvAkYKNcEAAAmRMibgGG/m7kF5ZoAAMD4CXkToFwTAACYFCFvAob9bmbnhTwAAGD8hLwJGPZ7mVWuCQAATICQNwGDftfF0AEAgIkQ8iZgOGVOHgAAMBlC3gQMp3uZW1jK8nJtuysAAMAWJ+RNwLDfTa3J0UWjeQAAwHgJeRMw7HeTRMkmAAAwdkLeBAymVkKexVcAAIBxE/ImYNjvJTGSBwAAjJ+QNwHD6ZWRvCPHXCsPAAAYLyFvAobKNQEAgAkR8iZAuSYAADApQt4EDNZW11SuCQAAjJeQNwHbp5VrAgAAkyHkTcBwaqVc84iQBwAAjJmQNwGr5ZpzyjUBAIAxE/ImoN/rpNcpFl4BAADGTsibkEG/K+QBAABjJ+RNyPZ+z8IrAADA2Al5EzLsd3PEnDwAAGDMhLwJGfS7RvIAAICxE/ImZGhOHgAAMAFC3oQM+r3MLgh5AADAeAl5E7K933WdPAAAYOyEvAkZ9Ls5Mm8kDwAAGC8hb0KG/W7mlGsCAABjJuRNyLDfy6xyTQAAYMyEvAkZTHVzdGE5y8u17a4AAABbmJA3Idunu0miZBMAABgrIW9CBv1ekuSIkk0AAGCMhLwJGU41I3kuiA4AAIyRkDchw/5KyJsV8gAAgDES8iZkIOQBAAATIORNyPbplTl5yjUBAIBxEvImZNDMybPwCgAAME5C3oSszskzkgcAAIyTkDchw+YSCubkAQAA4yTkTcjxhVeUawIAAOMj5E2Ick0AAGAShLwJmep20u92ckTIAwAAxkjIm6BBv5s55ZoAAMAYCXkTNOx3LbwCAACMlZA3QYN+N7MLQh4AADA+Qt4Ebe/3LLwCAACMlZA3QYN+N0fmzckDAADGR8iboGG/mznlmgAAwBgJeRNk4RUAAGDchLwJGkyZkwcAAIyXkDdB26e7mXWdPAAAYIyEvAka9Ls5YiQPAAAYIyFvgoZTvRxbXM7Scm27KwAAwBZ1xpBXStlbSvlEKeX2UsptpZR/0LRfVEr5WCnlrub+wqa9lFLeWUo5UEr5UinlB9a91o3N8XeVUm5c1/6DpZQvN895ZymljOPDtm3Y7yaJkk0AAGBszmYkbzHJz9dar03ykiRvLqVcm+QtST5ea70mycebx0ny6iTXNLc3JXlXshIKk7w1yYuTvCjJW1eDYXPMz6x73vXn/tE2nkET8iy+AgAAjMsZQ16t9b5a6+eb7cNJ7khyRZIbkryvOex9SV7bbN+Q5P11xaeSzJRS9iR5VZKP1VofrrU+kuRjSa5v9u2stX6q1lqTvH/da20p26dXQp55eQAAwLg8qTl5pZSrk1yX5NNJLq213tfsuj/Jpc32FUnuXve0e5q2J2q/5xTtW85gqpdEuSYAADA+Zx3ySikXJPn3Sf5hrfXQ+n3NCNzYVxMppbyplHJLKeWWBx98cNxvN3JD5ZoAAMCYnVXIK6VMZSXg/bta6x80zd9tSi3T3D/QtN+bZO+6p1/ZtD1R+5WnaH+cWuu7a637a637d+/efTZd31COL7wi5AEAAONxNqtrliS/leSOWuuvr9v1kSSrK2TemOTD69rf0Kyy+ZIkB5uyzpuTvLKUcmGz4Mork9zc7DtUSnlJ815vWPdaW8pAyAMAAMasdxbHvCzJTyX5cinl1qbtnyT5lSQfLKW8Mcm3kvxEs++mJK9JciDJbJKfTpJa68OllLcl+Wxz3C/XWh9utn82yW8nGST5w+a25Wzvm5MHAACM1xlDXq31T5Oc7rp1rzjF8TXJm0/zWu9N8t5TtN+S5AVn6stmp1wTAAAYtye1uibnxnXyAACAcRPyJmi4Vq4p5AEAAOMh5E1Qt1PS73Uyu2BOHgAAMB5C3oRt73czO28kDwAAGA8hb8KG/Z5yTQAAYGyEvAkb9LuZU64JAACMiZA3YcN+10geAAAwNkLehA2mhDwAAGB8hLwJ2z7dy+wx5ZoAAMB4CHkTNlCuCQAAjJGQN2HDqW7mhDwAAGBMhLwJs/AKAAAwTkLehA36PSN5AADA2Ah5E7a9382xpeUsLC233RUAAGALEvImbNDvJomSTQAAYCyEvAkb9ntJomQTAAAYCyFvwoZrI3mulQcAAIyekDdhyjUBAIBxEvImbHtTrinkAQAA4yDkTdhAuSYAADBGQt6Erc7Js/AKAAAwDkLehA3NyQMAAMZIyJuwtXLNBSEPAAAYPSFvwtYWXpk3Jw8AABg9IW/CBlPKNQEAgPER8ias0ynZNtXJnHJNAABgDIS8Fgz7PZdQAAAAxkLIa8FgqqtcEwAAGAshrwXbp7uZnRfyAACA0RPyWjDo91xCAQAAGAshrwXDqW7mzMkDAADGQMhrwbBvTh4AADAeQl4LBv1u5oQ8AABgDIS8Fmzv93JEuSYAADAGQl4LBso1AQCAMRHyWjBUrgkAAIyJkNeCYb+bxeWaY4vLbXcFAADYYoS8Fgz6vSQxmgcAAIyckNeC7f1uklh8BQAAGDkhrwWDJuRZfAUAABg1Ia8FQ+WaAADAmAh5LRiujeQp1wQAAEZLyGvBULkmAAAwJkJeC1bLNYU8AABg1IS8FijXBAAAxkXIa8Hq6ppzC0byAACA0RLyWmBOHgAAMC5CXgu29bopJZmdV64JAACMlpDXgk6nZDDVNZIHAACMnJDXkmG/m1lz8gAAgBET8loy6HczZyQPAAAYMSGvJcOpnksoAAAAIyfktWQ4bU4eAAAwekJeS4Z9IQ8AABg9Ia8lg6mekAcAAIyckNeSYb+bOXPyAACAERPyWqJcEwAAGAchryXDvnJNAABg9IS8lqyM5C2m1tp2VwAAgC1EyGvJoN/Nck3mF5fb7goAALCFCHktGfa7SZI5JZsAAMAICXktWQ15swtCHgAAMDpCXkuG/V6SZHbeZRQAAIDREfJasjaSp1wTAAAYISGvJQMhDwAAGIMzhrxSyntLKQ+UUr6yru2XSin3llJubW6vWbfvF0opB0opXy2lvGpd+/VN24FSylvWtT+jlPLppv33Syn9UX7AjWq1XHNuQbkmAAAwOmczkvfbSa4/Rfs7aq37mttNSVJKuTbJ65M8v3nOb5ZSuqWUbpLfSPLqJNcm+cnm2CT51ea1npXkkSRvPJcPtFko1wQAAMbhjCGv1vonSR4+y9e7IckHaq3ztdZvJDmQ5EXN7UCt9eu11mNJPpDkhlJKSfKjST7UPP99SV77JD/DprQW8uaFPAAAYHTOZU7ez5VSvtSUc17YtF2R5O51x9zTtJ2u/WlJHq21Lp7UvuWtra55TLkmAAAwOk815L0ryTOT7EtyX5J/NbIePYFSyptKKbeUUm558MEHJ/GWY+M6eQAAwDg8pZBXa/1urXWp1rqc5D1ZKcdMknuT7F136JVN2+naH0oyU0rpndR+uvd9d611f611/+7du59K1zeM6V4npSRz5uQBAAAj9JRCXillz7qHfyPJ6sqbH0ny+lLKdCnlGUmuSfKZJJ9Nck2zkmY/K4uzfKTWWpN8IsmPN8+/McmHn0qfNptSSoZTXQuvAAAAI9U70wGllN9L8vIkF5dS7kny1iQvL6XsS1KTfDPJ302SWuttpZQPJrk9yWKSN9dal5rX+bkkNyfpJnlvrfW25i3+cZIPlFL+RZIvJPmtkX26DW443TMnDwAAGKkzhrxa60+eovm0QazW+vYkbz9F+01JbjpF+9dzvNzzvDLsG8kDAABG61xW1+QcDZRrAgAAIybktWjY71p4BQAAGCkhr0XDvjl5AADAaAl5LTInDwAAGDUhr0VCHgAAMGpCXosG/Z6QBwAAjJSQ16KVhVfMyQMAAEZHyGvRsN/N7MJSaq1tdwUAANgihLwWDfu91JocXVhuuysAAMAWIeS1aNjvJonLKAAAACMj5LVosBbyLL4CAACMhpDXotWRvLkFIQ8AABgNIa9FqyHvyLxyTQAAYDSEvBYN+70kyZxyTQAAYESEvBYNzckDAABGTMhr0VrIMycPAAAYESGvRYO1ck1z8gAAgNEQ8lo0nFpdeMVIHgAAMBpCXouG0y6hAAAAjJaQ16J+t5Nup2RWuSYAADAiQl6LSikZTnWtrgkAAIyMkNeyQb/rOnkAAMDICHktG/a7OSLkAQAAIyLktWzY77mEAgAAMDJCXsuGfXPyAACA0RHyWjYQ8gAAgBES8lo2tPAKAAAwQkJey4b9Xo6YkwcAAIyIkNcyI3kAAMAoCXkts/AKAAAwSkJeywb9XuYWlrK8XNvuCgAAsAUIeS0b9rtJkqOLRvMAAIBzJ+S1bDXkHZkX8gAAgHMn5LVs2O8licVXAACAkRDyWrY6kje74DIKAADAuRPyWjZYDXlG8gAAgBEQ8lo2nFoJeco1AQCAURDyWrY6J+/IvHJNAADg3Al5LRtONyN5C0byAACAcyfktWxoTh4AADBCQl7LhlMr5ZpCHgAAMApCXstWV9ecO2ZOHgAAcO6EvJb1e530OiVHjOQBAAAjIORtAMN+1yUUAACAkRDyNoBhv5dZ5ZoAAMAICHkbwLDftfAKAAAwEkLeBjBQrgkAAIyIkLcBDPvdHFGuCQAAjICQtwEM+z0jeQAAwEgIeRuAOXkAAMCoCHkbwEDIAwAARkTI2wCG/W7mFoQ8AADg3Al5G8Cw38uReQuvAAAA507I2wCG/W7mF5eztFzb7goAALDJCXkbwLDfTRIlmwAAwDkT8jaAQb+XJJl1rTwAAOAcCXkbwHCqGcmzwiYAAHCOhLwNYPv0Ssg7Mi/kAQAA50bI2wBWyzXnFpRrAgAA50bI2wBWF15xQXQAAOBcCXkbwGBKyAMAAEZDyNsAjo/kKdcEAADOjZC3AWyfXr2EgpE8AADg3Ah5G8Cg7xIKAADAaAh5G8DQnDwAAGBEzhjySinvLaU8UEr5yrq2i0opHyul3NXcX9i0l1LKO0spB0opXyql/MC659zYHH9XKeXGde0/WEr5cvOcd5ZSyqg/5EbX63bS73aEPAAA4JydzUjebye5/qS2tyT5eK31miQfbx4nyauTXNPc3pTkXclKKEzy1iQvTvKiJG9dDYbNMT+z7nknv9d5YdDvWngFAAA4Z2cMebXWP0ny8EnNNyR5X7P9viSvXdf+/rriU0lmSil7krwqycdqrQ/XWh9J8rEk1zf7dtZaP1VrrUnev+61zivb+10jeQAAwDl7qnPyLq213tds35/k0mb7iiR3rzvunqbtidrvOUX7eWfQ71p4BQAAOGfnvPBKMwJXR9CXMyqlvKmUcksp5ZYHH3xwEm85McN+T7kmAABwzp5qyPtuU2qZ5v6Bpv3eJHvXHXdl0/ZE7Veeov2Uaq3vrrXur7Xu371791Ps+sY0UK4JAACMwFMNeR9JsrpC5o1JPryu/Q3NKpsvSXKwKeu8OckrSykXNguuvDLJzc2+Q6WUlzSrar5h3WudV4ZCHgAAMAK9Mx1QSvm9JC9PcnEp5Z6srJL5K0k+WEp5Y5JvJfmJ5vCbkrwmyYEks0l+OklqrQ+XUt6W5LPNcb9ca11dzOVns7KC5yDJHza38872fi93H5ttuxsAAMAmd8aQV2v9ydPsesUpjq1J3nya13lvkveeov2WJC84Uz+2OguvAAAAo3DOC68wGsN+N7MLQh4AAHBuhLwNwsIrAADAKAh5G8T2fi/HFpdz1GgeAABwDoS8DeKaSy5Iktx5/+GWewIAAGxmQt4Gse+qmSTJrd9+pOWeAAAAm5mQt0Hs2TXIpTunc+vdj7bdFQAAYBMT8jaQfXtnhDwAAOCcCHkbyL69F+abD83m4SPH2jMX50MAAByCSURBVO4KAACwSQl5G8h1zby8LxrNAwAAniIhbwP5vit2pVOSLwh5AADAUyTkbSDbp3t59qU7zMsDAACeMiFvg7nuqpl88e5Hs7xc2+4KAACwCQl5G8y+vTM5OLeQbzx0pO2uAAAAm5CQt8Fcd9WFSZJbv61kEwAAePKEvA3mmbsvyAXTPfPyAACAp0TI22C6nZIXXrlLyAMAAJ4SIW8D2rd3JnfcdyhHF5ba7goAALDJCHkb0HVXXZjF5Zqv3Huw7a4AAACbjJC3Ae3bO5MkSjYBAIAnTcjbgHbvmM4VM4N8QcgDAACeJCFvg9p31YzLKAAAAE+akLdBXbd3Jvc+OpcHDh9tuysAAMAmIuRtUNdd1czLM5oHAAA8CULeBvX8y3el1ykWXwEAAJ4UIW+D2jbVzfP27BTyAACAJ0XI28D27Z3JF+9+NEvLte2uAAAAm4SQt4Ht2zuTI8eWcuCBx9ruCgAAsEkIeRvY2uIrdz/Sck8AAIDNQsjbwJ5x8fbsGkyZlwcAAJw1IW8DK6Xk+/fO5AsuowAAAJwlIW+D27d3Jn/x3cM5Mr/YdlcAAIBNQMjb4K7bO5PlmnzpnoNtdwUAANgEhLwNbt/e1cVXlGwCAABnJuRtcBdu7+fqpw2tsAkAAJwVIW8T2NcsvlKri6IDAABPTMjbBPbtnckDh+dz38GjbXcFAADY4IS8TWDfVRcmMS8PAAA4MyFvE7h2z870ex0hDwAAOCMhbxPo9zp5/uU7c6uLogMAAGcg5G0S+/bO5Ev3PpqFpeW2uwIAAGxgQt4msW/vTI4uLOer9x9uuysAAMAGJuRtEj9g8RUAAOAsCHmbxJUXDvK07X0hDwAAeEJC3iZRSsm+vTNCHgAA8ISEvE1k396ZHHjgsRycW2i7KwAAwAYl5G0i+66aSZJ86R6jeQAAwKkJeZvI9++dSSlxvTwAAOC0hLxNZOe2qTxz9wXm5QEAAKcl5G0y+/bO5At3P5paa9tdAQAANiAhb5PZt3cmDx85lrsfnmu7KwAAwAYk5G0y+/auLL7yhbsfabknAADARiTkbTLPvWxHtk11zMsDAABOScjbZHrdTl54hYuiAwAApybkbUL7rprJbfceyvziUttdAQAANhghbxPat3cmx5aWc8d9h9vuCgAAsMEIeZvQ6uIrt37b4isAAMCJhLxNaM+ubbl057R5eQAAwOMIeZtQKWXtougAAADrCXmb1L69F+ZbD83m4SPH2u4KAACwgQh5m9TqvLwvGs0DAADWEfI2qRdeuSudEiWbAADACYS8TWr7dC/PvnSHxVcAAIATCHmb2HVXzeTWbz+S5eXadlcAAIANQsjbxPbtncmho4v5xkNH2u4KAACwQQh5m9i+vRcmSW79tpJNAABgxTmFvFLKN0spXy6l3FpKuaVpu6iU8rFSyl3N/YVNeymlvLOUcqCU8qVSyg+se50bm+PvKqXceG4f6fzxrEsuyAXTPfPyAACANaMYyfvLtdZ9tdb9zeO3JPl4rfWaJB9vHifJq5Nc09zelORdyUooTPLWJC9O8qIkb10NhjyxbqfkhVfuyhfufqTtrgAAABvEOMo1b0jyvmb7fUleu679/XXFp5LMlFL2JHlVko/VWh+utT6S5GNJrh9Dv7akfXtncud9h3N0YantrgAAABvAuYa8muSPSimfK6W8qWm7tNZ6X7N9f5JLm+0rkty97rn3NG2na+cs7Ns7k8Xlmq/ce7DtrgAAABtA7xyf/9/UWu8tpVyS5GOllDvX76y11lLKyNb3b4Lkm5LkqquuGtXLbmr7rppJktx696PZf/VFLfcGAABo2zmN5NVa723uH0jyH7Iyp+67TRlmmvsHmsPvTbJ33dOvbNpO136q93t3rXV/rXX/7t27z6XrW8YlO7bliplBvmDxFQAAIOcQ8kop20spO1a3k7wyyVeSfCTJ6gqZNyb5cLP9kSRvaFbZfEmSg01Z581JXllKubBZcOWVTRtnad9VMy6jAAAAJDm3cs1Lk/yHUsrq6/xurfWjpZTPJvlgKeWNSb6V5Cea429K8pokB5LMJvnpJKm1PlxKeVuSzzbH/XKt9eFz6Nd557q9M/n/vnRfHjh8NJfs2NZ2dwAAgBY95ZBXa/16ku8/RftDSV5xivaa5M2nea33JnnvU+3L+W7f3mZe3rcfzSuff1nLvQEAANo0jksoMGEvuGJXep2SP7nrwba7AgAAtEzI2wK2TXXzuv1X5vc+c3e+ev/htrsDAAC0SMjbIv7Rq56bHdt6+cUPfyUrlbEAAMD5SMjbIi7a3s///Krn5jPfeDgfvvU7bXcHAABoiZC3hfzNH9qb779yV95+0x05dHSh7e4AAAAtEPK2kG6n5G2vfUG+99h83vGxv2i7OwAAQAuEvC3mhVfO5G+96Kq875PfzO3fOdR2dwAAgAkT8ragf/Sq52TXYCr/zCIsAABw3hHytqCZYT9vefVzc8u3Hsm///y9bXcHAACYICFvi3rdD+7NdVfN5H+56Y4cnLMICwAAnC+EvC2q0yl52w0vyCOzx/Lrf/TVtrsDAABMiJC3hb3gil3571/y9PzOp76Vr9x7sO3uAAAAEyDkbXE//8rn5MJhP7/44a9kedkiLAAAsNUJeVvcrsFUfuE1z8sXvv1oPvS5e9ruDgAAMGZC3nngv7vuiux/+oX5lY/emUdnj7XdHQAAYIyEvPNAp1Pytte+IAfnFvJrN1uEBQAAtjIh7zzxvD0784aXPj2/+5lv50v3PNp2dwAAgDER8s4j/9OPPTsXXzCdX/x/LMICAABblZB3Htm5bSr/9DXPyxfvOZjfv+XutrsDAACMgZB3nrlh3+V50TMuyq9+9M48fMQiLAAAsNUIeeeZUkredsMLcvjoYn7t5jvb7g4AADBiQt556DmX7cjfednV+cBn784Xvv1I290BAABGSMg7T/2Dv/LsXLJjOr/44a9kySIsAACwZQh556kLpnv5p3/12nzl3kP53c98u+3uAAAAIyLkncf+2gv35Ief+bT82kfvzEOPzbfdHQAAYASEvPNYKSW/fMPzM3tsKb/yhxZhAQCArUDIO88965Id+R9+5C/l//7cPXnXf/5a290BAADOUa/tDtC+n3/ls3Pfwbn86kfvzKOzx/KWVz83pZS2uwUAADwFQh6Z6nbyjp/Yl5nBVP6PP/l6Hpk9ln/5N74vva6BXgAA2GyEPJIknU7JL/315+fC7f386z++K4/OLuSdP3ldtk112+4aAADwJBiqYU0pJf/wrzw7v/TXrs0f3f7d/PS//WwOH11ou1sAAMCTIOTxOH/7Zc/Iv/6b+/LZbz6cv/WeT7u8AgAAbCJCHqf02uuuyHvesD93PXA4r/s3f557H51ru0sAAMBZEPI4rb/83EvyO298cR58bD4//q5P5sADh9vuEgAAcAZCHk/oh66+KB/8uy/N4nLN6/7Nn+fWux9tu0sAAMATEPI4o+ft2ZkP/b2XZse2qfyt93wqf3rX99ruEgAAcBpCHmfl6U/bng/9vZfmqouG+Tu//dnc9OX72u4SAABwCkIeZ+2Sndvy+296ab7vyl158+9+Pr/3mW+33SUAAOAkQh5Pyq7hVP6vN744/+2zd+cX/uDL+c3/fCC11ra7BQAANIQ8nrRBv5v3vGF/bth3ef7Xj341//KmOwQ9AADYIHptd4DNaarbyTt+Yl9mBlN5z3/9Ru68/3De8urn5vmX72q7awAAcF4zksdT1umU/NJff37e+teuzRfvfjR/9Z1/mp/73c/n6w8+1nbXAADgvFU2a5nd/v376y233NJ2N2gcnFvIe/7k6/mtP/1Gji0t53U/eGX+/iuuyeUzg7a7BgAAW1Ip5XO11v2PaxfyGKUHD8/nNz5xIL/76W8nJfmplzw9P/vyZ+ZpF0y33TUAANhShDwm6p5HZvOv//iu/MHn78lgqps3/shfys/8yDOyY9tU210DAIAtQcijFQceOJx/9Ud/kT/8yv25cDiVn335s/JTL316tk112+4aAABsakIerfrSPY/m127+av7rXd/LZTu35e+/4pq8bv+Vmepa+wcAAJ6K04U837CZiBdeOZPfeeOL83s/85JcPrMt/+Q/fDk/9uv/JR++9d4sL2/OEw0AALARGclj4mqt+fgdD+R/+6Ov5s77D+fZl16QH7v20rzsmRfnB55+oVJOAAA4C8o12XCWl2v+3y99J+/75DfzxXsOZmm5ZrrXyf6rL8wPP/Pi/PAzn5bvu2JXeko6AQDgcYQ8NrTDRxfymW88nE9+7aH82YHv5c77DydJdkz38uK/9LS87FlPyw8/8+I8+9ILUkppubcAANC+04W8XhudgZPt2DaVVzzv0rzieZcmSb732Hz+/GsP5ZNf+14++bWH8sd3fDdJcvEF0/nhZx4PfXsvGrbZbQAA2HCM5LEp3PPIbD554KH8WRP6Hjw8nyTZe9EgP/T0i3Lt5Tvz/Mt35drLd2bXwLX4AADY+pRrsmXUWnPggcfyZwe+lz/72kP54t2P5oEm9CUrwe/5e3bl+ZfvzPOvWAl/l+yYVuYJAMCWolyTLaOUkmsu3ZFrLt2Rv/2yZyRJHjw8n9u+czC3fedQbv/Oodx+36F89Lb7155z8QX9XHt5E/yaUb+nXzRMpyP4AQCwtQh5bAm7d0zn5c+5JC9/ziVrbY/NL+aO+w7lK/euhL/bvnMo7/mTr2exuS7fBdO9PG/Pjjxvz84897Kdee6eHXnuZTsy7PtrAQDA5uXbLFvWBdO9/NDVF+WHrr5orW1+cSl3ffex3P6dQ2sjf3/w+Xvz2Py3kiSlJE+/aJjnXrZzJfzt2ZHnXbYzV144MOoHAMCmIORxXpnudfOCK3blBVfsSrI3ycocv3semcsd9x3Knfcfzp33H8od9x3Ozbffn9UpqxdM9/Kcy1ZG+p63Z2eet2dHnnPZzlww7a8QAAAbi4VX4DRmjy3mL7772Er4u+9Q7rj/cO6471AOH11cO+bSndPZs2uQy2e25bKdK/d7dg2yZ2Zb9uzalkt2bEvXCCAAAGNg4RV4kob9Xvbtncm+vTNrbbXWfOfg0ZXQd9+hfOuh2dx38GjuvP9wPnHng5lbWDrhNbqdkkt3TGfPzCCX7dqWy3dtWwuFl+zclu39XqZ7nfR7nUz3Opme6ma610mvU6wGCgDAUyLkwZNQSskVM4NcMTNYu3D7qlprDs0t5jsH53Lfwbncd/Bo7nv06MrjR4/m9u8cyh/f/t3MLy6f8X06JU3w6zbhb2W7313d7mR7v5ddg6nsHExl12AqM8OV+/Xbq/ume91x/UgAANhghDwYkVJKdg2nsms4left2XnKY2qteWR2Id95dC4PHD6aowvLmV9cyvzCco4tLWd+9fHi8sptYWld+/F9RxeW1kYQD80t5PD84infb9VgqrsWAHcNpzIzmMrFO6Zz6Y5tuWTndC7ZMZ1Ld27LJTum87QLppWYAgBsYkIeTFApJRdt7+ei7f0ku0b2uotLyzl0dDEH5xZycG4hj84ey8G5hRxae7xwfN/cQr710Gxu+dYjefjIsce9VqckF18wnUt2Hg+Bu3dsy6U7p3PJjpUguGswlVKSkpLVqtJOp6Qka+2dkmTdMSVJp5R0Sslwupupbmdknx8AgOOEPNgCet3OuvB49o4tLufBx+bzwKGj+e6h+Tx4+GgeODyf7x5aub/v4NF88Z6DeejIfEa9RtNgqpudg152bpvKjm297BxMnbC9Y9vKvvXbuwa9DPu99Lol/W4nve7K/MV+t+MSFwAADSEPzmP9XmdtjuETWVhazkOPHcsDh1fC4OGjC6k1qVkpQV27X2tLamqWmwerbct1pe3I/GIOH13IobnFHDq6kMNHF/PwkWP51kOzOTS3kENHF7Kw9ORSZaeshN2V8FfS63TS75aVILgWCku6nZVg2O2UdEtp2la2u52Vx51SmmM66XZywnNKWRmRXL/dKcdHKdcerz+2rGxPdTuZWteflcclU71OpjrHt1f7OtV8ntXjLMwDAJyNDRPySinXJ/nfk3ST/J+11l9puUtAY6rbyWW7tuWyXdsm8n611swvLjeB73gQPDS3kNlji1lYqllYWs7iUs3C8nIWFmsWl1fmNS4u1SwuLedYc7+wtJyF5ZqFxeUsLtcsLtcsL68cv7i8nPnFmqXlmqVas7h0fHtp+fjt+HNqlpswuxJYa5aX121P6Io0nbJyzcfjq7KuW6SnWbDn5BVb1wLlaoDsHA+cx9s7meqsO2Y1lHZKapoQ/7hwnySPD/irj1f7uxqiV2+9TiedTtI7KUh3ThG8H//78fifSc2pf/glJd3uymutvV853rban07JaYNzrSt/9scWl1duSyv38yc9Xtk+Pse2lJJtzZ/Btl4n26a62db8eaxsr9wbiQZg1DZEyCuldJP8RpIfS3JPks+WUj5Sa7293Z4BbSilrH0hvuTUa9hsWMvLxwPf+vC3GhgXmkCwsLS8FhwWlpbXguvptldDxNkszDO/uJxH5xZW9q+GkaXlJvTWtfdemlQq3STWB8vVkdrVEDfuS8r2e521IDg91cm2XncldHdXRpXXjwL3mlHf1f1TneOhvNc53t5tnrcSqsu6++Mj06uvt7avCb9JThqhPx7yl08K+Mv1xHC/vO4kyXI98STJ2smTuu7Eyeq+pu1clLJ+JH79SYUTtzudU4/Wr47Srz9m9cTDCftO1dacLFgZuT9+8qCz+rrlxEqAJzqxcConVkvUx59UOenfnHrSv0PrqymWl088WZWs9Gu1f8nxedar7evnW5emYqGkpHROVdGw/nlOYGwFtR7/u9vrdJ707y+TtyFCXpIXJTlQa/16kpRSPpDkhiRCHrCpdDolnWyO//iWl5uR0LVRz+MjpMeWVkY6F5u2Uk5cWGf1//b1j0/YbvYlZW0kbP0X/KXllZHT5bV9y1laztr94vLy2vGn+iJxqp/wqb5vrH6hXTxNsFheHcGtj29brnXtOpb9bnN/0uPj+1fC2frjaq1rK+geXVhZFXd1ddzVx0eb1XXX7tcds7C0MuK8GvZnjy1mcbmu/XmtniRY/+e02Pxcjy2d+VItG8lqSFoNQk/V6uj66u/YRre+pLvm1CFusytN8F0fItcC4boR7Np82LWPXE+4W9t/wjGne89T9uPsfrHqKX7op3u/9SdSppqTJKsnXbqddSdbVk+kdI8fv/peK/8mrvw7tf4EyIn3606eNIH9cZ+1nPh49fMef7y6f+V37YR/B9f923eq6pWVEzaP//y9k04UrX7e3trP4MTPvnqC5PEnKnLC1I5TVY6s12kqMzrNCZXOut+r9Y9LOfGY0vxcjv//1PxEyok/p1P9H5ckr3/R3vzINbtP89uw8WyUkHdFkrvXPb4nyYtb6gvAeaHTKZnudDO9Uf4nYKTWj56thubVxwtLJz5eLVVePS7J2ijM+i9Gq1+cVvevjdacEO4fP5K2Nkp6Upls96Qv+qO0fuTh5C+uJ5djr//sx0Pi8RMOy6v3tT7uZMT6trWRs+XjI/gnj+jXk7aXmuevPzmy/ueZUk75BXTt2OYPZG3k8LRffB//5bfTOf4Ftmbls9esfIasG7VdHQlc/WK+vLzafnyEcbkeDyCrn//4Zz/+eHX/aog55UmcdYHkxMePP+bxf+6naDvNcad7jbM5iVRrTvjdWVg6/ndrcak20wNWt1dO1swtHG9L1o3qrhsF7jTBcdvaiY/VP9v1x678ZE4OwI8PyPWEn8n6n023CV0nlK2vH6k+TVunU9b+3iyu+3dkcd1JptUTdwvL/3979xNqaV3Hcfz98epQmOC/ScSZSaWBmEWNMIiRCxsophRtEWIUuAhmY2BQhLmRAhdurBZtJIdc9E80axAhBxuolfk31DTSUGowb6FDhWCNfl08v+s99+bGPOc89/n5fsHlPL/fuVy+HL4zv/N5nt9zTvH6zGuxfuJuc5CaPYm4OYht7Peh19auVG/qqw3/1t7YcEX7rUdqw+uxOUSuhczZ59de4+Ov/vftG2aLmtTSnuQgcBBg165dI1cjSdLWddJJYdtbAWpl1FrGkLalclJvdCRpTrbKF1UdA3bOjHe0uQ2q6raq2ldV+7Zvn87lUkmSJElalq0S8h4Cdie5IMk24Brg8Mg1SZIkSdLkbIldDFV1IslXgF8x7Ck5VFVPjVyWJEmSJE3Olgh5AFV1H3Df2HVIkiRJ0pRtle2akiRJkqQ5MORJkiRJUkcMeZIkSZLUEUOeJEmSJHXEkCdJkiRJHTHkSZIkSVJHDHmSJEmS1BFDniRJkiR1xJAnSZIkSR0x5EmSJElSRwx5kiRJktQRQ54kSZIkdcSQJ0mSJEkdMeRJkiRJUkcMeZIkSZLUEUOeJEmSJHXEkCdJkiRJHTHkSZIkSVJHDHmSJEmS1JFU1dg1/F+S/B14YQF/+mzgHwv4u9Lbsd+0LPaalsVe07LYa1qmrdpvH6qq7ZsnJxvyFiXJw1W1b+w69N5gv2lZ7DUti72mZbHXtExT6ze3a0qSJElSRwx5kiRJktQRQ97/um3sAvSeYr9pWew1LYu9pmWx17RMk+o378mTJEmSpI54JU+SJEmSOmLIm5HkQJI/Jnk2yQ1j16N+JDmUZDXJkzNzZyY5kuRP7fGMMWtUH5LsTHI0yR+SPJXk+jZvv2nukrwvye+S/L7127fa/AVJHmzr6c+SbBu7VvUhyUqSx5Lc28b2muYuyfNJnkjyeJKH29yk1lFDXpNkBfg+8BlgD/CFJHvGrUod+SFwYNPcDcADVbUbeKCNpXfrBPC1qtoDXAJc1/4vs9+0CK8B+6vqY8Be4ECSS4BbgO9U1YeBV4Avj1ij+nI98PTM2F7TonyyqvbOfG3CpNZRQ966i4Fnq+rPVfUf4KfAVSPXpE5U1W+AlzdNXwXc0Y7vAD631KLUpap6saoebcf/YngzdB72mxagBv9uw1PaTwH7gbvavP2muUiyA7gc+EEbB3tNyzOpddSQt+484C8z47+2OWlRzqmqF9vx34BzxixG/UlyPnAR8CD2mxakbZ97HFgFjgDPAcer6kT7FddTzct3gW8Ab7TxWdhrWowC7k/ySJKDbW5S6+jJYxcgaTgbnsSPutXcJPkAcDfw1ar653DCe2C/aZ6q6nVgb5LTgXuAj4xckjqU5ApgtaoeSXLZ2PWoe5dW1bEkHwSOJHlm9skprKNeyVt3DNg5M97R5qRFeSnJuQDtcXXketSJJKcwBLwfVdXP27T9poWqquPAUeDjwOlJ1k4ku55qHj4BXJnkeYZbavYD38Ne0wJU1bH2uMpw8upiJraOGvLWPQTsbp/StA24Bjg8ck3q22Hg2nZ8LfDLEWtRJ9o9KrcDT1fVrTNP2W+auyTb2xU8krwf+BTDfaBHgc+3X7Pf9K5V1TerakdVnc/wHu3XVfVF7DXNWZJTk5y2dgx8GniSia2jfhn6jCSfZdjvvQIcqqqbRy5JnUjyE+Ay4GzgJeAm4BfAncAu4AXg6qra/OEs0juS5FLgt8ATrN+3ciPDfXn2m+YqyUcZPoBgheHE8Z1V9e0kFzJcbTkTeAz4UlW9Nl6l6knbrvn1qrrCXtO8tZ66pw1PBn5cVTcnOYsJraOGPEmSJEnqiNs1JUmSJKkjhjxJkiRJ6oghT5IkSZI6YsiTJEmSpI4Y8iRJkiSpI4Y8SZIkSeqIIU+SJEmSOmLIkyRJkqSOvAlrrocnCr0k/AAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 1080x720 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# TODO: 统计一下qlist中出现1次,2次,3次... 出现的单词个数, 然后画一个plot. 这里的x轴是单词出现的次数(1,2,3,..), y轴是单词个数。\n",
    "#       从左到右分别是 出现1次的单词数,出现2次的单词数,出现3次的单词数... \n",
    "from collections import Counter\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "\n",
    "c = Counter(word_total_list)\n",
    "count_list = c.most_common(len(c))\n",
    "sorted_count_list = sorted(count_list, key=lambda x:x[1])\n",
    "keys = [item[1] for item in sorted_count_list]\n",
    "d = {}\n",
    "for k in set(keys):\n",
    "    d[k] = keys.count(k)\n",
    "sorted_d = sorted(d.items(), key=lambda x:x[0])\n",
    "x = [item[0] for item in sorted_d]\n",
    "y = [item[1] for item in sorted_d]\n",
    "plt.figure(figsize=(15, 10))\n",
    "plt.plot(x[:50], y[:50])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: 从上面的图中能观察到什么样的现象? 这样的一个图的形状跟一个非常著名的函数形状很类似,能所出此定理吗? \n",
    "#       hint: [XXX]'s law\n",
    "# \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "齐夫定律(英语:Zipf's law)是由哈佛大学的语言学家乔治·金斯利·齐夫(George Kingsley Zipf)于1949年发表的实验定律。它可以表述为:在自然语言的语料库里,一个单词出现的频率与它在频率表里的排名成反比。所以,频率最高的单词出现的频率大约是出现频率第二位的单词的2倍,而出现频率第二位的单词则是出现频率第四位的单词的2倍。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.3 文本预处理\n",
    "此部分需要做文本方面的处理。 以下是可以用到的一些方法:\n",
    "\n",
    "- 1. 停用词过滤 (去网上搜一下 \"english stop words list\",会出现很多包含停用词库的网页,或者直接使用NLTK自带的)   \n",
    "- 2. 转换成lower_case: 这是一个基本的操作   \n",
    "- 3. 去掉一些无用的符号: 比如连续的感叹号!!!, 或者一些奇怪的单词。\n",
    "- 4. 去掉出现频率很低的词:比如出现次数少于10,20.... (想一下如何选择阈值)\n",
    "- 5. 对于数字的处理: 分词完只有有些单词可能就是数字比如44,415,把所有这些数字都看成是一个单词,这个新的单词我们可以定义为 \"#number\"\n",
    "- 6. lemmazation: 在这里不要使用stemming, 因为stemming的结果有可能不是valid word。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: 需要做文本方面的处理。 从上述几个常用的方法中选择合适的方法给qlist做预处理(不一定要按照上面的顺序,不一定要全部使用)\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.tokenize import word_tokenize\n",
    "from nltk.stem import WordNetLemmatizer \n",
    "\n",
    "\n",
    "clean_qlist = []\n",
    "stop_words = set(stopwords.words('english'))\n",
    "wnl = WordNetLemmatizer() \n",
    "\n",
    "for q in qlist:\n",
    "    lower_q = q.lower()   # 转换成小写\n",
    "    words = word_tokenize(lower_q)  # 分词(会把标点等符号与单词分隔开)\n",
    "    words = [w for w in words if not w in stop_words]  # 去停用词\n",
    "    words = [w for w in words if not w in ['?',',','.','`','《》','!','``',\"''\",\"'t\",\"'s\",'&','...','---','___','#']]   # 去除常用标点符号\n",
    "    words = [w for w in words if not len(w) <= 1]   # 去掉单个字符和空字符\n",
    "    words = [w for w in words if not \"'\" in w]   # 去掉带'的词\n",
    "    words = [wnl.lemmatize(w) for w in words]  # lemmazation\n",
    "    words = ['#number' if w.isdigit() else w for w in words]   # 替换数字为新单词\n",
    "    \n",
    "    clean_qlist.append(' '.join(words))\n",
    "\n",
    "qlist = clean_qlist"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第二部分: 文本的表示\n",
    "当我们做完必要的文本处理之后就需要想办法表示文本了,这里有几种方式\n",
    "\n",
    "- 1. 使用```tf-idf vector```\n",
    "- 2. 使用embedding技术如```word2vec```, ```bert embedding```等\n",
    "\n",
    "下面我们分别提取这三个特征来做对比。 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.1 使用tf-idf表示向量\n",
    "把```qlist```中的每一个问题的字符串转换成```tf-idf```向量, 转换之后的结果存储在```X```矩阵里。 ``X``的大小是: ``N* D``的矩阵。 这里``N``是问题的个数(样本个数),\n",
    "``D``是词典库的大小"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO \n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "vectorizer = TfidfVectorizer().fit(qlist)    # 定义一个tf-idf的vectorizer\n",
    "\n",
    "X_tfidf = vectorizer.transform(qlist)  # 结果存放在X矩阵里"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(86821, 31993)"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_tfidf.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.2 使用wordvec + average pooling\n",
    "词向量方面需要下载: https://nlp.stanford.edu/projects/glove/ (请下载``glove.6B.zip``),并使用``d=200``的词向量(200维)。国外网址如果很慢,可以在百度上搜索国内服务器上的。 每个词向量获取完之后,即可以得到一个句子的向量。 我们通过``average pooling``来实现句子的向量。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "the -0.071549 0.093459 0.023738 -0.090339 0.056123 0.32547 -0.39796 -0.092139 0.061181 -0.1895 0.13061 0.14349 0.011479 0.38158 0.5403 -0.14088 0.24315 0.23036 -0.55339 0.048154 0.45662 3.2338 0.020199 0.049019 -0.014132 0.076017 -0.11527 0.2006 -0.077657 0.24328 0.16368 -0.34118 -0.06607 0.10152 0.038232 -0.17668 -0.88153 -0.33895 -0.035481 -0.55095 -0.016899 -0.43982 0.039004 0.40447 -0.2588 0.64594 0.26641 0.28009 -0.024625 0.63302 -0.317 0.10271 0.30886 0.097792 -0.38227 0.086552 0.047075 0.23511 -0.32127 -0.28538 0.1667 -0.0049707 -0.62714 -0.24904 0.29713 0.14379 -0.12325 -0.058178 -0.001029 -0.082126 0.36935 -0.00058442 0.34286 0.28426 -0.068599 0.65747 -0.029087 0.16184 0.073672 -0.30343 0.095733 -0.5286 -0.22898 0.064079 0.015218 0.34921 -0.4396 -0.43983 0.77515 -0.87767 -0.087504 0.39598 0.62362 -0.26211 -0.30539 -0.022964 0.30567 0.06766 0.15383 -0.11211 -0.09154 0.082562 0.16897 -0.032952 -0.28775 -0.2232 -0.090426 1.2407 -0.18244 -0.0075219 -0.041388 -0.011083 0.078186 0.38511 0.23334 0.14414 -0.0009107 -0.26388 -0.20481 0.10099 0.14076 0.28834 -0.045429 0.37247 0.13645 -0.67457 0.22786 0.12599 0.029091 0.030428 -0.13028 0.19408 0.49014 -0.39121 -0.075952 0.074731 0.18902 -0.16922 -0.26019 -0.039771 -0.24153 0.10875 0.30434 0.036009 1.4264 0.12759 -0.073811 -0.20418 0.0080016 0.15381 0.20223 0.28274 0.096206 -0.33634 0.50983 0.32625 -0.26535 0.374 -0.30388 -0.40033 -0.04291 -0.067897 -0.29332 0.10978 -0.045365 0.23222 -0.31134 -0.28983 -0.66687 0.53097 0.19461 0.3667 0.26185 -0.65187 0.10266 0.11363 -0.12953 -0.68246 -0.18751 0.1476 1.0765 -0.22908 -0.0093435 -0.20651 -0.35225 -0.2672 -0.0034307 0.25906 0.21759 0.66158 0.1218 0.19957 -0.20303 0.34474 -0.24328 0.13139 -0.0088767 0.33617 0.030591 0.25577\n",
      ", 0.17651 0.29208 -0.0020768 -0.37523 0.0049139 0.23979 -0.28893 -0.014643 -0.10993 0.15592 0.20627 0.47675 0.099907 -0.14058 0.21114 0.12126 -0.31831 -0.089433 -0.090553 -0.31962 0.21319 2.4844 -0.077521 -0.084279 0.20186 0.26084 -0.40411 -0.19127 0.24715 0.22394 -0.063437 0.20379 -0.18463 -0.088413 0.024169 -0.28769 -0.61246 -0.12683 -0.088273 0.18331 -0.53161 -0.1997 -0.26703 0.15312 -0.015239 -0.082844 0.47856 -0.29612 0.11168 -0.02579 -0.011697 0.19923 -0.14267 0.6625 -0.051739 -0.16938 -0.15635 0.092806 0.32548 0.11724 0.28788 -0.060651 -0.14153 0.16668 0.26861 -0.031001 -0.39665 0.35304 0.2385 0.12388 0.45698 -0.12559 -0.12804 0.37449 0.2446 0.23073 0.20808 0.051258 -0.21816 -0.036409 -0.0388 -0.042487 -0.30779 -0.025449 0.22532 0.045538 -0.48934 -0.13988 0.17394 -0.46137 -0.26555 0.15473 0.063816 -0.17022 -0.15762 0.075765 0.12151 -0.4934 -0.10909 0.034487 0.29947 0.01869 -0.16534 0.016679 0.16341 -0.27418 0.077797 1.4023 0.025275 0.094725 -0.040735 -0.10642 0.023364 0.079143 -0.16615 -0.23013 -0.14071 0.40159 -0.34951 0.018545 0.22434 0.76922 0.24722 0.14936 0.42368 -0.72059 -0.038541 0.15522 0.33596 -0.43077 -0.026925 -0.37733 0.24271 -0.46495 0.45783 0.23693 0.079361 -0.32244 -0.42434 -0.11138 0.55426 0.085153 -0.020581 -0.046386 1.2467 0.13177 0.067092 -0.5778 0.013586 -0.071274 0.017311 0.089781 0.19857 -0.032205 0.64843 -0.23797 -0.19676 0.20203 0.21074 -0.50347 0.026823 -0.045444 -0.22642 -0.19977 -0.12138 0.16941 0.061998 0.42631 -0.088383 0.45756 0.077774 0.061342 0.4571 -0.17787 -0.14597 0.32654 0.002443 -0.11886 0.10081 -0.020011 1.0366 -0.39814 -0.6818 0.23685 -0.20396 -0.17668 -0.31385 0.14834 -0.052187 0.0613 -0.32582 0.19153 -0.15469 -0.14679 0.046971 0.032325 -0.22006 -0.20774 -0.23189 -0.10814\n"
     ]
    }
   ],
   "source": [
    "!head -2 ./data/glove.6B/glove.6B.200d.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "词向量矩阵维度: (34181, 200)\n",
      "句向量矩阵维度: (86821, 200)\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "# TODO 基于Glove向量获取句子向量\n",
    "glove_file_name = './data/glove.6B/glove.6B.200d.txt'\n",
    "\n",
    "# 将词转换为id\n",
    "word_to_id = {}\n",
    "unique_words = []\n",
    "[unique_words.extend(qua.split(' ')) for qua in qlist]\n",
    "unique_words = set(unique_words)\n",
    "\n",
    "for k in unique_words:\n",
    "    word_to_id[k] = len(word_to_id) + 1\n",
    "\n",
    "model = {}\n",
    "# 读取glove文件\n",
    "with open(glove_file_name, 'r') as f:\n",
    "    for line in f:\n",
    "        split_lines = line.split(' ')\n",
    "        word = split_lines[0]\n",
    "        model[word] = np.array(split_lines[1:], dtype='float')\n",
    "        \n",
    "# 将每个句子的每个词转换成id\n",
    "qlist_id = []\n",
    "for qua in qlist:\n",
    "    words_list = qua.split(' ')\n",
    "    qlist_id.append([word_to_id[w] for w in words_list])\n",
    "    \n",
    "# 求已有词向量均值和方差\n",
    "tmp = []\n",
    "for word, index in word_to_id.items():\n",
    "    try:\n",
    "        tmp.append(model[word])\n",
    "    except:\n",
    "        pass\n",
    "mean = np.mean(np.array(tmp))\n",
    "std = np.std(np.array(tmp))\n",
    "\n",
    "vocab_size = len(word_to_id)\n",
    "embed_size = 200\n",
    "\n",
    "# 生成词向量矩阵\n",
    "# 这是 D*H的矩阵,这里的D是词典库的大小, H是词向量的大小。 这里面我们给定的每个单词的词向量,\n",
    "# 这需要从文本中读取\n",
    "emb = np.random.normal(mean,std,[vocab_size, embed_size])  # 根据求得的已有词向量的均值和方差来随机生成\n",
    "for word, index in word_to_id.items():\n",
    "    try:\n",
    "        emb[index-1, :] = model[word]\n",
    "    except:\n",
    "        pass\n",
    "    \n",
    "print('词向量矩阵维度:', emb.shape)    \n",
    "\n",
    "# 构建句子向量\n",
    "X_w2v = []\n",
    "for q in qlist_id:\n",
    "    q_emb_list = np.array([emb[w-1,:] for w in q])\n",
    "    q_emb = np.mean(q_emb_list, axis=0)   # 求平均得到句向量\n",
    "    X_w2v.append(q_emb)\n",
    "    \n",
    "X_w2v = np.array(X_w2v) # 初始化完emb之后就可以对每一个句子来构建句子向量了,这个过程使用average pooling来实现\n",
    "\n",
    "print('句向量矩阵维度:', X_w2v.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 保存word_to_id和emb\n",
    "with open('./data/word2id.txt', 'w') as f:\n",
    "    f.write(str(word_to_id))\n",
    "    \n",
    "np.savetxt(\"./data/embedding.txt\", emb, fmt='%f',delimiter=',')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.3 使用BERT + average pooling\n",
    "最近流行的BERT也可以用来学出上下文相关的词向量(contex-aware embedding), 在很多问题上得到了比较好的结果。在这里,我们不做任何的训练,而是直接使用已经训练好的BERT embedding。 具体如何训练BERT将在之后章节里体会到。 为了获取BERT-embedding,可以直接下载已经训练好的模型从而获得每一个单词的向量。可以从这里获取: https://github.com/imgarylai/bert-embedding , 请使用```bert_12_768_12```\t当然,你也可以从其他source获取也没问题,只要是合理的词向量。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO 基于BERT的句子向量计算\n",
    "from bert_embedding import BertEmbedding\n",
    "\n",
    "bert_embedding = BertEmbedding(model='bert_12_768_12', dataset_name='book_corpus_wiki_en_cased')\n",
    "\n",
    "X_bert = bert_embedding(qlist[:100])   # 太慢了,跑100个意思意思  \n",
    "X_bert = np.array([np.mean(item[1], axis=0) for item in X_bert])  # 每一个句子的向量结果存放在X_bert矩阵里。行数为句子的总个数,列数为一个句子embedding大小。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(100, 768)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_bert.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第三部分: 相似度匹配以及搜索\n",
    "在这部分里,我们需要把用户每一个输入跟知识库里的每一个问题做一个相似度计算,从而得出最相似的问题。但对于这个问题,时间复杂度其实很高,所以我们需要结合倒排表来获取相似度最高的问题,从而获得答案。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.1 tf-idf + 余弦相似度\n",
    "我们可以直接基于计算出来的``tf-idf``向量,计算用户最新问题与库中存储的问题之间的相似度,从而选择相似度最高的问题的答案。这个方法的复杂度为``O(N)``, ``N``是库中问题的个数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [],
   "source": [
    "import heapq\n",
    "\n",
    "# 利用priority queue性质求解Top-K问题\n",
    "def topk(inputs, k):\n",
    "    pq = []\n",
    "    pq_index = []\n",
    "    for index, value in enumerate(inputs):\n",
    "        if len(pq) < k:\n",
    "            heapq.heappush(pq, value)\n",
    "            heapq.heappush(pq_index, index)\n",
    "        elif value > pq[0]:\n",
    "            heapq.heapreplace(pq, value)\n",
    "            heapq.heapreplace(pq_index, index)\n",
    "    ret = list()\n",
    "    while pq_index:\n",
    "        ret.append(heapq.heappop(pq_index))\n",
    "    return ret[::-1]\n",
    "\n",
    "\n",
    "def get_top_results_tfidf_noindex(query):\n",
    "    # TODO 需要编写\n",
    "    \"\"\"\n",
    "    给定用户输入的问题 query, 返回最有可能的TOP 5问题。这里面需要做到以下几点:\n",
    "    1. 对于用户的输入 query 首先做一系列的预处理(上面提到的方法),然后再转换成tf-idf向量(利用上面的vectorizer)\n",
    "    2. 计算跟每个库里的问题之间的相似度\n",
    "    3. 找出相似度最高的top5问题的答案\n",
    "    \"\"\"\n",
    "    lower_query = query.lower()    # 小写\n",
    "    query_tokens = word_tokenize(lower_query)    # 分词\n",
    "    tokens = [w for w in query_tokens if not w in stop_words]  # 去停用词\n",
    "    tokens = [w for w in tokens if not w in ['?',',','.','`','《》','!','``',\"''\",\"'t\",\"'s\",'&','...','---','___','#']]   # 去除常用标点符号\n",
    "    tokens = [w for w in tokens if not len(w) <= 1]   # 去掉单个字符和空字符\n",
    "    tokens = [w for w in tokens if not \"'\" in w]  # 去掉带'的词\n",
    "    tokens = [wnl.lemmatize(w) for w in tokens]  # lemmazation\n",
    "    tokens = ['#number' if w.isdigit() else w for w in tokens]   # 替换数字为新单词\n",
    "    \n",
    "    word_tfidf = vectorizer.transform([' '.join(tokens)])  # 转换tf-idf向量\n",
    "    \n",
    "    # 计算余弦相似度\n",
    "    vector1 = word_tfidf.toarray()\n",
    "    vector2 = X_tfidf.toarray().T\n",
    "    cos = np.dot(vector1, vector2) / (np.linalg.norm(vector1)*(np.linalg.norm(vector2)))\n",
    "    inputs = list(cos)[0]\n",
    "    k = 5\n",
    "    top_idxs = topk(inputs, k)  # top_idxs存放相似度最高的(存在qlist里的)问题的下标 \n",
    "                                # hint: 请使用 priority queue来找出top results. 思考为什么可以这么做? \n",
    "    print('top_idxs', top_idxs)\n",
    "    alist_arr = np.array(alist)\n",
    "    return alist_arr[top_idxs]  # 返回相似度最高的问题对应的答案,作为TOP5答案    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "top_idxs [634, 628, 625, 311, 310]\n",
      "['20' 'six' '20' 'three' 'six']\n",
      "top_idxs [1776, 1757, 1756, 1755, 1754]\n",
      "['5 GB' '128 GB' '2 GB' 'iPod Touch' 'iPod Shuffle']\n",
      "top_idxs [86391, 86383, 86382, 86373, 86351]\n",
      "['19th' '19th century' '1996–2006' 'any particular class or culture'\n",
      " 'by state law']\n",
      "top_idxs [3886, 3345, 3091, 3090, 3089]\n",
      "['Manhattan' 'five' '8,491,079' '1898'\n",
      " 'Brooklyn, Queens, Manhattan, the Bronx, and Staten Island']\n"
     ]
    }
   ],
   "source": [
    "# TODO: 编写几个测试用例,并输出结果\n",
    "print (get_top_results_tfidf_noindex(\"How many did Beyonce win to set the record for Grammys?\"))\n",
    "print (get_top_results_tfidf_noindex(\"how does iPod product have the largest data capacity\"))\n",
    "print (get_top_results_tfidf_noindex(\"When do hunting regulations date from in the US?\"))\n",
    "print (get_top_results_tfidf_noindex(\"How many boroughs comprise New York City?\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "你会发现上述的程序很慢,没错! 是因为循环了所有库里的问题。为了优化这个过程,我们需要使用一种数据结构叫做```倒排表```。 使用倒排表我们可以把单词和出现这个单词的文档做关键。 之后假如要搜索包含某一个单词的文档,即可以非常快速的找出这些文档。 在这个QA系统上,我们首先使用倒排表来快速查找包含至少一个单词的文档,然后再进行余弦相似度的计算,即可以大大减少```时间复杂度```。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.2 倒排表的创建\n",
    "倒排表的创建其实很简单,最简单的方法就是循环所有的单词一遍,然后记录每一个单词所出现的文档,然后把这些文档的ID保存成list即可。我们可以定义一个类似于```hash_map```, 比如 ``inverted_index = {}``, 然后存放包含每一个关键词的文档出现在了什么位置,也就是,通过关键词的搜索首先来判断包含这些关键词的文档(比如出现至少一个),然后对于candidates问题做相似度比较。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO 请创建倒排表\n",
    "inverted_idx = {}  # 定一个一个简单的倒排表,是一个map结构。 循环所有qlist一遍就可以\n",
    "for k in unique_words:\n",
    "    inverted_idx[k] = []\n",
    "    for index,q in enumerate(qlist):\n",
    "        if k in q:\n",
    "            inverted_idx[k].append(index)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.3 语义相似度\n",
    "这里有一个问题还需要解决,就是语义的相似度。可以这么理解: 两个单词比如car, auto这两个单词长得不一样,但从语义上还是类似的。如果只是使用倒排表我们不能考虑到这些单词之间的相似度,这就导致如果我们搜索句子里包含了``car``, 则我们没法获取到包含auto的所有的文档。所以我们希望把这些信息也存下来。那这个问题如何解决呢? 其实也不难,可以提前构建好相似度的关系,比如对于``car``这个单词,一开始就找好跟它意思上比较类似的单词比如top 10,这些都标记为``related words``。所以最后我们就可以创建一个保存``related words``的一个``map``. 比如调用``related_words['car']``即可以调取出跟``car``意思上相近的TOP 10的单词。 \n",
    "\n",
    "那这个``related_words``又如何构建呢? 在这里我们仍然使用``Glove``向量,然后计算一下俩俩的相似度(余弦相似度)。之后对于每一个词,存储跟它最相近的top 10单词,最终结果保存在``related_words``里面。 这个计算需要发生在离线,因为计算量很大,复杂度为``O(V*V)``, V是单词的总数。 \n",
    "\n",
    "这个计算过程的代码请放在``related.py``的文件里,然后结果保存在``related_words.txt``里。 我们在使用的时候直接从文件里读取就可以了,不用再重复计算。所以在此notebook里我们就直接读取已经计算好的结果。 作业提交时需要提交``related.py``和``related_words.txt``文件,这样在使用的时候就不再需要做这方面的计算了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run related.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO 读取语义相关的单词\n",
    "def get_related_words(file):\n",
    "    \n",
    "    with open('related_words.txt', 'r') as f:\n",
    "        related_words = eval(f.read())\n",
    "    \n",
    "    return related_words\n",
    "\n",
    "related_words = get_related_words('related_words.txt') # 直接放在文件夹的根目录下,不要修改此路径。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['landline',\n",
       " 'cellphone',\n",
       " 'mail',\n",
       " 'cable',\n",
       " 'wireless',\n",
       " 'internet',\n",
       " 'cellular',\n",
       " 'satellite',\n",
       " 'modem',\n",
       " 'computer']"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "related_words['phone']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.4 利用倒排表搜索\n",
    "在这里,我们使用倒排表先获得一批候选问题,然后再通过余弦相似度做精准匹配,这样一来可以节省大量的时间。搜索过程分成两步:\n",
    "\n",
    "- 使用倒排表把候选问题全部提取出来。首先,对输入的新问题做分词等必要的预处理工作,然后对于句子里的每一个单词,从``related_words``里提取出跟它意思相近的top 10单词, 然后根据这些top词从倒排表里提取相关的文档,把所有的文档返回。 这部分可以放在下面的函数当中,也可以放在外部。\n",
    "- 然后针对于这些文档做余弦相似度的计算,最后排序并选出最好的答案。\n",
    "\n",
    "可以适当定义自定义函数,使得减少重复性代码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 143,
   "metadata": {},
   "outputs": [],
   "source": [
    "def query_preprocessing(query):\n",
    "    \n",
    "    lower_query = query.lower()    # 小写\n",
    "    query_tokens = word_tokenize(lower_query)    # 分词\n",
    "    tokens = [w for w in query_tokens if not w in stop_words]  # 去停用词\n",
    "    tokens = [w for w in tokens if not w in ['?',',','.','`','《》','!','``',\"''\",\"'t\",\"'s\",'&','...','---','___','#']]   # 去除常用标点符号\n",
    "    tokens = [w for w in tokens if not len(w) <= 1]   # 去掉单个字符和空字符\n",
    "    tokens = [w for w in tokens if not \"'\" in w]  # 去掉带'的词\n",
    "    tokens = [wnl.lemmatize(w) for w in tokens]  # lemmazation\n",
    "    tokens = ['#number' if w.isdigit() else w for w in tokens]   # 替换数字为新单词\n",
    "    \n",
    "    return tokens\n",
    "\n",
    "def get_related_questions(tokens):\n",
    "    # 找到query中词相似的词,并包括query中的词\n",
    "    query_related_words = []\n",
    "    for token in tokens:\n",
    "        if not token in query_related_words:\n",
    "            query_related_words.append(token)\n",
    "        query_related_words.extend(related_words[token])\n",
    "        \n",
    "    query_related_words = set(query_related_words)\n",
    "    \n",
    "    # 找到所有相似词对应的候选问题\n",
    "    related_q = []\n",
    "    for k in query_related_words:\n",
    "        related_q.extend(inverted_idx[k])\n",
    "        \n",
    "    related_q = list(set(related_q))\n",
    "    \n",
    "    return related_q\n",
    "    \n",
    "\n",
    "def get_top_results(vector1, vector2, related_q):\n",
    "           \n",
    "    # 将query向量,与候选问题向量计算余弦相似度取topk\n",
    "    cos = np.dot(vector1, vector2) / (np.linalg.norm(vector1)*(np.linalg.norm(vector2)))\n",
    "    inputs = list(cos)\n",
    "    k = 5\n",
    "    top_idxs = topk(inputs, k)   # 此时对应related_q中的索引\n",
    "    \n",
    "    top_idxs = list(np.array(related_q)[top_idxs])  # top_idxs存放相似度最高的(存在qlist里的)问题的下表 \n",
    "    \n",
    "    return top_idxs\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 144,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_top_results_tfidf(query):\n",
    "    \"\"\"\n",
    "    给定用户输入的问题 query, 返回最有可能的TOP 5问题。这里面需要做到以下几点:\n",
    "    1. 利用倒排表来筛选 candidate (需要使用related_words). \n",
    "    2. 对于候选文档,计算跟输入问题之间的相似度\n",
    "    3. 找出相似度最高的top5问题的答案\n",
    "    \"\"\"\n",
    "    # 对query进行预处理\n",
    "    tokens = query_preprocessing(query)\n",
    "    # 找到相似的question候选集\n",
    "    related_q = get_related_questions(tokens)\n",
    "    \n",
    "    # 生成tf-idf词向量\n",
    "    vector1 = vectorizer.transform([' '.join(tokens)]).toarray()[0]\n",
    "    vector2 =  X_tfidf.toarray()[related_q].T\n",
    "    \n",
    "    # 计算tf-idf词向量余弦相似度\n",
    "    top_idxs = get_top_results(vector1, vector2, related_q)  # top_idxs存放相似度最高的(存在qlist里的)问题的下表 \n",
    "                                                             # hint: 利用priority queue来找出top results. 思考为什么可以这么做?\n",
    "    \n",
    "    return np.array(alist)[top_idxs]  # 返回相似度最高的问题对应的答案,作为TOP5答案"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 145,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_top_results_w2v(query):\n",
    "    \"\"\"\n",
    "    给定用户输入的问题 query, 返回最有可能的TOP 5问题。这里面需要做到以下几点:\n",
    "    1. 利用倒排表来筛选 candidate (需要使用related_words). \n",
    "    2. 对于候选文档,计算跟输入问题之间的相似度\n",
    "    3. 找出相似度最高的top5问题的答案\n",
    "    \"\"\"\n",
    "    # 对query进行预处理\n",
    "    tokens = query_preprocessing(query)\n",
    "    # 找到相似的question候选集\n",
    "    related_q = get_related_questions(tokens)\n",
    "    \n",
    "    # 生成word2vec词向量\n",
    "    query_emb = np.random.normal(mean, std, [len(tokens), 200])   # 根据之前计算得到的词向量均值和方差来生成query的词向量矩阵\n",
    "    for index,word in enumerate(tokens):\n",
    "        try:\n",
    "            query_emb[index, :] = model[word]\n",
    "        except:\n",
    "            pass\n",
    "        \n",
    "    vector1 = np.mean(query_emb, axis=0)\n",
    "    vector2 = X_w2v[related_q].T\n",
    "    \n",
    "    # 计算word2vec词向量余弦相似度\n",
    "    top_idxs = get_top_results(vector1, vector2, related_q)  # top_idxs存放相似度最高的(存在qlist里的)问题的下表 \n",
    "                                                             # hint: 利用priority queue来找出top results. 思考为什么可以这么做? \n",
    "    \n",
    "    return np.array(alist)[top_idxs]  # 返回相似度最高的问题对应的答案,作为TOP5答案"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_top_results_bert(query):\n",
    "    \"\"\"\n",
    "    给定用户输入的问题 query, 返回最有可能的TOP 5问题。这里面需要做到以下几点:\n",
    "    1. 利用倒排表来筛选 candidate (需要使用related_words). \n",
    "    2. 对于候选文档,计算跟输入问题之间的相似度\n",
    "    3. 找出相似度最高的top5问题的答案\n",
    "    \"\"\"\n",
    "    \n",
    "    top_idxs = []  # top_idxs存放相似度最高的(存在qlist里的)问题的下表 \n",
    "                   # hint: 利用priority queue来找出top results. 思考为什么可以这么做? \n",
    "    \n",
    "    return alist[top_idxs]  # 返回相似度最高的问题对应的答案,作为TOP5答案"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 158,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['an external source of energy' 'qualitative descriptions'\n",
      " 'permanent gonads' 'cell division' 'health services']\n",
      "['1876 to 1910' 'predators' 'Work' 'cell division'\n",
      " 'Plant-derived pesticides, or \"botanicals']\n",
      "['5 GB' '2 GB' '128 GB' '2 GB' 'iPod Touch']\n",
      "['5 GB' '2 GB' 'iPod Touch' 'iPod Shuffle' '128 GB']\n",
      "['Panchanada' 'Hangzhou Bay' 'Jiankang' 'during the Three Kingdoms'\n",
      " 'East China Sea']\n",
      "['Land of Fish and Rice' 'Fujian' 'Jiangxi' 'Anhui' 'Chekiang']\n",
      "['1971'\n",
      " 'the Admiralty, the War Office, the Air Ministry, the Ministry of Aviation, and an earlier form of the Ministry of Defence'\n",
      " '1946 to 1964' 'a seat in the Cabinet' '1946']\n",
      "['British' 'Murodali Alimardon and Ruqiya Qurbanova'\n",
      " 'Colonel General Seyran Ohanyan' 'Rosh HaMemshalah'\n",
      " 'Chairman of the government']\n"
     ]
    }
   ],
   "source": [
    "# TODO: 编写几个测试用例,并输出结果\n",
    "\n",
    "test_query1 = \"What does the growth, development, and reproduction of organisms rely on?\"\n",
    "test_query2 = \"what's the largest data capacity of iPod product?\"\n",
    "test_query3 = \"Zhejiang formerly romanized as\"\n",
    "test_query4 = \"which government's Ministry of Defence is mentioned here\"\n",
    "\n",
    "print (get_top_results_tfidf(test_query1))\n",
    "print (get_top_results_w2v(test_query1))\n",
    "# print (get_top_results_bert(test_query1))\n",
    "\n",
    "print (get_top_results_tfidf(test_query2))\n",
    "print (get_top_results_w2v(test_query2))\n",
    "# print (get_top_results_bert(test_query2))\n",
    "\n",
    "print(get_top_results_tfidf(test_query3))\n",
    "print (get_top_results_w2v(test_query3))\n",
    "\n",
    "print(get_top_results_tfidf(test_query4))\n",
    "print (get_top_results_w2v(test_query4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. 拼写纠错\n",
    "其实用户在输入问题的时候,不能期待他一定会输入正确,有可能输入的单词的拼写错误的。这个时候我们需要后台及时捕获拼写错误,并进行纠正,然后再通过修正之后的结果再跟库里的问题做匹配。这里我们需要实现一个简单的拼写纠错的代码,然后自动去修复错误的单词。\n",
    "\n",
    "这里使用的拼写纠错方法是课程里讲过的方法,就是使用noisy channel model。 我们回想一下它的表示:\n",
    "\n",
    "$c^* = \\text{argmax}_{c\\in candidates} ~~p(c|s) = \\text{argmax}_{c\\in candidates} ~~p(s|c)p(c)$\n",
    "\n",
    "这里的```candidates```指的是针对于错误的单词的候选集,这部分我们可以假定是通过edit_distance来获取的(比如生成跟当前的词距离为1/2的所有的valid 单词。 valid单词可以定义为存在词典里的单词。 ```c```代表的是正确的单词, ```s```代表的是用户错误拼写的单词。 所以我们的目的是要寻找出在``candidates``里让上述概率最大的正确写法``c``。 \n",
    "\n",
    "$p(s|c)$,这个概率我们可以通过历史数据来获得,也就是对于一个正确的单词$c$, 有百分之多少人把它写成了错误的形式1,形式2...  这部分的数据可以从``spell_errors.txt``里面找得到。但在这个文件里,我们并没有标记这个概率,所以可以使用uniform probability来表示。这个也叫做channel probability。\n",
    "\n",
    "$p(c)$,这一项代表的是语言模型,也就是假如我们把错误的$s$,改造成了$c$, 把它加入到当前的语句之后有多通顺?在本次项目里我们使用bigram来评估这个概率。 举个例子: 假如有两个候选 $c_1, c_2$, 然后我们希望分别计算出这个语言模型的概率。 由于我们使用的是``bigram``, 我们需要计算出两个概率,分别是当前词前面和后面词的``bigram``概率。 用一个例子来表示:\n",
    "\n",
    "给定: ``We are go to school tomorrow``, 对于这句话我们希望把中间的``go``替换成正确的形式,假如候选集里有个,分别是``going``, ``went``, 这时候我们分别对这俩计算如下的概率:\n",
    "$p(going|are)p(to|going)$和 $p(went|are)p(to|went)$, 然后把这个概率当做是$p(c)$的概率。 然后再跟``channel probability``结合给出最终的概率大小。\n",
    "\n",
    "那这里的$p(are|going)$这些bigram概率又如何计算呢?答案是训练一个语言模型! 但训练一个语言模型需要一些文本数据,这个数据怎么找? 在这次项目作业里我们会用到``nltk``自带的``reuters``的文本类数据来训练一个语言模型。当然,如果你有资源你也可以尝试其他更大的数据。最终目的就是计算出``bigram``概率。 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.1 训练一个语言模型\n",
    "在这里,我们使用``nltk``自带的``reuters``数据来训练一个语言模型。 使用``add-one smoothing``"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 179,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.corpus import reuters\n",
    "import collections\n",
    "\n",
    "# 读取语料库的数据\n",
    "categories = reuters.categories()\n",
    "corpus = reuters.sents(categories=categories)\n",
    "\n",
    "# 循环所有的语料库并构建bigram probability. bigram[word1][word2]: 在word1出现的情况下下一个是word2的概率。 \n",
    "counts = {}\n",
    "bigram = {}\n",
    "for corpu in corpus:\n",
    "    for index, word in enumerate(corpu):\n",
    "        counts[word] = counts.get(word, 0) + 1\n",
    "        if index >= 1:\n",
    "            pre_word = corpu[index-1]\n",
    "            if pre_word not in bigram:\n",
    "                bigram[pre_word] = {}\n",
    "            bigram[pre_word][word] = bigram[pre_word].get(word, 0) + 1\n",
    "        \n",
    "v = len(counts.keys())\n",
    "\n",
    "for k1 in bigram.keys():\n",
    "    for k2 in bigram[k1].keys():\n",
    "        bigram[k1][k2] = (bigram[k1][k2] + 1) / (counts[k1] + v)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.2 构建Channel Probs\n",
    "基于``spell_errors.txt``文件构建``channel probability``, 其中$channel[c][s]$表示正确的单词$c$被写错成$s$的概率。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 182,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "raining: rainning, raning\n",
      "writings: writtings\n",
      "disparagingly: disparingly\n",
      "yellow: yello\n",
      "four: forer, fours, fuore, fore*5, for*4\n"
     ]
    }
   ],
   "source": [
    "!head -5 spell-errors.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 188,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'rainning': 0.5, 'raning': 0.5}\n"
     ]
    }
   ],
   "source": [
    "# TODO 构建channel probability  \n",
    "channel = {}\n",
    "\n",
    "spell_error_dict = {}\n",
    "for line in open('spell-errors.txt'):\n",
    "    # TODO\n",
    "    item = line.split(\":\")\n",
    "    correct_word = item[0].strip()\n",
    "    spell_error_list = [word.strip() for word in item[1].strip().split(\",\")]\n",
    "    spell_error_dict[correct_word] = spell_error_list\n",
    "\n",
    "# TODO\n",
    "for k in spell_error_dict:\n",
    "    if k not in channel:\n",
    "        channel[k] = {}\n",
    "        for value in spell_error_dict[k]:\n",
    "            channel[k][value] = 1 / len(spell_error_dict[k])\n",
    "\n",
    "print(channel['raining'])   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.3 根据错别字生成所有候选集合\n",
    "给定一个错误的单词,首先生成跟这个单词距离为1或者2的所有的候选集合。 这部分的代码我们在课程上也讲过,可以参考一下。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 259,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('vocab.txt', 'r') as f:\n",
    "    vocab = f.readlines()\n",
    "vocab = [item.strip('\\n') for item in vocab]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 205,
   "metadata": {},
   "outputs": [],
   "source": [
    "alphabet = \"abcdefghijklmnopqrstuvwxyz\"\n",
    "\n",
    "# 生成编辑距离为1的候选集合\n",
    "def editDist1(word):\n",
    "    n = len(word)\n",
    "    # 删除\n",
    "    s1 = [word[0:i] + word[i+1:] for i in range(n)]\n",
    "    # 调换相连的两个字母\n",
    "    s2 = [word[0:i] + word[i+1] + word[i] + word[i+2:] for i in range(n-1)]\n",
    "    # replace\n",
    "    s3 = [word[0:i] + c + word[i + 1:] for i in range(n) for c in alphabet]\n",
    "    # 插入\n",
    "    s4 = [word[0:i] + c + word[i:] for i in range(n + 1) for c in alphabet]\n",
    "    \n",
    "    edit1_words = set(s1 + s2 + s3 + s4)\n",
    "\n",
    "    if word in edit1_words:\n",
    "        edit1_words.remove(word)\n",
    "\n",
    "    \n",
    "    edit1_words = list(set(w for w in edit1_words if w in vocab))\n",
    "    \n",
    "    return edit1_words\n",
    "\n",
    "# 生成编辑距离为2的候选集合\n",
    "def editDist2(word, edit1_words):\n",
    "    edit2_words = set(e2 for e1 in edit1_words for e2 in editDist1(e1))\n",
    "    if word in edit2_words:\n",
    "        edit2_words.remove(word)\n",
    "    edit2_words = list(set(w for w in edit2_words if w in vocab))\n",
    "    return edit2_words\n",
    "    \n",
    "\n",
    "def generate_candidates(word):\n",
    "    # 基于拼写错误的单词,生成跟它的编辑距离为1或者2的单词,并通过词典库的过滤。\n",
    "    # 只留写法上正确的单词。 \n",
    "    edit1_words = editDist1(word)   # 编辑距离为1的候选项\n",
    "    edit2_words = editDist2(word, edit1_words)   # 编辑距离为2的候选项\n",
    "    \n",
    "    candidates = []\n",
    "    candidates.extend(edit1_words)\n",
    "    candidates.extend(edit2_words)\n",
    "    \n",
    "    return candidates\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.4 给定一个输入,如果有错误需要纠正\n",
    "\n",
    "给定一个输入``query``, 如果这里有些单词是拼错的,就需要把它纠正过来。这部分的实现可以简单一点: 对于``query``分词,然后把分词后的每一个单词在词库里面搜一下,假设搜不到的话可以认为是拼写错误的! 人如果拼写错误了再通过``channel``和``bigram``来计算最适合的候选。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 294,
   "metadata": {},
   "outputs": [],
   "source": [
    "qlist_word_counts = {}\n",
    "clean_word_total_list = []  # 存储去除标点后的word\n",
    "\n",
    "for word in word_total_list:\n",
    "     clean_word_total_list.append(''.join([e for e in word if e not in ['?',',','.','`','《》','!','``',\"''\",\"'t\",\"'s\",'&','...','---','___','#']]))\n",
    "    \n",
    "for word in clean_word_total_list:\n",
    "    qlist_word_counts[word] = qlist_word_counts.get(word, 0) + 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 296,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "def getCorrectestWord(input_word, token_list, cur_index):\n",
    "    candidates = generate_candidates(input_word.lower())\n",
    "\n",
    "    if len(candidates) == 0:\n",
    "        return input_word\n",
    "\n",
    "    candidates_spell_error = []\n",
    "    for candidate in candidates:\n",
    "        if candidate in channel and input_word in channel[candidate]:\n",
    "            candidates_spell_error.append(candidate)\n",
    "\n",
    "    if len(candidates_spell_error) == 0:\n",
    "        for w in candidates:\n",
    "            qlist_word_counts[w] = qlist_word_counts.get(w, 0) + 1\n",
    "        return max(candidates, key=lambda w: qlist_word_counts[w])\n",
    "\n",
    "    candidates = candidates_spell_error\n",
    "\n",
    "    if(len(token_list) == 1):\n",
    "        return max(candidates, key = lambda w: channel[w][input_word])\n",
    "\n",
    "    bein_pos_state = cur_index == 0\n",
    "    middle_pos_state = cur_index > 0 and  cur_index < len(token_list) - 1\n",
    "    end_pos_state = cur_index == len(token_list) - 1\n",
    "\n",
    "    prev_word = token_list[cur_index - 1] if cur_index > 0 else None\n",
    "    next_word = token_list[cur_index + 1] if cur_index < len(token_list) - 1 else None\n",
    "\n",
    "    candidates_bigram = []\n",
    "\n",
    "    candidates_bigram_value_dic = {candidate:0 for candidate in candidates}\n",
    "\n",
    "    for candidate in candidates:\n",
    "        is_bigram_right = False\n",
    "        is_bigram_left = False\n",
    "\n",
    "        if bein_pos_state or middle_pos_state:\n",
    "            if bein_pos_state:\n",
    "                is_bigram_left = True\n",
    "            if candidate in word_bigram_dict and next_word != None and next_word in word_bigram_dict[candidate]:\n",
    "                is_bigram_right = True\n",
    "\n",
    "        if end_pos_state or middle_pos_state:\n",
    "            if end_pos_state:\n",
    "                is_bigram_right = True\n",
    "            if prev_word in word_bigram_dict and prev_word != None and candidate in word_bigram_dict[prev_word]:\n",
    "                is_bigram_left = True\n",
    "        is_bigram = is_bigram_left and is_bigram_right\n",
    "        if is_bigram:\n",
    "            candidates_bigram.append(candidate)\n",
    "            bigram_left_prob = 1 if bein_pos_state else word_bigram_dict[prev_word][candidate]\n",
    "            bigram_right_prob = 1 if end_pos_state else word_bigram_dict[candidate][next_word]\n",
    "            candidates_bigram_value_dic[candidate] = bigram_left_prob * bigram_right_prob\n",
    "\n",
    "    if len(candidates_bigram) == 0:\n",
    "        return max(candidates, key=lambda w: channel[w][input_word])\n",
    "\n",
    "    candidates = candidates_bigram\n",
    "\n",
    "    return max(candidates, key=lambda w: channel[w][input_word] * candidates_bigram_value_dic[candidate])\n",
    "\n",
    "def word_preprocessing(word):\n",
    "    # 去除标点\n",
    "    word = ''.join([e for e in word if e not in ['?',',','.','`','《》','!','``',\"''\",\"'t\",\"'s\",'&','...','---','___','#']])\n",
    "    return word\n",
    "\n",
    "def spell_corrector(line):\n",
    "    # 1. 首先做分词,然后把``line``表示成``tokens``\n",
    "    # 2. 循环每一token, 然后判断是否存在词库里。如果不存在就意味着是拼写错误的,需要修正。 \n",
    "    #    修正的过程就使用上述提到的``noisy channel model``, 然后从而找出最好的修正之后的结果。  \n",
    "    \n",
    "    tokens = [word.strip() for word in line.split()]\n",
    "    newline = \"\"\n",
    "    for index, token in enumerate(tokens):\n",
    "        token = word_preprocessing(token)  # 对单词进行预处理,这里只去除标点\n",
    "        if token == None or token.strip() == \"\":\n",
    "            continue\n",
    "        if token not in vocab: #默认单词拼错了\n",
    "            token = getCorrectestWord(token, tokens, index)\n",
    "        newline += token + \" \"\n",
    "   \n",
    "    return newline   # 修正之后的结果,假如用户输入没有问题,那这时候``newline = line``\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.5 基于拼写纠错算法,实现用户输入自动矫正\n",
    "首先有了用户的输入``query``, 然后做必要的处理把句子转换成tokens的形状,然后对于每一个token比较是否是valid, 如果不是的话就进行下面的修正过程。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 303,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "In what territory is New Delhi located \n",
      "How many Indian cities have been selected to be developed as a smart city \n",
      "['subtropical climate' 'the Indo-Gangetic Plain' 'a union territory'\n",
      " 'The Constitution (Sixty-ninth Amendment) Act' '1956']\n",
      "['Delhi' 'Southern Connecticut State University' 'New Haven County,'\n",
      " 'New Haven County' 'on the south bank']\n",
      "['to promote nutrition literacy' '439,896' 'Melbourne' 'New Delhi'\n",
      " 'New Delhi']\n",
      "['nine'\n",
      " 'The most prominent of these is Utrecht University (est. 1636), the largest university of the Netherlands with 30,449 students'\n",
      " '439,896' 'Pyongyang' 'Private FM stations']\n"
     ]
    }
   ],
   "source": [
    "test_query1 = \"In whhat terrtory is New Delhi located?\"  # 拼写错误的\n",
    "test_query2 = \"How many Indian cities have been selected to be develped as a smart city?\"  # 拼写错误的\n",
    "\n",
    "test_query1 = spell_corrector(test_query1)\n",
    "test_query2 = spell_corrector(test_query2)\n",
    "\n",
    "print(test_query1)\n",
    "print(test_query2)\n",
    "\n",
    "print (get_top_results_tfidf(test_query1))\n",
    "print (get_top_results_w2v(test_query1))\n",
    "# print (get_top_results_bert(test_query1))\n",
    "\n",
    "print (get_top_results_tfidf(test_query2))\n",
    "print (get_top_results_w2v(test_query2))\n",
    "# print (get_top_results_bert(test_query2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 总结\n",
    "- 从我实现结果来看,词向量语义相似度计算结果并不是很理想(仿佛能够得到近似答案,但是几乎很难精准获取答案,而且很多时候获取到了精准答案不是排在top1的位置),word2vec和tf-idf词向量表示的效果差异不明显,不清楚问题出在哪里,有几个不确定的点:首先是在词向量表示过程中,出现不存在的词向量如何表示,其次是不确定利用优先队列思想实现的topk函数编写逻辑是否正确;\n",
    "- 基于bert的句向量表示过程很慢,因此我没有耐心跑完,不清楚结果会不会更好;\n",
    "- 单词拼写纠错如何提升效果,希望老师给些建议。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 附录 \n",
    "在本次项目中我们实现了一个简易的问答系统。基于这个项目,我们其实可以有很多方面的延伸。\n",
    "- 在这里,我们使用文本向量之间的余弦相似度作为了一个标准。但实际上,我们也可以基于基于包含关键词的情况来给一定的权重。比如一个单词跟related word有多相似,越相似就意味着相似度更高,权重也会更大。 \n",
    "- 另外 ,除了根据词向量去寻找``related words``也可以提前定义好同义词库,但这个需要大量的人力成本。 \n",
    "- 在这里,我们直接返回了问题的答案。 但在理想情况下,我们还是希望通过问题的种类来返回最合适的答案。 比如一个用户问:“明天北京的天气是多少?”, 那这个问题的答案其实是一个具体的温度(其实也叫做实体),所以需要在答案的基础上做进一步的抽取。这项技术其实是跟信息抽取相关的。 \n",
    "- 对于词向量,我们只是使用了``average pooling``, 除了average pooling,我们也还有其他的经典的方法直接去学出一个句子的向量。\n",
    "- 短文的相似度分析一直是业界和学术界一个具有挑战性的问题。在这里我们使用尽可能多的同义词来提升系统的性能。但除了这种简单的方法,可以尝试其他的方法比如WMD,或者适当结合parsing相关的知识点。 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "好了,祝你好运! "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}