Commit a271acac by 20200913032

Upload New File

parent df0a7547
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## 搭建一个简单的问答系统 (Building a Simple QA System)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"本次项目的目标是搭建一个基于检索式的简易的问答系统,这是一个最经典的方法也是最有效的方法。 \n",
"\n",
"```不要单独创建一个文件,所有的都在这里面编写,不要试图改已经有的函数名字 (但可以根据需求自己定义新的函数)```\n",
"\n",
"```预估完成时间```: 5-10小时"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 检索式的问答系统\n",
"问答系统所需要的数据已经提供,对于每一个问题都可以找得到相应的答案,所以可以理解为每一个样本数据是 ``<问题、答案>``。 那系统的核心是当用户输入一个问题的时候,首先要找到跟这个问题最相近的已经存储在库里的问题,然后直接返回相应的答案即可(但实际上也可以抽取其中的实体或者关键词)。 举一个简单的例子:\n",
"\n",
"假设我们的库里面已有存在以下几个<问题,答案>:\n",
"- <\"贪心学院主要做什么方面的业务?”, “他们主要做人工智能方面的教育”>\n",
"- <“国内有哪些做人工智能教育的公司?”, “贪心学院”>\n",
"- <\"人工智能和机器学习的关系什么?\", \"其实机器学习是人工智能的一个范畴,很多人工智能的应用要基于机器学习的技术\">\n",
"- <\"人工智能最核心的语言是什么?\", ”Python“>\n",
"- .....\n",
"\n",
"假设一个用户往系统中输入了问题 “贪心学院是做什么的?”, 那这时候系统先去匹配最相近的“已经存在库里的”问题。 那在这里很显然是 “贪心学院是做什么的”和“贪心学院主要做什么方面的业务?”是最相近的。 所以当我们定位到这个问题之后,直接返回它的答案 “他们主要做人工智能方面的教育”就可以了。 所以这里的核心问题可以归结为计算两个问句(query)之间的相似度。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 项目中涉及到的任务描述\n",
"问答系统看似简单,但其中涉及到的内容比较多。 在这里先做一个简单的解释,总体来讲,我们即将要搭建的模块包括:\n",
"\n",
"- 文本的读取: 需要从相应的文件里读取```(问题,答案)```\n",
"- 文本预处理: 清洗文本很重要,需要涉及到```停用词过滤```等工作\n",
"- 文本的表示: 如果表示一个句子是非常核心的问题,这里会涉及到```tf-idf```, ```Glove```以及```BERT Embedding```\n",
"- 文本相似度匹配: 在基于检索式系统中一个核心的部分是计算文本之间的```相似度```,从而选择相似度最高的问题然后返回这些问题的答案\n",
"- 倒排表: 为了加速搜索速度,我们需要设计```倒排表```来存储每一个词与出现的文本\n",
"- 词义匹配:直接使用倒排表会忽略到一些意思上相近但不完全一样的单词,我们需要做这部分的处理。我们需要提前构建好```相似的单词```然后搜索阶段使用\n",
"- 拼写纠错:我们不能保证用户输入的准确,所以第一步需要做用户输入检查,如果发现用户拼错了,我们需要及时在后台改正,然后按照修改后的在库里面搜索\n",
"- 文档的排序: 最后返回结果的排序根据文档之间```余弦相似度```有关,同时也跟倒排表中匹配的单词有关\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 项目中需要的数据:\n",
"1. ```dev-v2.0.json```: 这个数据包含了问题和答案的pair, 但是以JSON格式存在,需要编写parser来提取出里面的问题和答案。 \n",
"2. ```glove.6B```: 这个文件需要从网上下载,下载地址为:https://nlp.stanford.edu/projects/glove/, 请使用d=200的词向量\n",
"3. ```spell-errors.txt``` 这个文件主要用来编写拼写纠错模块。 文件中第一列为正确的单词,之后列出来的单词都是常见的错误写法。 但这里需要注意的一点是我们没有给出他们之间的概率,也就是p(错误|正确),所以我们可以认为每一种类型的错误都是```同等概率```\n",
"4. ```vocab.txt``` 这里列了几万个英文常见的单词,可以用这个词库来验证是否有些单词被拼错\n",
"5. ```testdata.txt``` 这里搜集了一些测试数据,可以用来测试自己的spell corrector。这个文件只是用来测试自己的程序。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在本次项目中,你将会用到以下几个工具:\n",
"- ```sklearn```。具体安装请见:http://scikit-learn.org/stable/install.html sklearn包含了各类机器学习算法和数据处理工具,包括本项目需要使用的词袋模型,均可以在sklearn工具包中找得到。 \n",
"- ```jieba```,用来做分词。具体使用方法请见 https://github.com/fxsjy/jieba\n",
"- ```bert embedding```: https://github.com/imgarylai/bert-embedding\n",
"- ```nltk```:https://www.nltk.org/index.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 第一部分:对于训练数据的处理:读取文件和预处理"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- ```文本的读取```: 需要从文本中读取数据,此处需要读取的文件是```dev-v2.0.json```,并把读取的文件存入一个列表里(list)\n",
"- ```文本预处理```: 对于问题本身需要做一些停用词过滤等文本方面的处理\n",
"- ```可视化分析```: 对于给定的样本数据,做一些可视化分析来更好地理解数据"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.1节: 文本的读取\n",
"把给定的文本数据读入到```qlist```和```alist```当中,这两个分别是列表,其中```qlist```是问题的列表,```alist```是对应的答案列表"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"def read_corpus():\n",
" \"\"\"\n",
" 读取给定的语料库,并把问题列表和答案列表分别写入到 qlist, alist 里面。 在此过程中,不用对字符换做任何的处理(这部分需要在 Part 2.3里处理)\n",
" qlist = [\"问题1\", “问题2”, “问题3” ....]\n",
" alist = [\"答案1\", \"答案2\", \"答案3\" ....]\n",
" 务必要让每一个问题和答案对应起来(下标位置一致)\n",
" \"\"\"\n",
" # TODO 需要完成的代码部分 ...\n",
" import json\n",
" import jsonpath\n",
" \n",
" with open('train-v2.0.json', 'r', encoding='utf-8') as f:\n",
" dic = json.load(f)\n",
" \n",
" qlist = jsonpath.jsonpath(dic,'$..question') \n",
" alist = jsonpath.jsonpath(dic,'$..text') \n",
" f.close()\n",
" assert len(qlist) == len(alist) # 确保长度一样\n",
" \n",
" return qlist, alist"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.2 理解数据(可视化分析/统计信息)\n",
"对数据的理解是任何AI工作的第一步, 需要对数据有个比较直观的认识。在这里,简单地统计一下:\n",
"\n",
"- 在```qlist```出现的总单词个数\n",
"- 按照词频画一个```histogram``` plot"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"53157\n"
]
}
],
"source": [
"# TODO: 统计一下在qlist中总共出现了多少个单词? 总共出现了多少个不同的单词(unique word)?\n",
"# 这里需要做简单的分词,对于英文我们根据空格来分词即可,其他过滤暂不考虑(只需分词)\n",
"from nltk import word_tokenize\n",
"\n",
"qlist, alist = read_corpus()\n",
"\n",
"qwords = []\n",
"for sentence in qlist:\n",
" qwords.append(word_tokenize(sentence))\n",
"\n",
"mydist = {}\n",
"\n",
"for sentence in qwords:\n",
" for word in sentence:\n",
" if word not in mydist:\n",
" mydist[word] =1\n",
" else:\n",
" mydist[word] +=1\n",
"\n",
"word_total = len(mydist)\n",
"print (word_total)\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAZcAAAEbCAYAAAAWFMmuAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3deXxU1fnH8c+TALJD2FdZFFFAQQkKKopKXVrr0lZb26pVW7tatZta21qttra1tWp/tda6Vdu6tFrBahFBcAVJZN9kX2SHAIEQQpLn98c9gSEkYSaZyWT5vl+vec3MuffccwjJPHPWa+6OiIhIMmWkuwIiItLwKLiIiEjSKbiIiEjSKbiIiEjSKbiIiEjSKbiIiEjSpTS4mFl7M/uXmS0ys4VmNsrMOpjZRDNbEp6zwrlmZg+a2VIzm2NmJ8Vc5+pw/hIzuzomfbiZzQ15HjQzC+kVliEiIrUj1S2XB4D/ufuxwFBgIXArMMndBwCTwnuAC4AB4XE98DBEgQK4AzgFOBm4IyZYPBzOLct3fkivrAwREakFlqpFlGbWFpgN9PeYQsxsMTDG3debWXdgirsPNLNHwut/xp5X9nD3r4f0R4Ap4fFmCFyY2RVl51VWRlX17dSpk/ft27da/9Y9e/bQokWLauVVfuVXfuWvz/lzc3O3uHvnQw64e0oewDDgA+BJYCbwV6AVsL3ceXnh+RXg9Jj0SUA28APgJzHpPw1p2cAbMemjgVfC6wrLqOoxfPhwr66cnJxq51V+5Vd+5a/P+YEcr+AzNZUtl2xgGnCau083sweAncAN7t4+5rw8d88ys/8Cv3L3d0L6JOBHwNnAEe5+d0j/KVAAvBXOHxvSRwM/cvdPm9n2isqooI7XE3Wr0b179+Hjx4+v1r+1oKCAli1bViuv8iu/8it/fc6fnZ2d6+7ZhxyoKOIk4wF0A1bGvB8N/BdYDHQPad2BxeH1I8AVMecvDsevAB6JSX8kpHUHFsWk7z+vsjKqeqjlovzKr/zKnzgqabmkbEDf3TcAa8ysbKzjHGABMA4om/F1NfByeD0OuCrMGhsJ7HD39cAE4FwzywoD+ecCE8KxfDMbGWaJXVXuWhWVISIitaBJiq9/A/B3M2sGLAeuIZqh9ryZXQesBi4L574KfBJYStTtdQ2Au28zs18AM8J5d7n7tvD6m0RjOi2A18ID4N5KyhARkVqQ0uDi7rOIBt7LO6eCcx34diXXeRx4vIL0HGBIBelbKypDRERqh1boi4hI0qW6W0xEROoYd2dt3h5mrdnO67N30rpnPgO7tUlqGQouIiIN3LbdRcxeu53Za8Jj7Q627S7af/ykgVsUXEREpHJ7ikqYv24Hs0IQmb1mO6u3FRxyXsdWzRjauz2dMws4uV+HpNdDwUVEpJ4qKXVW7djH0hmrmbUmCiSLN+ZTUnrw4vgWTTM5vmc7hvZux9De7Rnaqz29slpgZuTm5jK4R7uk103BRUSkHnB31u8oZPaa7cwKj7kf76CgqATYuv+8zAzjuO5tGda7HUN7tWdo7/YM6NKaJpm1O39LwUVEpA7KL9zHnLU79geS2Wu2syl/7yHndWmZyclHd2FY7yiQDO7RlpbN0v/Rnv4aiIg0cvtKSlm8IX9/IJm1ZjvLNu+i/NaP7Vo0ZWjv9gzr1Y5hR0bdWysXz2P48JMqvnAaKbiIiNSyrbv2krMqj9xVeUydv5WVL01gb3HpQec0y8zguB5t9weSYb2z6NuxJeGeiPutrMV6J0LBRUQkhdyd5Vt2k7syj5xV28hZmcfyLbsPOa9fp1YM7dWOYb3bM+zILI7r3oYjmmSmocbJoeAiIpJEe4tLmPfxTnJXbWPGyjw+XJXH1pg1JQDNm2YwrHd7RvTtQNuirVx2djbtWzZLU41TQ8FFRKQGdhWVMnnRRmaszCN3ZR6z1m6nqFwXV+c2R5DdJ4vhfbIY0bcDg3q0pWmYvZWbm9vgAgsouIiIJGTHnn18sGIb7y3bwvvLtrJoQz6w6aBzBnRpTXbfLLL7dCC7bxZHdjh0rKShU3AREanCrr3FzFi5jfeXbeX9ZVuZv24HsWsUm2bAiUd2YHjfrP2tk4bYEkmUgouISIw9RSXkrsqLWibLtzJn7Y6DVrw3zTSye2cx8qiOjOrfEbauYNTJFd1ZpHFTcBGRRm1fiTNt+VbeW7aVacu2MnNNHvtKDgSTzAxjWO/2nHpUR0Yd1ZHhfbIOWqSYu31lGmpd9ym4iEij4u4s3bSLqR9tZupHm5m+bAtFpRv3HzeDIT3bMqp/R049qhPZfbNo07xpGmtcPym4iEiDt6NgH+8u28LUxZt5a8lm1u8oPOj4sd3aMLJ/1DIZ2a8j7VoqmNSUgouINDglpc7cj3fsDyYzV+cdNAjfqXUzRg/ozBnHdKJtwTrOOe3k9FW2gVJwEZEGYePOQt4KXV3vLN3C9oJ9+481yTCy+2Zx5jGdOfOYzgzq3paMjGhqcG7uxsouKTWg4CIi9VJRcSk5q7bx3Jx8bn/7rbDe5IDeHVpwxoAomIw6qqPGTWqZgouI1Bvrtu9hyuLNTFm8iXeXbmF3Ucn+Yy2aZjLqqI6cMaATZw7sUuEmj1J7FFxEpM4qa52UBZSPNu466PgxXVtzbLtSPn/G8WT3zarXGz02NAouIlKnVNU6adUsk9OO7sSYgV04c2BnerZvQW5uLsOP7pTGGktFFFxEJK32lTrvhWnCUxZvZvHGg8dOjunamjEDuzBmYGey+3SgWZPavV2vVI+Ci4jUuu0FRUxetImJCzYyZdEm9hQfmLFVUetE6h8FFxGpFau3FvD6gg1MXLCRnFV5B+3XNbBrG8YM7MyZap00GAouIpISpWEh48QFG5m4YONB3V1NMozTj+7EJwZ1pWvJJs4frUWMDY2Ci4gkzd7iEt5btpWJCzYyaeFGNu7cu/9YmyOaMObYLow9rgtjBnahXYto3Ulu7tZ0VVdSSMFFRGokv6iUl2auZeKCjUxdvPmg2V3d2zXnE4O68olBXTmlX0d1dzUiCi4ikrB12/fw+vwNvL5gI9OWb6XUD9yJ8bjubfnEoK6cO6grg3u01ULGRkrBRUQOq2yb+gkhoMxZu2P/sUyD047uyNjjujL2uK707tAyjTWVukLBRUQqVFrqzFq7ndfnb+T1+RtYvmX3/mMtmmZy5jGdOW9IVzoUrufMUSPSWFOpixRcRGS/ouJSpq/YyoT50ZTh2AH59i2bMva4rpw3uBujB3SiedNoqxXtKiwVUXARaeQKi0t5be56JszfwKRFm8gvLN5/rEe75pw7uBvnDe7GiL5ZNMnUgLzEJ6XBxcxWAvlACVDs7tlm1gF4DugLrAQud/c8i0b9HgA+CRQAX3H3D8N1rgZ+Ei57t7s/FdKHA08CLYBXgRvd3SsrI5X/VpH6pKComMmLNjF+9jreXLiJotIDA/LHdG3NeYO7ce6gbgzpqQF5qZ7aaLmc5e5bYt7fCkxy93vN7Nbw/hbgAmBAeJwCPAycEgLFHUA24ECumY0LweJh4HpgGlFwOR94rYoyRBqtvcUlTF28mfFz1vPGgo3s2XdgyvBJR7bf30Lp16lVGmspDUU6usUuBsaE108BU4g++C8G/ubuDkwzs/Zm1j2cO9HdtwGY2UTgfDObArR19/dD+t+AS4iCS2VliDQq+0pKeXfpFsbPXs/r8zeQv/dAl9eJR7bn0yf0oBebOfd0rZCX5LLoszxFFzdbAeQRtTgecfe/mNl2d28fc06eu2eZ2SvAve7+TkifRBQQxgDN3f3ukP5TYA9RwLjX3ceG9NHALe5+YWVlVFC/64laPnTv3n34+PHjq/XvLCgooGXL6k+/VH7lT2b+EncWbC7i3TWFTFtbSH7Rgb/xfu2bcHrv5pzauzldWjWpk/VX/vqVPzs7O9fds8unp7rlcpq7rzOzLsBEM1tUxbkVdex6NdLj5u5/Af4CkJ2d7cOHD08k+365ublUN6/yK38y8p944knMXJPH+Nnr+e/c9WzOPzDL6+gurbloaA8uPKE7/Tu3Tkn5yt9481cmpcHF3deF501m9hJwMrDRzLq7+/rQ7VU2krgW6B2TvRewLqSPKZc+JaT3quB8qihDpEH5aGM+T83eyQ2vT2bdjsL96X06tuTCE7rz6aE9GNi1jQblpdalLLiYWSsgw93zw+tzgbuAccDVwL3h+eWQZRzwHTN7lmhAf0cIDhOAX5pZWbfWucBt7r7NzPLNbCQwHbgKeCjmWhWVIVLvlZY6Uz7axOPvrOSdpQfmynRv13x/QDm+ZzsFFEmrVLZcugIvhV/wJsA/3P1/ZjYDeN7MrgNWA5eF818lmoa8lGgq8jUAIYj8ApgRzrurbHAf+CYHpiK/Fh4QBZWKyhCpt3bvLeZfuWt58r2VrAir5Vs0zWR072Zcf+4wTjoyi4wMBRSpG1IWXNx9OTC0gvStwDkVpDvw7Uqu9TjweAXpOcCQeMsQqY/WbCvgqfdW8lzOmv0LHHu2b8HVp/bh89lHsnThHIb37ZDmWoocTCv0Reogd+eDFdt4/N0VTFywkbKbNo7om8W1p/XjE4O6arW81GkKLiJ1yN7iEsbPXs/j76xgwfqdADTNNC45oQfXnNaP43u1S3MNReKj4CJSB2zO38sz01bx9+mr2LKrCICOrZrxpZF9+PIpR9KlbfM011AkMQouImm0aMNOHvpgO++9OJmiklIgutnWtaf15dNDe+zfeVikvlFwEUmDJRvz+cOkJfx3znoAzODcQV259vR+nNKvg6YRS72n4CJSi5Zt3sWDk5YwbvY63KFZkwzO6duc2y49hSM76g6O0nAouIjUgpVbdvPg5CX8Z+bHlHo0SP+FEUfyrbOOYt3SBQos0uAouIik0JptBTw0eQn//vBjSkqdJhnG50f05jtnH03P9i2AA3sWiTQkCi4iKfDx9j38cfJSXshZQ3Gpk5lhXJ7dixvOHkDvDmqlSMOn4CKSRBt2FPKnKUt59oM1FJWUkmHwmZN68t2zB9BXN+GSRkTBRSQJNuUX8vCUZfx9+mqKiksxg4uG9uC75wzg6C6HbnMv0tAdNriEHY33uHupmR0DHAu85u77Ul47kTpu6669PDV7J6//500K90XrVD51fHduHDuAY7q2SXPtRNInnpbLW8DosOX9JCAH+DzwpVRWTKQuc3deyFnLPa8uZMee6HvWeYO7ctPYYziue9s0104k/eIJLubuBWH7+ofc/TdmNjPVFROpq1Zs2c1tL85h2vLozg8ndGnGLz9/MkN6at8vkTJxBRczG0XUUrkugXwiDUpRcSmPvr2cByYtoai4lA6tmvGzCwfRq2S9AotIOfEEiRuB24CX3H2+mfUH3kxttUTqlg9X53Hbv+eyeGM+AJ8b3ovbP3kcWa2akZu7Ic21E6l74gkuXd39orI37r7czN5OYZ1E6oz8wn3cN2Exf5u2Cvfo3vS/vPR4Tju6U7qrJlKnxRNcbgNeiCNNpEGZuGAjP/3PPDbsLCQzw7j+zP7ceM4A7VQsEodKg4uZXUB0T/ueZvZgzKG2QHGqKyaSLpt2FnLHuPm8Ni/q7hraqx33fvYEzQITSUBVLZd1RNOOLwJyY9LzgZtTWSmRdCgtdf45YzX3vraI/MJiWjbL5IfnDeSqUX3JzNAW+CKJqDS4uPtsYLaZ/UMLJqWhW7opn9tenMuMlXkAnHNsF+66ZMj+zSVFJDHxjLmcbGY/B/qE8w1wd++fyoqJ1Ia9xSU8Nz+fl158m30lTqfWR3DnRYP55PHddMMukRqIJ7g8RtQNlguUpLY6IrVn1prt/PCF2SzZtBuAK07uza3nH0e7lk3TXDOR+i+e4LLD3V9LeU1EaknhvhIemLSER6Yuo9ShR+tM7v/iCE7p3zHdVRNpMOIJLm+a2W+BF4G9ZYnu/mHKaiWSIgdaK7vIMLj+jP6M6bRbgUUkyeIJLqeE5+yYNAfOTn51RFKjcF8Jf3hjCX95K2qt9O/cit9+bijD+2SRm5t7+AuISEIOG1zc/azaqIhIqsxcnccP/zWHpaG18vUz+nPzJ47RYkiRFIrnfi4/qyjd3e9KfnVEkqdwXwn3v/ERj761fH9r5b7LhnLSkVnprppIgxdPt9jumNfNgQuBhampjkhyHNJaObM/N49Va0WktsTTLfa72Pdmdh8wLmU1EqmB8q2Vo0Jr5US1VkRqVXXuy9IS0AJKqXNmrs7jBy/MZtnm3WqtiKRZPGMuc4lmhwFkAp0BjbdInVFU4vzq1YU8+nbUWjm6S2t++7kT1FoRSaN4Wi4XxrwuBja6u3ZFljph5uo8fjBxCx/nbyTD4BtnHsVNY7Utvki6xTPmssrMhgKjQ9JbwJyU1krkMPYWl/DAG0v4c1hlf3SX1tx32VCG9W6f7qqJCJBxuBPM7Ebg70CX8Pi7md0QbwFmlmlmM83slfC+n5lNN7MlZvacmTUL6UeE90vD8b4x17gtpC82s/Ni0s8PaUvN7NaY9ArLkIZh3sc7uPiP7/KnKcsAuGRgK1654XQFFpE65LDBBbgOOMXdf+buPwNGAl9LoIwbOXjq8q+B+919AJAXrl9WTp67Hw3cH87DzAYBXwAGA+cDfwoBKxP4P+ACYBBwRTi3qjKkHttXUsoDbyzhkv97l0Ub8unXqRUvfONUrjyhjbrBROqYeIKLcfBuyCUh7fAZzXoBnwL+Gt4b0bYx/wqnPAVcEl5fHN4Tjp8Tzr8YeNbd97r7CmApcHJ4LHX35e5eBDwLXHyYMqSeWrIxn8/86T3uf+Mjikudr5zal1e/O5rhfTRoL1IXxTOg/wQw3cxeCu8vIdqGPx5/AH4EtAnvOwLbYyYErAV6htc9gTUA7l5sZjvC+T2BaTHXjM2zplz6KYcpQ+qZklLnr28v53cTP6KouJSe7Vvw28tO4NSjOqW7aiJSBXP3w59kdhJwOlGL5S13nxlHnguBT7r7t8xsDPAD4Brg/dD1hZn1Bl519+PNbD5wnruvDceWEbVO7gp5ngnpjwGvErW6znP3r4b0K8udf0gZFdTxeuB6gO7duw8fP378YX8WFSkoKKBly5bVyqv8ledfv6uYhz7YweKt0Y1Qx/ZrwdVD29CyaUZc+WtavvIrv/IfXnZ2dq67Zx9ywN0rfAAjgAsqSL8IGF5ZvpjzfkXUalgJbAAKiCYGbAGahHNGARPC6wnAqPC6STjPgNuA22KuOyHk2583pN8WHlZZGVU9hg8f7tWVk5NT7bzKf2j+kpJSf+q9FX7sT17zPre84iPunuiTF22stfKVX/mVP35AjlfwmVrVmMtvqXgPsQXhWJXc/TZ37+XufYkG5Ce7+5eAN4HPhdOuBl4Or8eF94Tjk0PFxwFfCLPJ+gEDgA+AGcCAMDOsWShjXMhTWRlSx63NK+DLj03nZy/PZ8++Ei4Z1oPXbz6DswZ2SXfVRCQBVY25dHT3leUT3X2pmdXkzkq3AM+a2d3ATA6M3zwGPG1mS4FtRMECd59vZs8TBbVi4NvuXgJgZt8haslkAo+7+/zDlCF1lLvzQs5a7nplAbv2FtOhVTPuuWQIFxzfPd1VE5FqqCq4tKjiWKtECnH3KcCU8Ho50dhI+XMKgcsqyX8PcE8F6a8Sjb+UT6+wDKmbtu0p4donZ/Dm4s0AnD+4G3dfOoROrY9Ic81EpLqqCi5vmNk9wE9CVxMAZnYnMDnlNZNGYfzsddw2YQu79jltmzfhrouHcPGwHkQzykWkvqoquHyfaH3KUjObFdKGAjnAV1NdMWnY9hSVcOf4+Tw7I5pNPmZgZ+79zAl0a9c8zTUTkWSoNLi4+26iVe/9iVbHA8wPXU4i1fbRxny+848P+WjjLpo1yeDqE1rz48tGqLUi0oDEs3HlckABRWrM3Xk+Zw13jJtP4b5Sjurcij9+8SQK1i1RYBFpYKpzszCRhOUX7uP2l+YxbvY6AD57Ui/uungwrY5oQu66NFdORJJOwUVSbu7aHXznnx+yamsBLZtlcvclQ/jMSb3SXS0RSaG4gouZnQ4McPcnzKwz0NqjTSRFKuXuPPneSn756kL2lTjHdW/LH794Ikd1bp3uqolIisVzm+M7gGxgINEmlk2BZ4DTUls1qc+2FxTxw3/NYeKCjQBcObIPt3/qOG2NL9JIxNNyuRQ4EfgQwN3XmVmbqrNIY5azchvf/edM1u0opE3zJvzmsydopb1IIxNPcClydzczBzCzhFbnS+NRWuo8PHUZv5/4ESWlzrDe7XnoihPp3aH6O66KSP0UT3B53sweAdqb2deAa4FHU1stqW825+/le8/P4u0lWwD4+pn9+cG5A2maGc/96ESkoYlnnct9ZvYJYCfRuMvP3H1iymsm9cbsjXv5+mtvs2XXXjq0asbvLh+qXYxFGrl4BvRvBl5QQJHy3J0/vLGEB9/Kw4GR/TvwwBdOpGtbbeEi0tjF0y3WFphgZtuI7lP/L3ffmNpqSV1XVFzKrf+ew4szPyYDuGnsMXzn7KPJzNBKexGhypuFAeDud7r7YODbQA9gqpm9kfKaSZ21s3Af1zz5AS/O/JiWzTK59fQsbhw7QIFFRPZLZIX+JqLbFW8F1KHeSK3fsYdrnpjBog35dGp9BE98ZQRFG5emu1oiUscctuViZt80synAJKAT8DV3PyHVFZO6Z9GGnXzmT++xaEM+/Tu34qVvncrxvdqlu1oiUgfF03LpA9zk7rMOe6Y0WO8t3cLXn84lf28x2X2yePSqbLJaNUt3tUSkjqo0uJhZW3ffCfwmvO8Qe9zdt6W4blJHvDzrY37wwmz2lTgXDOnG/Z8fpm1cRKRKVbVc/gFcCOQCDsSO1jrQP4X1kjrAPVpx/5v/LQbg2tP68ZNPHUeGBu5F5DCquhPlheG5X+1VR+qKklLnjnHzeGbaaszg9k8ex1dH6/uEiMQnngH9SfGkScOxp6iErz+dyzPTVtOsSQZ/vOIkBRYRSUhVYy7NgZZAJzPL4kC3WFui9S7SAG3ZtZfrnsph9prttGvRlEevyubkfh0On1FEJEZVYy5fB24iCiS5HAguO4H/S3G9JA1WbtnN1U98wKqtBfRs34Knrh3B0V10dwURSVxVYy4PAA+Y2Q3u/lAt1knS4MPVeXz1qRy27S5iSM+2PH71CLpojzARqaZ4dkV+yMyGAIOA5jHpf0tlxaT2fPBxIQ/8ZxqF+0o585jO/N+XTqL1EYls3iAicrB4b3M8hii4vApcALwDKLg0AP/8YDW/fW87pcDl2b2459LjdQ8WEamxeD5FPgecA2xw92uAocARKa2V1Irpy7dy+0tzKQVuGjuAX3/2BAUWEUmKeD5J9rh7KVBsZm2JNrDUvNR6bsuuvXz32ZmUOlx6bCtuGnsMZlocKSLJEU/Heo6ZtSe6tXEusAv4IKW1kpQqLXVufm4WG3fuZUTfLK4YrD3CRCS54hnQ/1Z4+Wcz+x/Q1t3npLZakkoPT13G20u20KFVMx684kTWLV2Q7iqJSANT1SLKk6o65u4fpqZKkkrTl2/ld69He4X9/vKhdG/XgnVprpOINDxVtVx+V8UxB85Ocl0kxWLHWb415ijGDNQ930QkNapaRHlWbVZEUqv8OMv3PnFMuqskIg1YPOtcrqooXYso65fy4yxNNOVYRFIonk+YETGP0cDPgYsOl8nMmpvZB2Y228zmm9mdIb2fmU03syVm9pyZNQvpR4T3S8PxvjHXui2kLzaz82LSzw9pS83s1pj0CstorCoaZxERSaXDBhd3vyHm8TXgRCCeD+u9wNnuPhQYBpxvZiOBXwP3u/sAIA+4Lpx/HZDn7kcD94fzMLNBwBeAwcD5wJ/MLNPMMok20LyAaPeAK8K5VFFGo6NxFhFJh+r0jRQAAw53kkd2hbdNw6NsIsC/QvpTwCXh9cXhPeH4ORat6rsYeNbd97r7CmApcHJ4LHX35e5eBDwLXBzyVFZGo6JxFhFJl3jGXMYTBQWIgtEg4Pl4Lh5aF7nA0UStjGXAdncvDqesBXqG1z2BNQDuXmxmO4COIX1azGVj86wpl35KyFNZGY2KxllEJF3M3as+wezMmLfFwCp3X5tQIdEK/5eAnwFPhK4vzKw38Kq7H29m84Hzyq5tZsuIWid3Ae+7+zMh/TGiDTQzwvlfDelXljv/kDIqqNf1wPUA3bt3Hz5+/PhE/ln7FRQU0LJly2rlTVX++ZuL+PmUbZQCPxmdxYndKt8Ori7WX/mVX/nrR/7s7Oxcd88+5IC7x/UgugNlh7JHvPli8t8B/BDYAjQJaaOACeH1BGBUeN0knGfAbcBtMdeZEPLtzxvSbwsPq6yMqh7Dhw/36srJyal23lTk35xf6CffM9H73PKK3/vawlovX/mVX/kbT34gxyv4TD1sP4mZXW9mG4E5QA5RN1dOHPk6hxYLZtYCGAssBN4k2mkZ4Grg5fB6XHhPOD45VHwc8IUwm6wf0XjPB8AMYECYGdaMaNB/XMhTWRkNXuw4S3afLL6vcRYRSYN4Nq78ITDY3bckeO3uwFNh3CUDeN7dXzGzBcCzZnY3MBN4LJz/GPC0mS0FthEFC9x9vpk9Dywg6pb7truXAJjZd4haMpnA4+4+P1zrlkrKaPDKxlmyWjbloS9qnEVE0iOe4LKMaIZYQjza3PLECtKXE42NlE8vBC6r5Fr3APdUkP4q0fhLXGU0dB+s2HZgPcvnh2k9i4ikTTzB5TbgPTObTrR2BQB3/27KaiUJ27prLzf880NKHb455ijO0noWEUmjeILLI8BkYC5QmtrqSHWUunPz87M1ziIidUY8waXY3b+X8ppItf1n0W7e+miXxllEpM6I51PozTBjrLuZdSh7pLxmEpcPVmzjn/OijRA0ziIidUU8LZcvhufbYtIc6J/86kgidu0t5qZnZ1KKxllEpG6J5zbH/WqjIpK4+yd+xLodhRyV1UTjLCJSp+h+LvXUvI938MS7K8gw+MbwdhpnEZE6JZ5usRExr5sD5wAfAgouaVJS6tz+0lxKHa45rS/9swrTXSURkYPE0y12Q+x7M2sHPJ2yGslh/WP6Kmav3UG3ts35/rkDWTxvdrqrJCJykJTdz0VSY9POQn7zv2gV/j8PNQgAABh1SURBVB2fHkTrI+JpfIqI1K6U3s9Fku8X/11I/t5izj62C+cP6Zbu6oiIVCier733xbyu1v1cJDne+mgz42evo3nTDO68aDDRTTdFROqeSoOLmR0NdHX3qeXSR5vZEe6+LOW1k/0K95Xw05fnAXDjOcfQu0P1b+4jIpJqVY25/AHIryB9Tzgmtej/3lzKqq0FDOzahq+O1tIjEanbqgoufcO2+Qdx9xygb8pqJIdYuimfP0+NGor3XDqEplrTIiJ1XFWfUs2rOKYNrGqJu3P7S/PYV+J8YURvsvtqWzcRqfuqCi4zzOxr5RPN7DqiWx1LLfj3hx8zfcU2OrZqxq0XHJvu6oiIxKWq2WI3AS+Z2Zc4EEyygWbApamumEDe7iJ++epCAG7/1HG0b9kszTUSEYlPpcHF3TcCp5rZWcCQkPxfd59cKzUT7n1tEdt2FzGqf0cuPbFnuqsjIhK3eLZ/eRN4sxbqIjE+WLGN53LW0Cwzg7svHaI1LSJSr2jaUR1UVFzK7S/NBeAbY47iqM6t01wjEZHEKLjUQX99ZzlLNu2ib8eWfGvMUemujohIwhRc6pg12wp4cNISAO6+5HiaN81Mc41ERBKn4FKHuDs/fXkehftKuXhYD04f0CndVRIRqRYFlzrktXkbmLJ4M22aN+H2Tx2X7uqIiFSbgksdkV+4jzvHzwfglvOPpUubqjZIEBGp2xRc6ojfvf4RG3fuZVjv9nzx5CPTXR0RkRrRbQzrgGV5+/jb+xvIzDB+eenxZGRoTYuI1G9quaRZSanzSO4OSh2uPa0vg3q0TXeVRERqTMElzZ5+fyXL8orp0a45N409Jt3VERFJCgWXNNpTVML9b0RrWn5+0WBaHaFeShFpGBRc0mjc7I/ZsWcfx3RoyrmDu6W7OiIiSaPgkibuzt/eXwXA+Ue3THNtRESSS8ElTWau2c78dTvp0KoZo3ppTYuINCwKLmnydGi1fH5Eb5plauqxiDQsKQsuZtbbzN40s4VmNt/MbgzpHcxsopktCc9ZId3M7EEzW2pmc8zspJhrXR3OX2JmV8ekDzezuSHPgxZuelJZGXXF1l17+e+c9ZihBZMi0iClsuVSDHzf3Y8DRgLfNrNBwK3AJHcfAEwK7wEuAAaEx/XAwxAFCuAO4BTgZOCOmGDxcDi3LN/5Ib2yMuqE53PWUlRSytkDu9C7g8ZbRKThSVlwcff17v5heJ0PLAR6AhcDT4XTngIuCa8vBv7mkWlAezPrDpwHTHT3be6eB0wEzg/H2rr7++7uwN/KXauiMtKupNR5ZlrUJXblqD5pro2ISGpY9Lmc4kLM+gJvAUOA1e7ePuZYnrtnmdkrwL3u/k5InwTcAowBmrv73SH9p8AeYEo4f2xIHw3c4u4Xmtn2isqooF7XE7V86N69+/Dx48dX699XUFBAy5bxtUBy1hXyq3e3061VJg9d0IkMs4Ty17R85Vd+5Vf+ZObPzs7OdffsQw64e0ofQGsgF/hMeL+93PG88Pxf4PSY9EnAcOCHwE9i0n8KfB8YAbwRkz4aGF9VGVU9hg8f7tWVk5MT97lXPTbd+9zyiv9l6rJq5a9p+cqv/Mqv/MnMD+R4BZ+pKZ0tZmZNgX8Df3f3F0PyxtClRXjeFNLXAr1jsvcC1h0mvVcF6VWVkVartu5m6kebOaJJBp8b3uvwGURE6qlUzhYz4DFgobv/PubQOKBsxtfVwMsx6VeFWWMjgR3uvh6YAJxrZllhIP9cYEI4lm9mI0NZV5W7VkVlpNXfp68G4NNDe5DVqlmaayMikjqp3MzqNOBKYK6ZzQppPwbuBZ43s+uA1cBl4dirwCeBpUABcA2Au28zs18AM8J5d7n7tvD6m8CTQAvgtfCgijLSpnBfCc/nrAHgKg3ki0gDl7Lg4tHAfGWrA8+p4HwHvl3JtR4HHq8gPYdokkD59K0VlZFO42evY3vBPob2ascJvdofPoOISD2mFfq15Okw/fjLI9VqEZGGT8GlFsxes505a3fQvmVTPj20R7qrIyKScgoutaCs1XJ5dm+aN81Mc21ERFJPwSXF8nYXMX72OszgS6doHzERaRwUXFLshdw17C0u5cxjOtOnY6t0V0dEpFYouKRQaanzzLRobcuVGsgXkUZEwSWFpi7ZzOptBfTKasGYgV3SXR0RkVqj4JJCz4Qbgn3plD5kZuiGYCLSeCi4pMiabQVMXryJZpkZXJ6tfcREpHFRcEmRv09fjTtceEJ3OrY+It3VERGpVQouKVC4r4TnZkQD+V/WPmIi0ggpuKTAq3PXk1ewjyE923Jib+0jJiKNj4JLCpStyL9yZB+iuwGIiDQuCi5JNu/jHcxcvZ22zZtw0dCe6a6OiEhaKLgk2dNh+vFl2b1p0Uz7iIlI46TgkkQ7Cvbx8uyPAe0jJiKNm4JLEr2Qu4bCfaWMHtCJ/p1bp7s6IiJpo+CSJKWlzt+nax8xERFQcEmad5ZuYcWW3fRo15yzj9U+YiLSuCm4JEnZ9OMvjexDk0z9WEWkcWuS7go0BJsLSpi0cDNNM43Ls3unuzoiImmnr9hJ8PqyAkodLhjSnc5ttI+YiIiCSw3tLS5h0oo9AFylfcRERAAFlxr737wN7NhbyrHd2jC8T1a6qyMiUicouNRQ2Yr8K0dpHzERkTIKLjWwcWch89btoGUT45Jh2kdMRKSMgksNdG3bnOm3jeVHp7Wn1RGaeCciUkbBpYbatWzK8V00Q0xEJJaCi4iIJJ2Ci4iIJJ2Ci4iIJJ2Ci4iIJJ2Ci4iIJJ2Ci4iIJJ2Ci4iIJJ25e7rrUCeY2WZgVTWzdwK21KB45Vd+5Vf++pq/j7t3PiTV3fWo4QPIUX7lV37lb4z5K3uoW0xERJJOwUVERJJOwSU5/qL8yq/8yt9I81dIA/oiIpJ0armIiEjSKbiIiEjS6Q5XIvWUmWUBA4DmZWnu/lb6atR4mVl3YJu77013XeoKtVwSZGYdzOzHZvY9M2ubpjoccneyitIqyft0eL4x2fVKhJl1NbMLw6NLGso/zcxahddfNrPfm1mf2iw/nrQq8n8VeAuYANwZnn+erPrFUX7/Guav9s/fzDLN7JmalJ8CTwOLzOy+dFckHmZ2n5kNTmUZCi6J+zfQGugFvF+dP7LwwfqYmb0W3g8ys+sSuMT7caZVZHj4I77WzLJCsNz/SKAOmNmpZvZFM7uq7BFnvsuBD4DLgMuB6Wb2uQTL/o2ZtTWzpmY2ycy2mNmXE7jEw0CBmQ0FfkS0O8Pf4ih3vJmNq+yRQPkPxZlWmRuBEcAqdz8LOBHYHG/m8HP7rpn9KzxuMLOmCZT/pJktM7NnzexbZnZ8Anmhmj9/AHcvATqbWbMEy8TM8s1sZ2WPRK8XU6exQH/giTjrUaPPgBCcJ5rZR2a23MxWmNnyBKq8CPiLmU03s2+YWbsE8sZF3WKJ6+juPwYws/OAqWa2Hfg+8FV3vzyOazxJ9Et4e3j/EfAc8FhVmcysG9ATaGFmJwIWDrUFWsZZ/z8D/yP6Q8iNvTzgIf2wQgvoKGAWUBKSnfg+IG4HRrj7pnCtzsAbwL/iKTs4191/ZGaXAmuJAtWbQLzfaIvd3c3sYuABd3/MzK6OI1/ZN9PPAN1iyrsCWHm4zGY2CjiV6MPxezGH2gKZcdYdoNDdC80MMzvC3ReZ2cAE8j8MNAX+FN5fGdK+Gk9mdz8jfLiPAMYA/zWz1u4e7xeU6v78y6wE3g0BfXdMvX5/mHq3ATCzu4ANRC0OA74EtEmg/Iqu7cD8OE9/kmp8BsR4DLiZ6G+45DDnHsLd/wr8NfzOXAPMMbN3gUfd/c1Er1cRBZfE5ZtZX3df6e4TzOxIoAeQB8yN8xqd3P15M7sNwN2LzSyeX5DzgK8QtZpi/4jygR/HU7C7Pwg8aGYPEwWaM8Kht9x9dpz1B8gGBnn15rJnlAWWYCuJt6LLvmV/Evinu28zs6rOLy8//Py/DJxhZpkx16yUu08FMLNfuPsZMYfGm1k84x3NiFq+TTj4w2wnkEjrba2ZtQf+A0w0szxgXQL5R7j70Jj3k80s7v9/MzsdGB0e7YFXgLcTKL9aP/8Y68Ijg+oFhfPc/ZSY9w+b2XTgN9W4VnVU9zOgzA53f60mFQg/82PDYwswG/iemX3d3b9Qk2uDgkt1XEv0AQHs/7bycXhbEOc1dptZR6Jv+pjZSGDH4TK5+1PAU2b2WXf/d0K1PtQiom/dLxJ9c3vazB5193i7ZuYRfXNfX42yXzOzCcA/w/vPA68meI3xZrYI2AN8K7R+ChPI/3ngi8B17r4hfEn4bQL5O5tZf3dfDmBm/YBDN+8rJwSnqWb2pLtXd6NU3P3S8PLnZvYm0I6oRRqvEjM7yt2Xwf4xlEQ+3KYCOcCvgFfdvSiBvFDDn7+73wlgZm2it74rwfJLzOxLwLNEf4dXUI0WQA1U6zMgxptm9luiv9/9kwjc/cN4MpvZ74GLgEnAL939g3Do12a2OIF6VF6GFlHWPjM7iah/fQjRh3Rn4HPuPieBa3wKGMzBM4XuSiD/HGCUu+8O71sB77v7CXHmfxMYRjR2EvvLfVEceX8NTAdOJwpsbwEj3f2WeOsfrpMF7HT3EjNrCbR19w2JXKO6zOx8opXNZf3cfYGvu/uEOPN3JhprKP9/eHZya1pp+ecQdcssJ/o/6ANcE2+XSGg1nUbU8h0BlBL9/vw0NTU+pPwhRF1aZd1wW4Cr3D2ubikz6ws8QPRvcOBd4CZ3X5nsulZSfo0+A8LfX3ke7++PmV0LPOvuh3whNrN27p5IoKu4DAWX9DCzJsBAoj/sxe6+L4G8fyYaYzkL+CtRd8oH7p7IgOBcoq6RwvC+OTDD3eMamDWzMytKL+s2OkzeD939pHJpc+INbDF5hgCDOPjDucoxHzN7x91PN7N8wrfGskNRdo97BqBFM/SODW8XJTIN1cxeJ+pj/wHwDeBqYHOiAbYmQv3LfgcTqn/IfxxwJlHX2KnAanev8PciJs8Kop/75nLdUgkxs/eA28uCoZmNIfoGfmp1r1lbzCwDGEn0xaxanwFJqkdPoi8V+3uwkjmVXcElTczsVKJvu7H/sXHNlin7II55bg286O7nJlD+94g+0F4KSZcAT7r7H+K9RqLM7JvAt4gmDSyLOdQGeNfd457tZWZ3EA0kDyLqUrsAeMfdE5p1ligzO9vdJ5vZZyo67u4vxnmdXHcfHhtUzWzq4T6ck6mGv4PLgMXAO0RjLdOr0TVWbWY2u9yYUYVpVeTvDHyNQ//91yaznlWU/767j6pB/q7AL4Ee7n6BmQ0i6omIa0KAmd0LfAFYQMyEnHh6HuKlMZc0qOFMK4jGGSCaytmDaEC8XyJ1cPffm9kUDnRNXePuMw+Xr4bf/P8BvEbUT39rTHq+u29LpP5ErbWhwEx3vyb8sf01wWtUxxnAZODTVPDvJ+oDj0fZt9T1oYtzHdFEjVqRhN/BAe5emoq6xWm5mf2UqGsMookBKxLI/zJRUHyD2h1rKfO6mX2W6Ethdb7hP0nNZptdCgxMtLWaCAWX9KjJTCuAV0Kf92+BD4k+FBL+YA2Df3ENAMbkOT08JzxDJ/Tj7iAaPK2pQncvNbNiixazbiLOadQ1lB9affOIfu5lU9QS/b+826K1Bd8n6ntvSzS1tLbU9Hewh5k9xIExi3eAG919bbIqWBEze9rdryQKDH05MCFlKtGU2ni1rM0uyAp8D2gFFJtZIYl3y9Z0ttlyotl5Ci4NTE1mWuHuvwgv/21mrwDNkzEAV8/MCAH2UaK5/ruI+rBTrXV4Hkg0kP0y0QfDp4kmJsTF3V8JL3cQjZ3Vthr9DhJ9a/4H0foiiFoOTwCfqHnVqlS2CPhqop9bWYsRDgT6eLxiZp9090RnKSaFu7exaNHyQdv3JKCms80KgFlmNomDJ+R8txp1qZDGXGqRmY0n+mVoQzVnWsVcq9r95Q1B6NZ5i+gbbCHRTLG4Z9slofzXgc+6e3543wZ4wd3PjzN/Wvr8k/U7aGaz3H3Y4dKSzcy+C3yTqJX6cewhom/+8S4CzidqOewl6qJMeEJHTVi0fc+NRF2hs4gG+N9z93PizF/T2WYVLlj1aLlDUqjlUrvuI/ol/jXRAHqZsrS4JKG/vCF4gmi86CGiD5pZZvaWuz9QS+UfCcQOYBcRBYp4pavPfzLR3/1MDoz7VEfZdjtla5WuIBr7SymPWQTs7t+swXVq2nKoqbLte6a5+1lmdizRHnFxcfcPw4zNas02S2YQqYxaLmlQ06m4ZraQmvWXNwgWrTAeQdQ98g1gj7sfW3WupJV9O9G+aC8RBfZLgefc/Vdx5k/5t/xKyr2PaNrwcUQrst8jWuPxfiKTKsKixz8Co4j+/e8RjblUe2FobappyyEJ5c9w9xFmNgs4xd33Jvo7UcPZfgOIJtaUn8qftHFLtVxqUexU3LCIsUwboj/weNW0v7zeC33FrYg27HybmL3KaoO732PRpoOjQ1Jcs+1ipKXP391/AGDRvmDZRIHmWuBRM9vu7oPivM5qohXe9VWNWg5JUKPte5LQe/EEcAdwP9GXs2tIbMzq8HVs5F9+a1WYHZRFNafiJnPMpr4zs/uB4UT//neJxl/ed/c9VWZMs3JTuFsT1b84vK/NPv92RK2O08Jze2Cuu1c54yrMEKv0QyOZA8KplIyWQxLrciZh+5541wrVtPciZp3VXA8Lp83sbXcffbi88VLLpRYlYSpusvrL6z13vxkgLCC9huibWDcgrvvapIsf2JX3aaIW19vuvrC2yjezvxBtOZNPtAXPe8Dv3T0vzkvkxLy+k+jbb31U040/k8bj2NWiAjXtvSgMOwUsMbPvEE2OSOp9ldRyqUeS1V/eEIQ/iNFErZdVhJlj7j45rRWLk5mdTTQhYTTRhISZRPVP6YQEM/sf0Inow+k9om7FedX5BmxmM939xCRXsdZVp+WQLkmc7TcCWEjUYv0F0Tqr37j79KTVVcGl/inXXz4qPOLuL28IzOyHRAEl192LD3d+XZSuCQlmZkStl1PDYwiwjehLStwtkYompkhqmdnNVNF7EW8ryMyyiVb39+HArQ483klF8VC3WP3UguibRrvwWEf895JpENw9ke3x65x0TkgIrZR5Ft3krqyr9kLgZOpvN1dj0ZPoC8GPqVnvxd+BHxJ9bqRkGx+1XOqRCvrLpxHNdom3v1zqiHRNSAiLEE8lGsjfF8p+PzzPPdx+YeUmJLTkwD2ManURYmNX096Lsj0CU1hFtVzqmSOJBqyXEA3ArQW2p7VGUi1pnJDQl+h20je7e8KDwdXZU05Soqa9F3eY2V+JbhYWO2YT78arh6WWSz2TrP5ySa/6PiFB0iNZvRdm9gzRvYjmc6BbzJO5/ZBaLvWM+ssbjBbA76nHExIkLZLVezHU47wxYHWp5VKP1LS/XETqv2T0XpjZo8D97r4gZfVUcKk/zOz3hNkh1ekvF5GGw8x6EX3RPJWo96Kju7ePM+9Cou1jVhCNuZRNyEjaVGQFFxGReiJZvRfhnjiHSObGowouIiL1RH3qvVBwERGRpMtIdwVERKThUXAREZGkU3ARSQEzu93M5pvZHDObZWanpLCsKWEjQpE6Q4soRZLMzEYRTQ09KdyEqhPQLM3VEqlVarmIJF93YIu77wVw9y3uvs7MfmZmM8xsnpn9JSyGK2t53G9mb5nZQjMbYWYvmtkSM7s7nNPXzBaZ2VOhNfQvM2tZvmAzO9fM3jezD83shbB3GWZ2r5ktCHnvq8WfhTRSCi4iyfc60NvMPjKzP4WbUQH80d1HuPsQou1fLozJU+TuZwB/Bl4Gvk208vorZtYxnDMQ+EtY6LYT+FZsoaGF9BNgbLjPSg7wPTPrAFwKDA55707Bv1nkIAouIknm7ruINqS8HtgMPGdmXwHOMrPpZjYXOJtoC48y48LzXGC+u68PLZ/lQO9wbI27vxteP0N0J8tYI4FBwLvh3vBXE90MaidQCPzVzD7DgW3yRVJGYy4iKeDuJcAUYEoIJl8HTgCy3X2Nmf0caB6TpWzb89KY12Xvy/5Oyy9KK//egInufkX5+pjZycA5wBeA7xAFN5GUUctFJMnMbKCZDYhJGgYsDq+3hHGQz1Xj0keGyQIAVwDvlDs+DTjNzI4O9WhpZseE8tq5+6vATaE+IimllotI8rUGHjKz9kAxsJSoi2w7UbfXSmBGNa67ELjazB4h2nL94diD7r45dL/908zKbjr2E6J7f7xsZs2JWjc3V6NskYRo+xeResDM+gKvhMkAInWeusVERCTp1HIREZGkU8tFRESSTsFFRESSTsFFRESSTsFFRESSTsFFRESSTsFFRESS7v8BKleqeNMFN6MAAAAASUVORK5CYII=\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x7fc2cd79c9d0>"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from nltk import word_tokenize\n",
"from nltk import FreqDist\n",
"%matplotlib inline\n",
"\n",
"fdist = FreqDist(mydist)\n",
"fdist.plot(20, cumulative=True)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# TODO: 从上面的图中能观察到什么样的现象? 这样的一个图的形状跟一个非常著名的函数形状很类似,能所出此定理吗? \n",
"# hint: [XXX]'s law\n",
"# \n",
"# zip's law"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"#### 1.3 文本预处理\n",
"此部分需要做文本方面的处理。 以下是可以用到的一些方法:\n",
"\n",
"- 1. 停用词过滤 (去网上搜一下 \"english stop words list\",会出现很多包含停用词库的网页,或者直接使用NLTK自带的) \n",
"- 2. 转换成lower_case: 这是一个基本的操作 \n",
"- 3. 去掉一些无用的符号: 比如连续的感叹号!!!, 或者一些奇怪的单词。\n",
"- 4. 去掉出现频率很低的词:比如出现次数少于10,20.... (想一下如何选择阈值)\n",
"- 5. 对于数字的处理: 分词完只有有些单词可能就是数字比如44,415,把所有这些数字都看成是一个单词,这个新的单词我们可以定义为 \"#number\"\n",
"- 6. lemmazation: 在这里不要使用stemming, 因为stemming的结果有可能不是valid word。\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# TODO: 需要做文本方面的处理。 从上述几个常用的方法中选择合适的方法给qlist做预处理(不一定要按照上面的顺序,不一定要全部使用)\n",
"from nltk.corpus import stopwords\n",
"from nltk.stem import WordNetLemmatizer\n",
"\n",
"stop_words = set(stopwords.words('english'))\n",
"wnl = WordNetLemmatizer()\n",
"f_list = []\n",
"english_punctuations = ['(', ')', '[', ']', '&', '!!', '*', '@', '#', '$', '%', '?']\n",
"\n",
"def sentencesList_preprocess(sentencesList):\n",
"\n",
" for sentence in sentencesList:\n",
" # filter stop words\n",
" filtered = [word for word in sentence if word not in stop_words]\n",
" \n",
" # to lower case\n",
" lower_case = [word.lower() for word in filtered]\n",
" \n",
" # lemmatization\n",
" lemma = [wnl.lemmatize(token) for token in lower_case]\n",
" \n",
" # remove unused symbols\n",
" sym = [word for word in lemma if word not in english_punctuations]\n",
" \n",
" f_list.append(sym)\n",
" \n",
" return f_list\n",
"\n",
"qlist = sentencesList_preprocess(qwords) # 更新后的问题列表"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"def sentence_preprocess(sentence):\n",
" # filter stop words\n",
" filtered = [word for word in sentence if word not in stop_words]\n",
" # to lower case\n",
" lower_case = [word.lower() for word in filtered]\n",
" # lemmatization\n",
" lemma = [wnl.lemmatize(token) for token in lower_case]\n",
" # remove unused symbols\n",
" token = [word for word in lemma if word not in english_punctuations]\n",
"\n",
" return token"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 第二部分: 文本的表示\n",
"当我们做完必要的文本处理之后就需要想办法表示文本了,这里有几种方式\n",
"\n",
"- 1. 使用```tf-idf vector```\n",
"- 2. 使用embedding技术如```word2vec```, ```bert embedding```等\n",
"\n",
"下面我们分别提取这三个特征来做对比。 "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.1 使用tf-idf表示向量\n",
"把```qlist```中的每一个问题的字符串转换成```tf-idf```向量, 转换之后的结果存储在```X```矩阵里。 ``X``的大小是: ``N* D``的矩阵。 这里``N``是问题的个数(样本个数),\n",
"``D``是词典库的大小"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# TODO \n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"\n",
"vectorizer = TfidfVectorizer() # 定义一个tf-idf的vectorizer\n",
"\n",
"qlistnew = [' '.join(sentence) for sentence in qlist]\n",
"X_tfidf = vectorizer.fit_transform(qlistnew) # 结果存放在X矩阵里"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.2 使用wordvec + average pooling\n",
"词向量方面需要下载: https://nlp.stanford.edu/projects/glove/ (请下载``glove.6B.zip``),并使用``d=200``的词向量(200维)。国外网址如果很慢,可以在百度上搜索国内服务器上的。 每个词向量获取完之后,即可以得到一个句子的向量。 我们通过``average pooling``来实现句子的向量。 "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Finish load Glove\n"
]
}
],
"source": [
"# load Glove\n",
"def loadGlove(path):\n",
" vocab = {}\n",
" embedding = []\n",
" vocab[\"UNK\"] = 0\n",
" embedding.append([0]*200)\n",
" file = open(path, 'r', encoding='utf8')\n",
" i = 1\n",
" for line in file:\n",
" row = line.strip().split()\n",
" vocab[row[0]] = i\n",
" embedding.append(row[1:])\n",
" i += 1\n",
" print(\"Finish load Glove\")\n",
" file.close()\n",
" return vocab, embedding\n",
"\n",
"path = '../glove.6b/glove.6b.200d.txt'\n",
"voc, emb = loadGlove(path)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def sentence_to_vec(embedding, sentence):\n",
" vec = np.zeros((200,), dtype=np.float64)\n",
" for word in sentence:\n",
" if word in voc:\n",
" idx = voc[word]\n",
" vec += embedding[idx].astype('float64')\n",
" else:\n",
" vec += embedding[0].astype('float64')\n",
" vec = vec/len(sentence)\n",
" return vec"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def sentences_to_vec(embedding, sentences):\n",
" vec = np.zeros((len(sentences), 200))\n",
" for i, sentence in enumerate(sentences):\n",
" sentence = sentence.strip().split(' ')\n",
" vec[i] = sentence_to_vec(embedding, sentence)\n",
" \n",
" return vec"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"emc = np.asarray(emb)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# TODO 基于Glove向量获取句子向量\n",
"# emb = # 这是 D*H的矩阵,这里的D是词典库的大小, H是词向量的大小。 这里面我们给定的每个单词的词向量,\n",
" # 这需要从文本中读取\n",
" \n",
"X_w2v = sentences_to_vec(emc, qlistnew) # 初始化完emb之后就可以对每一个句子来构建句子向量了,这个过程使用average pooling来实现\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# for i in voc:\n",
"# v = emb[v]\n",
"# res = list(cosine_similarity(emb, v)[0])\n",
"# result = sorted(get_top_numbers(res, 10), reverse = True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.3 使用BERT + average pooling\n",
"最近流行的BERT也可以用来学出上下文相关的词向量(contex-aware embedding), 在很多问题上得到了比较好的结果。在这里,我们不做任何的训练,而是直接使用已经训练好的BERT embedding。 具体如何训练BERT将在之后章节里体会到。 为了获取BERT-embedding,可以直接下载已经训练好的模型从而获得每一个单词的向量。可以从这里获取: https://github.com/imgarylai/bert-embedding , 请使用```bert_12_768_12```\t当然,你也可以从其他source获取也没问题,只要是合理的词向量。 "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from bert_embedding.bert import BertEmbedding\n",
"bert_embedding = BertEmbedding(model='bert_12_768_12', dataset_name='book_corpus_wiki_en_uncased')"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"def sentences_to_vec_bert(sentences):\n",
" X_bert = np.zeros((len(sentences), 768))\n",
" for i in range(len(sentences)):\n",
" emb = bert_embedding(sentence, 'avg')\n",
" vec_list = emb[0][1]\n",
"# v_bert = np.zeros((768,), dtype=np.float32)\n",
" \n",
"# for vec in vec_list:\n",
"# v_bert += vec.astype('float32')\n",
"# v_bert = v_bert/len(vec_list)\n",
" v_bert = np.average(vec_list, axis=0)\n",
" X_bert[i] = v_bert\n",
" return X_bert"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"def sentence_to_vec_bert(sentence):\n",
" emb = bert_embedding([sentence], 'avg')\n",
" vec_list = emb[0][1]\n",
" v_bert = np.sum(vec_list, axis=0)\n",
" \n",
" return v_bert"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# TODO 基于BERT的句子向量计算\n",
"\n",
"X_bert = sentences_to_vec_bert(qlistnew) # 每一个句子的向量结果存放在X_bert矩阵里。行数为句子的总个数,列数为一个句子embedding大小。 "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"import heapq\n",
"from sklearn.metrics.pairwise import cosine_similarity\n",
"\n",
"# return the top k numbers from the list using prority queue\n",
"def get_top_numbers(tlist, k):\n",
" max_heap = []\n",
" l = len(tlist)\n",
" if l<=0 or k<=0 or k>l: return None\n",
" \n",
" for i in tlist:\n",
" if k > len(max_heap): heapq.heappush(max_heap, i)\n",
" else: heapq.heappushpop(max_heap, i)\n",
" \n",
" return max_heap"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3.1 tf-idf + 余弦相似度\n",
"我们可以直接基于计算出来的``tf-idf``向量,计算用户最新问题与库中存储的问题之间的相似度,从而选择相似度最高的问题的答案。这个方法的复杂度为``O(N)``, ``N``是库中问题的个数。"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"def get_top_results_tfidf_noindex(query):\n",
" # TODO 需要编写\n",
" \"\"\"\n",
" 给定用户输入的问题 query, 返回最有可能的TOP 5问题。这里面需要做到以下几点:\n",
" 1. 对于用户的输入 query 首先做一系列的预处理(上面提到的方法),然后再转换成tf-idf向量(利用上面的vectorizer)\n",
" 2. 计算跟每个库里的问题之间的相似度\n",
" 3. 找出相似度最高的top5问题的答案\n",
" \"\"\"\n",
" query = word_tokenize(query)\n",
" query= sentence_preprocess(query)\n",
" query = ' '.join(query)\n",
" \n",
" q_tfidf = vectorizer.transform([query])\n",
" res = list(cosine_similarity(q_tfidf, X_tfidf)[0])\n",
" result = sorted(get_top_numbers(res, 5), reverse = True)\n",
" \n",
" top_idxs = [] # top_idxs存放相似度最高的(存在qlist里的)问题的下标 \n",
" # hint: 请使用 priority queue来找出top results. 思考为什么可以这么做? \n",
" dict_visited = {}\n",
"\n",
" for r in result:\n",
" for i, n in enumerate(res):\n",
" if n == r and i not in dict_visited: \n",
" top_idxs.append(i)\n",
" dict_visited[i] = True\n",
"\n",
" ans = [alist[i] for i in top_idxs]\n",
" \n",
" return ans # 返回相似度最高的问题对应的答案,作为TOP5答案 "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['in the late 1990s', 'Particularly since the 1950s, pro wrestling events have frequently been responsible for sellout crowds at large arenas', 'single mandolins', 'single mandolins', 'molecular methods']\n",
"['Greek', '1877', 'living together', '1570s', 'glesum']\n"
]
}
],
"source": [
"# TODO: 编写几个测试用例,并输出结果\n",
"print (get_top_results_tfidf_noindex(\"when did Beyonce start becoming popular\"))\n",
"print (get_top_results_tfidf_noindex(\"what languge does the word of 'symbiosis' come from\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"你会发现上述的程序很慢,没错! 是因为循环了所有库里的问题。为了优化这个过程,我们需要使用一种数据结构叫做```倒排表```。 使用倒排表我们可以把单词和出现这个单词的文档做关键。 之后假如要搜索包含某一个单词的文档,即可以非常快速的找出这些文档。 在这个QA系统上,我们首先使用倒排表来快速查找包含至少一个单词的文档,然后再进行余弦相似度的计算,即可以大大减少```时间复杂度```。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3.2 倒排表的创建\n",
"倒排表的创建其实很简单,最简单的方法就是循环所有的单词一遍,然后记录每一个单词所出现的文档,然后把这些文档的ID保存成list即可。我们可以定义一个类似于```hash_map```, 比如 ``inverted_index = {}``, 然后存放包含每一个关键词的文档出现在了什么位置,也就是,通过关键词的搜索首先来判断包含这些关键词的文档(比如出现至少一个),然后对于candidates问题做相似度比较。"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"# TODO 请创建倒排表\n",
"inverted_idx = {} # 定一个一个简单的倒排表,是一个map结构。 循环所有qlist一遍就可以\n",
"for index, sentence in enumerate(qlist):\n",
" for word in sentence:\n",
" if word not in inverted_idx: inverted_idx[word] = [index]\n",
" else: inverted_idx[word].append(index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3.3 语义相似度\n",
"这里有一个问题还需要解决,就是语义的相似度。可以这么理解: 两个单词比如car, auto这两个单词长得不一样,但从语义上还是类似的。如果只是使用倒排表我们不能考虑到这些单词之间的相似度,这就导致如果我们搜索句子里包含了``car``, 则我们没法获取到包含auto的所有的文档。所以我们希望把这些信息也存下来。那这个问题如何解决呢? 其实也不难,可以提前构建好相似度的关系,比如对于``car``这个单词,一开始就找好跟它意思上比较类似的单词比如top 10,这些都标记为``related words``。所以最后我们就可以创建一个保存``related words``的一个``map``. 比如调用``related_words['car']``即可以调取出跟``car``意思上相近的TOP 10的单词。 \n",
"\n",
"那这个``related_words``又如何构建呢? 在这里我们仍然使用``Glove``向量,然后计算一下俩俩的相似度(余弦相似度)。之后对于每一个词,存储跟它最相近的top 10单词,最终结果保存在``related_words``里面。 这个计算需要发生在离线,因为计算量很大,复杂度为``O(V*V)``, V是单词的总数。 \n",
"\n",
"这个计算过程的代码请放在``related.py``的文件里,然后结果保存在``related_words.txt``里。 我们在使用的时候直接从文件里读取就可以了,不用再重复计算。所以在此notebook里我们就直接读取已经计算好的结果。 作业提交时需要提交``related.py``和``related_words.txt``文件,这样在使用的时候就不再需要做这方面的计算了。"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"# TODO 读取语义相关的单词\n",
"def get_related_words(file):\n",
" related_words = {}\n",
" for line in open(file, mode='r', encoding='utf-8'):\n",
" item = line.split(\",\")\n",
" word, s_list = item[0], [value for value in item[1].strip().split()]\n",
" related_words[word] = s_list\n",
" return related_words\n",
"\n",
"related_words = get_related_words('related_words.txt') # 直接放在文件夹的根目录下,不要修改此路径。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3.4 利用倒排表搜索\n",
"在这里,我们使用倒排表先获得一批候选问题,然后再通过余弦相似度做精准匹配,这样一来可以节省大量的时间。搜索过程分成两步:\n",
"\n",
"- 使用倒排表把候选问题全部提取出来。首先,对输入的新问题做分词等必要的预处理工作,然后对于句子里的每一个单词,从``related_words``里提取出跟它意思相近的top 10单词, 然后根据这些top词从倒排表里提取相关的文档,把所有的文档返回。 这部分可以放在下面的函数当中,也可以放在外部。\n",
"- 然后针对于这些文档做余弦相似度的计算,最后排序并选出最好的答案。\n",
"\n",
"可以适当定义自定义函数,使得减少重复性代码"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"def get_inverted_index_sentence(query):\n",
" query = word_tokenize(query)\n",
" query= sentence_preprocess(query)\n",
" \n",
" r_list = []\n",
" for q in query:\n",
" if q in related_words: \n",
" for word in related_words[q]:\n",
" r_list.append(word)\n",
" \n",
" total_list = query\n",
" for word in r_list:\n",
" total_list.append(word)\n",
" \n",
" idx_list = [] \n",
" for word in total_list:\n",
" if word in inverted_idx:\n",
" indx = inverted_idx[word]\n",
" idx_list.extend(indx)\n",
" return query, idx_list"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"def get_top_index(result):\n",
" top_idxs = []\n",
" dict_visited = {}\n",
" top_result = sorted(get_top_numbers(result, 5), reverse = True)\n",
" \n",
" for r in top_result:\n",
" for i, n in enumerate(result):\n",
" if n == r and i not in dict_visited: \n",
" top_idxs.append(i)\n",
" dict_visited[i] = True\n",
" return top_idxs"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def get_top_results_tfidf(query):\n",
" \"\"\"\n",
" 给定用户输入的问题 query, 返回最有可能的TOP 5问题。这里面需要做到以下几点:\n",
" 1. 利用倒排表来筛选 candidate (需要使用related_words). \n",
" 2. 对于候选文档,计算跟输入问题之间的相似度\n",
" 3. 找出相似度最高的top5问题的答案\n",
" \"\"\"\n",
" query, sentence_idxs = get_inverted_index_sentence(query)\n",
" query = ' '.join(query)\n",
" q_tfidf = vectorizer.transform([query])\n",
" \n",
" \n",
" X_tfidf_idx = []\n",
" for indx in sentence_idxs:\n",
" X_tfidf_idx.append(X_tfidf[indx].toarray()[0])\n",
" \n",
" res = list(cosine_similarity(q_tfidf, X_tfidf_idx)[0])\n",
" \n",
" top_idxs = [] \n",
" top_idxs = get_top_index(res)\n",
" \n",
" ans = [alist[i] for i in top_idxs]\n",
" \n",
" return ans # 返回相似度最高的问题对应的答案,作为TOP5答案"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['in the late 1990s', 'Beck', 'Polish Great Emigration', 'six', 'the Karmapa', 'November 12, 2015']\n",
"['Darlette Johnson', 'flute and violin', 'Chopin Family Parlour', '2001: A Space Odyssey', 'Houston', 'Saxon Palace.']\n"
]
}
],
"source": [
"test_query1 = \"when did Beyonce start becoming popular\"\n",
"test_query2 = \"what languge does the word of symbiosis come from\"\n",
"\n",
"print (get_top_results_tfidf(test_query1))\n",
"print (get_top_results_tfidf(test_query2))"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def get_top_results_w2v(query):\n",
" \"\"\"\n",
" 给定用户输入的问题 query, 返回最有可能的TOP 5问题。这里面需要做到以下几点:\n",
" 1. 利用倒排表来筛选 candidate (需要使用related_words). \n",
" 2. 对于候选文档,计算跟输入问题之间的相似度\n",
" 3. 找出相似度最高的top5问题的答案\n",
" \"\"\"\n",
" \n",
" query, sentence_idxs = get_inverted_index_sentence(query)\n",
" query = ' '.join(query)\n",
" q_w2v = sentence_to_vec(emc,query)\n",
" \n",
" \n",
" q_w2v_idx = []\n",
" for indx in sentence_idxs:\n",
" q_w2v_idx.append(X_w2v[indx])\n",
" \n",
" res = list(cosine_similarity(np.array([q_w2v]), np.array([q_w2v_idx])[0])[0])\n",
" \n",
" top_idxs = [] \n",
" top_idxs = get_top_index(res)\n",
" \n",
" ans = [alist[i] for i in top_idxs]\n",
" \n",
" return ans # 返回相似度最高的问题对应的答案,作为TOP5答案"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Top 20 Hot 100 Songwriters', 'Diana Ross.', '1664', \"a canon at one beat's distance\", 'Gautama Buddha']\n",
"['50', 'Missy Elliott and Alicia Keys', 'GamePro and EGM', 'The Women Behind The Music', 'Oujda, Tangier and Erfoud']\n"
]
}
],
"source": [
"test_query1 = \"when did Beyonce start becoming popular\"\n",
"test_query2 = \"what languge does the word of symbiosis come from\"\n",
"\n",
"print (get_top_results_w2v(test_query1))\n",
"print (get_top_results_w2v(test_query2))"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
"def get_top_results_bert(query):\n",
" \"\"\"\n",
" 给定用户输入的问题 query, 返回最有可能的TOP 5问题。这里面需要做到以下几点:\n",
" 1. 利用倒排表来筛选 candidate (需要使用related_words). \n",
" 2. 对于候选文档,计算跟输入问题之间的相似度\n",
" 3. 找出相似度最高的top5问题的答案\n",
" \"\"\"\n",
" \n",
" query, sentence_idxs = get_inverted_index_sentence(query)\n",
" query = ' '.join(query)\n",
" q_bert = sentence_to_vec_bert(query)\n",
" \n",
" \n",
" \n",
" q_bert_idx = []\n",
" for indx in sentence_idxs:\n",
" q_bert_idx.append(X_bert[indx])\n",
" \n",
" res = list(cosine_similarity(np.array([q_bert]), np.array([q_bert_idx])[0])[0])\n",
" \n",
" top_idxs = [] \n",
" top_idxs = get_top_index(res)\n",
" \n",
" ans = [alist[i] for i in top_idxs]\n",
" \n",
" return ans # 返回相似度最高的问题对应的答案,作为TOP5答案"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_query1 = \"when did Beyonce start becoming popular\"\n",
"test_query2 = \"what language does the word of symbiosis come from\"\n",
"\n",
"print (get_top_results_bert(test_query1))\n",
"print (get_top_results_bert(test_query2))"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['in the late 1990s', 'Beck', 'Polish Great Emigration', 'six', 'the Karmapa', 'November 12, 2015']\n",
"['Top 20 Hot 100 Songwriters', 'Diana Ross.', '1664', \"a canon at one beat's distance\", 'Gautama Buddha']\n",
"['Darlette Johnson', 'flute and violin', 'Chopin Family Parlour', '2001: A Space Odyssey', 'Houston', 'Saxon Palace.']\n",
"['50', 'Missy Elliott and Alicia Keys', 'GamePro and EGM', 'The Women Behind The Music', 'Oujda, Tangier and Erfoud']\n"
]
}
],
"source": [
"# TODO: 编写几个测试用例,并输出结果\n",
"\n",
"test_query1 = \"when did Beyonce start becoming popular\"\n",
"test_query2 = \"what languge does the word of symbiosis come from\"\n",
"\n",
"print (get_top_results_tfidf(test_query1))\n",
"print (get_top_results_w2v(test_query1))\n",
"# print (get_top_results_bert(test_query1))\n",
"\n",
"print (get_top_results_tfidf(test_query2))\n",
"print (get_top_results_w2v(test_query2))\n",
"# print (get_top_results_bert(test_query2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4. 拼写纠错\n",
"其实用户在输入问题的时候,不能期待他一定会输入正确,有可能输入的单词的拼写错误的。这个时候我们需要后台及时捕获拼写错误,并进行纠正,然后再通过修正之后的结果再跟库里的问题做匹配。这里我们需要实现一个简单的拼写纠错的代码,然后自动去修复错误的单词。\n",
"\n",
"这里使用的拼写纠错方法是课程里讲过的方法,就是使用noisy channel model。 我们回想一下它的表示:\n",
"\n",
"$c^* = \\text{argmax}_{c\\in candidates} ~~p(c|s) = \\text{argmax}_{c\\in candidates} ~~p(s|c)p(c)$\n",
"\n",
"这里的```candidates```指的是针对于错误的单词的候选集,这部分我们可以假定是通过edit_distance来获取的(比如生成跟当前的词距离为1/2的所有的valid 单词。 valid单词可以定义为存在词典里的单词。 ```c```代表的是正确的单词, ```s```代表的是用户错误拼写的单词。 所以我们的目的是要寻找出在``candidates``里让上述概率最大的正确写法``c``。 \n",
"\n",
"$p(s|c)$,这个概率我们可以通过历史数据来获得,也就是对于一个正确的单词$c$, 有百分之多少人把它写成了错误的形式1,形式2... 这部分的数据可以从``spell_errors.txt``里面找得到。但在这个文件里,我们并没有标记这个概率,所以可以使用uniform probability来表示。这个也叫做channel probability。\n",
"\n",
"$p(c)$,这一项代表的是语言模型,也就是假如我们把错误的$s$,改造成了$c$, 把它加入到当前的语句之后有多通顺?在本次项目里我们使用bigram来评估这个概率。 举个例子: 假如有两个候选 $c_1, c_2$, 然后我们希望分别计算出这个语言模型的概率。 由于我们使用的是``bigram``, 我们需要计算出两个概率,分别是当前词前面和后面词的``bigram``概率。 用一个例子来表示:\n",
"\n",
"给定: ``We are go to school tomorrow``, 对于这句话我们希望把中间的``go``替换成正确的形式,假如候选集里有个,分别是``going``, ``went``, 这时候我们分别对这俩计算如下的概率:\n",
"$p(going|are)p(to|going)$和 $p(went|are)p(to|went)$, 然后把这个概率当做是$p(c)$的概率。 然后再跟``channel probability``结合给出最终的概率大小。\n",
"\n",
"那这里的$p(are|going)$这些bigram概率又如何计算呢?答案是训练一个语言模型! 但训练一个语言模型需要一些文本数据,这个数据怎么找? 在这次项目作业里我们会用到``nltk``自带的``reuters``的文本类数据来训练一个语言模型。当然,如果你有资源你也可以尝试其他更大的数据。最终目的就是计算出``bigram``概率。 "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.1 训练一个语言模型\n",
"在这里,我们使用``nltk``自带的``reuters``数据来训练一个语言模型。 使用``add-one smoothing``"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from nltk.corpus import reuters\n",
"import numpy as np\n",
"import codecs\n",
"# 读取语料库的数据\n",
"categories = reuters.categories()\n",
"corpus = reuters.sents(categories=categories)\n",
"\n",
"word2index = {}\n",
"index2word = {}\n",
"corpus2 = []\n",
"\n",
"for sentence in corpus:\n",
" corpus2.append(['<s> '] + sentence + [' </s>'])\n",
"\n",
"for sentence in corpus2:\n",
" for word in sentence:\n",
" word = word.lower()\n",
" if word in word2index: continue\n",
" index2word[len(word2index)] = word\n",
" word2index[word] = len(word2index)\n",
"\n",
"word_count = len(word2index)\n",
"uni_count = np.zeros(word_count)\n",
"bi_count = np.zeros((word_count, word_count))\n",
"\n",
"for sentence in corpus2:\n",
" for i, word in enumerate(sentence):\n",
" word = word.lower()\n",
" uni_count[word2index[word]] += 1\n",
" if i <len(sentence) -1:\n",
" pre = word2index[word]\n",
" curr = word2index[sentence[i+1].lower()]\n",
" bi_count[pre, curr] +=1\n",
"\n",
"bigram = np.zeros((word_count, word_count))\n",
"\n",
"for i in range(word_count):\n",
" for j in range(word_count):\n",
" if bi_count[i,j]==0:\n",
" bigram[i,j] = 1.0 / (uni_count[i] + word_count)\n",
" else:\n",
" bigram[i,j] = (1.0 + bi_count[i,j]) / (word_count + uni_count[i])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def checkCount(pre,word):\n",
" if pre.lower() in word2index and word.lower() in word2index:\n",
" return bigram[word2index[pre.lower()],word2index[word.lower()]]\n",
" else:\n",
" return 0.0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.2 构建Channel Probs\n",
"基于``spell_errors.txt``文件构建``channel probability``, 其中$channel[c][s]$表示正确的单词$c$被写错成$s$的概率。 "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO 构建channel probability \n",
"channel = {}\n",
"\n",
"spell_error_dict = {}\n",
"for line in open('spell-errors.txt'):\n",
" word = line.split(\":\")\n",
" c_word = word[0] # correct word is the key\n",
" spell_error_dict[c_word] = [e_word.strip( )for e_word in word[1].strip().split(\",\")]\n",
"\n",
"# TODO\n",
"for c_word in spell_error_dict:\n",
" if c_word not in channel:\n",
" channel[c_word] = {}\n",
" for e_word in spell_error_dict[c_word]:\n",
" channel[c_word][e_word] = 1/len(spell_error_dict[c_word])\n",
" \n",
"# print(channel) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.3 根据错别字生成所有候选集合\n",
"给定一个错误的单词,首先生成跟这个单词距离为1或者2的所有的候选集合。 这部分的代码我们在课程上也讲过,可以参考一下。 "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"alphabet = \"abcdefghijklmnopqrstuvwxyz\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def known(words):\n",
" return set(w for w in words if w in word2index)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def edits1(word):\n",
" n = len(word)\n",
" #删除\n",
" s1 = [word[0:i] + word[i+1:] for i in range(n)]\n",
" #调换相连的两个字母\n",
" s2 = [word[0:i] + word[i+1] + word[i] + word[i+2:] for i in range(n-1)]\n",
" #replace\n",
" s3 = [word[0:i] + c + word[i + 1:] for i in range(n) for c in alphabet]\n",
" #插入\n",
" s4 = [word[0:i] + c + word[i:] for i in range(n + 1) for c in alphabet]\n",
" edit1_words = set(s1 + s2 + s3 + s4)\n",
"\n",
" if word in edit1_words:\n",
" edit1_words.remove(word)\n",
"\n",
" edit1_words = known(edit1_words)\n",
" return edit1_words"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def edits2(word, edit1_words):\n",
" edit2_words = set(e2 for e1 in edit1_words for e2 in edits1(e1))\n",
" \n",
" if word in edit2_words:\n",
" edit2_words.remove(word)\n",
" \n",
" edit2_words = known(edit2_words)\n",
" return edit2_words"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def generate_candidates(word):\n",
" # 基于拼写错误的单词,生成跟它的编辑距离为1或者2的单词,并通过词典库的过滤。\n",
" # 只留写法上正确的单词。 \n",
" \n",
" word_edit1 = edits1(word)\n",
" word_edit2 = edits2(word, word_edit1)\n",
" \n",
" words = word_edit1 | word_edit2\n",
" \n",
" return words\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.4 给定一个输入,如果有错误需要纠正\n",
"\n",
"给定一个输入``query``, 如果这里有些单词是拼错的,就需要把它纠正过来。这部分的实现可以简单一点: 对于``query``分词,然后把分词后的每一个单词在词库里面搜一下,假设搜不到的话可以认为是拼写错误的! 人如果拼写错误了再通过``channel``和``bigram``来计算最适合的候选。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from queue import PriorityQueue\n",
"def spell_corrector(line):\n",
" # 1. 首先做分词,然后把``line``表示成``tokens``\n",
" # 2. 循环每一token, 然后判断是否存在词库里。如果不存在就意味着是拼写错误的,需要修正。 \n",
" # 修正的过程就使用上述提到的``noisy channel model``, 然后从而找出最好的修正之后的结果。 \n",
" \n",
" corrected_words = []\n",
" tokens = []\n",
" tokens = ['<s>']+word_tokenize(line)+['</s>']\n",
" for i, token in enumerate(tokens):\n",
" if i == len(tokens)-1: break\n",
" if token.lower() not in word2index:\n",
" pre, nxt = tokens[i-1].lower(), tokens[i+1].lower()\n",
" token = word_corrector(token, pre, nxt)\n",
" corrected_words.append(token)\n",
" else: corrected_words.append(token)\n",
" newline = ' '.join(corrected_words)\n",
" \n",
" return newline # 修正之后的结果,假如用户输入没有问题,那这时候``newline = line``\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def word_corrector(word, pre_word, next_word):\n",
" candidates = generate_candidates(word)\n",
" correctors = PriorityQueue()\n",
" \n",
" if len(candidates) == 0: return word\n",
" \n",
" for candidate in candidates:\n",
" if candidate in channel and word in channel[candidate] and candidate in word2index:\n",
" bi_pre = checkCount(pre_word, candidate)\n",
" bi_nxt = checkCount(candidate, next_word)\n",
" p = np.log(channel[candidate][word] + 0.001) + bi_pre + bi_nxt\n",
" correctors.put((-1*p, candidate))\n",
" \n",
" if correctors.empty(): return word\n",
" \n",
" return correctors.get()[1]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"sentence = spell_corrector(\"when did Beyonce start beeome popular?\")\n",
"print(sentence)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_query1 = \"What counted for more of the poplation change\" # 拼写错误的\n",
"test_query2 = \"What counted for more of the population chenge\" # 拼写错误的\n",
"\n",
"test_query1 = spell_corrector(test_query1)\n",
"test_query2 = spell_corrector(test_query2)\n",
"\n",
"print(test_query1)\n",
"print(test_query2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.5 基于拼写纠错算法,实现用户输入自动矫正\n",
"首先有了用户的输入``query``, 然后做必要的处理把句子转换成tokens的形状,然后对于每一个token比较是否是valid, 如果不是的话就进行下面的修正过程。 "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"test_query1 = \"when did Beyonce starte becoming popular?\" # 拼写错误的\n",
"test_query2 = \"What counted for more of the population chenge\" # 拼写错误的\n",
"\n",
"test_query1 = spell_corector(test_query1)\n",
"test_query2 = spell_corector(test_query2)\n",
"\n",
"print (get_top_results_tfidf(test_query1))\n",
"print (get_top_results_w2v(test_query1))\n",
"print (get_top_results_bert(test_query1))\n",
"\n",
"print (get_top_results_tfidf(test_query2))\n",
"print (get_top_results_w2v(test_query2))\n",
"print (get_top_results_bert(test_query2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 附录 \n",
"在本次项目中我们实现了一个简易的问答系统。基于这个项目,我们其实可以有很多方面的延伸。\n",
"- 在这里,我们使用文本向量之间的余弦相似度作为了一个标准。但实际上,我们也可以基于基于包含关键词的情况来给一定的权重。比如一个单词跟related word有多相似,越相似就意味着相似度更高,权重也会更大。 \n",
"- 另外 ,除了根据词向量去寻找``related words``也可以提前定义好同义词库,但这个需要大量的人力成本。 \n",
"- 在这里,我们直接返回了问题的答案。 但在理想情况下,我们还是希望通过问题的种类来返回最合适的答案。 比如一个用户问:“明天北京的天气是多少?”, 那这个问题的答案其实是一个具体的温度(其实也叫做实体),所以需要在答案的基础上做进一步的抽取。这项技术其实是跟信息抽取相关的。 \n",
"- 对于词向量,我们只是使用了``average pooling``, 除了average pooling,我们也还有其他的经典的方法直接去学出一个句子的向量。\n",
"- 短文的相似度分析一直是业界和学术界一个具有挑战性的问题。在这里我们使用尽可能多的同义词来提升系统的性能。但除了这种简单的方法,可以尝试其他的方法比如WMD,或者适当结合parsing相关的知识点。 "
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"好了,祝你好运! "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment