{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### 搭建倒排表\n", "倒排表的作用是让搜索更加快速,是搜索引擎中常用的技术。根据课程中所讲的方法,你需要完成这部分的代码。 " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from tqdm import tqdm\n", "import numpy as np\n", "import pickle\n", "from gensim.models import KeyedVectors # 词向量用来比较俩俩之间相似度" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# 读取数据: 导入在preprocessor.ipynb中生成的data/question_answer_pares.pkl文件,并将其保存在变量QApares中\n", "with open('data/question_answer_pares.pkl','rb') as f:\n", " QApares = pickle.load(f)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>question</th>\n", " <th>answer</th>\n", " <th>question_after_preprocessing</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>买二份有没有少点呀</td>\n", " <td>亲亲真的不好意思我们已经是优惠价了呢小本生意请亲谅解</td>\n", " <td>[买, 二份, 有没有, 少点]</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>那就等你们处理喽</td>\n", " <td>好的亲退了</td>\n", " <td>[处理]</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>那我不喜欢</td>\n", " <td>颜色的话一般茶刀茶针和二合一的话都是红木檀和黑木檀哦</td>\n", " <td>[喜欢]</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>不是免运费</td>\n", " <td>本店茶具订单满99包邮除宁夏青海内蒙古海南新疆西藏满39包邮</td>\n", " <td>[免, 运费]</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>好吃吗</td>\n", " <td>好吃的</td>\n", " <td>[好吃]</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " question answer question_after_preprocessing\n", "0 买二份有没有少点呀 亲亲真的不好意思我们已经是优惠价了呢小本生意请亲谅解 [买, 二份, 有没有, 少点]\n", "1 那就等你们处理喽 好的亲退了 [处理]\n", "2 那我不喜欢 颜色的话一般茶刀茶针和二合一的话都是红木檀和黑木檀哦 [喜欢]\n", "3 不是免运费 本店茶具订单满99包邮除宁夏青海内蒙古海南新疆西藏满39包邮 [免, 运费]\n", "4 好吃吗 好吃的 [好吃]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "QApares.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```TODO1``` 构造一个倒排表,不需要考虑单词的相似度" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# 构建一个倒排表,有关倒排表的详细内容参考实验手册\n", "# 为了能够快速检索,倒排表应用哈希表来存储。python中字典内部便是用哈希表来存储的,所以这里我们直接将倒排表保存在字典中\n", "# 注意:在这里不需要考虑单词之间的相似度。\n", "inverted_list = {}\n", "for index,sentence in enumerate(QApares.question_after_preprocessing):\n", " ### 你需要完成的代码\n", " for word in sentence:\n", " if word in inverted_list:\n", " inverted_list[word].add(index)\n", " else:\n", " inverted_list[word] = set()\n", " inverted_list[word].add(index)\n", " \n", " ### 你需要完成的代码结束" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{5,\n", " 65541,\n", " 32776,\n", " 17,\n", " 18,\n", " 65554,\n", " 29,\n", " 65566,\n", " 32800,\n", " 32803,\n", " 98339,\n", " 32810,\n", " 98346,\n", " 32818,\n", " 55,\n", " 98366,\n", " 64,\n", " 65604,\n", " 65611,\n", " 32850,\n", " 98387,\n", " 98398,\n", " 65631,\n", " 102,\n", " 65639,\n", " 65640,\n", " 65646,\n", " 98415,\n", " 98416,\n", " 118,\n", " 122,\n", " 65659,\n", " 125,\n", " 32894,\n", " 133,\n", " 65669,\n", " 65670,\n", " 65671,\n", " 142,\n", " 65679,\n", " 32912,\n", " 98451,\n", " 150,\n", " 151,\n", " 32929,\n", " 65708,\n", " 98484,\n", " 98489,\n", " 187,\n", " 32957,\n", " 200,\n", " 32973,\n", " 65742,\n", " 98518,\n", " 65755,\n", " 220,\n", " 223,\n", " 65764,\n", " 65783,\n", " 33017,\n", " 65786,\n", " 254,\n", " 65790,\n", " 261,\n", " 65798,\n", " 65810,\n", " 275,\n", " 98586,\n", " 65833,\n", " 33068,\n", " 65838,\n", " 33073,\n", " 65843,\n", " 65844,\n", " 310,\n", " 65852,\n", " 318,\n", " 65862,\n", " 65863,\n", " 344,\n", " 65883,\n", " 65885,\n", " 350,\n", " 33120,\n", " 364,\n", " 98668,\n", " 65903,\n", " 33140,\n", " 98678,\n", " 65913,\n", " 33149,\n", " 33158,\n", " 33163,\n", " 33168,\n", " 401,\n", " 98709,\n", " 98715,\n", " 33189,\n", " 33191,\n", " 65960,\n", " 33193,\n", " 98740,\n", " 33210,\n", " 98751,\n", " 98754,\n", " 33220,\n", " 453,\n", " 98763,\n", " 461,\n", " 469,\n", " 33244,\n", " 98794,\n", " 495,\n", " 98803,\n", " 33273,\n", " 33276,\n", " 66046,\n", " 33290,\n", " 530,\n", " 33300,\n", " 66069,\n", " 66077,\n", " 542,\n", " 543,\n", " 66081,\n", " 66091,\n", " 556,\n", " 66093,\n", " 66094,\n", " 98860,\n", " 66097,\n", " 98877,\n", " 66111,\n", " 33346,\n", " 579,\n", " 33347,\n", " 588,\n", " 66152,\n", " 66153,\n", " 98932,\n", " 66166,\n", " 639,\n", " 640,\n", " 642,\n", " 98948,\n", " 66185,\n", " 651,\n", " 33422,\n", " 98962,\n", " 33427,\n", " 98970,\n", " 66207,\n", " 673,\n", " 33441,\n", " 33446,\n", " 66221,\n", " 687,\n", " 66224,\n", " 33459,\n", " 694,\n", " 33464,\n", " 33466,\n", " 33468,\n", " 99005,\n", " 66238,\n", " 99007,\n", " 710,\n", " 99017,\n", " 718,\n", " 99044,\n", " 743,\n", " 33513,\n", " 749,\n", " 33523,\n", " 759,\n", " 66299,\n", " 99068,\n", " 33535,\n", " 769,\n", " 33539,\n", " 66307,\n", " 66309,\n", " 99084,\n", " 66321,\n", " 99090,\n", " 66323,\n", " 99098,\n", " 66334,\n", " 800,\n", " 66337,\n", " 33576,\n", " 99120,\n", " 99122,\n", " 33592,\n", " 33593,\n", " 33600,\n", " 33604,\n", " 99140,\n", " 841,\n", " 845,\n", " 99149,\n", " 66389,\n", " 854,\n", " 33624,\n", " 33625,\n", " 858,\n", " 66395,\n", " 99163,\n", " 33629,\n", " 99173,\n", " 33638,\n", " 874,\n", " 66411,\n", " 878,\n", " 33647,\n", " 66415,\n", " 66417,\n", " 33650,\n", " 66418,\n", " 99182,\n", " 891,\n", " 33662,\n", " 66436,\n", " 901,\n", " 99208,\n", " 66441,\n", " 33674,\n", " 33675,\n", " 33682,\n", " 916,\n", " 66452,\n", " 33691,\n", " 33692,\n", " 66461,\n", " 66463,\n", " 33697,\n", " 99234,\n", " 931,\n", " 938,\n", " 944,\n", " 66480,\n", " 33716,\n", " 99254,\n", " 959,\n", " 33728,\n", " 99267,\n", " 33736,\n", " 33741,\n", " 978,\n", " 33750,\n", " 992,\n", " 33765,\n", " 99301,\n", " 1000,\n", " 1005,\n", " 99309,\n", " 33780,\n", " 99318,\n", " 1017,\n", " 33787,\n", " 66557,\n", " 1024,\n", " 1027,\n", " 1034,\n", " 1036,\n", " 66577,\n", " 66580,\n", " 99351,\n", " 99360,\n", " 1057,\n", " 33825,\n", " 99363,\n", " 33831,\n", " 66600,\n", " 33837,\n", " 66607,\n", " 99376,\n", " 33841,\n", " 99380,\n", " 66613,\n", " 33849,\n", " 66622,\n", " 1087,\n", " 1088,\n", " 99391,\n", " 1095,\n", " 1098,\n", " 66637,\n", " 1102,\n", " 99405,\n", " 99410,\n", " 99413,\n", " 66646,\n", " 99419,\n", " 66652,\n", " 1118,\n", " 99431,\n", " 1129,\n", " 66670,\n", " 33912,\n", " 33914,\n", " 66685,\n", " 33927,\n", " 99467,\n", " 99470,\n", " 99472,\n", " 66705,\n", " 66708,\n", " 66709,\n", " 99484,\n", " 99485,\n", " 99487,\n", " 66721,\n", " 33954,\n", " 1187,\n", " 99495,\n", " 66728,\n", " 66732,\n", " 66734,\n", " 99510,\n", " 99517,\n", " 66751,\n", " 1223,\n", " 33992,\n", " 99528,\n", " 99537,\n", " 66770,\n", " 99539,\n", " 1237,\n", " 1242,\n", " 66779,\n", " 66780,\n", " 1247,\n", " 99555,\n", " 34024,\n", " 1265,\n", " 34034,\n", " 1270,\n", " 34042,\n", " 1277,\n", " 66814,\n", " 99583,\n", " 34064,\n", " 1297,\n", " 66834,\n", " 66844,\n", " 1312,\n", " 66849,\n", " 1326,\n", " 66866,\n", " 34099,\n", " 99637,\n", " 1334,\n", " 66871,\n", " 66877,\n", " 1342,\n", " 1345,\n", " 99651,\n", " 66887,\n", " 1352,\n", " 1354,\n", " 66895,\n", " 99671,\n", " 66909,\n", " 99681,\n", " 1392,\n", " 66929,\n", " 34162,\n", " 1401,\n", " 1402,\n", " 34169,\n", " 99709,\n", " 1407,\n", " 66944,\n", " 66945,\n", " 1412,\n", " 1415,\n", " 34184,\n", " 1420,\n", " 99729,\n", " 1426,\n", " 34196,\n", " 66971,\n", " 1436,\n", " 66974,\n", " 1443,\n", " 34217,\n", " 66986,\n", " 1452,\n", " 66993,\n", " 1459,\n", " 34227,\n", " 66996,\n", " 34258,\n", " 99797,\n", " 99803,\n", " 1508,\n", " 99814,\n", " 1512,\n", " 1515,\n", " 67053,\n", " 99821,\n", " 99824,\n", " 99825,\n", " 34303,\n", " 67071,\n", " 34307,\n", " 67077,\n", " 67079,\n", " 99854,\n", " 67092,\n", " 67094,\n", " 99862,\n", " 1565,\n", " 34333,\n", " 1567,\n", " 34338,\n", " 1579,\n", " 1585,\n", " 67127,\n", " 99897,\n", " 1603,\n", " 1604,\n", " 67147,\n", " 67152,\n", " 34388,\n", " 67156,\n", " 99924,\n", " 1623,\n", " 34395,\n", " 99931,\n", " 1635,\n", " 99946,\n", " 67180,\n", " 99949,\n", " 99954,\n", " 34419,\n", " 99957,\n", " 67197,\n", " 34434,\n", " 99970,\n", " 34440,\n", " 1673,\n", " 1675,\n", " 1676,\n", " 34443,\n", " 67212,\n", " 67216,\n", " 34457,\n", " 1691,\n", " 1692,\n", " 1697,\n", " 34473,\n", " 67241,\n", " 34480,\n", " 67250,\n", " 67269,\n", " 1743,\n", " 67284,\n", " 34528,\n", " 1775,\n", " 34553,\n", " 67323,\n", " 1798,\n", " 34572,\n", " 1805,\n", " 34576,\n", " 67344,\n", " 67350,\n", " 1820,\n", " 1821,\n", " 34590,\n", " 1825,\n", " 34606,\n", " 1839,\n", " 67375,\n", " 67379,\n", " 1852,\n", " 1861,\n", " 1866,\n", " 67402,\n", " 1874,\n", " 67416,\n", " 67431,\n", " 1904,\n", " 67441,\n", " 34687,\n", " 1920,\n", " 34689,\n", " 34701,\n", " 34708,\n", " 34710,\n", " 67492,\n", " 67493,\n", " 1970,\n", " 34744,\n", " 34753,\n", " 67531,\n", " 2004,\n", " 2008,\n", " 2009,\n", " 2023,\n", " 67560,\n", " 34793,\n", " 67562,\n", " 34795,\n", " 2037,\n", " 34807,\n", " 2046,\n", " 34819,\n", " 2052,\n", " 34827,\n", " 2068,\n", " 67613,\n", " 34848,\n", " 67626,\n", " 67628,\n", " 34864,\n", " 2099,\n", " 67636,\n", " 34869,\n", " 67641,\n", " 2110,\n", " 67649,\n", " 2117,\n", " 67655,\n", " 67666,\n", " 67669,\n", " 67680,\n", " 34914,\n", " 67682,\n", " 2151,\n", " 34926,\n", " 2160,\n", " 2161,\n", " 34929,\n", " 67699,\n", " 2170,\n", " 2178,\n", " 34949,\n", " 67718,\n", " 2187,\n", " 2193,\n", " 67730,\n", " 67735,\n", " 2216,\n", " 2218,\n", " 2230,\n", " 34998,\n", " 2235,\n", " 67771,\n", " 2244,\n", " 2247,\n", " 67791,\n", " 2263,\n", " 67800,\n", " 2266,\n", " 35037,\n", " 67810,\n", " 67816,\n", " 35049,\n", " 35055,\n", " 67824,\n", " 67825,\n", " 35059,\n", " 35063,\n", " 2296,\n", " 35067,\n", " 2323,\n", " 2326,\n", " 2330,\n", " 67867,\n", " 2335,\n", " 67886,\n", " 2365,\n", " 2370,\n", " 2372,\n", " 2377,\n", " 67915,\n", " 2380,\n", " 2392,\n", " 67928,\n", " 67934,\n", " 35174,\n", " 35180,\n", " 35188,\n", " 67958,\n", " 35195,\n", " 67968,\n", " 35202,\n", " 2441,\n", " 35211,\n", " 67981,\n", " 35215,\n", " 67984,\n", " 35219,\n", " 35220,\n", " 35228,\n", " 2465,\n", " 2483,\n", " 2484,\n", " 35256,\n", " 68025,\n", " 2510,\n", " 35280,\n", " 35281,\n", " 68066,\n", " 2532,\n", " 35310,\n", " 68084,\n", " 2553,\n", " 68089,\n", " 68097,\n", " 2566,\n", " 35351,\n", " 35358,\n", " 2594,\n", " 2607,\n", " 35378,\n", " 2612,\n", " 68151,\n", " 35385,\n", " 2620,\n", " 35394,\n", " 35395,\n", " 68163,\n", " 68164,\n", " 68169,\n", " 35415,\n", " 2653,\n", " 68199,\n", " 68200,\n", " 68209,\n", " 2677,\n", " 68215,\n", " 35458,\n", " 35459,\n", " 2692,\n", " 35466,\n", " 35472,\n", " 2705,\n", " 35473,\n", " 35474,\n", " 68240,\n", " 2710,\n", " 2712,\n", " 35481,\n", " 2715,\n", " 2721,\n", " 68257,\n", " 68264,\n", " 68265,\n", " 35503,\n", " 2744,\n", " 68290,\n", " 2761,\n", " 68306,\n", " 35539,\n", " 35547,\n", " 35549,\n", " 2786,\n", " 35557,\n", " 35559,\n", " 68329,\n", " 68332,\n", " 68334,\n", " 68337,\n", " 35570,\n", " 2804,\n", " 68343,\n", " 35587,\n", " 2824,\n", " 35603,\n", " 2838,\n", " 68375,\n", " 35613,\n", " 2853,\n", " 35622,\n", " 35634,\n", " 2868,\n", " 35636,\n", " 68408,\n", " 68419,\n", " 35654,\n", " 2887,\n", " 68440,\n", " 68445,\n", " 2916,\n", " 35689,\n", " 68457,\n", " 35697,\n", " 35729,\n", " 2968,\n", " 68506,\n", " 35740,\n", " 68512,\n", " 2981,\n", " 35758,\n", " 2991,\n", " 35759,\n", " 2996,\n", " 3000,\n", " 68537,\n", " 3009,\n", " 68549,\n", " 68556,\n", " 35792,\n", " 35793,\n", " 68562,\n", " 68568,\n", " 35804,\n", " 35809,\n", " 35810,\n", " 68578,\n", " 3050,\n", " 3053,\n", " 68590,\n", " 3061,\n", " 3067,\n", " 35835,\n", " 35843,\n", " 3076,\n", " 68613,\n", " 35846,\n", " 3081,\n", " 35855,\n", " 68629,\n", " 3094,\n", " 35863,\n", " 35867,\n", " 35888,\n", " 35899,\n", " 35915,\n", " 68683,\n", " 68685,\n", " 3151,\n", " 68687,\n", " 68690,\n", " 68697,\n", " 35932,\n", " 68705,\n", " 3171,\n", " 68708,\n", " 68711,\n", " 68721,\n", " 3187,\n", " 35974,\n", " 68744,\n", " 3210,\n", " 3211,\n", " 35983,\n", " 68761,\n", " 36002,\n", " 3236,\n", " 68774,\n", " 3240,\n", " 68790,\n", " 3256,\n", " 68792,\n", " 3265,\n", " 3269,\n", " 3271,\n", " 68809,\n", " 3275,\n", " 36052,\n", " 3285,\n", " 3286,\n", " 68823,\n", " 3288,\n", " 3291,\n", " 3293,\n", " 3303,\n", " 68855,\n", " 68865,\n", " 3333,\n", " 36107,\n", " 3356,\n", " 3364,\n", " 3375,\n", " 68914,\n", " 68930,\n", " 68948,\n", " 3414,\n", " 36192,\n", " 36197,\n", " 36201,\n", " 3434,\n", " 36211,\n", " 36214,\n", " 3455,\n", " 36226,\n", " 69014,\n", " 69015,\n", " 3482,\n", " 3483,\n", " 69022,\n", " 3496,\n", " 3500,\n", " 36273,\n", " 36277,\n", " 3526,\n", " 3527,\n", " 36294,\n", " 36300,\n", " 3536,\n", " 3539,\n", " 36308,\n", " 3542,\n", " 3543,\n", " 69082,\n", " 36320,\n", " 36321,\n", " 36326,\n", " 3560,\n", " 3572,\n", " 36340,\n", " 36341,\n", " 36346,\n", " 69124,\n", " 3590,\n", " 36370,\n", " 3610,\n", " 3613,\n", " 69158,\n", " 3627,\n", " 3636,\n", " 36404,\n", " 36416,\n", " 36419,\n", " 3652,\n", " 3655,\n", " 3657,\n", " 3669,\n", " 3673,\n", " 36444,\n", " 36446,\n", " 3680,\n", " 69220,\n", " 69221,\n", " 69231,\n", " 36469,\n", " 69244,\n", " 3712,\n", " 3713,\n", " 36481,\n", " 69251,\n", " 36488,\n", " 69264,\n", " 36501,\n", " 36509,\n", " 36510,\n", " 3744,\n", " 69282,\n", " 69287,\n", " 69294,\n", " 36527,\n", " 36528,\n", " 3764,\n", " 3785,\n", " 3791,\n", " 69338,\n", " 36572,\n", " 3807,\n", " 69344,\n", " 3815,\n", " 69365,\n", " 36598,\n", " 3832,\n", " 3837,\n", " 36619,\n", " 69405,\n", " 3882,\n", " 3883,\n", " 36650,\n", " 69426,\n", " 36687,\n", " 3928,\n", " 3931,\n", " 69474,\n", " 36707,\n", " 69477,\n", " 36710,\n", " 3945,\n", " 36723,\n", " 3960,\n", " 69512,\n", " 3980,\n", " 69516,\n", " 69518,\n", " 69519,\n", " 69523,\n", " 36756,\n", " 36762,\n", " 36765,\n", " 36771,\n", " 36778,\n", " 4026,\n", " 69563,\n", " 69569,\n", " 69578,\n", " 69596,\n", " 36829,\n", " 4069,\n", " 36838,\n", " 69607,\n", " 36841,\n", " 4074,\n", " 36845,\n", " 69617,\n", " 4093,\n", " 36871,\n", " 36874,\n", " 36878,\n", " 36880,\n", " 36888,\n", " 36890,\n", " 4124,\n", " 69662,\n", " 4127,\n", " 36898,\n", " 4133,\n", " 36916,\n", " 36917,\n", " 36924,\n", " 69692,\n", " 69695,\n", " 4161,\n", " 4162,\n", " 4165,\n", " 69707,\n", " 4175,\n", " 36948,\n", " 4182,\n", " 36958,\n", " 69732,\n", " 36974,\n", " 36978,\n", " 69748,\n", " 4220,\n", " 36995,\n", " 69766,\n", " 36999,\n", " 69767,\n", " 69770,\n", " 4238,\n", " 37007,\n", " 69783,\n", " 4250,\n", " 69800,\n", " 69802,\n", " 4267,\n", " 37036,\n", " 37037,\n", " 37039,\n", " 4278,\n", " 37049,\n", " 4282,\n", " 4287,\n", " 4290,\n", " 4291,\n", " 37064,\n", " 4301,\n", " 37072,\n", " 37073,\n", " 69847,\n", " 4312,\n", " 4321,\n", " 4323,\n", " 4324,\n", " 4332,\n", " 4336,\n", " 69874,\n", " 69880,\n", " 37114,\n", " 37120,\n", " 4354,\n", " 4360,\n", " 69898,\n", " 37131,\n", " 69903,\n", " 37150,\n", " 4387,\n", " 69923,\n", " 69928,\n", " 37166,\n", " 69934,\n", " 69944,\n", " 37177,\n", " 4425,\n", " 37193,\n", " 4429,\n", " 37216,\n", " 4463,\n", " 37234,\n", " 37235,\n", " 70002,\n", " 4492,\n", " 4493,\n", " 4500,\n", " 70040,\n", " ...}" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inverted_list[\"发货\"]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3832" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(inverted_list)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "#d ata/retrieve/sgns.zhihu.word是从https://github.com/Embedding/Chinese-Word-Vectors下载到的预训练好的中文词向量文件\n", "#使 用KeyedVectors.load_word2vec_format()函数加载预训练好的词向量文件\n", "model = KeyedVectors.load_word2vec_format('data/retrieve/sgns.zhihu.word')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def get_similar_by_word(word,topk):\n", " '''\n", " 返回与一个单词word相似度最高的topk个单词所组成的单词列表\n", " 出参:\n", " word_list:与word相似度最高的topk个单词所组成的单词列表。格式为[单词1,单词2,单词3,单词4,单词5]\n", " '''\n", " similar_words = model.similar_by_word(word,topk)\n", " word_list = [word[0] for word in similar_words]\n", " return word_list" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['昨天', '现在', '今天下午', '明天', '今日']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_similar_by_word(\"今天\",5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```TODO2``` 构造一个新的倒排表,考虑单词之间的语义相似度" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 3832/3832 [00:44<00:00, 85.74it/s] " ] }, { "name": "stdout", "output_type": "stream", "text": [ "OOV_count: 832\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# TODO:\n", "# 构造一个新的倒排表,并将结果保存在字典inverted_list_new中\n", "# 新的倒排表键为word,值为老倒排表[word]、老倒排表[单词1]、老倒排表[单词2]、老倒排表[单词3]、老倒排表[单词4]的并集\n", "# 即新倒排表保存了包含单词word或包含与单词word最相近的5个单词中的某一个的问题的index\n", "inverted_list_new = {}\n", "OOV_count = 0\n", "for word in tqdm(inverted_list):\n", " ### 你需要完成的部分\n", " try:\n", " top_4_words = get_similar_by_word(word,4)\n", " inverted_list_new[word] = set()\n", " inverted_list_new[word] = inverted_list_new[word].union(inverted_list[word])\n", " for t_word in top_4_words:\n", " if t_word in inverted_list:\n", " inverted_list_new[word] = inverted_list_new[word].union(inverted_list[t_word])\n", " except Exception as e:\n", " OOV_count += 1\n", "print(\"OOV_count:\",OOV_count)\n", " ### 你需要完成的代码结束\n", " " ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{81920,\n", " 16386,\n", " 5,\n", " 65541,\n", " 81927,\n", " 32776,\n", " 81930,\n", " 81935,\n", " 17,\n", " 18,\n", " 65554,\n", " 16401,\n", " 81947,\n", " 98331,\n", " 29,\n", " 65566,\n", " 32800,\n", " 81953,\n", " 32803,\n", " 98339,\n", " 81959,\n", " 32810,\n", " 98346,\n", " 49194,\n", " 32818,\n", " 16435,\n", " 55,\n", " 49209,\n", " 98366,\n", " 64,\n", " 49219,\n", " 65604,\n", " 81988,\n", " 16458,\n", " 65611,\n", " 81995,\n", " 81998,\n", " 16463,\n", " 16464,\n", " 49233,\n", " 32850,\n", " 98387,\n", " 49234,\n", " 82004,\n", " 86,\n", " 81999,\n", " 98386,\n", " 16475,\n", " 32859,\n", " 49245,\n", " 98398,\n", " 65631,\n", " 82015,\n", " 65630,\n", " 102,\n", " 65639,\n", " 65640,\n", " 49259,\n", " 65646,\n", " 98415,\n", " 98416,\n", " 49263,\n", " 16495,\n", " 49267,\n", " 16500,\n", " 82035,\n", " 118,\n", " 16503,\n", " 65650,\n", " 65656,\n", " 122,\n", " 65659,\n", " 49275,\n", " 125,\n", " 32894,\n", " 65660,\n", " 133,\n", " 65669,\n", " 65670,\n", " 65671,\n", " 32902,\n", " 139,\n", " 49293,\n", " 142,\n", " 65679,\n", " 32912,\n", " 98451,\n", " 150,\n", " 151,\n", " 32929,\n", " 49318,\n", " 49320,\n", " 65708,\n", " 82092,\n", " 82093,\n", " 65711,\n", " 98484,\n", " 98489,\n", " 49337,\n", " 187,\n", " 49340,\n", " 32957,\n", " 200,\n", " 16588,\n", " 32973,\n", " 65742,\n", " 16589,\n", " 98518,\n", " 16598,\n", " 49366,\n", " 82137,\n", " 65755,\n", " 220,\n", " 223,\n", " 65764,\n", " 82149,\n", " 82155,\n", " 16621,\n", " 49396,\n", " 65783,\n", " 33017,\n", " 65786,\n", " 254,\n", " 65790,\n", " 82176,\n", " 261,\n", " 65798,\n", " 65806,\n", " 65810,\n", " 275,\n", " 279,\n", " 98586,\n", " 82208,\n", " 49442,\n", " 65833,\n", " 33068,\n", " 82220,\n", " 65838,\n", " 82221,\n", " 82224,\n", " 33073,\n", " 65843,\n", " 65844,\n", " 16691,\n", " 310,\n", " 16696,\n", " 82234,\n", " 65852,\n", " 49468,\n", " 318,\n", " 49471,\n", " 49472,\n", " 16706,\n", " 49475,\n", " 65862,\n", " 65863,\n", " 16712,\n", " 82248,\n", " 65866,\n", " 49483,\n", " 49493,\n", " 16726,\n", " 344,\n", " 16730,\n", " 65883,\n", " 82268,\n", " 65885,\n", " 350,\n", " 33120,\n", " 16745,\n", " 364,\n", " 98668,\n", " 65903,\n", " 33140,\n", " 98678,\n", " 65913,\n", " 33149,\n", " 49535,\n", " 33158,\n", " 49544,\n", " 82313,\n", " 33163,\n", " 16783,\n", " 33168,\n", " 401,\n", " 82322,\n", " 98709,\n", " 49558,\n", " 98715,\n", " 16796,\n", " 49565,\n", " 82333,\n", " 82334,\n", " 33189,\n", " 33191,\n", " 65960,\n", " 33193,\n", " 49579,\n", " 16812,\n", " 98740,\n", " 65973,\n", " 16822,\n", " 82359,\n", " 16825,\n", " 33210,\n", " 16827,\n", " 82365,\n", " 98751,\n", " 16832,\n", " 98754,\n", " 33220,\n", " 453,\n", " 49604,\n", " 98763,\n", " 461,\n", " 469,\n", " 49623,\n", " 16856,\n", " 33244,\n", " 49629,\n", " 16867,\n", " 16869,\n", " 98794,\n", " 82410,\n", " 82412,\n", " 495,\n", " 16882,\n", " 98803,\n", " 49651,\n", " 49656,\n", " 33273,\n", " 16889,\n", " 49658,\n", " 33276,\n", " 82426,\n", " 66046,\n", " 507,\n", " 16895,\n", " 49669,\n", " 82437,\n", " 33290,\n", " 49674,\n", " 16911,\n", " 530,\n", " 33300,\n", " 66069,\n", " 16918,\n", " 16922,\n", " 66077,\n", " 542,\n", " 543,\n", " 82463,\n", " 66081,\n", " 16932,\n", " 66091,\n", " 556,\n", " 66093,\n", " 66094,\n", " 98860,\n", " 82475,\n", " 66097,\n", " 49709,\n", " 16942,\n", " 49715,\n", " 49716,\n", " 66106,\n", " 98877,\n", " 66111,\n", " 49729,\n", " 33346,\n", " 579,\n", " 33347,\n", " 82499,\n", " 49735,\n", " 16967,\n", " 49739,\n", " 588,\n", " 16983,\n", " 82522,\n", " 16991,\n", " 82528,\n", " 49761,\n", " 49762,\n", " 66151,\n", " 66152,\n", " 66153,\n", " 17003,\n", " 49777,\n", " 98932,\n", " 17012,\n", " 66166,\n", " 17020,\n", " 17021,\n", " 639,\n", " 640,\n", " 49793,\n", " 642,\n", " 17026,\n", " 98948,\n", " 17027,\n", " 49796,\n", " 82563,\n", " 17030,\n", " 66185,\n", " 82566,\n", " 651,\n", " 33422,\n", " 82576,\n", " 98962,\n", " 33427,\n", " 17043,\n", " 49812,\n", " 82584,\n", " 98970,\n", " 17050,\n", " 17052,\n", " 66207,\n", " 82592,\n", " 673,\n", " 33441,\n", " 17057,\n", " 17061,\n", " 33446,\n", " 49831,\n", " 66221,\n", " 82605,\n", " 687,\n", " 66224,\n", " 49841,\n", " 33453,\n", " 33459,\n", " 17076,\n", " 66228,\n", " 694,\n", " 33464,\n", " 33466,\n", " 33468,\n", " 99005,\n", " 66238,\n", " 99007,\n", " 99006,\n", " 82628,\n", " 710,\n", " 82630,\n", " 82632,\n", " 99017,\n", " 711,\n", " 718,\n", " 82639,\n", " 49872,\n", " 82645,\n", " 82652,\n", " 49889,\n", " 82658,\n", " 99044,\n", " 743,\n", " 17127,\n", " 33513,\n", " 49897,\n", " 17132,\n", " 749,\n", " 49901,\n", " 49903,\n", " 82674,\n", " 33523,\n", " 17142,\n", " 759,\n", " 17144,\n", " 33527,\n", " 66299,\n", " 99068,\n", " 33535,\n", " 17152,\n", " 769,\n", " 33539,\n", " 66307,\n", " 66309,\n", " 82694,\n", " 17162,\n", " 99084,\n", " 49932,\n", " 17167,\n", " 49935,\n", " 66321,\n", " 99090,\n", " 66323,\n", " 783,\n", " 82713,\n", " 99098,\n", " 66333,\n", " 66334,\n", " 82719,\n", " 800,\n", " 66337,\n", " 17185,\n", " 49954,\n", " 17190,\n", " 33576,\n", " 49962,\n", " 17196,\n", " 99120,\n", " 99122,\n", " 33587,\n", " 49972,\n", " 49974,\n", " 17207,\n", " 33592,\n", " 33593,\n", " 33600,\n", " 33604,\n", " 99140,\n", " 82757,\n", " 841,\n", " 49994,\n", " 845,\n", " 99149,\n", " 17235,\n", " 82771,\n", " 66389,\n", " 854,\n", " 82775,\n", " 33624,\n", " 33625,\n", " 858,\n", " 66395,\n", " 99163,\n", " 33629,\n", " 17245,\n", " 82782,\n", " 17249,\n", " 17251,\n", " 82787,\n", " 99173,\n", " 33638,\n", " 17255,\n", " 874,\n", " 66411,\n", " 82794,\n", " 17259,\n", " 878,\n", " 33647,\n", " 66415,\n", " 66417,\n", " 33650,\n", " 66418,\n", " 99182,\n", " 82798,\n", " 33652,\n", " 891,\n", " 17275,\n", " 82812,\n", " 33662,\n", " 82813,\n", " 17283,\n", " 66436,\n", " 901,\n", " 99208,\n", " 66441,\n", " 33674,\n", " 33675,\n", " 17291,\n", " 82829,\n", " 33682,\n", " 916,\n", " 66452,\n", " 17304,\n", " 33688,\n", " 33691,\n", " 33692,\n", " 66461,\n", " 66463,\n", " 82847,\n", " 33697,\n", " 99234,\n", " 931,\n", " 82848,\n", " 17317,\n", " 938,\n", " 17324,\n", " 82863,\n", " 944,\n", " 66480,\n", " 33716,\n", " 99254,\n", " 82878,\n", " 959,\n", " 33728,\n", " 99267,\n", " 33736,\n", " 33741,\n", " 17359,\n", " 978,\n", " 17364,\n", " 33750,\n", " 82905,\n", " 17371,\n", " 992,\n", " 17378,\n", " 33765,\n", " 99301,\n", " 1000,\n", " 50152,\n", " 50155,\n", " 1005,\n", " 99309,\n", " 50158,\n", " 82928,\n", " 66542,\n", " 33780,\n", " 99318,\n", " 1017,\n", " 33787,\n", " 66557,\n", " 1024,\n", " 17408,\n", " 17409,\n", " 1027,\n", " 50179,\n", " 82949,\n", " 82950,\n", " 1034,\n", " 50187,\n", " 1036,\n", " 50191,\n", " 66577,\n", " 66580,\n", " 99351,\n", " 82969,\n", " 82970,\n", " 17437,\n", " 99360,\n", " 1057,\n", " 33825,\n", " 99363,\n", " 82977,\n", " 82979,\n", " 33831,\n", " 66600,\n", " 17447,\n", " 17448,\n", " 33837,\n", " 50221,\n", " 66607,\n", " 99376,\n", " 33841,\n", " 82993,\n", " 99380,\n", " 66613,\n", " 82996,\n", " 82998,\n", " 33849,\n", " 83003,\n", " 66622,\n", " 1087,\n", " 1088,\n", " 99391,\n", " 17473,\n", " 83011,\n", " 50245,\n", " 1095,\n", " 1098,\n", " 66637,\n", " 1102,\n", " 99405,\n", " 50256,\n", " 99410,\n", " 66644,\n", " 99413,\n", " 66646,\n", " 99419,\n", " 66652,\n", " 83037,\n", " 1118,\n", " 83040,\n", " 99431,\n", " 66663,\n", " 1129,\n", " 17516,\n", " 83052,\n", " 66670,\n", " 33900,\n", " 50290,\n", " 17524,\n", " 33912,\n", " 83065,\n", " 33914,\n", " 66685,\n", " 83071,\n", " 17537,\n", " 50309,\n", " 33927,\n", " 50314,\n", " 99467,\n", " 99470,\n", " 50318,\n", " 99472,\n", " 66705,\n", " 83088,\n", " 66708,\n", " 66709,\n", " 50327,\n", " 99484,\n", " 99485,\n", " 83101,\n", " 99487,\n", " 66721,\n", " 33954,\n", " 1187,\n", " 83105,\n", " 50337,\n", " 50341,\n", " 99495,\n", " 66728,\n", " 50345,\n", " 50346,\n", " 17578,\n", " 66732,\n", " 83116,\n", " 66734,\n", " 50351,\n", " 50356,\n", " 99510,\n", " 17591,\n", " 50363,\n", " 99517,\n", " 66751,\n", " 17599,\n", " 1223,\n", " 33992,\n", " 99528,\n", " 17607,\n", " 50376,\n", " 50379,\n", " 17616,\n", " 99537,\n", " 66770,\n", " 99539,\n", " 1237,\n", " 1242,\n", " 66779,\n", " 66780,\n", " 17628,\n", " 1247,\n", " 17631,\n", " 83168,\n", " 50402,\n", " 99555,\n", " 50405,\n", " 34024,\n", " 50409,\n", " 50411,\n", " 1265,\n", " 34034,\n", " 83186,\n", " 17651,\n", " 50419,\n", " 1270,\n", " 34042,\n", " 99580,\n", " 1277,\n", " 66814,\n", " 99583,\n", " 83200,\n", " 17671,\n", " 50441,\n", " 17677,\n", " 34064,\n", " 1297,\n", " 66834,\n", " 50448,\n", " 50449,\n", " 50456,\n", " 66844,\n", " 83229,\n", " 83231,\n", " 1312,\n", " 66849,\n", " 50464,\n", " 17698,\n", " 17699,\n", " 83236,\n", " 17702,\n", " 1326,\n", " 99632,\n", " 17713,\n", " 66866,\n", " 34099,\n", " 83252,\n", " 99637,\n", " 1334,\n", " 66871,\n", " 83258,\n", " 66877,\n", " 1342,\n", " 50493,\n", " 17728,\n", " 1345,\n", " 99651,\n", " 66887,\n", " 1352,\n", " 50503,\n", " 1354,\n", " 17737,\n", " 50508,\n", " 66895,\n", " 99671,\n", " 66909,\n", " 34143,\n", " 99681,\n", " 83300,\n", " 17770,\n", " 1392,\n", " 66929,\n", " 34162,\n", " 17777,\n", " 1401,\n", " 1402,\n", " 34169,\n", " 50554,\n", " 99709,\n", " 50557,\n", " 1407,\n", " 66944,\n", " 66945,\n", " 17788,\n", " 1412,\n", " 1415,\n", " 34184,\n", " 1420,\n", " 99729,\n", " 1426,\n", " 34196,\n", " 83351,\n", " 66971,\n", " 1436,\n", " 83357,\n", " 66974,\n", " 17821,\n", " 1443,\n", " 83363,\n", " 50598,\n", " 34217,\n", " 66986,\n", " 83369,\n", " 1452,\n", " 66993,\n", " 50609,\n", " 1459,\n", " 34227,\n", " 66996,\n", " 83385,\n", " 50618,\n", " 50626,\n", " 50627,\n", " 83406,\n", " 83407,\n", " 34258,\n", " 50643,\n", " 83412,\n", " 99797,\n", " 50647,\n", " 99803,\n", " 50652,\n", " 50654,\n", " 1508,\n", " 50661,\n", " 99814,\n", " 1512,\n", " 1515,\n", " 50667,\n", " 67053,\n", " 99821,\n", " 17901,\n", " 99824,\n", " 99825,\n", " 83438,\n", " 50672,\n", " 83441,\n", " 50677,\n", " 17910,\n", " 83446,\n", " 83447,\n", " 83448,\n", " 34292,\n", " 17916,\n", " 17918,\n", " 34303,\n", " 67071,\n", " 34307,\n", " 83460,\n", " 67077,\n", " 1540,\n", " 67079,\n", " 34310,\n", " 99854,\n", " 83473,\n", " 67092,\n", " 67094,\n", " 99862,\n", " 67100,\n", " 1565,\n", " 34333,\n", " 1567,\n", " 34338,\n", " 99877,\n", " 1579,\n", " 83502,\n", " 83504,\n", " 1585,\n", " 67127,\n", " 99897,\n", " 50753,\n", " 83521,\n", " 1603,\n", " 1604,\n", " 17991,\n", " 17992,\n", " 50760,\n", " 83527,\n", " 67147,\n", " 17996,\n", " 67152,\n", " 34388,\n", " 67156,\n", " 99924,\n", " 1623,\n", " 34395,\n", " 99931,\n", " 18016,\n", " 1635,\n", " 99946,\n", " 67180,\n", " 99949,\n", " 99954,\n", " 34419,\n", " 83571,\n", " 99957,\n", " 83573,\n", " 83572,\n", " 18043,\n", " 50811,\n", " 67197,\n", " 34434,\n", " 99970,\n", " 18053,\n", " 83590,\n", " 34440,\n", " 1673,\n", " 50826,\n", " 1675,\n", " 1676,\n", " 34443,\n", " 67212,\n", " 67216,\n", " 18065,\n", " 18066,\n", " 18068,\n", " 18069,\n", " 50838,\n", " 50839,\n", " 34457,\n", " 50841,\n", " 1691,\n", " 1692,\n", " 83615,\n", " 1697,\n", " 50855,\n", " 34473,\n", " 67241,\n", " 50861,\n", " 34480,\n", " 67250,\n", " 18104,\n", " 18108,\n", " 83645,\n", " 18112,\n", " 18116,\n", " 67269,\n", " 1743,\n", " 83667,\n", " 67284,\n", " 50904,\n", " 83674,\n", " 50910,\n", " 34528,\n", " 34529,\n", " 18146,\n", " 50917,\n", " 83688,\n", " 50923,\n", " 1775,\n", " 18160,\n", " 18168,\n", " 34553,\n", " 67322,\n", " 67323,\n", " 18171,\n", " 50939,\n", " 67327,\n", " 50944,\n", " 50947,\n", " 1798,\n", " 83718,\n", " 34572,\n", " 1805,\n", " 83724,\n", " 34576,\n", " 67344,\n", " 83730,\n", " 67350,\n", " 1820,\n", " 1821,\n", " 34590,\n", " 50972,\n", " 1825,\n", " 50986,\n", " 50988,\n", " 34606,\n", " 1839,\n", " 67375,\n", " 50990,\n", " 50993,\n", " 67379,\n", " 83767,\n", " 51000,\n", " 1852,\n", " 34620,\n", " 83774,\n", " 18239,\n", " 1861,\n", " 18245,\n", " 1866,\n", " 67402,\n", " 1874,\n", " 83795,\n", " 83799,\n", " 67416,\n", " 18270,\n", " 51039,\n", " 83807,\n", " 83810,\n", " 51043,\n", " 51046,\n", " 67431,\n", " 51047,\n", " 1904,\n", " 67441,\n", " 18289,\n", " 34687,\n", " 1920,\n", " 34689,\n", " 51071,\n", " 51074,\n", " 1919,\n", " 83846,\n", " 18314,\n", " 51084,\n", " 34701,\n", " 18319,\n", " 34708,\n", " 34710,\n", " 18326,\n", " 51095,\n", " 67492,\n", " 67493,\n", " 51108,\n", " 1959,\n", " 34725,\n", " 18348,\n", " 67500,\n", " 83887,\n", " 1970,\n", " 83895,\n", " 34744,\n", " 51128,\n", " 83898,\n", " 83902,\n", " 34753,\n", " 83906,\n", " 51140,\n", " 18373,\n", " 67531,\n", " 18384,\n", " 2004,\n", " 51159,\n", " 2008,\n", " 2009,\n", " 18397,\n", " 18401,\n", " 83940,\n", " 2023,\n", " 67560,\n", " 34793,\n", " 67562,\n", " 34795,\n", " 51175,\n", " 83951,\n", " 51184,\n", " 83953,\n", " 51185,\n", " ...}" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inverted_list_new[\"发货\"]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# 将新的倒排表保存在文件data/retrieve/invertedList.pkl中\n", "with open('data/retrieve/invertedList.pkl','wb') as f:\n", " pickle.dump(inverted_list_new,f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "以下为测试,完成上述过程之后,可以运行以下的代码来测试准确性。" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "#这一格的内容是从preprocessor.ipynb中粘贴而来,包含了数据预处理的几个关键函数\n", "import emoji\n", "import re\n", "import jieba\n", "def clean(content):\n", " content = emoji.demojize(content)\n", " content = re.sub('<.*>','',content)\n", " return content\n", "#这一函数是用于对句子进行分词,在preprocessor.ipynb中由于数据是已经分好词的,所以我们并没有进行这一步骤,但是对于一个新的问句,这一步是必不可少的\n", "def question_cut(content):\n", " return list(jieba.cut(content))\n", "def strip(wordList):\n", " return [word.strip() for word in wordList if word.strip()!='']\n", "with open(\"data/stopWord.json\",\"r\") as f:\n", " stopWords = f.read().split(\"\\n\")\n", "def rm_stop_word(wordList):\n", " return [word for word in wordList if word not in stopWords]" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "# 从data/retrieve/invertedList.pkl加载倒排表并将其保存在变量invertedList中\n", "with open('data/retrieve/invertedList.pkl','rb') as f:\n", " invertedList = pickle.load(f)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "def get_retrieve_result(sentence):\n", " '''\n", " 输入一个句子sentence,根据倒排表进行快速检索,返回与该句子较相近的一些候选问题的index\n", " 候选问题由包含该句子中任一单词或包含与该句子中任一单词意思相近的单词的问题索引组成\n", " '''\n", " sentence = clean(sentence)\n", " sentence = question_cut(sentence)\n", " sentence = strip(sentence)\n", " sentence = rm_stop_word(sentence)\n", " candidate = set()\n", " for word in sentence:\n", " if word in invertedList:\n", " candidate = candidate | invertedList[word]\n", " return candidate" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{81920,\n", " 16386,\n", " 65541,\n", " 5,\n", " 81927,\n", " 32776,\n", " 81930,\n", " 81935,\n", " 17,\n", " 18,\n", " 65554,\n", " 16401,\n", " 98331,\n", " 81947,\n", " 29,\n", " 65566,\n", " 32800,\n", " 81953,\n", " 32803,\n", " 98339,\n", " 81959,\n", " 32810,\n", " 98346,\n", " 49194,\n", " 32818,\n", " 16435,\n", " 55,\n", " 49209,\n", " 98366,\n", " 64,\n", " 49219,\n", " 65604,\n", " 81988,\n", " 16458,\n", " 65611,\n", " 81995,\n", " 81998,\n", " 16463,\n", " 16464,\n", " 49233,\n", " 32850,\n", " 98387,\n", " 98386,\n", " 49234,\n", " 86,\n", " 81999,\n", " 82004,\n", " 32859,\n", " 16475,\n", " 49245,\n", " 98398,\n", " 65631,\n", " 65630,\n", " 82015,\n", " 102,\n", " 65639,\n", " 65640,\n", " 49259,\n", " 65646,\n", " 98415,\n", " 98416,\n", " 49263,\n", " 65650,\n", " 16495,\n", " 49267,\n", " 16500,\n", " 118,\n", " 82035,\n", " 65656,\n", " 16503,\n", " 122,\n", " 65659,\n", " 65660,\n", " 125,\n", " 32894,\n", " 49275,\n", " 133,\n", " 65669,\n", " 65670,\n", " 65671,\n", " 32902,\n", " 139,\n", " 49293,\n", " 142,\n", " 65679,\n", " 32912,\n", " 98451,\n", " 150,\n", " 151,\n", " 32929,\n", " 49318,\n", " 49320,\n", " 65708,\n", " 82092,\n", " 82093,\n", " 65711,\n", " 98484,\n", " 98489,\n", " 49337,\n", " 187,\n", " 49340,\n", " 32957,\n", " 200,\n", " 16588,\n", " 32973,\n", " 65742,\n", " 16589,\n", " 98518,\n", " 16598,\n", " 49366,\n", " 82137,\n", " 65755,\n", " 220,\n", " 223,\n", " 65764,\n", " 82149,\n", " 82155,\n", " 16621,\n", " 49396,\n", " 65783,\n", " 33017,\n", " 65786,\n", " 254,\n", " 65790,\n", " 82176,\n", " 261,\n", " 65798,\n", " 65806,\n", " 65810,\n", " 275,\n", " 279,\n", " 98586,\n", " 82208,\n", " 49442,\n", " 65833,\n", " 33068,\n", " 82220,\n", " 65838,\n", " 82221,\n", " 82224,\n", " 33073,\n", " 65843,\n", " 65844,\n", " 16691,\n", " 310,\n", " 16696,\n", " 82234,\n", " 65852,\n", " 49468,\n", " 318,\n", " 49471,\n", " 49472,\n", " 16706,\n", " 49475,\n", " 65862,\n", " 65863,\n", " 16712,\n", " 82248,\n", " 65866,\n", " 49483,\n", " 49493,\n", " 16726,\n", " 344,\n", " 16730,\n", " 65883,\n", " 82268,\n", " 65885,\n", " 350,\n", " 33120,\n", " 16745,\n", " 364,\n", " 98668,\n", " 65903,\n", " 33140,\n", " 98678,\n", " 65913,\n", " 33149,\n", " 49535,\n", " 33158,\n", " 49544,\n", " 82313,\n", " 33163,\n", " 16783,\n", " 33168,\n", " 401,\n", " 82322,\n", " 98709,\n", " 49558,\n", " 98715,\n", " 16796,\n", " 49565,\n", " 82333,\n", " 82334,\n", " 33189,\n", " 33191,\n", " 65960,\n", " 33193,\n", " 49579,\n", " 16812,\n", " 98740,\n", " 65973,\n", " 16822,\n", " 82359,\n", " 16825,\n", " 33210,\n", " 16827,\n", " 82365,\n", " 98751,\n", " 16832,\n", " 98754,\n", " 33220,\n", " 453,\n", " 49604,\n", " 98763,\n", " 461,\n", " 469,\n", " 49623,\n", " 16856,\n", " 33244,\n", " 49629,\n", " 16867,\n", " 16869,\n", " 98794,\n", " 82410,\n", " 82412,\n", " 495,\n", " 16882,\n", " 98803,\n", " 49651,\n", " 49656,\n", " 33273,\n", " 16889,\n", " 507,\n", " 33276,\n", " 82426,\n", " 66046,\n", " 49658,\n", " 16895,\n", " 49669,\n", " 82437,\n", " 33290,\n", " 49674,\n", " 16911,\n", " 530,\n", " 33300,\n", " 66069,\n", " 16918,\n", " 16922,\n", " 66077,\n", " 542,\n", " 543,\n", " 82463,\n", " 66081,\n", " 16932,\n", " 66091,\n", " 556,\n", " 66093,\n", " 66094,\n", " 98860,\n", " 33324,\n", " 66097,\n", " 82475,\n", " 49709,\n", " 16942,\n", " 49715,\n", " 49716,\n", " 66106,\n", " 98877,\n", " 66111,\n", " 49729,\n", " 33346,\n", " 579,\n", " 33347,\n", " 82499,\n", " 49735,\n", " 16967,\n", " 49739,\n", " 588,\n", " 16983,\n", " 82522,\n", " 16991,\n", " 82528,\n", " 49761,\n", " 49762,\n", " 66151,\n", " 66152,\n", " 66153,\n", " 17003,\n", " 49777,\n", " 98932,\n", " 17012,\n", " 66166,\n", " 17020,\n", " 17021,\n", " 639,\n", " 640,\n", " 49793,\n", " 642,\n", " 17026,\n", " 98948,\n", " 17027,\n", " 82563,\n", " 49796,\n", " 17030,\n", " 66185,\n", " 82566,\n", " 651,\n", " 33422,\n", " 82576,\n", " 98962,\n", " 33427,\n", " 17043,\n", " 49812,\n", " 82584,\n", " 98970,\n", " 17050,\n", " 17052,\n", " 66207,\n", " 82592,\n", " 673,\n", " 33441,\n", " 17057,\n", " 17061,\n", " 33446,\n", " 49831,\n", " 66221,\n", " 33453,\n", " 687,\n", " 66224,\n", " 82605,\n", " 49841,\n", " 33459,\n", " 66228,\n", " 17076,\n", " 694,\n", " 33464,\n", " 33466,\n", " 33468,\n", " 99005,\n", " 66238,\n", " 99007,\n", " 99006,\n", " 82628,\n", " 710,\n", " 711,\n", " 82630,\n", " 99017,\n", " 82632,\n", " 718,\n", " 82639,\n", " 49872,\n", " 82645,\n", " 82652,\n", " 49889,\n", " 82658,\n", " 99044,\n", " 743,\n", " 17127,\n", " 33513,\n", " 49897,\n", " 17132,\n", " 749,\n", " 49901,\n", " 49903,\n", " 82674,\n", " 33523,\n", " 17142,\n", " 759,\n", " 33527,\n", " 17144,\n", " 66299,\n", " 99068,\n", " 33535,\n", " 17152,\n", " 769,\n", " 33539,\n", " 66307,\n", " 66309,\n", " 82694,\n", " 17162,\n", " 99084,\n", " 49932,\n", " 783,\n", " 17167,\n", " 66321,\n", " 99090,\n", " 66323,\n", " 49935,\n", " 82713,\n", " 99098,\n", " 66333,\n", " 66334,\n", " 82719,\n", " 800,\n", " 66337,\n", " 17185,\n", " 49954,\n", " 17190,\n", " 33576,\n", " 49962,\n", " 17196,\n", " 99120,\n", " 99122,\n", " 33587,\n", " 49972,\n", " 49974,\n", " 17207,\n", " 33592,\n", " 33593,\n", " 33600,\n", " 33604,\n", " 99140,\n", " 82757,\n", " 841,\n", " 49994,\n", " 845,\n", " 99149,\n", " 17235,\n", " 82771,\n", " 66389,\n", " 854,\n", " 82775,\n", " 33624,\n", " 33625,\n", " 858,\n", " 66395,\n", " 99163,\n", " 33629,\n", " 17245,\n", " 82782,\n", " 17249,\n", " 17251,\n", " 82787,\n", " 99173,\n", " 33638,\n", " 17255,\n", " 874,\n", " 66411,\n", " 82794,\n", " 17259,\n", " 878,\n", " 33647,\n", " 66415,\n", " 66417,\n", " 33650,\n", " 66418,\n", " 99182,\n", " 33652,\n", " 82798,\n", " 891,\n", " 17275,\n", " 82812,\n", " 33662,\n", " 82813,\n", " 17283,\n", " 66436,\n", " 901,\n", " 99208,\n", " 66441,\n", " 33674,\n", " 33675,\n", " 17291,\n", " 82829,\n", " 33682,\n", " 916,\n", " 66452,\n", " 33688,\n", " 17304,\n", " 33691,\n", " 33692,\n", " 66461,\n", " 66463,\n", " 82847,\n", " 33697,\n", " 99234,\n", " 931,\n", " 82848,\n", " 17317,\n", " 938,\n", " 17324,\n", " 82863,\n", " 944,\n", " 66480,\n", " 33716,\n", " 99254,\n", " 82878,\n", " 959,\n", " 33728,\n", " 99267,\n", " 33736,\n", " 33741,\n", " 17359,\n", " 978,\n", " 17364,\n", " 33750,\n", " 82905,\n", " 17371,\n", " 992,\n", " 17378,\n", " 33765,\n", " 99301,\n", " 1000,\n", " 50152,\n", " 50155,\n", " 1005,\n", " 99309,\n", " 66542,\n", " 50158,\n", " 82928,\n", " 33780,\n", " 99318,\n", " 1017,\n", " 33787,\n", " 66557,\n", " 1024,\n", " 17408,\n", " 17409,\n", " 1027,\n", " 50179,\n", " 82949,\n", " 82950,\n", " 1034,\n", " 50187,\n", " 1036,\n", " 50191,\n", " 66577,\n", " 66580,\n", " 99351,\n", " 82969,\n", " 82970,\n", " 17437,\n", " 99360,\n", " 1057,\n", " 33825,\n", " 99363,\n", " 82977,\n", " 82979,\n", " 33831,\n", " 66600,\n", " 17447,\n", " 17448,\n", " 33837,\n", " 50221,\n", " 66607,\n", " 99376,\n", " 33841,\n", " 82993,\n", " 99380,\n", " 66613,\n", " 82996,\n", " 82998,\n", " 33849,\n", " 83003,\n", " 66622,\n", " 1087,\n", " 1088,\n", " 99391,\n", " 17473,\n", " 83011,\n", " 50245,\n", " 1095,\n", " 1098,\n", " 66637,\n", " 1102,\n", " 99405,\n", " 50256,\n", " 99410,\n", " 66644,\n", " 99413,\n", " 66646,\n", " 99419,\n", " 66652,\n", " 83037,\n", " 1118,\n", " 83040,\n", " 99431,\n", " 66663,\n", " 1129,\n", " 33900,\n", " 17516,\n", " 66670,\n", " 83052,\n", " 50290,\n", " 17524,\n", " 33912,\n", " 83065,\n", " 33914,\n", " 66685,\n", " 83071,\n", " 17537,\n", " 50309,\n", " 33927,\n", " 50314,\n", " 99467,\n", " 99470,\n", " 50318,\n", " 99472,\n", " 66705,\n", " 83088,\n", " 66708,\n", " 66709,\n", " 50327,\n", " 99484,\n", " 99485,\n", " 83101,\n", " 99487,\n", " 66721,\n", " 33954,\n", " 1187,\n", " 83105,\n", " 50337,\n", " 50341,\n", " 99495,\n", " 66728,\n", " 50345,\n", " 50346,\n", " 17578,\n", " 66732,\n", " 83116,\n", " 66734,\n", " 50351,\n", " 50356,\n", " 99510,\n", " 17591,\n", " 50363,\n", " 99517,\n", " 66751,\n", " 17599,\n", " 1223,\n", " 33992,\n", " 99528,\n", " 17607,\n", " 50376,\n", " 50379,\n", " 17616,\n", " 99537,\n", " 66770,\n", " 99539,\n", " 1237,\n", " 1242,\n", " 66779,\n", " 66780,\n", " 17628,\n", " 1247,\n", " 17631,\n", " 83168,\n", " 50402,\n", " 99555,\n", " 50405,\n", " 34024,\n", " 50409,\n", " 50411,\n", " 1265,\n", " 34034,\n", " 83186,\n", " 17651,\n", " 50419,\n", " 1270,\n", " 34042,\n", " 99580,\n", " 1277,\n", " 66814,\n", " 99583,\n", " 83200,\n", " 17671,\n", " 50441,\n", " 17677,\n", " 34064,\n", " 1297,\n", " 66834,\n", " 50448,\n", " 50449,\n", " 50456,\n", " 66844,\n", " 83229,\n", " 83231,\n", " 1312,\n", " 66849,\n", " 50464,\n", " 17698,\n", " 17699,\n", " 83236,\n", " 17702,\n", " 1326,\n", " 99632,\n", " 17713,\n", " 66866,\n", " 34099,\n", " 83252,\n", " 99637,\n", " 1334,\n", " 66871,\n", " 83258,\n", " 66877,\n", " 1342,\n", " 50493,\n", " 17728,\n", " 1345,\n", " 99651,\n", " 66887,\n", " 1352,\n", " 50503,\n", " 1354,\n", " 17737,\n", " 50508,\n", " 66895,\n", " 99671,\n", " 66909,\n", " 34143,\n", " 99681,\n", " 83300,\n", " 17770,\n", " 1392,\n", " 66929,\n", " 34162,\n", " 17777,\n", " 1401,\n", " 1402,\n", " 34169,\n", " 50554,\n", " 99709,\n", " 17788,\n", " 1407,\n", " 66944,\n", " 66945,\n", " 50557,\n", " 1412,\n", " 1415,\n", " 34184,\n", " 1420,\n", " 99729,\n", " 1426,\n", " 34196,\n", " 83351,\n", " 66971,\n", " 1436,\n", " 83357,\n", " 66974,\n", " 17821,\n", " 1443,\n", " 83363,\n", " 50598,\n", " 34217,\n", " 66986,\n", " 83369,\n", " 1452,\n", " 66993,\n", " 50609,\n", " 1459,\n", " 34227,\n", " 66996,\n", " 83385,\n", " 50618,\n", " 50626,\n", " 50627,\n", " 83406,\n", " 83407,\n", " 34258,\n", " 50643,\n", " 83412,\n", " 99797,\n", " 50647,\n", " 99803,\n", " 50652,\n", " 50654,\n", " 1508,\n", " 50661,\n", " 99814,\n", " 1512,\n", " 1515,\n", " 50667,\n", " 67053,\n", " 99821,\n", " 17901,\n", " 99824,\n", " 99825,\n", " 83438,\n", " 50672,\n", " 34292,\n", " 83441,\n", " 50677,\n", " 17910,\n", " 83446,\n", " 83447,\n", " 83448,\n", " 17916,\n", " 17918,\n", " 34303,\n", " 67071,\n", " 34307,\n", " 1540,\n", " 67077,\n", " 34310,\n", " 67079,\n", " 83460,\n", " 99854,\n", " 83473,\n", " 67092,\n", " 67094,\n", " 99862,\n", " 67100,\n", " 1565,\n", " 34333,\n", " 1567,\n", " 34338,\n", " 99877,\n", " 1579,\n", " 83502,\n", " 83504,\n", " 1585,\n", " 67127,\n", " 99897,\n", " 50753,\n", " 83521,\n", " 1603,\n", " 1604,\n", " 17991,\n", " 17992,\n", " 50760,\n", " 83527,\n", " 67147,\n", " 17996,\n", " 67152,\n", " 34388,\n", " 67156,\n", " 99924,\n", " 1623,\n", " 34395,\n", " 99931,\n", " 18016,\n", " 1635,\n", " 99946,\n", " 67180,\n", " 99949,\n", " 99954,\n", " 34419,\n", " 83571,\n", " 99957,\n", " 83572,\n", " 83573,\n", " 18043,\n", " 50811,\n", " 67197,\n", " 34434,\n", " 99970,\n", " 18053,\n", " 83590,\n", " 34440,\n", " 1673,\n", " 50826,\n", " 1675,\n", " 1676,\n", " 34443,\n", " 67212,\n", " 67216,\n", " 18065,\n", " 18066,\n", " 18068,\n", " 18069,\n", " 50838,\n", " 50839,\n", " 34457,\n", " 50841,\n", " 1691,\n", " 1692,\n", " 83615,\n", " 1697,\n", " 50855,\n", " 34473,\n", " 67241,\n", " 50861,\n", " 34480,\n", " 67250,\n", " 18104,\n", " 18108,\n", " 83645,\n", " 18112,\n", " 18116,\n", " 67269,\n", " 1743,\n", " 83667,\n", " 67284,\n", " 50904,\n", " 83674,\n", " 50910,\n", " 34528,\n", " 34529,\n", " 18146,\n", " 50917,\n", " 83688,\n", " 50923,\n", " 1775,\n", " 18160,\n", " 18168,\n", " 34553,\n", " 67322,\n", " 67323,\n", " 18171,\n", " 50939,\n", " 67327,\n", " 50944,\n", " 50947,\n", " 1798,\n", " 83718,\n", " 34572,\n", " 1805,\n", " 83724,\n", " 34576,\n", " 67344,\n", " 83730,\n", " 67350,\n", " 1820,\n", " 1821,\n", " 34590,\n", " 50972,\n", " 1825,\n", " 50986,\n", " 50988,\n", " 34606,\n", " 1839,\n", " 67375,\n", " 50990,\n", " 50993,\n", " 67379,\n", " 83767,\n", " 51000,\n", " 1852,\n", " 34620,\n", " 83774,\n", " 18239,\n", " 1861,\n", " 18245,\n", " 1866,\n", " 67402,\n", " 1874,\n", " 83795,\n", " 83799,\n", " 67416,\n", " 18270,\n", " 51039,\n", " 83807,\n", " 83810,\n", " 51043,\n", " 51046,\n", " 67431,\n", " 51047,\n", " 1904,\n", " 67441,\n", " 18289,\n", " 34687,\n", " 1920,\n", " 34689,\n", " 1919,\n", " 51071,\n", " 51074,\n", " 83846,\n", " 18314,\n", " 51084,\n", " 34701,\n", " 18319,\n", " 34708,\n", " 34710,\n", " 18326,\n", " 51095,\n", " 67492,\n", " 67493,\n", " 34725,\n", " 1959,\n", " 51108,\n", " 67500,\n", " 18348,\n", " 83887,\n", " 1970,\n", " 83895,\n", " 34744,\n", " 51128,\n", " 83898,\n", " 83902,\n", " 34753,\n", " 83906,\n", " 51140,\n", " 18373,\n", " 67531,\n", " 18384,\n", " 2004,\n", " 51159,\n", " 2008,\n", " 2009,\n", " 18397,\n", " 18401,\n", " 83940,\n", " 2023,\n", " 67560,\n", " 34793,\n", " 67562,\n", " 34795,\n", " 51175,\n", " 83951,\n", " 51184,\n", " 83953,\n", " ...}" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_retrieve_result('什么时候发货') # 通过倒排表返回文档IDs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:greedyaiqa] *", "language": "python", "name": "conda-env-greedyaiqa-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 2 }