Commit 8d71034d by 20210516036
parent d5a677c5
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 搭建倒排表\n",
"倒排表的作用是让搜索更加快速,是搜索引擎中常用的技术。根据课程中所讲的方法,你需要完成这部分的代码。 "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from tqdm import tqdm\n",
"import numpy as np\n",
"import pickle\n",
"from gensim.models import KeyedVectors # 词向量用来比较俩俩之间相似度"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# 读取数据: 导入在preprocessor.ipynb中生成的data/question_answer_pares.pkl文件,并将其保存在变量QApares中\n",
"with open('data/question_answer_pares.pkl','rb') as f:\n",
" QApares = pickle.load(f)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>question</th>\n",
" <th>answer</th>\n",
" <th>question_after_preprocessing</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>买二份有没有少点呀</td>\n",
" <td>亲亲真的不好意思我们已经是优惠价了呢小本生意请亲谅解</td>\n",
" <td>[买, 二份, 有没有, 少点]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>那就等你们处理喽</td>\n",
" <td>好的亲退了</td>\n",
" <td>[处理]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>那我不喜欢</td>\n",
" <td>颜色的话一般茶刀茶针和二合一的话都是红木檀和黑木檀哦</td>\n",
" <td>[喜欢]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>不是免运费</td>\n",
" <td>本店茶具订单满99包邮除宁夏青海内蒙古海南新疆西藏满39包邮</td>\n",
" <td>[免, 运费]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>好吃吗</td>\n",
" <td>好吃的</td>\n",
" <td>[好吃]</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" question answer question_after_preprocessing\n",
"0 买二份有没有少点呀 亲亲真的不好意思我们已经是优惠价了呢小本生意请亲谅解 [买, 二份, 有没有, 少点]\n",
"1 那就等你们处理喽 好的亲退了 [处理]\n",
"2 那我不喜欢 颜色的话一般茶刀茶针和二合一的话都是红木檀和黑木檀哦 [喜欢]\n",
"3 不是免运费 本店茶具订单满99包邮除宁夏青海内蒙古海南新疆西藏满39包邮 [免, 运费]\n",
"4 好吃吗 好吃的 [好吃]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"QApares.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```TODO1``` 构造一个倒排表,不需要考虑单词的相似度"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# 构建一个倒排表,有关倒排表的详细内容参考实验手册\n",
"# 为了能够快速检索,倒排表应用哈希表来存储。python中字典内部便是用哈希表来存储的,所以这里我们直接将倒排表保存在字典中\n",
"# 注意:在这里不需要考虑单词之间的相似度。\n",
"inverted_list = {}\n",
"for index,sentence in enumerate(QApares.question_after_preprocessing):\n",
" ### 你需要完成的代码\n",
" for word in sentence:\n",
" if word in inverted_list:\n",
" inverted_list[word].add(index)\n",
" else:\n",
" inverted_list[word] = set()\n",
" inverted_list[word].add(index)\n",
" \n",
" ### 你需要完成的代码结束"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{5,\n",
" 65541,\n",
" 32776,\n",
" 17,\n",
" 18,\n",
" 65554,\n",
" 29,\n",
" 65566,\n",
" 32800,\n",
" 32803,\n",
" 98339,\n",
" 32810,\n",
" 98346,\n",
" 32818,\n",
" 55,\n",
" 98366,\n",
" 64,\n",
" 65604,\n",
" 65611,\n",
" 32850,\n",
" 98387,\n",
" 98398,\n",
" 65631,\n",
" 102,\n",
" 65639,\n",
" 65640,\n",
" 65646,\n",
" 98415,\n",
" 98416,\n",
" 118,\n",
" 122,\n",
" 65659,\n",
" 125,\n",
" 32894,\n",
" 133,\n",
" 65669,\n",
" 65670,\n",
" 65671,\n",
" 142,\n",
" 65679,\n",
" 32912,\n",
" 98451,\n",
" 150,\n",
" 151,\n",
" 32929,\n",
" 65708,\n",
" 98484,\n",
" 98489,\n",
" 187,\n",
" 32957,\n",
" 200,\n",
" 32973,\n",
" 65742,\n",
" 98518,\n",
" 65755,\n",
" 220,\n",
" 223,\n",
" 65764,\n",
" 65783,\n",
" 33017,\n",
" 65786,\n",
" 254,\n",
" 65790,\n",
" 261,\n",
" 65798,\n",
" 65810,\n",
" 275,\n",
" 98586,\n",
" 65833,\n",
" 33068,\n",
" 65838,\n",
" 33073,\n",
" 65843,\n",
" 65844,\n",
" 310,\n",
" 65852,\n",
" 318,\n",
" 65862,\n",
" 65863,\n",
" 344,\n",
" 65883,\n",
" 65885,\n",
" 350,\n",
" 33120,\n",
" 364,\n",
" 98668,\n",
" 65903,\n",
" 33140,\n",
" 98678,\n",
" 65913,\n",
" 33149,\n",
" 33158,\n",
" 33163,\n",
" 33168,\n",
" 401,\n",
" 98709,\n",
" 98715,\n",
" 33189,\n",
" 33191,\n",
" 65960,\n",
" 33193,\n",
" 98740,\n",
" 33210,\n",
" 98751,\n",
" 98754,\n",
" 33220,\n",
" 453,\n",
" 98763,\n",
" 461,\n",
" 469,\n",
" 33244,\n",
" 98794,\n",
" 495,\n",
" 98803,\n",
" 33273,\n",
" 33276,\n",
" 66046,\n",
" 33290,\n",
" 530,\n",
" 33300,\n",
" 66069,\n",
" 66077,\n",
" 542,\n",
" 543,\n",
" 66081,\n",
" 66091,\n",
" 556,\n",
" 66093,\n",
" 66094,\n",
" 98860,\n",
" 66097,\n",
" 98877,\n",
" 66111,\n",
" 33346,\n",
" 579,\n",
" 33347,\n",
" 588,\n",
" 66152,\n",
" 66153,\n",
" 98932,\n",
" 66166,\n",
" 639,\n",
" 640,\n",
" 642,\n",
" 98948,\n",
" 66185,\n",
" 651,\n",
" 33422,\n",
" 98962,\n",
" 33427,\n",
" 98970,\n",
" 66207,\n",
" 673,\n",
" 33441,\n",
" 33446,\n",
" 66221,\n",
" 687,\n",
" 66224,\n",
" 33459,\n",
" 694,\n",
" 33464,\n",
" 33466,\n",
" 33468,\n",
" 99005,\n",
" 66238,\n",
" 99007,\n",
" 710,\n",
" 99017,\n",
" 718,\n",
" 99044,\n",
" 743,\n",
" 33513,\n",
" 749,\n",
" 33523,\n",
" 759,\n",
" 66299,\n",
" 99068,\n",
" 33535,\n",
" 769,\n",
" 33539,\n",
" 66307,\n",
" 66309,\n",
" 99084,\n",
" 66321,\n",
" 99090,\n",
" 66323,\n",
" 99098,\n",
" 66334,\n",
" 800,\n",
" 66337,\n",
" 33576,\n",
" 99120,\n",
" 99122,\n",
" 33592,\n",
" 33593,\n",
" 33600,\n",
" 33604,\n",
" 99140,\n",
" 841,\n",
" 845,\n",
" 99149,\n",
" 66389,\n",
" 854,\n",
" 33624,\n",
" 33625,\n",
" 858,\n",
" 66395,\n",
" 99163,\n",
" 33629,\n",
" 99173,\n",
" 33638,\n",
" 874,\n",
" 66411,\n",
" 878,\n",
" 33647,\n",
" 66415,\n",
" 66417,\n",
" 33650,\n",
" 66418,\n",
" 99182,\n",
" 891,\n",
" 33662,\n",
" 66436,\n",
" 901,\n",
" 99208,\n",
" 66441,\n",
" 33674,\n",
" 33675,\n",
" 33682,\n",
" 916,\n",
" 66452,\n",
" 33691,\n",
" 33692,\n",
" 66461,\n",
" 66463,\n",
" 33697,\n",
" 99234,\n",
" 931,\n",
" 938,\n",
" 944,\n",
" 66480,\n",
" 33716,\n",
" 99254,\n",
" 959,\n",
" 33728,\n",
" 99267,\n",
" 33736,\n",
" 33741,\n",
" 978,\n",
" 33750,\n",
" 992,\n",
" 33765,\n",
" 99301,\n",
" 1000,\n",
" 1005,\n",
" 99309,\n",
" 33780,\n",
" 99318,\n",
" 1017,\n",
" 33787,\n",
" 66557,\n",
" 1024,\n",
" 1027,\n",
" 1034,\n",
" 1036,\n",
" 66577,\n",
" 66580,\n",
" 99351,\n",
" 99360,\n",
" 1057,\n",
" 33825,\n",
" 99363,\n",
" 33831,\n",
" 66600,\n",
" 33837,\n",
" 66607,\n",
" 99376,\n",
" 33841,\n",
" 99380,\n",
" 66613,\n",
" 33849,\n",
" 66622,\n",
" 1087,\n",
" 1088,\n",
" 99391,\n",
" 1095,\n",
" 1098,\n",
" 66637,\n",
" 1102,\n",
" 99405,\n",
" 99410,\n",
" 99413,\n",
" 66646,\n",
" 99419,\n",
" 66652,\n",
" 1118,\n",
" 99431,\n",
" 1129,\n",
" 66670,\n",
" 33912,\n",
" 33914,\n",
" 66685,\n",
" 33927,\n",
" 99467,\n",
" 99470,\n",
" 99472,\n",
" 66705,\n",
" 66708,\n",
" 66709,\n",
" 99484,\n",
" 99485,\n",
" 99487,\n",
" 66721,\n",
" 33954,\n",
" 1187,\n",
" 99495,\n",
" 66728,\n",
" 66732,\n",
" 66734,\n",
" 99510,\n",
" 99517,\n",
" 66751,\n",
" 1223,\n",
" 33992,\n",
" 99528,\n",
" 99537,\n",
" 66770,\n",
" 99539,\n",
" 1237,\n",
" 1242,\n",
" 66779,\n",
" 66780,\n",
" 1247,\n",
" 99555,\n",
" 34024,\n",
" 1265,\n",
" 34034,\n",
" 1270,\n",
" 34042,\n",
" 1277,\n",
" 66814,\n",
" 99583,\n",
" 34064,\n",
" 1297,\n",
" 66834,\n",
" 66844,\n",
" 1312,\n",
" 66849,\n",
" 1326,\n",
" 66866,\n",
" 34099,\n",
" 99637,\n",
" 1334,\n",
" 66871,\n",
" 66877,\n",
" 1342,\n",
" 1345,\n",
" 99651,\n",
" 66887,\n",
" 1352,\n",
" 1354,\n",
" 66895,\n",
" 99671,\n",
" 66909,\n",
" 99681,\n",
" 1392,\n",
" 66929,\n",
" 34162,\n",
" 1401,\n",
" 1402,\n",
" 34169,\n",
" 99709,\n",
" 1407,\n",
" 66944,\n",
" 66945,\n",
" 1412,\n",
" 1415,\n",
" 34184,\n",
" 1420,\n",
" 99729,\n",
" 1426,\n",
" 34196,\n",
" 66971,\n",
" 1436,\n",
" 66974,\n",
" 1443,\n",
" 34217,\n",
" 66986,\n",
" 1452,\n",
" 66993,\n",
" 1459,\n",
" 34227,\n",
" 66996,\n",
" 34258,\n",
" 99797,\n",
" 99803,\n",
" 1508,\n",
" 99814,\n",
" 1512,\n",
" 1515,\n",
" 67053,\n",
" 99821,\n",
" 99824,\n",
" 99825,\n",
" 34303,\n",
" 67071,\n",
" 34307,\n",
" 67077,\n",
" 67079,\n",
" 99854,\n",
" 67092,\n",
" 67094,\n",
" 99862,\n",
" 1565,\n",
" 34333,\n",
" 1567,\n",
" 34338,\n",
" 1579,\n",
" 1585,\n",
" 67127,\n",
" 99897,\n",
" 1603,\n",
" 1604,\n",
" 67147,\n",
" 67152,\n",
" 34388,\n",
" 67156,\n",
" 99924,\n",
" 1623,\n",
" 34395,\n",
" 99931,\n",
" 1635,\n",
" 99946,\n",
" 67180,\n",
" 99949,\n",
" 99954,\n",
" 34419,\n",
" 99957,\n",
" 67197,\n",
" 34434,\n",
" 99970,\n",
" 34440,\n",
" 1673,\n",
" 1675,\n",
" 1676,\n",
" 34443,\n",
" 67212,\n",
" 67216,\n",
" 34457,\n",
" 1691,\n",
" 1692,\n",
" 1697,\n",
" 34473,\n",
" 67241,\n",
" 34480,\n",
" 67250,\n",
" 67269,\n",
" 1743,\n",
" 67284,\n",
" 34528,\n",
" 1775,\n",
" 34553,\n",
" 67323,\n",
" 1798,\n",
" 34572,\n",
" 1805,\n",
" 34576,\n",
" 67344,\n",
" 67350,\n",
" 1820,\n",
" 1821,\n",
" 34590,\n",
" 1825,\n",
" 34606,\n",
" 1839,\n",
" 67375,\n",
" 67379,\n",
" 1852,\n",
" 1861,\n",
" 1866,\n",
" 67402,\n",
" 1874,\n",
" 67416,\n",
" 67431,\n",
" 1904,\n",
" 67441,\n",
" 34687,\n",
" 1920,\n",
" 34689,\n",
" 34701,\n",
" 34708,\n",
" 34710,\n",
" 67492,\n",
" 67493,\n",
" 1970,\n",
" 34744,\n",
" 34753,\n",
" 67531,\n",
" 2004,\n",
" 2008,\n",
" 2009,\n",
" 2023,\n",
" 67560,\n",
" 34793,\n",
" 67562,\n",
" 34795,\n",
" 2037,\n",
" 34807,\n",
" 2046,\n",
" 34819,\n",
" 2052,\n",
" 34827,\n",
" 2068,\n",
" 67613,\n",
" 34848,\n",
" 67626,\n",
" 67628,\n",
" 34864,\n",
" 2099,\n",
" 67636,\n",
" 34869,\n",
" 67641,\n",
" 2110,\n",
" 67649,\n",
" 2117,\n",
" 67655,\n",
" 67666,\n",
" 67669,\n",
" 67680,\n",
" 34914,\n",
" 67682,\n",
" 2151,\n",
" 34926,\n",
" 2160,\n",
" 2161,\n",
" 34929,\n",
" 67699,\n",
" 2170,\n",
" 2178,\n",
" 34949,\n",
" 67718,\n",
" 2187,\n",
" 2193,\n",
" 67730,\n",
" 67735,\n",
" 2216,\n",
" 2218,\n",
" 2230,\n",
" 34998,\n",
" 2235,\n",
" 67771,\n",
" 2244,\n",
" 2247,\n",
" 67791,\n",
" 2263,\n",
" 67800,\n",
" 2266,\n",
" 35037,\n",
" 67810,\n",
" 67816,\n",
" 35049,\n",
" 35055,\n",
" 67824,\n",
" 67825,\n",
" 35059,\n",
" 35063,\n",
" 2296,\n",
" 35067,\n",
" 2323,\n",
" 2326,\n",
" 2330,\n",
" 67867,\n",
" 2335,\n",
" 67886,\n",
" 2365,\n",
" 2370,\n",
" 2372,\n",
" 2377,\n",
" 67915,\n",
" 2380,\n",
" 2392,\n",
" 67928,\n",
" 67934,\n",
" 35174,\n",
" 35180,\n",
" 35188,\n",
" 67958,\n",
" 35195,\n",
" 67968,\n",
" 35202,\n",
" 2441,\n",
" 35211,\n",
" 67981,\n",
" 35215,\n",
" 67984,\n",
" 35219,\n",
" 35220,\n",
" 35228,\n",
" 2465,\n",
" 2483,\n",
" 2484,\n",
" 35256,\n",
" 68025,\n",
" 2510,\n",
" 35280,\n",
" 35281,\n",
" 68066,\n",
" 2532,\n",
" 35310,\n",
" 68084,\n",
" 2553,\n",
" 68089,\n",
" 68097,\n",
" 2566,\n",
" 35351,\n",
" 35358,\n",
" 2594,\n",
" 2607,\n",
" 35378,\n",
" 2612,\n",
" 68151,\n",
" 35385,\n",
" 2620,\n",
" 35394,\n",
" 35395,\n",
" 68163,\n",
" 68164,\n",
" 68169,\n",
" 35415,\n",
" 2653,\n",
" 68199,\n",
" 68200,\n",
" 68209,\n",
" 2677,\n",
" 68215,\n",
" 35458,\n",
" 35459,\n",
" 2692,\n",
" 35466,\n",
" 35472,\n",
" 2705,\n",
" 35473,\n",
" 35474,\n",
" 68240,\n",
" 2710,\n",
" 2712,\n",
" 35481,\n",
" 2715,\n",
" 2721,\n",
" 68257,\n",
" 68264,\n",
" 68265,\n",
" 35503,\n",
" 2744,\n",
" 68290,\n",
" 2761,\n",
" 68306,\n",
" 35539,\n",
" 35547,\n",
" 35549,\n",
" 2786,\n",
" 35557,\n",
" 35559,\n",
" 68329,\n",
" 68332,\n",
" 68334,\n",
" 68337,\n",
" 35570,\n",
" 2804,\n",
" 68343,\n",
" 35587,\n",
" 2824,\n",
" 35603,\n",
" 2838,\n",
" 68375,\n",
" 35613,\n",
" 2853,\n",
" 35622,\n",
" 35634,\n",
" 2868,\n",
" 35636,\n",
" 68408,\n",
" 68419,\n",
" 35654,\n",
" 2887,\n",
" 68440,\n",
" 68445,\n",
" 2916,\n",
" 35689,\n",
" 68457,\n",
" 35697,\n",
" 35729,\n",
" 2968,\n",
" 68506,\n",
" 35740,\n",
" 68512,\n",
" 2981,\n",
" 35758,\n",
" 2991,\n",
" 35759,\n",
" 2996,\n",
" 3000,\n",
" 68537,\n",
" 3009,\n",
" 68549,\n",
" 68556,\n",
" 35792,\n",
" 35793,\n",
" 68562,\n",
" 68568,\n",
" 35804,\n",
" 35809,\n",
" 35810,\n",
" 68578,\n",
" 3050,\n",
" 3053,\n",
" 68590,\n",
" 3061,\n",
" 3067,\n",
" 35835,\n",
" 35843,\n",
" 3076,\n",
" 68613,\n",
" 35846,\n",
" 3081,\n",
" 35855,\n",
" 68629,\n",
" 3094,\n",
" 35863,\n",
" 35867,\n",
" 35888,\n",
" 35899,\n",
" 35915,\n",
" 68683,\n",
" 68685,\n",
" 3151,\n",
" 68687,\n",
" 68690,\n",
" 68697,\n",
" 35932,\n",
" 68705,\n",
" 3171,\n",
" 68708,\n",
" 68711,\n",
" 68721,\n",
" 3187,\n",
" 35974,\n",
" 68744,\n",
" 3210,\n",
" 3211,\n",
" 35983,\n",
" 68761,\n",
" 36002,\n",
" 3236,\n",
" 68774,\n",
" 3240,\n",
" 68790,\n",
" 3256,\n",
" 68792,\n",
" 3265,\n",
" 3269,\n",
" 3271,\n",
" 68809,\n",
" 3275,\n",
" 36052,\n",
" 3285,\n",
" 3286,\n",
" 68823,\n",
" 3288,\n",
" 3291,\n",
" 3293,\n",
" 3303,\n",
" 68855,\n",
" 68865,\n",
" 3333,\n",
" 36107,\n",
" 3356,\n",
" 3364,\n",
" 3375,\n",
" 68914,\n",
" 68930,\n",
" 68948,\n",
" 3414,\n",
" 36192,\n",
" 36197,\n",
" 36201,\n",
" 3434,\n",
" 36211,\n",
" 36214,\n",
" 3455,\n",
" 36226,\n",
" 69014,\n",
" 69015,\n",
" 3482,\n",
" 3483,\n",
" 69022,\n",
" 3496,\n",
" 3500,\n",
" 36273,\n",
" 36277,\n",
" 3526,\n",
" 3527,\n",
" 36294,\n",
" 36300,\n",
" 3536,\n",
" 3539,\n",
" 36308,\n",
" 3542,\n",
" 3543,\n",
" 69082,\n",
" 36320,\n",
" 36321,\n",
" 36326,\n",
" 3560,\n",
" 3572,\n",
" 36340,\n",
" 36341,\n",
" 36346,\n",
" 69124,\n",
" 3590,\n",
" 36370,\n",
" 3610,\n",
" 3613,\n",
" 69158,\n",
" 3627,\n",
" 3636,\n",
" 36404,\n",
" 36416,\n",
" 36419,\n",
" 3652,\n",
" 3655,\n",
" 3657,\n",
" 3669,\n",
" 3673,\n",
" 36444,\n",
" 36446,\n",
" 3680,\n",
" 69220,\n",
" 69221,\n",
" 69231,\n",
" 36469,\n",
" 69244,\n",
" 3712,\n",
" 3713,\n",
" 36481,\n",
" 69251,\n",
" 36488,\n",
" 69264,\n",
" 36501,\n",
" 36509,\n",
" 36510,\n",
" 3744,\n",
" 69282,\n",
" 69287,\n",
" 69294,\n",
" 36527,\n",
" 36528,\n",
" 3764,\n",
" 3785,\n",
" 3791,\n",
" 69338,\n",
" 36572,\n",
" 3807,\n",
" 69344,\n",
" 3815,\n",
" 69365,\n",
" 36598,\n",
" 3832,\n",
" 3837,\n",
" 36619,\n",
" 69405,\n",
" 3882,\n",
" 3883,\n",
" 36650,\n",
" 69426,\n",
" 36687,\n",
" 3928,\n",
" 3931,\n",
" 69474,\n",
" 36707,\n",
" 69477,\n",
" 36710,\n",
" 3945,\n",
" 36723,\n",
" 3960,\n",
" 69512,\n",
" 3980,\n",
" 69516,\n",
" 69518,\n",
" 69519,\n",
" 69523,\n",
" 36756,\n",
" 36762,\n",
" 36765,\n",
" 36771,\n",
" 36778,\n",
" 4026,\n",
" 69563,\n",
" 69569,\n",
" 69578,\n",
" 69596,\n",
" 36829,\n",
" 4069,\n",
" 36838,\n",
" 69607,\n",
" 36841,\n",
" 4074,\n",
" 36845,\n",
" 69617,\n",
" 4093,\n",
" 36871,\n",
" 36874,\n",
" 36878,\n",
" 36880,\n",
" 36888,\n",
" 36890,\n",
" 4124,\n",
" 69662,\n",
" 4127,\n",
" 36898,\n",
" 4133,\n",
" 36916,\n",
" 36917,\n",
" 36924,\n",
" 69692,\n",
" 69695,\n",
" 4161,\n",
" 4162,\n",
" 4165,\n",
" 69707,\n",
" 4175,\n",
" 36948,\n",
" 4182,\n",
" 36958,\n",
" 69732,\n",
" 36974,\n",
" 36978,\n",
" 69748,\n",
" 4220,\n",
" 36995,\n",
" 69766,\n",
" 36999,\n",
" 69767,\n",
" 69770,\n",
" 4238,\n",
" 37007,\n",
" 69783,\n",
" 4250,\n",
" 69800,\n",
" 69802,\n",
" 4267,\n",
" 37036,\n",
" 37037,\n",
" 37039,\n",
" 4278,\n",
" 37049,\n",
" 4282,\n",
" 4287,\n",
" 4290,\n",
" 4291,\n",
" 37064,\n",
" 4301,\n",
" 37072,\n",
" 37073,\n",
" 69847,\n",
" 4312,\n",
" 4321,\n",
" 4323,\n",
" 4324,\n",
" 4332,\n",
" 4336,\n",
" 69874,\n",
" 69880,\n",
" 37114,\n",
" 37120,\n",
" 4354,\n",
" 4360,\n",
" 69898,\n",
" 37131,\n",
" 69903,\n",
" 37150,\n",
" 4387,\n",
" 69923,\n",
" 69928,\n",
" 37166,\n",
" 69934,\n",
" 69944,\n",
" 37177,\n",
" 4425,\n",
" 37193,\n",
" 4429,\n",
" 37216,\n",
" 4463,\n",
" 37234,\n",
" 37235,\n",
" 70002,\n",
" 4492,\n",
" 4493,\n",
" 4500,\n",
" 70040,\n",
" ...}"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"inverted_list[\"发货\"]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3832"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(inverted_list)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"#d ata/retrieve/sgns.zhihu.word是从https://github.com/Embedding/Chinese-Word-Vectors下载到的预训练好的中文词向量文件\n",
"#使 用KeyedVectors.load_word2vec_format()函数加载预训练好的词向量文件\n",
"model = KeyedVectors.load_word2vec_format('data/retrieve/sgns.zhihu.word')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"def get_similar_by_word(word,topk):\n",
" '''\n",
" 返回与一个单词word相似度最高的topk个单词所组成的单词列表\n",
" 出参:\n",
" word_list:与word相似度最高的topk个单词所组成的单词列表。格式为[单词1,单词2,单词3,单词4,单词5]\n",
" '''\n",
" similar_words = model.similar_by_word(word,topk)\n",
" word_list = [word[0] for word in similar_words]\n",
" return word_list"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['昨天', '现在', '今天下午', '明天', '今日']"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_similar_by_word(\"今天\",5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```TODO2``` 构造一个新的倒排表,考虑单词之间的语义相似度"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 3832/3832 [00:44<00:00, 85.74it/s] "
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"OOV_count: 832\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"# TODO:\n",
"# 构造一个新的倒排表,并将结果保存在字典inverted_list_new中\n",
"# 新的倒排表键为word,值为老倒排表[word]、老倒排表[单词1]、老倒排表[单词2]、老倒排表[单词3]、老倒排表[单词4]的并集\n",
"# 即新倒排表保存了包含单词word或包含与单词word最相近的5个单词中的某一个的问题的index\n",
"inverted_list_new = {}\n",
"OOV_count = 0\n",
"for word in tqdm(inverted_list):\n",
" ### 你需要完成的部分\n",
" try:\n",
" top_4_words = get_similar_by_word(word,4)\n",
" inverted_list_new[word] = set()\n",
" inverted_list_new[word] = inverted_list_new[word].union(inverted_list[word])\n",
" for t_word in top_4_words:\n",
" if t_word in inverted_list:\n",
" inverted_list_new[word] = inverted_list_new[word].union(inverted_list[t_word])\n",
" except Exception as e:\n",
" OOV_count += 1\n",
"print(\"OOV_count:\",OOV_count)\n",
" ### 你需要完成的代码结束\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{81920,\n",
" 16386,\n",
" 5,\n",
" 65541,\n",
" 81927,\n",
" 32776,\n",
" 81930,\n",
" 81935,\n",
" 17,\n",
" 18,\n",
" 65554,\n",
" 16401,\n",
" 81947,\n",
" 98331,\n",
" 29,\n",
" 65566,\n",
" 32800,\n",
" 81953,\n",
" 32803,\n",
" 98339,\n",
" 81959,\n",
" 32810,\n",
" 98346,\n",
" 49194,\n",
" 32818,\n",
" 16435,\n",
" 55,\n",
" 49209,\n",
" 98366,\n",
" 64,\n",
" 49219,\n",
" 65604,\n",
" 81988,\n",
" 16458,\n",
" 65611,\n",
" 81995,\n",
" 81998,\n",
" 16463,\n",
" 16464,\n",
" 49233,\n",
" 32850,\n",
" 98387,\n",
" 49234,\n",
" 82004,\n",
" 86,\n",
" 81999,\n",
" 98386,\n",
" 16475,\n",
" 32859,\n",
" 49245,\n",
" 98398,\n",
" 65631,\n",
" 82015,\n",
" 65630,\n",
" 102,\n",
" 65639,\n",
" 65640,\n",
" 49259,\n",
" 65646,\n",
" 98415,\n",
" 98416,\n",
" 49263,\n",
" 16495,\n",
" 49267,\n",
" 16500,\n",
" 82035,\n",
" 118,\n",
" 16503,\n",
" 65650,\n",
" 65656,\n",
" 122,\n",
" 65659,\n",
" 49275,\n",
" 125,\n",
" 32894,\n",
" 65660,\n",
" 133,\n",
" 65669,\n",
" 65670,\n",
" 65671,\n",
" 32902,\n",
" 139,\n",
" 49293,\n",
" 142,\n",
" 65679,\n",
" 32912,\n",
" 98451,\n",
" 150,\n",
" 151,\n",
" 32929,\n",
" 49318,\n",
" 49320,\n",
" 65708,\n",
" 82092,\n",
" 82093,\n",
" 65711,\n",
" 98484,\n",
" 98489,\n",
" 49337,\n",
" 187,\n",
" 49340,\n",
" 32957,\n",
" 200,\n",
" 16588,\n",
" 32973,\n",
" 65742,\n",
" 16589,\n",
" 98518,\n",
" 16598,\n",
" 49366,\n",
" 82137,\n",
" 65755,\n",
" 220,\n",
" 223,\n",
" 65764,\n",
" 82149,\n",
" 82155,\n",
" 16621,\n",
" 49396,\n",
" 65783,\n",
" 33017,\n",
" 65786,\n",
" 254,\n",
" 65790,\n",
" 82176,\n",
" 261,\n",
" 65798,\n",
" 65806,\n",
" 65810,\n",
" 275,\n",
" 279,\n",
" 98586,\n",
" 82208,\n",
" 49442,\n",
" 65833,\n",
" 33068,\n",
" 82220,\n",
" 65838,\n",
" 82221,\n",
" 82224,\n",
" 33073,\n",
" 65843,\n",
" 65844,\n",
" 16691,\n",
" 310,\n",
" 16696,\n",
" 82234,\n",
" 65852,\n",
" 49468,\n",
" 318,\n",
" 49471,\n",
" 49472,\n",
" 16706,\n",
" 49475,\n",
" 65862,\n",
" 65863,\n",
" 16712,\n",
" 82248,\n",
" 65866,\n",
" 49483,\n",
" 49493,\n",
" 16726,\n",
" 344,\n",
" 16730,\n",
" 65883,\n",
" 82268,\n",
" 65885,\n",
" 350,\n",
" 33120,\n",
" 16745,\n",
" 364,\n",
" 98668,\n",
" 65903,\n",
" 33140,\n",
" 98678,\n",
" 65913,\n",
" 33149,\n",
" 49535,\n",
" 33158,\n",
" 49544,\n",
" 82313,\n",
" 33163,\n",
" 16783,\n",
" 33168,\n",
" 401,\n",
" 82322,\n",
" 98709,\n",
" 49558,\n",
" 98715,\n",
" 16796,\n",
" 49565,\n",
" 82333,\n",
" 82334,\n",
" 33189,\n",
" 33191,\n",
" 65960,\n",
" 33193,\n",
" 49579,\n",
" 16812,\n",
" 98740,\n",
" 65973,\n",
" 16822,\n",
" 82359,\n",
" 16825,\n",
" 33210,\n",
" 16827,\n",
" 82365,\n",
" 98751,\n",
" 16832,\n",
" 98754,\n",
" 33220,\n",
" 453,\n",
" 49604,\n",
" 98763,\n",
" 461,\n",
" 469,\n",
" 49623,\n",
" 16856,\n",
" 33244,\n",
" 49629,\n",
" 16867,\n",
" 16869,\n",
" 98794,\n",
" 82410,\n",
" 82412,\n",
" 495,\n",
" 16882,\n",
" 98803,\n",
" 49651,\n",
" 49656,\n",
" 33273,\n",
" 16889,\n",
" 49658,\n",
" 33276,\n",
" 82426,\n",
" 66046,\n",
" 507,\n",
" 16895,\n",
" 49669,\n",
" 82437,\n",
" 33290,\n",
" 49674,\n",
" 16911,\n",
" 530,\n",
" 33300,\n",
" 66069,\n",
" 16918,\n",
" 16922,\n",
" 66077,\n",
" 542,\n",
" 543,\n",
" 82463,\n",
" 66081,\n",
" 16932,\n",
" 66091,\n",
" 556,\n",
" 66093,\n",
" 66094,\n",
" 98860,\n",
" 82475,\n",
" 66097,\n",
" 49709,\n",
" 16942,\n",
" 49715,\n",
" 49716,\n",
" 66106,\n",
" 98877,\n",
" 66111,\n",
" 49729,\n",
" 33346,\n",
" 579,\n",
" 33347,\n",
" 82499,\n",
" 49735,\n",
" 16967,\n",
" 49739,\n",
" 588,\n",
" 16983,\n",
" 82522,\n",
" 16991,\n",
" 82528,\n",
" 49761,\n",
" 49762,\n",
" 66151,\n",
" 66152,\n",
" 66153,\n",
" 17003,\n",
" 49777,\n",
" 98932,\n",
" 17012,\n",
" 66166,\n",
" 17020,\n",
" 17021,\n",
" 639,\n",
" 640,\n",
" 49793,\n",
" 642,\n",
" 17026,\n",
" 98948,\n",
" 17027,\n",
" 49796,\n",
" 82563,\n",
" 17030,\n",
" 66185,\n",
" 82566,\n",
" 651,\n",
" 33422,\n",
" 82576,\n",
" 98962,\n",
" 33427,\n",
" 17043,\n",
" 49812,\n",
" 82584,\n",
" 98970,\n",
" 17050,\n",
" 17052,\n",
" 66207,\n",
" 82592,\n",
" 673,\n",
" 33441,\n",
" 17057,\n",
" 17061,\n",
" 33446,\n",
" 49831,\n",
" 66221,\n",
" 82605,\n",
" 687,\n",
" 66224,\n",
" 49841,\n",
" 33453,\n",
" 33459,\n",
" 17076,\n",
" 66228,\n",
" 694,\n",
" 33464,\n",
" 33466,\n",
" 33468,\n",
" 99005,\n",
" 66238,\n",
" 99007,\n",
" 99006,\n",
" 82628,\n",
" 710,\n",
" 82630,\n",
" 82632,\n",
" 99017,\n",
" 711,\n",
" 718,\n",
" 82639,\n",
" 49872,\n",
" 82645,\n",
" 82652,\n",
" 49889,\n",
" 82658,\n",
" 99044,\n",
" 743,\n",
" 17127,\n",
" 33513,\n",
" 49897,\n",
" 17132,\n",
" 749,\n",
" 49901,\n",
" 49903,\n",
" 82674,\n",
" 33523,\n",
" 17142,\n",
" 759,\n",
" 17144,\n",
" 33527,\n",
" 66299,\n",
" 99068,\n",
" 33535,\n",
" 17152,\n",
" 769,\n",
" 33539,\n",
" 66307,\n",
" 66309,\n",
" 82694,\n",
" 17162,\n",
" 99084,\n",
" 49932,\n",
" 17167,\n",
" 49935,\n",
" 66321,\n",
" 99090,\n",
" 66323,\n",
" 783,\n",
" 82713,\n",
" 99098,\n",
" 66333,\n",
" 66334,\n",
" 82719,\n",
" 800,\n",
" 66337,\n",
" 17185,\n",
" 49954,\n",
" 17190,\n",
" 33576,\n",
" 49962,\n",
" 17196,\n",
" 99120,\n",
" 99122,\n",
" 33587,\n",
" 49972,\n",
" 49974,\n",
" 17207,\n",
" 33592,\n",
" 33593,\n",
" 33600,\n",
" 33604,\n",
" 99140,\n",
" 82757,\n",
" 841,\n",
" 49994,\n",
" 845,\n",
" 99149,\n",
" 17235,\n",
" 82771,\n",
" 66389,\n",
" 854,\n",
" 82775,\n",
" 33624,\n",
" 33625,\n",
" 858,\n",
" 66395,\n",
" 99163,\n",
" 33629,\n",
" 17245,\n",
" 82782,\n",
" 17249,\n",
" 17251,\n",
" 82787,\n",
" 99173,\n",
" 33638,\n",
" 17255,\n",
" 874,\n",
" 66411,\n",
" 82794,\n",
" 17259,\n",
" 878,\n",
" 33647,\n",
" 66415,\n",
" 66417,\n",
" 33650,\n",
" 66418,\n",
" 99182,\n",
" 82798,\n",
" 33652,\n",
" 891,\n",
" 17275,\n",
" 82812,\n",
" 33662,\n",
" 82813,\n",
" 17283,\n",
" 66436,\n",
" 901,\n",
" 99208,\n",
" 66441,\n",
" 33674,\n",
" 33675,\n",
" 17291,\n",
" 82829,\n",
" 33682,\n",
" 916,\n",
" 66452,\n",
" 17304,\n",
" 33688,\n",
" 33691,\n",
" 33692,\n",
" 66461,\n",
" 66463,\n",
" 82847,\n",
" 33697,\n",
" 99234,\n",
" 931,\n",
" 82848,\n",
" 17317,\n",
" 938,\n",
" 17324,\n",
" 82863,\n",
" 944,\n",
" 66480,\n",
" 33716,\n",
" 99254,\n",
" 82878,\n",
" 959,\n",
" 33728,\n",
" 99267,\n",
" 33736,\n",
" 33741,\n",
" 17359,\n",
" 978,\n",
" 17364,\n",
" 33750,\n",
" 82905,\n",
" 17371,\n",
" 992,\n",
" 17378,\n",
" 33765,\n",
" 99301,\n",
" 1000,\n",
" 50152,\n",
" 50155,\n",
" 1005,\n",
" 99309,\n",
" 50158,\n",
" 82928,\n",
" 66542,\n",
" 33780,\n",
" 99318,\n",
" 1017,\n",
" 33787,\n",
" 66557,\n",
" 1024,\n",
" 17408,\n",
" 17409,\n",
" 1027,\n",
" 50179,\n",
" 82949,\n",
" 82950,\n",
" 1034,\n",
" 50187,\n",
" 1036,\n",
" 50191,\n",
" 66577,\n",
" 66580,\n",
" 99351,\n",
" 82969,\n",
" 82970,\n",
" 17437,\n",
" 99360,\n",
" 1057,\n",
" 33825,\n",
" 99363,\n",
" 82977,\n",
" 82979,\n",
" 33831,\n",
" 66600,\n",
" 17447,\n",
" 17448,\n",
" 33837,\n",
" 50221,\n",
" 66607,\n",
" 99376,\n",
" 33841,\n",
" 82993,\n",
" 99380,\n",
" 66613,\n",
" 82996,\n",
" 82998,\n",
" 33849,\n",
" 83003,\n",
" 66622,\n",
" 1087,\n",
" 1088,\n",
" 99391,\n",
" 17473,\n",
" 83011,\n",
" 50245,\n",
" 1095,\n",
" 1098,\n",
" 66637,\n",
" 1102,\n",
" 99405,\n",
" 50256,\n",
" 99410,\n",
" 66644,\n",
" 99413,\n",
" 66646,\n",
" 99419,\n",
" 66652,\n",
" 83037,\n",
" 1118,\n",
" 83040,\n",
" 99431,\n",
" 66663,\n",
" 1129,\n",
" 17516,\n",
" 83052,\n",
" 66670,\n",
" 33900,\n",
" 50290,\n",
" 17524,\n",
" 33912,\n",
" 83065,\n",
" 33914,\n",
" 66685,\n",
" 83071,\n",
" 17537,\n",
" 50309,\n",
" 33927,\n",
" 50314,\n",
" 99467,\n",
" 99470,\n",
" 50318,\n",
" 99472,\n",
" 66705,\n",
" 83088,\n",
" 66708,\n",
" 66709,\n",
" 50327,\n",
" 99484,\n",
" 99485,\n",
" 83101,\n",
" 99487,\n",
" 66721,\n",
" 33954,\n",
" 1187,\n",
" 83105,\n",
" 50337,\n",
" 50341,\n",
" 99495,\n",
" 66728,\n",
" 50345,\n",
" 50346,\n",
" 17578,\n",
" 66732,\n",
" 83116,\n",
" 66734,\n",
" 50351,\n",
" 50356,\n",
" 99510,\n",
" 17591,\n",
" 50363,\n",
" 99517,\n",
" 66751,\n",
" 17599,\n",
" 1223,\n",
" 33992,\n",
" 99528,\n",
" 17607,\n",
" 50376,\n",
" 50379,\n",
" 17616,\n",
" 99537,\n",
" 66770,\n",
" 99539,\n",
" 1237,\n",
" 1242,\n",
" 66779,\n",
" 66780,\n",
" 17628,\n",
" 1247,\n",
" 17631,\n",
" 83168,\n",
" 50402,\n",
" 99555,\n",
" 50405,\n",
" 34024,\n",
" 50409,\n",
" 50411,\n",
" 1265,\n",
" 34034,\n",
" 83186,\n",
" 17651,\n",
" 50419,\n",
" 1270,\n",
" 34042,\n",
" 99580,\n",
" 1277,\n",
" 66814,\n",
" 99583,\n",
" 83200,\n",
" 17671,\n",
" 50441,\n",
" 17677,\n",
" 34064,\n",
" 1297,\n",
" 66834,\n",
" 50448,\n",
" 50449,\n",
" 50456,\n",
" 66844,\n",
" 83229,\n",
" 83231,\n",
" 1312,\n",
" 66849,\n",
" 50464,\n",
" 17698,\n",
" 17699,\n",
" 83236,\n",
" 17702,\n",
" 1326,\n",
" 99632,\n",
" 17713,\n",
" 66866,\n",
" 34099,\n",
" 83252,\n",
" 99637,\n",
" 1334,\n",
" 66871,\n",
" 83258,\n",
" 66877,\n",
" 1342,\n",
" 50493,\n",
" 17728,\n",
" 1345,\n",
" 99651,\n",
" 66887,\n",
" 1352,\n",
" 50503,\n",
" 1354,\n",
" 17737,\n",
" 50508,\n",
" 66895,\n",
" 99671,\n",
" 66909,\n",
" 34143,\n",
" 99681,\n",
" 83300,\n",
" 17770,\n",
" 1392,\n",
" 66929,\n",
" 34162,\n",
" 17777,\n",
" 1401,\n",
" 1402,\n",
" 34169,\n",
" 50554,\n",
" 99709,\n",
" 50557,\n",
" 1407,\n",
" 66944,\n",
" 66945,\n",
" 17788,\n",
" 1412,\n",
" 1415,\n",
" 34184,\n",
" 1420,\n",
" 99729,\n",
" 1426,\n",
" 34196,\n",
" 83351,\n",
" 66971,\n",
" 1436,\n",
" 83357,\n",
" 66974,\n",
" 17821,\n",
" 1443,\n",
" 83363,\n",
" 50598,\n",
" 34217,\n",
" 66986,\n",
" 83369,\n",
" 1452,\n",
" 66993,\n",
" 50609,\n",
" 1459,\n",
" 34227,\n",
" 66996,\n",
" 83385,\n",
" 50618,\n",
" 50626,\n",
" 50627,\n",
" 83406,\n",
" 83407,\n",
" 34258,\n",
" 50643,\n",
" 83412,\n",
" 99797,\n",
" 50647,\n",
" 99803,\n",
" 50652,\n",
" 50654,\n",
" 1508,\n",
" 50661,\n",
" 99814,\n",
" 1512,\n",
" 1515,\n",
" 50667,\n",
" 67053,\n",
" 99821,\n",
" 17901,\n",
" 99824,\n",
" 99825,\n",
" 83438,\n",
" 50672,\n",
" 83441,\n",
" 50677,\n",
" 17910,\n",
" 83446,\n",
" 83447,\n",
" 83448,\n",
" 34292,\n",
" 17916,\n",
" 17918,\n",
" 34303,\n",
" 67071,\n",
" 34307,\n",
" 83460,\n",
" 67077,\n",
" 1540,\n",
" 67079,\n",
" 34310,\n",
" 99854,\n",
" 83473,\n",
" 67092,\n",
" 67094,\n",
" 99862,\n",
" 67100,\n",
" 1565,\n",
" 34333,\n",
" 1567,\n",
" 34338,\n",
" 99877,\n",
" 1579,\n",
" 83502,\n",
" 83504,\n",
" 1585,\n",
" 67127,\n",
" 99897,\n",
" 50753,\n",
" 83521,\n",
" 1603,\n",
" 1604,\n",
" 17991,\n",
" 17992,\n",
" 50760,\n",
" 83527,\n",
" 67147,\n",
" 17996,\n",
" 67152,\n",
" 34388,\n",
" 67156,\n",
" 99924,\n",
" 1623,\n",
" 34395,\n",
" 99931,\n",
" 18016,\n",
" 1635,\n",
" 99946,\n",
" 67180,\n",
" 99949,\n",
" 99954,\n",
" 34419,\n",
" 83571,\n",
" 99957,\n",
" 83573,\n",
" 83572,\n",
" 18043,\n",
" 50811,\n",
" 67197,\n",
" 34434,\n",
" 99970,\n",
" 18053,\n",
" 83590,\n",
" 34440,\n",
" 1673,\n",
" 50826,\n",
" 1675,\n",
" 1676,\n",
" 34443,\n",
" 67212,\n",
" 67216,\n",
" 18065,\n",
" 18066,\n",
" 18068,\n",
" 18069,\n",
" 50838,\n",
" 50839,\n",
" 34457,\n",
" 50841,\n",
" 1691,\n",
" 1692,\n",
" 83615,\n",
" 1697,\n",
" 50855,\n",
" 34473,\n",
" 67241,\n",
" 50861,\n",
" 34480,\n",
" 67250,\n",
" 18104,\n",
" 18108,\n",
" 83645,\n",
" 18112,\n",
" 18116,\n",
" 67269,\n",
" 1743,\n",
" 83667,\n",
" 67284,\n",
" 50904,\n",
" 83674,\n",
" 50910,\n",
" 34528,\n",
" 34529,\n",
" 18146,\n",
" 50917,\n",
" 83688,\n",
" 50923,\n",
" 1775,\n",
" 18160,\n",
" 18168,\n",
" 34553,\n",
" 67322,\n",
" 67323,\n",
" 18171,\n",
" 50939,\n",
" 67327,\n",
" 50944,\n",
" 50947,\n",
" 1798,\n",
" 83718,\n",
" 34572,\n",
" 1805,\n",
" 83724,\n",
" 34576,\n",
" 67344,\n",
" 83730,\n",
" 67350,\n",
" 1820,\n",
" 1821,\n",
" 34590,\n",
" 50972,\n",
" 1825,\n",
" 50986,\n",
" 50988,\n",
" 34606,\n",
" 1839,\n",
" 67375,\n",
" 50990,\n",
" 50993,\n",
" 67379,\n",
" 83767,\n",
" 51000,\n",
" 1852,\n",
" 34620,\n",
" 83774,\n",
" 18239,\n",
" 1861,\n",
" 18245,\n",
" 1866,\n",
" 67402,\n",
" 1874,\n",
" 83795,\n",
" 83799,\n",
" 67416,\n",
" 18270,\n",
" 51039,\n",
" 83807,\n",
" 83810,\n",
" 51043,\n",
" 51046,\n",
" 67431,\n",
" 51047,\n",
" 1904,\n",
" 67441,\n",
" 18289,\n",
" 34687,\n",
" 1920,\n",
" 34689,\n",
" 51071,\n",
" 51074,\n",
" 1919,\n",
" 83846,\n",
" 18314,\n",
" 51084,\n",
" 34701,\n",
" 18319,\n",
" 34708,\n",
" 34710,\n",
" 18326,\n",
" 51095,\n",
" 67492,\n",
" 67493,\n",
" 51108,\n",
" 1959,\n",
" 34725,\n",
" 18348,\n",
" 67500,\n",
" 83887,\n",
" 1970,\n",
" 83895,\n",
" 34744,\n",
" 51128,\n",
" 83898,\n",
" 83902,\n",
" 34753,\n",
" 83906,\n",
" 51140,\n",
" 18373,\n",
" 67531,\n",
" 18384,\n",
" 2004,\n",
" 51159,\n",
" 2008,\n",
" 2009,\n",
" 18397,\n",
" 18401,\n",
" 83940,\n",
" 2023,\n",
" 67560,\n",
" 34793,\n",
" 67562,\n",
" 34795,\n",
" 51175,\n",
" 83951,\n",
" 51184,\n",
" 83953,\n",
" 51185,\n",
" ...}"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"inverted_list_new[\"发货\"]"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"# 将新的倒排表保存在文件data/retrieve/invertedList.pkl中\n",
"with open('data/retrieve/invertedList.pkl','wb') as f:\n",
" pickle.dump(inverted_list_new,f)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"以下为测试,完成上述过程之后,可以运行以下的代码来测试准确性。"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"#这一格的内容是从preprocessor.ipynb中粘贴而来,包含了数据预处理的几个关键函数\n",
"import emoji\n",
"import re\n",
"import jieba\n",
"def clean(content):\n",
" content = emoji.demojize(content)\n",
" content = re.sub('<.*>','',content)\n",
" return content\n",
"#这一函数是用于对句子进行分词,在preprocessor.ipynb中由于数据是已经分好词的,所以我们并没有进行这一步骤,但是对于一个新的问句,这一步是必不可少的\n",
"def question_cut(content):\n",
" return list(jieba.cut(content))\n",
"def strip(wordList):\n",
" return [word.strip() for word in wordList if word.strip()!='']\n",
"with open(\"data/stopWord.json\",\"r\") as f:\n",
" stopWords = f.read().split(\"\\n\")\n",
"def rm_stop_word(wordList):\n",
" return [word for word in wordList if word not in stopWords]"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"# 从data/retrieve/invertedList.pkl加载倒排表并将其保存在变量invertedList中\n",
"with open('data/retrieve/invertedList.pkl','rb') as f:\n",
" invertedList = pickle.load(f)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"def get_retrieve_result(sentence):\n",
" '''\n",
" 输入一个句子sentence,根据倒排表进行快速检索,返回与该句子较相近的一些候选问题的index\n",
" 候选问题由包含该句子中任一单词或包含与该句子中任一单词意思相近的单词的问题索引组成\n",
" '''\n",
" sentence = clean(sentence)\n",
" sentence = question_cut(sentence)\n",
" sentence = strip(sentence)\n",
" sentence = rm_stop_word(sentence)\n",
" candidate = set()\n",
" for word in sentence:\n",
" if word in invertedList:\n",
" candidate = candidate | invertedList[word]\n",
" return candidate"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{81920,\n",
" 16386,\n",
" 65541,\n",
" 5,\n",
" 81927,\n",
" 32776,\n",
" 81930,\n",
" 81935,\n",
" 17,\n",
" 18,\n",
" 65554,\n",
" 16401,\n",
" 98331,\n",
" 81947,\n",
" 29,\n",
" 65566,\n",
" 32800,\n",
" 81953,\n",
" 32803,\n",
" 98339,\n",
" 81959,\n",
" 32810,\n",
" 98346,\n",
" 49194,\n",
" 32818,\n",
" 16435,\n",
" 55,\n",
" 49209,\n",
" 98366,\n",
" 64,\n",
" 49219,\n",
" 65604,\n",
" 81988,\n",
" 16458,\n",
" 65611,\n",
" 81995,\n",
" 81998,\n",
" 16463,\n",
" 16464,\n",
" 49233,\n",
" 32850,\n",
" 98387,\n",
" 98386,\n",
" 49234,\n",
" 86,\n",
" 81999,\n",
" 82004,\n",
" 32859,\n",
" 16475,\n",
" 49245,\n",
" 98398,\n",
" 65631,\n",
" 65630,\n",
" 82015,\n",
" 102,\n",
" 65639,\n",
" 65640,\n",
" 49259,\n",
" 65646,\n",
" 98415,\n",
" 98416,\n",
" 49263,\n",
" 65650,\n",
" 16495,\n",
" 49267,\n",
" 16500,\n",
" 118,\n",
" 82035,\n",
" 65656,\n",
" 16503,\n",
" 122,\n",
" 65659,\n",
" 65660,\n",
" 125,\n",
" 32894,\n",
" 49275,\n",
" 133,\n",
" 65669,\n",
" 65670,\n",
" 65671,\n",
" 32902,\n",
" 139,\n",
" 49293,\n",
" 142,\n",
" 65679,\n",
" 32912,\n",
" 98451,\n",
" 150,\n",
" 151,\n",
" 32929,\n",
" 49318,\n",
" 49320,\n",
" 65708,\n",
" 82092,\n",
" 82093,\n",
" 65711,\n",
" 98484,\n",
" 98489,\n",
" 49337,\n",
" 187,\n",
" 49340,\n",
" 32957,\n",
" 200,\n",
" 16588,\n",
" 32973,\n",
" 65742,\n",
" 16589,\n",
" 98518,\n",
" 16598,\n",
" 49366,\n",
" 82137,\n",
" 65755,\n",
" 220,\n",
" 223,\n",
" 65764,\n",
" 82149,\n",
" 82155,\n",
" 16621,\n",
" 49396,\n",
" 65783,\n",
" 33017,\n",
" 65786,\n",
" 254,\n",
" 65790,\n",
" 82176,\n",
" 261,\n",
" 65798,\n",
" 65806,\n",
" 65810,\n",
" 275,\n",
" 279,\n",
" 98586,\n",
" 82208,\n",
" 49442,\n",
" 65833,\n",
" 33068,\n",
" 82220,\n",
" 65838,\n",
" 82221,\n",
" 82224,\n",
" 33073,\n",
" 65843,\n",
" 65844,\n",
" 16691,\n",
" 310,\n",
" 16696,\n",
" 82234,\n",
" 65852,\n",
" 49468,\n",
" 318,\n",
" 49471,\n",
" 49472,\n",
" 16706,\n",
" 49475,\n",
" 65862,\n",
" 65863,\n",
" 16712,\n",
" 82248,\n",
" 65866,\n",
" 49483,\n",
" 49493,\n",
" 16726,\n",
" 344,\n",
" 16730,\n",
" 65883,\n",
" 82268,\n",
" 65885,\n",
" 350,\n",
" 33120,\n",
" 16745,\n",
" 364,\n",
" 98668,\n",
" 65903,\n",
" 33140,\n",
" 98678,\n",
" 65913,\n",
" 33149,\n",
" 49535,\n",
" 33158,\n",
" 49544,\n",
" 82313,\n",
" 33163,\n",
" 16783,\n",
" 33168,\n",
" 401,\n",
" 82322,\n",
" 98709,\n",
" 49558,\n",
" 98715,\n",
" 16796,\n",
" 49565,\n",
" 82333,\n",
" 82334,\n",
" 33189,\n",
" 33191,\n",
" 65960,\n",
" 33193,\n",
" 49579,\n",
" 16812,\n",
" 98740,\n",
" 65973,\n",
" 16822,\n",
" 82359,\n",
" 16825,\n",
" 33210,\n",
" 16827,\n",
" 82365,\n",
" 98751,\n",
" 16832,\n",
" 98754,\n",
" 33220,\n",
" 453,\n",
" 49604,\n",
" 98763,\n",
" 461,\n",
" 469,\n",
" 49623,\n",
" 16856,\n",
" 33244,\n",
" 49629,\n",
" 16867,\n",
" 16869,\n",
" 98794,\n",
" 82410,\n",
" 82412,\n",
" 495,\n",
" 16882,\n",
" 98803,\n",
" 49651,\n",
" 49656,\n",
" 33273,\n",
" 16889,\n",
" 507,\n",
" 33276,\n",
" 82426,\n",
" 66046,\n",
" 49658,\n",
" 16895,\n",
" 49669,\n",
" 82437,\n",
" 33290,\n",
" 49674,\n",
" 16911,\n",
" 530,\n",
" 33300,\n",
" 66069,\n",
" 16918,\n",
" 16922,\n",
" 66077,\n",
" 542,\n",
" 543,\n",
" 82463,\n",
" 66081,\n",
" 16932,\n",
" 66091,\n",
" 556,\n",
" 66093,\n",
" 66094,\n",
" 98860,\n",
" 33324,\n",
" 66097,\n",
" 82475,\n",
" 49709,\n",
" 16942,\n",
" 49715,\n",
" 49716,\n",
" 66106,\n",
" 98877,\n",
" 66111,\n",
" 49729,\n",
" 33346,\n",
" 579,\n",
" 33347,\n",
" 82499,\n",
" 49735,\n",
" 16967,\n",
" 49739,\n",
" 588,\n",
" 16983,\n",
" 82522,\n",
" 16991,\n",
" 82528,\n",
" 49761,\n",
" 49762,\n",
" 66151,\n",
" 66152,\n",
" 66153,\n",
" 17003,\n",
" 49777,\n",
" 98932,\n",
" 17012,\n",
" 66166,\n",
" 17020,\n",
" 17021,\n",
" 639,\n",
" 640,\n",
" 49793,\n",
" 642,\n",
" 17026,\n",
" 98948,\n",
" 17027,\n",
" 82563,\n",
" 49796,\n",
" 17030,\n",
" 66185,\n",
" 82566,\n",
" 651,\n",
" 33422,\n",
" 82576,\n",
" 98962,\n",
" 33427,\n",
" 17043,\n",
" 49812,\n",
" 82584,\n",
" 98970,\n",
" 17050,\n",
" 17052,\n",
" 66207,\n",
" 82592,\n",
" 673,\n",
" 33441,\n",
" 17057,\n",
" 17061,\n",
" 33446,\n",
" 49831,\n",
" 66221,\n",
" 33453,\n",
" 687,\n",
" 66224,\n",
" 82605,\n",
" 49841,\n",
" 33459,\n",
" 66228,\n",
" 17076,\n",
" 694,\n",
" 33464,\n",
" 33466,\n",
" 33468,\n",
" 99005,\n",
" 66238,\n",
" 99007,\n",
" 99006,\n",
" 82628,\n",
" 710,\n",
" 711,\n",
" 82630,\n",
" 99017,\n",
" 82632,\n",
" 718,\n",
" 82639,\n",
" 49872,\n",
" 82645,\n",
" 82652,\n",
" 49889,\n",
" 82658,\n",
" 99044,\n",
" 743,\n",
" 17127,\n",
" 33513,\n",
" 49897,\n",
" 17132,\n",
" 749,\n",
" 49901,\n",
" 49903,\n",
" 82674,\n",
" 33523,\n",
" 17142,\n",
" 759,\n",
" 33527,\n",
" 17144,\n",
" 66299,\n",
" 99068,\n",
" 33535,\n",
" 17152,\n",
" 769,\n",
" 33539,\n",
" 66307,\n",
" 66309,\n",
" 82694,\n",
" 17162,\n",
" 99084,\n",
" 49932,\n",
" 783,\n",
" 17167,\n",
" 66321,\n",
" 99090,\n",
" 66323,\n",
" 49935,\n",
" 82713,\n",
" 99098,\n",
" 66333,\n",
" 66334,\n",
" 82719,\n",
" 800,\n",
" 66337,\n",
" 17185,\n",
" 49954,\n",
" 17190,\n",
" 33576,\n",
" 49962,\n",
" 17196,\n",
" 99120,\n",
" 99122,\n",
" 33587,\n",
" 49972,\n",
" 49974,\n",
" 17207,\n",
" 33592,\n",
" 33593,\n",
" 33600,\n",
" 33604,\n",
" 99140,\n",
" 82757,\n",
" 841,\n",
" 49994,\n",
" 845,\n",
" 99149,\n",
" 17235,\n",
" 82771,\n",
" 66389,\n",
" 854,\n",
" 82775,\n",
" 33624,\n",
" 33625,\n",
" 858,\n",
" 66395,\n",
" 99163,\n",
" 33629,\n",
" 17245,\n",
" 82782,\n",
" 17249,\n",
" 17251,\n",
" 82787,\n",
" 99173,\n",
" 33638,\n",
" 17255,\n",
" 874,\n",
" 66411,\n",
" 82794,\n",
" 17259,\n",
" 878,\n",
" 33647,\n",
" 66415,\n",
" 66417,\n",
" 33650,\n",
" 66418,\n",
" 99182,\n",
" 33652,\n",
" 82798,\n",
" 891,\n",
" 17275,\n",
" 82812,\n",
" 33662,\n",
" 82813,\n",
" 17283,\n",
" 66436,\n",
" 901,\n",
" 99208,\n",
" 66441,\n",
" 33674,\n",
" 33675,\n",
" 17291,\n",
" 82829,\n",
" 33682,\n",
" 916,\n",
" 66452,\n",
" 33688,\n",
" 17304,\n",
" 33691,\n",
" 33692,\n",
" 66461,\n",
" 66463,\n",
" 82847,\n",
" 33697,\n",
" 99234,\n",
" 931,\n",
" 82848,\n",
" 17317,\n",
" 938,\n",
" 17324,\n",
" 82863,\n",
" 944,\n",
" 66480,\n",
" 33716,\n",
" 99254,\n",
" 82878,\n",
" 959,\n",
" 33728,\n",
" 99267,\n",
" 33736,\n",
" 33741,\n",
" 17359,\n",
" 978,\n",
" 17364,\n",
" 33750,\n",
" 82905,\n",
" 17371,\n",
" 992,\n",
" 17378,\n",
" 33765,\n",
" 99301,\n",
" 1000,\n",
" 50152,\n",
" 50155,\n",
" 1005,\n",
" 99309,\n",
" 66542,\n",
" 50158,\n",
" 82928,\n",
" 33780,\n",
" 99318,\n",
" 1017,\n",
" 33787,\n",
" 66557,\n",
" 1024,\n",
" 17408,\n",
" 17409,\n",
" 1027,\n",
" 50179,\n",
" 82949,\n",
" 82950,\n",
" 1034,\n",
" 50187,\n",
" 1036,\n",
" 50191,\n",
" 66577,\n",
" 66580,\n",
" 99351,\n",
" 82969,\n",
" 82970,\n",
" 17437,\n",
" 99360,\n",
" 1057,\n",
" 33825,\n",
" 99363,\n",
" 82977,\n",
" 82979,\n",
" 33831,\n",
" 66600,\n",
" 17447,\n",
" 17448,\n",
" 33837,\n",
" 50221,\n",
" 66607,\n",
" 99376,\n",
" 33841,\n",
" 82993,\n",
" 99380,\n",
" 66613,\n",
" 82996,\n",
" 82998,\n",
" 33849,\n",
" 83003,\n",
" 66622,\n",
" 1087,\n",
" 1088,\n",
" 99391,\n",
" 17473,\n",
" 83011,\n",
" 50245,\n",
" 1095,\n",
" 1098,\n",
" 66637,\n",
" 1102,\n",
" 99405,\n",
" 50256,\n",
" 99410,\n",
" 66644,\n",
" 99413,\n",
" 66646,\n",
" 99419,\n",
" 66652,\n",
" 83037,\n",
" 1118,\n",
" 83040,\n",
" 99431,\n",
" 66663,\n",
" 1129,\n",
" 33900,\n",
" 17516,\n",
" 66670,\n",
" 83052,\n",
" 50290,\n",
" 17524,\n",
" 33912,\n",
" 83065,\n",
" 33914,\n",
" 66685,\n",
" 83071,\n",
" 17537,\n",
" 50309,\n",
" 33927,\n",
" 50314,\n",
" 99467,\n",
" 99470,\n",
" 50318,\n",
" 99472,\n",
" 66705,\n",
" 83088,\n",
" 66708,\n",
" 66709,\n",
" 50327,\n",
" 99484,\n",
" 99485,\n",
" 83101,\n",
" 99487,\n",
" 66721,\n",
" 33954,\n",
" 1187,\n",
" 83105,\n",
" 50337,\n",
" 50341,\n",
" 99495,\n",
" 66728,\n",
" 50345,\n",
" 50346,\n",
" 17578,\n",
" 66732,\n",
" 83116,\n",
" 66734,\n",
" 50351,\n",
" 50356,\n",
" 99510,\n",
" 17591,\n",
" 50363,\n",
" 99517,\n",
" 66751,\n",
" 17599,\n",
" 1223,\n",
" 33992,\n",
" 99528,\n",
" 17607,\n",
" 50376,\n",
" 50379,\n",
" 17616,\n",
" 99537,\n",
" 66770,\n",
" 99539,\n",
" 1237,\n",
" 1242,\n",
" 66779,\n",
" 66780,\n",
" 17628,\n",
" 1247,\n",
" 17631,\n",
" 83168,\n",
" 50402,\n",
" 99555,\n",
" 50405,\n",
" 34024,\n",
" 50409,\n",
" 50411,\n",
" 1265,\n",
" 34034,\n",
" 83186,\n",
" 17651,\n",
" 50419,\n",
" 1270,\n",
" 34042,\n",
" 99580,\n",
" 1277,\n",
" 66814,\n",
" 99583,\n",
" 83200,\n",
" 17671,\n",
" 50441,\n",
" 17677,\n",
" 34064,\n",
" 1297,\n",
" 66834,\n",
" 50448,\n",
" 50449,\n",
" 50456,\n",
" 66844,\n",
" 83229,\n",
" 83231,\n",
" 1312,\n",
" 66849,\n",
" 50464,\n",
" 17698,\n",
" 17699,\n",
" 83236,\n",
" 17702,\n",
" 1326,\n",
" 99632,\n",
" 17713,\n",
" 66866,\n",
" 34099,\n",
" 83252,\n",
" 99637,\n",
" 1334,\n",
" 66871,\n",
" 83258,\n",
" 66877,\n",
" 1342,\n",
" 50493,\n",
" 17728,\n",
" 1345,\n",
" 99651,\n",
" 66887,\n",
" 1352,\n",
" 50503,\n",
" 1354,\n",
" 17737,\n",
" 50508,\n",
" 66895,\n",
" 99671,\n",
" 66909,\n",
" 34143,\n",
" 99681,\n",
" 83300,\n",
" 17770,\n",
" 1392,\n",
" 66929,\n",
" 34162,\n",
" 17777,\n",
" 1401,\n",
" 1402,\n",
" 34169,\n",
" 50554,\n",
" 99709,\n",
" 17788,\n",
" 1407,\n",
" 66944,\n",
" 66945,\n",
" 50557,\n",
" 1412,\n",
" 1415,\n",
" 34184,\n",
" 1420,\n",
" 99729,\n",
" 1426,\n",
" 34196,\n",
" 83351,\n",
" 66971,\n",
" 1436,\n",
" 83357,\n",
" 66974,\n",
" 17821,\n",
" 1443,\n",
" 83363,\n",
" 50598,\n",
" 34217,\n",
" 66986,\n",
" 83369,\n",
" 1452,\n",
" 66993,\n",
" 50609,\n",
" 1459,\n",
" 34227,\n",
" 66996,\n",
" 83385,\n",
" 50618,\n",
" 50626,\n",
" 50627,\n",
" 83406,\n",
" 83407,\n",
" 34258,\n",
" 50643,\n",
" 83412,\n",
" 99797,\n",
" 50647,\n",
" 99803,\n",
" 50652,\n",
" 50654,\n",
" 1508,\n",
" 50661,\n",
" 99814,\n",
" 1512,\n",
" 1515,\n",
" 50667,\n",
" 67053,\n",
" 99821,\n",
" 17901,\n",
" 99824,\n",
" 99825,\n",
" 83438,\n",
" 50672,\n",
" 34292,\n",
" 83441,\n",
" 50677,\n",
" 17910,\n",
" 83446,\n",
" 83447,\n",
" 83448,\n",
" 17916,\n",
" 17918,\n",
" 34303,\n",
" 67071,\n",
" 34307,\n",
" 1540,\n",
" 67077,\n",
" 34310,\n",
" 67079,\n",
" 83460,\n",
" 99854,\n",
" 83473,\n",
" 67092,\n",
" 67094,\n",
" 99862,\n",
" 67100,\n",
" 1565,\n",
" 34333,\n",
" 1567,\n",
" 34338,\n",
" 99877,\n",
" 1579,\n",
" 83502,\n",
" 83504,\n",
" 1585,\n",
" 67127,\n",
" 99897,\n",
" 50753,\n",
" 83521,\n",
" 1603,\n",
" 1604,\n",
" 17991,\n",
" 17992,\n",
" 50760,\n",
" 83527,\n",
" 67147,\n",
" 17996,\n",
" 67152,\n",
" 34388,\n",
" 67156,\n",
" 99924,\n",
" 1623,\n",
" 34395,\n",
" 99931,\n",
" 18016,\n",
" 1635,\n",
" 99946,\n",
" 67180,\n",
" 99949,\n",
" 99954,\n",
" 34419,\n",
" 83571,\n",
" 99957,\n",
" 83572,\n",
" 83573,\n",
" 18043,\n",
" 50811,\n",
" 67197,\n",
" 34434,\n",
" 99970,\n",
" 18053,\n",
" 83590,\n",
" 34440,\n",
" 1673,\n",
" 50826,\n",
" 1675,\n",
" 1676,\n",
" 34443,\n",
" 67212,\n",
" 67216,\n",
" 18065,\n",
" 18066,\n",
" 18068,\n",
" 18069,\n",
" 50838,\n",
" 50839,\n",
" 34457,\n",
" 50841,\n",
" 1691,\n",
" 1692,\n",
" 83615,\n",
" 1697,\n",
" 50855,\n",
" 34473,\n",
" 67241,\n",
" 50861,\n",
" 34480,\n",
" 67250,\n",
" 18104,\n",
" 18108,\n",
" 83645,\n",
" 18112,\n",
" 18116,\n",
" 67269,\n",
" 1743,\n",
" 83667,\n",
" 67284,\n",
" 50904,\n",
" 83674,\n",
" 50910,\n",
" 34528,\n",
" 34529,\n",
" 18146,\n",
" 50917,\n",
" 83688,\n",
" 50923,\n",
" 1775,\n",
" 18160,\n",
" 18168,\n",
" 34553,\n",
" 67322,\n",
" 67323,\n",
" 18171,\n",
" 50939,\n",
" 67327,\n",
" 50944,\n",
" 50947,\n",
" 1798,\n",
" 83718,\n",
" 34572,\n",
" 1805,\n",
" 83724,\n",
" 34576,\n",
" 67344,\n",
" 83730,\n",
" 67350,\n",
" 1820,\n",
" 1821,\n",
" 34590,\n",
" 50972,\n",
" 1825,\n",
" 50986,\n",
" 50988,\n",
" 34606,\n",
" 1839,\n",
" 67375,\n",
" 50990,\n",
" 50993,\n",
" 67379,\n",
" 83767,\n",
" 51000,\n",
" 1852,\n",
" 34620,\n",
" 83774,\n",
" 18239,\n",
" 1861,\n",
" 18245,\n",
" 1866,\n",
" 67402,\n",
" 1874,\n",
" 83795,\n",
" 83799,\n",
" 67416,\n",
" 18270,\n",
" 51039,\n",
" 83807,\n",
" 83810,\n",
" 51043,\n",
" 51046,\n",
" 67431,\n",
" 51047,\n",
" 1904,\n",
" 67441,\n",
" 18289,\n",
" 34687,\n",
" 1920,\n",
" 34689,\n",
" 1919,\n",
" 51071,\n",
" 51074,\n",
" 83846,\n",
" 18314,\n",
" 51084,\n",
" 34701,\n",
" 18319,\n",
" 34708,\n",
" 34710,\n",
" 18326,\n",
" 51095,\n",
" 67492,\n",
" 67493,\n",
" 34725,\n",
" 1959,\n",
" 51108,\n",
" 67500,\n",
" 18348,\n",
" 83887,\n",
" 1970,\n",
" 83895,\n",
" 34744,\n",
" 51128,\n",
" 83898,\n",
" 83902,\n",
" 34753,\n",
" 83906,\n",
" 51140,\n",
" 18373,\n",
" 67531,\n",
" 18384,\n",
" 2004,\n",
" 51159,\n",
" 2008,\n",
" 2009,\n",
" 18397,\n",
" 18401,\n",
" 83940,\n",
" 2023,\n",
" 67560,\n",
" 34793,\n",
" 67562,\n",
" 34795,\n",
" 51175,\n",
" 83951,\n",
" 51184,\n",
" 83953,\n",
" ...}"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_retrieve_result('什么时候发货') # 通过倒排表返回文档IDs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:greedyaiqa] *",
"language": "python",
"name": "conda-env-greedyaiqa-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment