Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
H
homework
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
20210509028
homework
Commits
8283d608
Commit
8283d608
authored
Jun 23, 2021
by
20210509028
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Upload New File
parent
c281b257
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
604 additions
and
0 deletions
+604
-0
project2/Retrieve.ipynb
+604
-0
No files found.
project2/Retrieve.ipynb
0 → 100644
View file @
8283d608
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 搭建倒排表\n",
"倒排表的作用是让搜索更加快速,是搜索引擎中常用的技术。根据课程中所讲的方法,你需要完成这部分的代码。 "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from tqdm import tqdm\n",
"import numpy as np\n",
"import pickle\n",
"from gensim.models import KeyedVectors # 词向量用来比较俩俩之间相似度"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# 读取数据: 导入在preprocessor.ipynb中生成的data/question_answer_pares.pkl文件,并将其保存在变量QApares中\n",
"with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/question_answer_pares.pkl','rb') as f:\n",
" QApares = pickle.load(f)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```TODO1``` 构造一个倒排表,不需要考虑单词的相似度"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'东西'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"QApares.question_after_preprocessing.values[18][1]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'买': [0, 18, 37, 45, 47, 62, 66, 84, 93, 95, 114, 158, 158, 163, 170, 182, 183, 189, 201, 243, 266, 276, 294, 298, 330, 357, 376, 392, 392, 398, 423, 463, 465, 496, 545, 549, 577, 624, 652, 652, 652, 662, 683, 700, 702, 713, 714, 720, 740, 742, 799, 799, 832, 839, 841, 843, 849, 851, 879, 884, 910, 910, 931, 946, 950, 982, 989, 994], '运费': [3, 30, 110, 177, 244, 322, 527, 560, 747, 860], '好吃': [4, 25, 25, 180, 315, 315, 395, 647, 671, 872, 879], '发货': [5, 17, 18, 29, 55, 64, 102, 118, 122, 125, 133, 142, 150, 151, 187, 200, 220, 223, 223, 254, 261, 275, 310, 318, 344, 350, 364, 401, 401, 453, 461, 469, 495, 530, 542, 543, 556, 579, 588, 639, 640, 642, 651, 673, 687, 694, 710, 718, 743, 749, 759, 769, 800, 841, 845, 854, 858, 874, 878, 891, 901, 916, 931, 938, 944, 959, 978, 992], '谢谢': [6, 59, 62, 90, 140, 168, 172, 192, 197, 206, 225, 244, 246, 256, 363, 373, 404, 408, 420, 440, 499, 502, 509, 529, 567, 583, 588, 596, 598, 611, 655, 736, 789, 827, 831, 836, 884, 914, 930, 958, 972, 977], '拍': [8, 30, 100, 120, 171, 171, 187, 209, 217, 217, 254, 257, 261, 266, 266, 345, 349, 446, 448, 470, 498, 509, 532, 560, 561, 584, 591, 610, 621, 640, 649, 657, 697, 697, 707, 712, 738, 760, 769, 769, 852, 858, 873, 891, 918, 945, 983, 997], '没有': [13, 40, 40, 82, 86, 92, 123, 141, 298, 300, 315, 346, 391, 428, 453, 476, 519, 576, 579, 617, 626, 652, 731, 740, 745, 746, 752, 797, 875, 879, 918, 931, 968], '吃': [15, 27, 27, 413, 537, 842, 877, 896, 993], '几天': [16, 23, 528, 585, 612, 769, 822, 880, 978], '东西': [18, 111, 145, 158, 360, 406, 457, 582, 697, 843, 849, 931, 946, 950], '没': [18, 57, 65, 142, 150, 152, 290, 335, 356, 360, 360, 360, 365, 401, 413, 457, 506, 519, 560, 569, 573, 593, 621, 635, 665, 728, 765, 800, 844, 849, 853, 860, 921, 960, 978], '一个': [20, 36, 80, 96, 213, 248, 283, 391, 442, 463, 528, 752, 779, 818, 824, 876, 915, 915, 933, 985, 985], '邮政': [20, 57, 164, 320, 501, 585, 795], '下次': [22, 47, 276, 827, 879, 910], '你家': [22, 417, 587, 607, 648, 677, 843, 872, 879, 985, 998], '大概': [23, 137, 238, 323, 372, 373, 927, 996], '知道': [23, 235, 267, 278, 279, 398, 404, 440, 466, 466, 501, 573, 692, 730, 865, 996], '干': [24, 49, 80, 106, 284, 472, 537, 618], '核桃': [27, 51, 162, 213, 260, 287, 338, 506, 521, 633, 692, 742, 747, 879, 993], '一起': [27, 62, 80, 96, 640, 982], '购买': [27, 44, 110, 386, 457, 527, 677], '会': [28, 78, 187, 187, 194, 266, 386, 405, 480, 581, 647, 688, 805, 834, 849, 948, 957], '不会': [28, 78, 224, 355, 383, 405, 615, 630, 739, 763, 786, 834], '今天': [29, 64, 102, 120, 187, 199, 223, 223, 261, 275, 344, 364, 484, 495, 542, 631, 639, 640, 650, 651, 673, 694, 710, 756, 783, 824, 852, 854, 858, 909, 942, 956, 969, 992, 999], '重新': [30, 171, 209, 507, 509, 697, 983], '多久': [32, 120, 137, 373, 505, 862], '100': [32, 208, 425, 500, 500, 527, 979], '货': [33, 72, 148, 519, 607, 646, 647, 684, 751, 797, 822, 912, 930, 942], '不要': [34, 36, 74, 228, 279, 802, 883, 928], '说': [35, 65, 82, 123, 144, 205, 239, 278, 312, 335, 355, 519, 543, 602, 683, 691, 692, 726, 780, 849, 857, 879], '发': [36, 80, 80, 82, 113, 173, 180, 223, 230, 286, 299, 299, 303, 312, 320, 327, 409, 485, 487, 603, 611, 639, 672, 681, 774, 796, 844, 951, 956, 965, 968, 969, 970, 973, 974, 982, 999], '包邮': [37, 96, 96, 110, 591, 595, 624, 633], '现在': [38, 91, 93, 200, 201, 254, 261, 298, 351, 448, 449, 458, 458, 459, 469, 544, 626, 760, 841, 858, 873, 878, 893, 918, 963], '钱': [38, 179, 182, 239, 264, 355, 387, 644, 832, 932, 947, 989, 993], '问': [39, 305, 498, 538, 576, 577, 640, 677, 768, 948], '活动': [39, 126, 152, 161, 365, 379, 500, 632, 677, 677, 677, 806, 943, 955, 962, 962, 991], '优惠': [40, 45, 98, 131, 157, 157, 298, 306, 311, 357, 415, 417, 458, 466, 644, 771, 786, 917, 957], '10': [40, 45, 436, 663, 684, 794, 938], '点': [40, 45, 98, 147, 157, 163, 271, 311, 423, 517, 635, 635, 685, 702, 771, 786, 888], '收到': [46, 105, 153, 335, 375, 506, 612, 822, 929, 930, 942], '送': [47, 51, 51, 136, 138, 158, 171, 183, 207, 249, 266, 281, 297, 348, 362, 398, 406, 425, 435, 436, 436, 465, 477, 636, 728, 738, 747, 779, 791, 839, 849, 849, 851, 857, 871, 940, 948, 995, 997], '一点': [47, 66, 83, 145, 227, 833, 917], '一下': [47, 100, 118, 244, 270, 322, 361, 394, 483, 509, 536, 538, 561, 582, 604, 647, 677, 683, 805, 832, 948, 953, 983], '订单': [48, 75, 101, 223, 279, 488, 747], '看到': [48, 152, 302, 360, 496, 500, 576, 635], '榴莲': [49, 80, 472, 537, 769, 877], '快递': [49, 56, 63, 86, 108, 113, 124, 148, 164, 173, 180, 195, 210, 223, 232, 232, 239, 265, 278, 286, 291, 299, 312, 335, 335, 367, 400, 409, 429, 482, 487, 507, 528, 540, 563, 603, 608, 616, 626, 639, 667, 672, 681, 753, 757, 759, 766, 768, 781, 783, 785, 796, 908, 909, 931, 938, 964, 965, 974, 985], '明天': [55, 82, 125, 543, 543, 610, 631, 632, 647, 929, 948, 991], '想': [57, 66, 114, 466, 677, 879, 948, 977], '行': [60, 170, 263, 311, 705, 711], '付款': [64, 102, 140, 211, 283, 345, 526, 683], '请': [64, 140, 148, 256, 461, 588], '算了': [82, 361, 391, 698, 713, 988], '已经': [82, 139, 211, 239, 612, 753, 783, 844], '不能': [82, 91, 136, 153, 223, 223, 223, 227, 351, 415, 528, 540, 711, 719, 826, 854], '催': [82, 155, 367, 367, 718, 766], '纯棉': [84, 101, 237, 262, 652, 652, 813, 888], '之前': [86, 227, 419, 463, 702, 718, 846], '你好': [94, 110, 166, 208, 213, 354, 389, 600, 638, 664, 674, 676, 742, 812, 977, 998], '麻烦': [97, 164, 177, 184, 253, 289, 412, 588, 604, 829, 977], '给我发': [101, 507, 524, 698, 918, 948], '一定': [104, 162, 386, 388, 467, 543, 543, 569, 802], '两个': [109, 298, 342, 462, 695, 985], '元': [110, 171, 348, 396, 446, 452, 463, 474, 551, 663, 684, 821, 859, 860, 860, 910, 975, 993], '这款': [114, 284, 334, 393, 417, 816], '号': [117, 783, 805, 870, 918, 938, 938, 950], '亲': [118, 142, 271, 289, 309, 365, 418, 440, 445, 463, 470, 569, 587, 677, 715, 761, 858, 898, 930, 955, 958, 964], '备注': [118, 290, 290, 291, 490, 679, 873], '尽快': [118, 461, 588, 611, 845, 916], '帮': [123, 244, 278, 291, 410, 509, 560, 569, 579, 604, 747, 747, 757, 802, 850, 977], '生产日期': [132, 313, 471, 506, 706, 790], '一包': [141, 141, 260, 307, 446, 531, 573, 573, 573, 932], '包装': [144, 227, 271, 360, 360, 442, 442, 706, 987], '退款': [153, 194, 243, 351, 392, 401, 612, 642, 697, 718], '是不是': [159, 342, 573, 576, 579, 589, 646, 692, 788, 941, 983], '退': [163, 171, 171, 239, 270, 473, 786, 818, 838, 860, 931, 932], '应该': [175, 290, 607, 635, 645, 996], '便宜': [179, 278, 423, 452, 528, 702, 702, 744, 833], '改': [192, 202, 244, 277, 322, 336, 337, 341, 399, 455, 543, 569, 760, 764, 764, 983], '价格': [192, 202, 295, 311, 330, 336, 430, 522, 719, 760, 837], '达': [195, 429, 614, 622, 669, 757, 814], '里面': [198, 360, 448, 742, 937, 952, 987], '前': [208, 362, 362, 425, 477, 500, 500], '太': [210, 370, 528, 770, 828, 866], '袋': [212, 548, 577, 702, 713, 769, 821], '两份': [217, 266, 266, 738, 940, 995], '请问': [242, 272, 286, 354, 379, 396, 456, 603, 620, 796, 952], '划算': [242, 460, 480, 500, 589, 663, 716], '区别': [242, 407, 486, 548, 558, 620, 695, 734, 815, 816, 902, 998], '看看': [248, 536, 545, 571, 785, 802, 872, 909], '湿巾': [266, 362, 500, 606, 897, 994], '地址': [270, 366, 385, 392, 399, 445, 515, 543, 735, 757, 774, 823, 954], '两包': [281, 327, 531, 641, 663, 738, 995], '款': [392, 476, 480, 678, 740, 994], '先': [460, 483, 536, 697, 711, 785, 890], '一直': [560, 582, 677, 754, 856, 860, 950, 994], '包': [677, 685, 693, 716, 839, 839, 847, 940]}\n"
]
}
],
"source": [
"# 构建一个倒排表,有关倒排表的详细内容参考实验手册\n",
"# 为了能够快速检索,倒排表应用哈希表来存储。python中字典内部便是用哈希表来存储的,所以这里我们直接将倒排表保存在字典中\n",
"# 注意:在这里不需要考虑单词之间的相似度。\n",
"inverted_list = {}\n",
"for index,sentence in enumerate(QApares.question_after_preprocessing):\n",
" ### 你需要完成的代码\n",
" #先建立每个句子里都有哪些单词\n",
" for word in sentence:\n",
" if word not in inverted_list.keys():\n",
" inverted_list[word]=[index]\n",
" elif word in inverted_list.keys():\n",
" inverted_list[word].append(index)\n",
" ### 你需要完成的代码结束\n",
"print(inverted_list)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"#d ata/retrieve/sgns.zhihu.word是从https://github.com/Embedding/Chinese-Word-Vectors下载到的预训练好的中文词向量文件\n",
"#使 用KeyedVectors.load_word2vec_format()函数加载预训练好的词向量文件\n",
"model = KeyedVectors.load_word2vec_format('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/retrieve/sgns.zhihu.word')"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0.242318, 0.280021, -0.20314 , -0.145087, 0.36263 , -0.1816 ,\n",
" 0.284412, -0.019258, 0.177943, -0.099748, -0.400317, -0.271987,\n",
" 0.004162, 0.087728, -0.191669, 0.085236, -0.414037, -0.217561,\n",
" -0.026183, -0.073249, -0.353846, 0.531875, 0.034659, -0.518132,\n",
" -0.068612, 0.268555, -0.205442, -0.127533, -0.162505, 0.048173,\n",
" 0.208574, -0.114745, 0.488997, -0.105653, 0.412296, 0.290467,\n",
" 0.297453, 0.045128, -0.179182, -0.050785, -0.347173, 0.024136,\n",
" -0.073133, -0.159962, 0.336675, -0.194373, -0.035422, -0.417701,\n",
" -0.152953, -0.118537, 0.126737, -0.008121, -0.181715, 0.087763,\n",
" 0.133048, 0.066853, 0.120959, 0.407376, 0.421318, 0.010574,\n",
" -0.264059, -0.033783, -0.032204, -0.336478, 0.231362, -0.313407,\n",
" 0.010844, 0.322092, -0.064023, -0.072975, 0.482646, 0.247257,\n",
" -0.152881, 0.231577, -0.577699, -0.121009, 0.370879, 0.106395,\n",
" 0.140295, -0.173298, -0.19279 , 0.398323, -0.183889, 0.006687,\n",
" 0.233605, -0.156288, 0.227104, 0.296975, -0.271261, 0.187074,\n",
" 0.08922 , 0.28858 , -0.152105, 0.056492, 0.125034, 0.280701,\n",
" -0.162378, 0.05236 , 0.021651, 0.171454, -0.287902, -0.06771 ,\n",
" -0.092863, -0.376919, -0.100612, -0.300021, 0.135061, -0.178398,\n",
" 0.156385, -0.068852, 0.530265, 0.415922, 0.075915, -0.261174,\n",
" -0.073183, -0.49721 , -0.43326 , 0.179234, -0.245175, -0.281062,\n",
" 0.043945, 0.095778, 0.684457, -0.193636, 0.22517 , -0.209223,\n",
" 0.523899, 0.531363, -0.097986, 0.03949 , -0.05308 , 0.051976,\n",
" 0.282402, -0.969551, 0.093102, -0.450612, 0.013247, -0.334614,\n",
" 0.24655 , -0.092421, -0.49236 , -0.075924, 0.205937, 0.19194 ,\n",
" 0.38601 , -0.039358, -0.607225, 0.046907, -0.13057 , -0.135274,\n",
" 0.109761, -0.13222 , 0.095713, 0.199767, -0.376427, -0.246138,\n",
" -0.078834, -0.213597, 0.003407, 0.227816, 0.287446, -0.437951,\n",
" -0.323966, 0.365273, 0.036325, 0.184054, -0.023083, -0.119412,\n",
" -0.045965, 0.053003, 0.082344, -0.441248, -0.104839, 0.148266,\n",
" 0.130892, -0.154648, 0.264133, 0.237587, -0.296352, 0.027918,\n",
" -0.067289, 0.799129, -0.002858, 0.047829, -0.194186, 0.251592,\n",
" 0.438095, -0.076399, -0.197788, 0.052582, 0.014761, -0.148981,\n",
" -0.092148, -0.297858, 0.181634, -0.127532, 0.060884, 0.075109,\n",
" -0.102974, 0.053721, -0.041425, 0.193051, -0.059818, 0.072369,\n",
" 0.361515, 0.284493, -0.077284, -0.03257 , 0.100904, -0.64975 ,\n",
" 0.089815, 0.259889, 0.052167, -0.009631, -0.2213 , -0.894523,\n",
" -0.072184, 0.268654, 0.069187, 0.019037, -0.096706, 0.222579,\n",
" 0.002876, -0.059017, 0.11002 , -0.467918, 0.735202, 0.510903,\n",
" 0.144756, 0.055391, -0.039559, 0.30669 , -0.332257, -0.164166,\n",
" 0.108381, -0.116477, -0.066806, -0.203976, 0.346737, 0.103472,\n",
" 0.74588 , -0.029833, -0.226489, 0.131728, -0.150197, -0.044112,\n",
" 0.229867, 0.130267, -0.338631, -0.199297, 0.144599, 0.362195,\n",
" -0.028561, 0.158175, 0.027781, -0.019202, -0.120493, -0.56064 ,\n",
" 0.272535, 0.007604, -0.296518, -0.512479, 0.04645 , 0.256112,\n",
" 0.178299, -0.027253, 0.259456, 0.01392 , -0.34781 , -0.105363,\n",
" 0.039269, 0.193894, -0.226721, 0.072046, 0.529211, 0.119466,\n",
" -0.245952, -0.040297, -0.080179, -0.298022, 0.368636, -0.126542,\n",
" -0.077372, 0.19964 , 0.177998, -0.082294, 0.053495, 0.012034,\n",
" -0.178534, 0.212586, 0.128121, -0.279217, 0.429773, -0.193822,\n",
" -0.336884, 0.311373, -0.026917, -0.085629, 0.080693, -0.0693 ],\n",
" dtype=float32)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model['感冒药']"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.8139644067932893"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#根据词向量计算任意两个单词的余弦相似度\n",
"def cos_sim(vector_a, vector_b):\n",
" vector_a = np.mat(vector_a)\n",
" vector_b = np.mat(vector_b)\n",
" num = float(vector_a * vector_b.T)\n",
" denom = np.linalg.norm(vector_a) * np.linalg.norm(vector_b)\n",
" cos = num / denom\n",
" sim = 0.5 + 0.5 * cos\n",
" return sim\n",
"cos_sim(model[\"阴天\"],model[\"晴天\"])"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'刮风'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def get_similar_by_word(word,topk):\n",
" '''\n",
" 返回与一个单词word相似度最高的topk个单词所组成的单词列表\n",
" 出参:\n",
" word_list:与word相似度最高的topk个单词所组成的单词列表。格式为[单词1,单词2,单词3,单词4,单词5]\n",
" '''\n",
" similar_words = model.similar_by_word(word,topk)\n",
" word_list = [word[0] for word in similar_words]\n",
" return word_list\n",
"get_similar_by_word(\"阴天\",5)[1]"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([-3.17429e-01, 3.50419e-01, 1.83339e-01, -2.94480e-02,\n",
" 4.33936e-01, 3.87034e-01, 1.76635e-01, 6.69800e-03,\n",
" -3.58115e-01, -6.93410e-02, -5.01974e-01, 2.41237e-01,\n",
" -2.21686e-01, -3.25771e-01, -2.07050e-01, 3.54428e-01,\n",
" -2.53675e-01, -2.40180e-02, 4.58380e-02, -5.33140e-02,\n",
" -2.76523e-01, 1.51700e-02, -1.59316e-01, -1.90596e-01,\n",
" 2.76345e-01, 1.22659e-01, -3.52800e-03, -2.86334e-01,\n",
" -2.72330e-01, 4.70316e-01, -4.06179e-01, -2.46439e-01,\n",
" -3.78340e-02, 5.71498e-01, 3.02080e-01, 1.37265e-01,\n",
" -6.59030e-02, -6.49700e-02, 1.83716e-01, 5.15830e-02,\n",
" -5.78495e-01, 2.23400e-03, 1.21261e-01, 5.18558e-01,\n",
" 3.13571e-01, 4.40266e-01, 2.79258e-01, -2.51387e-01,\n",
" 1.63034e-01, -1.85571e-01, 2.40120e-02, -6.06328e-01,\n",
" -1.93281e-01, 4.73970e-02, 1.87440e-01, -3.69999e-01,\n",
" 1.10618e-01, -1.73538e-01, 8.47500e-02, 6.67570e-02,\n",
" -1.86387e-01, 1.69726e-01, 8.62980e-02, -1.77526e-01,\n",
" 6.18764e-01, -3.56652e-01, 3.48822e-01, 1.93979e-01,\n",
" 3.89840e-01, -1.79693e-01, 2.41427e-01, 9.54200e-03,\n",
" 2.17311e-01, 1.26466e-01, 1.00387e-01, 2.59241e-01,\n",
" -2.36720e-02, 2.97217e-01, -2.22128e-01, -8.48480e-02,\n",
" -1.15659e-01, 1.89916e-01, 2.16194e-01, 2.40728e-01,\n",
" 6.07440e-02, 8.00180e-02, -1.80182e-01, 7.22319e-01,\n",
" -1.15640e-02, 1.27173e-01, -1.96589e-01, -1.32580e-02,\n",
" 1.02395e-01, 2.95336e-01, 2.12951e-01, -1.57784e-01,\n",
" 4.61191e-01, 1.11355e-01, -3.09651e-01, -1.34621e-01,\n",
" -2.82010e-01, 5.57414e-01, 1.20933e-01, -4.92023e-01,\n",
" 1.04745e-01, -1.53617e-01, -3.35699e-01, -1.65150e-01,\n",
" 1.89110e-01, -1.83353e-01, 4.61495e-01, 4.55130e-02,\n",
" -3.73119e-01, -3.57460e-01, -5.14130e-02, -1.95919e-01,\n",
" -1.86161e-01, -5.76750e-02, -1.44046e-01, -2.03275e-01,\n",
" 1.50491e-01, -4.03152e-01, 3.37345e-01, 3.18200e-03,\n",
" 4.74520e-02, -2.54625e-01, 1.87776e-01, 1.62288e-01,\n",
" -4.72474e-01, 7.14720e-02, 1.63166e-01, -6.60160e-02,\n",
" 3.79150e-02, -3.81038e-01, 4.68620e-02, -1.59246e-01,\n",
" 3.38916e-01, -6.10040e-01, 2.34290e-02, 2.21173e-01,\n",
" -2.51696e-01, -1.65990e-02, -1.94380e-01, -2.34510e-02,\n",
" 2.33676e-01, -2.52557e-01, 3.04723e-01, 2.27229e-01,\n",
" -4.17142e-01, -1.22740e-02, 2.53935e-01, -8.14810e-02,\n",
" -3.79260e-02, -1.36534e-01, 5.78090e-02, -1.01162e-01,\n",
" 3.68702e-01, -4.59950e-02, -2.43824e-01, 8.24470e-02,\n",
" 1.91396e-01, -2.49273e-01, -4.05858e-01, 2.83549e-01,\n",
" 9.27080e-02, 2.21971e-01, -3.86260e-01, 1.35761e-01,\n",
" -3.20225e-01, 1.26944e-01, 4.11720e-02, -1.41876e-01,\n",
" -1.81637e-01, 7.48700e-02, 7.55780e-02, -9.23700e-03,\n",
" -2.15860e-02, 8.18210e-02, -2.28575e-01, 3.97945e-01,\n",
" 3.33800e-02, 2.34973e-01, -5.59178e-01, -5.18610e-02,\n",
" 1.08180e-01, 4.02300e-02, -2.63620e-02, 1.70661e-01,\n",
" 7.68950e-02, -4.39535e-01, -2.92898e-01, 6.22060e-02,\n",
" -2.62560e-01, -1.98636e-01, 2.38575e-01, -2.10745e-01,\n",
" 9.03290e-02, -1.51103e-01, 2.31656e-01, -2.72834e-01,\n",
" 1.54018e-01, -6.98653e-01, -4.53500e-02, 3.12950e-02,\n",
" 1.53099e-01, 2.87303e-01, -1.15267e-01, 3.58764e-01,\n",
" 1.03935e-01, -3.46535e-01, 2.20197e-01, -4.36054e-01,\n",
" -8.98140e-02, 7.39560e-02, -3.70363e-01, -2.90207e-01,\n",
" 3.95486e-01, 3.51225e-01, -7.57010e-02, -8.28440e-02,\n",
" -8.01980e-02, 1.67353e-01, -1.90671e-01, -2.21765e-01,\n",
" 1.14005e-01, -2.43832e-01, 1.65086e-01, 3.71588e-01,\n",
" 7.16840e-02, 2.02490e-02, -3.35774e-01, 3.60321e-01,\n",
" -2.18428e-01, -5.48593e-01, 2.23213e-01, 8.38120e-02,\n",
" 1.32154e-01, -1.92319e-01, 2.99350e-02, 3.67071e-01,\n",
" -3.52000e-03, -1.62794e-01, -1.77430e-01, 4.28930e-02,\n",
" -2.50198e-01, 3.86700e-03, 2.31180e-02, -3.19086e-01,\n",
" 1.93166e-01, -4.17477e-01, 2.73502e-01, 3.62869e-01,\n",
" -3.67103e-01, 4.85054e-01, 4.78180e-02, 4.97660e-02,\n",
" -2.25295e-01, -4.45640e-02, -6.67400e-02, 1.54758e-01,\n",
" -3.82405e-01, -5.50330e-02, -2.16761e-01, -1.98404e-01,\n",
" 4.69792e-01, -1.43327e-01, 3.46644e-01, 1.79300e-03,\n",
" -1.48577e-01, 1.30571e-01, 2.47140e-01, 2.01360e-02,\n",
" 4.44790e-02, -1.18602e-01, 3.72478e-01, -1.56236e-01,\n",
" 5.31280e-02, -4.50350e-02, -3.98054e-01, -2.95267e-01,\n",
" 4.29560e-01, 7.47020e-02, -3.32763e-01, -4.54000e-04,\n",
" 2.01490e-02, 2.98967e-01, 1.10296e-01, -2.59987e-01,\n",
" -9.22760e-02, 5.07842e-01, -2.35260e-02, 6.58140e-02,\n",
" 2.73014e-01, -3.34090e-01, -1.05600e-03, 3.09531e-01,\n",
" -1.59764e-01, 1.41541e-01, 3.15749e-01, -2.73243e-01],\n",
" dtype=float32)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model[get_similar_by_word(\"阴天\",5)[1]]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['昨晚', '前天', '前两天', '上次', '昨天晚上']"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_similar_by_word(\"昨天\",5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```TODO2``` 构造一个新的倒排表,考虑单词之间的语义相似度"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████████████████████████████████████████████████████████████████████████████| 97/97 [00:14<00:00, 6.89it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'买': [0, 18, 37, 45, 47, 62, 66, 84, 93, 95, 114, 158, 158, 163, 170, 182, 183, 189, 201, 243, 266, 276, 294, 298, 330, 357, 376, 392, 392, 398, 423, 463, 465, 496, 545, 549, 577, 624, 652, 652, 652, 662, 683, 700, 702, 713, 714, 720, 740, 742, 799, 799, 832, 839, 841, 843, 849, 851, 879, 884, 910, 910, 931, 946, 950, 982, 989, 994], '运费': [3, 30, 110, 177, 244, 322, 527, 560, 747, 860], '好吃': [4, 25, 25, 180, 315, 315, 395, 647, 671, 872, 879], '发货': [5, 17, 18, 29, 55, 64, 102, 118, 122, 125, 133, 142, 150, 151, 187, 200, 220, 223, 223, 254, 261, 275, 310, 318, 344, 350, 364, 401, 401, 453, 461, 469, 495, 530, 542, 543, 556, 579, 588, 639, 640, 642, 651, 673, 687, 694, 710, 718, 743, 749, 759, 769, 800, 841, 845, 854, 858, 874, 878, 891, 901, 916, 931, 938, 944, 959, 978, 992], '谢谢': [6, 59, 62, 90, 140, 168, 172, 192, 197, 206, 225, 244, 246, 256, 363, 373, 404, 408, 420, 440, 499, 502, 509, 529, 567, 583, 588, 596, 598, 611, 655, 736, 789, 827, 831, 836, 884, 914, 930, 958, 972, 977], '拍': [8, 30, 100, 120, 171, 171, 187, 209, 217, 217, 254, 257, 261, 266, 266, 345, 349, 446, 448, 470, 498, 509, 532, 560, 561, 584, 591, 610, 621, 640, 649, 657, 697, 697, 707, 712, 738, 760, 769, 769, 852, 858, 873, 891, 918, 945, 983, 997], '没有': [13, 40, 40, 82, 86, 92, 123, 141, 298, 300, 315, 346, 391, 428, 453, 476, 519, 576, 579, 617, 626, 652, 731, 740, 745, 746, 752, 797, 875, 879, 918, 931, 968], '吃': [15, 27, 27, 413, 537, 842, 877, 896, 993], '几天': [16, 23, 528, 585, 612, 769, 822, 880, 978], '东西': [18, 111, 145, 158, 360, 406, 457, 582, 697, 843, 849, 931, 946, 950], '没': [18, 57, 65, 142, 150, 152, 290, 335, 356, 360, 360, 360, 365, 401, 413, 457, 506, 519, 560, 569, 573, 593, 621, 635, 665, 728, 765, 800, 844, 849, 853, 860, 921, 960, 978], '一个': [20, 36, 80, 96, 213, 248, 283, 391, 442, 463, 528, 752, 779, 818, 824, 876, 915, 915, 933, 985, 985], '邮政': [20, 57, 164, 320, 501, 585, 795], '下次': [22, 47, 276, 827, 879, 910, 55, 82, 125, 543, 543, 610, 631, 632, 647, 929, 948, 991], '你家': [22, 417, 587, 607, 648, 677, 843, 872, 879, 985, 998], '大概': [23, 137, 238, 323, 372, 373, 927, 996], '知道': [23, 235, 267, 278, 279, 398, 404, 440, 466, 466, 501, 573, 692, 730, 865, 996], '干': [24, 49, 80, 106, 284, 472, 537, 618], '核桃': [27, 51, 162, 213, 260, 287, 338, 506, 521, 633, 692, 742, 747, 879, 993], '一起': [27, 62, 80, 96, 640, 982], '购买': [27, 44, 110, 386, 457, 527, 677], '会': [28, 78, 187, 187, 194, 266, 386, 405, 480, 581, 647, 688, 805, 834, 849, 948, 957], '不会': [28, 78, 224, 355, 383, 405, 615, 630, 739, 763, 786, 834], '今天': [29, 64, 102, 120, 187, 199, 223, 223, 261, 275, 344, 364, 484, 495, 542, 631, 639, 640, 650, 651, 673, 694, 710, 756, 783, 824, 852, 854, 858, 909, 942, 956, 969, 992, 999], '重新': [30, 171, 209, 507, 509, 697, 983], '多久': [32, 120, 137, 373, 505, 862], '100': [32, 208, 425, 500, 500, 527, 979], '货': [33, 72, 148, 519, 607, 646, 647, 684, 751, 797, 822, 912, 930, 942], '不要': [34, 36, 74, 228, 279, 802, 883, 928], '说': [35, 65, 82, 123, 144, 205, 239, 278, 312, 335, 355, 519, 543, 602, 683, 691, 692, 726, 780, 849, 857, 879], '发': [36, 80, 80, 82, 113, 173, 180, 223, 230, 286, 299, 299, 303, 312, 320, 327, 409, 485, 487, 603, 611, 639, 672, 681, 774, 796, 844, 951, 956, 965, 968, 969, 970, 973, 974, 982, 999], '包邮': [37, 96, 96, 110, 591, 595, 624, 633], '现在': [38, 91, 93, 200, 201, 254, 261, 298, 351, 448, 449, 458, 458, 459, 469, 544, 626, 760, 841, 858, 873, 878, 893, 918, 963], '钱': [38, 179, 182, 239, 264, 355, 387, 644, 832, 932, 947, 989, 993], '问': [39, 305, 498, 538, 576, 577, 640, 677, 768, 948], '活动': [39, 126, 152, 161, 365, 379, 500, 632, 677, 677, 677, 806, 943, 955, 962, 962, 991], '优惠': [40, 45, 98, 131, 157, 157, 298, 306, 311, 357, 415, 417, 458, 466, 644, 771, 786, 917, 957], '10': [40, 45, 436, 663, 684, 794, 938], '点': [40, 45, 98, 147, 157, 163, 271, 311, 423, 517, 635, 635, 685, 702, 771, 786, 888], '收到': [46, 105, 153, 335, 375, 506, 612, 822, 929, 930, 942], '送': [47, 51, 51, 136, 138, 158, 171, 183, 207, 249, 266, 281, 297, 348, 362, 398, 406, 425, 435, 436, 436, 465, 477, 636, 728, 738, 747, 779, 791, 839, 849, 849, 851, 857, 871, 940, 948, 995, 997], '一点': [47, 66, 83, 145, 227, 833, 917], '一下': [47, 100, 118, 244, 270, 322, 361, 394, 483, 509, 536, 538, 561, 582, 604, 647, 677, 683, 805, 832, 948, 953, 983], '订单': [48, 75, 101, 223, 279, 488, 747], '看到': [48, 152, 302, 360, 496, 500, 576, 635], '榴莲': [49, 80, 472, 537, 769, 877], '快递': [49, 56, 63, 86, 108, 113, 124, 148, 164, 173, 180, 195, 210, 223, 232, 232, 239, 265, 278, 286, 291, 299, 312, 335, 335, 367, 400, 409, 429, 482, 487, 507, 528, 540, 563, 603, 608, 616, 626, 639, 667, 672, 681, 753, 757, 759, 766, 768, 781, 783, 785, 796, 908, 909, 931, 938, 964, 965, 974, 985], '明天': [55, 82, 125, 543, 543, 610, 631, 632, 647, 929, 948, 991], '想': [57, 66, 114, 466, 677, 879, 948, 977], '行': [60, 170, 263, 311, 705, 711], '付款': [64, 102, 140, 211, 283, 345, 526, 683], '请': [64, 140, 148, 256, 461, 588], '算了': [82, 361, 391, 698, 713, 988], '已经': [82, 139, 211, 239, 612, 753, 783, 844], '不能': [82, 91, 136, 153, 223, 223, 223, 227, 351, 415, 528, 540, 711, 719, 826, 854], '催': [82, 155, 367, 367, 718, 766], '纯棉': [84, 101, 237, 262, 652, 652, 813, 888], '之前': [86, 227, 419, 463, 702, 718, 846], '你好': [94, 110, 166, 208, 213, 354, 389, 600, 638, 664, 674, 676, 742, 812, 977, 998], '麻烦': [97, 164, 177, 184, 253, 289, 412, 588, 604, 829, 977], '给我发': [101, 507, 524, 698, 918, 948, 36, 80, 80, 82, 113, 173, 180, 223, 230, 286, 299, 299, 303, 312, 320, 327, 409, 485, 487, 603, 611, 639, 672, 681, 774, 796, 844, 951, 956, 965, 968, 969, 970, 973, 974, 982, 999], '一定': [104, 162, 386, 388, 467, 543, 543, 569, 802], '两个': [109, 298, 342, 462, 695, 985], '元': [110, 171, 348, 396, 446, 452, 463, 474, 551, 663, 684, 821, 859, 860, 860, 910, 975, 993], '这款': [114, 284, 334, 393, 417, 816], '号': [117, 783, 805, 870, 918, 938, 938, 950], '亲': [118, 142, 271, 289, 309, 365, 418, 440, 445, 463, 470, 569, 587, 677, 715, 761, 858, 898, 930, 955, 958, 964], '备注': [118, 290, 290, 291, 490, 679, 873], '尽快': [118, 461, 588, 611, 845, 916], '帮': [123, 244, 278, 291, 410, 509, 560, 569, 579, 604, 747, 747, 757, 802, 850, 977], '生产日期': [132, 313, 471, 506, 706, 790], '一包': [141, 141, 260, 307, 446, 531, 573, 573, 573, 932], '包装': [144, 227, 271, 360, 360, 442, 442, 706, 987], '退款': [153, 194, 243, 351, 392, 401, 612, 642, 697, 718], '是不是': [159, 342, 573, 576, 579, 589, 646, 692, 788, 941, 983], '退': [163, 171, 171, 239, 270, 473, 786, 818, 838, 860, 931, 932], '应该': [175, 290, 607, 635, 645, 996], '便宜': [179, 278, 423, 452, 528, 702, 702, 744, 833], '改': [192, 202, 244, 277, 322, 336, 337, 341, 399, 455, 543, 569, 760, 764, 764, 983], '价格': [192, 202, 295, 311, 330, 336, 430, 522, 719, 760, 837], '达': [195, 429, 614, 622, 669, 757, 814], '里面': [198, 360, 448, 742, 937, 952, 987], '前': [208, 362, 362, 425, 477, 500, 500], '太': [210, 370, 528, 770, 828, 866], '袋': [212, 548, 577, 702, 713, 769, 821], '两份': [217, 266, 266, 738, 940, 995], '请问': [242, 272, 286, 354, 379, 396, 456, 603, 620, 796, 952], '划算': [242, 460, 480, 500, 589, 663, 716], '区别': [242, 407, 486, 548, 558, 620, 695, 734, 815, 816, 902, 998], '看看': [248, 536, 545, 571, 785, 802, 872, 909], '湿巾': [266, 362, 500, 606, 897, 994], '地址': [270, 366, 385, 392, 399, 445, 515, 543, 735, 757, 774, 823, 954], '两包': [281, 327, 531, 641, 663, 738, 995], '款': [392, 476, 480, 678, 740, 994], '先': [460, 483, 536, 697, 711, 785, 890], '一直': [560, 582, 677, 754, 856, 860, 950, 994], '包': [677, 685, 693, 716, 839, 839, 847, 940]}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"# TODO:\n",
"# 构造一个新的倒排表,并将结果保存在字典inverted_list_new中\n",
"# 新的倒排表键为word,值为老倒排表[word]、老倒排表[单词1]、老倒排表[单词2]、老倒排表[单词3]、老倒排表[单词4]的并集\n",
"# 即新倒排表保存了包含单词word或包含与单词word最相近的5个单词中的某一个的问题的index\n",
"inverted_list_new = {}\n",
"for word in tqdm(inverted_list.keys()):\n",
" ### 你需要完成的部分\n",
" for i in range(5): # 与当前次相似的5个词\n",
" inverted_list_new[word]=inverted_list[word] # 保留原始词出现的文档索引\n",
" try: # 防止单词没有在gensim词库中\n",
" if get_similar_by_word(word,5)[i] in inverted_list.keys():\n",
" inverted_list_new[word]=inverted_list_new[word]+inverted_list[get_similar_by_word(word,5)[i]]\n",
" except:\n",
"\n",
" continue\n",
" \n",
"print(inverted_list_new)\n",
" \n",
" \n",
" ### 你需要完成的代码结束\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# 将新的倒排表保存在文件data/retrieve/invertedList.pkl中\n",
"with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/retrieve/invertedList.pkl','wb') as f:\n",
" pickle.dump(inverted_list_new,f)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"以下为测试,完成上述过程之后,可以运行以下的代码来测试准确性。"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"#这一格的内容是从preprocessor.ipynb中粘贴而来,包含了数据预处理的几个关键函数\n",
"import emoji\n",
"import re\n",
"import jieba\n",
"def clean(content):\n",
" content = emoji.demojize(content) # 将emoji表情改为文字\n",
" content = re.sub('<.*>','',content) # 删除html标签\n",
" return content\n",
"#这一函数是用于对句子进行分词,在preprocessor.ipynb中由于数据是已经分好词的,所以我们并没有进行这一步骤,但是对于一个新的问句,这一步是必不可少的\n",
"def question_cut(content):\n",
" return list(jieba.cut(content))\n",
"def strip(wordList):\n",
" return [word.strip() for word in wordList if word.strip()!='']\n",
"with open(\"C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/stopWord.json\",\"r\",encoding=\"utf-8\") as f:\n",
" stopWords = f.read().split(\"\\n\")\n",
"def rm_stop_word(wordList):\n",
" return [word for word in wordList if word not in stopWords]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"# 从data/retrieve/invertedList.pkl加载倒排表并将其保存在变量invertedList中\n",
"with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/retrieve/invertedList.pkl','rb') as f:\n",
" invertedList = pickle.load(f)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"def get_retrieve_result(sentence):\n",
" '''\n",
" 输入一个句子sentence,根据倒排表进行快速检索,返回与该句子较相近的一些候选问题的index\n",
" 候选问题由包含该句子中任一单词或包含与该句子中任一单词意思相近的单词的问题索引组成\n",
" '''\n",
" sentence = clean(sentence)\n",
" sentence = question_cut(sentence)\n",
" sentence = strip(sentence)\n",
" sentence = rm_stop_word(sentence)\n",
" candidate = set()\n",
" for word in sentence:\n",
" if word in invertedList:\n",
" candidate = candidate | set(invertedList[word])\n",
" return candidate"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Building prefix dict from the default dictionary ...\n",
"Loading model from cache C:\\Users\\CUISHU~1\\AppData\\Local\\Temp\\jieba.cache\n",
"Loading model cost 1.175 seconds.\n",
"Prefix dict has been built successfully.\n"
]
},
{
"data": {
"text/plain": [
"{5,\n",
" 17,\n",
" 18,\n",
" 29,\n",
" 55,\n",
" 64,\n",
" 102,\n",
" 118,\n",
" 122,\n",
" 125,\n",
" 133,\n",
" 142,\n",
" 150,\n",
" 151,\n",
" 187,\n",
" 200,\n",
" 220,\n",
" 223,\n",
" 254,\n",
" 261,\n",
" 275,\n",
" 310,\n",
" 318,\n",
" 344,\n",
" 350,\n",
" 364,\n",
" 401,\n",
" 453,\n",
" 461,\n",
" 469,\n",
" 495,\n",
" 530,\n",
" 542,\n",
" 543,\n",
" 556,\n",
" 579,\n",
" 588,\n",
" 639,\n",
" 640,\n",
" 642,\n",
" 651,\n",
" 673,\n",
" 687,\n",
" 694,\n",
" 710,\n",
" 718,\n",
" 743,\n",
" 749,\n",
" 759,\n",
" 769,\n",
" 800,\n",
" 841,\n",
" 845,\n",
" 854,\n",
" 858,\n",
" 874,\n",
" 878,\n",
" 891,\n",
" 901,\n",
" 916,\n",
" 931,\n",
" 938,\n",
" 944,\n",
" 959,\n",
" 978,\n",
" 992}"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_retrieve_result('什么时候发货') # 通过倒排表返回文档IDs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment