Commit 2270968c by 20200519088

Upload New File

parent a8c31cbf
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 利用信息抽取技术搭建知识库\n",
"\n",
"在这个notebook文件中,有些模板代码已经提供给你,但你还需要实现更多的功能来完成这个项目。除非有明确要求,你无须修改任何已给出的代码。以**'【练习】'**开始的标题表示接下来的代码部分中有你需要实现的功能。这些部分都配有详细的指导,需要实现的部分也会在注释中以'TODO'标出。请仔细阅读所有的提示。\n",
"\n",
">**提示:**Code 和 Markdown 区域可通过 **Shift + Enter** 快捷键运行。此外,Markdown可以通过双击进入编辑模式。\n",
"\n",
"---\n",
"\n",
"### 让我们开始吧\n",
"\n",
"本项目的目的是结合命名实体识别、依存语法分析、实体消歧、实体统一对网站开放语料抓取的数据建立小型知识图谱。\n",
"\n",
"在现实世界中,你需要拼凑一系列的模型来完成不同的任务;举个例子,用来预测狗种类的算法会与预测人类的算法不同。在做项目的过程中,你可能会遇到不少失败的预测,因为并不存在完美的算法和模型。你最终提交的不完美的解决方案也一定会给你带来一个有趣的学习经验!\n",
"\n",
"\n",
"---\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤 1:实体统一"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"实体统一做的是对同一实体具有多个名称的情况进行统一,将多种称谓统一到一个实体上,并体现在实体的属性中(可以给实体建立“别称”属性)\n",
"\n",
"例如:对“河北银行股份有限公司”、“河北银行公司”和“河北银行”我们都可以认为是一个实体,我们就可以将通过提取前两个称谓的主要内容,得到“河北银行”这个实体关键信息。\n",
"\n",
"公司名称有其特点,例如后缀可以省略、上市公司的地名可以省略等等。在data/dict目录中提供了几个词典,可供实体统一使用。\n",
"- company_suffix.txt是公司的通用后缀词典\n",
"- company_business_scope.txt是公司经营范围常用词典\n",
"- co_Province_Dim.txt是省份词典\n",
"- co_City_Dim.txt是城市词典\n",
"- stopwords.txt是可供参考的停用词\n",
"\n",
"### 练习1:\n",
"编写main_extract函数,实现对实体的名称提取“主体名称”的功能。"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import jieba\n",
"import jieba.posseg as pseg\n",
"import re\n",
"import datetime\n",
"\n",
"\n",
"# 从输入的“公司名”中提取主体\n",
"def main_extract(input_str, stop_word, d_4_delete, d_city_province):\n",
" # 开始分词并处理\n",
" seg = pseg.cut(input_str)\n",
" segment_lst = remove_word(seg, stop_word, d_4_delete)\n",
" seg_lst = city_prov_ahead(segment_lst, d_city_province)\n",
" return seg_lst\n",
"\n",
"# TODO:实现公司名称中地名提前city_prov\n",
"def city_prov_ahead(words, d_city_province):\n",
" city_prov_lst = []\n",
" seg_list = []\n",
" city_prov_lst = [city_prov for city_prov in words if city_prov in d_city_province]\n",
" seg_list = [rest for rest in words if rest not in d_city_province]\n",
" return city_prov_lst + seg_list\n",
"\n",
"\n",
"# TODO:替换特殊符号\n",
"def remove_word(seg, stop_word, d_4_delete):\n",
" # TODO ...\n",
" seg_list = []\n",
" words = []\n",
" for word, flag in seg:\n",
" # print('%s %s' % (word, flag))\n",
" words.append(word)\n",
" # print(words)\n",
" segm_list = [word for word in words if word not in stop_word]\n",
" # print(segm_list)\n",
" seg_list = [word for word in segm_list if word not in d_4_delete]\n",
" # print(seg_list)\n",
" return seg_list\n",
"\n",
"\n",
"# ********************* 老师,请问下面用open直接打开文件,最后为什么不用close关闭呢?*************************\n",
"\n",
"# 初始化,加载词典\n",
"def my_initial():\n",
" fr1 = open(r\"../data/dict/co_City_Dim.txt\", encoding='utf-8')\n",
" fr2 = open(r\"../data/dict/co_Province_Dim.txt\", encoding='utf-8')\n",
" fr3 = open(r\"../data/dict/company_business_scope.txt\", encoding='utf-8')\n",
" fr4 = open(r\"../data/dict/company_suffix.txt\", encoding='utf-8')\n",
" # 城市名\n",
" lines1 = fr1.readlines()\n",
" d_4_delete = []\n",
" d_city_province = [re.sub(r'(\\r|\\n)*', '', line) for line in lines1]\n",
" # 省份名\n",
" lines2 = fr2.readlines()\n",
" l2_tmp = [re.sub(r'(\\r|\\n)*', '', line) for line in lines2]\n",
" d_city_province.extend(l2_tmp)\n",
" # 公司后缀\n",
" lines3 = fr3.readlines()\n",
" l3_tmp = [re.sub(r'(\\r|\\n)*', '', line) for line in lines3]\n",
" lines4 = fr4.readlines()\n",
" l4_tmp = [re.sub(r'(\\r|\\n)*', '', line) for line in lines4]\n",
" d_4_delete.extend(l4_tmp)\n",
" # get stop_word\n",
" fr = open(r'../data/dict/stopwords.txt', encoding='utf-8')\n",
" stop_word = fr.readlines()\n",
" stop_word_after = [re.sub(r'(\\r|\\n)*', '', stop_word[i]) for i in range(len(stop_word))]\n",
" stop_word_after[-1] = stop_word[-1] # *************** 这句不懂啥意思,还望老师解答****************************************\n",
" stop_word = stop_word_after\n",
" return d_4_delete, stop_word, d_city_province"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Building prefix dict from the default dictionary ...\n",
"Loading model from cache C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\jieba.cache\n",
"Loading model cost 0.948 seconds.\n",
"Prefix dict has been built successfully.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"河北银行\n",
"['河北', '银行']\n"
]
}
],
"source": [
"# TODO:测试实体统一用例\n",
"d_4_delete,stop_word,d_city_province = my_initial()\n",
"company_name = \"河北银行股份有限公司\"\n",
"lst = main_extract(company_name,stop_word,d_4_delete,d_city_province)\n",
"company_name = ''.join(lst) # 对公司名提取主体部分,将包含相同主体部分的公司统一为一个实体\n",
"print(company_name)\n",
"print(lst)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤 2:实体识别\n",
"有很多开源工具可以帮助我们对实体进行识别。常见的有LTP、StanfordNLP、FoolNLTK等等。\n",
"\n",
"本次采用FoolNLTK实现实体识别,fool是一个基于bi-lstm+CRF算法开发的深度学习开源NLP工具,包括了分词、实体识别等功能,大家可以通过fool很好地体会深度学习在该任务上的优缺点。\n",
"\n",
"在‘data/train_data.csv’和‘data/test_data.csv’中是从网络上爬虫得到的上市公司公告,数据样例如下:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>sentence</th>\n",
" <th>tag</th>\n",
" <th>member1</th>\n",
" <th>member2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>6461</td>\n",
" <td>与本公司关系:受同一公司控制 2,杭州富生电器有限公司企业类型: 有限公司注册地址: 富阳市...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2111</td>\n",
" <td>三、关联交易标的基本情况 1、交易标的基本情况 公司名称:红豆集团财务有限公司 公司地址:无...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>9603</td>\n",
" <td>2016年协鑫集成科技股份有限公司向瑞峰(张家港)光伏科技有限公司支付设备款人民币4,515...</td>\n",
" <td>1</td>\n",
" <td>协鑫集成科技股份有限公司</td>\n",
" <td>瑞峰(张家港)光伏科技有限公司</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3456</td>\n",
" <td>证券代码:600777 证券简称:新潮实业 公告编号:2015-091 烟台新潮实业股份有限...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>8844</td>\n",
" <td>本集团及广发证券股份有限公司持有辽宁成大股份有限公司股票的本期变动系买卖一揽子沪深300指数...</td>\n",
" <td>1</td>\n",
" <td>广发证券股份有限公司</td>\n",
" <td>辽宁成大股份有限公司</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id sentence tag member1 \\\n",
"0 6461 与本公司关系:受同一公司控制 2,杭州富生电器有限公司企业类型: 有限公司注册地址: 富阳市... 0 0 \n",
"1 2111 三、关联交易标的基本情况 1、交易标的基本情况 公司名称:红豆集团财务有限公司 公司地址:无... 0 0 \n",
"2 9603 2016年协鑫集成科技股份有限公司向瑞峰(张家港)光伏科技有限公司支付设备款人民币4,515... 1 协鑫集成科技股份有限公司 \n",
"3 3456 证券代码:600777 证券简称:新潮实业 公告编号:2015-091 烟台新潮实业股份有限... 0 0 \n",
"4 8844 本集团及广发证券股份有限公司持有辽宁成大股份有限公司股票的本期变动系买卖一揽子沪深300指数... 1 广发证券股份有限公司 \n",
"\n",
" member2 \n",
"0 0 \n",
"1 0 \n",
"2 瑞峰(张家港)光伏科技有限公司 \n",
"3 0 \n",
"4 辽宁成大股份有限公司 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"train_data = pd.read_csv('../data/info_extract/train_data.csv', encoding = 'gb2312', header=0)\n",
"train_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>sentence</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>9259</td>\n",
" <td>2015年1月26日,多氟多化工股份有限公司与李云峰先生签署了《附条件生效的股份认购合同》</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>9136</td>\n",
" <td>2、2016年2月5日,深圳市新纶科技股份有限公司与侯毅先</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>220</td>\n",
" <td>2015年10月26日,山东华鹏玻璃股份有限公司与张德华先生签署了附条件生效条件的《股份认购合同》</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>9041</td>\n",
" <td>2、2015年12月31日,印纪娱乐传媒股份有限公司与肖文革签订了《印纪娱乐传媒股份有限公司...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>10041</td>\n",
" <td>一、金发科技拟与熊海涛女士签订《股份转让协议》,协议约定:以每股1.0509元的收购价格,收...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id sentence\n",
"0 9259 2015年1月26日,多氟多化工股份有限公司与李云峰先生签署了《附条件生效的股份认购合同》\n",
"1 9136 2、2016年2月5日,深圳市新纶科技股份有限公司与侯毅先\n",
"2 220 2015年10月26日,山东华鹏玻璃股份有限公司与张德华先生签署了附条件生效条件的《股份认购合同》\n",
"3 9041 2、2015年12月31日,印纪娱乐传媒股份有限公司与肖文革签订了《印纪娱乐传媒股份有限公司...\n",
"4 10041 一、金发科技拟与熊海涛女士签订《股份转让协议》,协议约定:以每股1.0509元的收购价格,收..."
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'gb2312', header=0)\n",
"test_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"我们选取一部分样本进行标注,即train_data,该数据由5列组成。id列表示原始样本序号;sentence列为我们截取的一段关键信息;如果关键信息中存在两个实体之间有股权交易关系则tag列为1,否则为0;如果tag为1,则在member1和member2列会记录两个实体出现在sentence中的名称。\n",
"\n",
"剩下的样本没有标注,即test_data,该数据只有id和sentence两列,希望你能训练模型对test_data中的实体进行识别,并判断实体对之间有没有股权交易关系。\n",
"\n",
"### 练习2:\n",
"将每句句子中实体识别出,存入实体词典,并用特殊符号替换语句。\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" ner\n",
"0 2015年1月26日, ner_1002_ 司与 ner_1001_ 峰先生签署了《附条件生...\n",
"1 2、2016年2月5日, ner_1004_ 司与 ner_1003_ 毅先\n",
"2 2015年10月26日, ner_1006_ 司与 ner_1005_ 华先生签署了附条件生...\n",
"3 2、2015年12月31日, ner_1008_ 司与 ner_1007_ 革签订了《印纪娱...\n",
"4 一、 ner_1010_ 技拟与 ner_1009_ 涛女士签订《股份转让协议》,协议约定:...\n",
"5 同日, ner_1012_ 司与 ner_1011_ 琳先生在上海签署了《股权转让协议》和《...\n",
"6 截至本公告日, ner_1014_ 林持有 ner_1013_ 司股份354,418,300...\n",
"7 本次交易完成后,上市公司将由传统的医药制造业转变成为“企业互联网服务业务为主导,制造业务为支...\n",
"8 2015年1月20日, ner_1018_ 司与 ner_1017_ 飞先生签署《万鸿集团股...\n",
"9 一、(一) ner_1020_ 司(以下简称“公司”)拟以非公开发行方式向 ner_1019...\n",
"10 2、 ner_1022_ 司于2014年12月12日与 ner_1021_ 睿先生签订了《成...\n",
"11 于2015年4月24日, ner_1024_ 司与鲜言先生签订了股权转让协议,鲜言先生按注册...\n",
"12 ner_1026_ 司于2015年11月26日与 ner_1025_ 华先\n",
"13 2015年6月17日, ner_1028_ 峰先生、 ner_1027_ 份高管资管计划分别...\n",
"14 一、 ner_1031_ 司拟向包括 ner_1030_ 峰先生在内的不超过10名特定对象非...\n",
"15 截止2016年12月31日, ner_1033_ 司已向 ner_1032_ 飞归还上述欠款\n",
"16 若公司A股股票在定价基准日至发行日期间发生派发股利、送红股、转增股本、增发新股或配股等除权、...\n",
"17 2015年6月28日, ner_1037_ 司在福建省厦门市与 ner_1036_ 斌先生签...\n",
"18 ner_1039_ 司控股股东、实际控制人 ner_1038_ 林与非关联方分别以自有资金...\n",
"19 2015年4月2日, ner_1041_ 司与 ner_1040_ 峰签署了附条件生效的《关...\n",
"20 一、(一) ner_1044_ 司(以下简称“公司”)拟与公司控股股东和实际控制人 ner_...\n",
"21 在本次非公开发行事项中, ner_1046_ 司总计向5名特定对象非公开发行股票,发行对象包...\n",
"22 一、本次交易方案概述本次交易中, ner_1049_ 份拟以发行股份及支付现金的方式购买 n...\n",
"23 2、 ner_1056_ 司(以下简称“公司”或“ ner_1055_ 技”)计划受让 ne...\n",
"24 ner_1064_ 司本次总计向五名特定对象非公开发行股票,发行对象为 ner_1063_...\n",
"25 2015年6月15日, ner_1066_ 流先生和 ner_1065_ 资分别与公司签署附...\n",
"26 ner_1068_ 司控股股东及实际控制人 ner_1067_ 华先生拟认购不低于本次非公...\n",
"27 一、(一)本次关联交易基本情况 ner_1070_ 良先生为 ner_1069_ 司控股股东...\n",
"28 一、1、 ner_1072_ 司控股股东及实际控制人 ner_1071_ 海以不低于人民币3...\n",
"29 一、(一)关联交易基本情况 ner_1074_ 司(以下简称“公司”)拟向 ner_1073...\n",
".. ...\n",
"389 股票代码:600400 股票简称: ner_1794_ 份 编号:临 2015-054 n...\n",
"390 经 ner_1796_ 司评估,并经 ner_1795_ 司备案,该煤灰池的账面价值为886...\n",
"391 10、2015 年 8 月 24 日,公司使用 10,000 万元人民币购买 ner_179...\n",
"392 3、<附生效条件的股份认购合同>的主要内容2015 年 11 月 18 日,公司与本次非公开...\n",
"393 1因 ner_1802_ 董事系北京缘竟尽⑸虾缘竟尽⒈本缘揽萍脊 ner_1801_ 司的法...\n",
"394 二、 ner_1804_ 厂项目2014年12月底, ner_1803_ 厂转为公用电厂项目...\n",
"395 公司独立董事发表了独立意见,认为: ner_1806_ 司与青海省电力设计院共同组建项目公司...\n",
"396 证券代码:600533 证券简称:栖霞建设 编号:临 2015-059 ner_1808_...\n",
"397 3、本公司与 ner_1810_ 化、 ner_1809_ 君于 2015 年 8 月 25...\n",
"398 重要内容提示●被担保人:控股子公司---- ner_1812_ 司●本次担保金额及已实际为其...\n",
"399 12 / 14 七、历史关联交易(日常关联交易除外)情况1、本次交易前 12 个月内,公司与...\n",
"400 ner_1816_ 司(“公司”)全资拥有的 ner_1815_ 司2号350兆瓦燃煤发电...\n",
"401 二、审议通过了<关于控股子公司 ner_1818_ 司与 ner_1817_ 司签订供热合作...\n",
"402 转让方: ner_1820_ 司 企业类型:有限责任公司(自然人投资或控股) 企业住所:郑州...\n",
"403 证券代码:600108 证券简称: ner_1822_ 团 公告编号:2015-068 n...\n",
"404 会议审议并通过了以下议案,形成如下决议:审议并通过了< ner_1824_ 司关于向 ner...\n",
"405 2014年,公司收购 ner_1826_ 司100%股权, ner_1825_ 司于2014...\n",
"406 申请材料显示,报告期内 ner_1828_ 流、 ner_1827_ 务董事、高级管理人员存...\n",
"407 公司使用暂时闲置募集资金 6,000 万元购买了 ner_1830_ 司 ner_1829_...\n",
"408 二、交易各方当事人情况介绍(一)北京工投基本情况公司名称: ner_1832_ 司住所:北京...\n",
"409 (依法需经批准的项目经相关部门批准后方可开展经营活动)经 ner_1834_ 所审计,201...\n",
"410 ner_1836_ 司吸收合并 ner_1835_ 司及发行股份购买资产并募集配套资金 暨...\n",
"411 回复:(一) 核查情况:1、请你公司结合行业情况、未来经营战略、盈利模式等补充披露客户集中 ...\n",
"412 三、 受让标的基本情况企业名称: ner_1840_ 司 企业类型:有限责任公司(法人独资)...\n",
"413 本次股权转让前,公司持有 ner_1842_ 险 18%的股权;本次股权转让完成后,公 司将...\n",
"414 近日,该子公司已完成工商注册登记手续,并领取了南京市工商行政管理局颁发的<企业法人营业执照>...\n",
"415 (二)本次交易构成关联交易正元投资拟认购金额不低于 13 亿元且不低于本次配套融资总额的 2...\n",
"416 证券代码:600225 证券简称: ner_1848_ 公告编号:临 2015-118 ...\n",
"417 2015年3月31日, ner_1850_ 司与 ner_1849_ 司签署了附条件生效的《...\n",
"418 三、本公司不会利用对 ner_1852_ 份控制关系损害 ner_1851_ 份及其他股东 ...\n",
"\n",
"[419 rows x 1 columns]\n"
]
}
],
"source": [
"# 处理test数据,利用开源工具进行实体识别和并使用实体统一函数存储实体\n",
"\n",
"import fool\n",
"import pandas as pd\n",
"from copy import copy\n",
"\n",
"\n",
"test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'gb2312', header=0)\n",
"test_data['ner'] = None\n",
"ner_id = 1001\n",
"ner_dict_new = {} # 存储所有实体\n",
"ner_dict_reverse_new = {} # 存储所有实体\n",
"\n",
"for i in range(len(test_data)):\n",
" sentence = copy(test_data.iloc[i, 1])#****提取test_data中第i行第2列的元素,也就是每个行中的句子\n",
" # TODO:调用fool进行实体识别,得到words和ners结果\n",
" # TODO ...\n",
" words,ners = fool.analysis(sentence)\n",
" \n",
" ners[0].sort(key=lambda x:x[0], reverse=True)#******对ners中的元组中第一个元素按照倒序排序\n",
" for start, end, ner_type, ner_name in ners[0]:\n",
" if ner_type=='company' or ner_type=='person':\n",
" # TODO:调用实体统一函数,存储统一后的实体\n",
" # 并自增ner_id\n",
" # TODO ...\n",
" company_main_name = ner_name\n",
" ner_dict_new[company_main_name] = ner_id\n",
" ner_dict_reverse_new[ner_id] = company_main_name\n",
" ner_id += 1\n",
" # 在句子中用编号替换实体名\n",
" sentence = sentence[:start] + ' ner_' + str(ner_dict_new[company_main_name]) + '_ ' + sentence[end-1:]\n",
" test_data.iloc[i, -1] = sentence\n",
"\n",
"X_test = test_data[['ner']]\n",
"print(X_test)\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[('我', 'r'), ('想', 'v'), ('去', 'v'), ('北京', 'ns'), ('腾讯', 'nz'), ('公司', 'n'), ('学习', 'v'), ('自然', 'n'), ('语言', 'n'), ('处理', 'v'), ('杭州', 'ns'), ('阿里巴巴', 'nz'), ('!', 'wt')]]\n",
"[[(3, 9, 'company', '北京腾讯公司'), (17, 23, 'company', '杭州阿里巴巴')]]\n"
]
}
],
"source": [
"# import fool \n",
"# sentence = \"我想去北京腾讯公司学习自然语言处理杭州阿里巴巴!\"\n",
"# words, ners = fool.analysis(sentence)\n",
"# print(words)\n",
"# print(ners)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" ner\n",
"0 与本公司关系:受同一公司控制 2, ner_1854_ 司企业类型: 有限公司注册地址: 富...\n",
"1 三、关联交易标的基本情况 1、交易标的基本情况 公司名称: ner_1856_ 司地址:无锡...\n",
"2 2016年 ner_1857_ 向 ner_1858_ 支付设备款人民币4,515,770.00元\n",
"3 证券代码:600777 证券简称: ner_1860_ 公告编号:2015-091 ne...\n",
"4 本集团及 ner_1861_ 持有 ner_1862_ 股票的本期变动系买卖一揽子沪深300...\n",
"5 二、被担保人基本情况被担保人: ner_1864_ 司 住所:上海市杨浦区中山北二路 112...\n",
"6 2014年12月12日,绵阳市国资委出具了<关于同意转让 ner_1866_ 司股权有关事项...\n",
"7 ner_1868_ 司 2015 年第十二次临时董事会会议 证券代码:600089 证券简...\n",
"8 变更 完成后,本公司持有 ner_1870_ 100%的股权, ner_1869_ 漫已成...\n",
"9 根据上市公司与 ner_1872_ 行达成的并购贷款意向函, ner_1871_ 行愿意提供...\n",
"10 特此公告 ner_1874_ 司董事会 2015 年 11 月 8 日备查文件< ner_1...\n",
"11 证券代码:600348 股票简称: ner_1876_ 业 编号:临 2015-056 ne...\n",
"12 附件: ner_1878_ 司董事会 2015 年 11 月 17 日1、 ner_1877...\n",
"13 根据以上评估结果,经交易各方协商确定本次 ner_1880_ 险 26.96%股权交易价 格...\n",
"14 资产出售和资产购买方案简要介绍如 下:(一)资产出售公司拟向控股股东 ner_1882_ 业...\n",
"15 根据 ner_1884_ 团实际投资额,在2012年6月11日和2013年8月12日,内蒙古...\n",
"16 五、本公司保证将依照 ner_1886_ 份章程参加股东大会,平等地行使相 应权利,承担相应...\n",
"17 2015 年 6 月 25 日,公司以人民币 8,000 万元闲置募集资金购买了 ner_1...\n",
"18 会议应参与表决董事9人,实际 表决9人,独立董事 ner_1890_ 渠先生因公出差委托独立...\n",
"19 二、关联方介绍(一)关联方关系介绍根据<上海证券交易所股票上市规则>和 ner_1892_ ...\n",
"20 第五届董事会第十九次会议决议审议通过了《关于退出 ner_1893_ 暨关联交易的议案》, ...\n",
"21 公司将与 ner_1896_ 司和 ner_1895_ 行及时分析和跟踪理财产品的进展情况,...\n",
"22 经评估,截至 2015 年 5 月 31 日, ner_1898_ 险32经审计的归属母公司...\n",
"23 本年度上海 ner_1899_ 收购天津 ner_1899_ 焊接材料有限公司将其所持天津 ...\n",
"24 变更后, ner_1902_ 技持有 ner_1901_ 带 100%股权。\n",
"25 2、2014年11月7日公司与 ner_1904_ 司签订 ner_1903_ 行对公结构 ...\n",
"26 13(本页无正文,为<延边 ner_1906_ 司重大资产出售和重大资产 购买暨关联交易实施...\n",
"27 经合肥市工商行政管理局核准,公司全资子公司“ ner_1908_ 司”更名为“ ner_19...\n",
"28 四、对外投资合作合同的主要内容甲方: ner_1910_ 心(有限合伙)乙方: ner_19...\n",
"29 股票代码:601877 股票简称: ner_1912_ 器 编号:临 2015-068 债券...\n",
".. ...\n",
"389 股票代码:600400 股票简称: ner_1794_ 份 编号:临 2015-054 n...\n",
"390 经 ner_1796_ 司评估,并经 ner_1795_ 司备案,该煤灰池的账面价值为886...\n",
"391 10、2015 年 8 月 24 日,公司使用 10,000 万元人民币购买 ner_179...\n",
"392 3、<附生效条件的股份认购合同>的主要内容2015 年 11 月 18 日,公司与本次非公开...\n",
"393 1因 ner_1802_ 董事系北京缘竟尽⑸虾缘竟尽⒈本缘揽萍脊 ner_1801_ 司的法...\n",
"394 二、 ner_1804_ 厂项目2014年12月底, ner_1803_ 厂转为公用电厂项目...\n",
"395 公司独立董事发表了独立意见,认为: ner_1806_ 司与青海省电力设计院共同组建项目公司...\n",
"396 证券代码:600533 证券简称:栖霞建设 编号:临 2015-059 ner_1808_...\n",
"397 3、本公司与 ner_1810_ 化、 ner_1809_ 君于 2015 年 8 月 25...\n",
"398 重要内容提示●被担保人:控股子公司---- ner_1812_ 司●本次担保金额及已实际为其...\n",
"399 12 / 14 七、历史关联交易(日常关联交易除外)情况1、本次交易前 12 个月内,公司与...\n",
"400 ner_1816_ 司(“公司”)全资拥有的 ner_1815_ 司2号350兆瓦燃煤发电...\n",
"401 二、审议通过了<关于控股子公司 ner_1818_ 司与 ner_1817_ 司签订供热合作...\n",
"402 转让方: ner_1820_ 司 企业类型:有限责任公司(自然人投资或控股) 企业住所:郑州...\n",
"403 证券代码:600108 证券简称: ner_1822_ 团 公告编号:2015-068 n...\n",
"404 会议审议并通过了以下议案,形成如下决议:审议并通过了< ner_1824_ 司关于向 ner...\n",
"405 2014年,公司收购 ner_1826_ 司100%股权, ner_1825_ 司于2014...\n",
"406 申请材料显示,报告期内 ner_1828_ 流、 ner_1827_ 务董事、高级管理人员存...\n",
"407 公司使用暂时闲置募集资金 6,000 万元购买了 ner_1830_ 司 ner_1829_...\n",
"408 二、交易各方当事人情况介绍(一)北京工投基本情况公司名称: ner_1832_ 司住所:北京...\n",
"409 (依法需经批准的项目经相关部门批准后方可开展经营活动)经 ner_1834_ 所审计,201...\n",
"410 ner_1836_ 司吸收合并 ner_1835_ 司及发行股份购买资产并募集配套资金 暨...\n",
"411 回复:(一) 核查情况:1、请你公司结合行业情况、未来经营战略、盈利模式等补充披露客户集中 ...\n",
"412 三、 受让标的基本情况企业名称: ner_1840_ 司 企业类型:有限责任公司(法人独资)...\n",
"413 本次股权转让前,公司持有 ner_1842_ 险 18%的股权;本次股权转让完成后,公 司将...\n",
"414 近日,该子公司已完成工商注册登记手续,并领取了南京市工商行政管理局颁发的<企业法人营业执照>...\n",
"415 (二)本次交易构成关联交易正元投资拟认购金额不低于 13 亿元且不低于本次配套融资总额的 2...\n",
"416 证券代码:600225 证券简称: ner_1848_ 公告编号:临 2015-118 ...\n",
"417 2015年3月31日, ner_1850_ 司与 ner_1849_ 司签署了附条件生效的《...\n",
"418 三、本公司不会利用对 ner_1852_ 份控制关系损害 ner_1851_ 份及其他股东 ...\n",
"\n",
"[1269 rows x 1 columns]\n"
]
}
],
"source": [
"# 处理train数据,利用开源工具进行实体识别和并使用实体统一函数存储实体\n",
"train_data = pd.read_csv('../data/info_extract/train_data.csv', encoding = 'gb2312', header=0)\n",
"train_data['ner'] = None\n",
"\n",
"for i in range(len(train_data)):\n",
" # 判断正负样本\n",
" if train_data.iloc[i,:]['member1']=='0' and train_data.iloc[i,:]['member2']=='0':\n",
" sentence = copy(train_data.iloc[i, 1])\n",
" # TODO:调用fool进行实体识别,得到words和ners结果\n",
" # TODO ...\n",
" words,ners = fool.analysis(sentence)\n",
" \n",
" ners[0].sort(key=lambda x:x[0], reverse=True)\n",
" for start, end, ner_type, ner_name in ners[0]:\n",
" if ner_type=='company' or ner_type=='person':\n",
" # TODO:调用实体统一函数,存储统一后的实体\n",
" # 并自增ner_id\n",
" # TODO ...\n",
" \n",
" company_main_name = ner_name\n",
" ner_dict_new[company_main_name] = ner_id\n",
" ner_dict_reverse_new[ner_id] = company_main_name\n",
" ner_id += 1\n",
"\n",
" # 在句子中用编号替换实体名\n",
" sentence = sentence[:start] + ' ner_' + str(ner_dict_new[company_main_name]) + '_ ' + sentence[end-1:]\n",
" train_data.iloc[i, -1] = sentence\n",
" else:\n",
" # 将训练集中正样本已经标注的实体也使用编码替换\n",
" sentence = copy(train_data.iloc[i,:]['sentence'])\n",
" for company_main_name in [train_data.iloc[i,:]['member1'],train_data.iloc[i,:]['member2']]:\n",
" # TODO:调用实体统一函数,存储统一后的实体\n",
" # 并自增ner_id\n",
" # TODO ...\n",
" ner_dict_new[company_main_name] = ner_id\n",
" ner_dict_reverse_new[ner_id] = company_main_name\n",
" ner_id += 1\n",
"\n",
" # 在句子中用编号替换实体名\n",
" sentence = re.sub(company_main_name, ' ner_%s_ '%(str(ner_dict_new[company_main_name])), sentence)\n",
" train_data.iloc[i, -1] = sentence\n",
" \n",
"y = train_data.loc[:,['tag']]\n",
"train_num = len(train_data)\n",
"X_train = train_data[['ner']]\n",
"\n",
"# 将train和test放在一起提取特征\n",
"X = pd.concat([X_train, X_test])\n",
"print(X)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤 3:关系抽取\n",
"\n",
"\n",
"目标:借助句法分析工具,和实体识别的结果,以及文本特征,基于训练数据抽取关系,并存储进图数据库。\n",
"\n",
"本次要求抽取股权交易关系,关系为无向边,不要求判断投资方和被投资方,只要求得到双方是否存在交易关系。\n",
"\n",
"模板建立可以使用“正则表达式”、“实体间距离”、“实体上下文”、“依存句法”等。\n",
"\n",
"答案提交在submit目录中,命名为info_extract_submit.csv和info_extract_entity.csv。\n",
"- info_extract_entity.csv格式为:第一列是实体编号,第二列是实体名(实体统一的多个实体名用“|”分隔)\n",
"- info_extract_submit.csv格式为:第一列是关系中实体1的编号,第二列为关系中实体2的编号。\n",
"\n",
"示例:\n",
"- info_extract_entity.csv\n",
"\n",
"| 实体编号 | 实体名 |\n",
"| ------ | ------ |\n",
"| 1001 | 小王 |\n",
"| 1002 | A化工厂 |\n",
"\n",
"- info_extract_submit.csv\n",
"\n",
"| 实体1 | 实体2 |\n",
"| ------ | ------ |\n",
"| 1001 | 1003 |\n",
"| 1002 | 1001 |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习3:提取文本tf-idf特征\n",
"\n",
"去除停用词,并转换成tfidf向量。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# code\n",
"from sklearn.feature_extraction.text import TfidfTransformer \n",
"from sklearn.feature_extraction.text import CountVectorizer \n",
"from pyltp import Segmentor\n",
"\n",
"\n",
"# 实体符号加入分词词典\n",
"with open('../data/user_dict.txt', 'w') as fw:\n",
" for v in ['一', '二', '三', '四', '五', '六', '七', '八', '九', '十']:\n",
" fw.write( v + '号企业 ni\\n')\n",
"\n",
"# 初始化实例\n",
"segmentor = Segmentor() \n",
"# 加载模型,加载自定义词典\n",
"segmentor.load_with_lexicon('E:\\chengxu\\pyltp\\ltp_data_v3.4.0\\cws.model', r'..\\data\\user_dict.txt') \n",
"\n",
"# 加载停用词\n",
"fr = open(r'../data/dict/stopwords.txt', encoding='utf-8') \n",
"stop_word = fr.readlines()\n",
"stop_word = [re.sub(r'(\\r|\\n)*','',stop_word[i]) for i in range(len(stop_word))]\n",
"\n",
"# 分词\n",
"f = lambda x: ' '.join([word for word in segmentor.segment(x) if word not in stop_word and not re.findall(r'ner\\_\\d\\d\\d\\d\\_', word)])\n",
"corpus=X['ner'].map(f).tolist()\n",
"\n",
"\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"# TODO:提取tfidf特征\n",
"# TODO ...\n",
"cv = TfidfVectorizer()\n",
"vec=cv.fit_transform(corpus)#传入句子组成的list\n",
"arr=vec.toarray()\n",
"print(arr)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习4:提取句法特征\n",
"除了词语层面的句向量特征,我们还可以从句法入手,提取一些句法分析的特征。\n",
"\n",
"参考特征:\n",
"\n",
"1、企业实体间距离\n",
"\n",
"2、企业实体间句法距离\n",
"\n",
"3、企业实体分别和关键触发词的距离\n",
"\n",
"4、实体的依存关系类别"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# -*- coding: utf-8 -*-\n",
"from pyltp import Parser\n",
"from pyltp import Segmentor\n",
"from pyltp import Postagger\n",
"import networkx as nx\n",
"import pylab\n",
"import re\n",
"\n",
"postagger = Postagger() # 初始化实例\n",
"postagger.load_with_lexicon('E:\\chengxu\\pyltp\\ltp_data_v3.4.0\\pos.model', '../data/user_dict.txt') # 加载模型\n",
"segmentor = Segmentor() # 初始化实例\n",
"segmentor.load_with_lexicon('E:\\chengxu\\pyltp\\ltp_data_v3.4.0\\cws.model', '../data/user_dict.txt') # 加载模型\n",
"\n",
"\n",
"\n",
"def parse(s):\n",
" \"\"\"\n",
" 对语句进行句法分析,并返回句法结果\n",
" \"\"\"\n",
" tmp_ner_dict = {}\n",
" num_lst = ['一', '二', '三', '四', '五', '六', '七', '八', '九', '十']\n",
"\n",
" # 将公司代码替换为特殊称谓,保证分词词性正确\n",
" for i, ner in enumerate(list(set(re.findall(r'(ner\\_\\d\\d\\d\\d\\_)', s)))):\n",
" try:\n",
" tmp_ner_dict[num_lst[i]+'号企业'] = ner\n",
" except IndexError:\n",
" # TODO:定义错误情况的输出\n",
" # TODO ...\n",
" print('the index is out of the num_list range')\n",
" \n",
" s = s.replace(ner, num_lst[i]+'号企业')\n",
" words = segmentor.segment(s)\n",
" tags = postagger.postag(words)\n",
" parser = Parser() # 初始化实例\n",
" parser.load('E:\\chengxu\\pyltp\\ltp_data_v3.4.0\\parser.model') # 加载模型\n",
" arcs = parser.parse(words, tags) # 句法分析\n",
" arcs_lst = list(map(list, zip(*[[arc.head, arc.relation] for arc in arcs])))\n",
" \n",
" # 句法分析结果输出\n",
" parse_result = pd.DataFrame([[a,b,c,d] for a,b,c,d in zip(list(words),list(tags), arcs_lst[0], arcs_lst[1])], index = range(1,len(words)+1))\n",
" parser.release() # 释放模型\n",
" # TODO:提取企业实体依存句法类型\n",
" # TODO ...\n",
" for ner_id,company_main_name in ner_dict_reverse_new:\n",
" tags = postagger.postag(company_main_name)\n",
" arcs =parser.parse(company_main_name,tags)\n",
" arcs_lst_com = list(map(list, zip(*[[arc.head, arc.relation] for arc in arcs])))\n",
"\n",
" # 投资关系关键词\n",
" key_words = [\"收购\",\"竞拍\",\"转让\",\"扩张\",\"并购\",\"注资\",\"整合\",\"并入\",\"竞购\",\"竞买\",\"支付\",\"收购价\",\"收购价格\",\"承购\",\"购得\",\"购进\",\n",
" \"购入\",\"买进\",\"买入\",\"赎买\",\"购销\",\"议购\",\"函购\",\"函售\",\"抛售\",\"售卖\",\"销售\",\"转售\"]\n",
" # TODO:*根据关键词和对应句法关系提取特征(如没有思路可以不完成)\n",
" # TODO ...\n",
" vec = {}\n",
" for relation in arcs_lst_com:\n",
" if relation in key_words:\n",
" cv = TfidfVectorizer()\n",
" vec[relation] = cv.fit_transform(relation)\n",
" \n",
" parser.release() # 释放模型\n",
" return your_result\n",
"\n",
"\n",
"def shortest_path(arcs_ret, source, target):\n",
" \"\"\"\n",
" 求出两个词最短依存句法路径,不存在路径返回-1\n",
" arcs_ret:句法分析结果\n",
" source:实体1\n",
" target:实体2\n",
" \"\"\"\n",
" G=nx.DiGraph()\n",
" # 为这个网络添加节点...\n",
" for i in list(arcs_ret.index):\n",
" G.add_node(i)\n",
" # TODO:在网络中添加带权中的边...(注意,我们需要的是无向边)\n",
" # TODO ...\n",
" \n",
"\n",
" try:\n",
" # TODO:利用nx包中shortest_path_length方法实现最短距离提取\n",
" # TODO ...\n",
" \n",
" \n",
" return distance\n",
" except:\n",
" return -1\n",
"\n",
"\n",
"def get_feature(s):\n",
" \"\"\"\n",
" 汇总上述函数汇总句法分析特征与TFIDF特征\n",
" \"\"\"\n",
" # TODO:汇总上述函数汇总句法分析特征与TFIDF特征\n",
" # TODO ...\n",
" \n",
" \n",
" return features\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习5:建立分类器\n",
"\n",
"利用已经提取好的tfidf特征以及parse特征,建立分类器进行分类任务。"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 建立分类器进行分类\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn import preprocessing\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"# TODO:定义需要遍历的参数\n",
"class RF:\n",
" def __init__(self):\n",
" self.train_num = 100\n",
" \n",
" def train(self, train_x, train_y):\n",
" rf_model = RandomForestClassifier(n_estimators=10,max_depth=None,min_samples_split=2, bootstrap=True)\n",
" rf_model.fit(train_x,train_y)\n",
" return rf_model\n",
" \n",
" def predict(self, model, test_x):\n",
" predict = model.predict(test_x)\n",
" score=model.score\n",
" return predict,score\n",
" \n",
"\n",
"# TODO:选择模型\n",
"rf = RF()\n",
"model = rf.train(train_x, train_y)\n",
"\n",
"# TODO:利用GridSearchCV搜索最佳参数\n",
"# 不会\n",
"\n",
"# TODO:对Test_data进行分类\n",
"test_predict,score= rf.predict(model, test_x_tfidf)\n",
"print(test_predict)\n",
"\n",
"\n",
"# TODO:保存Test_data分类结果\n",
"# 答案提交在submit目录中,命名为info_extract_submit.csv和info_extract_entity.csv。\n",
"# info_extract_entity.csv格式为:第一列是实体编号,第二列是实体名(实体统一的多个实体名用“|”分隔)\n",
"# info_extract_submit.csv格式为:第一列是关系中实体1的编号,第二列为关系中实体2的编号。\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习6:操作图数据库\n",
"对关系最好的描述就是用图,那这里就需要使用图数据库,目前最常用的图数据库是noe4j,通过cypher语句就可以操作图数据库的增删改查。可以参考“https://cuiqingcai.com/4778.html”。\n",
"\n",
"本次作业我们使用neo4j作为图数据库,neo4j需要java环境,请先配置好环境。\n",
"\n",
"将我们提出的实体关系插入图数据库,并查询某节点的3层投资关系,即三个节点组成的路径(如果有的话)。如果无法找到3层投资关系,请查询出任意指定节点的投资路径。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"\n",
"from py2neo import Node, Relationship, Graph\n",
"\n",
"graph = Graph(\n",
" \"http://localhost:7474\", \n",
" username=\"neo4j\", \n",
" password=\"person\"\n",
")\n",
"\n",
"for v in relation_list:\n",
" a = Node('Company', name=v[0])\n",
" b = Node('Company', name=v[1])\n",
" \n",
" # 本次不区分投资方和被投资方,无向图\n",
" r = Relationship(a, 'INVEST', b)\n",
" s = a | b | r\n",
" graph.create(s)\n",
" r = Relationship(b, 'INVEST', a)\n",
" s = a | b | r\n",
" graph.create(s)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# TODO:查询某节点的3层投资关系\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤4:实体消歧\n",
"解决了实体识别和关系的提取,我们已经完成了一大截,但是我们提取的实体究竟对应知识库中哪个实体呢?下图中,光是“苹果”就对应了13个同名实体。\n",
"<img src=\"../image/baike2.png\", width=340, heigth=480>\n",
"\n",
"在这个问题上,实体消歧旨在解决文本中广泛存在的名称歧义问题,将句中识别的实体与知识库中实体进行匹配,解决实体歧义问题。\n",
"\n",
"\n",
"### 练习7:\n",
"匹配test_data.csv中前25条样本中的人物实体对应的百度百科URL(此部分样本中所有人名均可在百度百科中链接到)。\n",
"\n",
"利用scrapy、beautifulsoup、request等python包对百度百科进行爬虫,判断是否具有一词多义的情况,如果有的话,选择最佳实体进行匹配。\n",
"\n",
"使用URL为‘https://baike.baidu.com/item/’+人名 可以访问百度百科该人名的词条,此处需要根据爬取到的网页识别该词条是否对应多个实体,如下图:\n",
"<img src=\"../image/baike1.png\", width=440, heigth=480>\n",
"如果该词条有对应多个实体,请返回正确匹配的实体URL,例如该示例网页中的‘https://baike.baidu.com/item/陆永/20793929’。\n",
"\n",
"- 提交文件:entity_disambiguation_submit.csv\n",
"- 提交格式:第一列为实体id(与info_extract_submit.csv中id保持一致),第二列为对应URL。\n",
"- 示例:\n",
"\n",
"| 实体编号 | URL |\n",
"| ------ | ------ |\n",
"| 1001 | https://baike.baidu.com/item/陆永/20793929 |\n",
"| 1002 | https://baike.baidu.com/item/王芳/567232 |\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import jieba\n",
"import pandas as pd\n",
"\n",
"# 找出test_data.csv中前25条样本所有的人物名称,以及人物所在文档的上下文内容\n",
"test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'gb2312', header=0)\n",
"\n",
"# 存储人物以及上下文信息(key为人物ID,value为人物名称、人物上下文内容)\n",
"person_name = {}\n",
"\n",
"# 观察上下文的窗口大小\n",
"window = 10 \n",
"\n",
"# 遍历前25条样本\n",
"id = 1\n",
"for i in range(25):\n",
" sentence = copy(test_data.iloc[i, 1])\n",
" words, ners = fool.analysis(sentence)\n",
" voc = []\n",
" dic_id={}\n",
" dic_voc = {}\n",
" for word,tag in words:\n",
" voc.append(word)\n",
" for i in range(len(voc)):\n",
" dic_id[i]=voc[i]\n",
" dic_voc[voc[i]] = i\n",
" ners[0].sort(key=lambda x:x[0], reverse=True)\n",
" for start, end, ner_type, ner_name in ners[0]:\n",
" if ner_type=='person':\n",
" # TODO:提取实体的上下文\n",
" person_name[id][0] = ner_name\n",
" context=[]\n",
" for i in [-5,-4,-3,-2,-1,1,2,3,4,5]:\n",
" context.append(dic_voc[ner_name]+i)\n",
" person_name[id][1] = context\n",
" id += 1\n",
"\n",
"\n",
"\n",
"# 利用爬虫得到每个人物名称对应的URL\n",
"# TODO:找到每个人物实体的词条内容。\n",
"\n",
"# TODO:将样本中人物上下文与爬取词条结果进行对比,选择最接近的词条。\n",
"\n",
"\n",
"# 输出结果\n",
"pd.DataFrame(result_data).to_csv('../submit/entity_disambiguation_submit.csv', index=False)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment