Commit 40daf887 by 20200203113

Project2

parents
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"对此project的想法:\n",
"\n",
"观察了此porject布置的时间点,是在“实体识别”、“句法分析”之后。应该适当温习算法原理的知识点。\n",
"\n",
"1、实体识别、依存句法等工具自己动手开发实现复杂度较高,挑选“句法结构分析”工具让学员开发,不是特别复杂,又温习了句法分析中的知识点,此处选择CYK算法让学员实现,同时又考察了动态规划的编写。(具体算法因为不知道课程内容,可以根据内容再做调整)\n",
"\n",
"2、经过挑选,选择“企业投资关系图谱”作为学员的任务,原因1:企业投资关系易于理解,模版比较好总结,适合没有训练样本的情景;原因2:企业的称谓较多而复杂,适合考察实体统一和消歧的知识点。\n",
"\n",
"3、对于实体识别和句法分析,我考虑使用stanfordnlp,但是python接口好像可配置性不强,主要就是让学员会调用,会利用调用结果。\n",
"对于实体消歧,我考虑使用上下文相似度进行衡量。\n",
"对于实体统一,我考虑考察一下学员在“企业多名称”派生上面的发散性思维。\n",
"由于不知道此project前几节课中涉及的相关知识点,所以先拍脑袋决定,老师如果对知识点有相关准备和资料可以给我看一下,或者听听老师的规划。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Project 1: 利用信息抽取技术搭建知识库"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"本项目的目的是结合命名实体识别、依存语法分析、实体消歧、实体统一对网站开放语料抓取的数据建立小型知识图谱。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part1:开发句法结构分析工具"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1 开发工具\n",
"使用CYK算法,根据所提供的:非终结符集合、终结符集合、规则集,对10句句子计算句法结构。\n",
"\n",
"非终结符集合:N={S, NP, VP, PP, DT, VI, VT, NN, IN}\n",
"\n",
"终结符集合:{sleeps, saw, boy, girl, dog, telescope, the, with, in}\n",
"\n",
"规则集: R={\n",
"- (1) S-->NP VP 1.0\n",
"- (2) VP-->VI 0.3\n",
"- ......\n",
"\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2 计算算法复杂度\n",
"计算上一节开发的算法所对应的时间复杂度和空间复杂度。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part2:在百度百科辅助下,建立“投资关系”知识图谱"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1 实体识别\n",
"data目录中“baike.txt”文件记录了15个实体对应百度百科的url。\n",
"\n",
"借助开源实体识别工具并根据百度百科建立的已知实体对进行识别,对句中实体进行识别。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# code\n",
"# 首先尝试利用开源工具分出实体\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# code\n",
"# 在此基础上,将百度百科作为已知实体加入词典,对实体进行识别"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 实体消歧\n",
"将句中识别的实体与知识库中实体进行匹配,解决实体歧义问题。\n",
"可利用上下文本相似度进行识别。"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# code\n",
"# 将识别出的实体与知识库中实体进行匹配,解决识别出一个实体对应知识库中多个实体的问题。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3 实体统一\n",
"对同一实体具有多个名称的情况进行统一,将多种称谓统一到一个实体上,并体现在实体的属性中(可以给实体建立“别称”属性)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.4 关系抽取\n",
"借助句法分析工具,和实体识别的结果,以及正则表达式,设定模版抽取关系。从data目录中news.txt文件中的url对应的新闻提取关系并存储进图数据库。\n",
"\n",
"本次要求抽取投资关系,关系为有向边,由投资方指向被投资方。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# code\n",
"\n",
"# 最后提交文件为识别出的整个投资图谱,以及图谱中结点列表与属性。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
{
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Project : 利用信息抽取技术搭建知识库"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"本项目的目的是结合命名实体识别、依存语法分析、实体消歧、实体统一对网站开放语料抓取的数据建立小型知识图谱。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part1:开发句法结构分析工具"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1 开发工具(15分)\n",
"使用CYK算法,根据所提供的:非终结符集合、终结符集合、规则集,对以下句子计算句法结构。\n",
"\n",
"“the boy saw the dog with a telescope\"\n",
"\n",
"\n",
"\n",
"非终结符集合:N={S, NP, VP, PP, DT, Vi, Vt, NN, IN}\n",
"\n",
"终结符集合:{sleeps, saw, boy, girl, dog, telescope, the, with, in}\n",
"\n",
"规则集: R={\n",
"- (1) S-->NP VP 1.0\n",
"- (2) VP-->VI 0.3\n",
"- (3) VP-->Vt NP 0.4\n",
"- (4) VP-->VP PP 0.3\n",
"- (5) NP-->DT NN 0.8\n",
"- (6) NP-->NP PP 0.2\n",
"- (7) PP-->IN NP 1.0\n",
"- (8) Vi-->sleeps 1.0\n",
"- (9) Vt-->saw 1.0\n",
"- (10) NN-->boy 0.1\n",
"- (11) NN-->girl 0.1\n",
"- (12) NN-->telescope 0.3\n",
"- (13) NN-->dog 0.5\n",
"- (14) DT-->the 0.5\n",
"- (15) DT-->a 0.5\n",
"- (16) IN-->with 0.6\n",
"- (17) IN-->in 0.4\n",
"\n",
"\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# 分数(15)\n",
"class my_CYK(object):\n",
" # TODO: 初始化函数\n",
" def __init__(self, non_ternimal, terminal, rules_prob, start_prob):\n",
" \n",
"\n",
" # TODO: 返回句子的句法结构,并以树状结构打印出来\n",
" def parse_sentence(self, sentence):\n",
"\n",
"\n",
"\n",
"def main(sentence):\n",
" \"\"\"\n",
" 主函数,句法结构分析需要的材料如下\n",
" \"\"\"\n",
" non_terminal = {'S', 'NP', 'VP', 'PP', 'DT', 'Vi', 'Vt', 'NN', 'IN'}\n",
" start_symbol = 'S'\n",
" terminal = {'sleeps', 'saw', 'boy', 'girl', 'dog', 'telescope', 'the', 'with', 'in'}\n",
" rules_prob = {'S': {('NP', 'VP'): 1.0},\n",
" 'VP': {('Vt', 'NP'): 0.8, ('VP', 'PP'): 0.2},\n",
" 'NP': {('DT', 'NN'): 0.8, ('NP', 'PP'): 0.2},\n",
" 'PP': {('IN', 'NP'): 1.0},\n",
" 'Vi': {('sleeps',): 1.0},\n",
" 'Vt': {('saw',): 1.0},\n",
" 'NN': {('boy',): 0.1, ('girl',): 0.1,('telescope',): 0.3,('dog',): 0.5},\n",
" 'DT': {('the',): 1.0},\n",
" 'IN': {('with',): 0.6, ('in',): 0.4},\n",
" }\n",
" cyk = my_CYK(non_terminal, terminal, rules_prob, start_symbol)\n",
" cyk.parse_sentence(sentence)\n",
"\n",
"\n",
"# TODO: 对该测试用例进行测试\n",
"if __name__ == \"__main__\":\n",
" sentence = \"the boy saw the dog with the telescope\"\n",
" main(sentence)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2 计算算法复杂度(3分)\n",
"计算上一节开发的算法所对应的时间复杂度和空间复杂度。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 分数(3)\n",
"# 上面所写的算法的时间复杂度和空间复杂度分别是多少?\n",
"# TODO\n",
"时间复杂度=O(), 空间复杂度=O()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part2 : 抽取企业股权交易关系,并建立知识库"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1 练习实体消歧(15分)\n",
"将句中识别的实体与知识库中实体进行匹配,解决实体歧义问题。\n",
"可利用上下文本相似度进行识别。\n",
"\n",
"在data/entity_disambiguation目录中,entity_list.csv是50个实体,valid_data.csv是需要消歧的语句。\n",
"\n",
"答案提交在submit目录中,命名为entity_disambiguation_submit.csv。格式为:第一列是需要消歧的语句序号,第二列为多个“实体起始位坐标-实体结束位坐标:实体序号”以“|”分隔的字符串。\n",
"\n",
"*成绩以实体识别准确率以及召回率综合的F1-score\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"import jieba\n",
"import pandas as pd\n",
"\n",
"# TODO:将entity_list.csv中已知实体的名称导入分词词典\n",
"entity_data = pd.read_csv('../data/entity_disambiguation/entity_list.csv', encoding = 'utf-8')\n",
"\n",
"\n",
"# TODO:对每句句子识别并匹配实体 \n",
"valid_data = pd.read_csv('../data/entity_disambiguation/valid_data.csv', encoding = 'gb18030')\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO:将计算结果存入文件\n",
"\"\"\"\n",
"格式为第一列是需要消歧的语句序号,第二列为多个“实体起始位坐标-实体结束位坐标:实体序号”以“|”分隔的字符串。\n",
"样例如下:\n",
"[\n",
" [0, '3-6:1008|109-112:1008|187-190:1008'],\n",
" ...\n",
"]\n",
"\"\"\"\n",
"pd.DataFrame(result_data).to_csv('../submit/entity_disambiguation_submit.csv', index=False)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 实体识别(10分)\n",
"\n",
"借助开源工具,对实体进行识别。\n",
"\n",
"将每句句子中实体识别出,存入实体词典,并用特殊符号替换语句。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# code\n",
"# 首先尝试利用开源工具分出实体\n",
"\n",
"import fool # foolnltk是基于深度学习的开源分词工具,参考https://github.com/rockyzhengwu/FoolNLTK,也可以使用LTP等开源工具\n",
"import pandas as pd\n",
"from copy import copy\n",
"\n",
"\n",
"sample_data = pd.read_csv('../data/info_extract/train_data.csv', encoding = 'utf-8', header=0)\n",
"y = sample_data.loc[:,['tag']]\n",
"train_num = len(sample_data)\n",
"test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'utf-8', header=0)\n",
"sample_data = pd.concat([sample_data.loc[:, ['id', 'sentence']], test_data])\n",
"sample_data['ner'] = None\n",
"# TODO: 将提取的实体以合适的方式在‘ner’列中,便于后续使用\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3 实体统一(15分)\n",
"对同一实体具有多个名称的情况进行统一,将多种称谓统一到一个实体上,并体现在实体的属性中(可以给实体建立“别称”属性)\n",
"\n",
"比如:“河北银行股份有限公司”和“河北银行”可以统一成一个实体。\n",
"\n",
"公司名称有其特点,例如后缀可以省略、上市公司的地名可以省略等等。在data/dict目录中提供了几个词典,可供实体统一使用。\n",
"- company_suffix.txt是公司的通用后缀词典\n",
"- company_business_scope.txt是公司经营范围常用词典\n",
"- co_Province_Dim.txt是省份词典\n",
"- co_City_Dim.txt是城市词典\n",
"- stopwords.txt是可供参考的停用词"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"import jieba\n",
"import jieba.posseg as pseg\n",
"import re\n",
"import datetime\n",
"\n",
"\n",
"\n",
"#提示:可以分析公司全称的组成方式,将“地名”、“公司主体部分”、“公司后缀”区分开,并制定一些启发式的规则\n",
"# TODO:建立main_extract,当输入公司名,返回会被统一的简称\n",
"def main_extract(company_name,stop_word,d_4_delete,d_city_province):\n",
" \"\"\"\n",
" company_name  输入的公司名\n",
" stop_word 停用词\n",
" d_city_province 地区\n",
" \"\"\"\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"河北银行\n"
]
}
],
"source": [
"# 简单测试实体统一代码\n",
"stop_word,d_city_province = my_initial()\n",
"company_name = \"河北银行股份有限公司\"\n",
"# 对公司名提取主体部分,将包含相同主体部分的公司统一为一个实体\n",
"company_name = main_extract(company_name,stop_word,d_city_province)\n",
"print(company_name)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# TODO:在实体识别中运用统一实体的功能\n",
"\n",
"import fool\n",
"import pandas as pd\n",
"from copy import copy\n",
"\n",
"\n",
"sample_data = pd.read_csv('../data/info_extract/train_data.csv', encoding = 'utf-8', header=0)\n",
"y = sample_data.loc[:,['tag']]\n",
"train_num = len(sample_data)\n",
"test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'utf-8', header=0)\n",
"sample_data = pd.concat([sample_data.loc[:, ['id', 'sentence']], test_data])\n",
"sample_data['ner'] = None\n",
"# TODO: 将提取的实体以合适的方式在‘ner’列中并统一编号,便于后续使用\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.4 关系抽取(37分)\n",
"目标:借助句法分析工具,和实体识别的结果,以及文本特征,基于训练数据抽取关系。\n",
"\n",
"本次要求抽取股权交易关系,关系为有向边,由投资方指向被投资方。\n",
"\n",
"模板建立可以使用“正则表达式”、“实体间距离”、“实体上下文”、“依存句法”等。\n",
"\n",
"答案提交在submit目录中,命名为info_extract_submit.csv和info_extract_entity.csv。\n",
"- info_extract_entity.csv格式为:第一列是实体编号,第二列是实体名(多个实体名用“|”分隔)\n",
"- info_extract_submit.csv格式为:第一列是一方实体编号,第二列为另一方实体编号。\n",
"\n",
"*成绩以抽取的关系准确率以及召回率综合的F1-score"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# 提取文本tf-idf特征\n",
"# 去除停用词,并转换成tfidf向量。\n",
"# 可以使用LTP分词工具,参考:https://ltp.readthedocs.io/zh_CN/latest/\n",
"from sklearn.feature_extraction.text import TfidfTransformer \n",
"from sklearn.feature_extraction.text import CountVectorizer \n",
"from pyltp import Segmentor\n",
"\n",
"def get_tfidf_feature():\n",
" segmentor = Segmentor() # 初始化实例\n",
" segmentor.load_with_lexicon('/ltp_data_v3.4.0/cws.model', '../data/user_dict.txt') # 加载模型\n",
"\n",
" \n",
" return tfidf_feature\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 提取句法特征\n",
"\n",
"参考特征:\n",
"\n",
"1、企业实体间距离\n",
"\n",
"2、企业实体间句法距离\n",
"\n",
"3、企业实体分别和关键触发词的距离\n",
"\n",
"4、实体的依存关系类别"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"# -*- coding: utf-8 -*-\n",
"from pyltp import Parser\n",
"from pyltp import Segmentor\n",
"from pyltp import Postagger\n",
"import networkx as nx\n",
"import pylab\n",
"import re\n",
"\n",
"# 投资关系关键词\n",
"# 提示:可以结合投资关系的触发词建立有效特征\n",
"key_words = [\"收购\",\"竞拍\",\"转让\",\"扩张\",\"并购\",\"注资\",\"整合\",\"并入\",\"竞购\",\"竞买\",\"支付\",\"收购价\",\"收购价格\",\"承购\",\"购得\",\"购进\",\n",
" \"购入\",\"买进\",\"买入\",\"赎买\",\"购销\",\"议购\",\"函购\",\"函售\",\"抛售\",\"售卖\",\"销售\",\"转售\"]\n",
"\n",
"postagger = Postagger() # 初始化实例\n",
"postagger.load_with_lexicon('/ltp_data_v3.4.0/pos.model', '../data/user_dict.txt') # 加载模型\n",
"segmentor = Segmentor() # 初始化实例\n",
"segmentor.load_with_lexicon('/ltp_data_v3.4.0/cws.model', '../data/user_dict.txt') # 加载模型\n",
"parser = Parser() # 初始化实例\n",
"parser.load('/ltp_data_v3.4.0/parser.model') # 加载模型\n",
"\n",
"\n",
"def get_parse_feature(s):\n",
" \"\"\"\n",
" 对语句进行句法分析,并返回句法结果\n",
" parse_result:依存句法解析结果\n",
" source:企业实体的词序号\n",
" target:另一个企业实体的词序号\n",
" keyword_pos:关键词词序号列表\n",
" source_dep:企业实体依存句法类型\n",
" target_dep:另一个企业实体依存句法类型\n",
" ...\n",
" (可自己建立新特征,提高分类准确率)\n",
" \"\"\"\n",
"\n",
"\n",
" return features\n",
"\n",
"\n",
"\n",
"# LTP中的依存句法类型如下:['SBV', 'VOB', 'IOB', 'FOB', 'DBL', 'ATT', 'ADV', 'CMP', 'COO', 'POB', 'LAD', 'RAD', 'IS', 'HED']"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"# 汇总词频特征和句法特征\n",
"from sklearn.preprocessing import OneHotEncoder\n",
"\n",
"whole_feature = pd.concat([tfidf_feature, parse_feature])\n",
"# TODO: 将字符型变量转换为onehot形式\n",
"\n",
"train_x = whole_feature.iloc[:, :train_num]\n",
"test_x = whole_feature.iloc[:, train_num:]\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 建立分类器进行分类,使用sklearn中的分类器,不限于LR、SVM、决策树等\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn import preprocessing\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"\n",
"class RF:\n",
" def __init__(self):\n",
" \n",
" def train(self, train_x, train_y):\n",
" \n",
" return model\n",
" \n",
" def predict(self, model, test_x)\n",
"\n",
" return predict, predict_prob\n",
" \n",
" \n",
"rf = RF()\n",
"model = rf.train(train_x, y)\n",
"predict, predict_prob = rf.predict(model, test_x)\n",
"\n",
"\n",
"# TODO: 存储提取的投资关系实体对,本次关系抽取不要求确定投资方和被投资方,仅确定实体对具有投资关系即可\n",
"\"\"\"\n",
"以如下形式存储,转为dataframe后写入csv文件:\n",
"[\n",
" [九州通,江中药业股份有限公司],\n",
" ...\n",
"]\n",
"\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.5 存储进图数据库(5分)\n",
"\n",
"本次作业我们使用neo4j作为图数据库,neo4j需要java环境,请先配置好环境。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"from py2neo import Node, Relationship, Graph\n",
"\n",
"graph = Graph(\n",
" \"http://localhost:7474\", \n",
" username=\"neo4j\", \n",
" password=\"person\"\n",
")\n",
"\n",
"for v in relation_list:\n",
" a = Node('Company', name=v[0])\n",
" b = Node('Company', name=v[1])\n",
" \n",
" # 本次不区分投资方和被投资方\n",
" r = Relationship(a, 'INVEST', b)\n",
" s = a | b | r\n",
" graph.create(s)\n",
" r = Relationship(b, 'INVEST', a)\n",
" s = a | b | r\n",
" graph.create(s)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO:查询某节点的3层投资关系"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Project 1: 利用信息抽取技术搭建知识库"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"本项目的目的是结合命名实体识别、依存语法分析、实体消歧、实体统一对网站开放语料抓取的数据建立小型知识图谱。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part1:开发句法结构分析工具"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1 开发工具\n",
"使用CYK算法,根据所提供的:非终结符集合、终结符集合、规则集,对以下句子计算句法结构。\n",
"\n",
"“the boy saw the dog with a telescope\"\n",
"\n",
"\n",
"\n",
"非终结符集合:N={S, NP, VP, PP, DT, Vi, Vt, NN, IN}\n",
"\n",
"终结符集合:{sleeps, saw, boy, girl, dog, telescope, the, with, in}\n",
"\n",
"规则集: R={\n",
"- (1) S-->NP VP 1.0\n",
"- (2) VP-->VI 0.3\n",
"- (3) VP-->Vt NP 0.4\n",
"- (4) VP-->VP PP 0.3\n",
"- (5) NP-->DT NN 0.8\n",
"- (6) NP-->NP PP 0.2\n",
"- (7) PP-->IN NP 1.0\n",
"- (8) Vi-->sleeps 1.0\n",
"- (9) Vt-->saw 1.0\n",
"- (10) NN-->boy 0.1\n",
"- (11) NN-->girl 0.1\n",
"- (12) NN-->telescope 0.3\n",
"- (13) NN-->dog 0.5\n",
"- (14) DT-->the 0.5\n",
"- (15) DT-->a 0.5\n",
"- (16) IN-->with 0.6\n",
"- (17) IN-->in 0.4\n",
"\n",
"\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# 分数(15)\n",
"class my_CYK(object):\n",
" def __init__(self, non_ternimal, terminal, rules_prob, start_prob):\n",
" self.non_terminal = non_ternimal\n",
" self.terminal = terminal\n",
" self.rules_prob = rules_prob\n",
" self.start_symbol = start_prob\n",
"\n",
"\n",
" def parse_sentence(self, sentence):\n",
" sents = sentence.split()\n",
" best_path = [[{} for _ in range(len(sents))] for _ in range(len(sents))]\n",
"\n",
" # initialization\n",
" for i in range(len(sents)):\n",
" for x in self.non_terminal:\n",
" best_path[i][i][x] = {}\n",
" if (sents[i],) in self.rules_prob[x].keys():\n",
" best_path[i][i][x]['prob'] = self.rules_prob[x][(sents[i],)]\n",
" best_path[i][i][x]['path'] = {'split':None, 'rule': sents[i]}\n",
" else:\n",
" best_path[i][i][x]['prob'] = 0\n",
" best_path[i][i][x]['path'] = {'split':None, 'rule': None}\n",
"\n",
" # CKY recursive\n",
" for l in range(1, len(sents)):\n",
" for i in range(len(sents)-l):\n",
" j = i + l\n",
" for x in self.non_terminal:\n",
" tmp_best_x = {'prob':0, 'path':None}\n",
" for key, value in self.rules_prob[x].items():\n",
" if key[0] not in self.non_terminal: \n",
" break\n",
" for s in range(i, j):\n",
" tmp_prob = value * best_path[i][s][key[0]]['prob'] * best_path[s+1][j][key[1]]['prob']\n",
" if tmp_prob > tmp_best_x['prob']:\n",
" tmp_best_x['prob'] = tmp_prob\n",
" tmp_best_x['path'] = {'split': s, 'rule': key}\n",
" best_path[i][j][x] = tmp_best_x\n",
" self.best_path = best_path\n",
"\n",
" # parse result\n",
" self._parse_result(0, len(sents)-1, self.start_symbol)\n",
" print(\"prob = \", self.best_path[0][len(sents)-1][self.start_symbol]['prob'])\n",
"\n",
"\n",
" def _parse_result(self, left_idx, right_idx, root, ind=0):\n",
" node = self.best_path[left_idx][right_idx][root]\n",
" if node['path']['split'] is not None:\n",
" print('\\t'*ind, (root, self.rules_prob[root].get(node['path']['rule'])))\n",
" self._parse_result(left_idx, node['path']['split'], node['path']['rule'][0], ind+1)\n",
" self._parse_result(node['path']['split']+1, right_idx, node['path']['rule'][1], ind+1)\n",
" else:\n",
" print('\\t'*ind, (root, self.rules_prob[root].get((node['path']['rule'],))) )\n",
" print('--->', node['path']['rule'])\n",
"\n",
"\n",
"\n",
"def main(sentence):\n",
" non_terminal = {'S', 'NP', 'VP', 'PP', 'DT', 'Vi', 'Vt', 'NN', 'IN'}\n",
" start_symbol = 'S'\n",
" terminal = {'sleeps', 'saw', 'boy', 'girl', 'dog', 'telescope', 'the', 'with', 'in'}\n",
" rules_prob = {'S': {('NP', 'VP'): 1.0},\n",
" 'VP': {('Vt', 'NP'): 0.8, ('VP', 'PP'): 0.2},\n",
" 'NP': {('DT', 'NN'): 0.8, ('NP', 'PP'): 0.2},\n",
" 'PP': {('IN', 'NP'): 1.0},\n",
" 'Vi': {('sleeps',): 1.0},\n",
" 'Vt': {('saw',): 1.0},\n",
" 'NN': {('boy',): 0.1, ('girl',): 0.1,('telescope',): 0.3,('dog',): 0.5},\n",
" 'DT': {('the',): 1.0},\n",
" 'IN': {('with',): 0.6, ('in',): 0.4},\n",
" }\n",
" cyk = my_CYK(non_terminal, terminal, rules_prob, start_symbol)\n",
" cyk.parse_sentence(sentence)\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" ('S', 1.0)\n",
"\t ('NP', 0.8)\n",
"\t\t ('DT', 1.0)\n",
"---> the\n",
"\t\t ('NN', 0.1)\n",
"---> boy\n",
"\t ('VP', 0.2)\n",
"\t\t ('VP', 0.8)\n",
"\t\t\t ('Vt', 1.0)\n",
"---> saw\n",
"\t\t\t ('NP', 0.8)\n",
"\t\t\t\t ('DT', 1.0)\n",
"---> the\n",
"\t\t\t\t ('NN', 0.5)\n",
"---> dog\n",
"\t\t ('PP', 1.0)\n",
"\t\t\t ('IN', 0.6)\n",
"---> with\n",
"\t\t\t ('NP', 0.8)\n",
"\t\t\t\t ('DT', 1.0)\n",
"---> the\n",
"\t\t\t\t ('NN', 0.3)\n",
"---> telescope\n",
"prob = 0.0007372800000000003\n"
]
}
],
"source": [
"# TODO: 对该测试用例进行测试\n",
"# \"the boy saw the dog with the telescope\"\n",
"\n",
"if __name__ == \"__main__\":\n",
" sentence = \"the boy saw the dog with the telescope\"\n",
" main(sentence)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2 计算算法复杂度\n",
"计算上一节开发的算法所对应的时间复杂度和空间复杂度。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 分数(3)\n",
"# 上面所写的算法的时间复杂度和空间复杂度分别是多少?\n",
"# TODO\n",
"时间复杂度=O(), 空间复杂度=O()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part2 基于Bootstrapping,抽取企业股权交易关系,并建立知识库"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1 练习实体消歧\n",
"将句中识别的实体与知识库中实体进行匹配,解决实体歧义问题。\n",
"可利用上下文本相似度进行识别。\n",
"\n",
"在data/entity_disambiguation目录中,entity_list.csv是50个实体,valid_data.csv是需要消歧的语句(待添加)。\n",
"\n",
"答案提交在submit目录中,命名为entity_disambiguation_submit.csv。格式为:第一列是需要消歧的语句序号,第二列为多个“实体起始位坐标-实体结束位坐标:实体序号”以“|”分隔的字符串。\n",
"\n",
"*成绩以实体识别准确率以及召回率综合的F值评分\n"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
"# code\n",
"# 将识别出的实体与知识库中实体进行匹配,解决识别出一个实体对应知识库中多个实体的问题。\n",
"\n",
"# 将entity_list.csv中已知实体的名称导入分词词典\n",
"\n",
"import jieba\n",
"import pandas as pd\n",
"\n",
"entity_data = pd.read_csv('../data/entity_disambiguation/entity_list.csv', encoding = 'gb18030')\n",
"entity_dict = {}\n",
"\n",
"for i in range(len(entity_data)):\n",
" line = entity_data.iloc[i, :]\n",
" for word in line.entity_name.split('|'):\n",
" jieba.add_word(word)\n",
" if word in entity_dict:\n",
" entity_dict[word].append(line.entity_id)\n",
" else:\n",
" entity_dict[word] = [line.entity_id]\n",
"\n",
"# 对每句句子识别并匹配实体 \n",
"\n",
"valid_data = pd.read_csv('../data/entity_disambiguation/valid_data.csv', encoding = 'gb18030')\n",
"\n",
"result_data = []\n",
"for i in range(len(valid_data)):\n",
" line = valid_data.iloc[i, :]\n",
" ret =[] # 存储实体的坐标和序号\n",
" loc = 0\n",
" window = 10 # 观察上下文的窗口大小\n",
" sentence = jieba.lcut(line.sentence)\n",
" ret = []\n",
" for idx, word in enumerate(sentence):\n",
" if word in entity_dict:\n",
" max_similar = 0\n",
" max_entity_id = 0\n",
" context = sentence[max(0, idx-window):min(len(sentence)-1, idx+window)]\n",
" for ids in entity_dict[word]:\n",
" similar = len(set(context)&set(jieba.lcut(entity_data[entity_data.entity_id.isin([ids])].reset_index().desc[0])))\n",
" if max_similar>similar:\n",
" max_similar = similar\n",
" max_entity_id = ids\n",
" ret.append(str(loc)+'-'+str(loc+len(word))+':'+str(ids))\n",
" loc+=len(word)\n",
" result_data.append([i, '|'.join(ret)])\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[[0, '3-6:1008|109-112:1008|187-190:1008'],\n",
" [1, '18-21:1008'],\n",
" [2, '23-26:1008|40-43:1008'],\n",
" [3, '7-10:1008'],\n",
" [4, '2-5:1008|14-17:1008'],\n",
" [5, '28-30:1003|34-36:1003|41-43:1003'],\n",
" [6, '4-8:1001|25-27:1003|34-36:1003|100-102:1003'],\n",
" [7, '0-2:1003|6-10:1001|19-21:1003|34-36:1003|45-47:1003'],\n",
" [8, '8-10:1003|22-24:1003|34-36:1003|37-39:1003|46-48:1003'],\n",
" [9, '14-16:1003'],\n",
" [10, '0-2:1005|39-44:1005'],\n",
" [11, '7-11:1005|20-22:1005'],\n",
" [12, '4-6:1005|29-31:1005|62-64:1005'],\n",
" [13, '26-28:1005'],\n",
" [14, '0-2:1005|24-26:1005'],\n",
" [15, '10-12:1005|28-30:1005'],\n",
" [16, '6-8:1005|20-22:1005'],\n",
" [17, '8-12:1011|26-30:1011'],\n",
" [18, '9-13:1011|28-30:1013'],\n",
" [19, '0-2:1013|18-20:1013'],\n",
" [20, '6-8:1013'],\n",
" [21, '0-2:1013|26-28:1013|41-43:1013'],\n",
" [22, '0-2:1013|20-22:1013'],\n",
" [23, '0-2:1013'],\n",
" [24, '0-2:1013|32-34:1013'],\n",
" [25, '0-3:1016'],\n",
" [26, '2-5:1016|11-14:1016|18-21:1016'],\n",
" [27, '20-23:1016'],\n",
" [28, '11-14:1016']]"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(result_data).to_csv('../submit/entity_disambiguation_submit.csv', index=False)\n",
"result_data\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 实体识别\n",
"\n",
"借助开源工具,对实体进行识别。\n",
"\n",
"将每句句子中实体识别出,存入实体词典,并用特殊符号替换语句。"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[(28, 33, 'company', '国泰君安'), (0, 13, 'company', '深圳能源集团股份有限公司')]]\n",
"[[(36, 49, 'company', '远大产业控股股份有限公司'), (0, 13, 'company', '远大产业控股股份有限公司')]]\n",
"[[(104, 109, 'company', '河北银行'), (88, 99, 'company', '河北银行股份有限公司'), (61, 74, 'company', '南京栖霞建设集团有限公司'), (34, 47, 'company', '南京栖霞建设股份有限公司')]]\n",
"[[(189, 196, 'time', '2015年度'), (185, 190, 'company', '歌礼制药'), (160, 165, 'company', '康桥资本'), (136, 150, 'company', '天士力(香港)药业有限公司'), (88, 114, 'company', 'CBCInvestmentSevenLimited'), (81, 86, 'company', '康桥资本'), (44, 58, 'company', '天士力(香港)药业有限公司'), (19, 33, 'company', '天士力医药集团股份有限公司'), (2, 16, 'company', '天士力制药集团股份有限公司')]]\n",
"[[(44, 59, 'company', '江苏康缘美域生物医药有限公司'), (21, 37, 'company', '连云港康缘美域保健食品有限公司'), (6, 19, 'company', '江苏康缘药业股份有限公司'), (0, 6, 'time', '2016年')]]\n",
"[[(74, 85, 'company', '康缘国际实业有限公司'), (60, 73, 'company', '江苏康缘药业股份有限公司'), (39, 52, 'company', '江苏康缘集团有限责任公司'), (21, 32, 'company', '康缘国际实业有限公司'), (6, 19, 'company', '江苏康缘药业股份有限公司'), (0, 6, 'time', '2015年')]]\n",
"[[(27, 33, 'company', '天津大西洋'), (10, 24, 'company', '天津大西洋焊接材料有限公司'), (3, 9, 'location', '上海大西洋'), (0, 4, 'time', '本年度')]]\n"
]
}
],
"source": [
"# code\n",
"# 首先尝试利用开源工具分出实体\n",
"\n",
"import fool\n",
"import pandas as pd\n",
"from copy import copy\n",
"\n",
"\n",
"sample_data = pd.read_csv('../data/info_extract/samples_test.csv', encoding = 'utf-8', header=0)\n",
"sample_data['ner'] = None\n",
"ner_id = 1001\n",
"ner_dict = {} # 存储所有实体\n",
"ner_dict_reverse = {} # 存储所有实体\n",
"for i in range(len(sample_data)):\n",
" sentence = copy(sample_data.iloc[i, 1])\n",
" words, ners = fool.analysis(sentence)\n",
" ners[0].sort(key=lambda x:x[0], reverse=True)\n",
" print(ners)\n",
" for start, end, ner_type, ner_name in ners[0]:\n",
" if ner_name not in ner_dict:\n",
" ner_dict[ner_name] = ner_id\n",
" ner_dict_reverse[ner_id] = ner_name\n",
" ner_id+=1\n",
" sentence = sentence[:start] + ' ner_' + str(ner_dict[ner_name]) + '_ ' + sentence[end-1:]\n",
" sample_data.iloc[i, 2] = sentence\n"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>sentence</th>\n",
" <th>ner</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>深圳能源集团股份有限公司拟按现有2.03%的持股比例参与国泰君安本次可转换公司债的配售,参与...</td>\n",
" <td>ner_1002_ 拟按现有2.03%的持股比例参与 ner_1001_ 本次可转换公司债...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>远大产业控股股份有限公司于报告期实施的发行股份购买资产的交易对方中金波为远大产业控股股份有限...</td>\n",
" <td>ner_1003_ 于报告期实施的发行股份购买资产的交易对方中金波为 ner_1003_ ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>一、根据公司第六届董事会第七次会议审议并通过的公司重大资产重组方案,南京栖霞建设股份有限公司...</td>\n",
" <td>一、根据公司第六届董事会第七次会议审议并通过的公司重大资产重组方案, ner_1007_ 拟...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>一、天士力制药集团股份有限公司(简称“天士力医药集团股份有限公司”、“公司”)拟向子公司天士...</td>\n",
" <td>一、 ner_1014_ (简称“ ner_1013_ ”、“公司”)拟向子公司 ner_1...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>2016年,江苏康缘药业股份有限公司将持有连云港康缘美域保健食品有限公司的股权全部转让给江苏...</td>\n",
" <td>ner_1018_ , ner_1017_ 将持有 ner_1016_ 的股权全部转让给 ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>2015年,江苏康缘药业股份有限公司将持有康缘国际实业有限公司的股权全部转让给江苏康缘集团有...</td>\n",
" <td>ner_1021_ , ner_1017_ 将持有 ner_1019_ 的股权全部转让给 ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>本年度上海大西洋收购天津大西洋焊接材料有限公司将其所持天津大西洋销售44%股权,支付对价8,...</td>\n",
" <td>ner_1025_ ner_1024_ 收购 ner_1023_ 将其所持 ner_10...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id sentence \\\n",
"0 1 深圳能源集团股份有限公司拟按现有2.03%的持股比例参与国泰君安本次可转换公司债的配售,参与... \n",
"1 2 远大产业控股股份有限公司于报告期实施的发行股份购买资产的交易对方中金波为远大产业控股股份有限... \n",
"2 3 一、根据公司第六届董事会第七次会议审议并通过的公司重大资产重组方案,南京栖霞建设股份有限公司... \n",
"3 4 一、天士力制药集团股份有限公司(简称“天士力医药集团股份有限公司”、“公司”)拟向子公司天士... \n",
"4 5 2016年,江苏康缘药业股份有限公司将持有连云港康缘美域保健食品有限公司的股权全部转让给江苏... \n",
"5 6 2015年,江苏康缘药业股份有限公司将持有康缘国际实业有限公司的股权全部转让给江苏康缘集团有... \n",
"6 7 本年度上海大西洋收购天津大西洋焊接材料有限公司将其所持天津大西洋销售44%股权,支付对价8,... \n",
"\n",
" ner \n",
"0 ner_1002_ 拟按现有2.03%的持股比例参与 ner_1001_ 本次可转换公司债... \n",
"1 ner_1003_ 于报告期实施的发行股份购买资产的交易对方中金波为 ner_1003_ ... \n",
"2 一、根据公司第六届董事会第七次会议审议并通过的公司重大资产重组方案, ner_1007_ 拟... \n",
"3 一、 ner_1014_ (简称“ ner_1013_ ”、“公司”)拟向子公司 ner_1... \n",
"4 ner_1018_ , ner_1017_ 将持有 ner_1016_ 的股权全部转让给 ... \n",
"5 ner_1021_ , ner_1017_ 将持有 ner_1019_ 的股权全部转让给 ... \n",
"6 ner_1025_ ner_1024_ 收购 ner_1023_ 将其所持 ner_10... "
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sample_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3 实体统一\n",
"对同一实体具有多个名称的情况进行统一,将多种称谓统一到一个实体上,并体现在实体的属性中(可以给实体建立“别称”属性)\n",
"\n",
"公司名称有其特点,例如后缀可以省略、上市公司的地名可以省略等等。在data/dict目录中提供了几个词典,可供实体统一使用。\n",
"- company_suffix.txt是公司的通用后缀词典\n",
"- company_business_scope.txt是公司经营范围常用词典\n",
"- co_Province_Dim.txt是省份词典\n",
"- co_City_Dim.txt是城市词典\n",
"- stopwords.txt是可供参考的停用词"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# code\n",
"\n",
"import jieba\n",
"import jieba.posseg as pseg\n",
"import re\n",
"import datetime\n",
"\n",
"\n",
"\n",
"#功能:从输入的“公司名”中提取主体(列表形式)\n",
"def main_extract(input_str,stop_word,d_4_delete,d_city_province):\n",
" input_str = replace(input_str,d_city_province)\n",
" #开始分词\n",
" seg = pseg.cut(input_str)\n",
" seg_lst = []\n",
" for w in seg:\n",
" elmt = w.word\n",
" if elmt not in d_4_delete:\n",
" seg_lst.append(elmt)\n",
" seg_lst = remove_word(seg_lst,stop_word)\n",
" seg_lst = city_prov_ahead(seg_lst,d_city_province)\n",
" return seg_lst\n",
"\n",
" \n",
"\n",
"#功能:将list中地名提前\n",
"def city_prov_ahead(seg_lst,d_city_province):\n",
" city_prov_lst = []\n",
" for seg in seg_lst:\n",
" if seg in d_city_province:\n",
" city_prov_lst.append(seg)\n",
" seg_lst.remove(seg)\n",
" city_prov_lst.sort()\n",
" return city_prov_lst+seg_lst\n",
" \n",
" \n",
"\n",
" \n",
"#功能:去除停用词\n",
"def remove_word(seg,sw):\n",
" ret = []\n",
" for i in range(len(seg)):\n",
" if seg[i] not in sw:\n",
" ret.append(seg[i])\n",
" return ret\n",
"\n",
"\n",
"#功能:替换com,dep的内容\n",
"def replace(com,d_city_province):\n",
" #————————公司、部门\n",
" #替换\n",
" #'*'\n",
" com = re.sub(r'(\\*)*(\\#)*(\\-)*(\\—)*(\\~)*(\\.)*(\\/)*(\\?)*(\\!)*(\\?)*(\\\")*','',com)\n",
" #'、'\n",
" com = re.sub(r'(\\、)*','',com)\n",
" #'+'\n",
" com = re.sub(r'(\\+)*','',com)\n",
" #','\n",
" com = re.sub(r'(\\,)+',' ',com)\n",
" #','\n",
" com = re.sub(r'(\\,)+',' ',com)\n",
" #':'\n",
" com = re.sub(r'(\\:)*','',com)\n",
" #[]【】都删除\n",
" com = re.sub(r'\\【.*?\\】','',com)\n",
" com = re.sub(r'\\[.*?\\]','',com)\n",
" #数字在结尾替换为‘’\n",
" com = re.sub(r'\\d*$',\"\",com)\n",
" #'&nbsp;'或‘&lt;’替换为‘’\n",
" com = re.sub(r'(&gt;)*(&nbsp;)*(&lt;)*',\"\",com)\n",
" #地名\n",
" com = re.sub(r'\\(',\"(\",com)\n",
" com = re.sub(r'\\)',\")\",com)\n",
" pat = re.search(r'\\(.+?\\)',com)\n",
" while pat:\n",
" v = pat.group()[3:-3]\n",
" start = pat.span()[0]\n",
" end = pat.span()[1]\n",
" if v not in d_city_province:\n",
" com = com[:start]+com[end:]\n",
" else:\n",
" com = com[:start]+com[start+3:end-3]+com[end:]\n",
" pat = re.search(r'\\(.+?\\)',com)\n",
" #()()\n",
" com = re.sub(r'(\\()*(\\))*(\\()*(\\))*','',com)\n",
" #全数字\n",
" com = re.sub(r'^(\\d)+$',\"\",com)\n",
" return com\n",
"\n",
"\n",
"\n",
"#初始加载步骤\n",
"#输出“用来删除的字典”和“stop word”\n",
"def my_initial():\n",
" fr1 = open(r\"../data/dict/co_City_Dim.txt\", encoding='utf-8')\n",
" fr2 = open(r\"../data/dict/co_Province_Dim.txt\", encoding='utf-8')\n",
" fr3 = open(r\"../data/dict/company_business_scope.txt\", encoding='utf-8')\n",
" fr4 = open(r\"../data/dict/company_suffix.txt\", encoding='utf-8')\n",
" #城市名\n",
" lines1 = fr1.readlines()\n",
" d_4_delete = []\n",
" d_city_province = [re.sub(r'(\\r|\\n)*','',line) for line in lines1]\n",
" #省份名\n",
" lines2 = fr2.readlines()\n",
" l2_tmp = [re.sub(r'(\\r|\\n)*','',line) for line in lines2]\n",
" d_city_province.extend(l2_tmp)\n",
" #公司后缀\n",
" lines3 = fr3.readlines()\n",
" l3_tmp = [re.sub(r'(\\r|\\n)*','',line) for line in lines3]\n",
" lines4 = fr4.readlines()\n",
" l4_tmp = [re.sub(r'(\\r|\\n)*','',line) for line in lines4]\n",
" d_4_delete.extend(l4_tmp)\n",
" #get stop_word\n",
" fr = open(r'../data/dict/stopwords.txt', encoding='utf-8') \n",
" stop_word = fr.readlines()\n",
" stop_word_after = [re.sub(r'(\\r|\\n)*','',stop_word[i]) for i in range(len(stop_word))]\n",
" stop_word_after[-1] = stop_word[-1]\n",
" stop_word = stop_word_after\n",
" return d_4_delete,stop_word,d_city_province\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"河北银行\n"
]
}
],
"source": [
"d_4_delete,stop_word,d_city_province = my_initial()\n",
"company_name = \"河北银行股份有限公司\"\n",
"lst = main_extract(company_name,stop_word,d_4_delete,d_city_province)\n",
"company_name = ''.join(lst) # 对公司名提取主体部分,将包含相同主体部分的公司统一为一个实体\n",
"print(company_name)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 在语句中统一实体\n",
"\n",
"import fool\n",
"import pandas as pd\n",
"from copy import copy\n",
"\n",
"\n",
"sample_data = pd.read_csv('../data/info_extract/samples_test.csv', encoding = 'utf-8', header=0)\n",
"sample_data['ner'] = None\n",
"ner_id = 1001\n",
"ner_dict_new = {} # 存储所有实体\n",
"ner_dict_reverse_new = {} # 存储所有实体\n",
"for i in range(len(sample_data)):\n",
" sentence = copy(sample_data.iloc[i, 1])\n",
" words, ners = fool.analysis(sentence)\n",
" ners[0].sort(key=lambda x:x[0], reverse=True)\n",
" print(ners)\n",
" for start, end, ner_type, ner_name in ners[0]:\n",
" company_main_name = ''.join(main_extract(ner_name,stop_word,d_4_delete,d_city_province)) # 提取公司主体名称\n",
" if company_main_name not in ner_dict:\n",
" ner_dict[company_main_name] = ner_id\n",
" ner_id+=1\n",
" sentence = sentence[:start] + ' ner_' + str(ner_dict[company_main_name]) + '_ ' + sentence[end-1:]\n",
" sample_data.iloc[i, 2] = sentence\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.4 关系抽取\n",
"借助句法分析工具,和实体识别的结果,以及正则表达式,设定模版抽取关系,并存储进图数据库。\n",
"\n",
"本次要求抽取股权交易关系,关系为有向边,由投资方指向被投资方。\n",
"\n",
"模板建立可以使用“正则表达式”、“实体间距离”、“实体上下文”、“依存句法”等。\n",
"\n",
"答案提交在submit目录中,命名为info_extract_submit.csv和info_extract_entity.csv。\n",
"- info_extract_entity.csv格式为:第一列是实体编号,第二列是实体名(多个实体名用“|”分隔)\n",
"- info_extract_submit.csv格式为:第一列是关系发起方实体编号,第二列为关系接收方实体编号。\n",
"\n",
"*成绩以抽取的关系准确率以及召回率综合的F值评分"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 建立种子模板"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [],
"source": [
"# code\n",
"\n",
"# 最后提交文件为识别出的整个投资图谱,以及图谱中结点列表与属性。\n",
"\n",
"# 建立模板\n",
"import re\n",
"\n",
"rep1 = re.compile(r'(ner_\\d\\d\\d\\d)_\\s+收购\\s+(ner_\\d\\d\\d\\d)_') # 例子模板\n",
"\n",
"relation_list = [] # 存储已经提取的关系\n",
"for i in range(len(sample_data)):\n",
" sentence = copy(sample_data.iloc[i, 2])\n",
" for v in rep1.findall(sentence+sentence):\n",
" relation_list.append(v)"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('ner_1024', 'ner_1023'), ('ner_1024', 'ner_1023')]"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"relation_list"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 利用bootstrapping搜索"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# code\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.5 存储进图数据库\n",
"\n",
"本次作业我们使用neo4j作为图数据库,neo4j需要java环境,请先配置好环境。"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [],
"source": [
"\n",
"from py2neo import Node, Relationship, Graph\n",
"\n",
"graph = Graph(\n",
" \"http://localhost:7474\", \n",
" username=\"neo4j\", \n",
" password=\"person\"\n",
")\n",
"\n",
"for v in relation_list:\n",
" a = Node('Company', name=v[0])\n",
" b = Node('Company', name=v[1])\n",
" r = Relationship(a, 'INVEST', b)\n",
" s = a | b | r\n",
" graph.create(s)"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(ner_1024)-[:INVEST {}]->(ner_1023)"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 利用信息抽取技术搭建知识库\n",
"\n",
"在这个notebook文件中,有些模板代码已经提供给你,但你还需要实现更多的功能来完成这个项目。除非有明确要求,你无须修改任何已给出的代码。以**'【练习】'**开始的标题表示接下来的代码部分中有你需要实现的功能。这些部分都配有详细的指导,需要实现的部分也会在注释中以'TODO'标出。请仔细阅读所有的提示。\n",
"\n",
">**提示:**Code 和 Markdown 区域可通过 **Shift + Enter** 快捷键运行。此外,Markdown可以通过双击进入编辑模式。\n",
"\n",
"---\n",
"\n",
"### 让我们开始吧\n",
"\n",
"本项目的目的是结合命名实体识别、依存语法分析、实体消歧、实体统一对网站开放语料抓取的数据建立小型知识图谱。\n",
"\n",
"在现实世界中,你需要拼凑一系列的模型来完成不同的任务;举个例子,用来预测狗种类的算法会与预测人类的算法不同。在做项目的过程中,你可能会遇到不少失败的预测,因为并不存在完美的算法和模型。你最终提交的不完美的解决方案也一定会给你带来一个有趣的学习经验!\n",
"\n",
"\n",
"---\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='step0'></a>\n",
"## 步骤 0:开发句法结构分析工具\n",
"\n",
"### 1.1 开发句法分析工具\n",
"\n",
"在课堂中,我们提到了在计算机科学领域,CYK算法(也称为Cocke–Younger–Kasami算法)是一种用来对上下文无关文法(CFG,Context Free Grammar)进行语法分析(parsing)的算法。\n",
"\n",
"### 【练习1】\n",
"使用CYK算法,根据所提供的:非终结符集合、终结符集合、规则集,对以下句子计算句法结构。并输出最优句法结构的概率。\n",
"\n",
"“the boy saw the dog with the telescope\"\n",
"\n",
"\n",
"\n",
"非终结符集合:N={S, NP, VP, PP, DT, Vi, Vt, NN, IN}\n",
"\n",
"终结符集合:{sleeps, saw, boy, girl, dog, telescope, the, with, in}\n",
"\n",
"规则集: R={\n",
"- (1) S-->NP VP 1.0\n",
"- (2) VP-->Vt NP 0.8\n",
"- (3) VP-->VP PP 0.2\n",
"- (4) NP-->DT NN 0.8\n",
"- (5) NP-->NP PP 0.2\n",
"- (6) PP-->IN NP 1.0\n",
"- (7) Vi-->sleeps 1.0\n",
"- (8) Vt-->saw 1.0\n",
"- (9) NN-->boy 0.1\n",
"- (10) NN-->girl 0.1\n",
"- (11) NN-->telescope 0.3\n",
"- (12) NN-->dog 0.5\n",
"- (13) DT-->the 1.0\n",
"- (14) IN-->with 0.6\n",
"- (15) IN-->in 0.4\n",
"\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# 分数(15)\n",
"class my_CYK(object):\n",
" def __init__(self, non_ternimal, terminal, rules_prob, start_prob):\n",
" self.non_terminal = non_ternimal\n",
" self.terminal = terminal\n",
" self.rules_prob = rules_prob\n",
" self.start_symbol = start_prob\n",
"\n",
"\n",
" def parse_sentence(self, sentence):\n",
" # TODO:parse the sentence with CYK algoritm\n",
" \n",
" sents = sentence.split()\n",
" best_path = [[{} for _ in range(len(sents))] for _ in range(len(sents))]\n",
"\n",
" # initialization\n",
" for i in range(len(sents)):\n",
" for x in self.non_terminal:\n",
" best_path[i][i][x] = {}\n",
" if (sents[i],) in self.rules_prob[x].keys():\n",
" best_path[i][i][x]['prob'] = self.rules_prob[x][(sents[i],)]\n",
" best_path[i][i][x]['path'] = {'split':None, 'rule': sents[i]}\n",
" else:\n",
" best_path[i][i][x]['prob'] = 0\n",
" best_path[i][i][x]['path'] = {'split':None, 'rule': None}\n",
"\n",
" # CKY recursive\n",
" for l in range(1, len(sents)):\n",
" for i in range(len(sents)-l):\n",
" j = i + l\n",
" for x in self.non_terminal:\n",
" tmp_best_x = {'prob':0, 'path':None}\n",
" for key, value in self.rules_prob[x].items():\n",
" if key[0] not in self.non_terminal: \n",
" break\n",
" for s in range(i, j):\n",
" tmp_prob = value * best_path[i][s][key[0]]['prob'] * best_path[s+1][j][key[1]]['prob']\n",
" if tmp_prob > tmp_best_x['prob']:\n",
" tmp_best_x['prob'] = tmp_prob\n",
" tmp_best_x['path'] = {'split': s, 'rule': key}\n",
" best_path[i][j][x] = tmp_best_x\n",
" self.best_path = best_path\n",
"\n",
" # TODO:print the result with tree structure\n",
" self._parse_result(0, len(sents)-1, self.start_symbol)\n",
" \n",
" # TODO:print the probability for most probable parsing\n",
" print(\"prob = \", self.best_path[0][len(sents)-1][self.start_symbol]['prob'])\n",
"\n",
"\n",
" def _parse_result(self, left_idx, right_idx, root, ind=0):\n",
" \"\"\"\n",
" print the result with tree- structure\n",
" \"\"\"\n",
" node = self.best_path[left_idx][right_idx][root]\n",
" if node['path']['split'] is not None:\n",
" print('\\t'*ind, (root, self.rules_prob[root].get(node['path']['rule'])))\n",
" self._parse_result(left_idx, node['path']['split'], node['path']['rule'][0], ind+1)\n",
" self._parse_result(node['path']['split']+1, right_idx, node['path']['rule'][1], ind+1)\n",
" else:\n",
" print('\\t'*ind, (root, self.rules_prob[root].get((node['path']['rule'],))) )\n",
" print('--->', node['path']['rule'])\n",
"\n",
"\n",
"\n",
"def main(sentence):\n",
" non_terminal = {'S', 'NP', 'VP', 'PP', 'DT', 'Vi', 'Vt', 'NN', 'IN'}\n",
" start_symbol = 'S'\n",
" terminal = {'sleeps', 'saw', 'boy', 'girl', 'dog', 'telescope', 'the', 'with', 'in'}\n",
" rules_prob = {'S': {('NP', 'VP'): 1.0},\n",
" 'VP': {('Vt', 'NP'): 0.8, ('VP', 'PP'): 0.2},\n",
" 'NP': {('DT', 'NN'): 0.8, ('NP', 'PP'): 0.2},\n",
" 'PP': {('IN', 'NP'): 1.0},\n",
" 'Vi': {('sleeps',): 1.0},\n",
" 'Vt': {('saw',): 1.0},\n",
" 'NN': {('boy',): 0.1, ('girl',): 0.1,('telescope',): 0.3,('dog',): 0.5},\n",
" 'DT': {('the',): 1.0},\n",
" 'IN': {('with',): 0.6, ('in',): 0.4},\n",
" }\n",
" cyk = my_CYK(non_terminal, terminal, rules_prob, start_symbol)\n",
" cyk.parse_sentence(sentence)\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" ('S', 1.0)\n",
"\t ('NP', 0.8)\n",
"\t\t ('DT', 1.0)\n",
"---> the\n",
"\t\t ('NN', 0.1)\n",
"---> boy\n",
"\t ('VP', 0.2)\n",
"\t\t ('VP', 0.8)\n",
"\t\t\t ('Vt', 1.0)\n",
"---> saw\n",
"\t\t\t ('NP', 0.8)\n",
"\t\t\t\t ('DT', 1.0)\n",
"---> the\n",
"\t\t\t\t ('NN', 0.5)\n",
"---> dog\n",
"\t\t ('PP', 1.0)\n",
"\t\t\t ('IN', 0.6)\n",
"---> with\n",
"\t\t\t ('NP', 0.8)\n",
"\t\t\t\t ('DT', 1.0)\n",
"---> the\n",
"\t\t\t\t ('NN', 0.3)\n",
"---> telescope\n",
"prob = 0.0007372800000000003\n"
]
}
],
"source": [
"# TODO: 对该测试用例进行测试\n",
"if __name__ == \"__main__\":\n",
" sentence = \"the boy saw the dog with the telescope\"\n",
" main(sentence)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 问题1:计算算法复杂度\n",
"计算上一节开发的算法所对应的时间复杂度和空间复杂度。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: 上面所写的算法的时间复杂度和空间复杂度分别是多少?\n",
"时间复杂度=O(), 空间复杂度=O()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤 2:实体统一"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"实体统一做的是对同一实体具有多个名称的情况进行统一,将多种称谓统一到一个实体上,并体现在实体的属性中(可以给实体建立“别称”属性)\n",
"\n",
"例如:对“河北银行股份有限公司”、“河北银行公司”和“河北银行”我们都可以认为是一个实体,我们就可以将通过提取前两个称谓的主要内容,得到“河北银行”这个实体关键信息。\n",
"\n",
"公司名称有其特点,例如后缀可以省略、上市公司的地名可以省略等等。在data/dict目录中提供了几个词典,可供实体统一使用。\n",
"- company_suffix.txt是公司的通用后缀词典\n",
"- company_business_scope.txt是公司经营范围常用词典\n",
"- co_Province_Dim.txt是省份词典\n",
"- co_City_Dim.txt是城市词典\n",
"- stopwords.txt是可供参考的停用词\n",
"\n",
"### 练习2:\n",
"编写main_extract函数,实现对实体的名称提取“主体名称”的功能。"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"import jieba\n",
"import jieba.posseg as pseg\n",
"import re\n",
"import datetime\n",
"\n",
"\n",
"# TODO: 从输入的“公司名”中提取主体\n",
"def main_extract(input_str,stop_word,d_4_delete,d_city_province):\n",
" input_str = replace(input_str,d_city_province)\n",
" #开始分词\n",
" seg = pseg.cut(input_str)\n",
" seg_lst = []\n",
" for w in seg:\n",
" elmt = w.word\n",
" if elmt not in d_4_delete:\n",
" seg_lst.append(elmt)\n",
" seg_lst = remove_word(seg_lst,stop_word)\n",
" seg_lst = city_prov_ahead(seg_lst,d_city_province)\n",
" return seg_lst\n",
"\n",
" \n",
"#TODO:实现公司名称中地名提前\n",
"def city_prov_ahead(seg_lst,d_city_province):\n",
" city_prov_lst = []\n",
" for seg in seg_lst:\n",
" if seg in d_city_province:\n",
" city_prov_lst.append(seg)\n",
" seg_lst.remove(seg)\n",
" city_prov_lst.sort()\n",
" return city_prov_lst+seg_lst\n",
"\n",
"\n",
"#TODO:替换特殊符号\n",
"def replace(com,d_city_province):\n",
" #————————公司\n",
" #地名\n",
" com = re.sub(r'\\(',\"(\",com)\n",
" com = re.sub(r'\\)',\")\",com)\n",
" pat = re.search(r'\\(.+?\\)',com)\n",
" while pat:\n",
" v = pat.group()[3:-3]\n",
" start = pat.span()[0]\n",
" end = pat.span()[1]\n",
" if v not in d_city_province:\n",
" com = com[:start]+com[end:]\n",
" else:\n",
" com = com[:start]+com[start+3:end-3]+com[end:]\n",
" pat = re.search(r'\\(.+?\\)',com)\n",
" #()()\n",
" com = re.sub(r'(\\()*(\\))*(\\()*(\\))*','',com)\n",
" #全数字\n",
" com = re.sub(r'^(\\d)+$',\"\",com)\n",
" return com\n",
"\n",
"\n",
"#TODO:初始化,加载词典\n",
"#输出“用来删除的字典”和“stop word”\n",
"def my_initial():\n",
" fr1 = open(r\"../data/dict/co_City_Dim.txt\", encoding='utf-8')\n",
" fr2 = open(r\"../data/dict/co_Province_Dim.txt\", encoding='utf-8')\n",
" fr3 = open(r\"../data/dict/company_business_scope.txt\", encoding='utf-8')\n",
" fr4 = open(r\"../data/dict/company_suffix.txt\", encoding='utf-8')\n",
" #城市名\n",
" lines1 = fr1.readlines()\n",
" d_4_delete = []\n",
" d_city_province = [re.sub(r'(\\r|\\n)*','',line) for line in lines1]\n",
" #省份名\n",
" lines2 = fr2.readlines()\n",
" l2_tmp = [re.sub(r'(\\r|\\n)*','',line) for line in lines2]\n",
" d_city_province.extend(l2_tmp)\n",
" #公司后缀\n",
" lines3 = fr3.readlines()\n",
" l3_tmp = [re.sub(r'(\\r|\\n)*','',line) for line in lines3]\n",
" lines4 = fr4.readlines()\n",
" l4_tmp = [re.sub(r'(\\r|\\n)*','',line) for line in lines4]\n",
" d_4_delete.extend(l4_tmp)\n",
" #get stop_word\n",
" fr = open(r'../data/dict/stopwords.txt', encoding='utf-8') \n",
" stop_word = fr.readlines()\n",
" stop_word_after = [re.sub(r'(\\r|\\n)*','',stop_word[i]) for i in range(len(stop_word))]\n",
" stop_word_after[-1] = stop_word[-1]\n",
" stop_word = stop_word_after\n",
" return d_4_delete,stop_word,d_city_province\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Building prefix dict from the default dictionary ...\n",
"Loading model from cache C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\jieba.cache\n",
"Loading model cost 0.732 seconds.\n",
"Prefix dict has been built succesfully.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"河北银行\n"
]
}
],
"source": [
"# TODO:测试实体统一用例\n",
"d_4_delete,stop_word,d_city_province = my_initial()\n",
"company_name = \"河北银行股份有限公司\"\n",
"lst = main_extract(company_name,stop_word,d_4_delete,d_city_province)\n",
"company_name = ''.join(lst) # 对公司名提取主体部分,将包含相同主体部分的公司统一为一个实体\n",
"print(company_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤 3:实体识别\n",
"有很多开源工具可以帮助我们对实体进行识别。常见的有LTP、StanfordNLP、FoolNLTK等等。\n",
"\n",
"本次采用FoolNLTK实现实体识别,fool是一个基于bi-lstm+CRF算法开发的深度学习开源NLP工具,包括了分词、实体识别等功能,大家可以通过fool很好地体会深度学习在该任务上的优缺点。\n",
"\n",
"在‘data/train_data.csv’和‘data/test_data.csv’中是从网络上爬虫得到的上市公司公告,数据样例如下:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>sentence</th>\n",
" <th>tag</th>\n",
" <th>member1</th>\n",
" <th>member2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>6461</td>\n",
" <td>与本公司关系:受同一公司控制 2,杭州富生电器有限公司企业类型: 有限公司注册地址: 富阳市...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2111</td>\n",
" <td>三、关联交易标的基本情况 1、交易标的基本情况 公司名称:红豆集团财务有限公司 公司地址:无...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>9603</td>\n",
" <td>2016年协鑫集成科技股份有限公司向瑞峰(张家港)光伏科技有限公司支付设备款人民币4,515...</td>\n",
" <td>1</td>\n",
" <td>协鑫集成科技股份有限公司</td>\n",
" <td>瑞峰(张家港)光伏科技有限公司</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3456</td>\n",
" <td>证券代码:600777 证券简称:新潮实业 公告编号:2015-091 烟台新潮实业股份有限...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>8844</td>\n",
" <td>本集团及广发证券股份有限公司持有辽宁成大股份有限公司股票的本期变动系买卖一揽子沪深300指数...</td>\n",
" <td>1</td>\n",
" <td>广发证券股份有限公司</td>\n",
" <td>辽宁成大股份有限公司</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id sentence tag member1 \\\n",
"0 6461 与本公司关系:受同一公司控制 2,杭州富生电器有限公司企业类型: 有限公司注册地址: 富阳市... 0 0 \n",
"1 2111 三、关联交易标的基本情况 1、交易标的基本情况 公司名称:红豆集团财务有限公司 公司地址:无... 0 0 \n",
"2 9603 2016年协鑫集成科技股份有限公司向瑞峰(张家港)光伏科技有限公司支付设备款人民币4,515... 1 协鑫集成科技股份有限公司 \n",
"3 3456 证券代码:600777 证券简称:新潮实业 公告编号:2015-091 烟台新潮实业股份有限... 0 0 \n",
"4 8844 本集团及广发证券股份有限公司持有辽宁成大股份有限公司股票的本期变动系买卖一揽子沪深300指数... 1 广发证券股份有限公司 \n",
"\n",
" member2 \n",
"0 0 \n",
"1 0 \n",
"2 瑞峰(张家港)光伏科技有限公司 \n",
"3 0 \n",
"4 辽宁成大股份有限公司 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_data = pd.read_csv('../data/info_extract/train_data.csv', encoding = 'gb2312', header=0)\n",
"train_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>sentence</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>9259</td>\n",
" <td>2015年1月26日,多氟多化工股份有限公司与李云峰先生签署了《附条件生效的股份认购合同》</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>9136</td>\n",
" <td>2、2016年2月5日,深圳市新纶科技股份有限公司与侯毅先</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>220</td>\n",
" <td>2015年10月26日,山东华鹏玻璃股份有限公司与张德华先生签署了附条件生效条件的《股份认购合同》</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>9041</td>\n",
" <td>2、2015年12月31日,印纪娱乐传媒股份有限公司与肖文革签订了《印纪娱乐传媒股份有限公司...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>10041</td>\n",
" <td>一、金发科技拟与熊海涛女士签订《股份转让协议》,协议约定:以每股1.0509元的收购价格,收...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id sentence\n",
"0 9259 2015年1月26日,多氟多化工股份有限公司与李云峰先生签署了《附条件生效的股份认购合同》\n",
"1 9136 2、2016年2月5日,深圳市新纶科技股份有限公司与侯毅先\n",
"2 220 2015年10月26日,山东华鹏玻璃股份有限公司与张德华先生签署了附条件生效条件的《股份认购合同》\n",
"3 9041 2、2015年12月31日,印纪娱乐传媒股份有限公司与肖文革签订了《印纪娱乐传媒股份有限公司...\n",
"4 10041 一、金发科技拟与熊海涛女士签订《股份转让协议》,协议约定:以每股1.0509元的收购价格,收..."
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'gb2312', header=0)\n",
"test_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"我们选取一部分样本进行标注,即train_data,该数据由5列组成。id列表示原始样本序号;sentence列为我们截取的一段关键信息;如果关键信息中存在两个实体之间有股权交易关系则tag列为1,否则为0;如果tag为1,则在member1和member2列会记录两个实体出现在sentence中的名称。\n",
"\n",
"剩下的样本没有标注,即test_data,该数据只有id和sentence两列,希望你能训练模型对test_data中的实体进行识别,并判断实体对之间有没有股权交易关系。\n",
"\n",
"### 练习3:\n",
"将每句句子中实体识别出,存入实体词典,并用特殊符号替换语句。\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"# 处理test数据,利用开源工具进行实体识别和并使用实体统一函数存储实体\n",
"\n",
"import fool\n",
"import pandas as pd\n",
"from copy import copy\n",
"\n",
"\n",
"test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'gb2312', header=0)\n",
"test_data['ner'] = None\n",
"ner_id = 1001\n",
"ner_dict_new = {} # 存储所有实体\n",
"ner_dict_reverse_new = {} # 存储所有实体\n",
"\n",
"for i in range(len(test_data)):\n",
" sentence = copy(test_data.iloc[i, 1])\n",
" words, ners = fool.analysis(sentence)\n",
" ners[0].sort(key=lambda x:x[0], reverse=True)\n",
" for start, end, ner_type, ner_name in ners[0]:\n",
" if ner_type=='company' or ner_type=='person':\n",
" company_main_name = ''.join(main_extract(ner_name,stop_word,d_4_delete,d_city_province)) # 提取公司主体名称\n",
" if company_main_name not in ner_dict_new:\n",
" ner_dict_new[company_main_name] = ner_id\n",
" ner_id+=1\n",
" sentence = sentence[:start] + ' ner_' + str(ner_dict_new[company_main_name]) + '_ ' + sentence[end-1:]\n",
" test_data.iloc[i, -1] = sentence\n",
"\n",
"X_test = test_data[['ner']]\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"# 处理train数据,利用开源工具进行实体识别和并使用实体统一函数存储实体\n",
"train_data = pd.read_csv('../data/info_extract/train_data.csv', encoding = 'gb2312', header=0)\n",
"train_data['ner'] = None\n",
"\n",
"for i in range(len(train_data)):\n",
" if train_data.iloc[i,:]['member1']=='0' and train_data.iloc[i,:]['member2']=='0':\n",
" sentence = copy(train_data.iloc[i, 1])\n",
" words, ners = fool.analysis(sentence)\n",
" ners[0].sort(key=lambda x:x[0], reverse=True)\n",
" for start, end, ner_type, ner_name in ners[0]:\n",
" if ner_type=='company' or ner_type=='person':\n",
" company_main_name = ''.join(main_extract(ner_name,stop_word,d_4_delete,d_city_province)) # 提取公司主体名称\n",
" if company_main_name not in ner_dict_new:\n",
" ner_dict_new[company_main_name] = ner_id\n",
" ner_id+=1\n",
" sentence = sentence[:start] + ' ner_' + str(ner_dict_new[company_main_name]) + '_ ' + sentence[end-1:]\n",
" train_data.iloc[i, -1] = sentence\n",
" else:\n",
" train_data.iloc[i, -1] = re.sub(r'%s|%s'%(train_data.iloc[i,:]['member1'],train_data.iloc[i,:]['member2']), ' ner_1 ', train_data.iloc[i,:]['sentence'])\n",
"\n",
"y = train_data.loc[:,['tag']]\n",
"train_num = len(train_data)\n",
"X_train = train_data[['ner']]\n",
"\n",
"# 将train和test放在一起提取特征\n",
"X = pd.concat([X_train, X_test])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤 4:关系抽取\n",
"\n",
"\n",
"目标:借助句法分析工具,和实体识别的结果,以及文本特征,基于训练数据抽取关系,并存储进图数据库。\n",
"\n",
"本次要求抽取股权交易关系,关系为有向边,由投资方指向被投资方。\n",
"\n",
"模板建立可以使用“正则表达式”、“实体间距离”、“实体上下文”、“依存句法”等。\n",
"\n",
"答案提交在submit目录中,命名为info_extract_submit.csv和info_extract_entity.csv。\n",
"- info_extract_entity.csv格式为:第一列是实体编号,第二列是实体名(实体统一的多个实体名用“|”分隔)\n",
"- info_extract_submit.csv格式为:第一列是关系中实体1的编号,第二列为关系中实体2的编号。\n",
"\n",
"示例:\n",
"- info_extract_entity.csv\n",
"\n",
"| 实体编号 | 实体名 |\n",
"| ------ | ------ |\n",
"| 1001 | 小王 |\n",
"| 1002 | A化工厂 |\n",
"\n",
"- info_extract_submit.csv\n",
"\n",
"| 实体1 | 实体2 |\n",
"| ------ | ------ |\n",
"| 1001 | 1003 |\n",
"| 1002 | 1001 |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习4:提取文本tf-idf特征\n",
"\n",
"去除停用词,并转换成tfidf向量。"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# code\n",
"from sklearn.feature_extraction.text import TfidfTransformer \n",
"from sklearn.feature_extraction.text import CountVectorizer \n",
"from pyltp import Segmentor\n",
"\n",
"# 初始化实例\n",
"segmentor = Segmentor() \n",
"# 加载模型,加载自定义词典\n",
"segmentor.load_with_lexicon('/Users/Badrain/Downloads/ltp_data_v3.4.0/cws.model', '../data/user_dict.txt') \n",
"\n",
"# 加载停用词\n",
"fr = open(r'../data/dict/stopwords.txt', encoding='utf-8') \n",
"stop_word = fr.readlines()\n",
"stop_word = [re.sub(r'(\\r|\\n)*','',stop_word[i]) for i in range(len(stop_word))]\n",
"\n",
"\n",
"f = lambda x: ' '.join([for word in segmentor.segment(x) if word not in stop_word])\n",
"corpus=sample_data['sentence'].map(f).tolist()\n",
"\n",
"\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"tfidf = TfidfVectorizer()\n",
"re = tfidf.fit_transform(corpus)\n",
"\n",
"tfidf_feature = pd.DataFrame(tfidf_feature.toarray())\n",
"# print(re)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习5:提取句法特征\n",
"除了词语层面的句向量特征,我们还可以从句法入手,提取一些句法分析的特征。\n",
"\n",
"参考特征:\n",
"\n",
"1、企业实体间距离\n",
"\n",
"2、企业实体间句法距离\n",
"\n",
"3、企业实体分别和关键触发词的距离\n",
"\n",
"4、实体的依存关系类别"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"# -*- coding: utf-8 -*-\n",
"from pyltp import Parser\n",
"from pyltp import Segmentor\n",
"from pyltp import Postagger\n",
"import networkx as nx\n",
"import pylab\n",
"import re\n",
"\n",
"postagger = Postagger() # 初始化实例\n",
"postagger.load_with_lexicon('/Users/Badrain/Downloads/ltp_data_v3.4.0/pos.model', '../data/user_dict.txt') # 加载模型\n",
"segmentor = Segmentor() # 初始化实例\n",
"segmentor.load_with_lexicon('/Users/Badrain/Downloads/ltp_data_v3.4.0/cws.model', '../data/user_dict.txt') # 加载模型\n",
"\n",
"\n",
"# 实体符号加入分词词典\n",
"with open('../data/user_dict.txt', 'w') as fw:\n",
" for v in ['一', '二', '三', '四', '五', '六', '七', '八', '九', '十']:\n",
" fw.write( v + '号企业 ni\\n')\n",
"\n",
"\n",
"def parse(s):\n",
" \"\"\"\n",
" 对语句进行句法分析,并返回句法结果\n",
" parse_result:依存句法解析结果\n",
" source:企业实体的词序号\n",
" target:另一个企业实体的词序号\n",
" keyword_pos:关键词词序号列表\n",
" source_dep:企业实体依存句法类型\n",
" target_dep:另一个企业实体依存句法类型\n",
" \"\"\"\n",
" tmp_ner_dict = {}\n",
" num_lst = ['一', '二', '三', '四', '五', '六', '七', '八', '九', '十']\n",
"\n",
" # 将公司代码替换为特殊称谓,保证分词词性正确\n",
" for i, ner in enumerate(list(set(re.findall(r'(ner\\_\\d\\d\\d\\d\\_)', s)))):\n",
" try:\n",
" tmp_ner_dict[num_lst[i]+'号企业'] = ner\n",
" except IndexError:\n",
" return None, None, None, None, None, None\n",
" s = s.replace(ner, num_lst[i]+'号企业')\n",
" words = segmentor.segment(s)\n",
" tags = postagger.postag(words)\n",
" parser = Parser() # 初始化实例\n",
" parser.load('/Users/Badrain/Downloads/ltp_data_v3.4.0/parser.model') # 加载模型\n",
" arcs = parser.parse(words, tags) # 句法分析\n",
" arcs_lst = list(map(list, zip(*[[arc.head, arc.relation] for arc in arcs])))\n",
" \n",
" # 句法分析结果输出\n",
" parse_result = pd.DataFrame([[a,b,c,d] for a,b,c,d in zip(list(words),list(tags), arcs_lst[0], arcs_lst[1])], index = range(1,len(words)+1))\n",
" parser.release() # 释放模型\n",
" \n",
" # 能找到两个企业以上才返回结果,目前简化,只考虑前两家企业的关系\n",
" try:\n",
" source = list(words).index('一号企业')+1\n",
" target = list(words).index('二号企业')+1\n",
" source_dep = arcs_lst[1][source-1]\n",
" target_dep = arcs_lst[1][target-1]\n",
" except:\n",
" return None, None, None, None, None, None\n",
" \n",
" # 投资关系关键词\n",
" key_words = [\"收购\",\"竞拍\",\"转让\",\"扩张\",\"并购\",\"注资\",\"整合\",\"并入\",\"竞购\",\"竞买\",\"支付\",\"收购价\",\"收购价格\",\"承购\",\"购得\",\"购进\",\n",
" \"购入\",\"买进\",\"买入\",\"赎买\",\"购销\",\"议购\",\"函购\",\"函售\",\"抛售\",\"售卖\",\"销售\",\"转售\"]\n",
" keyword_pos = [list(words).index(w)+1 if w in list(words) else -1 for w in key_words]\n",
" \n",
" return parse_result, source, target, keyword_pos, source_dep, target_dep\n",
"\n",
"\n",
"def shortest_path(arcs_ret, source, target):\n",
" \"\"\"\n",
" 求出两个词最短依存句法路径,不存在路径返回-1\n",
" \"\"\"\n",
" G=nx.DiGraph()\n",
" # 为这个网络添加节点...\n",
" for i in list(arcs_ret.index):\n",
" G.add_node(i)\n",
" # 在网络中添加带权中的边...\n",
" for i in list(arcs_ret.index):\n",
" G.add_edges_from([( arcs_ret.iloc[i-1, 2], i )])\n",
" G.add_edges_from([( i, arcs_ret.iloc[i-1, 2] )])\n",
"\n",
" try:\n",
" distance=nx.shortest_path_length(G,source,target)\n",
" # print('源节点为0,终点为7,最短距离:', distance)\n",
" return distance\n",
" except:\n",
" return -1\n",
"\n",
"\n",
"def get_parse_feature(s):\n",
" \"\"\"\n",
" 综合上述函数汇总句法分析特征\n",
" \"\"\"\n",
" parse_result, source, target, keyword_pos, source_dep, target_dep = parse(s)\n",
" if parse_result is None:\n",
" return [-1]*59\n",
" features = []\n",
" features.append(shortest_path(parse_result, source, target))\n",
" keyword_feature = []\n",
" for p in keyword_pos:\n",
" if p==-1:\n",
" keyword_feature.append(-1)\n",
" keyword_feature.append(-1)\n",
" else:\n",
" keyword_feature.append(shortest_path(parse_result, source, p))\n",
" keyword_feature.append(shortest_path(parse_result, target, p))\n",
" features.extend(keyword_feature)\n",
" features.extend([source_dep, target_dep])\n",
" return features\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"# 对所有样本统一派生特征\n",
"\n",
"f = lambda x: get_parse_feature(x)\n",
"parse_feature = X.map(f)\n",
"\n",
"whole_feature = []\n",
"for i in range(len(parse_feature)):\n",
" whole_feature.append(list(tfidf_feature.iloc[i,:]) + parse_feature[i])\n",
"\n",
"whole_feature = pd.DataFrame(whole_feature)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习6:建立分类器\n",
"\n",
"利用已经提取好的tfidf特征以及parse特征,建立分类器进行分类任务。"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"# 建立分类器进行分类\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn import preprocessing\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"# TODO:定义需要遍历的参数\n",
"parameters = { 'C':np.logspace(-3,3,7)}\n",
"\n",
"# TODO:选择模型\n",
"lr = LogisticRegression()\n",
"\n",
"# TODO:利用GridSearchCV搜索最佳参数\n",
"clf = GridSearchCV(lr, parameters, cv=5)\n",
"clf.fit(train_x, y)\n",
"\n",
"# TODO:对Test_data进行分类\n",
"predict = clf.predict(test_x)\n",
"predict_prob = clf.predict_proba(test_x)\n",
"\n",
"\n",
"# TODO:保存结果\n",
"# 答案提交在submit目录中,命名为info_extract_submit.csv和info_extract_entity.csv。\n",
"# info_extract_entity.csv格式为:第一列是实体编号,第二列是实体名(实体统一的多个实体名用“|”分隔)\n",
"# info_extract_submit.csv格式为:第一列是关系中实体1的编号,第二列为关系中实体2的编号。\n",
"\n",
" "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习7:操作图数据库\n",
"对关系最好的描述就是用图,那这里就需要使用图数据库,目前最常用的图数据库是noe4j,通过cypher语句就可以操作图数据库的增删改查。可以参考“https://cuiqingcai.com/4778.html”。\n",
"\n",
"本次作业我们使用neo4j作为图数据库,neo4j需要java环境,请先配置好环境。\n",
"\n",
"将我们提出的实体关系插入图数据库,并查询某节点的3层投资关系。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"from py2neo import Node, Relationship, Graph\n",
"\n",
"graph = Graph(\n",
" \"http://localhost:7474\", \n",
" username=\"neo4j\", \n",
" password=\"person\"\n",
")\n",
"\n",
"for v in relation_list:\n",
" a = Node('Company', name=v[0])\n",
" b = Node('Company', name=v[1])\n",
" \n",
" # 本次不区分投资方和被投资方,无向图\n",
" r = Relationship(a, 'INVEST', b)\n",
" s = a | b | r\n",
" graph.create(s)\n",
" r = Relationship(b, 'INVEST', a)\n",
" s = a | b | r\n",
" graph.create(s)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO:查询某节点的3层投资关系\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤5:实体消歧\n",
"解决了实体识别和关系的提取,我们已经完成了一大截,但是我们提取的实体究竟对应知识库中哪个实体呢?下图中,光是“苹果”就对应了13个同名实体。\n",
"<img src=\"../image/baike2.png\", width=340, heigth=480>\n",
"\n",
"在这个问题上,实体消歧旨在解决文本中广泛存在的名称歧义问题,将句中识别的实体与知识库中实体进行匹配,解决实体歧义问题。\n",
"\n",
"\n",
"### 练习8:\n",
"匹配test_data.csv中前25条样本中的人物实体对应的百度百科URL(此部分样本中所有人名均可在百度百科中链接到)。\n",
"\n",
"利用scrapy、beautifulsoup、request等python包对百度百科进行爬虫,判断是否具有一词多义的情况,如果有的话,选择最佳实体进行匹配。\n",
"\n",
"使用URL为‘https://baike.baidu.com/item/’+人名 可以访问百度百科该人名的词条,此处需要根据爬取到的网页识别该词条是否对应多个实体,如下图:\n",
"<img src=\"../image/baike1.png\", width=440, heigth=480>\n",
"如果该词条有对应多个实体,请返回正确匹配的实体URL,例如该示例网页中的‘https://baike.baidu.com/item/陆永/20793929’。\n",
"\n",
"- 提交文件:entity_disambiguation_submit.csv\n",
"- 提交格式:第一列为实体id(与info_extract_submit.csv中id保持一致),第二列为对应URL。\n",
"- 示例:\n",
"\n",
"| 实体编号 | URL |\n",
"| ------ | ------ |\n",
"| 1001 | https://baike.baidu.com/item/陆永/20793929 |\n",
"| 1002 | https://baike.baidu.com/item/王芳/567232 |\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import jieba\n",
"import pandas as pd\n",
"\n",
"# 找出test_data.csv中前25条样本所有的人物名称,以及人物所在文档的上下文内容\n",
"test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'gb2312', header=0)\n",
"\n",
"# 存储人物以及上下文信息(key为人物ID,value为人物名称、人物上下文内容)\n",
"person_name = {}\n",
"\n",
"# 观察上下文的窗口大小\n",
"window = 10 \n",
"\n",
"# 遍历前25条样本\n",
"for i in range(25):\n",
" sentence = copy(test_data.iloc[i, 1])\n",
" words, ners = fool.analysis(sentence)\n",
" ners[0].sort(key=lambda x:x[0], reverse=True)\n",
" for start, end, ner_type, ner_name in ners[0]:\n",
" if ner_type=='person':\n",
" # TODO:提取上下文\n",
" person_name[ner_name] = person_name.get(ner_name)+sentence[max(0, idx-window):min(len(sentence)-1, idx+window)]\n",
"\n",
"\n",
"\n",
"# 利用爬虫得到每个人物名称对应的URL\n",
"# TODO:找到每个人物实体的词条内容。\n",
"\n",
"# TODO:将样本中人物上下文与爬取词条结果进行对比,选择最接近的词条。\n",
"\n",
"\n",
"\n",
"# 输出结果\n",
"pd.DataFrame(result_data).to_csv('../submit/entity_disambiguation_submit.csv', index=False)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 利用信息抽取技术搭建知识库\n",
"\n",
"在这个notebook文件中,有些模板代码已经提供给你,但你还需要实现更多的功能来完成这个项目。除非有明确要求,你无须修改任何已给出的代码。以**'【练习】'**开始的标题表示接下来的代码部分中有你需要实现的功能。这些部分都配有详细的指导,需要实现的部分也会在注释中以'TODO'标出。请仔细阅读所有的提示。\n",
"\n",
">**提示:**Code 和 Markdown 区域可通过 **Shift + Enter** 快捷键运行。此外,Markdown可以通过双击进入编辑模式。\n",
"\n",
"---\n",
"\n",
"### 让我们开始吧\n",
"\n",
"本项目的目的是结合命名实体识别、依存语法分析、实体消歧、实体统一对网站开放语料抓取的数据建立小型知识图谱。\n",
"\n",
"在现实世界中,你需要拼凑一系列的模型来完成不同的任务;举个例子,用来预测狗种类的算法会与预测人类的算法不同。在做项目的过程中,你可能会遇到不少失败的预测,因为并不存在完美的算法和模型。你最终提交的不完美的解决方案也一定会给你带来一个有趣的学习经验!\n",
"\n",
"\n",
"---\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤 1:实体统一"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"实体统一做的是对同一实体具有多个名称的情况进行统一,将多种称谓统一到一个实体上,并体现在实体的属性中(可以给实体建立“别称”属性)\n",
"\n",
"例如:对“河北银行股份有限公司”、“河北银行公司”和“河北银行”我们都可以认为是一个实体,我们就可以将通过提取前两个称谓的主要内容,得到“河北银行”这个实体关键信息。\n",
"\n",
"公司名称有其特点,例如后缀可以省略、上市公司的地名可以省略等等。在data/dict目录中提供了几个词典,可供实体统一使用。\n",
"- company_suffix.txt是公司的通用后缀词典\n",
"- company_business_scope.txt是公司经营范围常用词典\n",
"- co_Province_Dim.txt是省份词典\n",
"- co_City_Dim.txt是城市词典\n",
"- stopwords.txt是可供参考的停用词\n",
"\n",
"### 练习1:\n",
"编写main_extract函数,实现对实体的名称提取“主体名称”的功能。"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import jieba\n",
"import jieba.posseg as pseg\n",
"import re\n",
"import datetime\n",
"\n",
"\n",
"# 从输入的“公司名”中提取主体\n",
"def main_extract(input_str,stop_word,d_4_delete,d_city_province):\n",
" # 开始分词并处理\n",
" seg = pseg.cut(input_str)\n",
" seg_lst = remove_word(seg,stop_word,d_4_delete)\n",
" seg_lst = city_prov_ahead(seg,d_city_province)\n",
" return seg_lst\n",
"\n",
" \n",
"#TODO:实现公司名称中地名提前\n",
"def city_prov_ahead(seg,d_city_province):\n",
" city_prov_lst = []\n",
" # TODO ...\n",
" \n",
" return city_prov_lst+seg_lst\n",
"\n",
"\n",
"\n",
"\n",
"#TODO:替换特殊符号\n",
"def remove_word(seg,stop_word,d_4_delete):\n",
" # TODO ...\n",
" \n",
" return seg_lst\n",
"\n",
"\n",
"# 初始化,加载词典\n",
"def my_initial():\n",
" fr1 = open(r\"../data/dict/co_City_Dim.txt\", encoding='utf-8')\n",
" fr2 = open(r\"../data/dict/co_Province_Dim.txt\", encoding='utf-8')\n",
" fr3 = open(r\"../data/dict/company_business_scope.txt\", encoding='utf-8')\n",
" fr4 = open(r\"../data/dict/company_suffix.txt\", encoding='utf-8')\n",
" #城市名\n",
" lines1 = fr1.readlines()\n",
" d_4_delete = []\n",
" d_city_province = [re.sub(r'(\\r|\\n)*','',line) for line in lines1]\n",
" #省份名\n",
" lines2 = fr2.readlines()\n",
" l2_tmp = [re.sub(r'(\\r|\\n)*','',line) for line in lines2]\n",
" d_city_province.extend(l2_tmp)\n",
" #公司后缀\n",
" lines3 = fr3.readlines()\n",
" l3_tmp = [re.sub(r'(\\r|\\n)*','',line) for line in lines3]\n",
" lines4 = fr4.readlines()\n",
" l4_tmp = [re.sub(r'(\\r|\\n)*','',line) for line in lines4]\n",
" d_4_delete.extend(l4_tmp)\n",
" #get stop_word\n",
" fr = open(r'../data/dict/stopwords.txt', encoding='utf-8') \n",
" stop_word = fr.readlines()\n",
" stop_word_after = [re.sub(r'(\\r|\\n)*','',stop_word[i]) for i in range(len(stop_word))]\n",
" stop_word_after[-1] = stop_word[-1]\n",
" stop_word = stop_word_after\n",
" return d_4_delete,stop_word,d_city_province\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Building prefix dict from the default dictionary ...\n",
"Loading model from cache C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\jieba.cache\n",
"Loading model cost 0.732 seconds.\n",
"Prefix dict has been built succesfully.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"河北银行\n"
]
}
],
"source": [
"# TODO:测试实体统一用例\n",
"d_4_delete,stop_word,d_city_province = my_initial()\n",
"company_name = \"河北银行股份有限公司\"\n",
"lst = main_extract(company_name,stop_word,d_4_delete,d_city_province)\n",
"company_name = ''.join(lst) # 对公司名提取主体部分,将包含相同主体部分的公司统一为一个实体\n",
"print(company_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤 2:实体识别\n",
"有很多开源工具可以帮助我们对实体进行识别。常见的有LTP、StanfordNLP、FoolNLTK等等。\n",
"\n",
"本次采用FoolNLTK实现实体识别,fool是一个基于bi-lstm+CRF算法开发的深度学习开源NLP工具,包括了分词、实体识别等功能,大家可以通过fool很好地体会深度学习在该任务上的优缺点。\n",
"\n",
"在‘data/train_data.csv’和‘data/test_data.csv’中是从网络上爬虫得到的上市公司公告,数据样例如下:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>sentence</th>\n",
" <th>tag</th>\n",
" <th>member1</th>\n",
" <th>member2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>6461</td>\n",
" <td>与本公司关系:受同一公司控制 2,杭州富生电器有限公司企业类型: 有限公司注册地址: 富阳市...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2111</td>\n",
" <td>三、关联交易标的基本情况 1、交易标的基本情况 公司名称:红豆集团财务有限公司 公司地址:无...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>9603</td>\n",
" <td>2016年协鑫集成科技股份有限公司向瑞峰(张家港)光伏科技有限公司支付设备款人民币4,515...</td>\n",
" <td>1</td>\n",
" <td>协鑫集成科技股份有限公司</td>\n",
" <td>瑞峰(张家港)光伏科技有限公司</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3456</td>\n",
" <td>证券代码:600777 证券简称:新潮实业 公告编号:2015-091 烟台新潮实业股份有限...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>8844</td>\n",
" <td>本集团及广发证券股份有限公司持有辽宁成大股份有限公司股票的本期变动系买卖一揽子沪深300指数...</td>\n",
" <td>1</td>\n",
" <td>广发证券股份有限公司</td>\n",
" <td>辽宁成大股份有限公司</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id sentence tag member1 \\\n",
"0 6461 与本公司关系:受同一公司控制 2,杭州富生电器有限公司企业类型: 有限公司注册地址: 富阳市... 0 0 \n",
"1 2111 三、关联交易标的基本情况 1、交易标的基本情况 公司名称:红豆集团财务有限公司 公司地址:无... 0 0 \n",
"2 9603 2016年协鑫集成科技股份有限公司向瑞峰(张家港)光伏科技有限公司支付设备款人民币4,515... 1 协鑫集成科技股份有限公司 \n",
"3 3456 证券代码:600777 证券简称:新潮实业 公告编号:2015-091 烟台新潮实业股份有限... 0 0 \n",
"4 8844 本集团及广发证券股份有限公司持有辽宁成大股份有限公司股票的本期变动系买卖一揽子沪深300指数... 1 广发证券股份有限公司 \n",
"\n",
" member2 \n",
"0 0 \n",
"1 0 \n",
"2 瑞峰(张家港)光伏科技有限公司 \n",
"3 0 \n",
"4 辽宁成大股份有限公司 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_data = pd.read_csv('../data/info_extract/train_data.csv', encoding = 'gb2312', header=0)\n",
"train_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>sentence</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>9259</td>\n",
" <td>2015年1月26日,多氟多化工股份有限公司与李云峰先生签署了《附条件生效的股份认购合同》</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>9136</td>\n",
" <td>2、2016年2月5日,深圳市新纶科技股份有限公司与侯毅先</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>220</td>\n",
" <td>2015年10月26日,山东华鹏玻璃股份有限公司与张德华先生签署了附条件生效条件的《股份认购合同》</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>9041</td>\n",
" <td>2、2015年12月31日,印纪娱乐传媒股份有限公司与肖文革签订了《印纪娱乐传媒股份有限公司...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>10041</td>\n",
" <td>一、金发科技拟与熊海涛女士签订《股份转让协议》,协议约定:以每股1.0509元的收购价格,收...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id sentence\n",
"0 9259 2015年1月26日,多氟多化工股份有限公司与李云峰先生签署了《附条件生效的股份认购合同》\n",
"1 9136 2、2016年2月5日,深圳市新纶科技股份有限公司与侯毅先\n",
"2 220 2015年10月26日,山东华鹏玻璃股份有限公司与张德华先生签署了附条件生效条件的《股份认购合同》\n",
"3 9041 2、2015年12月31日,印纪娱乐传媒股份有限公司与肖文革签订了《印纪娱乐传媒股份有限公司...\n",
"4 10041 一、金发科技拟与熊海涛女士签订《股份转让协议》,协议约定:以每股1.0509元的收购价格,收..."
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'gb2312', header=0)\n",
"test_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"我们选取一部分样本进行标注,即train_data,该数据由5列组成。id列表示原始样本序号;sentence列为我们截取的一段关键信息;如果关键信息中存在两个实体之间有股权交易关系则tag列为1,否则为0;如果tag为1,则在member1和member2列会记录两个实体出现在sentence中的名称。\n",
"\n",
"剩下的样本没有标注,即test_data,该数据只有id和sentence两列,希望你能训练模型对test_data中的实体进行识别,并判断实体对之间有没有股权交易关系。\n",
"\n",
"### 练习2:\n",
"将每句句子中实体识别出,存入实体词典,并用特殊符号替换语句。\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 处理test数据,利用开源工具进行实体识别和并使用实体统一函数存储实体\n",
"\n",
"import fool\n",
"import pandas as pd\n",
"from copy import copy\n",
"\n",
"\n",
"test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'gb2312', header=0)\n",
"test_data['ner'] = None\n",
"ner_id = 1001\n",
"ner_dict_new = {} # 存储所有实体\n",
"ner_dict_reverse_new = {} # 存储所有实体\n",
"\n",
"for i in range(len(test_data)):\n",
" sentence = copy(test_data.iloc[i, 1])\n",
" # TODO:调用fool进行实体识别,得到words和ners结果\n",
" # TODO ...\n",
" \n",
" \n",
" ners[0].sort(key=lambda x:x[0], reverse=True)\n",
" for start, end, ner_type, ner_name in ners[0]:\n",
" if ner_type=='company' or ner_type=='person':\n",
" # TODO:调用实体统一函数,存储统一后的实体\n",
" # 并自增ner_id\n",
" # TODO ...\n",
" \n",
" \n",
" # 在句子中用编号替换实体名\n",
" sentence = sentence[:start] + ' ner_' + str(ner_dict_new[company_main_name]) + '_ ' + sentence[end-1:]\n",
" test_data.iloc[i, -1] = sentence\n",
"\n",
"X_test = test_data[['ner']]\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 处理train数据,利用开源工具进行实体识别和并使用实体统一函数存储实体\n",
"train_data = pd.read_csv('../data/info_extract/train_data.csv', encoding = 'gb2312', header=0)\n",
"train_data['ner'] = None\n",
"\n",
"for i in range(len(train_data)):\n",
" # 判断正负样本\n",
" if train_data.iloc[i,:]['member1']=='0' and train_data.iloc[i,:]['member2']=='0':\n",
" sentence = copy(train_data.iloc[i, 1])\n",
" # TODO:调用fool进行实体识别,得到words和ners结果\n",
" # TODO ...\n",
" \n",
" \n",
" ners[0].sort(key=lambda x:x[0], reverse=True)\n",
" for start, end, ner_type, ner_name in ners[0]:\n",
" if ner_type=='company' or ner_type=='person':\n",
" # TODO:调用实体统一函数,存储统一后的实体\n",
" # 并自增ner_id\n",
" # TODO ...\n",
"\n",
"\n",
" # 在句子中用编号替换实体名\n",
" sentence = sentence[:start] + ' ner_' + str(ner_dict_new[company_main_name]) + '_ ' + sentence[end-1:]\n",
" train_data.iloc[i, -1] = sentence\n",
" else:\n",
" # 将训练集中正样本已经标注的实体也使用编码替换\n",
" sentence = copy(train_data.iloc[i,:]['sentence'])\n",
" for company_main_name in [train_data.iloc[i,:]['member1'],train_data.iloc[i,:]['member2']]:\n",
" # TODO:调用实体统一函数,存储统一后的实体\n",
" # 并自增ner_id\n",
" # TODO ...\n",
"\n",
"\n",
" # 在句子中用编号替换实体名\n",
" sentence = re.sub(company_main_name, ' ner_%s_ '%(str(ner_dict_new[company_main_name])), sentence)\n",
" train_data.iloc[i, -1] = sentence\n",
" \n",
"y = train_data.loc[:,['tag']]\n",
"train_num = len(train_data)\n",
"X_train = train_data[['ner']]\n",
"\n",
"# 将train和test放在一起提取特征\n",
"X = pd.concat([X_train, X_test])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤 3:关系抽取\n",
"\n",
"\n",
"目标:借助句法分析工具,和实体识别的结果,以及文本特征,基于训练数据抽取关系,并存储进图数据库。\n",
"\n",
"本次要求抽取股权交易关系,关系为无向边,不要求判断投资方和被投资方,只要求得到双方是否存在交易关系。\n",
"\n",
"模板建立可以使用“正则表达式”、“实体间距离”、“实体上下文”、“依存句法”等。\n",
"\n",
"答案提交在submit目录中,命名为info_extract_submit.csv和info_extract_entity.csv。\n",
"- info_extract_entity.csv格式为:第一列是实体编号,第二列是实体名(实体统一的多个实体名用“|”分隔)\n",
"- info_extract_submit.csv格式为:第一列是关系中实体1的编号,第二列为关系中实体2的编号。\n",
"\n",
"示例:\n",
"- info_extract_entity.csv\n",
"\n",
"| 实体编号 | 实体名 |\n",
"| ------ | ------ |\n",
"| 1001 | 小王 |\n",
"| 1002 | A化工厂 |\n",
"\n",
"- info_extract_submit.csv\n",
"\n",
"| 实体1 | 实体2 |\n",
"| ------ | ------ |\n",
"| 1001 | 1003 |\n",
"| 1002 | 1001 |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习3:提取文本tf-idf特征\n",
"\n",
"去除停用词,并转换成tfidf向量。"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"# code\n",
"from sklearn.feature_extraction.text import TfidfTransformer \n",
"from sklearn.feature_extraction.text import CountVectorizer \n",
"from pyltp import Segmentor\n",
"\n",
"\n",
"# 实体符号加入分词词典\n",
"with open('../data/user_dict.txt', 'w') as fw:\n",
" for v in ['一', '二', '三', '四', '五', '六', '七', '八', '九', '十']:\n",
" fw.write( v + '号企业 ni\\n')\n",
"\n",
"# 初始化实例\n",
"segmentor = Segmentor() \n",
"# 加载模型,加载自定义词典\n",
"segmentor.load_with_lexicon('/Users/Badrain/Downloads/ltp_data_v3.4.0/cws.model', '../data/user_dict.txt') \n",
"\n",
"# 加载停用词\n",
"fr = open(r'../data/dict/stopwords.txt', encoding='utf-8') \n",
"stop_word = fr.readlines()\n",
"stop_word = [re.sub(r'(\\r|\\n)*','',stop_word[i]) for i in range(len(stop_word))]\n",
"\n",
"# 分词\n",
"f = lambda x: ' '.join([for word in segmentor.segment(x) if word not in stop_word and not re.findall(r'ner\\_\\d\\d\\d\\d\\_', word)])\n",
"corpus=X['ner'].map(f).tolist()\n",
"\n",
"\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"# TODO:提取tfidf特征\n",
"# TODO ...\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习4:提取句法特征\n",
"除了词语层面的句向量特征,我们还可以从句法入手,提取一些句法分析的特征。\n",
"\n",
"参考特征:\n",
"\n",
"1、企业实体间距离\n",
"\n",
"2、企业实体间句法距离\n",
"\n",
"3、企业实体分别和关键触发词的距离\n",
"\n",
"4、实体的依存关系类别"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# -*- coding: utf-8 -*-\n",
"from pyltp import Parser\n",
"from pyltp import Segmentor\n",
"from pyltp import Postagger\n",
"import networkx as nx\n",
"import pylab\n",
"import re\n",
"\n",
"postagger = Postagger() # 初始化实例\n",
"postagger.load_with_lexicon('/Users/Badrain/Downloads/ltp_data_v3.4.0/pos.model', '../data/user_dict.txt') # 加载模型\n",
"segmentor = Segmentor() # 初始化实例\n",
"segmentor.load_with_lexicon('/Users/Badrain/Downloads/ltp_data_v3.4.0/cws.model', '../data/user_dict.txt') # 加载模型\n",
"\n",
"\n",
"\n",
"def parse(s):\n",
" \"\"\"\n",
" 对语句进行句法分析,并返回句法结果\n",
" \"\"\"\n",
" tmp_ner_dict = {}\n",
" num_lst = ['一', '二', '三', '四', '五', '六', '七', '八', '九', '十']\n",
"\n",
" # 将公司代码替换为特殊称谓,保证分词词性正确\n",
" for i, ner in enumerate(list(set(re.findall(r'(ner\\_\\d\\d\\d\\d\\_)', s)))):\n",
" try:\n",
" tmp_ner_dict[num_lst[i]+'号企业'] = ner\n",
" except IndexError:\n",
" # TODO:定义错误情况的输出\n",
" # TODO ...\n",
" \n",
" \n",
" s = s.replace(ner, num_lst[i]+'号企业')\n",
" words = segmentor.segment(s)\n",
" tags = postagger.postag(words)\n",
" parser = Parser() # 初始化实例\n",
" parser.load('/Users/Badrain/Downloads/ltp_data_v3.4.0/parser.model') # 加载模型\n",
" arcs = parser.parse(words, tags) # 句法分析\n",
" arcs_lst = list(map(list, zip(*[[arc.head, arc.relation] for arc in arcs])))\n",
" \n",
" # 句法分析结果输出\n",
" parse_result = pd.DataFrame([[a,b,c,d] for a,b,c,d in zip(list(words),list(tags), arcs_lst[0], arcs_lst[1])], index = range(1,len(words)+1))\n",
" parser.release() # 释放模型\n",
" # TODO:提取企业实体依存句法类型\n",
" # TODO ...\n",
" \n",
" \n",
"\n",
" # 投资关系关键词\n",
" key_words = [\"收购\",\"竞拍\",\"转让\",\"扩张\",\"并购\",\"注资\",\"整合\",\"并入\",\"竞购\",\"竞买\",\"支付\",\"收购价\",\"收购价格\",\"承购\",\"购得\",\"购进\",\n",
" \"购入\",\"买进\",\"买入\",\"赎买\",\"购销\",\"议购\",\"函购\",\"函售\",\"抛售\",\"售卖\",\"销售\",\"转售\"]\n",
" # TODO:*根据关键词和对应句法关系提取特征(如没有思路可以不完成)\n",
" # TODO ...\n",
" \n",
" \n",
" parser.release() # 释放模型\n",
" return your_result\n",
"\n",
"\n",
"def shortest_path(arcs_ret, source, target):\n",
" \"\"\"\n",
" 求出两个词最短依存句法路径,不存在路径返回-1\n",
" arcs_ret:句法分析结果\n",
" source:实体1\n",
" target:实体2\n",
" \"\"\"\n",
" G=nx.DiGraph()\n",
" # 为这个网络添加节点...\n",
" for i in list(arcs_ret.index):\n",
" G.add_node(i)\n",
" # TODO:在网络中添加带权中的边...(注意,我们需要的是无向边)\n",
" # TODO ...\n",
" \n",
"\n",
" try:\n",
" # TODO:利用nx包中shortest_path_length方法实现最短距离提取\n",
" # TODO ...\n",
" \n",
" \n",
" return distance\n",
" except:\n",
" return -1\n",
"\n",
"\n",
"def get_feature(s):\n",
" \"\"\"\n",
" 汇总上述函数汇总句法分析特征与TFIDF特征\n",
" \"\"\"\n",
" # TODO:汇总上述函数汇总句法分析特征与TFIDF特征\n",
" # TODO ...\n",
" \n",
" \n",
" return features\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习5:建立分类器\n",
"\n",
"利用已经提取好的tfidf特征以及parse特征,建立分类器进行分类任务。"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 建立分类器进行分类\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn import preprocessing\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"# TODO:定义需要遍历的参数\n",
"\n",
"\n",
"# TODO:选择模型\n",
"\n",
"\n",
"# TODO:利用GridSearchCV搜索最佳参数\n",
"\n",
"\n",
"# TODO:对Test_data进行分类\n",
"\n",
"\n",
"\n",
"# TODO:保存Test_data分类结果\n",
"# 答案提交在submit目录中,命名为info_extract_submit.csv和info_extract_entity.csv。\n",
"# info_extract_entity.csv格式为:第一列是实体编号,第二列是实体名(实体统一的多个实体名用“|”分隔)\n",
"# info_extract_submit.csv格式为:第一列是关系中实体1的编号,第二列为关系中实体2的编号。\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习6:操作图数据库\n",
"对关系最好的描述就是用图,那这里就需要使用图数据库,目前最常用的图数据库是noe4j,通过cypher语句就可以操作图数据库的增删改查。可以参考“https://cuiqingcai.com/4778.html”。\n",
"\n",
"本次作业我们使用neo4j作为图数据库,neo4j需要java环境,请先配置好环境。\n",
"\n",
"将我们提出的实体关系插入图数据库,并查询某节点的3层投资关系,即三个节点组成的路径(如果有的话)。如果无法找到3层投资关系,请查询出任意指定节点的投资路径。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"\n",
"from py2neo import Node, Relationship, Graph\n",
"\n",
"graph = Graph(\n",
" \"http://localhost:7474\", \n",
" username=\"neo4j\", \n",
" password=\"person\"\n",
")\n",
"\n",
"for v in relation_list:\n",
" a = Node('Company', name=v[0])\n",
" b = Node('Company', name=v[1])\n",
" \n",
" # 本次不区分投资方和被投资方,无向图\n",
" r = Relationship(a, 'INVEST', b)\n",
" s = a | b | r\n",
" graph.create(s)\n",
" r = Relationship(b, 'INVEST', a)\n",
" s = a | b | r\n",
" graph.create(s)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# TODO:查询某节点的3层投资关系\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤4:实体消歧\n",
"解决了实体识别和关系的提取,我们已经完成了一大截,但是我们提取的实体究竟对应知识库中哪个实体呢?下图中,光是“苹果”就对应了13个同名实体。\n",
"<img src=\"../image/baike2.png\", width=340, heigth=480>\n",
"\n",
"在这个问题上,实体消歧旨在解决文本中广泛存在的名称歧义问题,将句中识别的实体与知识库中实体进行匹配,解决实体歧义问题。\n",
"\n",
"\n",
"### 练习7:\n",
"匹配test_data.csv中前25条样本中的人物实体对应的百度百科URL(此部分样本中所有人名均可在百度百科中链接到)。\n",
"\n",
"利用scrapy、beautifulsoup、request等python包对百度百科进行爬虫,判断是否具有一词多义的情况,如果有的话,选择最佳实体进行匹配。\n",
"\n",
"使用URL为‘https://baike.baidu.com/item/’+人名 可以访问百度百科该人名的词条,此处需要根据爬取到的网页识别该词条是否对应多个实体,如下图:\n",
"<img src=\"../image/baike1.png\", width=440, heigth=480>\n",
"如果该词条有对应多个实体,请返回正确匹配的实体URL,例如该示例网页中的‘https://baike.baidu.com/item/陆永/20793929’。\n",
"\n",
"- 提交文件:entity_disambiguation_submit.csv\n",
"- 提交格式:第一列为实体id(与info_extract_submit.csv中id保持一致),第二列为对应URL。\n",
"- 示例:\n",
"\n",
"| 实体编号 | URL |\n",
"| ------ | ------ |\n",
"| 1001 | https://baike.baidu.com/item/陆永/20793929 |\n",
"| 1002 | https://baike.baidu.com/item/王芳/567232 |\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import jieba\n",
"import pandas as pd\n",
"\n",
"# 找出test_data.csv中前25条样本所有的人物名称,以及人物所在文档的上下文内容\n",
"test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'gb2312', header=0)\n",
"\n",
"# 存储人物以及上下文信息(key为人物ID,value为人物名称、人物上下文内容)\n",
"person_name = {}\n",
"\n",
"# 观察上下文的窗口大小\n",
"window = 10 \n",
"\n",
"# 遍历前25条样本\n",
"for i in range(25):\n",
" sentence = copy(test_data.iloc[i, 1])\n",
" words, ners = fool.analysis(sentence)\n",
" ners[0].sort(key=lambda x:x[0], reverse=True)\n",
" for start, end, ner_type, ner_name in ners[0]:\n",
" if ner_type=='person':\n",
" # TODO:提取实体的上下文\n",
" \n",
"\n",
"\n",
"\n",
"# 利用爬虫得到每个人物名称对应的URL\n",
"# TODO:找到每个人物实体的词条内容。\n",
"\n",
"# TODO:将样本中人物上下文与爬取词条结果进行对比,选择最接近的词条。\n",
"\n",
"\n",
"\n",
"# 输出结果\n",
"pd.DataFrame(result_data).to_csv('../submit/entity_disambiguation_submit.csv', index=False)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 利用信息抽取技术搭建知识库\n",
"\n",
"在这个notebook文件中,有些模板代码已经提供给你,但你还需要实现更多的功能来完成这个项目。除非有明确要求,你无须修改任何已给出的代码。以**'【练习】'**开始的标题表示接下来的代码部分中有你需要实现的功能。这些部分都配有详细的指导,需要实现的部分也会在注释中以'TODO'标出。请仔细阅读所有的提示。\n",
"\n",
">**提示:**Code 和 Markdown 区域可通过 **Shift + Enter** 快捷键运行。此外,Markdown可以通过双击进入编辑模式。\n",
"\n",
"---\n",
"\n",
"### 让我们开始吧\n",
"\n",
"本项目的目的是结合命名实体识别、依存语法分析、实体消歧、实体统一对网站开放语料抓取的数据建立小型知识图谱。\n",
"\n",
"在现实世界中,你需要拼凑一系列的模型来完成不同的任务;举个例子,用来预测狗种类的算法会与预测人类的算法不同。在做项目的过程中,你可能会遇到不少失败的预测,因为并不存在完美的算法和模型。你最终提交的不完美的解决方案也一定会给你带来一个有趣的学习经验!\n",
"\n",
"\n",
"---\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤 1:实体统一"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"实体统一做的是对同一实体具有多个名称的情况进行统一,将多种称谓统一到一个实体上,并体现在实体的属性中(可以给实体建立“别称”属性)\n",
"\n",
"例如:对“河北银行股份有限公司”、“河北银行公司”和“河北银行”我们都可以认为是一个实体,我们就可以将通过提取前两个称谓的主要内容,得到“河北银行”这个实体关键信息。\n",
"\n",
"公司名称有其特点,例如后缀可以省略、上市公司的地名可以省略等等。在data/dict目录中提供了几个词典,可供实体统一使用。\n",
"- company_suffix.txt是公司的通用后缀词典\n",
"- company_business_scope.txt是公司经营范围常用词典\n",
"- co_Province_Dim.txt是省份词典\n",
"- co_City_Dim.txt是城市词典\n",
"- stopwords.txt是可供参考的停用词\n",
"\n",
"### 练习1:\n",
"编写main_extract函数,实现对实体的名称提取“主体名称”的功能。"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import jieba\n",
"import jieba.posseg as pseg\n",
"import re\n",
"import datetime\n",
"\n",
"\n",
"# 从输入的“公司名”中提取主体\n",
"def main_extract(input_str,stop_word,d_4_delete,d_city_province):\n",
" # 开始分词并处理\n",
" seg = pseg.cut(input_str)\n",
" seg_lst = remove_word(seg,stop_word,d_4_delete)\n",
" seg_lst = city_prov_ahead(seg,d_city_province)\n",
" return seg_lst\n",
"\n",
" \n",
"#TODO:实现公司名称中地名提前\n",
"def city_prov_ahead(seg,d_city_province):\n",
" city_prov_lst = []\n",
" # TODO ...\n",
" \n",
" return city_prov_lst+seg_lst\n",
"\n",
"\n",
"\n",
"\n",
"#TODO:替换特殊符号\n",
"def remove_word(seg,stop_word,d_4_delete):\n",
" # TODO ...\n",
" \n",
" return seg_lst\n",
"\n",
"\n",
"# 初始化,加载词典\n",
"def my_initial():\n",
" fr1 = open(r\"../data/dict/co_City_Dim.txt\", encoding='utf-8')\n",
" fr2 = open(r\"../data/dict/co_Province_Dim.txt\", encoding='utf-8')\n",
" fr3 = open(r\"../data/dict/company_business_scope.txt\", encoding='utf-8')\n",
" fr4 = open(r\"../data/dict/company_suffix.txt\", encoding='utf-8')\n",
" #城市名\n",
" lines1 = fr1.readlines()\n",
" d_4_delete = []\n",
" d_city_province = [re.sub(r'(\\r|\\n)*','',line) for line in lines1]\n",
" #省份名\n",
" lines2 = fr2.readlines()\n",
" l2_tmp = [re.sub(r'(\\r|\\n)*','',line) for line in lines2]\n",
" d_city_province.extend(l2_tmp)\n",
" #公司后缀\n",
" lines3 = fr3.readlines()\n",
" l3_tmp = [re.sub(r'(\\r|\\n)*','',line) for line in lines3]\n",
" lines4 = fr4.readlines()\n",
" l4_tmp = [re.sub(r'(\\r|\\n)*','',line) for line in lines4]\n",
" d_4_delete.extend(l4_tmp)\n",
" #get stop_word\n",
" fr = open(r'../data/dict/stopwords.txt', encoding='utf-8') \n",
" stop_word = fr.readlines()\n",
" stop_word_after = [re.sub(r'(\\r|\\n)*','',stop_word[i]) for i in range(len(stop_word))]\n",
" stop_word_after[-1] = stop_word[-1]\n",
" stop_word = stop_word_after\n",
" return d_4_delete,stop_word,d_city_province\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Building prefix dict from the default dictionary ...\n",
"Loading model from cache C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\jieba.cache\n",
"Loading model cost 0.732 seconds.\n",
"Prefix dict has been built succesfully.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"河北银行\n"
]
}
],
"source": [
"# TODO:测试实体统一用例\n",
"d_4_delete,stop_word,d_city_province = my_initial()\n",
"company_name = \"河北银行股份有限公司\"\n",
"lst = main_extract(company_name,stop_word,d_4_delete,d_city_province)\n",
"company_name = ''.join(lst) # 对公司名提取主体部分,将包含相同主体部分的公司统一为一个实体\n",
"print(company_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤 2:实体识别\n",
"有很多开源工具可以帮助我们对实体进行识别。常见的有LTP、StanfordNLP、FoolNLTK等等。\n",
"\n",
"本次采用FoolNLTK实现实体识别,fool是一个基于bi-lstm+CRF算法开发的深度学习开源NLP工具,包括了分词、实体识别等功能,大家可以通过fool很好地体会深度学习在该任务上的优缺点。\n",
"\n",
"在‘data/train_data.csv’和‘data/test_data.csv’中是从网络上爬虫得到的上市公司公告,数据样例如下:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>sentence</th>\n",
" <th>tag</th>\n",
" <th>member1</th>\n",
" <th>member2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>6461</td>\n",
" <td>与本公司关系:受同一公司控制 2,杭州富生电器有限公司企业类型: 有限公司注册地址: 富阳市...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2111</td>\n",
" <td>三、关联交易标的基本情况 1、交易标的基本情况 公司名称:红豆集团财务有限公司 公司地址:无...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>9603</td>\n",
" <td>2016年协鑫集成科技股份有限公司向瑞峰(张家港)光伏科技有限公司支付设备款人民币4,515...</td>\n",
" <td>1</td>\n",
" <td>协鑫集成科技股份有限公司</td>\n",
" <td>瑞峰(张家港)光伏科技有限公司</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3456</td>\n",
" <td>证券代码:600777 证券简称:新潮实业 公告编号:2015-091 烟台新潮实业股份有限...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>8844</td>\n",
" <td>本集团及广发证券股份有限公司持有辽宁成大股份有限公司股票的本期变动系买卖一揽子沪深300指数...</td>\n",
" <td>1</td>\n",
" <td>广发证券股份有限公司</td>\n",
" <td>辽宁成大股份有限公司</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id sentence tag member1 \\\n",
"0 6461 与本公司关系:受同一公司控制 2,杭州富生电器有限公司企业类型: 有限公司注册地址: 富阳市... 0 0 \n",
"1 2111 三、关联交易标的基本情况 1、交易标的基本情况 公司名称:红豆集团财务有限公司 公司地址:无... 0 0 \n",
"2 9603 2016年协鑫集成科技股份有限公司向瑞峰(张家港)光伏科技有限公司支付设备款人民币4,515... 1 协鑫集成科技股份有限公司 \n",
"3 3456 证券代码:600777 证券简称:新潮实业 公告编号:2015-091 烟台新潮实业股份有限... 0 0 \n",
"4 8844 本集团及广发证券股份有限公司持有辽宁成大股份有限公司股票的本期变动系买卖一揽子沪深300指数... 1 广发证券股份有限公司 \n",
"\n",
" member2 \n",
"0 0 \n",
"1 0 \n",
"2 瑞峰(张家港)光伏科技有限公司 \n",
"3 0 \n",
"4 辽宁成大股份有限公司 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_data = pd.read_csv('../data/info_extract/train_data.csv', encoding = 'gb2312', header=0)\n",
"train_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>sentence</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>9259</td>\n",
" <td>2015年1月26日,多氟多化工股份有限公司与李云峰先生签署了《附条件生效的股份认购合同》</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>9136</td>\n",
" <td>2、2016年2月5日,深圳市新纶科技股份有限公司与侯毅先</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>220</td>\n",
" <td>2015年10月26日,山东华鹏玻璃股份有限公司与张德华先生签署了附条件生效条件的《股份认购合同》</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>9041</td>\n",
" <td>2、2015年12月31日,印纪娱乐传媒股份有限公司与肖文革签订了《印纪娱乐传媒股份有限公司...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>10041</td>\n",
" <td>一、金发科技拟与熊海涛女士签订《股份转让协议》,协议约定:以每股1.0509元的收购价格,收...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id sentence\n",
"0 9259 2015年1月26日,多氟多化工股份有限公司与李云峰先生签署了《附条件生效的股份认购合同》\n",
"1 9136 2、2016年2月5日,深圳市新纶科技股份有限公司与侯毅先\n",
"2 220 2015年10月26日,山东华鹏玻璃股份有限公司与张德华先生签署了附条件生效条件的《股份认购合同》\n",
"3 9041 2、2015年12月31日,印纪娱乐传媒股份有限公司与肖文革签订了《印纪娱乐传媒股份有限公司...\n",
"4 10041 一、金发科技拟与熊海涛女士签订《股份转让协议》,协议约定:以每股1.0509元的收购价格,收..."
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'gb2312', header=0)\n",
"test_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"我们选取一部分样本进行标注,即train_data,该数据由5列组成。id列表示原始样本序号;sentence列为我们截取的一段关键信息;如果关键信息中存在两个实体之间有股权交易关系则tag列为1,否则为0;如果tag为1,则在member1和member2列会记录两个实体出现在sentence中的名称。\n",
"\n",
"剩下的样本没有标注,即test_data,该数据只有id和sentence两列,希望你能训练模型对test_data中的实体进行识别,并判断实体对之间有没有股权交易关系。\n",
"\n",
"### 练习2:\n",
"将每句句子中实体识别出,存入实体词典,并用特殊符号替换语句。\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 处理test数据,利用开源工具进行实体识别和并使用实体统一函数存储实体\n",
"\n",
"import fool\n",
"import pandas as pd\n",
"from copy import copy\n",
"\n",
"\n",
"test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'gb2312', header=0)\n",
"test_data['ner'] = None\n",
"ner_id = 1001\n",
"ner_dict_new = {} # 存储所有实体\n",
"ner_dict_reverse_new = {} # 存储所有实体\n",
"\n",
"for i in range(len(test_data)):\n",
" sentence = copy(test_data.iloc[i, 1])\n",
" # TODO:调用fool进行实体识别,得到words和ners结果\n",
" # TODO ...\n",
" \n",
" \n",
" ners[0].sort(key=lambda x:x[0], reverse=True)\n",
" for start, end, ner_type, ner_name in ners[0]:\n",
" if ner_type=='company' or ner_type=='person':\n",
" # TODO:调用实体统一函数,存储统一后的实体\n",
" # 并自增ner_id\n",
" # TODO ...\n",
" \n",
" \n",
" # 在句子中用编号替换实体名\n",
" sentence = sentence[:start] + ' ner_' + str(ner_dict_new[company_main_name]) + '_ ' + sentence[end-1:]\n",
" test_data.iloc[i, -1] = sentence\n",
"\n",
"X_test = test_data[['ner']]\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 处理train数据,利用开源工具进行实体识别和并使用实体统一函数存储实体\n",
"train_data = pd.read_csv('../data/info_extract/train_data.csv', encoding = 'gb2312', header=0)\n",
"train_data['ner'] = None\n",
"\n",
"for i in range(len(train_data)):\n",
" # 判断正负样本\n",
" if train_data.iloc[i,:]['member1']=='0' and train_data.iloc[i,:]['member2']=='0':\n",
" sentence = copy(train_data.iloc[i, 1])\n",
" # TODO:调用fool进行实体识别,得到words和ners结果\n",
" # TODO ...\n",
" \n",
" \n",
" ners[0].sort(key=lambda x:x[0], reverse=True)\n",
" for start, end, ner_type, ner_name in ners[0]:\n",
" if ner_type=='company' or ner_type=='person':\n",
" # TODO:调用实体统一函数,存储统一后的实体\n",
" # 并自增ner_id\n",
" # TODO ...\n",
"\n",
"\n",
" # 在句子中用编号替换实体名\n",
" sentence = sentence[:start] + ' ner_' + str(ner_dict_new[company_main_name]) + '_ ' + sentence[end-1:]\n",
" train_data.iloc[i, -1] = sentence\n",
" else:\n",
" # 将训练集中正样本已经标注的实体也使用编码替换\n",
" sentence = copy(train_data.iloc[i,:]['sentence'])\n",
" for company_main_name in [train_data.iloc[i,:]['member1'],train_data.iloc[i,:]['member2']]:\n",
" # TODO:调用实体统一函数,存储统一后的实体\n",
" # 并自增ner_id\n",
" # TODO ...\n",
"\n",
"\n",
" # 在句子中用编号替换实体名\n",
" sentence = re.sub(company_main_name, ' ner_%s_ '%(str(ner_dict_new[company_main_name])), sentence)\n",
" train_data.iloc[i, -1] = sentence\n",
" \n",
"y = train_data.loc[:,['tag']]\n",
"train_num = len(train_data)\n",
"X_train = train_data[['ner']]\n",
"\n",
"# 将train和test放在一起提取特征\n",
"X = pd.concat([X_train, X_test])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤 3:关系抽取\n",
"\n",
"\n",
"目标:借助句法分析工具,和实体识别的结果,以及文本特征,基于训练数据抽取关系,并存储进图数据库。\n",
"\n",
"本次要求抽取股权交易关系,关系为无向边,不要求判断投资方和被投资方,只要求得到双方是否存在交易关系。\n",
"\n",
"模板建立可以使用“正则表达式”、“实体间距离”、“实体上下文”、“依存句法”等。\n",
"\n",
"答案提交在submit目录中,命名为info_extract_submit.csv和info_extract_entity.csv。\n",
"- info_extract_entity.csv格式为:第一列是实体编号,第二列是实体名(实体统一的多个实体名用“|”分隔)\n",
"- info_extract_submit.csv格式为:第一列是关系中实体1的编号,第二列为关系中实体2的编号。\n",
"\n",
"示例:\n",
"- info_extract_entity.csv\n",
"\n",
"| 实体编号 | 实体名 |\n",
"| ------ | ------ |\n",
"| 1001 | 小王 |\n",
"| 1002 | A化工厂 |\n",
"\n",
"- info_extract_submit.csv\n",
"\n",
"| 实体1 | 实体2 |\n",
"| ------ | ------ |\n",
"| 1001 | 1003 |\n",
"| 1002 | 1001 |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习3:提取文本tf-idf特征\n",
"\n",
"去除停用词,并转换成tfidf向量。"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"# code\n",
"from sklearn.feature_extraction.text import TfidfTransformer \n",
"from sklearn.feature_extraction.text import CountVectorizer \n",
"from pyltp import Segmentor\n",
"\n",
"\n",
"# 实体符号加入分词词典\n",
"with open('../data/user_dict.txt', 'w') as fw:\n",
" for v in ['一', '二', '三', '四', '五', '六', '七', '八', '九', '十']:\n",
" fw.write( v + '号企业 ni\\n')\n",
"\n",
"# 初始化实例\n",
"segmentor = Segmentor() \n",
"# 加载模型,加载自定义词典\n",
"segmentor.load_with_lexicon('/Users/Badrain/Downloads/ltp_data_v3.4.0/cws.model', '../data/user_dict.txt') \n",
"\n",
"# 加载停用词\n",
"fr = open(r'../data/dict/stopwords.txt', encoding='utf-8') \n",
"stop_word = fr.readlines()\n",
"stop_word = [re.sub(r'(\\r|\\n)*','',stop_word[i]) for i in range(len(stop_word))]\n",
"\n",
"# 分词\n",
"f = lambda x: ' '.join([for word in segmentor.segment(x) if word not in stop_word and not re.findall(r'ner\\_\\d\\d\\d\\d\\_', word)])\n",
"corpus=X['ner'].map(f).tolist()\n",
"\n",
"\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"# TODO:提取tfidf特征\n",
"# TODO ...\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习4:提取句法特征\n",
"除了词语层面的句向量特征,我们还可以从句法入手,提取一些句法分析的特征。\n",
"\n",
"参考特征:\n",
"\n",
"1、企业实体间距离\n",
"\n",
"2、企业实体间句法距离\n",
"\n",
"3、企业实体分别和关键触发词的距离\n",
"\n",
"4、实体的依存关系类别"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# -*- coding: utf-8 -*-\n",
"from pyltp import Parser\n",
"from pyltp import Segmentor\n",
"from pyltp import Postagger\n",
"import networkx as nx\n",
"import pylab\n",
"import re\n",
"\n",
"postagger = Postagger() # 初始化实例\n",
"postagger.load_with_lexicon('/Users/Badrain/Downloads/ltp_data_v3.4.0/pos.model', '../data/user_dict.txt') # 加载模型\n",
"segmentor = Segmentor() # 初始化实例\n",
"segmentor.load_with_lexicon('/Users/Badrain/Downloads/ltp_data_v3.4.0/cws.model', '../data/user_dict.txt') # 加载模型\n",
"\n",
"\n",
"\n",
"def parse(s):\n",
" \"\"\"\n",
" 对语句进行句法分析,并返回句法结果\n",
" \"\"\"\n",
" tmp_ner_dict = {}\n",
" num_lst = ['一', '二', '三', '四', '五', '六', '七', '八', '九', '十']\n",
"\n",
" # 将公司代码替换为特殊称谓,保证分词词性正确\n",
" for i, ner in enumerate(list(set(re.findall(r'(ner\\_\\d\\d\\d\\d\\_)', s)))):\n",
" try:\n",
" tmp_ner_dict[num_lst[i]+'号企业'] = ner\n",
" except IndexError:\n",
" # TODO:定义错误情况的输出\n",
" # TODO ...\n",
" \n",
" \n",
" s = s.replace(ner, num_lst[i]+'号企业')\n",
" words = segmentor.segment(s)\n",
" tags = postagger.postag(words)\n",
" parser = Parser() # 初始化实例\n",
" parser.load('/Users/Badrain/Downloads/ltp_data_v3.4.0/parser.model') # 加载模型\n",
" arcs = parser.parse(words, tags) # 句法分析\n",
" arcs_lst = list(map(list, zip(*[[arc.head, arc.relation] for arc in arcs])))\n",
" \n",
" # 句法分析结果输出\n",
" parse_result = pd.DataFrame([[a,b,c,d] for a,b,c,d in zip(list(words),list(tags), arcs_lst[0], arcs_lst[1])], index = range(1,len(words)+1))\n",
" parser.release() # 释放模型\n",
" # TODO:提取企业实体依存句法类型\n",
" # TODO ...\n",
" \n",
" \n",
"\n",
" # 投资关系关键词\n",
" key_words = [\"收购\",\"竞拍\",\"转让\",\"扩张\",\"并购\",\"注资\",\"整合\",\"并入\",\"竞购\",\"竞买\",\"支付\",\"收购价\",\"收购价格\",\"承购\",\"购得\",\"购进\",\n",
" \"购入\",\"买进\",\"买入\",\"赎买\",\"购销\",\"议购\",\"函购\",\"函售\",\"抛售\",\"售卖\",\"销售\",\"转售\"]\n",
" # TODO:*根据关键词和对应句法关系提取特征(如没有思路可以不完成)\n",
" # TODO ...\n",
" \n",
" \n",
" parser.release() # 释放模型\n",
" return your_result\n",
"\n",
"\n",
"def shortest_path(arcs_ret, source, target):\n",
" \"\"\"\n",
" 求出两个词最短依存句法路径,不存在路径返回-1\n",
" arcs_ret:句法分析结果\n",
" source:实体1\n",
" target:实体2\n",
" \"\"\"\n",
" G=nx.DiGraph()\n",
" # 为这个网络添加节点...\n",
" for i in list(arcs_ret.index):\n",
" G.add_node(i)\n",
" # TODO:在网络中添加带权中的边...(注意,我们需要的是无向边)\n",
" # TODO ...\n",
" \n",
"\n",
" try:\n",
" # TODO:利用nx包中shortest_path_length方法实现最短距离提取\n",
" # TODO ...\n",
" \n",
" \n",
" return distance\n",
" except:\n",
" return -1\n",
"\n",
"\n",
"def get_feature(s):\n",
" \"\"\"\n",
" 汇总上述函数汇总句法分析特征与TFIDF特征\n",
" \"\"\"\n",
" # TODO:汇总上述函数汇总句法分析特征与TFIDF特征\n",
" # TODO ...\n",
" \n",
" \n",
" return features\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习5:建立分类器\n",
"\n",
"利用已经提取好的tfidf特征以及parse特征,建立分类器进行分类任务。"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 建立分类器进行分类\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn import preprocessing\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"# TODO:定义需要遍历的参数\n",
"\n",
"\n",
"# TODO:选择模型\n",
"\n",
"\n",
"# TODO:利用GridSearchCV搜索最佳参数\n",
"\n",
"\n",
"# TODO:对Test_data进行分类\n",
"\n",
"\n",
"\n",
"# TODO:保存Test_data分类结果\n",
"# 答案提交在submit目录中,命名为info_extract_submit.csv和info_extract_entity.csv。\n",
"# info_extract_entity.csv格式为:第一列是实体编号,第二列是实体名(实体统一的多个实体名用“|”分隔)\n",
"# info_extract_submit.csv格式为:第一列是关系中实体1的编号,第二列为关系中实体2的编号。\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习6:操作图数据库\n",
"对关系最好的描述就是用图,那这里就需要使用图数据库,目前最常用的图数据库是noe4j,通过cypher语句就可以操作图数据库的增删改查。可以参考“https://cuiqingcai.com/4778.html”。\n",
"\n",
"本次作业我们使用neo4j作为图数据库,neo4j需要java环境,请先配置好环境。\n",
"\n",
"将我们提出的实体关系插入图数据库,并查询某节点的3层投资关系,即三个节点组成的路径(如果有的话)。如果无法找到3层投资关系,请查询出任意指定节点的投资路径。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"\n",
"from py2neo import Node, Relationship, Graph\n",
"\n",
"graph = Graph(\n",
" \"http://localhost:7474\", \n",
" username=\"neo4j\", \n",
" password=\"person\"\n",
")\n",
"\n",
"for v in relation_list:\n",
" a = Node('Company', name=v[0])\n",
" b = Node('Company', name=v[1])\n",
" \n",
" # 本次不区分投资方和被投资方,无向图\n",
" r = Relationship(a, 'INVEST', b)\n",
" s = a | b | r\n",
" graph.create(s)\n",
" r = Relationship(b, 'INVEST', a)\n",
" s = a | b | r\n",
" graph.create(s)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# TODO:查询某节点的3层投资关系\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤4:实体消歧\n",
"解决了实体识别和关系的提取,我们已经完成了一大截,但是我们提取的实体究竟对应知识库中哪个实体呢?下图中,光是“苹果”就对应了13个同名实体。\n",
"<img src=\"../image/baike2.png\", width=340, heigth=480>\n",
"\n",
"在这个问题上,实体消歧旨在解决文本中广泛存在的名称歧义问题,将句中识别的实体与知识库中实体进行匹配,解决实体歧义问题。\n",
"\n",
"\n",
"### 练习7:\n",
"匹配test_data.csv中前25条样本中的人物实体对应的百度百科URL(此部分样本中所有人名均可在百度百科中链接到)。\n",
"\n",
"利用scrapy、beautifulsoup、request等python包对百度百科进行爬虫,判断是否具有一词多义的情况,如果有的话,选择最佳实体进行匹配。\n",
"\n",
"使用URL为‘https://baike.baidu.com/item/’+人名 可以访问百度百科该人名的词条,此处需要根据爬取到的网页识别该词条是否对应多个实体,如下图:\n",
"<img src=\"../image/baike1.png\", width=440, heigth=480>\n",
"如果该词条有对应多个实体,请返回正确匹配的实体URL,例如该示例网页中的‘https://baike.baidu.com/item/陆永/20793929’。\n",
"\n",
"- 提交文件:entity_disambiguation_submit.csv\n",
"- 提交格式:第一列为实体id(与info_extract_submit.csv中id保持一致),第二列为对应URL。\n",
"- 示例:\n",
"\n",
"| 实体编号 | URL |\n",
"| ------ | ------ |\n",
"| 1001 | https://baike.baidu.com/item/陆永/20793929 |\n",
"| 1002 | https://baike.baidu.com/item/王芳/567232 |\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import jieba\n",
"import pandas as pd\n",
"\n",
"# 找出test_data.csv中前25条样本所有的人物名称,以及人物所在文档的上下文内容\n",
"test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'gb2312', header=0)\n",
"\n",
"# 存储人物以及上下文信息(key为人物ID,value为人物名称、人物上下文内容)\n",
"person_name = {}\n",
"\n",
"# 观察上下文的窗口大小\n",
"window = 10 \n",
"\n",
"# 遍历前25条样本\n",
"for i in range(25):\n",
" sentence = copy(test_data.iloc[i, 1])\n",
" words, ners = fool.analysis(sentence)\n",
" ners[0].sort(key=lambda x:x[0], reverse=True)\n",
" for start, end, ner_type, ner_name in ners[0]:\n",
" if ner_type=='person':\n",
" # TODO:提取实体的上下文\n",
" \n",
"\n",
"\n",
"\n",
"# 利用爬虫得到每个人物名称对应的URL\n",
"# TODO:找到每个人物实体的词条内容。\n",
"\n",
"# TODO:将样本中人物上下文与爬取词条结果进行对比,选择最接近的词条。\n",
"\n",
"\n",
"\n",
"# 输出结果\n",
"pd.DataFrame(result_data).to_csv('../submit/entity_disambiguation_submit.csv', index=False)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 利用信息抽取技术搭建知识库\n",
"\n",
"在这个notebook文件中,有些模板代码已经提供给你,但你还需要实现更多的功能来完成这个项目。除非有明确要求,你无须修改任何已给出的代码。以**'【练习】'**开始的标题表示接下来的代码部分中有你需要实现的功能。这些部分都配有详细的指导,需要实现的部分也会在注释中以'TODO'标出。请仔细阅读所有的提示。\n",
"\n",
">**提示:**Code 和 Markdown 区域可通过 **Shift + Enter** 快捷键运行。此外,Markdown可以通过双击进入编辑模式。\n",
"\n",
"---\n",
"\n",
"### 让我们开始吧\n",
"\n",
"本项目的目的是结合命名实体识别、依存语法分析、实体消歧、实体统一对网站开放语料抓取的数据建立小型知识图谱。\n",
"\n",
"在现实世界中,你需要拼凑一系列的模型来完成不同的任务;举个例子,用来预测狗种类的算法会与预测人类的算法不同。在做项目的过程中,你可能会遇到不少失败的预测,因为并不存在完美的算法和模型。你最终提交的不完美的解决方案也一定会给你带来一个有趣的学习经验!\n",
"\n",
"\n",
"---\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤 1:实体统一"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"实体统一做的是对同一实体具有多个名称的情况进行统一,将多种称谓统一到一个实体上,并体现在实体的属性中(可以给实体建立“别称”属性)\n",
"\n",
"例如:对“河北银行股份有限公司”、“河北银行公司”和“河北银行”我们都可以认为是一个实体,我们就可以将通过提取前两个称谓的主要内容,得到“河北银行”这个实体关键信息。\n",
"\n",
"公司名称有其特点,例如后缀可以省略、上市公司的地名可以省略等等。在data/dict目录中提供了几个词典,可供实体统一使用。\n",
"- company_suffix.txt是公司的通用后缀词典\n",
"- company_business_scope.txt是公司经营范围常用词典\n",
"- co_Province_Dim.txt是省份词典\n",
"- co_City_Dim.txt是城市词典\n",
"- stopwords.txt是可供参考的停用词\n",
"\n",
"### 练习1:\n",
"编写main_extract函数,实现对实体的名称提取“主体名称”的功能。"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import jieba\n",
"import jieba.posseg as pseg\n",
"import re\n",
"import datetime\n",
"\n",
"\n",
"# 从输入的“公司名”中提取主体\n",
"def main_extract(input_str,stop_word,d_4_delete,d_city_province):\n",
" # 开始分词并处理\n",
" seg = pseg.cut(input_str)\n",
" seg_lst = remove_word(seg,stop_word,d_4_delete)\n",
" seg_lst = city_prov_ahead(seg,d_city_province)\n",
" return seg_lst\n",
"\n",
" \n",
"#TODO:实现公司名称中地名提前\n",
"def city_prov_ahead(seg,d_city_province):\n",
" city_prov_lst = []\n",
" # TODO ...\n",
" \n",
" return city_prov_lst+seg_lst\n",
"\n",
"\n",
"\n",
"\n",
"#TODO:替换特殊符号\n",
"def remove_word(seg,stop_word,d_4_delete):\n",
" # TODO ...\n",
" \n",
" return seg_lst\n",
"\n",
"\n",
"# 初始化,加载词典\n",
"def my_initial():\n",
" fr1 = open(r\"../data/dict/co_City_Dim.txt\", encoding='utf-8')\n",
" fr2 = open(r\"../data/dict/co_Province_Dim.txt\", encoding='utf-8')\n",
" fr3 = open(r\"../data/dict/company_business_scope.txt\", encoding='utf-8')\n",
" fr4 = open(r\"../data/dict/company_suffix.txt\", encoding='utf-8')\n",
" #城市名\n",
" lines1 = fr1.readlines()\n",
" d_4_delete = []\n",
" d_city_province = [re.sub(r'(\\r|\\n)*','',line) for line in lines1]\n",
" #省份名\n",
" lines2 = fr2.readlines()\n",
" l2_tmp = [re.sub(r'(\\r|\\n)*','',line) for line in lines2]\n",
" d_city_province.extend(l2_tmp)\n",
" #公司后缀\n",
" lines3 = fr3.readlines()\n",
" l3_tmp = [re.sub(r'(\\r|\\n)*','',line) for line in lines3]\n",
" lines4 = fr4.readlines()\n",
" l4_tmp = [re.sub(r'(\\r|\\n)*','',line) for line in lines4]\n",
" d_4_delete.extend(l4_tmp)\n",
" #get stop_word\n",
" fr = open(r'../data/dict/stopwords.txt', encoding='utf-8') \n",
" stop_word = fr.readlines()\n",
" stop_word_after = [re.sub(r'(\\r|\\n)*','',stop_word[i]) for i in range(len(stop_word))]\n",
" stop_word_after[-1] = stop_word[-1]\n",
" stop_word = stop_word_after\n",
" return d_4_delete,stop_word,d_city_province\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Building prefix dict from the default dictionary ...\n",
"Loading model from cache C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\jieba.cache\n",
"Loading model cost 0.732 seconds.\n",
"Prefix dict has been built succesfully.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"河北银行\n"
]
}
],
"source": [
"# TODO:测试实体统一用例\n",
"d_4_delete,stop_word,d_city_province = my_initial()\n",
"company_name = \"河北银行股份有限公司\"\n",
"lst = main_extract(company_name,stop_word,d_4_delete,d_city_province)\n",
"company_name = ''.join(lst) # 对公司名提取主体部分,将包含相同主体部分的公司统一为一个实体\n",
"print(company_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤 2:实体识别\n",
"有很多开源工具可以帮助我们对实体进行识别。常见的有LTP、StanfordNLP、FoolNLTK等等。\n",
"\n",
"本次采用FoolNLTK实现实体识别,fool是一个基于bi-lstm+CRF算法开发的深度学习开源NLP工具,包括了分词、实体识别等功能,大家可以通过fool很好地体会深度学习在该任务上的优缺点。\n",
"\n",
"在‘data/train_data.csv’和‘data/test_data.csv’中是从网络上爬虫得到的上市公司公告,数据样例如下:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>sentence</th>\n",
" <th>tag</th>\n",
" <th>member1</th>\n",
" <th>member2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>6461</td>\n",
" <td>与本公司关系:受同一公司控制 2,杭州富生电器有限公司企业类型: 有限公司注册地址: 富阳市...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2111</td>\n",
" <td>三、关联交易标的基本情况 1、交易标的基本情况 公司名称:红豆集团财务有限公司 公司地址:无...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>9603</td>\n",
" <td>2016年协鑫集成科技股份有限公司向瑞峰(张家港)光伏科技有限公司支付设备款人民币4,515...</td>\n",
" <td>1</td>\n",
" <td>协鑫集成科技股份有限公司</td>\n",
" <td>瑞峰(张家港)光伏科技有限公司</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3456</td>\n",
" <td>证券代码:600777 证券简称:新潮实业 公告编号:2015-091 烟台新潮实业股份有限...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>8844</td>\n",
" <td>本集团及广发证券股份有限公司持有辽宁成大股份有限公司股票的本期变动系买卖一揽子沪深300指数...</td>\n",
" <td>1</td>\n",
" <td>广发证券股份有限公司</td>\n",
" <td>辽宁成大股份有限公司</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id sentence tag member1 \\\n",
"0 6461 与本公司关系:受同一公司控制 2,杭州富生电器有限公司企业类型: 有限公司注册地址: 富阳市... 0 0 \n",
"1 2111 三、关联交易标的基本情况 1、交易标的基本情况 公司名称:红豆集团财务有限公司 公司地址:无... 0 0 \n",
"2 9603 2016年协鑫集成科技股份有限公司向瑞峰(张家港)光伏科技有限公司支付设备款人民币4,515... 1 协鑫集成科技股份有限公司 \n",
"3 3456 证券代码:600777 证券简称:新潮实业 公告编号:2015-091 烟台新潮实业股份有限... 0 0 \n",
"4 8844 本集团及广发证券股份有限公司持有辽宁成大股份有限公司股票的本期变动系买卖一揽子沪深300指数... 1 广发证券股份有限公司 \n",
"\n",
" member2 \n",
"0 0 \n",
"1 0 \n",
"2 瑞峰(张家港)光伏科技有限公司 \n",
"3 0 \n",
"4 辽宁成大股份有限公司 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_data = pd.read_csv('../data/info_extract/train_data.csv', encoding = 'gb2312', header=0)\n",
"train_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>sentence</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>9259</td>\n",
" <td>2015年1月26日,多氟多化工股份有限公司与李云峰先生签署了《附条件生效的股份认购合同》</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>9136</td>\n",
" <td>2、2016年2月5日,深圳市新纶科技股份有限公司与侯毅先</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>220</td>\n",
" <td>2015年10月26日,山东华鹏玻璃股份有限公司与张德华先生签署了附条件生效条件的《股份认购合同》</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>9041</td>\n",
" <td>2、2015年12月31日,印纪娱乐传媒股份有限公司与肖文革签订了《印纪娱乐传媒股份有限公司...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>10041</td>\n",
" <td>一、金发科技拟与熊海涛女士签订《股份转让协议》,协议约定:以每股1.0509元的收购价格,收...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id sentence\n",
"0 9259 2015年1月26日,多氟多化工股份有限公司与李云峰先生签署了《附条件生效的股份认购合同》\n",
"1 9136 2、2016年2月5日,深圳市新纶科技股份有限公司与侯毅先\n",
"2 220 2015年10月26日,山东华鹏玻璃股份有限公司与张德华先生签署了附条件生效条件的《股份认购合同》\n",
"3 9041 2、2015年12月31日,印纪娱乐传媒股份有限公司与肖文革签订了《印纪娱乐传媒股份有限公司...\n",
"4 10041 一、金发科技拟与熊海涛女士签订《股份转让协议》,协议约定:以每股1.0509元的收购价格,收..."
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'gb2312', header=0)\n",
"test_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"我们选取一部分样本进行标注,即train_data,该数据由5列组成。id列表示原始样本序号;sentence列为我们截取的一段关键信息;如果关键信息中存在两个实体之间有股权交易关系则tag列为1,否则为0;如果tag为1,则在member1和member2列会记录两个实体出现在sentence中的名称。\n",
"\n",
"剩下的样本没有标注,即test_data,该数据只有id和sentence两列,希望你能训练模型对test_data中的实体进行识别,并判断实体对之间有没有股权交易关系。\n",
"\n",
"### 练习2:\n",
"将每句句子中实体识别出,存入实体词典,并用特殊符号替换语句。\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 处理test数据,利用开源工具进行实体识别和并使用实体统一函数存储实体\n",
"\n",
"import fool\n",
"import pandas as pd\n",
"from copy import copy\n",
"\n",
"\n",
"test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'gb2312', header=0)\n",
"test_data['ner'] = None\n",
"ner_id = 1001\n",
"ner_dict_new = {} # 存储所有实体\n",
"ner_dict_reverse_new = {} # 存储所有实体\n",
"\n",
"for i in range(len(test_data)):\n",
" sentence = copy(test_data.iloc[i, 1])\n",
" # TODO:调用fool进行实体识别,得到words和ners结果\n",
" # TODO ...\n",
" \n",
" \n",
" ners[0].sort(key=lambda x:x[0], reverse=True)\n",
" for start, end, ner_type, ner_name in ners[0]:\n",
" if ner_type=='company' or ner_type=='person':\n",
" # TODO:调用实体统一函数,存储统一后的实体\n",
" # 并自增ner_id\n",
" # TODO ...\n",
" \n",
" \n",
" # 在句子中用编号替换实体名\n",
" sentence = sentence[:start] + ' ner_' + str(ner_dict_new[company_main_name]) + '_ ' + sentence[end-1:]\n",
" test_data.iloc[i, -1] = sentence\n",
"\n",
"X_test = test_data[['ner']]\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 处理train数据,利用开源工具进行实体识别和并使用实体统一函数存储实体\n",
"train_data = pd.read_csv('../data/info_extract/train_data.csv', encoding = 'gb2312', header=0)\n",
"train_data['ner'] = None\n",
"\n",
"for i in range(len(train_data)):\n",
" # 判断正负样本\n",
" if train_data.iloc[i,:]['member1']=='0' and train_data.iloc[i,:]['member2']=='0':\n",
" sentence = copy(train_data.iloc[i, 1])\n",
" # TODO:调用fool进行实体识别,得到words和ners结果\n",
" # TODO ...\n",
" \n",
" \n",
" ners[0].sort(key=lambda x:x[0], reverse=True)\n",
" for start, end, ner_type, ner_name in ners[0]:\n",
" if ner_type=='company' or ner_type=='person':\n",
" # TODO:调用实体统一函数,存储统一后的实体\n",
" # 并自增ner_id\n",
" # TODO ...\n",
"\n",
"\n",
" # 在句子中用编号替换实体名\n",
" sentence = sentence[:start] + ' ner_' + str(ner_dict_new[company_main_name]) + '_ ' + sentence[end-1:]\n",
" train_data.iloc[i, -1] = sentence\n",
" else:\n",
" # 将训练集中正样本已经标注的实体也使用编码替换\n",
" sentence = copy(train_data.iloc[i,:]['sentence'])\n",
" for company_main_name in [train_data.iloc[i,:]['member1'],train_data.iloc[i,:]['member2']]:\n",
" # TODO:调用实体统一函数,存储统一后的实体\n",
" # 并自增ner_id\n",
" # TODO ...\n",
"\n",
"\n",
" # 在句子中用编号替换实体名\n",
" sentence = re.sub(company_main_name, ' ner_%s_ '%(str(ner_dict_new[company_main_name])), sentence)\n",
" train_data.iloc[i, -1] = sentence\n",
" \n",
"y = train_data.loc[:,['tag']]\n",
"train_num = len(train_data)\n",
"X_train = train_data[['ner']]\n",
"\n",
"# 将train和test放在一起提取特征\n",
"X = pd.concat([X_train, X_test])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤 3:关系抽取\n",
"\n",
"\n",
"目标:借助句法分析工具,和实体识别的结果,以及文本特征,基于训练数据抽取关系,并存储进图数据库。\n",
"\n",
"本次要求抽取股权交易关系,关系为无向边,不要求判断投资方和被投资方,只要求得到双方是否存在交易关系。\n",
"\n",
"模板建立可以使用“正则表达式”、“实体间距离”、“实体上下文”、“依存句法”等。\n",
"\n",
"答案提交在submit目录中,命名为info_extract_submit.csv和info_extract_entity.csv。\n",
"- info_extract_entity.csv格式为:第一列是实体编号,第二列是实体名(实体统一的多个实体名用“|”分隔)\n",
"- info_extract_submit.csv格式为:第一列是关系中实体1的编号,第二列为关系中实体2的编号。\n",
"\n",
"示例:\n",
"- info_extract_entity.csv\n",
"\n",
"| 实体编号 | 实体名 |\n",
"| ------ | ------ |\n",
"| 1001 | 小王 |\n",
"| 1002 | A化工厂 |\n",
"\n",
"- info_extract_submit.csv\n",
"\n",
"| 实体1 | 实体2 |\n",
"| ------ | ------ |\n",
"| 1001 | 1003 |\n",
"| 1002 | 1001 |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习3:提取文本tf-idf特征\n",
"\n",
"去除停用词,并转换成tfidf向量。"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"# code\n",
"from sklearn.feature_extraction.text import TfidfTransformer \n",
"from sklearn.feature_extraction.text import CountVectorizer \n",
"from pyltp import Segmentor\n",
"\n",
"\n",
"# 实体符号加入分词词典\n",
"with open('../data/user_dict.txt', 'w') as fw:\n",
" for v in ['一', '二', '三', '四', '五', '六', '七', '八', '九', '十']:\n",
" fw.write( v + '号企业 ni\\n')\n",
"\n",
"# 初始化实例\n",
"segmentor = Segmentor() \n",
"# 加载模型,加载自定义词典\n",
"segmentor.load_with_lexicon('/Users/Badrain/Downloads/ltp_data_v3.4.0/cws.model', '../data/user_dict.txt') \n",
"\n",
"# 加载停用词\n",
"fr = open(r'../data/dict/stopwords.txt', encoding='utf-8') \n",
"stop_word = fr.readlines()\n",
"stop_word = [re.sub(r'(\\r|\\n)*','',stop_word[i]) for i in range(len(stop_word))]\n",
"\n",
"# 分词\n",
"f = lambda x: ' '.join([for word in segmentor.segment(x) if word not in stop_word and not re.findall(r'ner\\_\\d\\d\\d\\d\\_', word)])\n",
"corpus=X['ner'].map(f).tolist()\n",
"\n",
"\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"# TODO:提取tfidf特征\n",
"# TODO ...\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习4:提取句法特征\n",
"除了词语层面的句向量特征,我们还可以从句法入手,提取一些句法分析的特征。\n",
"\n",
"参考特征:\n",
"\n",
"1、企业实体间距离\n",
"\n",
"2、企业实体间句法距离\n",
"\n",
"3、企业实体分别和关键触发词的距离\n",
"\n",
"4、实体的依存关系类别"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# -*- coding: utf-8 -*-\n",
"from pyltp import Parser\n",
"from pyltp import Segmentor\n",
"from pyltp import Postagger\n",
"import networkx as nx\n",
"import pylab\n",
"import re\n",
"\n",
"postagger = Postagger() # 初始化实例\n",
"postagger.load_with_lexicon('/Users/Badrain/Downloads/ltp_data_v3.4.0/pos.model', '../data/user_dict.txt') # 加载模型\n",
"segmentor = Segmentor() # 初始化实例\n",
"segmentor.load_with_lexicon('/Users/Badrain/Downloads/ltp_data_v3.4.0/cws.model', '../data/user_dict.txt') # 加载模型\n",
"\n",
"\n",
"\n",
"def parse(s):\n",
" \"\"\"\n",
" 对语句进行句法分析,并返回句法结果\n",
" \"\"\"\n",
" tmp_ner_dict = {}\n",
" num_lst = ['一', '二', '三', '四', '五', '六', '七', '八', '九', '十']\n",
"\n",
" # 将公司代码替换为特殊称谓,保证分词词性正确\n",
" for i, ner in enumerate(list(set(re.findall(r'(ner\\_\\d\\d\\d\\d\\_)', s)))):\n",
" try:\n",
" tmp_ner_dict[num_lst[i]+'号企业'] = ner\n",
" except IndexError:\n",
" # TODO:定义错误情况的输出\n",
" # TODO ...\n",
" \n",
" \n",
" s = s.replace(ner, num_lst[i]+'号企业')\n",
" words = segmentor.segment(s)\n",
" tags = postagger.postag(words)\n",
" parser = Parser() # 初始化实例\n",
" parser.load('/Users/Badrain/Downloads/ltp_data_v3.4.0/parser.model') # 加载模型\n",
" arcs = parser.parse(words, tags) # 句法分析\n",
" arcs_lst = list(map(list, zip(*[[arc.head, arc.relation] for arc in arcs])))\n",
" \n",
" # 句法分析结果输出\n",
" parse_result = pd.DataFrame([[a,b,c,d] for a,b,c,d in zip(list(words),list(tags), arcs_lst[0], arcs_lst[1])], index = range(1,len(words)+1))\n",
" parser.release() # 释放模型\n",
" # TODO:提取企业实体依存句法类型\n",
" # TODO ...\n",
" \n",
" \n",
"\n",
" # 投资关系关键词\n",
" key_words = [\"收购\",\"竞拍\",\"转让\",\"扩张\",\"并购\",\"注资\",\"整合\",\"并入\",\"竞购\",\"竞买\",\"支付\",\"收购价\",\"收购价格\",\"承购\",\"购得\",\"购进\",\n",
" \"购入\",\"买进\",\"买入\",\"赎买\",\"购销\",\"议购\",\"函购\",\"函售\",\"抛售\",\"售卖\",\"销售\",\"转售\"]\n",
" # TODO:*根据关键词和对应句法关系提取特征(如没有思路可以不完成)\n",
" # TODO ...\n",
" \n",
" \n",
" parser.release() # 释放模型\n",
" return your_result\n",
"\n",
"\n",
"def shortest_path(arcs_ret, source, target):\n",
" \"\"\"\n",
" 求出两个词最短依存句法路径,不存在路径返回-1\n",
" arcs_ret:句法分析结果\n",
" source:实体1\n",
" target:实体2\n",
" \"\"\"\n",
" G=nx.DiGraph()\n",
" # 为这个网络添加节点...\n",
" for i in list(arcs_ret.index):\n",
" G.add_node(i)\n",
" # TODO:在网络中添加带权中的边...(注意,我们需要的是无向边)\n",
" # TODO ...\n",
" \n",
"\n",
" try:\n",
" # TODO:利用nx包中shortest_path_length方法实现最短距离提取\n",
" # TODO ...\n",
" \n",
" \n",
" return distance\n",
" except:\n",
" return -1\n",
"\n",
"\n",
"def get_feature(s):\n",
" \"\"\"\n",
" 汇总上述函数汇总句法分析特征与TFIDF特征\n",
" \"\"\"\n",
" # TODO:汇总上述函数汇总句法分析特征与TFIDF特征\n",
" # TODO ...\n",
" \n",
" \n",
" return features\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习5:建立分类器\n",
"\n",
"利用已经提取好的tfidf特征以及parse特征,建立分类器进行分类任务。"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 建立分类器进行分类\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn import preprocessing\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"# TODO:定义需要遍历的参数\n",
"\n",
"\n",
"# TODO:选择模型\n",
"\n",
"\n",
"# TODO:利用GridSearchCV搜索最佳参数\n",
"\n",
"\n",
"# TODO:对Test_data进行分类\n",
"\n",
"\n",
"\n",
"# TODO:保存Test_data分类结果\n",
"# 答案提交在submit目录中,命名为info_extract_submit.csv和info_extract_entity.csv。\n",
"# info_extract_entity.csv格式为:第一列是实体编号,第二列是实体名(实体统一的多个实体名用“|”分隔)\n",
"# info_extract_submit.csv格式为:第一列是关系中实体1的编号,第二列为关系中实体2的编号。\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习6:操作图数据库\n",
"对关系最好的描述就是用图,那这里就需要使用图数据库,目前最常用的图数据库是noe4j,通过cypher语句就可以操作图数据库的增删改查。可以参考“https://cuiqingcai.com/4778.html”。\n",
"\n",
"本次作业我们使用neo4j作为图数据库,neo4j需要java环境,请先配置好环境。\n",
"\n",
"将我们提出的实体关系插入图数据库,并查询某节点的3层投资关系,即三个节点组成的路径(如果有的话)。如果无法找到3层投资关系,请查询出任意指定节点的投资路径。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"\n",
"from py2neo import Node, Relationship, Graph\n",
"\n",
"graph = Graph(\n",
" \"http://localhost:7474\", \n",
" username=\"neo4j\", \n",
" password=\"person\"\n",
")\n",
"\n",
"for v in relation_list:\n",
" a = Node('Company', name=v[0])\n",
" b = Node('Company', name=v[1])\n",
" \n",
" # 本次不区分投资方和被投资方,无向图\n",
" r = Relationship(a, 'INVEST', b)\n",
" s = a | b | r\n",
" graph.create(s)\n",
" r = Relationship(b, 'INVEST', a)\n",
" s = a | b | r\n",
" graph.create(s)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# TODO:查询某节点的3层投资关系\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 步骤4:实体消歧\n",
"解决了实体识别和关系的提取,我们已经完成了一大截,但是我们提取的实体究竟对应知识库中哪个实体呢?下图中,光是“苹果”就对应了13个同名实体。\n",
"<img src=\"../image/baike2.png\", width=340, heigth=480>\n",
"\n",
"在这个问题上,实体消歧旨在解决文本中广泛存在的名称歧义问题,将句中识别的实体与知识库中实体进行匹配,解决实体歧义问题。\n",
"\n",
"\n",
"### 练习7:\n",
"匹配test_data.csv中前25条样本中的人物实体对应的百度百科URL(此部分样本中所有人名均可在百度百科中链接到)。\n",
"\n",
"利用scrapy、beautifulsoup、request等python包对百度百科进行爬虫,判断是否具有一词多义的情况,如果有的话,选择最佳实体进行匹配。\n",
"\n",
"使用URL为‘https://baike.baidu.com/item/’+人名 可以访问百度百科该人名的词条,此处需要根据爬取到的网页识别该词条是否对应多个实体,如下图:\n",
"<img src=\"../image/baike1.png\", width=440, heigth=480>\n",
"如果该词条有对应多个实体,请返回正确匹配的实体URL,例如该示例网页中的‘https://baike.baidu.com/item/陆永/20793929’。\n",
"\n",
"- 提交文件:entity_disambiguation_submit.csv\n",
"- 提交格式:第一列为实体id(与info_extract_submit.csv中id保持一致),第二列为对应URL。\n",
"- 示例:\n",
"\n",
"| 实体编号 | URL |\n",
"| ------ | ------ |\n",
"| 1001 | https://baike.baidu.com/item/陆永/20793929 |\n",
"| 1002 | https://baike.baidu.com/item/王芳/567232 |\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import jieba\n",
"import pandas as pd\n",
"\n",
"# 找出test_data.csv中前25条样本所有的人物名称,以及人物所在文档的上下文内容\n",
"test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'gb2312', header=0)\n",
"\n",
"# 存储人物以及上下文信息(key为人物ID,value为人物名称、人物上下文内容)\n",
"person_name = {}\n",
"\n",
"# 观察上下文的窗口大小\n",
"window = 10 \n",
"\n",
"# 遍历前25条样本\n",
"for i in range(25):\n",
" sentence = copy(test_data.iloc[i, 1])\n",
" words, ners = fool.analysis(sentence)\n",
" ners[0].sort(key=lambda x:x[0], reverse=True)\n",
" for start, end, ner_type, ner_name in ners[0]:\n",
" if ner_type=='person':\n",
" # TODO:提取实体的上下文\n",
" \n",
"\n",
"\n",
"\n",
"# 利用爬虫得到每个人物名称对应的URL\n",
"# TODO:找到每个人物实体的词条内容。\n",
"\n",
"# TODO:将样本中人物上下文与爬取词条结果进行对比,选择最接近的词条。\n",
"\n",
"\n",
"\n",
"# 输出结果\n",
"pd.DataFrame(result_data).to_csv('../submit/entity_disambiguation_submit.csv', index=False)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment