{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Project : 利用信息抽取技术搭建知识库" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "本项目的目的是结合命名实体识别、依存语法分析、实体消歧、实体统一对网站开放语料抓取的数据建立小型知识图谱。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part1:开发句法结构分析工具" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 开发工具(15分)\n", "使用CYK算法,根据所提供的:非终结符集合、终结符集合、规则集,对以下句子计算句法结构。\n", "\n", "“the boy saw the dog with a telescope\"\n", "\n", "\n", "\n", "非终结符集合:N={S, NP, VP, PP, DT, Vi, Vt, NN, IN}\n", "\n", "终结符集合:{sleeps, saw, boy, girl, dog, telescope, the, with, in}\n", "\n", "规则集: R={\n", "- (1) S-->NP VP 1.0\n", "- (2) VP-->VI 0.3\n", "- (3) VP-->Vt NP 0.4\n", "- (4) VP-->VP PP 0.3\n", "- (5) NP-->DT NN 0.8\n", "- (6) NP-->NP PP 0.2\n", "- (7) PP-->IN NP 1.0\n", "- (8) Vi-->sleeps 1.0\n", "- (9) Vt-->saw 1.0\n", "- (10) NN-->boy 0.1\n", "- (11) NN-->girl 0.1\n", "- (12) NN-->telescope 0.3\n", "- (13) NN-->dog 0.5\n", "- (14) DT-->the 0.5\n", "- (15) DT-->a 0.5\n", "- (16) IN-->with 0.6\n", "- (17) IN-->in 0.4\n", "\n", "\n", "}" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# 分数(15)\n", "class my_CYK(object):\n", " # TODO: 初始化函数\n", " def __init__(self, non_ternimal, terminal, rules_prob, start_prob):\n", " \n", "\n", " # TODO: 返回句子的句法结构,并以树状结构打印出来\n", " def parse_sentence(self, sentence):\n", "\n", "\n", "\n", "def main(sentence):\n", " \"\"\"\n", " 主函数,句法结构分析需要的材料如下\n", " \"\"\"\n", " non_terminal = {'S', 'NP', 'VP', 'PP', 'DT', 'Vi', 'Vt', 'NN', 'IN'}\n", " start_symbol = 'S'\n", " terminal = {'sleeps', 'saw', 'boy', 'girl', 'dog', 'telescope', 'the', 'with', 'in'}\n", " rules_prob = {'S': {('NP', 'VP'): 1.0},\n", " 'VP': {('Vt', 'NP'): 0.8, ('VP', 'PP'): 0.2},\n", " 'NP': {('DT', 'NN'): 0.8, ('NP', 'PP'): 0.2},\n", " 'PP': {('IN', 'NP'): 1.0},\n", " 'Vi': {('sleeps',): 1.0},\n", " 'Vt': {('saw',): 1.0},\n", " 'NN': {('boy',): 0.1, ('girl',): 0.1,('telescope',): 0.3,('dog',): 0.5},\n", " 'DT': {('the',): 1.0},\n", " 'IN': {('with',): 0.6, ('in',): 0.4},\n", " }\n", " cyk = my_CYK(non_terminal, terminal, rules_prob, start_symbol)\n", " cyk.parse_sentence(sentence)\n", "\n", "\n", "# TODO: 对该测试用例进行测试\n", "if __name__ == \"__main__\":\n", " sentence = \"the boy saw the dog with the telescope\"\n", " main(sentence)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 计算算法复杂度(3分)\n", "计算上一节开发的算法所对应的时间复杂度和空间复杂度。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 分数(3)\n", "# 上面所写的算法的时间复杂度和空间复杂度分别是多少?\n", "# TODO\n", "时间复杂度=O(), 空间复杂度=O()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part2 : 抽取企业股权交易关系,并建立知识库" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 练习实体消歧(15分)\n", "将句中识别的实体与知识库中实体进行匹配,解决实体歧义问题。\n", "可利用上下文本相似度进行识别。\n", "\n", "在data/entity_disambiguation目录中,entity_list.csv是50个实体,valid_data.csv是需要消歧的语句。\n", "\n", "答案提交在submit目录中,命名为entity_disambiguation_submit.csv。格式为:第一列是需要消歧的语句序号,第二列为多个“实体起始位坐标-实体结束位坐标:实体序号”以“|”分隔的字符串。\n", "\n", "*成绩以实体识别准确率以及召回率综合的F1-score\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "import jieba\n", "import pandas as pd\n", "\n", "# TODO:将entity_list.csv中已知实体的名称导入分词词典\n", "entity_data = pd.read_csv('../data/entity_disambiguation/entity_list.csv', encoding = 'utf-8')\n", "\n", "\n", "# TODO:对每句句子识别并匹配实体 \n", "valid_data = pd.read_csv('../data/entity_disambiguation/valid_data.csv', encoding = 'gb18030')\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO:将计算结果存入文件\n", "\"\"\"\n", "格式为第一列是需要消歧的语句序号,第二列为多个“实体起始位坐标-实体结束位坐标:实体序号”以“|”分隔的字符串。\n", "样例如下:\n", "[\n", " [0, '3-6:1008|109-112:1008|187-190:1008'],\n", " ...\n", "]\n", "\"\"\"\n", "pd.DataFrame(result_data).to_csv('../submit/entity_disambiguation_submit.csv', index=False)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 实体识别(10分)\n", "\n", "借助开源工具,对实体进行识别。\n", "\n", "将每句句子中实体识别出,存入实体词典,并用特殊符号替换语句。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# code\n", "# 首先尝试利用开源工具分出实体\n", "\n", "import fool # foolnltk是基于深度学习的开源分词工具,参考https://github.com/rockyzhengwu/FoolNLTK,也可以使用LTP等开源工具\n", "import pandas as pd\n", "from copy import copy\n", "\n", "\n", "sample_data = pd.read_csv('../data/info_extract/train_data.csv', encoding = 'utf-8', header=0)\n", "y = sample_data.loc[:,['tag']]\n", "train_num = len(sample_data)\n", "test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'utf-8', header=0)\n", "sample_data = pd.concat([sample_data.loc[:, ['id', 'sentence']], test_data])\n", "sample_data['ner'] = None\n", "# TODO: 将提取的实体以合适的方式在‘ner’列中,便于后续使用\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 实体统一(15分)\n", "对同一实体具有多个名称的情况进行统一,将多种称谓统一到一个实体上,并体现在实体的属性中(可以给实体建立“别称”属性)\n", "\n", "比如:“河北银行股份有限公司”和“河北银行”可以统一成一个实体。\n", "\n", "公司名称有其特点,例如后缀可以省略、上市公司的地名可以省略等等。在data/dict目录中提供了几个词典,可供实体统一使用。\n", "- company_suffix.txt是公司的通用后缀词典\n", "- company_business_scope.txt是公司经营范围常用词典\n", "- co_Province_Dim.txt是省份词典\n", "- co_City_Dim.txt是城市词典\n", "- stopwords.txt是可供参考的停用词" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "import jieba\n", "import jieba.posseg as pseg\n", "import re\n", "import datetime\n", "\n", "\n", "\n", "#提示:可以分析公司全称的组成方式,将“地名”、“公司主体部分”、“公司后缀”区分开,并制定一些启发式的规则\n", "# TODO:建立main_extract,当输入公司名,返回会被统一的简称\n", "def main_extract(company_name,stop_word,d_4_delete,d_city_province):\n", " \"\"\"\n", " company_name 输入的公司名\n", " stop_word 停用词\n", " d_city_province 地区\n", " \"\"\"\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "河北银行\n" ] } ], "source": [ "# 简单测试实体统一代码\n", "stop_word,d_city_province = my_initial()\n", "company_name = \"河北银行股份有限公司\"\n", "# 对公司名提取主体部分,将包含相同主体部分的公司统一为一个实体\n", "company_name = main_extract(company_name,stop_word,d_city_province)\n", "print(company_name)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# TODO:在实体识别中运用统一实体的功能\n", "\n", "import fool\n", "import pandas as pd\n", "from copy import copy\n", "\n", "\n", "sample_data = pd.read_csv('../data/info_extract/train_data.csv', encoding = 'utf-8', header=0)\n", "y = sample_data.loc[:,['tag']]\n", "train_num = len(sample_data)\n", "test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'utf-8', header=0)\n", "sample_data = pd.concat([sample_data.loc[:, ['id', 'sentence']], test_data])\n", "sample_data['ner'] = None\n", "# TODO: 将提取的实体以合适的方式在‘ner’列中并统一编号,便于后续使用\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4 关系抽取(37分)\n", "目标:借助句法分析工具,和实体识别的结果,以及文本特征,基于训练数据抽取关系。\n", "\n", "本次要求抽取股权交易关系,关系为有向边,由投资方指向被投资方。\n", "\n", "模板建立可以使用“正则表达式”、“实体间距离”、“实体上下文”、“依存句法”等。\n", "\n", "答案提交在submit目录中,命名为info_extract_submit.csv和info_extract_entity.csv。\n", "- info_extract_entity.csv格式为:第一列是实体编号,第二列是实体名(多个实体名用“|”分隔)\n", "- info_extract_submit.csv格式为:第一列是一方实体编号,第二列为另一方实体编号。\n", "\n", "*成绩以抽取的关系准确率以及召回率综合的F1-score" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# 提取文本tf-idf特征\n", "# 去除停用词,并转换成tfidf向量。\n", "# 可以使用LTP分词工具,参考:https://ltp.readthedocs.io/zh_CN/latest/\n", "from sklearn.feature_extraction.text import TfidfTransformer \n", "from sklearn.feature_extraction.text import CountVectorizer \n", "from pyltp import Segmentor\n", "\n", "def get_tfidf_feature():\n", " segmentor = Segmentor() # 初始化实例\n", " segmentor.load_with_lexicon('/ltp_data_v3.4.0/cws.model', '../data/user_dict.txt') # 加载模型\n", "\n", " \n", " return tfidf_feature\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 提取句法特征\n", "\n", "参考特征:\n", "\n", "1、企业实体间距离\n", "\n", "2、企业实体间句法距离\n", "\n", "3、企业实体分别和关键触发词的距离\n", "\n", "4、实体的依存关系类别" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "# -*- coding: utf-8 -*-\n", "from pyltp import Parser\n", "from pyltp import Segmentor\n", "from pyltp import Postagger\n", "import networkx as nx\n", "import pylab\n", "import re\n", "\n", "# 投资关系关键词\n", "# 提示:可以结合投资关系的触发词建立有效特征\n", "key_words = [\"收购\",\"竞拍\",\"转让\",\"扩张\",\"并购\",\"注资\",\"整合\",\"并入\",\"竞购\",\"竞买\",\"支付\",\"收购价\",\"收购价格\",\"承购\",\"购得\",\"购进\",\n", " \"购入\",\"买进\",\"买入\",\"赎买\",\"购销\",\"议购\",\"函购\",\"函售\",\"抛售\",\"售卖\",\"销售\",\"转售\"]\n", "\n", "postagger = Postagger() # 初始化实例\n", "postagger.load_with_lexicon('/ltp_data_v3.4.0/pos.model', '../data/user_dict.txt') # 加载模型\n", "segmentor = Segmentor() # 初始化实例\n", "segmentor.load_with_lexicon('/ltp_data_v3.4.0/cws.model', '../data/user_dict.txt') # 加载模型\n", "parser = Parser() # 初始化实例\n", "parser.load('/ltp_data_v3.4.0/parser.model') # 加载模型\n", "\n", "\n", "def get_parse_feature(s):\n", " \"\"\"\n", " 对语句进行句法分析,并返回句法结果\n", " parse_result:依存句法解析结果\n", " source:企业实体的词序号\n", " target:另一个企业实体的词序号\n", " keyword_pos:关键词词序号列表\n", " source_dep:企业实体依存句法类型\n", " target_dep:另一个企业实体依存句法类型\n", " ...\n", " (可自己建立新特征,提高分类准确率)\n", " \"\"\"\n", "\n", "\n", " return features\n", "\n", "\n", "\n", "# LTP中的依存句法类型如下:['SBV', 'VOB', 'IOB', 'FOB', 'DBL', 'ATT', 'ADV', 'CMP', 'COO', 'POB', 'LAD', 'RAD', 'IS', 'HED']" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "# 汇总词频特征和句法特征\n", "from sklearn.preprocessing import OneHotEncoder\n", "\n", "whole_feature = pd.concat([tfidf_feature, parse_feature])\n", "# TODO: 将字符型变量转换为onehot形式\n", "\n", "train_x = whole_feature.iloc[:, :train_num]\n", "test_x = whole_feature.iloc[:, train_num:]\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 建立分类器进行分类,使用sklearn中的分类器,不限于LR、SVM、决策树等\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn import preprocessing\n", "from sklearn.model_selection import train_test_split\n", "\n", "\n", "class RF:\n", " def __init__(self):\n", " \n", " def train(self, train_x, train_y):\n", " \n", " return model\n", " \n", " def predict(self, model, test_x)\n", "\n", " return predict, predict_prob\n", " \n", " \n", "rf = RF()\n", "model = rf.train(train_x, y)\n", "predict, predict_prob = rf.predict(model, test_x)\n", "\n", "\n", "# TODO: 存储提取的投资关系实体对,本次关系抽取不要求确定投资方和被投资方,仅确定实体对具有投资关系即可\n", "\"\"\"\n", "以如下形式存储,转为dataframe后写入csv文件:\n", "[\n", " [九州通,江中药业股份有限公司],\n", " ...\n", "]\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.5 存储进图数据库(5分)\n", "\n", "本次作业我们使用neo4j作为图数据库,neo4j需要java环境,请先配置好环境。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "from py2neo import Node, Relationship, Graph\n", "\n", "graph = Graph(\n", " \"http://localhost:7474\", \n", " username=\"neo4j\", \n", " password=\"person\"\n", ")\n", "\n", "for v in relation_list:\n", " a = Node('Company', name=v[0])\n", " b = Node('Company', name=v[1])\n", " \n", " # 本次不区分投资方和被投资方\n", " r = Relationship(a, 'INVEST', b)\n", " s = a | b | r\n", " graph.create(s)\n", " r = Relationship(b, 'INVEST', a)\n", " s = a | b | r\n", " graph.create(s)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO:查询某节点的3层投资关系" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }