Commit 40daf887 by 20200203113

Project2

parents
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"对此project的想法:\n",
"\n",
"观察了此porject布置的时间点,是在“实体识别”、“句法分析”之后。应该适当温习算法原理的知识点。\n",
"\n",
"1、实体识别、依存句法等工具自己动手开发实现复杂度较高,挑选“句法结构分析”工具让学员开发,不是特别复杂,又温习了句法分析中的知识点,此处选择CYK算法让学员实现,同时又考察了动态规划的编写。(具体算法因为不知道课程内容,可以根据内容再做调整)\n",
"\n",
"2、经过挑选,选择“企业投资关系图谱”作为学员的任务,原因1:企业投资关系易于理解,模版比较好总结,适合没有训练样本的情景;原因2:企业的称谓较多而复杂,适合考察实体统一和消歧的知识点。\n",
"\n",
"3、对于实体识别和句法分析,我考虑使用stanfordnlp,但是python接口好像可配置性不强,主要就是让学员会调用,会利用调用结果。\n",
"对于实体消歧,我考虑使用上下文相似度进行衡量。\n",
"对于实体统一,我考虑考察一下学员在“企业多名称”派生上面的发散性思维。\n",
"由于不知道此project前几节课中涉及的相关知识点,所以先拍脑袋决定,老师如果对知识点有相关准备和资料可以给我看一下,或者听听老师的规划。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Project 1: 利用信息抽取技术搭建知识库"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"本项目的目的是结合命名实体识别、依存语法分析、实体消歧、实体统一对网站开放语料抓取的数据建立小型知识图谱。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part1:开发句法结构分析工具"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1 开发工具\n",
"使用CYK算法,根据所提供的:非终结符集合、终结符集合、规则集,对10句句子计算句法结构。\n",
"\n",
"非终结符集合:N={S, NP, VP, PP, DT, VI, VT, NN, IN}\n",
"\n",
"终结符集合:{sleeps, saw, boy, girl, dog, telescope, the, with, in}\n",
"\n",
"规则集: R={\n",
"- (1) S-->NP VP 1.0\n",
"- (2) VP-->VI 0.3\n",
"- ......\n",
"\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2 计算算法复杂度\n",
"计算上一节开发的算法所对应的时间复杂度和空间复杂度。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part2:在百度百科辅助下,建立“投资关系”知识图谱"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1 实体识别\n",
"data目录中“baike.txt”文件记录了15个实体对应百度百科的url。\n",
"\n",
"借助开源实体识别工具并根据百度百科建立的已知实体对进行识别,对句中实体进行识别。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# code\n",
"# 首先尝试利用开源工具分出实体\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# code\n",
"# 在此基础上,将百度百科作为已知实体加入词典,对实体进行识别"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 实体消歧\n",
"将句中识别的实体与知识库中实体进行匹配,解决实体歧义问题。\n",
"可利用上下文本相似度进行识别。"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# code\n",
"# 将识别出的实体与知识库中实体进行匹配,解决识别出一个实体对应知识库中多个实体的问题。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3 实体统一\n",
"对同一实体具有多个名称的情况进行统一,将多种称谓统一到一个实体上,并体现在实体的属性中(可以给实体建立“别称”属性)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.4 关系抽取\n",
"借助句法分析工具,和实体识别的结果,以及正则表达式,设定模版抽取关系。从data目录中news.txt文件中的url对应的新闻提取关系并存储进图数据库。\n",
"\n",
"本次要求抽取投资关系,关系为有向边,由投资方指向被投资方。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# code\n",
"\n",
"# 最后提交文件为识别出的整个投资图谱,以及图谱中结点列表与属性。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment