Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
N
NLP_homework2
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
20200519088
NLP_homework2
Commits
126390ad
Commit
126390ad
authored
Jul 09, 2020
by
20200519088
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Upload New File
parent
3680e774
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
307 additions
and
0 deletions
+307
-0
starter_code.ipynb
+307
-0
No files found.
starter_code.ipynb
0 → 100644
View file @
126390ad
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 搭建一个分词工具\n",
"\n",
"### Part 1 基于枚举方法来搭建中文分词工具\n",
"\n",
"此项目需要的数据:\n",
"1. 综合类中文词库.xlsx: 包含了中文词,当做词典来用\n",
"2. 以变量的方式提供了部分unigram概率 word_prob\n",
"\n",
"\n",
"举个例子: 给定词典=[我们 学习 人工 智能 人工智能 未来 是], 另外我们给定unigram概率:p(我们)=0.25, p(学习)=0.15, p(人工)=0.05, p(智能)=0.1, p(人工智能)=0.2, p(未来)=0.1, p(是)=0.15\n",
"\n",
"#### Step 1: 对于给定字符串:”我们学习人工智能,人工智能是未来“, 找出所有可能的分割方式\n",
"- [我们,学习,人工智能,人工智能,是,未来]\n",
"- [我们,学习,人工,智能,人工智能,是,未来]\n",
"- [我们,学习,人工,智能,人工,智能,是,未来]\n",
"- [我们,学习,人工智能,人工,智能,是,未来]\n",
".......\n",
"\n",
"\n",
"#### Step 2: 我们也可以计算出每一个切分之后句子的概率\n",
"- p(我们,学习,人工智能,人工智能,是,未来)= -log p(我们)-log p(学习)-log p(人工智能)-log p(人工智能)-log p(是)-log p(未来)\n",
"- p(我们,学习,人工,智能,人工智能,是,未来)=-log p(我们)-log p(学习)-log p(人工)-log p(智能)-log p(人工智能)-log p(是)-log p(未来)\n",
"- p(我们,学习,人工,智能,人工,智能,是,未来)=-log p(我们)-log p(学习)-log p(人工)-log p(智能)-log p(人工)-log p(智能)-log p(是)-log p(未来)\n",
"- p(我们,学习,人工智能,人工,智能,是,未来)=-log p(我们)-log p(学习)-log p(人工智能)-log p(人工)-log p(智能)-log(是)-log p(未来)\n",
".....\n",
"\n",
"#### Step 3: 返回第二步中概率最大的结果"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3.980060000019523\n"
]
}
],
"source": [
"# TODO: 第一步: 从综合类中文词库.xlsx 中读取所有中文词。\n",
"# hint: 思考一下用什么数据结构来存储这个词典会比较好? 要考虑我们每次查询一个单词的效率。 \n",
"import xlrd\n",
"xl = xlrd.open_workbook('综合类中文词库.xlsx') #读取文件\n",
"# print(xl)\n",
"table = xl.sheets()[0] #获取工作薄\n",
"# print(table)\n",
"dic_words = table.col_values(0) # 保存词典库中读取的单词\n",
"# print(dic_words[:10])\n",
"\n",
"# 以下是每一个单词出现的概率。为了问题的简化,我们只列出了一小部分单词的概率。 在这里没有出现的的单词但是出现在词典里的,统一把概率设置成为0.00001\n",
"# 比如 p(\"学院\")=p(\"概率\")=...0.00001\n",
"\n",
"word_prob = {\"北京\":0.03,\"的\":0.08,\"天\":0.005,\"气\":0.005,\"天气\":0.06,\"真\":0.04,\"好\":0.05,\"真好\":0.04,\"啊\":0.01,\"真好啊\":0.02, \n",
" \"今\":0.01,\"今天\":0.07,\"课程\":0.06,\"内容\":0.06,\"有\":0.05,\"很\":0.03,\"很有\":0.04,\"意思\":0.06,\"有意思\":0.005,\"课\":0.01,\n",
" \"程\":0.005,\"经常\":0.08,\"意见\":0.08,\"意\":0.01,\"见\":0.005,\"有意见\":0.02,\"分歧\":0.04,\"分\":0.02, \"歧\":0.005}\n",
"words_dic = word_prob.keys()\n",
"for i in range(len(dic_words)):\n",
" if dic_words[i] not in words_dic:\n",
" word_prob[dic_words[i]] = 0.00001\n",
"# word_prob这个字典是全部词库及其概率 \n",
"# print(len(word_prob))\n",
"print (sum(word_prob.values()))"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['今天', '的', '课程', '内容', '很有', '意思']\n",
"['北京', '的', '天气', '真好啊']\n",
"['北京', '的', '天气', '真好啊']\n",
"['经常', '有意见', '分歧']\n"
]
}
],
"source": [
"## TODO 请编写word_segment_naive函数来实现对输入字符串的分词\n",
" # TODO: 第一步: 计算所有可能的分词结果,要保证每个分完的词存在于词典里,这个结果有可能会非常多。\n",
"import numpy as np\n",
"segments = [] # 存储所有分词的结果。如果次字符串不可能被完全切分,则返回空列表(list)\n",
"# 格式为:segments = [[\"今天\",“天气”,“好”],[\"今天\",“天“,”气”,“好”],[\"今“,”天\",“天气”,“好”],...]\n",
"dic_words = word_prob.keys()\n",
"def word_segment_all(input_str):\n",
" stk = [(input_str, [])]\n",
" while len(stk) != 0:\n",
" strbase, seg = stk.pop()\n",
" if strbase in dic_words:\n",
" segments.append(seg + [strbase])\n",
" for index in range(1, len(strbase)):\n",
" if strbase[:index] in dic_words:\n",
" stk.append((strbase[index:], seg + [strbase[:index]]))\n",
" return segments\n",
"def word_segment_naive(input_str):\n",
" \"\"\"\n",
" 1. 对于输入字符串做分词,并返回所有可行的分词之后的结果。\n",
" 2. 针对于每一个返回结果,计算句子的概率\n",
" 3. 返回概率最高的最作为最后结果\n",
" \n",
" input_str: 输入字符串 输入格式:“今天天气好”\n",
" best_segment: 最好的分词结果 输出格式:[\"今天\",\"天气\",\"好\"]\n",
" \"\"\"\n",
" # TODO: 第二步:循环所有的分词结果,并计算出概率最高的分词结果,并返回\n",
" best_segment = []\n",
" best_score = np.inf\n",
" segments = word_segment_all(input_str)\n",
" for seg in segments:\n",
" score = 0\n",
" for w in seg:\n",
" score += -np.log(word_prob[w])\n",
" if score < best_score:\n",
" best_score = score\n",
" best_segment = seg\n",
" return best_segment\n",
"# 测试\n",
"print (word_segment_naive(\"今天的课程内容很有意思\"))\n",
"print (word_segment_naive(\"北京的天气真好啊\"))\n",
"print (word_segment_naive(\"今天的课程内容很有意思\")) #不明白下面句子为什么输入的是“北京的天气正好啊”这句话的分词效果,望老师解答,谢谢。\n",
"print (word_segment_naive(\"经常有意见分歧\"))\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part 2 基于维特比算法来优化上述流程\n",
"\n",
"此项目需要的数据:\n",
"1. 综合类中文词库.xlsx: 包含了中文词,当做词典来用\n",
"2. 以变量的方式提供了部分unigram概率word_prob\n",
"\n",
"\n",
"举个例子: 给定词典=[我们 学习 人工 智能 人工智能 未来 是], 另外我们给定unigram概率:p(我们)=0.25, p(学习)=0.15, p(人工)=0.05, p(智能)=0.1, p(人工智能)=0.2, p(未来)=0.1, p(是)=0.15\n",
"\n",
"#### Step 1: 根据词典,输入的句子和 word_prob来创建带权重的有向图(Directed Graph) 参考:课程内容\n",
"有向图的每一条边是一个单词的概率(只要存在于词典里的都可以作为一个合法的单词),这些概率已经给出(存放在word_prob)。\n",
"注意:思考用什么方式来存储这种有向图比较合适? 不一定只有一种方式来存储这种结构。 \n",
"\n",
"#### Step 2: 编写维特比算法(viterebi)算法来找出其中最好的PATH, 也就是最好的句子切分\n",
"具体算法参考课程中讲过的内容\n",
"\n",
"#### Step 3: 返回结果\n",
"跟PART 1的要求一致"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['北京', '的', '天气', '真好', '啊']\n",
"['今天', '的', '课程', '内容', '很有', '意思']\n",
"['经常', '有意见', '分歧']\n"
]
}
],
"source": [
"## TODO 请编写word_segment_viterbi函数来实现对输入字符串的分词\n",
"import numpy as np\n",
"dic_words = word_prob.keys()\n",
"def word_segment_viterbi(input_str):\n",
" \"\"\"\n",
" 1. 基于输入字符串,词典,以及给定的unigram概率来创建DAG(有向图)。\n",
" 2. 编写维特比算法来寻找最优的PATH\n",
" 3. 返回分词结果\n",
" \n",
" input_str: 输入字符串 输入格式:“今天天气好”\n",
" best_segment: 最好的分词结果 输出格式:[\"今天\",\"天气\",\"好\"]\n",
" \"\"\"\n",
" \n",
" # TODO: 第一步:根据词典,输入的句子,以及给定的unigram概率来创建带权重的有向图(Directed Graph) 参考:课程内容\n",
" # 有向图的每一条边是一个单词的概率(只要存在于词典里的都可以作为一个合法的单词),这些概率在 word_prob,如果不在word_prob里的单词但在\n",
" # 词典里存在的,统一用概率值0.00001。\n",
" # 注意:思考用什么方式来存储这种有向图比较合适? 不一定有只有一种方式来存储这种结构。 \n",
" graph = {}\n",
" strlength = len(input_str)\n",
" for i in range(strlength, 0, -1):\n",
" j = i - 1\n",
" in_list = []\n",
" flag = input_str[j:i]\n",
" while j >= 0 and flag in dic_words:\n",
" in_list.append(j)\n",
" j = j - 1\n",
" flag = input_str[j:i]\n",
" graph[i] = in_list\n",
" \n",
" # TODO: 第二步: 利用维特比算法来找出最好的PATH, 这个PATH是P(sentence)最大或者 -log P(sentence)最小的PATH。\n",
" # hint: 思考为什么不用相乘: p(w1)p(w2)...而是使用negative log sum: -log(w1)-log(w2)-...\n",
" mem = [0] * (strlength + 1)\n",
" last_index = [0] * (strlength + 1)\n",
" for i in range(1, strlength + 1):\n",
" min_dis = np.inf\n",
" for j in graph[i]:\n",
" if input_str[j:i] in word_prob.keys():\n",
" # 有向图的每一条边是一个单词的概率(只要存在于词典里的都可以作为一个合法的单词),这些概率在 word_prob,如果不在word_prob里的单词但在\n",
" # 词典里存在的,统一用概率值0.00001。\n",
" if min_dis > mem[j] + round(-np.log(word_prob[input_str[j:i]]), 1):\n",
" min_dis = mem[j] + round(-np.log(word_prob[input_str[j:i]]), 1)\n",
" last_index[i] = j\n",
" else:\n",
" if min_dis > mem[j] + round(-np.log(0.00001), 1):\n",
" min_dis = mem[j] + round(-np.log(0.00001), 1)\n",
" last_index[i] = j\n",
" mem[i] = min_dis\n",
" \n",
" # TODO: 第三步: 根据最好的PATH, 返回最好的切分\n",
" best_segment = []\n",
" j = strlength\n",
" while True:\n",
" best_segment.append(input_str[last_index[j]:j])\n",
" j = last_index[j]\n",
" if j == 0 and last_index[j] == 0:\n",
" break\n",
" best_segment.reverse()\n",
" return best_segment\n",
"\n",
"# 测试\n",
"print (word_segment_viterbi(\"北京的天气真好啊\"))\n",
"print (word_segment_viterbi(\"今天的课程内容很有意思\"))\n",
"print (word_segment_viterbi(\"经常有意见分歧\"))\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# TODO: 第一种方法和第二种方法的时间复杂度和空间复杂度分别是多少?\n",
"第一个方法: \n",
"时间复杂度= , 空间复杂度=\n",
"\n",
"第二个方法:\n",
"时间复杂度= , 空间复杂度="
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# TODO:如果把上述的分词工具持续优化,有哪些可以考虑的方法? (至少列出3点)\n",
"- 0. (例), 目前的概率是不完整的,可以考虑大量的语料库,然后从中计算出每一个词出现的概率,这样更加真实\n",
"- 1.\n",
"- 2.\n",
"- 3. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment