Commit 3c46eefb by 20200519029

0620

parents 8d6b3f54 4ce6bc9a
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problem 1. Fibonacci Sequence\n",
"在课程里,讨论过如果去找到第N个Fibonacci number。在这里,我们来试着求一下它的Closed-form解。 \n",
"\n",
"Fibonacci数列为 1,1,2,3,5,8,13,21,.... 也就第一个数为1,第二个数为1,以此类推...\n",
"我们用f(n)来数列里的第n个数,比如n=3时 f(3)=2。\n",
"\n",
"下面,来证明一下fibonacci数列的closed-form, 如下:\n",
"\n",
"$f(n)=\\frac{1}{\\sqrt{5}}(\\frac{1+\\sqrt{5}}{2})^n-\\frac{1}{\\sqrt{5}}(\\frac{1-\\sqrt{5}}{2})^n$\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"// your proof is here ....\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"令$F_{n}$表示第N个Fibonacci number,令$F_{0}=0$\n",
"设存在$M=\\left(\\begin{array}{ll}A & C \\\\ B & D\\end{array}\\right)$使得$\\left(\\begin{array}{c}F_{n} \\\\ F_{n+1}\\end{array}\\right)=M\\left(\\begin{array}{c}F_{n-1} \\\\ F_{n}\\end{array}\\right)$有$\\left(\\begin{array}{c}F_{n} \\\\ F_{n+1}\\end{array}\\right)=\\left(\\begin{array}{c}A F_{n-1}+C F_{n} \\\\ B F_{n-1}+D F_{n}\\end{array}\\right)$令$M=\\left(\\begin{array}{ll}0 & 1 \\\\ 1 & 1\\end{array}\\right)$可求出通项公式$\\left(\\begin{array}{c}F_{n} \\\\ F_{n+1}\\end{array}\\right)=M\\left(\\begin{array}{c}F_{n-1} \\\\ F_{n}\\end{array}\\right) \\Rightarrow\\left(\\begin{array}{c}F_{n} \\\\ F_{n+1}\\end{array}\\right)=M^{n}\\left(\\begin{array}{c}F_{0} \\\\ F_{1}\\end{array}\\right) \\Rightarrow\\left(\\begin{array}{c}F_{n} \\\\ F_{n+1}\\end{array}\\right)=P D^{n} P^{-1}\\left(\\begin{array}{c}0 \\\\ 1\\end{array}\\right)$解其特征方程$\\operatorname{det}(M-\\lambda I)=0$得$\\lambda=\\frac{1 \\pm \\sqrt{5}}{2}$特征向量$\\left(\\begin{array}{c}1 \\\\ \\frac{1 \\pm \\sqrt{5}}{2}\\end{array}\\right)$\n",
"$P^{-1}=\\left(\\begin{array}{cc}\\frac{\\sqrt{5}-1}{2 \\sqrt{5}} & \\frac{1}{\\sqrt{5}} \\\\ \\frac{\\sqrt{5}+1}{2 \\sqrt{5}} & -\\frac{1}{\\sqrt{5}}\\end{array}\\right)$\n",
"$F_{n}=(1 \\quad 0)\\left(\\begin{array}{ccc}1 & 1 \\\\ \\frac{1+\\sqrt{5}}{2} & \\frac{1-\\sqrt{5}}{2}\\end{array}\\right)\\left(\\begin{array}{cc}\\left(\\frac{1+\\sqrt{5}}{2}\\right)^{n} & 0 \\\\ 0 & \\left(\\frac{1-\\sqrt{5}}{2}\\right)^{n}\\end{array}\\right)\\left(\\begin{array}{cc}\\frac{\\sqrt{5}-1}{2 \\sqrt{5}} & \\frac{1}{\\sqrt{5}} \\\\ \\frac{\\sqrt{5}+1}{2 \\sqrt{5}} & -\\frac{1}{\\sqrt{5}}\\end{array}\\right)\\left(\\begin{array}{c}0 \\\\ 1\\end{array}\\right)$\n",
"$=(1 \\quad 0)\\left(\\begin{array}{ccc}1 & 1 \\\\ \\frac{1+\\sqrt{5}}{2} & \\frac{1-\\sqrt{5}}{2}\\end{array}\\right)\\left(\\begin{array}{cc}\\left.\\frac{1+\\sqrt{5}}{2}\\right)^{n} & 0 \\\\ 0 & \\left(\\frac{1-\\sqrt{5}}{2}\\right)^{n}\\end{array}\\right)\\left(\\begin{array}{c}\\frac{1}{\\sqrt{5}} \\\\ -\\frac{1}{\\sqrt{5}}\\end{array}\\right)$\n",
"$=\\frac{2^{-n}}{\\sqrt{5}}(1 \\quad 0)\\left(\\begin{array}{cc}1 & 1 \\\\ \\frac{1+\\sqrt{5}}{2} & \\frac{1-\\sqrt{5}}{2}\\end{array}\\right)\\left(\\begin{array}{c}(1+\\sqrt{5})^{n} \\\\ -(1-\\sqrt{5})^{n}\\end{array}\\right)=\\frac{\\left(\\frac{1+\\sqrt{5}}{2}\\right)^{n}-\\left(\\frac{1-\\sqrt{5}}{2}\\right)^{n}}{\\sqrt{5}}$故\n",
"$F_{n}=\\frac{\\left(\\frac{1+\\sqrt{5}}{2}\\right)^{n}-\\left(\\frac{1-\\sqrt{5}}{2}\\right)^{n}}{\\sqrt{5}}$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problem2. Algorithmic Complexity\n",
"对于下面的复杂度,从小大排一下顺序:\n",
"\n",
"$O(N), O(N^2), O(2^N), O(N\\log N), O(N!), O(1), O(\\log N), O(3^N), O(N^2\\log N), O(N^{2.1})$\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"// your answer....\n",
"\n",
"$O(1),O(\\log N), O(N), O(N\\log N),O(N^2), O(N^{2.1},O(N^2\\log N), O(2^N),O(3^N), O(N!))$\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problem 3 Dynamic Programming Problem"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Edit Distance (编辑距离)\n",
"编辑距离用来计算两个字符串之间的最短距离,这里涉及到三个不通过的操作,add, delete和replace. 每一个操作我们假定需要1各单位的cost. \n",
"\n",
"例子: \"apple\", \"appl\" 之间的编辑距离为1 (需要1个删除的操作)\n",
"\"machine\", \"macaide\" dist = 2\n",
"\"mach\", \"aaach\" dist=2"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The Edit Distance of 'sunday' and 'saturday' is 3.\n"
]
}
],
"source": [
"# s1, s2 are two strings\n",
"def editDistDP(s1, s2):\n",
"\n",
" m, n = len(s1), len(s2)\n",
" dp = [[0 for _ in range(n+1)] for _ in range(m+1)]\n",
"\n",
" for i in range(m + 1):\n",
" for j in range(n + 1):\n",
" # i 为0 最少操作是Insert j字符\n",
" if i == 0:\n",
" dp[i][j] = j\n",
" # j 为0 最少操作是Insert i字符\n",
" elif j == 0:\n",
" dp[i][j] = i\n",
" # 当前字符相等时不需要编辑\n",
" elif s1[i-1] == s2[j-1]:\n",
" dp[i][j] = dp[i-1][j-1]\n",
" #不相等时需要编辑,从三种操作中选取最小的加一\n",
" else:\n",
" dp[i][j] = 1 + min(dp[i][j-1], # Insert\n",
" dp[i-1][j], # Remove\n",
" dp[i-1][j-1]) # Replace \n",
" return dp[m][n]\n",
"\n",
"s1 = \"sunday\"\n",
"s2 = \"saturday\"\n",
"edit_distance = editDistDP(s1, s2)\n",
"print(\"The Edit Distance of '%s' and '%s' is %d.\"%(s1, s2, edit_distance))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problem 4 非技术问题\n",
"本题目的目的是想再深入了解背景,之后课程的内容也会根据感兴趣的点来做适当会调整。 \n",
"\n",
"\n",
"Q1: 之前或者现在,做过哪些AI项目/NLP项目?可以适当说一下采用的解决方案,如果目前还没有想出合适的解决方案,也可以说明一下大致的想法。 请列举几个点。\n",
"前期跟着咱们训练营做过一些广告点击率预测、chatbot等项目,目前尝试使用LSTM+crf做一些词性标注项目\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"Q2: 未来想往哪个行业发展? 或者想做哪方面的项目? 请列举几个点。\n",
"医疗、金融、推荐系统等方面,想做信息检索、信息抽取、文本生成、机器翻译、问答系统的项目\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"Q3: 参加训练营,最想获得的是什么?可以列举几个点。\n",
"主要想转行,想要获得对AI/NLP领域的深入了解,能实际解决一下NLP任务,对于用到的算法能够有一些自己的思考和理解,达到NLP工程师入门级水平\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
...@@ -4,19 +4,34 @@ ...@@ -4,19 +4,34 @@
| 日期 | 主题 | 知识点详情 | 课件 | 相关阅读 | 其 他 | 作业 | | 日期 | 主题 | 知识点详情 | 课件 | 相关阅读 | 其 他 | 作业 |
|---------|---------|---------|---------|---------|---------|---------| |---------|---------|---------|---------|---------|---------|---------|
| PART 0 前期基础复习(机器学习与凸优化)| | PART 0 前期基础复习(机器学习与凸优化)|
| 5月24日 (周日) 10:30AM | (直播-Lecture1) <br />概论,算法复杂度,逻辑回归与正则 | 时间/空间复杂度的分析, <br> 递归程序的时间复杂度 <br>与空间复杂度、<br>逻辑回归与正则 | [Lecture1](http://47.94.6.102/NLP7/course-info/blob/master/%E8%AF%BE%E4%BB%B6/0524%E6%A6%82%E8%AE%BA%EF%BC%8C%E7%AE%97%E6%B3%95%E5%A4%8D%E6%9D%82%E5%BA%A6%EF%BC%8C%E5%8A%A8%E6%80%81%E8%A7%84%E5%88%92%EF%BC%8CDTW%EF%BC%8C%E9%80%BB%E8%BE%91%E5%9B%9E%E5%BD%92%E4%B8%8E%E6%AD%A3%E5%88%99.pptx)| [gitlab使用教程](https://www.greedyai.com/course/46)<br/><br />[[博客]十分钟搞定时间复杂度(必看)](https://www.jianshu.com/p/f4cca5ce055a)<br/><br />[[博客] Dynamic Programming – Edit Distance Problem(必看)](https://algorithms.tutorialhorizon.com/dynamic-programming-edit-distance-problem/)<br /><br/>[[材料]Master's Theorem(建议看)](http://people.csail.mit.edu/thies/6.046-web/master.pdf)<br/><br />[Introduction to Algorithm (MIT Press)(强烈建议从头到尾好好读一下)](http://ressources.unisciel.fr/algoprog/s00aaroot/aa00module1/res/%5BCormen-AL2011%5DIntroduction_To_Algorithms-A3.pdf)<br/><br />[Convergence for Gradient Descent(想挑战一下的看)](https://www.stat.cmu.edu/~ryantibs/convexopt-F13/scribes/lec6.pdf)<br/><br />[Convergence for Adagrad(想挑战一下的看)](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)<br/><br />[Convergence for Adam(想挑战一下的看)](https://arxiv.org/pdf/1412.6980.pdf)<br/><br />[ElasticNet(想挑战一下的看)](https://web.stanford.edu/~hastie/Papers/B67.2%20(2005)%20301-320%20Zou%20&%20Hastie.pdf)<br/><br />[DP Problems(必看)](https://people.cs.clemson.edu/~bcdean/dp_practice/)<br/><br />|[如何写summary](http://47.94.6.102/NLP7/course-info/wikis/%E5%A6%82%E4%BD%95%E5%86%99summary)<br/><br />[如何写小作业?](http://47.94.6.102/NLP7/course-info/wikis/%E5%A6%82%E4%BD%95%E5%86%99%E5%B0%8F%E4%BD%9C%E4%B8%9A%EF%BC%9F)|[第一次小作业](http://47.94.6.102/NLP7/MiniAssignments/tree/master/homework1)Out 截止日期:5月31日(周日)北京时间 23:59PM, 上传到gitlab| | 5月24日 (周日) 10:30AM | (直播-Lecture1) <br />概论,算法复杂度,逻辑回归与正则 | 时间/空间复杂度的分析, <br> 递归程序的时间复杂度 <br>与空间复杂度、<br>逻辑回归与正则 | [Lecture1](http://47.94.6.102/NLP7/course-info/blob/master/%E8%AF%BE%E4%BB%B6/0524%E6%A6%82%E8%AE%BA%EF%BC%8C%E7%AE%97%E6%B3%95%E5%A4%8D%E6%9D%82%E5%BA%A6%EF%BC%8C%E5%8A%A8%E6%80%81%E8%A7%84%E5%88%92%EF%BC%8CDTW%EF%BC%8C%E9%80%BB%E8%BE%91%E5%9B%9E%E5%BD%92%E4%B8%8E%E6%AD%A3%E5%88%99.pptx)| [gitlab使用教程](https://www.greedyai.com/course/46)<br/><br />[[博客]十分钟搞定时间复杂度(必看)](https://www.jianshu.com/p/f4cca5ce055a)<br/><br />[[博客] Dynamic Programming – Edit Distance Problem(必看)](https://algorithms.tutorialhorizon.com/dynamic-programming-edit-distance-problem/)<br /><br/>[[材料]Master's Theorem(建议看)](http://people.csail.mit.edu/thies/6.046-web/master.pdf)<br/><br />[Introduction to Algorithm (MIT Press)(强烈建议从头到尾好好读一下)](http://ressources.unisciel.fr/algoprog/s00aaroot/aa00module1/res/%5BCormen-AL2011%5DIntroduction_To_Algorithms-A3.pdf)<br/><br />[Convergence for Gradient Descent(想挑战一下的看)](https://www.stat.cmu.edu/~ryantibs/convexopt-F13/scribes/lec6.pdf)<br/><br />[Convergence for Adagrad(想挑战一下的看)](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)<br/><br />[Convergence for Adam(想挑战一下的看)](https://arxiv.org/pdf/1412.6980.pdf)<br/><br />[ElasticNet(想挑战一下的看)](https://web.stanford.edu/~hastie/Papers/B67.2%20(2005)%20301-320%20Zou%20&%20Hastie.pdf)<br/><br />[DP Problems(必看)](https://people.cs.clemson.edu/~bcdean/dp_practice/)<br/><br />|[如何写summary](http://47.94.6.102/NLP7/course-info/wikis/%E5%A6%82%E4%BD%95%E5%86%99summary)<br/><br />[如何写小作业?](http://47.94.6.102/NLP7/course-info/wikis/%E5%A6%82%E4%BD%95%E5%86%99%E5%B0%8F%E4%BD%9C%E4%B8%9A%EF%BC%9F)|[第一次小作业](http://47.94.6.102/NLP7/MiniAssignments/tree/master/homework1)<br><br> 截止日期:5月31日(周日)<br>北京时间 23:59PM, <br>上传到gitlab|
| | (直播-Discussion) <br>经典数据结构与算法 |动态规划问题讲解||||| | 5月30日 (周六) 8:00PM | (直播-Paper) <br> 第一篇论文解读<br>XGBoost: A Scalable Tree Boosting System ||[课件](http://47.94.6.102/NLP7/course-info/blob/master/%E8%AF%BE%E4%BB%B6/0530%E7%AC%AC%E4%B8%80%E7%AF%87%E8%AE%BA%E6%96%87xgboost.pptx)|||[第一篇论文原文](http://47.94.6.102/NLP7/course-info/blob/master/%E8%AF%BE%E4%BB%B6/XGBoost-%20A%20Scalable%20Tree%20Boosting%20System.pdf) <br><br>summary截止日期:<br>5月31日(周日)<br>北京时间 23:59PM,<br> 上传到核心文档|
| | (直播-Discussion) <br>经典数据结构与算法 |哈希表,搜索树,堆(优先堆)||||| | 5月31日 (周日) 10:30AM | (直播-Lecture2) <br>Decision Tree, Random Forest, XGBoost | 树模型以及XGBoost核心算法讲解|[Lecture2](http://47.94.6.102/NLP7/course-info/blob/master/%E8%AF%BE%E4%BB%B6/0531Decision%20Tree%EF%BC%8Crandom%20forest%EF%BC%8Cxgboost.pptx)|||
| | (直播-Lecture2) <br>Decision Tree, Random Forest, XGBoost | 树模型以及XGBoost核心算法讲解|||| | TBD | (直播-Discussion) <br>经典数据结构与算法 |动态规划问题讲解,贪心算法|[课件](http://47.94.6.102/NLP7/course-info/blob/master/%E8%AF%BE%E4%BB%B6/0531TBD%E5%8A%A8%E6%80%81%E8%A7%84%E5%88%92.pptx)||||
| | (直播-Discussion) <br>Ensemble模型实战|||||| | 5月31日 (周日) 8:00PM | (直播-Discussion) <br>经典数据结构与算法 |哈希表,搜索树,堆(优先堆)|[课件](http://47.94.6.102/NLP7/course-info/blob/master/%E8%AF%BE%E4%BB%B6/0531%E5%93%88%E5%B8%8C%E8%A1%A8%EF%BC%8C%E6%90%9C%E7%B4%A2%E6%A0%91%EF%BC%8C%E5%A0%86%EF%BC%88%E4%BC%98%E5%85%88%E5%A0%86%EF%BC%89.pptx)||||
| | (直播-Discussion) <br>经典数据结构与算法|贪心算法|||| | 6月6日 (周六) 6:00PM | (直播-Discussion) <br>Ensemble模型实战||[课件](http://47.94.6.102/NLP7/course-info/blob/master/%E8%AF%BE%E4%BB%B6/0606%20Ensemble%20%E6%A8%A1%E5%9E%8B%E5%AE%9E%E6%88%98%20%5B%E9%98%BF%E5%8B%87%5D.pptx)|[资料 代码](http://47.94.6.102/NLP7/course-info/tree/master/%E8%AF%BE%E4%BB%B6/0606%20%20Ensemble%20%E6%A8%A1%E5%9E%8B%E5%AE%9E%E6%88%98--%E8%B5%84%E6%96%99/%E4%BB%A3%E7%A0%81)|||
| | (直播-Paper) <br> XGBoost论文解读 |||||| | 6月6日 (周六) 8:00PM | (直播-Paper) <br> 第二篇论文解读<br>From Word Embeddings To Document Distances ||[课件](http://47.94.6.102/NLP7/course-info/blob/master/%E8%AF%BE%E4%BB%B6/0606%E7%AC%AC%E4%BA%8C%E7%AF%87%E8%AE%BA%E6%96%87From%20Word%20Embeddings%20To%20Document%20Distances.pptx)|||[第二篇论文原文](http://47.94.6.102/NLP7/course-info/blob/master/%E8%AF%BE%E4%BB%B6/2.From%20Word%20Embeddings%20To%20Document%20Distances.pdf) <br><br>summary截止日期:<br>6月7日(周日)<br>北京时间 23:59PM,<br> 上传到核心文档|
| | (直播-Lecture3) <br> 凸优化(1) | 凸集,凸函数,判定凸函数,LP, QP, 非凸函数问题 |||| | 6月7日 (周日) 10:30AM | (直播-Lecture3) <br> 凸优化(1) | 凸集,凸函数,判定凸函数,LP, QP, 非凸函数问题 |[Lecture3](http://47.94.6.102/NLP7/course-info/blob/master/%E8%AF%BE%E4%BB%B6/%E5%87%B8%E4%BC%98%E5%8C%96%EF%BC%881%EF%BC%89.pptx)|||[第二次小作业](http://47.94.6.102/NLP7/MiniAssignments/blob/master/homework2.zip)<br><br> 截止日期:6月21日(周日)<br>北京时间 23:59PM, <br>上传到gitlab|
| | (直播-Discussion) <br> 生活中的优化问题 ||||| | 6月7日 (周日) 8:00PM | (直播-Discussion) <br> 生活中的优化问题 ||[课件](http://47.94.6.102/NLP7/course-info/tree/master/%E8%AF%BE%E4%BB%B6/0607%E7%94%9F%E6%B4%BB%E4%B8%AD%E7%9A%84%E4%BC%98%E5%8C%96%E9%97%AE%E9%A2%98)|||
| | (直播-Discussion) <br> Simplex Method与LP实战 ||||| | 6月13日 (周六) 6:00PM | (直播-Discussion) <br> LP, QP以及它们的Dual ||[课件](http://47.94.6.102/NLP7/course-info/blob/master/%E8%AF%BE%E4%BB%B6/0613%20LP%20QP%E4%BB%A5%E5%8F%8A%E5%AE%83%E4%BB%AC%E7%9A%84Dual%20%5B%E9%98%BF%E5%8B%87%5D.pptx)|||
| | (直播-Lecture4) <br> 凸优化 (2) |Dualtiy, KKT条件,SVM的primal-Dual|||| | 6月13日 (周六) 8:00PM | (直播-Discussion) <br> Simplex Method与LP实战 ||[课件<br>代码](http://47.94.6.102/NLP7/course-info/tree/master/%E8%AF%BE%E4%BB%B6/0613Simplex%20Method%E4%B8%8ELP%E5%AE%9E%E6%88%98)|||
| | (直播-Discussion) <br> LP, QP以及它们的Dual ||||| | 6月14日 (周日) 10:30AM | (直播-Lecture4) <br> 凸优化 (2) |Dualtiy, KKT条件,SVM的primal-Dual|[Lecture4](http://47.94.6.102/NLP7/course-info/blob/master/%E8%AF%BE%E4%BB%B6/0614%E5%87%B8%E4%BC%98%E5%8C%96%EF%BC%882%EF%BC%89.pptx)|||
| | (直播-Discussion) <br> Inventory Optimization with Stochastic Programming ||||| | 6月14日 (周日) 8:00PM | (直播-Discussion) <br> Inventory Optimization with Stochastic Programming ||[课件](http://47.94.6.102/NLP7/course-info/blob/master/%E8%AF%BE%E4%BB%B6/0614%20Inventory%20Optimization%20with%20Stochastic%20Programming%5B%E9%98%BF%E5%8B%87%5D.pptx)|||
| PART 1 自然语言处理基础| | PART 1 自然语言处理基础 |
| 6月21日 (周日,父亲节) 10:30AM | (直播-Lecture5) <br>文本表示|分词<br>拼写纠错<br>停用词过滤<br>词的标准化<br>词袋模型<br>文本相似度计算<br>词向量<br>句子向量<br>语言模型||||[project1](http://47.94.6.102/NLP7/course-info/blob/master/%E8%AF%BE%E4%BB%B6/project1/Project1%E9%A1%B9%E7%9B%AE.zip) <br><br> 截止日期:7月1日(周三)<br>北京时间 23:59PM, <br>上传到gitlab|
| 6月21日 (周日) 5:30PM | (直播-Discussion) <br>各类文本相似度计算技术的Survey| 短文本<br>长文本 |||||
| 6月21日 (周日) 8:00PM | (直播-Discussion) 搜索引擎技术介绍|Vector-space model, 倒排表索引,PageRank等||||
| TBD | (直播-Discussion) 词向量实战:如何使用Glove, BERT词向量在自己的项目当中? ||||||
| TBD | (直播-Discussion) 问答系统的搭建:完整流程,相似度匹配,排序,文本预处理等||||
| 7月5日(周日) 10:30AM | (直播-Lecture6) | <br>SkipGram(重点讲解), CBOW, Glove, MF,Gaussian Embedding, 语言模型以及各类Smooting技术|||||
| TBD | (直播-Discussion) <br>SkipGram源代码解读|包括Huffman树等优化细节|||||
| TBD | (直播-Discussion) <br>第一次作业讲解||||||
| TBD | (直播-Paper) [Evaluation methods for unsupervised word embeddings](https://www.aclweb.org/anthology/D15-1036.pdf) <br>|||||
| 7月12日(周日) 10:30AM | (直播-Lecture7) <br>EM算法和HMM |Em算法<br>Em收敛性<br>Em算法在高斯混合模型中的应用<br>HMM算法的引入<br>HMM中的概率计算<br>HMM中的学习(Baum-Welch算法)<br>HMM的预测问题(Viterbi算法)|||||
| TBD | (直播-Discussion) <br> 代码实战:如何基于HMM实现词性分析器?(POS tagger)||||||
| TBD | (直播-Discussion) <br>结巴分词的应用以及底层原理剖析||||||
| 7月19日(周日) 10:30AM | (直播-Lecture8) <br>CRF模型 ||||||
| TBD | (直播-Discussion) <br> 代码实战:基于LSTM-CRF的命名实体识别||||||
| TBD | (直播-Discussion) TBD||||||
| TBD | (直播-Discussion) TBD||||||
|PART 2 深度学习与预训练模型|
\ No newline at end of file
++ "b/\350\257\276\344\273\266/0606 Ensemble \346\250\241\345\236\213\345\256\236\346\210\230--\350\265\204\346\226\231/\344\273\243\347\240\201/.gitkeep"
++ "b/\350\257\276\344\273\266/0607\347\224\237\346\264\273\344\270\255\347\232\204\344\274\230\345\214\226\351\227\256\351\242\230/.gitkeep"
{
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 图的遍历"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'B', 'C', 'D', 'E', 'A', 'F'}\n"
]
}
],
"source": [
"\n",
"def BFS(graph, s):\n",
" queue = [] # \n",
" queue.append(s)\n",
" seen = set() # \n",
" while len(queue) > 0:\n",
" vertex = queue.pop(0)\n",
" nodes = graph[vertex]\n",
" for w in nodes:\n",
" if w not in seen:\n",
" queue.append(w)\n",
" seen.add(w)\n",
"\n",
" # print(vertex)\n",
"\n",
" print(seen)\n",
"\n",
"\n",
"graph = {\n",
" \"A\": [\"B\", \"C\"],\n",
" \"B\": [\"A\", \"C\", \"D\"],\n",
" \"C\": [\"A\", \"B\", \"E\", \"D\"],\n",
" \"D\": [\"B\", \"C\", \"E\", \"F\"],\n",
" \"F\": [\"D\"],\n",
" \"E\": [\"C\", \"D\"],\n",
"}\n",
"\n",
"BFS(graph, \"F\")\n",
"#\n",
"# def breadth_travel(root):\n",
"# \"\"\"利⽤队列实现树的层次遍历\"\"\"\n",
"# if root == None:\n",
"# return\n",
"# queue = []\n",
"# queue.append(root)\n",
"# while queue:\n",
"# node = queue.pop(0)\n",
"# print(node.elem)\n",
"# if node.lchild is not None:\n",
"# queue.append(node.lchild)\n",
"# if node.rchild != None:\n",
"# queue.append(node.rchild)\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# dijkstra heap"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import heapq as hp\n",
"import math\n",
"\n",
"graph = {\n",
"\n",
" \"A\": {\"B\": 5, \"C\": 1},\n",
" \"B\": {\"A\": 5, \"C\": 2, \"D\": 1},\n",
" \"C\": {\"A\": 1, \"B\": 2, \"E\": 8, \"D\": 4},\n",
" \"D\": {\"B\": 1, \"C\": 4, \"E\": 3, \"F\": 6},\n",
" \"F\": {\"D\": 6},\n",
" \"E\": {\"C\": 8, \"D\": 3},\n",
"}\n",
"\n",
"\n",
"def init_distance(graph, s):\n",
" distance = {s: 0}\n",
" for key in graph:\n",
" if key != s:\n",
" distance[key] = math.inf\n",
" return distance\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<module 'heapq' from '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/heapq.py'>\n",
"{'A': 0, 'B': inf, 'C': inf, 'D': inf, 'F': inf, 'E': inf}\n",
"seen: {'A'}\n",
"nodes: dict_keys(['B', 'C'])\n",
"change distance for B: {'A': 0, 'B': 5, 'C': inf, 'D': inf, 'F': inf, 'E': inf}\n",
"change distance for C: {'A': 0, 'B': 5, 'C': 1, 'D': inf, 'F': inf, 'E': inf}\n",
"seen: {'A', 'C'}\n",
"nodes: dict_keys(['A', 'B', 'E', 'D'])\n",
"change distance for B: {'A': 0, 'B': 3, 'C': 1, 'D': inf, 'F': inf, 'E': inf}\n",
"change distance for E: {'A': 0, 'B': 3, 'C': 1, 'D': inf, 'F': inf, 'E': 9}\n",
"change distance for D: {'A': 0, 'B': 3, 'C': 1, 'D': 5, 'F': inf, 'E': 9}\n",
"seen: {'A', 'B', 'C'}\n",
"nodes: dict_keys(['A', 'C', 'D'])\n",
"change distance for D: {'A': 0, 'B': 3, 'C': 1, 'D': 4, 'F': inf, 'E': 9}\n",
"seen: {'D', 'A', 'B', 'C'}\n",
"nodes: dict_keys(['B', 'C', 'E', 'F'])\n",
"change distance for E: {'A': 0, 'B': 3, 'C': 1, 'D': 4, 'F': inf, 'E': 7}\n",
"change distance for F: {'A': 0, 'B': 3, 'C': 1, 'D': 4, 'F': 10, 'E': 7}\n",
"seen: {'D', 'A', 'B', 'C'}\n",
"nodes: dict_keys(['A', 'C', 'D'])\n",
"seen: {'D', 'A', 'B', 'C'}\n",
"nodes: dict_keys(['B', 'C', 'E', 'F'])\n",
"seen: {'B', 'D', 'C', 'E', 'A'}\n",
"nodes: dict_keys(['C', 'D'])\n",
"seen: {'B', 'D', 'C', 'E', 'A'}\n",
"nodes: dict_keys(['C', 'D'])\n",
"seen: {'B', 'D', 'C', 'E', 'A', 'F'}\n",
"nodes: dict_keys(['D'])\n",
"{'A': 0, 'B': 3, 'C': 1, 'D': 4, 'F': 10, 'E': 7}\n"
]
}
],
"source": [
"def dijkstra(graph, s):\n",
" pqueue = []\n",
" hp.heappush(pqueue, (0, s)) #\n",
" print(hp)\n",
"# seen = set()\n",
" distance = init_distance(graph, s)\n",
" print(distance)\n",
" while len(pqueue) > 0:\n",
" pair = hp.heappop(pqueue)\n",
" dist = pair[0] # \n",
" node = pair[1] #\n",
"# seen.add(node)\n",
" print(\"seen: \", seen)\n",
" nodes = graph[node].keys() # \n",
" print(\"nodes: \", nodes)\n",
" #\n",
" for w in nodes:\n",
" if dist + graph[node][w] < distance[w]:\n",
" hp.heappush(pqueue, (dist + graph[node][w], w))\n",
" distance[w] = dist + graph[node][w]\n",
" print(f\"change distance for {w}: \", distance)\n",
" return distance\n",
"\n",
"\n",
"d = dijkstra(graph, \"A\")\n",
"print(d)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# dijkstra 动态规划"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0, 3, 1, 4, 7, 10]\n"
]
},
{
"data": {
"text/plain": [
"10"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Inf = float('inf')\n",
"Adjacent = [[0, 5, 1, Inf, Inf, Inf],\n",
" [5, 0, 2, 1, Inf, Inf],\n",
" [1, 2, 0, 4, 8, Inf],\n",
" [Inf, 1, 4, 0, 3, 6],\n",
" [Inf, Inf, 8, 3, 0, Inf],\n",
" [Inf, Inf, Inf, 6, Inf, 0]]\n",
"Src, Dst, N = 0, 5, 6\n",
"\n",
"\n",
"# 动态规划\n",
"def dijstra(adj, src, dst, n):\n",
" dist = [Inf] * n #\n",
" dist[src] = 0\n",
" book = [0] * n # 记录已经确定的顶点\n",
" # 每次找到起点到该点的最短途径\n",
" u = src\n",
" for _ in range(n - 1): # 找n-1次\n",
" book[u] = 1 # 已经确定\n",
" # 更新距离并记录最小距离的结点\n",
" next_u, minVal = None, float('inf')\n",
" for v in range(n): # w\n",
" w = adj[u][v]\n",
" if w == Inf: # 结点u和v之间没有边\n",
" continue\n",
" if not book[v] and dist[u] + w < dist[v]: # 判断结点是否已经确定了\n",
" dist[v] = dist[u] + w\n",
" if dist[v] < minVal:\n",
" next_u, minVal = v, dist[v]\n",
" # 开始下一轮遍历\n",
" u = next_u\n",
" print(dist)\n",
" return dist[dst]\n",
"\n",
"\n",
"dijstra(Adjacent, Src, Dst, N)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 模拟退火"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYkAAAD4CAYAAAAZ1BptAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nO3deXxV9Z3/8dcn+0oIELaEEJbIIqJARNzaqlVxaaFq3apSa7Xzq+3Urmo7HWfame4dl446danbryMqOpVxqowittaqEERQ9oBAEhISCNnJcnO/88c90BBzQ4Ak596b9/PxuI97z/ece7+f48H7zvme5ZpzDhERke7E+V2AiIhELoWEiIiEpZAQEZGwFBIiIhKWQkJERMJK8LuAvjZixAhXUFDgdxkiIlFl9erVe51zOV3bYy4kCgoKKC4u9rsMEZGoYmY7u2vXcJOIiISlkBARkbAUEiIiEpZCQkREwup1SJhZvJmtMbOXvOkJZvaumZWY2TNmluS1J3vTJd78gk6fcafXvtnMLuzUPt9rKzGzOzq1d9uHiIgMjKPZk/gGsLHT9M+Bu51zk4H9wE1e+03Afq/9bm85zGw6cDVwIjAfeMALnnjgfuAiYDpwjbdsT32IiMgA6FVImFkecAnwiDdtwLnAEm+RJ4CF3usF3jTe/PO85RcAi51zrc65j4ASYK73KHHObXfOtQGLgQVH6ENERAZAb/ck7gG+BwS96eFArXMu4E2XAbne61ygFMCbX+ctf6i9y3vCtffUx2HM7BYzKzaz4urq6l6ukohIbCitaeYXr2yiqqGlzz/7iCFhZpcCVc651X3eex9xzj3knCtyzhXl5HzsgkERkZj2/HtlPPinbbQFgkde+Cj15orrM4HPmtnFQAowBLgXGGpmCd5f+nlAubd8OTAOKDOzBCAL2Nep/aDO7+mufV8PfYiICBAMOpasLuOMScPJy07r888/4p6Ec+5O51yec66A0IHn151zXwBWAFd4iy0CXvReL/Wm8ea/7kI/f7cUuNo7+2kCUAisBFYBhd6ZTEleH0u994TrQ0REgHc/qqFs/wE+P2fckRc+BsdzncTtwLfMrITQ8YNHvfZHgeFe+7eAOwCcc+uBZ4ENwCvArc65Dm8v4WvAMkJnTz3rLdtTHyIiAjy3upTM5AQuPHF0v3y+xdpvXBcVFTnd4E9EBoPG1gCn/strLJw1lp9eNvO4PsvMVjvnirq264prEZEo9T/rdnOgvYMr+mmoCRQSIiJRa8nqMibmpDM7f2i/9aGQEBGJQh/tbWLVjv1cMSeP0LXH/UMhISIShZasLiXO4PLZef3aj0JCRCTKdAQdL7xXzidOyGHUkJR+7UshISISZd4q2UtFXUu/XRvRmUJCRCTKPLe6jKzURM6bNrLf+1JIiIhEkboD7SxbX8mCU8aSkhjf7/0pJEREosh/r91NWyA4IENNoJAQEYkqi1ftYuroTGbkDhmQ/hQSIiJR4sPyOj4sr+eaufn9em1EZwoJEZEo8fTKXSQnxLHwlG5/f61fKCRERKJAc1uAF9/fzSUzx5CVljhg/SokRESiwEtrK2hsDXDN3PwB7VchISISBZ5etYvJIzMoGp89oP0qJEREItymynrW7Krl6lPHDdgB64MUEiIiEW7xylKS4uO4rJ9v5tcdhYSISARrae/ghffKuHDGaIalJw14/woJEZEI9vKHFdS3BLhm7sBcYd2VQkJEJII9/W4pBcPTOH3icF/6V0iIiESokqpGVu6o4apTB+4K664UEiIiEeqZVbtIiDOumDPwB6wPUkiIiESglvYOnltdxgUnjiInM9m3OhQSIiIR6KV1FdQ2t3PdvPG+1qGQEBGJQE+9s5PJIzN8O2B9kEJCRCTCrCurZW1pLdfPG+/bAeuDFBIiIhHmqbd3kpYUz+dmD9wtwcNRSIiIRJDa5jaWrt3Nwlm5DEkZuFuCh6OQEBGJIEtWl9EaCHK9zwesD1JIiIhEiGDQ8dQ7Ozm1IJtpYwbmN6yPRCEhIhIh3izZy859zb6f9tqZQkJEJEI89fZORmQkMX/GaL9LOUQhISISAcr2N/P6pj1cfWo+yQnxfpdziEJCRCQCPL1yFwDXnDawv2F9JAoJERGftbR3sHhlKedNG0Xu0FS/yzmMQkJExGdL1+5mX1MbN55R4HcpH6OQEBHxkXOOx97awZRRmZw+yd/7NHXniCFhZilmttLM1prZejP7Z699gpm9a2YlZvaMmSV57cnedIk3v6DTZ93ptW82sws7tc/32krM7I5O7d32ISISK97ZXsPGinpuPLPA9/s0dac3exKtwLnOuZOBU4D5ZjYP+Dlwt3NuMrAfuMlb/iZgv9d+t7ccZjYduBo4EZgPPGBm8WYWD9wPXARMB67xlqWHPkREYsJjb31EdloiC2f5f5+m7hwxJFxIozeZ6D0ccC6wxGt/AljovV7gTePNP89C8bgAWOyca3XOfQSUAHO9R4lzbrtzrg1YDCzw3hOuDxGRqLdrXzOvbtzDtaflk5IYOae9dtarYxLeX/zvA1XAq8A2oNY5F/AWKQMOxmAuUArgza8Dhndu7/KecO3De+ija323mFmxmRVXV1f3ZpVERHz3xNs7iDfj+nkFfpcSVq9CwjnX4Zw7Bcgj9Jf/1H6t6ig55x5yzhU554pycnL8LkdE5IgaWwM8u6qUi08aw+isFL/LCeuozm5yztUCK4DTgaFmluDNygPKvdflwDgAb34WsK9ze5f3hGvf10MfIiJRbUlxKQ2tAW48s8DvUnrUm7ObcsxsqPc6FTgf2EgoLK7wFlsEvOi9XupN481/3TnnvParvbOfJgCFwEpgFVDoncmUROjg9lLvPeH6EBGJWsGg4/G/7mBW/lBm5Wf7XU6PEo68CGOAJ7yzkOKAZ51zL5nZBmCxmf0LsAZ41Fv+UeApMysBagh96eOcW29mzwIbgABwq3OuA8DMvgYsA+KB3znn1nufdXuYPkREotaKzVXs2NfMty6Y4ncpR2ShP9hjR1FRkSsuLva7DBGRsL7wyDtsq2rizdvPITE+Mq5pNrPVzrmiru2RUZ2IyCDxYXkdb5XsY9EZBRETED2J/ApFRGLIw29uJz0pnmsj7G6v4SgkREQGSNn+Zl5aV8E1c/PJSk30u5xeUUiIiAyQ3/1lBwZ86awJfpfSawoJEZEBUNfczuJVu/jMyWMZG2G/GdEThYSIyAD4/cqdNLd1cPPZE/0u5agoJERE+llroIPH3trB2YUjmD52iN/lHBWFhIhIP3txzW6qG1r5yicm+V3KUVNIiIj0o2DQ8dCb25k+ZghnTo68X547EoWEiEg/emNLFSVVjXzlkxMj8pfnjkQhISLSjx58Yxu5Q1O5+KQxfpdyTBQSIiL9ZOVHNazasZ9bPjExKm7B0Z3orFpEJAr8+4oSRmQkcdWp4468cIRSSIiI9IN1ZbX8eUs1N501MWJ/v7o3FBIiIv3ggRXbGJKSwHXzouNGfuEoJERE+tjWPQ28sr6SL55RQGZKdNzILxyFhIhIH3vgjW2kJcVz45nRcyO/cBQSIiJ9aNe+Zpau3c21c/PJTk/yu5zjppAQEelD//HnbcSbcfMnoutGfuEoJERE+khlXQtLisu4oiiPUUNS/C6nTygkRET6yEN/3k6Hc/xdFN7ILxyFhIhIH6iqb+H37+7kc7NyyR+e5nc5fUYhISLSBx54YxuBoOPr5072u5Q+pZAQETlOlXUt/OfKXVw+O5fxw9P9LqdPKSRERI7Tg2+UEAw6vn5uod+l9DmFhIjIcaioO8DTK0u5Yk4e44bFzrGIgxQSIiLH4cE3thF0jlvPia1jEQcpJEREjtHu2gMsXlnK54vGxeReBCgkRESO2QNvlOBwfC3GzmjqTCEhInIMymsP8MyqUq4sGkfu0FS/y+k3CgkRkWNw32tbMYyvxuixiIMUEiIiR6mkqpHnVpdy3bzxMb0XAQoJEZGj9m+vbiY1MZ5bz4mdezSFo5AQETkK68pq+eMHlXz57IkMz0j2u5x+p5AQETkKv1y2mWHpSXz57Oj/1bneUEiIiPTSWyV7eXPrXr76qUlR/9vVvaWQEBHpBeccv1i2mbFZKVw3b7zf5QyYI4aEmY0zsxVmtsHM1pvZN7z2YWb2qplt9Z6zvXYzs/vMrMTM1pnZ7E6ftchbfquZLerUPsfMPvDec5+ZWU99iIgMtGXr97C2tJbbzj+BlMR4v8sZML3ZkwgA33bOTQfmAbea2XTgDmC5c64QWO5NA1wEFHqPW4AHIfSFD9wFnAbMBe7q9KX/IHBzp/fN99rD9SEiMmACHUF+9b+bmZSTzmWzcv0uZ0AdMSSccxXOufe81w3ARiAXWAA84S32BLDQe70AeNKFvAMMNbMxwIXAq865GufcfuBVYL43b4hz7h3nnAOe7PJZ3fUhIjJglqwuo6Sqke9cMIWE+ME1Sn9Ua2tmBcAs4F1glHOuwptVCYzyXucCpZ3eVua19dRe1k07PfTRta5bzKzYzIqrq6uPZpVERHrU1Brg169uYc74bObPGO13OQOu1yFhZhnA88Btzrn6zvO8PQDXx7Udpqc+nHMPOeeKnHNFOTk5/VmGiAwyv/3TNqobWvnBJdPwDpcOKr0KCTNLJBQQv3fOveA17/GGivCeq7z2cmBcp7fneW09ted1095THyIi/a6i7gAPvbmdS2eOYXb+4DxvpjdnNxnwKLDROfdvnWYtBQ6eobQIeLFT+w3eWU7zgDpvyGgZcIGZZXsHrC8Alnnz6s1sntfXDV0+q7s+RET63a+WbSEYhNvnT/W7FN8k9GKZM4HrgQ/M7H2v7fvAz4BnzewmYCdwpTfvj8DFQAnQDNwI4JyrMbMfA6u85X7knKvxXn8VeBxIBV72HvTQh4hIv/qwvI4X1pRxy9kTY/YHhXrDQkP9saOoqMgVFxf7XYaIRDHnHNc+/C6bKut547vnkJUa+1dXm9lq51xR1/bBdS6XiEgvLN9Yxdvb93Hbp08YFAHRE4WEiEgn7R1BfvLyRibmpHPtafl+l+M7hYSISCdP/HUH26ub+P5F00gcZBfOdUf/BUREPNUNrdz72lY+eUIO500b6Xc5EUEhISLi+cUrm2gJdPCPn5k+KC+c645CQkQEWLNrP8+tLuNLZ05gUk6G3+VEDIWEiAx6waDjn5auZ2RmMl8/r9DvciKKQkJEBr3nVpeytqyOOy+eSkZyb64xHjwUEiIyqNUdaOcXr2ymaHw2C08ZXL8V0RuKTBEZ1O55bQs1zW088dm5OljdDe1JiMigtWF3PU++vZNr5+YzIzfL73IikkJCRAaljqDj+//1AdlpiXz3wil+lxOxFBIiMij957s7eb+0ln+4ZDpD05L8LidiKSREZNCpqm/hF69s5qzJI1hwyli/y4loCgkRGXR+9NIGWjuC/HjhDB2sPgKFhIgMKm9sruKldRV87ZzJTBiR7nc5EU8hISKDxoG2Dn744odMzEnnK5+c6Hc5UUHXSYjIoHHv8q2U1hzg6ZvnkZwQ73c5UUF7EiIyKKwrq+XhN7dzZVEep08a7nc5UUMhISIxry0Q5HtL1jEiI4kfXDLd73KiioabRCTm3b+ihE2VDTy6qGjQ/2b10dKehIjEtI0V9dy/ooSFp4zlvGmj/C4n6igkRCRmBTqCfHfJWoamJXLXZ070u5yopOEmEYlZD725nQ/L63ngC7PJTtetN46F9iREJCZt2dPAPa9t5eKTRnPxSWP8LidqKSREJOa0BYLctvh9MpMT+OfPzvC7nKim4SYRiTn3Lt/Chop6Hrp+DjmZyX6XE9W0JyEiMaV4Rw0PvrGNq4rGccGJo/0uJ+opJEQkZjS2BvjWs2vJzU7lh5/RRXN9QcNNIhIz/uWlDZTtb+bZr5xORrK+3vqC9iREJCa8umEPi1eV8nefnERRwTC/y4kZCgkRiXqVdS3c/vw6po8Zwm2fPsHvcmKKQkJEolpH0HHbM2toae/gN9fOIilBX2t9SYN2IhLV7l9Rwjvba/jV509mUk6G3+XEHEWuiEStd7fv457XtvC5WblcPjvX73JikkJCRKLS/qY2vrH4fcYPT+fHC2dgZn6XFJM03CQiUcc5x3eXrKWmqY0XFp2h01370RH3JMzsd2ZWZWYfdmobZmavmtlW7znbazczu8/MSsxsnZnN7vSeRd7yW81sUaf2OWb2gfee+8z7cyBcHyIij7z5Ea9trOLOi6cyIzfL73JiWm+Gmx4H5ndpuwNY7pwrBJZ70wAXAYXe4xbgQQh94QN3AacBc4G7On3pPwjc3Ol984/Qh4gMYm9v28fPXtnERTNG88UzCvwuJ+YdMSScc38Garo0LwCe8F4/ASzs1P6kC3kHGGpmY4ALgVedczXOuf3Aq8B8b94Q59w7zjkHPNnls7rrQ0QGqcq6Fr7+9HsUDE/jl58/WcchBsCxHrge5Zyr8F5XAgd/EzAXKO20XJnX1lN7WTftPfXxMWZ2i5kVm1lxdXX1MayOiES6tkCQW//zPZrbOvjt9XN0HGKAHPfZTd4egOuDWo65D+fcQ865IudcUU5OTn+WIiI++ckfN7J6535+ccVMJo/M9LucQeNYQ2KPN1SE91zltZcD4zotl+e19dSe1017T32IyCDzhzXlPP7XHXz5rAlcOnOs3+UMKscaEkuBg2coLQJe7NR+g3eW0zygzhsyWgZcYGbZ3gHrC4Bl3rx6M5vnndV0Q5fP6q4PERlE1pXVcvvz65hbMIzbL5rqdzmDzhEH9czsaeBTwAgzKyN0ltLPgGfN7CZgJ3Clt/gfgYuBEqAZuBHAOVdjZj8GVnnL/cg5d/Bg+FcJnUGVCrzsPeihDxEZJPbUt3Dzk8WMyEjmgetmkxiv638HmoWG+2NHUVGRKy4u9rsMETlOLe0dXPXbt9la1cjz/+8Mpo0Z4ndJMc3MVjvnirq26/QAEYk4zjm+t2Qd68rr+O11cxQQPtK+m4hEnAfe2MbStbv5zgVT9DvVPlNIiEhEefmDCn65bDMLTxnLVz81ye9yBj2FhIhEjOIdNdz2zPvMzh/Kzy6fqSuqI4BCQkQiQklVI19+spjcoak8suhUUhLj/S5JUEiISASoamjhi4+tJCHOePzGuQxLT/K7JPHo7CYR8VVja4AvPb6KmqY2Ft8yj/zhaX6XJJ0oJETEN22BILf+/j02VjTwyA1FzMwb6ndJ0oWGm0TEFx1BxzefeZ8/banmJ5+bwTlTR/pdknRDISEiAy4YdNz5wjr+54MKfnDxNK46Nd/vkiQMhYSIDCjnHD/+nw08W1zG359XyM2fmOh3SdIDhYSIDKi7X9vKY2/t4EtnTuCbny70uxw5AoWEiAyY//jTNu5bvpWrisbxw0un6WK5KKCzm0RkQNy/ooRfLtvMZ04ey08uO0kBESUUEiLS736zfCu/fnULC04Zy68/fzLxcQqIaKGQEJF+dc9rW7jnta1cNiuXXyogoo5CQkT6hXOOu1/byn3Lt3LFnDx+fvlMBUQUUkiISJ8LBh3/+seNPPqXj7iyKI+fXTaTOAVEVFJIiEifau8IcvuSdbywppwvnlHAP146XQERxRQSItJnWto7uPX377F8UxXfPv8EvnbuZJ3FFOUUEiLSJ+oOtHPzE8Ws2lnDjxfO4Pp54/0uSfqAQkJEjlt57QFuenwV26ob+c01s7h05li/S5I+opAQkeOytrSWm54opjXQwWNfnMtZhSP8Lkn6kEJCRI7ZKx9WcNsz7zMiI5mnbz6NwlGZfpckfUwhISJHzTnHw29u56cvb+LkvKE8fEMROZnJfpcl/UAhISJH5UBbBz/4rw94YU05l5w0hl9feTIpifF+lyX9RCEhIr1WWtPMV55azcbKem77dCF/f26hroGIcQoJEemVN7dW8/Wn19ARdDy6qIhzp47yuyQZAAoJEelRMOh48E/b+PX/bqZwZCa/vX4OBSPS/S5LBohCQkTCqmpo4VvPrOUvJXu5dOYYfn75TNKT9bUxmGhri0i33thcxbefXUtTW4CfXnYSV586TrfYGIQUEiJymLZAkF8u28TDb37E1NGZLL5mnq5/GMQUEiJyyIfldXznubVsqmzg+nnj+cEl03R66yCnkBAR2gJB/n1FCQ+sKGFYehKPLirivGk6e0kUEiKD3vrddXznuXVsrKjnslm53PWZE8lKS/S7LIkQCgmRQaqxNcA9r27hsb/uYFh6Eg/fUMT507X3IIdTSIgMMs45Xv6wkh/99wYq61u4Zu44bp8/laFpSX6XJhEo4kPCzOYD9wLxwCPOuZ/5XJJI1PpobxP/tHQ9f9pSzbQxQ7j/C7OZMz7b77IkgkV0SJhZPHA/cD5QBqwys6XOuQ3+ViYSXfY3tXHv8q38/3d2kpwQxw8vnc6i08eTEB/nd2kS4SI6JIC5QIlzbjuAmS0GFgB9HhKP/uUjNlfWEx8XR2K8kRAXR0K8kRBnJCfEk54cT0ZyAhkpCaQnJ5CZHHrOTktiWHoSSQn6n00iT0t7B0++vYPfvF5CU2uAq07N55vnFzIyM8Xv0iRKRHpI5AKlnabLgNO6LmRmtwC3AOTn5x9TRxsr6vnL1r0Ego5AMEigo/OzO+L7M1MSGJ6exPCMZO85iZGZKYwdmsKYrNRDz7qlgQyE9o4g//VeOfe9vpWy/Qf41JQc7rxoGlNG66I4OTox8Y3lnHsIeAigqKjoyN/o3fjV508OOy/QEaSprYPG1gBNrQEaWwM0toSea5vb2dfYyr6mttCjsZVdNc28t2s/+5racF2qGZKScCg0xg9PZ/zwNAq857zsNO2RyHE5GA6/WbGV0poDnJSbxU8vO4mzC3P8Lk2iVKSHRDkwrtN0ntc2oBLi48hKjSMr9ejOHW8LBNlT30JFXQsVdQfYXfu35921B1j5UQ1NbR2Hlo8zyM1OPRQaBcPTmTwyg8kjMxiblar79ktYLe0d/GFNOQ+8sY1dNc2clJvFPy06kXOnjtT9luS4RHpIrAIKzWwCoXC4GrjW35J6LykhjnHD0hg3LK3b+c459jW1sXNfEzv2Noee94Wel76/m/qWwKFl05LiQ4GRk8HkURkUjsxk8sgM8oelEa/wGLRqmtp46u2dPPXODvY2tjEjdwiP3FDEedMUDtI3IjoknHMBM/sasIzQKbC/c86t97msPmNmjMhIZkRGMnPGD/vY/JqmNkqqGtla1cDWPY1sq27kr9v28cKav+1MJSXEMXFEOlNGZzJ19BCmjslk6uhMRg9J0ZdEDNtUWc+Tb+/k+dVltAaCnDMlh5vPnsjpk4Zru0ufMtd10DzKFRUVueLiYr/L6Ff1Le1sq2pka1VjKET2NLC5soHddS2HlslKTWTq6EymjRniBUgmU0ZnkpYU0X8XSA+a2wK8tK6Cp1fuYs2uWpIS4vjcKbl8+ewJukurHDczW+2cK+rarm+MKDQkJZFZ+dnMyj/8Iqi65nY272lgU2U9GytCz88Vlx467mEG44elMXV0KDimjQntfeQPS9PxjgjlnOO9Xfv5w5rd/GFNOQ2tASblpPMPl0zj8tl5ZKfrKmnpXwqJGJKVlsjcCcOYO+FvQ1fBoKNs/wE2VtazuTIUHJsqGli2ofLQmVepifGcMCqDKaMzmTJ6yKG9jhEZyT6tyeDmnGNjRQNL1+7mv9fuprz2AMkJcVxy0hiuOS2fovHZGlKSAaPhpkHqQFsHW6sa2FTRcChANlc2sK+p7dAyIzKSmDI6kxNGHRyuGsIJozI0ZNUPAh1B1pTW8trGPby2YQ/bqpuIjzPOLhzBZ08ey/nTR5GZojuzSv/RcJMcJjUpnpl5Q5mZN/Sw9uqG1lBg7Glgsxcei1eWcqD9b0NW+cPSOgVH6LlgeLpu8XCUqhpaeHvbPt7YXM2KzVXUNreTEGecNnEYN545gYtPGsMwDSeJzxQScpiczGRyMpM5q3DEobZg0LGrptkLjoZDw1bLN+7h4MXoSfFxTMxJZ1JOBhNz0kOPEaHX+gs4pLqhlVU7anh72z7+um0v26qbAMhOS+TcKSM5b9oozj5hBEP030siiIab5Ji1tHdQUtXIFi88tuxpYPveJkprmul8J5ORmclecGQwcUQ6E0akM25YGnnZqTE7dFV3oJ315XWsLatjbWkt68pqD519lpYUz6kFwzh90nBOnzicE8cO0V6Y+E7DTdLnUhLjmZGbxYzcrMPaWwMd7NrXzLbqJrbvbWR7dRPbqxv54wcV1Da3H7bs8PQk8rzAyMtOZVx26PXorBRGZqaQnZYYsQdpO4KOPfUtlO0/cOh6ltApyY1U1v/tdOTxw9OYUzCML+VlMSs/m5l5WSQqFCRKKCSkzyUnxFM4KrPbc/drmtr4aG8TZfubKdt/4NDz+vI6/nd9Je0dh+/ZJsYbORnJ5AxJYWRmMiO94bCs1ESyUhMZkpJIVpr3nJpIZkoCKYnxx3QVunOOlvYgzW0Bmts6Qvflamqlpqnt0KOqoZWy/c2U1x6gorblsJs/pibGUzgqgzMmD2fyyAxOHJvFzNwsnaYqUU0hIQNqWHro1urd/dBNMOjY0xD6y7yyroXqhlaqGlqpagi93rWvmeIdNezvsjfSndAt3uNISogjOSGe5MQ4DHCAcxB07tApwK2BDprbOjjQ3vGxGzJ2/cwRGcnkZqcyOz+b3Jmp5GankpedxqScdN1fS2KSQkIiRlycMSYrlTFZqT0uF+gI0tASoO5AO/Ut7aHnA6HphpZ2WgNBWgMdtLYHaQ0EaQsEaQmEAsAM4swwAAPDSE6MIy0xntSk0CMtMZ60pASGpiUeCrXh6ckMSU2I2KEvkf6ikJCokxAfR3Z6koZxRAaAjp6JiEhYCgkREQlLISEiImEpJEREJCyFhIiIhKWQEBGRsBQSIiISlkJCRETCirm7wJpZNbDzGN8+Atjbh+VEA63z4KB1HhyOZ53HO+dyujbGXEgcDzMr7u5WubFM6zw4aJ0Hh/5YZw03iYhIWAoJEREJSyFxuIf8LsAHWufBQes8OPT5OuuYhIiIhKU9CRERCUshISIiYSkkPGY238w2m1mJmd3hdz19zczGmdkKM9tgZuvN7Bte+zAze/S0dBIAAANtSURBVNXMtnrPH/9d0ShnZvFmtsbMXvKmJ5jZu962fsbMYurXi8xsqJktMbNNZrbRzE6P9e1sZt/0/l1/aGZPm1lKrG1nM/udmVWZ2Yed2rrdrhZyn7fu68xs9rH2q5Ag9CUC3A9cBEwHrjGz6f5W1ecCwLedc9OBecCt3jreASx3zhUCy73pWPMNYGOn6Z8DdzvnJgP7gZt8qar/3Au84pybCpxMaN1jdjubWS7w90CRc24GEA9cText58eB+V3awm3Xi4BC73EL8OCxdqqQCJkLlDjntjvn2oDFwAKfa+pTzrkK59x73usGQl8cuYTW8wlvsSeAhf5U2D/MLA+4BHjEmzbgXGCJt0hMrbOZZQGfAB4FcM61OedqifHtTOinmFPNLAFIAyqIse3snPszUNOlOdx2XQA86ULeAYaa2Zhj6VchEZILlHaaLvPaYpKZFQCzgHeBUc65Cm9WJTDKp7L6yz3A94CgNz0cqHXOBbzpWNvWE4Bq4DFviO0RM0snhrezc64c+BWwi1A41AGrie3tfFC47dpn32kKiUHGzDKA54HbnHP1nee50PnQMXNOtJldClQ551b7XcsASgBmAw8652YBTXQZWorB7ZxN6C/nCcBYIJ2PD8vEvP7argqJkHJgXKfpPK8tpphZIqGA+L1z7gWvec/B3VDvucqv+vrBmcBnzWwHoSHEcwmN1w/1hiUg9rZ1GVDmnHvXm15CKDRieTt/GvjIOVftnGsHXiC07WN5Ox8Ubrv22XeaQiJkFVDonQ2RROig11Kfa+pT3lj8o8BG59y/dZq1FFjkvV4EvDjQtfUX59ydzrk851wBoW36unPuC8AK4ApvsVhb50qg1MymeE3nARuI4e1MaJhpnpmlef/OD65zzG7nTsJt16XADd5ZTvOAuk7DUkdFV1x7zOxiQuPX8cDvnHP/6nNJfcrMzgLeBD7gb+Pz3yd0XOJZIJ/QLdavdM51PTgW9czsU8B3nHOXmtlEQnsWw4A1wHXOuVY/6+tLZnYKoQP1ScB24EZCfxDG7HY2s38GriJ0Ft8a4MuExuBjZjub2dPApwjdDnwPcBfwB7rZrl5Y/juhYbdm4EbnXPEx9auQEBGRcDTcJCIiYSkkREQkLIWEiIiEpZAQEZGwFBIiIhKWQkJERMJSSIiISFj/B5UhF9RDA2I4AAAAAElFTkSuQmCC\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"最小值 400\n",
"最优值 40.0 -32154.0\n"
]
}
],
"source": [
"from __future__ import division\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import math\n",
" \n",
"#define aim function\n",
"def aimFunction(x):\n",
" y=x**3-60*x**2-4*x+6\n",
" return y\n",
"x=[i/10 for i in range(1000)]\n",
"y=[0 for i in range(1000)]\n",
"for i in range(1000):\n",
" y[i]=aimFunction(x[i])\n",
"\n",
"plt.plot(x,y)\n",
"plt.show()\n",
"\n",
"print('最小值',y.index(min(y))) \n",
"print(\"最优值\",x[400], min(y))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"39.69856448101894 -32147.369845045607\n"
]
}
],
"source": [
"T=1000 #initiate temperature\n",
"Tmin=10 #minimum value of terperature\n",
"x=np.random.uniform(low=0,high=100)#initiate x\n",
"k=50 #times of internal circulation \n",
"y=0#initiate result\n",
"t=0#time\n",
"while T>=Tmin:\n",
" for i in range(k):\n",
" #calculate y\n",
" y=aimFunction(x)\n",
" #generate a new x in the neighboorhood of x by transform function\n",
" xNew=x+np.random.uniform(low=-0.055,high=0.055)*T\n",
" if (0<=xNew and xNew<=100):\n",
" yNew=aimFunction(xNew)\n",
" if yNew-y<0:\n",
" x=xNew\n",
" else:\n",
" #metropolis principle\n",
" p=math.exp(-(yNew-y)/T)\n",
" r=np.random.uniform(low=0,high=1)\n",
" if r<p:\n",
" x=xNew\n",
" t+=1\n",
"# print(t)\n",
" T=1000/(1+t)\n",
" \n",
"print (x,aimFunction(x))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
++ "b/\350\257\276\344\273\266/0613Simplex Method\344\270\216LP\345\256\236\346\210\230/.gitkeep"
'''
'''
原题目:
有2000元经费,需要采购单价为50元的若干桌子和单价为20元的若干椅子,你希望桌椅的总数尽可能的多,但要求椅子数量不少于桌子数量,且不多于桌子数量的1.5倍,那你需要怎样的一个采购方案呢?
解:要采购x1张桌子,x2把椅子
max z= x1 + x2
s.t. x1 - x2 <= 0
1.5x1 >= x2
50x1 + 20x2 <= 2000
x1, x2 >=0
在python中此类线性规划问题可用lp solver解决
scipy.optimize._linprog def linprog(c: int,
A_ub: Optional[int] = None,
b_ub: Optional[int] = None,
A_eq: Optional[int] = None,
b_eq: Optional[int] = None,
bounds: Optional[Iterable] = None,
method: Optional[str] = 'simplex',
callback: Optional[Callable] = None,
options: Optional[dict] = None) -> OptimizeResult
矩阵A:就是约束条件的系数(等号左边的系数)
矩阵B:就是约束条件的值(等号右边)
矩阵C:目标函数的系数值
'''
from scipy import optimize as opt
import numpy as np
#参数
#c是目标函数里变量的系数
c=np.array([1,1])
#a是不等式条件的变量系数
a=np.array([[1,-1],[-1.5,1],[50,20]])
#b是是不等式条件的常数项
b=np.array([0,0,2000])
#a1,b1是等式条件的变量系数和常数项,这个例子里无等式条件,不要这两项
#a1=np.array([[1,1,1]])
#b1=np.array([7])
#限制
lim1=(0,None) #(0,None)->(0,+无穷)
lim2=(0,None)
#调用函数
ans=opt.linprog(-c,a,b,bounds=(lim1,lim2))
#输出结果
print(ans)
#注意:我们这里的应用问题,椅子不能是0.5把,所以最后应该采购37把椅子
import numpy as np
class Simplex(object):
def __init__(self, obj, max_mode=False):
self.max_mode = max_mode # 默认是求min,如果是max目标函数要乘-1
self.mat = np.array([[0] + obj]) * (-1 if max_mode else 1) #矩阵先加入目标函数
def add_constraint(self, a, b):
self.mat = np.vstack([self.mat, [b] + a]) #矩阵加入约束
def solve(self):
m, n = self.mat.shape # 矩阵里有1行目标函数,m - 1行约束,应加入m-1个松弛变量
temp, B = np.vstack([np.zeros((1, m - 1)), np.eye(m - 1)]), list(range(n - 1, n + m - 1)) # temp是一个对角矩阵,B是个递增序列
mat = self.mat = np.hstack([self.mat, temp]) # 横向拼接
while mat[0, 1:].min() < 0: #判断目标函数里是否还有负系数项
col = np.where(mat[0, 1:] < 0)[0][0] + 1 # 在目标函数里找到第一个负系数的变量,找到替入变量
row = np.array([mat[i][0] / mat[i][col] if mat[i][col] > 0 else 0x7fffffff for i in
range(1, mat.shape[0])]).argmin() + 1 # 找到最严格约束的行,也就找到替出变量
if mat[row][col] <= 0: return None # 若无替出变量,原问题无界
mat[row] /= mat[row][col] #替入变量和替出变量互换
ids = np.arange(mat.shape[0]) != row
mat[ids] -= mat[row] * mat[ids, col:col + 1] # 对i!= row的每一行约束条件,做替入和替出变量的替换
B[row] = col #用B数组记录替入的替入变量
return mat[0][0] * (1 if self.max_mode else -1), {B[i]: mat[i, 0] for i in range(1, m) if B[i] < n} #返回目标值,对应x的解就是基本变量为对应的bi,非基本变量为0
"""
minimize -x1 - 14x2 - 6x3
st
x1 + x2 + x3 <=4
x1 <= 2
x3 <= 3
3x2 + x3 <= 6
x1 ,x2 ,x3 >= 0
answer :-32
"""
t = Simplex([-1, -14, -6])
t.add_constraint([1, 1, 1], 4)
t.add_constraint([1, 0, 0], 2)
t.add_constraint([0, 0, 1], 3)
t.add_constraint([0, 3, 1], 6)
print(t.solve())
print(t.mat)
\ No newline at end of file
++ "b/\350\257\276\344\273\266/project1/.gitkeep"
{
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## 搭建一个简单的问答系统 (Building a Simple QA System)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"本次项目的目标是搭建一个基于检索式的简易的问答系统,这是一个最经典的方法也是最有效的方法。 \n",
"\n",
"```不要单独创建一个文件,所有的都在这里面编写,不要试图改已经有的函数名字 (但可以根据需求自己定义新的函数)```\n",
"\n",
"```预估完成时间```: 5-10小时"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 检索式的问答系统\n",
"问答系统所需要的数据已经提供,对于每一个问题都可以找得到相应的答案,所以可以理解为每一个样本数据是 ``<问题、答案>``。 那系统的核心是当用户输入一个问题的时候,首先要找到跟这个问题最相近的已经存储在库里的问题,然后直接返回相应的答案即可(但实际上也可以抽取其中的实体或者关键词)。 举一个简单的例子:\n",
"\n",
"假设我们的库里面已有存在以下几个<问题,答案>:\n",
"- <\"贪心学院主要做什么方面的业务?”, “他们主要做人工智能方面的教育”>\n",
"- <“国内有哪些做人工智能教育的公司?”, “贪心学院”>\n",
"- <\"人工智能和机器学习的关系什么?\", \"其实机器学习是人工智能的一个范畴,很多人工智能的应用要基于机器学习的技术\">\n",
"- <\"人工智能最核心的语言是什么?\", ”Python“>\n",
"- .....\n",
"\n",
"假设一个用户往系统中输入了问题 “贪心学院是做什么的?”, 那这时候系统先去匹配最相近的“已经存在库里的”问题。 那在这里很显然是 “贪心学院是做什么的”和“贪心学院主要做什么方面的业务?”是最相近的。 所以当我们定位到这个问题之后,直接返回它的答案 “他们主要做人工智能方面的教育”就可以了。 所以这里的核心问题可以归结为计算两个问句(query)之间的相似度。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 项目中涉及到的任务描述\n",
"问答系统看似简单,但其中涉及到的内容比较多。 在这里先做一个简单的解释,总体来讲,我们即将要搭建的模块包括:\n",
"\n",
"- 文本的读取: 需要从相应的文件里读取```(问题,答案)```\n",
"- 文本预处理: 清洗文本很重要,需要涉及到```停用词过滤```等工作\n",
"- 文本的表示: 如果表示一个句子是非常核心的问题,这里会涉及到```tf-idf```, ```Glove```以及```BERT Embedding```\n",
"- 文本相似度匹配: 在基于检索式系统中一个核心的部分是计算文本之间的```相似度```,从而选择相似度最高的问题然后返回这些问题的答案\n",
"- 倒排表: 为了加速搜索速度,我们需要设计```倒排表```来存储每一个词与出现的文本\n",
"- 词义匹配:直接使用倒排表会忽略到一些意思上相近但不完全一样的单词,我们需要做这部分的处理。我们需要提前构建好```相似的单词```然后搜索阶段使用\n",
"- 拼写纠错:我们不能保证用户输入的准确,所以第一步需要做用户输入检查,如果发现用户拼错了,我们需要及时在后台改正,然后按照修改后的在库里面搜索\n",
"- 文档的排序: 最后返回结果的排序根据文档之间```余弦相似度```有关,同时也跟倒排表中匹配的单词有关\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 项目中需要的数据:\n",
"1. ```dev-v2.0.json```: 这个数据包含了问题和答案的pair, 但是以JSON格式存在,需要编写parser来提取出里面的问题和答案。 \n",
"2. ```glove.6B```: 这个文件需要从网上下载,下载地址为:https://nlp.stanford.edu/projects/glove/, 请使用d=200的词向量\n",
"3. ```spell-errors.txt``` 这个文件主要用来编写拼写纠错模块。 文件中第一列为正确的单词,之后列出来的单词都是常见的错误写法。 但这里需要注意的一点是我们没有给出他们之间的概率,也就是p(错误|正确),所以我们可以认为每一种类型的错误都是```同等概率```\n",
"4. ```vocab.txt``` 这里列了几万个英文常见的单词,可以用这个词库来验证是否有些单词被拼错\n",
"5. ```testdata.txt``` 这里搜集了一些测试数据,可以用来测试自己的spell corrector。这个文件只是用来测试自己的程序。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在本次项目中,你将会用到以下几个工具:\n",
"- ```sklearn```。具体安装请见:http://scikit-learn.org/stable/install.html sklearn包含了各类机器学习算法和数据处理工具,包括本项目需要使用的词袋模型,均可以在sklearn工具包中找得到。 \n",
"- ```jieba```,用来做分词。具体使用方法请见 https://github.com/fxsjy/jieba\n",
"- ```bert embedding```: https://github.com/imgarylai/bert-embedding\n",
"- ```nltk```:https://www.nltk.org/index.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 第一部分:对于训练数据的处理:读取文件和预处理"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- ```文本的读取```: 需要从文本中读取数据,此处需要读取的文件是```dev-v2.0.json```,并把读取的文件存入一个列表里(list)\n",
"- ```文本预处理```: 对于问题本身需要做一些停用词过滤等文本方面的处理\n",
"- ```可视化分析```: 对于给定的样本数据,做一些可视化分析来更好地理解数据"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.1节: 文本的读取\n",
"把给定的文本数据读入到```qlist```和```alist```当中,这两个分别是列表,其中```qlist```是问题的列表,```alist```是对应的答案列表"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def read_corpus():\n",
" \"\"\"\n",
" 读取给定的语料库,并把问题列表和答案列表分别写入到 qlist, alist 里面。 在此过程中,不用对字符换做任何的处理(这部分需要在 Part 2.3里处理)\n",
" qlist = [\"问题1\", “问题2”, “问题3” ....]\n",
" alist = [\"答案1\", \"答案2\", \"答案3\" ....]\n",
" 务必要让每一个问题和答案对应起来(下标位置一致)\n",
" \"\"\"\n",
" # TODO 需要完成的代码部分 ...\n",
" \n",
" \n",
" \n",
" assert len(qlist) == len(alist) # 确保长度一样\n",
" return qlist, alist"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1.2 理解数据(可视化分析/统计信息)\n",
"对数据的理解是任何AI工作的第一步, 需要对数据有个比较直观的认识。在这里,简单地统计一下:\n",
"\n",
"- 在```qlist```出现的总单词个数\n",
"- 按照词频画一个```histogram``` plot"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: 统计一下在qlist中总共出现了多少个单词? 总共出现了多少个不同的单词(unique word)?\n",
"# 这里需要做简单的分词,对于英文我们根据空格来分词即可,其他过滤暂不考虑(只需分词)\n",
"\n",
"print (word_total)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: 统计一下qlist中出现1次,2次,3次... 出现的单词个数, 然后画一个plot. 这里的x轴是单词出现的次数(1,2,3,..), y轴是单词个数。\n",
"# 从左到右分别是 出现1次的单词数,出现2次的单词数,出现3次的单词数... \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: 从上面的图中能观察到什么样的现象? 这样的一个图的形状跟一个非常著名的函数形状很类似,能所出此定理吗? \n",
"# hint: [XXX]'s law\n",
"# \n",
"# "
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"#### 1.3 文本预处理\n",
"此部分需要做文本方面的处理。 以下是可以用到的一些方法:\n",
"\n",
"- 1. 停用词过滤 (去网上搜一下 \"english stop words list\",会出现很多包含停用词库的网页,或者直接使用NLTK自带的) \n",
"- 2. 转换成lower_case: 这是一个基本的操作 \n",
"- 3. 去掉一些无用的符号: 比如连续的感叹号!!!, 或者一些奇怪的单词。\n",
"- 4. 去掉出现频率很低的词:比如出现次数少于10,20.... (想一下如何选择阈值)\n",
"- 5. 对于数字的处理: 分词完只有有些单词可能就是数字比如44,415,把所有这些数字都看成是一个单词,这个新的单词我们可以定义为 \"#number\"\n",
"- 6. lemmazation: 在这里不要使用stemming, 因为stemming的结果有可能不是valid word。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: 需要做文本方面的处理。 从上述几个常用的方法中选择合适的方法给qlist做预处理(不一定要按照上面的顺序,不一定要全部使用)\n",
"\n",
"qlist = # 更新后的问题列表"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 第二部分: 文本的表示\n",
"当我们做完必要的文本处理之后就需要想办法表示文本了,这里有几种方式\n",
"\n",
"- 1. 使用```tf-idf vector```\n",
"- 2. 使用embedding技术如```word2vec```, ```bert embedding```等\n",
"\n",
"下面我们分别提取这三个特征来做对比。 "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.1 使用tf-idf表示向量\n",
"把```qlist```中的每一个问题的字符串转换成```tf-idf```向量, 转换之后的结果存储在```X```矩阵里。 ``X``的大小是: ``N* D``的矩阵。 这里``N``是问题的个数(样本个数),\n",
"``D``是词典库的大小"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO \n",
"vectorizer = # 定义一个tf-idf的vectorizer\n",
"\n",
"X_tfidf = # 结果存放在X矩阵里"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.2 使用wordvec + average pooling\n",
"词向量方面需要下载: https://nlp.stanford.edu/projects/glove/ (请下载``glove.6B.zip``),并使用``d=200``的词向量(200维)。国外网址如果很慢,可以在百度上搜索国内服务器上的。 每个词向量获取完之后,即可以得到一个句子的向量。 我们通过``average pooling``来实现句子的向量。 "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO 基于Glove向量获取句子向量\n",
"emb = # 这是 D*H的矩阵,这里的D是词典库的大小, H是词向量的大小。 这里面我们给定的每个单词的词向量,\n",
" # 这需要从文本中读取\n",
" \n",
"X_w2v = # 初始化完emb之后就可以对每一个句子来构建句子向量了,这个过程使用average pooling来实现\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2.3 使用BERT + average pooling\n",
"最近流行的BERT也可以用来学出上下文相关的词向量(contex-aware embedding), 在很多问题上得到了比较好的结果。在这里,我们不做任何的训练,而是直接使用已经训练好的BERT embedding。 具体如何训练BERT将在之后章节里体会到。 为了获取BERT-embedding,可以直接下载已经训练好的模型从而获得每一个单词的向量。可以从这里获取: https://github.com/imgarylai/bert-embedding , 请使用```bert_12_768_12```\t当然,你也可以从其他source获取也没问题,只要是合理的词向量。 "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO 基于BERT的句子向量计算\n",
"\n",
"X_bert = # 每一个句子的向量结果存放在X_bert矩阵里。行数为句子的总个数,列数为一个句子embedding大小。 "
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### 第三部分: 相似度匹配以及搜索\n",
"在这部分里,我们需要把用户每一个输入跟知识库里的每一个问题做一个相似度计算,从而得出最相似的问题。但对于这个问题,时间复杂度其实很高,所以我们需要结合倒排表来获取相似度最高的问题,从而获得答案。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3.1 tf-idf + 余弦相似度\n",
"我们可以直接基于计算出来的``tf-idf``向量,计算用户最新问题与库中存储的问题之间的相似度,从而选择相似度最高的问题的答案。这个方法的复杂度为``O(N)``, ``N``是库中问题的个数。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_top_results_tfidf_noindex(query):\n",
" # TODO 需要编写\n",
" \"\"\"\n",
" 给定用户输入的问题 query, 返回最有可能的TOP 5问题。这里面需要做到以下几点:\n",
" 1. 对于用户的输入 query 首先做一系列的预处理(上面提到的方法),然后再转换成tf-idf向量(利用上面的vectorizer)\n",
" 2. 计算跟每个库里的问题之间的相似度\n",
" 3. 找出相似度最高的top5问题的答案\n",
" \"\"\"\n",
" \n",
" top_idxs = [] # top_idxs存放相似度最高的(存在qlist里的)问题的下标 \n",
" # hint: 请使用 priority queue来找出top results. 思考为什么可以这么做? \n",
" \n",
" return alist[top_idxs] # 返回相似度最高的问题对应的答案,作为TOP5答案 "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: 编写几个测试用例,并输出结果\n",
"print (get_top_results_tfidf_noindex(\"\"))\n",
"print (get_top_results_tfidf_noindex(\"\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"你会发现上述的程序很慢,没错! 是因为循环了所有库里的问题。为了优化这个过程,我们需要使用一种数据结构叫做```倒排表```。 使用倒排表我们可以把单词和出现这个单词的文档做关键。 之后假如要搜索包含某一个单词的文档,即可以非常快速的找出这些文档。 在这个QA系统上,我们首先使用倒排表来快速查找包含至少一个单词的文档,然后再进行余弦相似度的计算,即可以大大减少```时间复杂度```。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3.2 倒排表的创建\n",
"倒排表的创建其实很简单,最简单的方法就是循环所有的单词一遍,然后记录每一个单词所出现的文档,然后把这些文档的ID保存成list即可。我们可以定义一个类似于```hash_map```, 比如 ``inverted_index = {}``, 然后存放包含每一个关键词的文档出现在了什么位置,也就是,通过关键词的搜索首先来判断包含这些关键词的文档(比如出现至少一个),然后对于candidates问题做相似度比较。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO 请创建倒排表\n",
"inverted_idx = {} # 定一个一个简单的倒排表,是一个map结构。 循环所有qlist一遍就可以"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3.3 语义相似度\n",
"这里有一个问题还需要解决,就是语义的相似度。可以这么理解: 两个单词比如car, auto这两个单词长得不一样,但从语义上还是类似的。如果只是使用倒排表我们不能考虑到这些单词之间的相似度,这就导致如果我们搜索句子里包含了``car``, 则我们没法获取到包含auto的所有的文档。所以我们希望把这些信息也存下来。那这个问题如何解决呢? 其实也不难,可以提前构建好相似度的关系,比如对于``car``这个单词,一开始就找好跟它意思上比较类似的单词比如top 10,这些都标记为``related words``。所以最后我们就可以创建一个保存``related words``的一个``map``. 比如调用``related_words['car']``即可以调取出跟``car``意思上相近的TOP 10的单词。 \n",
"\n",
"那这个``related_words``又如何构建呢? 在这里我们仍然使用``Glove``向量,然后计算一下俩俩的相似度(余弦相似度)。之后对于每一个词,存储跟它最相近的top 10单词,最终结果保存在``related_words``里面。 这个计算需要发生在离线,因为计算量很大,复杂度为``O(V*V)``, V是单词的总数。 \n",
"\n",
"这个计算过程的代码请放在``related.py``的文件里,然后结果保存在``related_words.txt``里。 我们在使用的时候直接从文件里读取就可以了,不用再重复计算。所以在此notebook里我们就直接读取已经计算好的结果。 作业提交时需要提交``related.py``和``related_words.txt``文件,这样在使用的时候就不再需要做这方面的计算了。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO 读取语义相关的单词\n",
"def get_related_words(file):\n",
" \n",
" return related_words\n",
"\n",
"related_words = get_related_words('related_words.txt') # 直接放在文件夹的根目录下,不要修改此路径。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3.4 利用倒排表搜索\n",
"在这里,我们使用倒排表先获得一批候选问题,然后再通过余弦相似度做精准匹配,这样一来可以节省大量的时间。搜索过程分成两步:\n",
"\n",
"- 使用倒排表把候选问题全部提取出来。首先,对输入的新问题做分词等必要的预处理工作,然后对于句子里的每一个单词,从``related_words``里提取出跟它意思相近的top 10单词, 然后根据这些top词从倒排表里提取相关的文档,把所有的文档返回。 这部分可以放在下面的函数当中,也可以放在外部。\n",
"- 然后针对于这些文档做余弦相似度的计算,最后排序并选出最好的答案。\n",
"\n",
"可以适当定义自定义函数,使得减少重复性代码"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_top_results_tfidf(query):\n",
" \"\"\"\n",
" 给定用户输入的问题 query, 返回最有可能的TOP 5问题。这里面需要做到以下几点:\n",
" 1. 利用倒排表来筛选 candidate (需要使用related_words). \n",
" 2. 对于候选文档,计算跟输入问题之间的相似度\n",
" 3. 找出相似度最高的top5问题的答案\n",
" \"\"\"\n",
" \n",
" top_idxs = [] # top_idxs存放相似度最高的(存在qlist里的)问题的下表 \n",
" # hint: 利用priority queue来找出top results. 思考为什么可以这么做? \n",
" \n",
" return alist[top_idxs] # 返回相似度最高的问题对应的答案,作为TOP5答案"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_top_results_w2v(query):\n",
" \"\"\"\n",
" 给定用户输入的问题 query, 返回最有可能的TOP 5问题。这里面需要做到以下几点:\n",
" 1. 利用倒排表来筛选 candidate (需要使用related_words). \n",
" 2. 对于候选文档,计算跟输入问题之间的相似度\n",
" 3. 找出相似度最高的top5问题的答案\n",
" \"\"\"\n",
" \n",
" top_idxs = [] # top_idxs存放相似度最高的(存在qlist里的)问题的下表 \n",
" # hint: 利用priority queue来找出top results. 思考为什么可以这么做? \n",
" \n",
" return alist[top_idxs] # 返回相似度最高的问题对应的答案,作为TOP5答案"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_top_results_bert(query):\n",
" \"\"\"\n",
" 给定用户输入的问题 query, 返回最有可能的TOP 5问题。这里面需要做到以下几点:\n",
" 1. 利用倒排表来筛选 candidate (需要使用related_words). \n",
" 2. 对于候选文档,计算跟输入问题之间的相似度\n",
" 3. 找出相似度最高的top5问题的答案\n",
" \"\"\"\n",
" \n",
" top_idxs = [] # top_idxs存放相似度最高的(存在qlist里的)问题的下表 \n",
" # hint: 利用priority queue来找出top results. 思考为什么可以这么做? \n",
" \n",
" return alist[top_idxs] # 返回相似度最高的问题对应的答案,作为TOP5答案"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: 编写几个测试用例,并输出结果\n",
"\n",
"test_query1 = \"\"\n",
"test_query2 = \"\"\n",
"\n",
"print (get_top_results_tfidf(test_query1))\n",
"print (get_top_results_w2v(test_query1))\n",
"print (get_top_results_bert(test_query1))\n",
"\n",
"print (get_top_results_tfidf(test_query2))\n",
"print (get_top_results_w2v(test_query2))\n",
"print (get_top_results_bert(test_query2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4. 拼写纠错\n",
"其实用户在输入问题的时候,不能期待他一定会输入正确,有可能输入的单词的拼写错误的。这个时候我们需要后台及时捕获拼写错误,并进行纠正,然后再通过修正之后的结果再跟库里的问题做匹配。这里我们需要实现一个简单的拼写纠错的代码,然后自动去修复错误的单词。\n",
"\n",
"这里使用的拼写纠错方法是课程里讲过的方法,就是使用noisy channel model。 我们回想一下它的表示:\n",
"\n",
"$c^* = \\text{argmax}_{c\\in candidates} ~~p(c|s) = \\text{argmax}_{c\\in candidates} ~~p(s|c)p(c)$\n",
"\n",
"这里的```candidates```指的是针对于错误的单词的候选集,这部分我们可以假定是通过edit_distance来获取的(比如生成跟当前的词距离为1/2的所有的valid 单词。 valid单词可以定义为存在词典里的单词。 ```c```代表的是正确的单词, ```s```代表的是用户错误拼写的单词。 所以我们的目的是要寻找出在``candidates``里让上述概率最大的正确写法``c``。 \n",
"\n",
"$p(s|c)$,这个概率我们可以通过历史数据来获得,也就是对于一个正确的单词$c$, 有百分之多少人把它写成了错误的形式1,形式2... 这部分的数据可以从``spell_errors.txt``里面找得到。但在这个文件里,我们并没有标记这个概率,所以可以使用uniform probability来表示。这个也叫做channel probability。\n",
"\n",
"$p(c)$,这一项代表的是语言模型,也就是假如我们把错误的$s$,改造成了$c$, 把它加入到当前的语句之后有多通顺?在本次项目里我们使用bigram来评估这个概率。 举个例子: 假如有两个候选 $c_1, c_2$, 然后我们希望分别计算出这个语言模型的概率。 由于我们使用的是``bigram``, 我们需要计算出两个概率,分别是当前词前面和后面词的``bigram``概率。 用一个例子来表示:\n",
"\n",
"给定: ``We are go to school tomorrow``, 对于这句话我们希望把中间的``go``替换成正确的形式,假如候选集里有个,分别是``going``, ``went``, 这时候我们分别对这俩计算如下的概率:\n",
"$p(going|are)p(to|going)$和 $p(went|are)p(to|went)$, 然后把这个概率当做是$p(c)$的概率。 然后再跟``channel probability``结合给出最终的概率大小。\n",
"\n",
"那这里的$p(are|going)$这些bigram概率又如何计算呢?答案是训练一个语言模型! 但训练一个语言模型需要一些文本数据,这个数据怎么找? 在这次项目作业里我们会用到``nltk``自带的``reuters``的文本类数据来训练一个语言模型。当然,如果你有资源你也可以尝试其他更大的数据。最终目的就是计算出``bigram``概率。 "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.1 训练一个语言模型\n",
"在这里,我们使用``nltk``自带的``reuters``数据来训练一个语言模型。 使用``add-one smoothing``"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from nltk.corpus import reuters\n",
"\n",
"# 读取语料库的数据\n",
"categories = reuters.categories()\n",
"corpus = reuters.sents(categories=categories)\n",
"\n",
"# 循环所有的语料库并构建bigram probability. bigram[word1][word2]: 在word1出现的情况下下一个是word2的概率。 \n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.2 构建Channel Probs\n",
"基于``spell_errors.txt``文件构建``channel probability``, 其中$channel[c][s]$表示正确的单词$c$被写错成$s$的概率。 "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO 构建channel probability \n",
"channel = {}\n",
"\n",
"for line in open('spell-errors.txt'):\n",
" # TODO\n",
"\n",
"# TODO\n",
"\n",
"print(channel) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.3 根据错别字生成所有候选集合\n",
"给定一个错误的单词,首先生成跟这个单词距离为1或者2的所有的候选集合。 这部分的代码我们在课程上也讲过,可以参考一下。 "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def generate_candidates(word):\n",
" # 基于拼写错误的单词,生成跟它的编辑距离为1或者2的单词,并通过词典库的过滤。\n",
" # 只留写法上正确的单词。 \n",
" \n",
" \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.4 给定一个输入,如果有错误需要纠正\n",
"\n",
"给定一个输入``query``, 如果这里有些单词是拼错的,就需要把它纠正过来。这部分的实现可以简单一点: 对于``query``分词,然后把分词后的每一个单词在词库里面搜一下,假设搜不到的话可以认为是拼写错误的! 人如果拼写错误了再通过``channel``和``bigram``来计算最适合的候选。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"def spell_corrector(line):\n",
" # 1. 首先做分词,然后把``line``表示成``tokens``\n",
" # 2. 循环每一token, 然后判断是否存在词库里。如果不存在就意味着是拼写错误的,需要修正。 \n",
" # 修正的过程就使用上述提到的``noisy channel model``, 然后从而找出最好的修正之后的结果。 \n",
" \n",
" return newline # 修正之后的结果,假如用户输入没有问题,那这时候``newline = line``\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.5 基于拼写纠错算法,实现用户输入自动矫正\n",
"首先有了用户的输入``query``, 然后做必要的处理把句子转换成tokens的形状,然后对于每一个token比较是否是valid, 如果不是的话就进行下面的修正过程。 "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_query1 = \"\" # 拼写错误的\n",
"test_query2 = \"\" # 拼写错误的\n",
"\n",
"test_query1 = spell_corector(test_query1)\n",
"test_query2 = spell_corector(test_query2)\n",
"\n",
"print (get_top_results_tfidf(test_query1))\n",
"print (get_top_results_w2v(test_query1))\n",
"print (get_top_results_bert(test_query1))\n",
"\n",
"print (get_top_results_tfidf(test_query2))\n",
"print (get_top_results_w2v(test_query2))\n",
"print (get_top_results_bert(test_query2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 附录 \n",
"在本次项目中我们实现了一个简易的问答系统。基于这个项目,我们其实可以有很多方面的延伸。\n",
"- 在这里,我们使用文本向量之间的余弦相似度作为了一个标准。但实际上,我们也可以基于基于包含关键词的情况来给一定的权重。比如一个单词跟related word有多相似,越相似就意味着相似度更高,权重也会更大。 \n",
"- 另外 ,除了根据词向量去寻找``related words``也可以提前定义好同义词库,但这个需要大量的人力成本。 \n",
"- 在这里,我们直接返回了问题的答案。 但在理想情况下,我们还是希望通过问题的种类来返回最合适的答案。 比如一个用户问:“明天北京的天气是多少?”, 那这个问题的答案其实是一个具体的温度(其实也叫做实体),所以需要在答案的基础上做进一步的抽取。这项技术其实是跟信息抽取相关的。 \n",
"- 对于词向量,我们只是使用了``average pooling``, 除了average pooling,我们也还有其他的经典的方法直接去学出一个句子的向量。\n",
"- 短文的相似度分析一直是业界和学术界一个具有挑战性的问题。在这里我们使用尽可能多的同义词来提升系统的性能。但除了这种简单的方法,可以尝试其他的方法比如WMD,或者适当结合parsing相关的知识点。 "
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"好了,祝你好运! "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment