Commit cbc4c641 by 20200116044

LogisticRegression-ML homework.ipynb

parent bc33f884
{
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# logistic回归\n",
"本次作业主要来练习使用逻辑回归对文本数据进行分类。通过完成作业,你将会学到: 1、如何调用逻辑回归进行分类; 2、如何对文本数据进行分类;3、如何评估模型效果。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```不要单独创建一个文件,所有的都在这里面编写(在TODO后编写),不要试图改已经有的函数名字 (但可以根据需求自己定义新的函数)```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"logistic回归又称logistic回归分析,是一种广义的线性回归分析模型,常用于数据挖掘,疾病自动诊断,经济预测等领域。例如,探讨引发疾病的危险因素,并根据危险因素预测疾病发生的概率等。以胃癌病情分析为例,选择两组人群,一组是胃癌组,一组是非胃癌组,两组人群必定具有不同的体征与生活方式等。因此因变量就为是否胃癌,值为“是”或“否”,自变量就可以包括很多了,如年龄、性别、饮食习惯、幽门螺杆菌感染等。自变量既可以是连续的,也可以是分类的。然后通过logistic回归分析,可以得到自变量的权重,从而可以大致了解到底哪些因素是胃癌的危险因素。同时根据该权值可以根据危险因素预测一个人患癌症的可能性。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在本次项目中,你将会用到以下几个工具:\n",
"- ```sklearn```。具体安装请见:http://scikit-learn.org/stable/install.html sklearn包含了各类机器学习算法和数据处理工具,包括本项目需要使用的词袋模型,均可以在sklearn工具包中找得到。 \n",
"- ```pandas```,数据处理库:https://pandas.pydata.org/pandas-docs/stable/\n",
"- ```matplotlib```,绘图库,绘制各种图表,本次作业中将进行各种模型评价指标的可视化展示:www.matplotlib.org"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. 文件读取\n",
"将文本数据读入,并探查数据的情况"
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"垃圾邮件个数:label 747\n",
"content 747\n",
"dtype: int64\n",
"正常邮件个数:label 4825\n",
"content 4825\n",
"dtype: int64\n"
]
}
],
"source": [
"#导入其他需要的算法库\n",
"import pandas as pd\n",
"#读取垃圾邮件数据,并统计垃圾邮件和正常邮件的数量\n",
"## TODO: 利用pandas库pd中read_csv()函数写出读取垃圾邮件数据csv文件的代码\n",
"smsDir = './SMSSpamCollection.csv' \n",
"df = pd.read_csv(smsDir)\n",
"\n",
"#数据探查\n",
"#print(df.head)\n",
"print(\"垃圾邮件个数:%s\" % df[df['label']=='spam'].count())\n",
"print(\"正常邮件个数:%s\" % df[df['label']=='ham'].count())\n",
"#print(df['content'])\n",
"#df[df['label']=='ham'] = '1'\n",
"#df[df['label']=='spam'] = '0'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. 准备训练数据\n",
"将数据分为训练数据、测试数据、训练标签、测试标签,并将文本转化数值特征。\n",
"本次使用的数据是对垃圾邮件分类:数据有两列,第一列是标签(ham为非垃圾邮件、spam为垃圾邮件),待分类的邮件为英文文本。"
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {},
"outputs": [],
"source": [
"#导入sklearn算法库中训练测试数据分割算法train_test_split,以及计算准确率等的算法cross_val_score\n",
"from sklearn.model_selection import train_test_split,cross_val_score\n",
"\n",
"# 对原始csv中的数据进行类型转换\n",
"y = df['label'].values.astype('U')\n",
"x = df['content'].values.astype('U')\n",
"## TODO: 利用train_test_split()函数对数据进行拆分,分出训练数据和测试数据\n",
"X_train_raw,X_test_raw,y_train,y_test = train_test_split(x, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜索引擎应用,作为文件与用户查询之间相关程度的度量或评级。除了TF-IDF以外,因特网上的搜索引擎还会使用基于链接分析的评级方法,以确定文件在搜寻结果中出现的顺序。\n",
"详细资料可参考百度百科:https://baike.baidu.com/item/tf-idf/8816134?fr=aladdin"
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['ham' 'ham' 'ham' ... 'ham' 'spam' 'ham']\n"
]
}
],
"source": [
"#导入sklearn算法库中文本特征提取的TFIDF算法\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"\n",
"# 文本是无法直接用模型进行计算的,需要对文本数值化\n",
"## TODO: 利用sklearn.feature_extraction.text的TfidfVectorizer模块对文本进行TFIDF特征转换\n",
"vectorizer = TfidfVectorizer()\n",
"X_train = vectorizer.fit_transform(X_train_raw)\n",
"X_test = vectorizer.transform(X_test_raw)\n",
"\n",
"print(y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. 训练模型"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"logistic回归是一种广义线性回归(generalized linear model),因此与多重线性回归分析有很多相同之处。它们的模型形式基本上相同,都具有 w‘x+b,其中w和b是待求参数,其区别在于他们的因变量不同,多重线性回归直接将w‘x+b作为因变量,即y =w‘x+b,而logistic回归则通过函数L将w‘x+b对应一个隐状态p,p =L(w‘x+b),然后根据p 与1-p的大小决定因变量的值。如果L是logistic函数,就是logistic回归,如果L是多项式函数就是多项式回归。\n",
"logistic回归的因变量可以是二分类的,也可以是多分类的,但是二分类的更为常用,也更加容易解释,多类可以使用softmax方法进行处理。实际中最为常用的就是二分类的logistic回归。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"对X_train、y_train进行训练,对X_test、y_test进行测试。"
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"预测为 ham ,信件为 Hi hope u get this txt~journey hasnt been gdnow about 50 mins late I think.\n",
"预测为 ham ,信件为 Super msg da:)nalla timing.\n",
"预测为 ham ,信件为 Eat at old airport road... But now 630 oredi... Got a lot of pple...\n",
"预测为 ham ,信件为 Some are lasting as much as 2 hours. You might get lucky.\n",
"预测为 ham ,信件为 K..k.:)congratulation ..\n"
]
}
],
"source": [
"#导入sklearn算法库logistic回归的算法\n",
"from sklearn.linear_model.logistic import LogisticRegression\n",
"\n",
"LR = LogisticRegression()\n",
"## TODO:写出LogisticRegression函数训练的代码,使用LR.fit()函数,第一个参数是训练的特征数据,第二个参数是训练的标签数据\n",
"LR.fit(X_train, y_train)\n",
"## TODO:写出LogisticRegression函数预测的代码,使用LR.predict()函数,参数是待遇测的特征数据\n",
"predictions = LR.predict(X_test)\n",
"#打印出预测的结果\n",
"for i, prediction in enumerate(predictions[:5]):\n",
" print(\"预测为 %s ,信件为 %s\" % (prediction, X_test_raw[i]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. 评估模型\n",
"训练完模型,需要利用二分类分类指标,以及ROC曲线衡量模型性能。"
]
},
{
"cell_type": "code",
"execution_count": 136,
"metadata": {},
"outputs": [],
"source": [
"#导入绘图需要的matplotlib库\n",
"import matplotlib\n",
"matplotlib.rcParams['font.sans-serif']=[u'simHei']\n",
"matplotlib.rcParams['axes.unicode_minus']=False"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.1 混淆矩阵"
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAQsAAAD0CAYAAACM5gMqAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAFVRJREFUeJzt3XuQHWWZx/HvL0PCJUEIO1wFjMHgiqsBjJAoYMISAWFRKFhRZEtZCl0tlb25Iljuuiwi660KFjBUQBYtIKKgKEjES23AICaCoKKiLnFBUUJiQlAumXn2j7fHHE5m5rw9OZ0+PfP7VHXNOW/3dL9zMvPkeS/dryICM7NOJtVdATNrBgcLM8viYGFmWRwszCyLg4WZZXGwMLMsDhY9RNLeJY6dI2n/4nW/pAsl7Ve8XyjpK5Kmthw/XdJ1krbdgvrNKK7TN9ZzWHNtU3cF7Dk+JemPwFnAw8ADbfsPBqZHxNPAi4BLJL0duAX4IfDfRTCYCrwuIp5s+d6jgGeK70XSdOCXwC9ajpkK/ANwJ/AYcD9wIHBQRNwP/CXwwogY6OLPbA3hYNFDIuJkSf8IzAbWAR9rO+QS4Jni2OskPQz8EXg+sD8wAFwAHA4skXRJRFxdfO9pwKGSfkn6d/9n4MmImNNeD0nbAb8BjgO+ALxE0hLghcBPJa0ABOwIHBkRD3frM7De5WDRIyQdApwMXBgRayQF8FTbYRERIWknYP+IuEPSZOCjwHJSprBHRPy9pD2BGcW5X0YKJq8BFgELi/N9fITqbCy+vh64JiKWSHoA+BTwXuB5EfGdLf+pJ66jF0yNx9fkJWgr73v6tog4puIqdeRg0TtWAicAt0t6BbATcH7bMf1FM2M/4FpJ3wK+BhwBzAUOAB6W9Obi+G0lfQb4CnAGMAgQEc8CSNpN0r3FsdOA+yPixJbrzQRWStqGlOWcA7yquL6DxRZYvWaA796W10U1ec9f9FdcnSwOFj2i6Ac4T9Iniuzh6IhYCSBpJrAuIh4vDv++pJcDR0fEjcAXJP07EMBHWk57f0SskvRaYB4pAM2UdBFwO/C7iDiwuMZRwOlt1TofuIOU4SwgZSW7pMN1BPDbiHhDtz+LiSEYiMG6K1GKR0N6z67F12sAJE0B3gfMk3Rka9YA3FccczJwLnAzsHexvZNNzY3HgO+ROk0fAb5J6pMYVURsAH5KCkKvIAWcDwOfi4h5DhRjF8AgkbX1CmcWPUTSNODmohnSL+kO4EHgSeBfgEdJ/RMAbwCOl3Q/8CbSaMhJFB2gwL7AEoCIuAe4R9IBwO7APRHx26IZsqKlCkvb6nMsMIeUUfw4IgYkte6fDAx6dGRsBmlWZuFg0Vv+CVgcEU9I+lVEHKb013kxsCgirilGKgBOBK4mDX1eAlwHvHVoZELSc/o7JE0C3grcA3y9aEb8brjRkBY7kIZS5wIfl7QR6Cf1hRwFTCEFsVu78LNPKEEw0LDHQzhY9AhJs4C3AQdI2gXYTdIngd2AtcBvi0OvknQpaXj0lJZ5E5OAL0pqzSw+2HKJj5CGY08BLgMuGqYOk4HJpCyZiPhCsetLwH8Ux5wJvCgi3t+Nn3si66UmRg4Hi94xA/jPiHhS0gLSnIl7gOuBM4HLJT0C7EXKIq4dChSF3YBjIuLhorlxMfADAElvBN4CHBoRg5LOLo4/oGjqDJlEmldxxSj13LbYbAsEMNCwYCE/Kav3FFnC5LZgsCXnE7BnRPy6G+fbGiTtDtwQEYfXXZcqzJ49JW67JW9EdM+9f7OyQ3Nxq/BoSEmSFktaLum8qq4REYPdChTF+aJhgWI6qT9maqdjm2wwc+sVDhYlSDoJ6IuIeaT5CrPqrtM4NQC8EVhfd0WqEgQDmVuvcJ9FOfMphiNJw4yHkYY2rYsiYj1A6zDtuBMw0DtxIIszi3KmkiY1AawhzVkwKy1NympWM8SZRTkbgO2L19NwsLUxEwM0K3PyL3s5K0lND0i3kT9UX1WsyQIYjLytVzizKOcmYJmkvYBjSTMbrSIRMb/uOlQlgGca9n91s2pbs6LjbT5wF7AgItbVWyNrssFQ1tYrnFmUFBFr2TQiYjYmaQZn7wSCHA4WZjUIxEDDEvtm1bZHSDqr7jqMdxPhM25aM8TBYmzG/S9yDxjXn/FQMyRn6xVuhpjVQgxEs/6vrj1Y9O/SFzP2mVx3NUrZ9/nbMGf2dj00Aj66n923Q91VKG07duB52qUxnzHAE6xdHRG7dj4yZRbP0qy1mmoPFjP2mczdt+1TdzXGtaP3OrDuKkwIt8cNq3KPjXBmYWaZBnuoPyKHg4VZDVIHpzMLM+vIzRAzy5BuUXewMLMOAvFMeDTEzDIMuhliZp24g9PMsgRioIfu+8jhYGFWk6Z1cDartmbjRAQMxKSsLYek3SUtK15PlnSzpDslnVGmbDQOFma1EIOZW8czbb4o07uBlRHxauBkSTuWKBuRg4VZDQJ4JrbJ2jK0L8o0n01Pc/sfYE6JshG5z8KsBkGpB9v0S1rR8n5RRCz607k2X5RpuPVtcstG5GBhVpMSQ6erSy6MPLS+zTrS+jYbSpSNyM0QsxqkdUMmZW1jMNz6NrllI3JmYVaLSh+ZdzVwi6TDgQOA75KaGzllI3JmYVaDKjKLoUWZImIVsBC4EzgqIgZyy0Y7vzMLs5pU+TDeiPg1bevb5JaNxMHCrAYR4tnBZv35Nau2ZuNEep6F7w0xs478pCwzy5A6OJ1ZmFkGP8/CzDoqOd27JzhYmNWkac+zcLAwq0EEPDvoYGFmHaRmiIOFmWWocgZnFRwszGrgoVMzy+RmiJll8nRvM+soPd3bwcLMOgjExkGvdWpmGdwMMbOOPBpiZtk8GmJmnYVvJDOzDH5Slpllc2ZhZh0FsNF3nZpZJ018+E1loU3SYknLJZ1X1TXMmmwQZW29opJgIekkoC8i5gEzJc2q4jpmjRWpzyJn6xVVZRbz2bTK0VI2Lb4KgKSzJK2QtOKxx0ddMc1sXBqalOVgAVNJi64CrAF2b90ZEYsiYk5EzNn1z5o1P96sW5oWLKrq4NwAbF+8noYXYDZ7jkAMNGw0pKrarmRT02M28FBF1zFrrKZ1cFaVWdwELJO0F3AsMLei65g1UkT3JmVJmg58DtgNWBkRb5e0GDgA+GpEnF8ct1lZGZVkFhGxntTJeRewICLWVXEdsyaLUNaW4XTgcxExB9hR0vtoG43sxghlZZOyImItm0ZEzOw5SnVe9kta0fJ+UUQsann/OPAXknYG9gHWsflo5EHDlD1YpsaewWlWk8ysAWB1kTWM5A7gOOA9wAPAFJ47Gnkwm49QHly2vs3qjjUbJ7o8z+JDwDsi4sPAT4A3s/lo5BaPUDpYmNWheGBvzpZhOvAySX3AocCFbD4aucUjlG6GmNUgKNUM6eQjwFXAC4DlwCfZfDQyhikrxcHCrBbdm50ZEXcDL33O2aX5wELgoqHRyOHKynCwMKtJRJXn3nw0cktHKB0szGrSxWbIVuFgYVaDCAcLM8vUS3eU5nCwMKvJ4KCDhZl1EGTf99EzHCzMalLhYEglHCzM6uAOTjPL1rDUwsHCrCbOLMwsS5UzOKvgYGFWgwiIhj2w18HCrCbOLMwsj4OFmXXmSVlmlsuZhZl15ElZZpbNmYWZZXFmYWZZGpZZjDorRNIkSVNH2ffX1VTLbJwLUmaRs/WITpnFDOBkSd8jrU3QSqQ1Fr1EodkYjLdJWRuBAeCDwDJgd+AI4PukdRIb9uOa9ZCG/fWMGCwkbQOcD+wI7Al8FZgFvBi4G7gTeMVWqKPZ+NRDTYwcne5kWQY803ZctH01s7ICNJi39YoRM4uI2ChpKbATsCtwMWlh1T2L7c3A77ZGJc3Gn97qvMzRqc9iX+DeiPhY+w5Jk0hNEzMbi4bl5qP1WWwLfAB4StKRwxwyCXikqoqZjXvjJVhExNPAsZJmAhcALwfOBh4vDhGwbeU1NBuvxkuwGBIRvwROlXQy8KuI+En11TIb54YmZXWRpEuBWyPiZkmLgQOAr0bE+cX+zcrKyH6uV0TcEBE/kfTqlso5szAbI0XelnUu6XBgjyJQnAT0RcQ8YKakWcOVla1vx2Ah6UFJK1qKLijKTwQ+VPaCZlaIzK0DSZOBK4CHJL0emM+mmdVLgcNGKCsl50ayhyJiYcv7JyX1AecAx5W9YLsHfziNY2e9uvOBNmaTZu9bdxUmhnvLHZ6bNQD9bf9hL4qIRS3v/wb4MXAR8G7gXcDiYt8a4GBgKpsGJIbKSskJFiHppaR7Q35WlL0F+FJEPFb2gmZWyO+zWB0Rc0bZfxApgDwq6bPAq0hzogCmkVoQG4YpK2XEb5A0WdKbSNO9XwKcAlwCvJJ0j8gny17MzAq5TZC87OPnwMzi9RzSDaBDzYzZwEPAymHKShkts+gHFgIbI+IGSS+PiPdKuhXYGXgPcGHZC5pZoXtDp4uBKyWdCkwm9U98WdJewLHA3OJqy9rKShkxs4iI30TEGaRJWYcA20k6HlBEfAA4XtJuZS9oZkm3RkMi4omIOCUijoiIeRGxihQw7gIWRMS6iFjfXla2vjntliD1VXyGdD/I0K0ti4FTy17QzArda4ZsfuqItRGxJCIeHa2sjJxg8QLS3afrgX8ndY4A3EbqyzCzkjSe7jodEhEvbn0v6SJJZ0TElZLeW13VzMa5ht112ukZnPOKfoo/iYivAKdJ2hn4dJWVMxvXKmyGVKFTZjEJ6JP0A+Bp0s1jQWqavA34VrXVMxu/SkzK6gmd+iyGfpw1pGdX/B74BnAfsD/w2eqqZjbOjbPM4q+A/2PzqkdE/F2VFTMb10rcJNYrRpvBOYk0n/yEoaK2/cOuJ2JmmRqWWYw2KWsQuB64bKio5auAyyX1V1s9s/GraUOnuTeTPI80RXRHYAHpqVmfBt5RUb3MrMd06rPoA6a03/Em6ZsRcUfx9CwzG4seamLk6BQs7qStr6JwBUBEnN31GplNBA3s4Bw1WETEwAjl11ZTHbMJZDwFCzOrkIOFmXUixlkzxMwqEr01LJrDwcKsLs4szCyLg4WZ5XCfhZnlcbAws4567CaxHA4WZjXxaIiZZXGfhZnlcbAws47cZ2FmOcTwt3P3MgcLs7o4szCzHO7gNLM8Hjo1s44a+KSs3Af2mlm3dXEpAEm7S7qneL1Y0nJJ57Xs36ysLAcLs5oo8rZMHwO2l3QS0BcR84CZkmYNVzaW+jpYmNUlP7Pol7SiZTur9TSSjgSeBB4F5gNLil1LgcNGKCvNfRZmNSmRNaxuX47jT+eQpgAfBE4EbiKtIvhIsXsNcPAIZaU5WJjVoXszON8PXBoRv5cEsAHYvtg3jdR6GK6sNDdDzGogurZ84VHAuyR9GziQtJj5UDNjNvAQsHKYstKcWZjVpQuZRUQcMfS6CBgnAMsk7UVacnRucaX2stIqyyyKoZxlVZ3frOkUkbXlioj5EbGe1KF5F7AgItYNVzaW+laSWUiaDlxN6lgxs3YV3nUaEWvZNPoxYllZVWUWA8AbgfUVnd+s8bo8z6JylWQWRdpD0Tu7mWKc+CyA7eTkwyaoHgoEOWrp4IyIRcAigJ36+hv2kZl1Ry9lDTk8GmJWBy9faGbZnFlsEhHzqzy/WVN5FXUzy1diDkUvcLAwq4kzCzPrzEsBmFkuj4aYWRYHCzPrLHAHp5nlcQenmeVxsDCzTjwpy8zyRLjPwszyeDTEzLK4GWJmnQUw2Kxo4WBhVpdmxQoHC7O6uBliZnk8GmJmOZxZmFlHCpA7OM0si+dZmFmOMksT9gIHC7M6+ElZZpbH94aYWaamjYZUtTCymXUydOdpp60DSTtJulXSUkk3SpoiabGk5ZLOazlus7IyHCzM6hCggcjaMpwGfCIiXgs8CpwK9EXEPGCmpFmSTmovK1tlN0PM6pLfDOmXtKLl/aJicfF0mohLW/btCrwF+FTxfilwGHAQsKSt7MEy1XWwMKtJiaHT1RExp+P5pHnAdOAh4JGieA1wMDB1mLJS3Awxq0uX+iwAJO0CXAycAWwAti92TSP9nQ9XVoqDhVkdgjSDM2frQNIU4PPAORGxClhJamYAzCZlGsOVleJmiFkNRHRzBuffkpoV50o6F7gKOF3SXsCxwFxSeFrWVlaKg4VZXboULCLiMuCy1jJJXwYWAhdFxLqibH57WRkOFmZ1CCBvWHRsp49Yy6bRjxHLynCwMKuJbyQzszwOFmbWmW8kM7McXkXdzLL5SVlmlsMdnGbWWQADzUotHCzMauEOztLWDz6+eumGq1fVXY+S+oHVdVci2711V2BMmvUZJy8odbSDRTkRsWvddShL0oqcW4Zt7CbEZ+xgYWYdeRV1M8sTEO7gnAgWdT6kepImAwMR6bdO0jak0fupEfHECN8zE1hb3FSEpO0i4qmW8xERz26N+nfQE59xZRo4GuKH34xB6/MPtyZJh0v6uqSbJT1Ceo7BlyQ9Lukm4CbgVcDtkuZL+rykz0i6XtJBxWnOID2PcchNkl4jaQbwNuBKSTMk7VcEn1rU9RlvVV18UtbW4MyiQSJimaSPAscAV0bEjcDlkm6LiDcMHSfpdaRnMQ4A55Ke/twvaSnwHYq5g5L2A54GtgVOAV5ZvD6Z9LvxX8CwGYp1QQ8FghzOLJrnD8ChEXGjpLmS7gZWSbpc0n2S5gKHRMTPi+MvB3YGngWeaTvXBcADwO3A60gZx58DxwPfG6kpY92QmVX0UEBxZtEgkk4Dzkov9W3ga8AtpIexLgf2Bn4EfFHSULAYANYPc65TSM9i/N+IGJQ0FTi92H0cKTOxqgQw6D4Lq861wHzg98DdwK+L8j0oJjAV2cAJpAeyCpgMbCxet/oRcHbL++2BFxXbblVU3to4s7CqtIx6AJxDekjrTGAf4FdsCgivB/YnBYkdSf0OQ4Fj6Fw/lrRDy+n3BM4sXu8BfL2qn8MKPRQIcjhYNFREDEj6A7AKOILUUblc0iTgPaROygOBk4AXAleQMsnDhj8jq0mjKQCHVFh1A4ggBgbqrkUpboY0jFJaIYCI+BEpc/gGcE3x9UzSiMcTwIeBfwWeAt4B/JTUgTn0WypgkqQ+YB1wR7H9rLhW39b4mSaswcjbeoQziwYpFpP5DnBt8Yd8CSngvxPYAbieFByWkPod/i0iHpZ0AamZsTvwfVJ/B6Rh0n5SJ+ljxfcOeSXp9+O6Sn+oiaxhzRBFwypsm0h6fkQ80vJ+B+DpiGhWfjsB7dTXH/OmnZB17G3rr1rZCzfVObNosNZAUbz/Q111sTFo2H/UDhZmNYmGzbNwsDCrRW/NocjhYGFWhwAaNnTqYGFWgwCih4ZFczhYmNUh/PAbM8vUtMzC8yzMaiDpa6QJcTlWR8QxVdYnh4OFmWXxvSFmlsXBwsyyOFiYWRYHCzPL4mBhZlkcLMwsi4OFmWVxsDCzLA4WZpbl/wGCT5CMagtNrgAAAABJRU5ErkJggg==\n",
"text/plain": [
"<Figure size 288x288 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# 二元分类分类指标\n",
"from sklearn.metrics import confusion_matrix\n",
"import matplotlib.pyplot as plt\n",
"# 计算predictions 与 y_test的混淆矩阵\n",
"## TODO: 利用confusion_matrix模块计算混淆矩阵,并使用matplot展示\n",
"confusion_matrix = confusion_matrix(y_test, predictions)\n",
"\n",
"#添加图示\n",
"plt.matshow(confusion_matrix)\n",
"plt.title(\"混淆矩阵\")\n",
"plt.colorbar()\n",
"plt.ylabel(\"真实值\")\n",
"plt.xlabel(\"预测值\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.2 precision、recall、f1-score"
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" ham 0.97 1.00 0.98 1190\n",
" spam 0.99 0.81 0.89 203\n",
"\n",
"avg / total 0.97 0.97 0.97 1393\n",
"\n"
]
}
],
"source": [
"# 自动计算precision、recall、f1-score指标\n",
"from sklearn.metrics import classification_report\n",
"print(classification_report(y_test,predictions))"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"平均精准率为: [1. 0.98780488 0.97297297 1. 1. ]\n",
"平均召回率为: [0.59633028 0.74311927 0.66055046 0.64220183 0.61111111]\n",
"平均F1值为: [0.74712644 0.84816754 0.78688525 0.78212291 0.75862069]\n"
]
}
],
"source": [
"## TODO:手动计算precision、recall、f1-score指标\n",
"# 精准率\n",
"from sklearn.preprocessing import LabelEncoder\n",
"label_int = LabelEncoder()\n",
"y_train = label_int.fit_transform(y_train)\n",
"y_test = label_int.fit_transform(y_test)\n",
"precision = cross_val_score(LR, X_train, y_train, cv = 5, scoring='precision') \n",
"print(\"平均精准率为: \",precision)\n",
"# 召回率\n",
"recall = cross_val_score(LR, X_train, y_train, cv = 5, scoring=\"recall\")\n",
"print(\"平均召回率为: \",recall) \n",
"# F1值\n",
"f1 = cross_val_score(LR, X_train, y_train, cv = 5, scoring=\"f1\")\n",
"print(\"平均F1值为: \",f1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.3 绘制ROC曲线"
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"from sklearn.metrics import roc_curve,auc\n",
"## 绘制ROC曲线\n",
"# 利用逻辑回归的predict_proba函数输出预测概率\n",
"predictions_pro = LR.predict_proba(X_test)\n",
"# 利用roc_curve函数生成如下指标\n",
"false_positive_rate, recall, thresholds = roc_curve(y_test, predictions_pro[:,1])\n",
"\n",
"roc_auc = auc(false_positive_rate, recall)\n",
"plt.title(\"受试者操作特征曲线(ROC)\")\n",
"plt.plot(false_positive_rate, recall, 'b', label='AUC = % 0.2f' % roc_auc)\n",
"plt.legend(loc='lower right')\n",
"plt.plot([0,1],[0,1],'r--')\n",
"plt.xlim([0.0, 1.0])\n",
"plt.ylim([0.0, 1.0])\n",
"plt.xlabel('假阳性率')\n",
"plt.ylabel('召回率')\n",
"plt.show() "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment