Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
H
homework
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
20210509028
homework
Commits
c281b257
Commit
c281b257
authored
Jun 23, 2021
by
20210509028
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Upload New File
parent
b422937e
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
661 additions
and
0 deletions
+661
-0
project2/preprocessor.ipynb
+661
-0
No files found.
project2/preprocessor.ipynb
0 → 100644
View file @
c281b257
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 文本的预处理\n",
"关于这个模块,你并不需要完成任何任务,所有的模块已经写好,你只需要读一读就可以了。输入为question_answer.txt,最后处理之后的结果存放在question_answer_parse.pkl文件中。"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import emoji \n",
"#emoji是表情符号,可参看网址http://www.fhdq.net/emoji/emojifuhao.html或者https://www.webfx.com/tools/emoji-cheat-sheet/\n",
"import re\n",
"import jieba\n",
"import os\n",
"from collections import Counter"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"# 导入原始数据\n",
"QApares_df = pd.read_csv('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/question_answer.txt',sep='\\t',header=None,nrows=1000)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>买 二份 有没有 少点 呀</td>\n",
" <td>亲亲 真的 不好意思 我们 已经 是 优惠价 了 呢 小本生意 请亲 谅解</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>那 就 等 你们 处理 喽</td>\n",
" <td>好 的 亲退 了</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>那 我 不 喜欢</td>\n",
" <td>颜色 的话 一般 茶刀 茶针 和 二合一 的话 都 是 红木 檀 和 黑木 檀 哦</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>不是 免 运费</td>\n",
" <td>本店 茶具 订单 满 99 包邮除 宁夏 青海 内蒙古 海南 新疆 西藏 满 39 包邮</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>好吃 吗</td>\n",
" <td>好吃 的</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1\n",
"0 买 二份 有没有 少点 呀 亲亲 真的 不好意思 我们 已经 是 优惠价 了 呢 小本生意 请亲 谅解\n",
"1 那 就 等 你们 处理 喽 好 的 亲退 了\n",
"2 那 我 不 喜欢 颜色 的话 一般 茶刀 茶针 和 二合一 的话 都 是 红木 檀 和 黑木 檀 哦\n",
"3 不是 免 运费 本店 茶具 订单 满 99 包邮除 宁夏 青海 内蒙古 海南 新疆 西藏 满 39 包邮\n",
"4 好吃 吗 好吃 的"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#观察原始数据格式,查看头几个样本\n",
"QApares_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>question_before_preprocessing</th>\n",
" <th>answer</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>买 二份 有没有 少点 呀</td>\n",
" <td>亲亲 真的 不好意思 我们 已经 是 优惠价 了 呢 小本生意 请亲 谅解</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>那 就 等 你们 处理 喽</td>\n",
" <td>好 的 亲退 了</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>那 我 不 喜欢</td>\n",
" <td>颜色 的话 一般 茶刀 茶针 和 二合一 的话 都 是 红木 檀 和 黑木 檀 哦</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>不是 免 运费</td>\n",
" <td>本店 茶具 订单 满 99 包邮除 宁夏 青海 内蒙古 海南 新疆 西藏 满 39 包邮</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>好吃 吗</td>\n",
" <td>好吃 的</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" question_before_preprocessing answer\n",
"0 买 二份 有没有 少点 呀 亲亲 真的 不好意思 我们 已经 是 优惠价 了 呢 小本生意 请亲 谅解\n",
"1 那 就 等 你们 处理 喽 好 的 亲退 了\n",
"2 那 我 不 喜欢 颜色 的话 一般 茶刀 茶针 和 二合一 的话 都 是 红木 檀 和 黑木 檀 哦\n",
"3 不是 免 运费 本店 茶具 订单 满 99 包邮除 宁夏 青海 内蒙古 海南 新疆 西藏 满 39 包邮\n",
"4 好吃 吗 好吃 的"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 将 QApares_df列重命名\n",
"QApares_df.columns = ['question_before_preprocessing','answer']\n",
"QApares_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"def rm_space(sentence):\n",
" '''将已分好词的句子去掉词之间的空格变成原句子,如将‘那 就 等 你们 处理 喽’改为‘那就等你们处理喽’'''\n",
" return ''.join(sentence.split())\n",
"#apply为Dataframe.apply(function,axis)对一行或一列做出一些操作(axis=1遍历行,axis=0遍历列)\n",
"#参考网址https://blog.csdn.net/u010916338/article/details/105493393/\n",
"QApares_df['question'] = QApares_df.question_before_preprocessing.apply(rm_space) \n",
"QApares_df['answer'] = QApares_df.answer.apply(rm_space)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>question_before_preprocessing</th>\n",
" <th>answer</th>\n",
" <th>question</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>买 二份 有没有 少点 呀</td>\n",
" <td>亲亲真的不好意思我们已经是优惠价了呢小本生意请亲谅解</td>\n",
" <td>买二份有没有少点呀</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>那 就 等 你们 处理 喽</td>\n",
" <td>好的亲退了</td>\n",
" <td>那就等你们处理喽</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>那 我 不 喜欢</td>\n",
" <td>颜色的话一般茶刀茶针和二合一的话都是红木檀和黑木檀哦</td>\n",
" <td>那我不喜欢</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>不是 免 运费</td>\n",
" <td>本店茶具订单满99包邮除宁夏青海内蒙古海南新疆西藏满39包邮</td>\n",
" <td>不是免运费</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>好吃 吗</td>\n",
" <td>好吃的</td>\n",
" <td>好吃吗</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" question_before_preprocessing answer question\n",
"0 买 二份 有没有 少点 呀 亲亲真的不好意思我们已经是优惠价了呢小本生意请亲谅解 买二份有没有少点呀\n",
"1 那 就 等 你们 处理 喽 好的亲退了 那就等你们处理喽\n",
"2 那 我 不 喜欢 颜色的话一般茶刀茶针和二合一的话都是红木檀和黑木檀哦 那我不喜欢\n",
"3 不是 免 运费 本店茶具订单满99包邮除宁夏青海内蒙古海南新疆西藏满39包邮 不是免运费\n",
"4 好吃 吗 好吃的 好吃吗"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"QApares_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Python is fun 💚️\n",
"Python is fun :green_heart:\n",
"我想对你说 :ok_woman:\n"
]
}
],
"source": [
"##测试emoji的使用方法\n",
"import emoji\n",
"\n",
"test = emoji.demojize('我想对你说 :ok_woman:', use_aliases=True)#在语句中输出表情符号,:ok_woman:为对应的表情代码,use_aliases=True,True为显示表情,False则显示表情代码\n",
"print(emoji.emojize(\"Python is fun :green_heart:\",variant=\"emoji_type\"))\n",
"print(emoji.demojize(\"Python is fun :green_heart:\"))\n",
"print(test)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"接下来对数据进行预处理"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"# 清洗数据\n",
"def clean(content):\n",
" #demojize将数据中的emoji表情转化为文字\n",
" content = emoji.demojize(content)\n",
" #过滤其中的html标签, \".\" 匹配任意字符(不包括换行符),\"*\"代表匹配前一个元字符0到多次,<>代表具体字符,sub()为正则表达式中的替换函数\n",
" content = re.sub('<.*>','',content)\n",
" return content"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"# 对每个问题应用上一clean函数进行清洗\n",
"QApares_df['question_after_preprocessing'] = QApares_df['question_before_preprocessing'].apply(clean)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 买 二份 有没有 少点 呀\n",
"1 那 就 等 你们 处理 喽\n",
"2 那 我 不 喜欢\n",
"3 不是 免 运费\n",
"4 好吃 吗\n",
"Name: question_after_preprocessing, dtype: object"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"QApares_df['question_after_preprocessing'].head()"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"# 按照一般的流程,之后应该是对问题进行分词,但由于我们的数据已经是分好词的数据,所以这里不再进行中文分词,而是直接将分好的词以列表的形式存储。\n",
"QApares_df['question_after_preprocessing'] = QApares_df['question_after_preprocessing'].apply(lambda x:x.split())"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 [买, 二份, 有没有, 少点, 呀]\n",
"1 [那, 就, 等, 你们, 处理, 喽]\n",
"2 [那, 我, 不, 喜欢]\n",
"3 [不是, 免, 运费]\n",
"4 [好吃, 吗]\n",
"Name: question_after_preprocessing, dtype: object"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"QApares_df['question_after_preprocessing'].head()"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 [买, 二份, 有没有, 少点, 呀]\n",
"1 [那, 就, 等, 你们, 处理, 喽]\n",
"2 [那, 我, 不, 喜欢]\n",
"3 [不是, 免, 运费]\n",
"4 [好吃, 吗]\n",
"Name: question_after_preprocessing, dtype: object"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 对question_after_preprocessing列去除单词中的空格回车等符号\n",
"def strip(wordList):\n",
" return [word.strip() for word in wordList if word.strip()!='']#Python strip() 方法用于移除字符串头尾指定的字符(默认为空格或换行符)或字符序列。\n",
" #注意:该方法只能删除开头或是结尾的字符,不能删除中间部分的字符。\n",
"QApares_df['question_after_preprocessing'] = QApares_df['question_after_preprocessing'].apply(strip)\n",
"QApares_df['question_after_preprocessing'].head()"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"# 读取data/stopWord.json中保存的的停用词表,并保存在列表中\n",
"# 中文停用词表的下载地址:https://github.com/goto456/stopwords/\n",
"with open(\"C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/stopWord.json\",\"r\",encoding=\"utf-8\") as f:\n",
" stopWords = f.read().split(\"\\n\")"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"747"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(stopWords)"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 [买, 二份, 有没有, 少点]\n",
"1 [处理]\n",
"2 [喜欢]\n",
"3 [免, 运费]\n",
"4 [好吃]\n",
"Name: question_after_preprocessing, dtype: object"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 去除停用词\n",
"def rm_stop_word(wordList):\n",
" return [word for word in wordList if word not in stopWords]\n",
"QApares_df['question_after_preprocessing'] = QApares_df['question_after_preprocessing'].apply(rm_stop_word)\n",
"QApares_df['question_after_preprocessing'].head()\n"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 [买]\n",
"1 []\n",
"2 []\n",
"3 [运费]\n",
"4 [好吃]\n",
"Name: question_after_preprocessing, dtype: object"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 去除低频词,保留高频词就行了,去除低频词后会有较多的对话被删的一个词都不剩了\n",
"allWords = [word for question in QApares_df['question_after_preprocessing'] for word in question] #所有词组成的列表\n",
"freWord = Counter(allWords) #统计词频,一个字典,键为词,值为词出现的次数\n",
"highFreWords = [word for word in freWord.keys() if freWord[word]>5] # 词频超过5的词列表, 剩下的词去掉\n",
"def rm_low_fre_word(content):\n",
" return [word for word in content if word in highFreWords]\n",
"\n",
"# 去除低频词\n",
"QApares_df['question_after_preprocessing'] = QApares_df['question_after_preprocessing'].apply(rm_low_fre_word)\n",
"QApares_df['question_after_preprocessing'] .head()"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>question</th>\n",
" <th>answer</th>\n",
" <th>question_after_preprocessing</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>买二份有没有少点呀</td>\n",
" <td>亲亲真的不好意思我们已经是优惠价了呢小本生意请亲谅解</td>\n",
" <td>[买]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>那就等你们处理喽</td>\n",
" <td>好的亲退了</td>\n",
" <td>[]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>那我不喜欢</td>\n",
" <td>颜色的话一般茶刀茶针和二合一的话都是红木檀和黑木檀哦</td>\n",
" <td>[]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>不是免运费</td>\n",
" <td>本店茶具订单满99包邮除宁夏青海内蒙古海南新疆西藏满39包邮</td>\n",
" <td>[运费]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>好吃吗</td>\n",
" <td>好吃的</td>\n",
" <td>[好吃]</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" question answer question_after_preprocessing\n",
"0 买二份有没有少点呀 亲亲真的不好意思我们已经是优惠价了呢小本生意请亲谅解 [买]\n",
"1 那就等你们处理喽 好的亲退了 []\n",
"2 那我不喜欢 颜色的话一般茶刀茶针和二合一的话都是红木檀和黑木檀哦 []\n",
"3 不是免运费 本店茶具订单满99包邮除宁夏青海内蒙古海南新疆西藏满39包邮 [运费]\n",
"4 好吃吗 好吃的 [好吃]"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 只保留question,answer和question_after_preprocessing列\n",
"QApares_df = QApares_df[['question','answer','question_after_preprocessing']]\n",
"#查看预处理之后的数据\n",
"QApares_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [],
"source": [
"#由于保存为csv格式会将数据默认保存为str格式,但我们的数据中question_after_preprocessing列中每一项为列表形式,保存为csv格式会增加我们后面处理数据的难度,所以这里我们将保存为pickle形式\n",
"#关于pickle,参考https://blog.csdn.net/chunmi6974/article/details/78392230/\n",
"import pickle\n",
"#序列化以及反序列化对象\n",
"with open('C:/Users/cuishufeng-ghq/Documents/tanxin/wenda/data/question_answer_pares.pkl','wb') as f:\n",
" pickle.dump(QApares_df,f)\n",
" #pickle对象到一个打开的文件。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment