The goal of this project is to automatically determine whether the comments are positive or negative based on the comments provided by users through an algorithm. For example, given a user's review:
Comment 1: "I love this appliance, I've had it for 3 months and it's not a problem!"
Comment 2: "The things I sold from this Taobao store started to break down within a week. I strongly recommend not buying them. It's a real waste of money."
Of these two comments, the first is clearly positive and the second is negative. I want to build an AI algorithm that can automatically tell if a review is positive or negative.
Sentiment analysis is a classic problem in text processing. The whole system generally consists of several modules:
1. Data capture: crawler technology is used to capture relevant text data from the network
2. Data cleaning/preprocessing: In this paper, it is generally necessary to remove useless information, such as various tags (HTML tags), punctuation marks, stop words and so on
3. Convert text information into vectors: This is also known as feature engineering. Text itself cannot be used as input to the model, only numbers (such as vectors) can be used as input to the model. So before entering the model,
any signal needs to be transformed into a digital signal that the model can recognize (numbers, vectors, matrices, tensors...).
4. Select appropriate models and evaluation methods. For sentiment analysis, this is a dichotomous problem (or three categories: positive, negative, neutral),
So we need to use classification algorithms such as logistic regression, naive Bayes, neural networks, SVM and so on.
In addition, we need to choose the appropriate evaluation method, such as for an application, should we focus on accuracy or recall