L16: Sentiment Analysis

Bogdan G. Popescu

John Cabot University

Introduction

Intro

We discuss today sentiment analysis

  • Preprocessing Data (lowercase, remove punctuation, remove numbers)
  • Stopwords
  • Tokenization
  • Lemmatization
  • Stemming

Intro

One of the most important subfields of text analysis is sentiment analysis

Sentiment analysis allows us to determine the emotional tone of a text.

Sentiment analysis has a variety of applications: brand monitoring, customer feedback analysis, political party manifestos, social media sentiment analysis, etc.

Typically, text analysis entails the analysis of words or phrases that allows to identify the underlying sentiment: positive, negative, or neutral.

Models of Sentiment Analysis

Some of the most common sentiment analysis models include:

  • Lexicon-based analysis
  • Machine learning
  • Pre-trained transformer-based deep learning

Models of Sentiment Analysis

Lexicon-based analysis

This type of analysis entails using predefined rules to determine the sentiment of text.

For example, presence of actual positive or negative words determine the sentiment of text.

Lexicon-based analysis are easy to implement but may not be as accurate.

Models of Sentiment Analysis

Machine Learning

This entails training a data to identify the sentiment of a piece based on a set of labeled training data.

This can be trained on a variety of ML algorithms: e.g. decision trees, support vector machines (SVMs), and neural networks.

These are typically more accurate than lexicon analysis, but they require a larger amount of training data.

Models of Sentiment Analysis

Pre-trained transformer-based deep learning

This is a deep-learning approach, as seen in BERT and GPT-4.

This is based on large amounts of text data.

These use complex neural networks to encode text and to detect the meaning of text.

Thus, the accuracy is much higher, but at the cost of higher computational power.

Bag of Words Model

The bag of words model is a technique used in NLP.

In such a model, every word in the text is represented by a separate feature (variable) in the resulting vector (dataframe).

The value of each feature is determined by the number of times the corresponding word appears in text.

The Bag of Words model is helpful because, by representing text as numerical data, we can train machine learning models or classify text to analyze sentiments.

Sentiment Analysis

Let us now perform sentiment analysis using NLTK in python

We will do this using Amazon Reviews

Let us first load the data

#Step1: Load library
import pandas as pd
import string
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('all')

from nltk.stem import WordNetLemmatizer, PorterStemmer

# Reading the CSV data into a DataFrame
reviews_df = pd.read_csv("./data/ps_5reviews.csv")
True

Pre-processing Text

We can use the function that we created previously to clean the text:

# Define lemmatization function
lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
    return [lemmatizer.lemmatize(word) for word in word_tokenize(text)]

# Define stemming function
stemmer = PorterStemmer()
def stem_text(text):
    return [stemmer.stem(word) for word in word_tokenize(text)]

def preprocess_text(df, text_column):
    # Step1: Ensure the specified column contains only strings
    df[text_column] = df[text_column].astype(str)

    ### Step2: Turning words to Lowercase
    df["text2"] = df[text_column].str.lower()

    ### Step3: Remove punctuation
    df['text2'] = df['text2'].str.translate(str.maketrans('', '', string.punctuation))

    ### Step4: Remove numbers
    df['text2'] = df['text2'].str.replace(r'[0-9]+', " ", regex=True)

    ### Step5: Tokenize
    df['text2'] = df['text2'].apply(lambda tokens: ' '.join(word_tokenize(tokens)))

    ### Step6: Lemmatization
    df['text2'] = df['text2'].apply(lambda tokens: ' '.join(lemmatize_text(tokens)))

    ### Step7: Stemming
    df['text2'] = df['text2'].apply(lambda tokens: ' '.join(stem_text(tokens)))

    ### Step8: Listing words line by line
    # Unnest the tokens into individual rows
    #df['text2'] = df['text2'].apply(lambda text: text.split())
    # Then use explode() to unnest each word into a separate row
    #df = df.explode('text2').reset_index(drop=True)

    ### Step9: Remove stop words
    stop_words = set(stopwords.words('english'))
    df['text2'] = df['text2'].apply(lambda word: word if word not in stop_words else None)

    ### Step10: Remove rows with NaN values in the 'text2' column
    df = df.dropna(subset=['text2'])

    return df

Pre-processing Text

We now apply the function and let us observe the changes.

preprocessed_df = preprocess_text(reviews_df, 'review_text')
preprocessed_df.head(10)
          profile_name  ...                                              text2
0         J.F. Carroll  ...  nobodi could get their hand on thi until just ...
1              Brandon  ...  i wa look for a great deal luckili i found thi...
2                Brock  ...  after use the playstat for over month i can sa...
3  Whispering Whiskers  ...  greet fellow gamer gather round a i recount my...
4         Carson Morby  ...  final i got a p ive been wait forev but i made...
5                Chris  ...  the unit arriv in good condit and work a inten...
6                   Lo  ...  im glad i wait and brought the slim on sale i ...
7           Brandon M.  ...  my p run veri well it quiet and doesnt overhea...
8                Hutch  ...  so i usual play on pc i have a ti r x gb ram m...
9                 Kris  ...  i bought thi around black friday and have been...

[10 rows x 6 columns]

Sentiment Analysis

Let us first initialize a Sentiment Intensity Analyzer object from the nltk.sentiment.vader library.

from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

We will first define a function called rate_sentiment which takes a string as an input

  • the function calls the polarity_scores method to get a dictionary of sentiment scores for the text
  • these could be positive, negative, or neutral
  • we code positive sentiment with 1 and 0 otherwise

Sentiment Analysis

This is the function:

def get_sentiment(text):
    scores = analyzer.polarity_scores(text)
    sentiment = 1 if scores['pos'] > 0 else 0
    return sentiment

And this is how we apply it to our customer reviews:

preprocessed_df['sentiment'] = preprocessed_df['text2'].apply(get_sentiment)

Sentiment Analysis

Since we have a column called “rating”, we can actually see how well our sentiment scores were labeled

We see that our rating column has labels such as “5.0 out of 5 stars”.

We will first extract the first number (i.e. “5.0”), turn into an integer and reclassify it

preprocessed_df['rating_numeric'] = preprocessed_df['rating'].str[:3].astype(float).astype(int)
preprocessed_df['rating_transformed'] = (preprocessed_df['rating_numeric'] > 4).astype(int)

https://www.datacamp.com/tutorial/text-analytics-beginners-nltk?dc_referrer=https%3A%2F%2Fwww.google.com%2F

Confusion Matrix

from sklearn.metrics import confusion_matrix
print(confusion_matrix(preprocessed_df['rating_transformed'], preprocessed_df['sentiment']))
[[ 1 12]
 [14 73]]

We can also check the classification report:

from sklearn.metrics import classification_report
print(classification_report(preprocessed_df['rating_transformed'], preprocessed_df['sentiment']))
              precision    recall  f1-score   support

           0       0.07      0.08      0.07        13
           1       0.86      0.84      0.85        87

    accuracy                           0.74       100
   macro avg       0.46      0.46      0.46       100
weighted avg       0.76      0.74      0.75       100

As you can see, the overall accuracy of this rule-based sentiment analysis model is 74%.

https://www.datacamp.com/tutorial/text-analytics-beginners-nltk?dc_referrer=https%3A%2F%2Fwww.google.com%2F