Sentiment Analysis

Synopsis

Develop algorithms that process labelled datasets to learn the sentiments present within and then accurately predict the sentiment of similar data.

Approach

I've implemented the project on 3 separate datasets:

Twitter

1

IMDB

1
2

Amazon

1
2

I’ve used NLTK and Regular Expression libraries for text cleaning in each of the five methods I’ve implemented. I also tried using un-processed data in the methods 2-5 but it resulted in a minute difference in test accuracy (-0.2%) so I didn’t include those in my final solutions.

Methods:

Multinomial Naïve Bayes based on tfidf using Sklearn
LSTM with relu and sigmoid activation functions using Keras
Bi-grams using Pytorch
2D Bidirectional RNN using Pytorch
CNN layers of multi-dimensional filters using Pytorch

Assumptions

The data is labelled correctly
Extremely short tweets/reviews like a single word or emoji can be used for training and subsequent prediction even though they might negatively affect results.

Charts

Logic Flow

Dataset Size

Sentiment Distribution

Outcome

Testing Accuracy

IMDB	IMDB	Tweets	Amazon
Multinomial Naïve Bayes - Sklearn	74.98%	71.31%	88.25%
LSTM - Keras	74.86%	71.10%	90.43%
Bigrams - Pytorch	74.06%	72.89%	88.86%
Bidirectional RNN - Pytorch	75.04%	73.14%	90.04%
CNN - Pytorch	75.42%	73.52%	90.50%

On comparing the results of the Amazon dataset with the others, it is evident that larger the amount of available data for training, higher the accuracy. But this could also be due to overfitting even though precautions were taken to avoid it. Among the 5 methods used, the method of passing the vectors through multi-dimensional filters of CNN layers achieved the highest accuracy across all three datasets.

Exceptions Considered

The twitter dataset doesn’t allow the algorithms to achieve high accuracy due to the use of slangs, different short forms for the same word and high amounts of sarcasm. I’ve included the dataset without removing such tweets for training the model but each such tweet is an exception in its own different way. Also, the IMDB and Twitter datasets have a larger amount of negative sentiments which can skew the results.

Scope for Enhancement

Since it is clear that a larger amount of data for training results in higher accuracy, the algorithms on the IMDB and Tweets can be further enhanced to attain much higher accuracy by training them with more similar data. I’ve included alternate datasets as well for this purpose.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Amazon		Amazon
IMDB		IMDB
Tweets		Tweets
README.md		README.md
datasetSize.png		datasetSize.png
logicFlow.png		logicFlow.png
sentimentDistribution.png		sentimentDistribution.png
summary.pdf		summary.pdf
testAccuracy.png		testAccuracy.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Analysis

Synopsis

Approach

Assumptions

Charts

Outcome

Testing Accuracy

Exceptions Considered

Scope for Enhancement

About

Releases

Packages

Languages

brondibur/sentimentAnalysis

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis

Synopsis

Approach

Assumptions

Charts

Outcome

Testing Accuracy

Exceptions Considered

Scope for Enhancement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages