motiurinfo/sentiment_classification
Performance evaluation of sentiment classification on movie reviews
sentiment_classification
Performance evaluation of sentiment classification in movie reviews
1. Introduction
Given the availability of a large volume of online review data (Amazon, IMDB, etc.), sentiment analysis becomes increasingly important. In this project, a sentiment sentiment classification is evaluated using ensemble methods.
2. Getting the Dataset
This can also be downloaded from: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz.
3. Data Preprocessing
The training dataset in aclImdb folder has two sub-directories pos/ for positive texts and neg/ for negative ones. Use only these two directories. The first task is to combine both of them to a single csv file, “imdb_tr.csv”. The csv file has three columns,"row_number" and “text” and “polarity”. The column “text” contains review texts from the aclImdb database and the column “polarity” consists of sentiment labels, 1 for positive and 0 for negative. The file imdb_tr.csv is an output of this preprocessing. In addition, common English stopwords should be removed. An English stopwords reference ('stopwords.en') is given in the code for reference.
4. Data Representations Used
Vectorization methods: Unigram , Bigram
Feature Extraction: TF-IDF
5. Algorithmic Overview
In this project, we will train ensemble methods and evaluate the optimized combination:
http://scikit-learn.org/stable/modules/ensemble.html
6. Functions used in the sentimentalAnalysis file
imdb_data_preprocess : Explores the neg and pos folders from aclImdb/train and creates a imdb_tr.csv file in the required format
remove_stopwords : Takes a sentence and the stopwords as inputs and returns the sentence without any stopwords
unigram_process : Takes the data to be fit as the input and returns a vectorizer of the unigram as output
bigram_process : Takes the data to be fit as the input and returns a vectorizer of the bigram as output
tfidf_process : Takes the data to be fit as the input and returns a vectorizer of the tfidf as output
retrieve_data : Takes a CSV file as the input and returns the corresponding arrays of labels and data as output
random_forest_classifier : Applies Random Forest on the training data and returns the predicted labels
extra_tree_classifier : Applies Extra Tree on the training data and returns the predicted labels
bagging_decision_tree : Applies Bagged Decision Tree on the training data and returns the predicted labels
ada_boost_classifier : Applies ADA Boost on the training data and returns the predicted labels
gradient_boost_classifier : Applies Gradient Boost on the training data and returns the predicted labels
accuracy : Finds the accuracy in percentage given the training and test labels
7. Environment
OS: Linux Mint
Language : Python 3
Libraries : Scikit, Pandas
8. How to Execute?
Run python sentimentalAnalysis.py
9. Screenshots
Check Result in ScreenShot folder
10. Publication
Paper Title:
Supervised Ensemble Machine Learning Aided Performance Evaluation of Sentiment Classification
Authonrs:
Sheikh Shah Mohammad Motiur Rahman,Md. Habibur Rahman,Kaushik Sarker,Md. Samadur Rahman, Nazmul Ahsan,M. Mesbahuddin Sarker
Conference Info:
2nd International Conference on Data Mining, Communications and Information Technology (DMCIT 2018), Shanghai, China