Practical Analysis of Consumer Reviews -part I

Text analysis for consumer reviews |Beyond Sentiment Analysis

Malaka Gunawardena

4 min readFeb 12, 2021

Can sentiment analysis solve business problems ?

why?

Knowing a review is positive or negative will not solve the business problem.

what if you can identify what is good and what is bad

yes that would help. then business can focus on specific areas that maters to end consumer.

This article is focused on how to go beyond standard sentiment analysis by creating custom classifiers along with sentiment analysis using limited data set.

Before going into beyond Sentiment analysis lets understand what is sentiment analysis and how to do it.

Wikipedia
Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.

Advantages of custom classification

Standard text classification models like sentiment analysis are not always sufficient to get the holistic view when analyzing unstructured data like voice of the customer. Custom classifier, on the other hand can give you the ability to categorize such data into more nuanced categories that we care. For, eg, using custom classifier, a fast food chain owner can categorize the reviews categories like price, ambiance, staff behavior, food quality, etc. to gain better insights into what aspects of business needs to improve.

Action plan

1- Build a Sentiment analysis algorithm
2- Build a Custom classifier
3- Create Interactive dashboard

Lets build a basic sentiment analysis algorithm using python (SK-learn )

Below are the quick steps to build a basic sentiment analysis algorithm. using sk-learn

you can find complete code here .

check the data set

our data set consist of two columns review and its sentiment for 65,000 rows that will be used to train and test the classification model.

data=pd.read_excel('sentimanet.xlsx',sheet_name='Sheet1')
plt.figure(figsize=(10,4))
data.Sentiment.value_counts().plot(kind='bar');

count of positive reviews vs negative reviews

This data set is balanced for two classes ( positive and Negative ) hence we can go ahead and create our binary classification model. if your data set in imbalanced there are multiple ways to overcome that problem using techniques such as SMOTE, oversampling ,under sampling etc.

Text Cleaning is very important to get the best out of the Machine learning model that we going to build.

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))def clean_text(text):
   
    text = BeautifulSoup(text, "lxml").text # HTML decoding
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # delete stopwors from text
    return textdata[‘review’] = data[‘Review’].apply(clean_text)

Train -Test Split

X = data.review
y = data.Sentiment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)

Model Building

tried many models and best model is picked based on accuracy

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformernb = Pipeline([('vect', CountVectorizer()),
               ('tfidf', TfidfTransformer()),
               ('clf', MultinomialNB()),
              ])
nb.fit(X_train, y_train)

Model Validation

%%time
%%time
from sklearn.metrics import classification_report
y_pred = nb.predict(X_test)print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred))accuracy 0.7948247174249814
              precision    recall  f1-score   support

           0       0.77      0.86      0.82      9129
           1       0.83      0.72      0.77      8300

   micro avg       0.79      0.79      0.79     17429
   macro avg       0.80      0.79      0.79     17429
weighted avg       0.80      0.79      0.79     17429

Wall time: 527 ms

we have built a fairly accurate sentiment classification model . lets pickle this and use.

Pkl_Filename = “Sentiment.pkl”with open(Pkl_Filename, ‘wb’) as file: 
 pickle.dump(nb, file)with open(Pkl_Filename, 'rb') as file:  
    Sentiment = pickle.load(file)Sentiment

now we can use this pickle object to get sentiment score to our dat set.

Lets Build custom classifier to classify reviews

we can use python libry called “fuzzywuzzy” which is simple but very effective algorithm.

fuzzywuzzy

Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a…

pypi.org

#import fuzzywuzzy libraryfrom fuzzywuzzy import fuzz
from fuzzywuzzy import process

define Key words for each category

fit = ['fit','fitting','small','too','tight','large','lose','confortable','stretchy','tailored','strappy']price =['expensive','price','cheep','bucks','buy','sell','purchase','too',price,cost,amount,buks]material=['strech','material','fabric','composition','feel','premium']

Lets create a Scoring function using both custom classifier and Sentiment analyzer.

def aspects(sentense):
    
    sentiment=Sentiment.predict(sentense)
    fit_score=process.default_scorer(sentense,fit)
    mat_score=process.default_scorer(sentense,mat)
    price_score=process.default_scorer(sentense,price)
   return sentiment,fit_score,mat_score,price_score

Lets check the final output data set we get once we apply scoring function.

Now we have analyzed data set to create an interactive dashboard in next Story.