The way other people think about one or another product or service has a big impact on our everyday process of making decisions. Earlier, people relied on the opinion of their friends, relatives, or products and services reposts, but the era of the Internet has made significant changes. Today opinions are collected from different people around the world via reviewing e-commerce sites as well as blogs and social nets. To transform gathered data into helpful information on how a product or service is perceived among the people, the sentiment analysis is needed.
Sentiment analysis is a computing exploration of opinions, sentiments, and emotions expressed in textual data. The reason why sentiment analysis is used increasingly by companies is due to the fact that the extracted information can result in the products and services monetizing.
Words express various kinds of sentiments. They can be positive, negative, or have no emotional overtone, be neutral. To perform analysis of the text's sentiment, the understanding of the polarity of the words is needed in order to classify sentiments into positive, negative, or neutral categories. This goal can be achieved through the use of sentiment lexicons.
Sentiment analysis can be done in three ways: using ML algorithms, using dictionaries and lexicons, and combining these techniques.
The approach based on the ML algorithms got significant popularity nowadays as it gives wide opportunities for performing identification of different sentiment expressions in the text.
For performing lexicon-based approach various dictionaries with polarity scores can be found. Such dictionaries can help in establishing the connotation of the word. One of the pros of such an approach is that you don't need a training set for performing analysis, and that is why even a small piece of data can be successfully classified. However, the problem is that many words are still missing in sentiment lexicons that somewhat diminishes results of the classification.
Sentiment analysis based on the combination of ML and lexicon-based techniques is not much popular but allows to achieve much more promising results then the results of independent use of the two approaches.
The central part of the lexicon-based sentiment analysis belongs to the dictionaries. The most popular are afinn, bing, and nrc that can be found and installed on python packages repository All dictionaries are based on the polarity scores that can be positive, negative, or neutral. For Python developers, two useful sentiment tools will be helpful - VADER and TextBlob. VADER is a rule and lexicon-based tool for sentiment analysis that is adapted to sentiments that can be found in social media posts. VADER uses a list of tokens that are labeled according to their semantic connotation. TextBlob is a useful library for text processing. It provides general dealing with such tasks like phrase extraction, sentiment analysis, classification and so on.
In this tutorial, we will build a lexicon-based sentiment classifier of Donald Trump tweets with the help of the TextBlob. Let's look, which sentiments generally prevail in the scope of tweets.
As every data exploration, there are some steps needed to be done before analysis, problem statement and data preparation. As the theme of our study is already stated, let's concentrate on data preparation.
We will get tweets directly from Twitter, the data will come to us in some unordered look and that is why we need to order data into dataframe and do cleaning, removing links and stopwords.
First of all, we have to install packages needed for dealing with the task.
import tweepy import pandas as pd import numpy as np import matplotlib.pyplot as plt import re import nltk import nltk.corpus as corp from textblob import TextBlob
The next step is to connect our app to Twitter via Twitter API. Provide the needed credentials that will be used in our function for connection and extracting tweets from Donald Trump's account.
CONSUMER_KEY = "Key" CONSUMER_SECRET = "Secret" ACCESS_TOKEN = "Token" ACCESS_SECRET = "Secret"
def twitter_access(): auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET) api = tweepy.API(auth) return api twitter = twitter_access()
tweets = twitter.user_timeline("RealDonaldTrump", count=600)
This is how our dataset looks:
tweets[0]
Status(_api=<tweepy.api.API object at 0x7f987f0ce240>, _json={'created_at': 'Wed Jun 26 02:34:41 +0000 2019', 'id': 1143709133234954241, 'id_str': '1143709133234954241', 'text': 'Presidential Harassment!', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': []}, 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 25073877, 'id_str': '25073877', 'name': 'Donald J. Trump', 'screen_name': 'realDonaldTrump', 'location': 'Washington, DC', 'description': '45th President of the United States of America🇺🇸', 'url': 'https://t.co/OMxB0x7xC5', 'entities': {'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 61369691, 'friends_count': 47, 'listed_count': 104541, 'created_at': 'Wed Mar 18 13:46:38 +0000 2009', 'favourites_count': 7, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': True, 'statuses_count': 42533, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': True, 'profile_background_color': '6D5C18', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1560920145', 'profile_link_color': '1B95E0', 'profile_sidebar_border_color': 'BDDCAD', 'profile_sidebar_fill_color': 'C5CEC0', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'regular'}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'retweet_count': 10387, 'favorite_count': 48141, 'favorited': False, 'retweeted': False, 'lang': 'in'}, created_at=datetime.datetime(2019, 6, 26, 2, 34, 41), id=1143709133234954241, id_str='1143709133234954241', text='Presidential Harassment!', truncated=False, entities={'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': []}, source='Twitter for iPhone', source_url='http://twitter.com/download/iphone', in_reply_to_status_id=None, in_reply_to_status_id_str=None, in_reply_to_user_id=None, in_reply_to_user_id_str=None, in_reply_to_screen_name=None, author=User(_api=<tweepy.api.API object at 0x7f987f0ce240>, _json={'id': 25073877, 'id_str': '25073877', 'name': 'Donald J. Trump', 'screen_name': 'realDonaldTrump', 'location': 'Washington, DC', 'description': '45th President of the United States of America🇺🇸', 'url': 'https://t.co/OMxB0x7xC5', 'entities': {'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 61369691, 'friends_count': 47, 'listed_count': 104541, 'created_at': 'Wed Mar 18 13:46:38 +0000 2009', 'favourites_count': 7, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': True, 'statuses_count': 42533, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': True, 'profile_background_color': '6D5C18', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1560920145', 'profile_link_color': '1B95E0', 'profile_sidebar_border_color': 'BDDCAD', 'profile_sidebar_fill_color': 'C5CEC0', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'regular'}, id=25073877, id_str='25073877', name='Donald J. Trump', screen_name='realDonaldTrump', location='Washington, DC', description='45th President of the United States of America🇺🇸', url='https://t.co/OMxB0x7xC5', entities={'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, protected=False, followers_count=61369691, friends_count=47, listed_count=104541, created_at=datetime.datetime(2009, 3, 18, 13, 46, 38), favourites_count=7, utc_offset=None, time_zone=None, geo_enabled=True, verified=True, statuses_count=42533, lang=None, contributors_enabled=False, is_translator=False, is_translation_enabled=True, profile_background_color='6D5C18', profile_background_image_url='http://abs.twimg.com/images/themes/theme1/bg.png', profile_background_image_url_https='https://abs.twimg.com/images/themes/theme1/bg.png', profile_background_tile=True, profile_image_url='http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_image_url_https='https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_banner_url='https://pbs.twimg.com/profile_banners/25073877/1560920145', profile_link_color='1B95E0', profile_sidebar_border_color='BDDCAD', profile_sidebar_fill_color='C5CEC0', profile_text_color='333333', profile_use_background_image=True, has_extended_profile=False, default_profile=False, default_profile_image=False, following=False, follow_request_sent=False, notifications=False, translator_type='regular'), user=User(_api=<tweepy.api.API object at 0x7f987f0ce240>, _json={'id': 25073877, 'id_str': '25073877', 'name': 'Donald J. Trump', 'screen_name': 'realDonaldTrump', 'location': 'Washington, DC', 'description': '45th President of the United States of America🇺🇸', 'url': 'https://t.co/OMxB0x7xC5', 'entities': {'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 61369691, 'friends_count': 47, 'listed_count': 104541, 'created_at': 'Wed Mar 18 13:46:38 +0000 2009', 'favourites_count': 7, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': True, 'statuses_count': 42533, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': True, 'profile_background_color': '6D5C18', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1560920145', 'profile_link_color': '1B95E0', 'profile_sidebar_border_color': 'BDDCAD', 'profile_sidebar_fill_color': 'C5CEC0', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'regular'}, id=25073877, id_str='25073877', name='Donald J. Trump', screen_name='realDonaldTrump', location='Washington, DC', description='45th President of the United States of America🇺🇸', url='https://t.co/OMxB0x7xC5', entities={'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, protected=False, followers_count=61369691, friends_count=47, listed_count=104541, created_at=datetime.datetime(2009, 3, 18, 13, 46, 38), favourites_count=7, utc_offset=None, time_zone=None, geo_enabled=True, verified=True, statuses_count=42533, lang=None, contributors_enabled=False, is_translator=False, is_translation_enabled=True, profile_background_color='6D5C18', profile_background_image_url='http://abs.twimg.com/images/themes/theme1/bg.png', profile_background_image_url_https='https://abs.twimg.com/images/themes/theme1/bg.png', profile_background_tile=True, profile_image_url='http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_image_url_https='https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_banner_url='https://pbs.twimg.com/profile_banners/25073877/1560920145', profile_link_color='1B95E0', profile_sidebar_border_color='BDDCAD', profile_sidebar_fill_color='C5CEC0', profile_text_color='333333', profile_use_background_image=True, has_extended_profile=False, default_profile=False, default_profile_image=False, following=False, follow_request_sent=False, notifications=False, translator_type='regular'), geo=None, coordinates=None, place=None, contributors=None, is_quote_status=False, retweet_count=10387, favorite_count=48141, favorited=False, retweeted=False, lang='in')
Not very informative, huh? Let's make our dataset look more legible.
tweetdata = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=["tweets"])
tweetdata["Created at"] = [tweet.created_at for tweet in tweets] tweetdata["retweets"] = [tweet.retweet_count for tweet in tweets] tweetdata["source"] = [tweet.source for tweet in tweets] tweetdata["favorites"] = [tweet.favorite_count for tweet in tweets]
And this is how it looks now. Much better, isn't it?
tweetdata.head()
tweets | Created at | retweets | source | favorites | |
---|---|---|---|---|---|
0 | Presidential Harassment! | 2019-06-26 02:34:41 | 10387 | Twitter for iPhone | 48141 |
1 | Senator Thom Tillis of North Carolina has real... | 2019-06-25 22:20:42 | 11127 | Twitter for iPhone | 45202 |
2 | Staff Sgt. David Bellavia - today, we honor yo... | 2019-06-25 21:38:42 | 11455 | Twitter for iPhone | 48278 |
3 | Today, it was my great honor to present the Me... | 2019-06-25 20:27:19 | 10389 | Twitter for iPhone | 44485 |
4 | ....Martha is strong on Crime and Borders, the... | 2019-06-25 19:25:20 | 9817 | Twitter for iPhone | 52995 |
The next step needed to be taken is cleaning our dataset from useless words that bring no sense and improving our dataset that will then contain, among default tweet data, its connotation (whether it's positive, negative, or neutral), sentimental score, and subjectivity.
stopword = corp.stopwords.words('english') + ['rt', 'https', 'co', 'u', 'go'] def clean_tweet(tweet): tweet = tweet.lower() filteredList = [] global stopword tweetList = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split() for i in tweetList: if not i in stopword: filteredList.append(i) return ' '.join(filteredList)
scores = [] status = [] sub = [] fullText = [] for tweet in tweetdata['tweets']: analysis = TextBlob(clean_tweet(tweet)) fullText.extend(analysis.words) value = analysis.sentiment.polarity subject = analysis.sentiment.subjectivity if value > 0: sent = 'positive' elif value == 0: sent = 'neutral' else: sent = 'negative' scores.append(value) status.append(sent) sub.append(subject)
tweetdata['sentimental_score'] = scores tweetdata['sentiment_status'] = status tweetdata['subjectivity'] = sub tweetdata.drop(tweetdata.columns[2:5], axis=1, inplace=True)
tweetdata.head()
tweets | Created at | sentimental_score | sentiment_status | subjectivity | |
---|---|---|---|---|---|
0 | Presidential Harassment! | 2019-06-26 02:34:41 | 0.000000 | neutral | 0.000000 |
1 | Senator Thom Tillis of North Carolina has real... | 2019-06-25 22:20:42 | 0.081481 | positive | 0.588889 |
2 | Staff Sgt. David Bellavia - today, we honor yo... | 2019-06-25 21:38:42 | 0.333333 | positive | 1.000000 |
3 | Today, it was my great honor to present the Me... | 2019-06-25 20:27:19 | 0.400000 | positive | 0.375000 |
4 | ....Martha is strong on Crime and Borders, the... | 2019-06-25 19:25:20 | 0.086667 | positive | 0.396667 |
For a better understanding of the obtained results, let's do some visualization.
positive = len(tweetdata[tweetdata['sentiment_status'] == 'positive']) negative = len(tweetdata[tweetdata['sentiment_status'] == 'negative']) neutral = len(tweetdata[tweetdata['sentiment_status'] == 'neutral'])
fig, ax = plt.subplots(figsize = (10,5)) index = range(3) plt.bar(index[2], positive, color='green', edgecolor = 'black', width = 0.8) plt.bar(index[0], negative, color = 'orange',edgecolor = 'black', width = 0.8) plt.bar(index[1], neutral, color = 'grey',edgecolor = 'black', width = 0.8) plt.legend(['Positive', 'Negative', 'Neutral']) plt.xlabel('Sentiment Status ',fontdict = {'size' : 15}) plt.ylabel('Sentimental Frequency', fontdict = {'size' : 15}) plt.title("Donald Trump's Twitter sentiment status", fontsize = 20)
Text(0.5, 1.0, "Donald Trump's Twitter sentiment status")
Sentiment analysis is a great way to explore emotions and opinions among society. We created basic sentiment classifier that can be used for analyzing textual data in social nets. The lexicon-based analysis allows creating own lexicon dictionaries thanks to what you can perform fine sentiment tuning depending on the task, textual data, and the goal of the analysis.