Your Career Platform for Big Data

Be part of the digital revolution in the UK and Ireland

 

Latest job opportunities

Stark Horley, UK
12/07/2019
Full time
About Operations Our busy Operations team works within industry guidelines to ensure that the highest possible quality of utility data is delivered to Suppliers promptly. We work closely as a team and with external agents to achieve this end and we deal with a wide range of issues that prevent or delay high quality data provision. The Data Collector and Analyst role requires you to: Validate the quality of utility data Investigate and resolve appointment anomalies Respond to a wide range of internal and external queries Participate in data collection & associated activities, with the enthusiasm and drive necessary to ensure data is delivered in accordance with the team’s published service levels Fully adhere to industry regulations and standards Please note: Complex data or spreadsheet analysis is not a part of our daily work Skills and Experience required: Proven excellent and high standard of communication skills, both written and verbal Initiative, drive and tenacity The ability to coordinate and liaise professionally with internal and external parties An investigative and enquiring approach Strong problem-solving skills with the ability to think laterally Good organisational skills The ability to focus on complex issues and to enjoy working to achieve complete resolution A good working knowledge of Excel, Word, Access and Outlook Minimum qualifications required: GSCE (or equivalent) Grade A-B Maths, English and a Science A minimum of 2 ‘good’ passes at A level or equivalent required including Maths and a Science. If only one of these is offered, the other pass must be in a subject that requires a high level of literacy e.g. English, History A degree is desirable
Sussex Police Guildford, UK
12/07/2019
Full time
Division / Department - Corporate Development Grade - Grade G Status - Full Time Contract Type - Permanent Salary Grade Range - £26,744 - £31,941 Working Hours - 36.0 Hours per Week The starting salary for this role will usually be at the bottom of the salary range. The Role Surrey Police is looking for a Performance Analysts on a permanent contract to support a range of operational and business support portfolios. We are seeking bright and enthusiastic individuals of graduate calibre (degree in analytical subject within social sciences/mathematics or with well proven and evidenced experience of numerical or research analysis in a law enforcement environment) to join our highly skilled performance analysts. As an analyst within the Performance & Consultation Unit you will be regularly monitoring and identifying performance risks and opportunities that have the potential to impact on the force's strategy and the public's priorities. You will be highly regarded throughout the force for your well-developed technical and mathematical skills and will take the lead on developing, maintaining and communicating performance frameworks for your respective areas of responsibility. Key Responsibilities To be responsible for the on-going development, integrity, and delivery of management information. To prepare it in the form of products and reports providing quantitative and qualitative analysis in support of the force’s strategy as well as other national, regional and local government requirements. Your work will play a key role in assessing the force’s performance in relation to the Force’s strategy for service delivery. To proactively liaise with customers to provide them with timely, accurate and relevant assessments of performance in relation to the force’s strategy. Skills & Experience Successful candidates will be expected to have outstanding problem solving skills, highly computer literate including the advanced functionality of MS Excel (experience in Oracle SQL /Tableau/Power-BI highly desirable). The preferred candidates will also be familiar with analysing data and information, making inferences and explaining statistical concepts and have the ability to manage requirements and prioritise tasks across a diverse range of subject areas. Most importantly you must be an exceptional communicator; you will be expected to maintain good working relationships within the organisation, (including key stakeholders and internal customers) and with external agencies such as the Home Office and HMIC. You will also be expected to present the findings of your analysis to senior leaders, who will use your insight and analysis to support their decision making. Contact Details For further information please contact Tony Fenton-Jones (tony.fenton-jones@surrey.pnn.police.uk) Previously unsuccessful candidates who have applied for this post in past 6 months should not reapply. The successful candidate will start as a junior performance analyst (grade G) with the opportunity to develop a portfolio of work that can lead them to an established analyst position on grade H, usually within two years of starting. Additional Information This post is being advertised in parallel with Force redeployment processes. Any redeployees who are identified will be given preference. This may result in the post being withdrawn at any point during the recruitment process. If you are an internal candidate looking to apply for this role, please ensure that you discuss your application with your line manager before submitting an application. If you are conditionally offered the role your attendance record and any reasonable adjustments already in place will be discussed with yourself and your current line manager.   Please note that not all jobs are available for internal candidates across both Forces, the current agreed recruitment principles are; vacancies in collaborated units are available to all officers and staff across both Forces vacancies in non-collaborated units are only available to officers and staff within the Force with the vacancy unless it is advertised externally. If the vacancy is advertised externally and an officer or member of staff from the other Force is success it will result in a transfer of employment. Surrey Police and Sussex Police Special Constables, Volunteers and Agency Staff (excluding self employed workers) covered under the Agency Worker Regulations (AWR) are eligible to apply for internal advertised posts. Applications will only be accepted by clicking the 'Apply Now' button. CVs are not accepted as part of our recruitment process. Diversity Statement Surrey Police and Sussex Police are committed to providing quality in the service we deliver, the career opportunities we offer our people and to increase the diversity of our workforce to reflect the community we serve.   We value diversity and inclusion and want to attract the best people for the roles available regardless of age, ethnicity, sexual orientation, gender identity, sex, disability, social status and religious beliefs.   We want to do all we can to encourage fair and inclusive opportunities, helping us to ensure that all individuals are properly supported and represented.   We are committed to the elimination of unfair discrimination and we are determined to ensure that all or employees receive fair and equitable treatment.   We are particularly keen to hear from applicants who are from different ethnic backgrounds and experiences.

DataCareer Blog

The way other people think about one or another product or service has a big impact on our everyday process of making decisions. Earlier, people relied on the opinion of their friends, relatives, or products and services reposts, but the era of the Internet has made significant changes. Today opinions are collected from different people around the world via reviewing e-commerce sites as well as blogs and social nets. To transform gathered data into helpful information on how a product or service is perceived among the people, the sentiment analysis is needed. What is sentiment analysis and why do we need it Sentiment analysis is a computing exploration of opinions, sentiments, and emotions expressed in textual data. The reason why sentiment analysis is used increasingly by companies is due to the fact that the extracted information can result in the products and services monetizing. Words express various kinds of sentiments. They can be positive, negative, or have no emotional overtone, be neutral. To perform analysis of the text's sentiment, the understanding of the polarity of the words is needed in order to classify sentiments into positive, negative, or neutral categories. This goal can be achieved through the use of sentiment lexicons. Common approaches for classifying sentiment  Sentiment analysis can be done in three ways: using ML algorithms, using dictionaries and lexicons, and combining these techniques. The approach based on the ML algorithms got significant popularity nowadays as it gives wide opportunities for performing identification of different sentiment expressions in the text. For performing lexicon-based approach various dictionaries with polarity scores can be found. Such dictionaries can help in establishing the connotation of the word. One of the pros of such an approach is that you don't need a training set for performing analysis, and that is why even a small piece of data can be successfully classified. However, the problem is that many words are still missing in sentiment lexicons that somewhat diminishes results of the classification. Sentiment analysis based on the combination of ML and lexicon-based techniques is not much popular but allows to achieve much more promising results then the results of independent use of the two approaches. The central part of the lexicon-based sentiment analysis belongs to the dictionaries. The most popular are afinn, bing, and nrc that can be found and installed on  python packages repository  All dictionaries are based on the polarity scores that can be positive, negative, or neutral. For Python developers, two useful sentiment tools will be helpful - VADER and TextBlob. VADER is a rule and lexicon-based tool for sentiment analysis that is adapted to sentiments that can be found in social media posts. VADER uses a list of tokens that are labeled according to their semantic connotation. TextBlob is a useful library for text processing. It provides general dealing with such tasks like phrase extraction, sentiment analysis, classification and so on. Things needed to be done before SA  In this tutorial, we will build a lexicon-based sentiment classifier of Donald Trump tweets with the help of the TextBlob. Let's look, which sentiments generally prevail in the scope of tweets. As every data exploration, there are some steps needed to be done before analysis, problem statement and data preparation. As the theme of our study is already stated, let's concentrate on data preparation. We will get tweets directly from Twitter, the data will come to us in some unordered look and that is why we need to order data into dataframe and do cleaning, removing links and stopwords. Building sentiment classifier First of all, we have to install packages needed for dealing with the task. In [41]: import tweepy import pandas as pd import numpy as np import matplotlib.pyplot as plt import re import nltk import nltk.corpus as corp from textblob import TextBlob  The next step is to connect our app to Twitter via Twitter API. Provide the needed credentials that will be used in our function for connection and extracting tweets from Donald Trump's account. In [4]: CONSUMER_KEY = "Key" CONSUMER_SECRET = "Secret" ACCESS_TOKEN = "Token" ACCESS_SECRET = "Secret" In [7]: def twitter_access(): auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET) api = tweepy.API(auth) return api twitter = twitter_access() In [9]: tweets = twitter.user_timeline("RealDonaldTrump", count=600) This is how our dataset looks: In [81]: tweets[0] Out[81]: Status(_api=<tweepy.api.API object at 0x7f987f0ce240>, _json={'created_at': 'Wed Jun 26 02:34:41 +0000 2019', 'id': 1143709133234954241, 'id_str': '1143709133234954241', 'text': 'Presidential Harassment!', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': []}, 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 25073877, 'id_str': '25073877', 'name': 'Donald J. Trump', 'screen_name': 'realDonaldTrump', 'location': 'Washington, DC', 'description': '45th President of the United States of America🇺🇸', 'url': 'https://t.co/OMxB0x7xC5', 'entities': {'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 61369691, 'friends_count': 47, 'listed_count': 104541, 'created_at': 'Wed Mar 18 13:46:38 +0000 2009', 'favourites_count': 7, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': True, 'statuses_count': 42533, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': True, 'profile_background_color': '6D5C18', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1560920145', 'profile_link_color': '1B95E0', 'profile_sidebar_border_color': 'BDDCAD', 'profile_sidebar_fill_color': 'C5CEC0', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'regular'}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'retweet_count': 10387, 'favorite_count': 48141, 'favorited': False, 'retweeted': False, 'lang': 'in'}, created_at=datetime.datetime(2019, 6, 26, 2, 34, 41), id=1143709133234954241, id_str='1143709133234954241', text='Presidential Harassment!', truncated=False, entities={'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': []}, source='Twitter for iPhone', source_url='http://twitter.com/download/iphone', in_reply_to_status_id=None, in_reply_to_status_id_str=None, in_reply_to_user_id=None, in_reply_to_user_id_str=None, in_reply_to_screen_name=None, author=User(_api=<tweepy.api.API object at 0x7f987f0ce240>, _json={'id': 25073877, 'id_str': '25073877', 'name': 'Donald J. Trump', 'screen_name': 'realDonaldTrump', 'location': 'Washington, DC', 'description': '45th President of the United States of America🇺🇸', 'url': 'https://t.co/OMxB0x7xC5', 'entities': {'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 61369691, 'friends_count': 47, 'listed_count': 104541, 'created_at': 'Wed Mar 18 13:46:38 +0000 2009', 'favourites_count': 7, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': True, 'statuses_count': 42533, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': True, 'profile_background_color': '6D5C18', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1560920145', 'profile_link_color': '1B95E0', 'profile_sidebar_border_color': 'BDDCAD', 'profile_sidebar_fill_color': 'C5CEC0', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'regular'}, id=25073877, id_str='25073877', name='Donald J. Trump', screen_name='realDonaldTrump', location='Washington, DC', description='45th President of the United States of America🇺🇸', url='https://t.co/OMxB0x7xC5', entities={'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, protected=False, followers_count=61369691, friends_count=47, listed_count=104541, created_at=datetime.datetime(2009, 3, 18, 13, 46, 38), favourites_count=7, utc_offset=None, time_zone=None, geo_enabled=True, verified=True, statuses_count=42533, lang=None, contributors_enabled=False, is_translator=False, is_translation_enabled=True, profile_background_color='6D5C18', profile_background_image_url='http://abs.twimg.com/images/themes/theme1/bg.png', profile_background_image_url_https='https://abs.twimg.com/images/themes/theme1/bg.png', profile_background_tile=True, profile_image_url='http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_image_url_https='https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_banner_url='https://pbs.twimg.com/profile_banners/25073877/1560920145', profile_link_color='1B95E0', profile_sidebar_border_color='BDDCAD', profile_sidebar_fill_color='C5CEC0', profile_text_color='333333', profile_use_background_image=True, has_extended_profile=False, default_profile=False, default_profile_image=False, following=False, follow_request_sent=False, notifications=False, translator_type='regular'), user=User(_api=<tweepy.api.API object at 0x7f987f0ce240>, _json={'id': 25073877, 'id_str': '25073877', 'name': 'Donald J. Trump', 'screen_name': 'realDonaldTrump', 'location': 'Washington, DC', 'description': '45th President of the United States of America🇺🇸', 'url': 'https://t.co/OMxB0x7xC5', 'entities': {'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 61369691, 'friends_count': 47, 'listed_count': 104541, 'created_at': 'Wed Mar 18 13:46:38 +0000 2009', 'favourites_count': 7, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': True, 'statuses_count': 42533, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': True, 'profile_background_color': '6D5C18', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1560920145', 'profile_link_color': '1B95E0', 'profile_sidebar_border_color': 'BDDCAD', 'profile_sidebar_fill_color': 'C5CEC0', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'regular'}, id=25073877, id_str='25073877', name='Donald J. Trump', screen_name='realDonaldTrump', location='Washington, DC', description='45th President of the United States of America🇺🇸', url='https://t.co/OMxB0x7xC5', entities={'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, protected=False, followers_count=61369691, friends_count=47, listed_count=104541, created_at=datetime.datetime(2009, 3, 18, 13, 46, 38), favourites_count=7, utc_offset=None, time_zone=None, geo_enabled=True, verified=True, statuses_count=42533, lang=None, contributors_enabled=False, is_translator=False, is_translation_enabled=True, profile_background_color='6D5C18', profile_background_image_url='http://abs.twimg.com/images/themes/theme1/bg.png', profile_background_image_url_https='https://abs.twimg.com/images/themes/theme1/bg.png', profile_background_tile=True, profile_image_url='http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_image_url_https='https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_banner_url='https://pbs.twimg.com/profile_banners/25073877/1560920145', profile_link_color='1B95E0', profile_sidebar_border_color='BDDCAD', profile_sidebar_fill_color='C5CEC0', profile_text_color='333333', profile_use_background_image=True, has_extended_profile=False, default_profile=False, default_profile_image=False, following=False, follow_request_sent=False, notifications=False, translator_type='regular'), geo=None, coordinates=None, place=None, contributors=None, is_quote_status=False, retweet_count=10387, favorite_count=48141, favorited=False, retweeted=False, lang='in') Not very informative, huh? Let's make our dataset look more legible. In [101]: tweetdata = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=["tweets"]) In [102]: tweetdata["Created at"] = [tweet.created_at for tweet in tweets] tweetdata["retweets"] = [tweet.retweet_count for tweet in tweets] tweetdata["source"] = [tweet.source for tweet in tweets] tweetdata["favorites"] = [tweet.favorite_count for tweet in tweets] And this is how it looks now. Much better, isn't it? In [103]: tweetdata.head() Out[103]:   tweets Created at retweets source favorites 0 Presidential Harassment! 2019-06-26 02:34:41 10387 Twitter for iPhone 48141 1 Senator Thom Tillis of North Carolina has real... 2019-06-25 22:20:42 11127 Twitter for iPhone 45202 2 Staff Sgt. David Bellavia - today, we honor yo... 2019-06-25 21:38:42 11455 Twitter for iPhone 48278 3 Today, it was my great honor to present the Me... 2019-06-25 20:27:19 10389 Twitter for iPhone 44485 4 ....Martha is strong on Crime and Borders, the... 2019-06-25 19:25:20 9817 Twitter for iPhone 52995   The next step needed to be taken is cleaning our dataset from useless words that bring no sense and improving our dataset that will then contain, among default tweet data, its connotation (whether it's positive, negative, or neutral), sentimental score, and subjectivity. In [104]: stopword = corp.stopwords.words('english') + ['rt', 'https', 'co', 'u', 'go'] def clean_tweet(tweet): tweet = tweet.lower() filteredList = [] global stopword tweetList = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split() for i in tweetList: if not i in stopword: filteredList.append(i) return ' '.join(filteredList) In [105]: scores = [] status = [] sub = [] fullText = [] for tweet in tweetdata['tweets']: analysis = TextBlob(clean_tweet(tweet)) fullText.extend(analysis.words) value = analysis.sentiment.polarity subject = analysis.sentiment.subjectivity if value > 0: sent = 'positive' elif value == 0: sent = 'neutral' else: sent = 'negative' scores.append(value) status.append(sent) sub.append(subject) In [106]: tweetdata['sentimental_score'] = scores tweetdata['sentiment_status'] = status tweetdata['subjectivity'] = sub tweetdata.drop(tweetdata.columns[2:5], axis=1, inplace=True) In [107]: tweetdata.head() Out[107]:   tweets Created at sentimental_score sentiment_status subjectivity 0 Presidential Harassment! 2019-06-26 02:34:41 0.000000 neutral 0.000000 1 Senator Thom Tillis of North Carolina has real... 2019-06-25 22:20:42 0.081481 positive 0.588889 2 Staff Sgt. David Bellavia - today, we honor yo... 2019-06-25 21:38:42 0.333333 positive 1.000000 3 Today, it was my great honor to present the Me... 2019-06-25 20:27:19 0.400000 positive 0.375000 4 ....Martha is strong on Crime and Borders, the... 2019-06-25 19:25:20 0.086667 positive 0.396667   For a better understanding of the obtained results, let's do some visualization. In [109]: positive = len(tweetdata[tweetdata['sentiment_status'] == 'positive']) negative = len(tweetdata[tweetdata['sentiment_status'] == 'negative']) neutral = len(tweetdata[tweetdata['sentiment_status'] == 'neutral']) In [110]: fig, ax = plt.subplots(figsize = (10,5)) index = range(3) plt.bar(index[2], positive, color='green', edgecolor = 'black', width = 0.8) plt.bar(index[0], negative, color = 'orange',edgecolor = 'black', width = 0.8) plt.bar(index[1], neutral, color = 'grey',edgecolor = 'black', width = 0.8) plt.legend(['Positive', 'Negative', 'Neutral']) plt.xlabel('Sentiment Status ',fontdict = {'size' : 15}) plt.ylabel('Sentimental Frequency', fontdict = {'size' : 15}) plt.title("Donald Trump's Twitter sentiment status", fontsize = 20) Out[110]: Text(0.5, 1.0, "Donald Trump's Twitter sentiment status")     Conclusion Sentiment analysis is a great way to explore emotions and opinions among society. We created basic sentiment classifier that can be used for analyzing textual data in social nets. The lexicon-based analysis allows creating own lexicon dictionaries thanks to what you can perform fine sentiment tuning depending on the task, textual data, and the goal of the analysis.
Introduction Exploratory data analysis (EDA) is an approach to data analysis to summarize the main characteristics of data. It can be performed using various methods, among which data visualization takes a great place. The idea of EDA is to recognize what information can data give us beyond the formal modeling or hypothesis testing task. In other words, if initially we don’t have at all or there are not enough priori ideas about the pattern and nature of the relationships within the data, an exploratory data analysis comes for help allowing us to identify main tendencies, properties, and nature of the information. In return, based on the information obtained, the researcher will be able to evaluate the structure and nature of the available data, which can ease the search and identification of questions and the purpose of data exploration. So, EDA is a crucial step before feature engineering and can involve a part of data preprocessing. In this tutorial, we will show you how to perform simple EDA using  Google Play Store Apps Data Set . To begin with, let’s install and load all the necessary libraries that we will need. # Remove warnings options(warn=-1) # Load libraries require(ggplot2) require(highcharter) require(dplyr) require(tidyverse) require(corrplot) require(RColorBrewer) require(xts) require(treemap) require(lubridate) Data overview The Play Store apps insights can tell developers information about the Android market. Each row of the dataset has values for the category, rating, size, and more apps characteristics. Here are the columns of our dataset: App - name of the application. Category - category of the app. Rating - application’s rating on Play Store. Reviews - number of the app’s reviews. Size - size of the app. Install - number of installs of the app. Type - whether the app is free or paid. Price - price of the app (0 if free). Content Rating - target audience of the app. Genres - genre the app belongs to. Last Updated - date the app was last updated. Current Ver - current version of the application. Android Ver - minimum Android version required to run the app. Now, let’s load data and view the first rows. For that, we use head() function: df<-read.csv("googleplaystore.csv",na.strings = c("NaN","NA","")) head(df) ## App Category ## 1 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN ## 2 Coloring book moana ART_AND_DESIGN ## 3 U Launcher Lite â\200“ FREE Live Cool Themes, Hide Apps ART_AND_DESIGN ## 4 Sketch - Draw & Paint ART_AND_DESIGN ## 5 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN ## 6 Paper flowers instructions ART_AND_DESIGN ## Rating Reviews Size Installs Type Price Content.Rating ## 1 4.1 159 19M 10,000+ Free 0 Everyone ## 2 3.9 967 14M 500,000+ Free 0 Everyone ## 3 4.7 87510 8.7M 5,000,000+ Free 0 Everyone ## 4 4.5 215644 25M 50,000,000+ Free 0 Teen ## 5 4.3 967 2.8M 100,000+ Free 0 Everyone ## 6 4.4 167 5.6M 50,000+ Free 0 Everyone ## Genres Last.Updated Current.Ver ## 1 Art & Design January 7, 2018 1.0.0 ## 2 Art & Design;Pretend Play January 15, 2018 2.0.0 ## 3 Art & Design August 1, 2018 1.2.4 ## 4 Art & Design June 8, 2018 Varies with device ## 5 Art & Design;Creativity June 20, 2018 1.1 ## 6 Art & Design March 26, 2017 1.0 ## Android.Ver ## 1 4.0.3 and up ## 2 4.0.3 and up ## 3 4.0.3 and up ## 4 4.2 and up ## 5 4.4 and up ## 6 2.3 and up It’s useful to see data format to perform analysis. Also, we can review data by columns type using str function: str(df) ## 'data.frame': 10841 obs. of 13 variables: ## $ App : Factor w/ 9660 levels "- Free Comics - Comic Apps",..: 7229 2563 8998 8113 7294 7125 8171 5589 4948 5826 ... ## $ Category : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..: 2 2 2 2 2 2 2 2 2 2 ... ## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ... ## $ Reviews : Factor w/ 6002 levels "0","1","10","100",..: 1183 5924 5681 1947 5924 1310 1464 3385 816 485 ... ## $ Size : Factor w/ 462 levels "1,000+","1.0M",..: 55 30 368 102 64 222 55 118 146 120 ... ## $ Installs : Factor w/ 22 levels "0","0+","1,000,000,000+",..: 8 20 13 16 11 17 17 4 4 8 ... ## $ Type : Factor w/ 3 levels "0","Free","Paid": 2 2 2 2 2 2 2 2 2 2 ... ## $ Price : Factor w/ 93 levels "$0.99","$1.00",..: 92 92 92 92 92 92 92 92 92 92 ... ## $ Content.Rating: Factor w/ 6 levels "Adults only 18+",..: 2 2 2 5 2 2 2 2 2 2 ... ## $ Genres : Factor w/ 120 levels "Action","Action;Action & Adventure",..: 10 13 10 10 12 10 10 10 10 12 ... ## $ Last.Updated : Factor w/ 1378 levels "1.0.19","April 1, 2016",..: 562 482 117 825 757 901 76 726 1317 670 ... ## $ Current.Ver : Factor w/ 2832 levels "0.0.0.2","0.0.1",..: 120 1019 465 2825 278 114 278 2392 1456 1430 ... ## $ Android.Ver : Factor w/ 33 levels "1.0 and up","1.5 and up",..: 16 16 16 19 21 9 16 19 11 16 ... As you can see, we got similar information as using head function, but here we are more concentrated on data type, rather than content. Now, we will use a function that produces summaries of the results of various model fitting functions: summary(df) ## App ## ROBLOX : 9 ## CBS Sports App - Scores, News, Stats & Watch Live: 8 ## 8 Ball Pool : 7 ## Candy Crush Saga : 7 ## Duolingo: Learn Languages Free : 7 ## ESPN : 7 ## (Other) :10796 ## Category Rating Reviews ## FAMILY :1972 Min. : 1.000 0 : 596 ## GAME :1144 1st Qu.: 4.000 1 : 272 ## TOOLS : 843 Median : 4.300 2 : 214 ## MEDICAL : 463 Mean : 4.193 3 : 175 ## BUSINESS : 460 3rd Qu.: 4.500 4 : 137 ## PRODUCTIVITY: 424 Max. :19.000 5 : 108 ## (Other) :5535 NA's :1474 (Other):9339 ## Size Installs Type Price ## Varies with device:1695 1,000,000+ :1579 0 : 1 0 :10040 ## 11M : 198 10,000,000+:1252 Free:10039 $0.99 : 148 ## 12M : 196 100,000+ :1169 Paid: 800 $2.99 : 129 ## 14M : 194 10,000+ :1054 NA's: 1 $1.99 : 73 ## 13M : 191 1,000+ : 907 $4.99 : 72 ## 15M : 184 5,000,000+ : 752 $3.99 : 63 ## (Other) :8183 (Other) :4128 (Other): 316 ## Content.Rating Genres Last.Updated ## Adults only 18+: 3 Tools : 842 August 3, 2018: 326 ## Everyone :8714 Entertainment: 623 August 2, 2018: 304 ## Everyone 10+ : 414 Education : 549 July 31, 2018 : 294 ## Mature 17+ : 499 Medical : 463 August 1, 2018: 285 ## Teen :1208 Business : 460 July 30, 2018 : 211 ## Unrated : 2 Productivity : 424 July 25, 2018 : 164 ## NA's : 1 (Other) :7480 (Other) :9257 ## Current.Ver Android.Ver ## Varies with device:1459 4.1 and up :2451 ## 1.0 : 809 4.0.3 and up :1501 ## 1.1 : 264 4.0 and up :1375 ## 1.2 : 178 Varies with device:1362 ## 2.0 : 151 4.4 and up : 980 ## (Other) :7972 (Other) :3169 ## NA's : 8 NA's : 3 NA analysis After getting acquainted with the dataset, we should analyze it on NA and duplicates. Detecting and removing such records helps to build a model with better accuracy. First, let’s analyze missing values. We can review the result as a table: sapply(df,function(x)sum(is.na(x))) ## App Category Rating Reviews Size ## 0 0 1474 0 0 ## Installs Type Price Content.Rating Genres ## 0 1 0 1 0 ## Last.Updated Current.Ver Android.Ver ## 0 8 3 Or as a chart: key value Columns with NA values Rating Current.Ver Android.Ver 0 250 500 750 1000 1250 1500 1750   As you can see, there are three columns containing missing values, and Rating column has the largest number of them. Let’s remove such values. df = na.omit(df) Duplicate records removal The next step is to check whether there are duplicates. We can check the difference between all and unique objects. distinct <- nrow(df %>% distinct()) nrow(df) - distinct ## [1] 474 After detecting duplicates, we need to remove them: df=df[!duplicated(df), ] When data is precleaned, we can begin further visual analysis. Analysis using visualization tools To start off, we will review the Category column. Let’s examine which categories are the most and the least popular: df %>% count(Category, Installs) %>% group_by(Category) %>% summarize( TotalInstalls = sum(as.numeric(Installs)) ) %>% arrange(-TotalInstalls) %>% hchart('scatter', hcaes(x = "Category", y = "TotalInstalls", size = "TotalInstalls", color = "Category")) %>% hc_add_theme(hc_theme_538()) %>% hc_title(text = "Most popular categories (# of installs)") Category TotalInstalls Most popular categories (# of installs) -2 -1 GAME SOCIAL COMMUNICATION FAMILY PRODUCTIVITY TOOLS HEALTH_AND_FITNESS BUSINESS SPORTS NEWS_AND_MAGAZINES LIFESTYLE PERSONALIZATION VIDEO_PLAYERS BOOKS_AND_REFERENCE FINANCE MEDICAL PHOTOGRAPHY SHOPPING MAPS_AND_NAVIGATION TRAVEL_AND_LOCAL FOOD_AND_DRINK DATING WEATHER EVENTS AUTO_AND_VEHICLES ART_AND_DESIGN PARENTING BEAUTY COMICS ENTERTAINMENT LIBRARIES_AND_DEMO EDUCATION HOUSE_AND_HOME 100 150 200 250 50 300 Here we can see that Game is the most popular category by installs. It’s interesting that Education has almost the lowest popularity. Moreover, Comics are also at the bottom according to popularity ratings. Now, we want to see a percentage of the apps in each category. The pie chart is not a widespread type of visual, but when you need to know the percentage, it is one of the best options. Let’s count the apps in each category and expand our color palette. freq<-table(df$Category) fr<-as.data.frame(freq) fr <- fr %>% arrange(desc(Freq)) coul = brewer.pal(12, "Paired") # We can add more tones to this palette: coul = colorRampPalette(coul)(15) op <- par(cex = 0.5) pielabels <- sprintf("%s = %3.1f%s", fr$Var1, 100*fr$Freq/sum(fr$Freq), "%") pie(fr$Freq, labels=NA, clockwise=TRUE, col=coul, border="black", radius=0.5, cex=1) legend("right",legend=pielabels,bty="n", fill=coul) We can see that Family now becomes a leader among the categories. Also, Education here has a higher percentage than Comics. Now, let’s look closer to the prices of the apps and review how many free apps are available in the Play Market. tmp <- df %>% count(Type) %>% mutate(perc = round((n /sum(n))*100)) %>% arrange(desc(perc)) hciconarray(tmp$Type, tmp$perc, size = 5) %>% hc_title(text="Percentage of paid vs. free apps")   Percentage of paid vs. free apps Free Paid As you can see, 93% of the apps are free. Let’s see the median price in each category:   df %>% filter(Type == "Paid") %>% group_by(Category) %>% summarize( Price = median(as.numeric(Price)) ) %>% arrange(-Price) %>% hchart('treemap', hcaes(x = 'Category', value = 'Price', color = 'Price')) %>% hc_add_theme(hc_theme_elementary()) %>% hc_title(text="Median price per category") %>% hc_legend(align = "left", verticalAlign = "top", layout = "vertical", x = 0, y = 100) Median price per category PARENTING PARENTING DATING DATING FINANCE FINANCE FOOD_AND_DRINK FOOD_AND_DRINK LIFESTYLE LIFESTYLE ENTERTAINMENT ENTERTAINMENT BUSINESS BUSINESS EDUCATION EDUCATION WEATHER WEATHER PRODUCTIVITY PRODUCTIVITY TRAVEL_AND_LOCAL TRAVEL_AND_LOCAL MEDICAL MEDICAL BOOKS_AND_REFERENCE BOOKS_AND_REFERENCE FAMILY FAMILY GAME GAME HEALTH_AND_FITNESS HEALTH_AND_FITNESS PHOTOGRAPHY PHOTOGRAPHY SPORTS SPORTS TOOLS TOOLS SHOPPING SHOPPING COMMUNICATION COMMUNICATION NEWS_AND_MAGAZINES NEWS_AND_MAGAZINES ART_AND_DESIGN ART_AND_DESIGN VIDEO_PLAYERS VIDEO_PLAYERS PERSONALIZATION PERSONALIZATION SOCIAL SOCIAL 0 25 50 75 NEWS_AND_MAGAZINES : 21.5 This chart is a treemap. In general, it is used to display a data hierarchy, to see summary based on two values (size and color). Therefore, we can see that Parenting category has the highest price while Personalization and Social the lowest. Now, we will build a correlation heatmap, previously performing some data preprocessing. df <- df %>% mutate( Installs = gsub("\\+", "", as.character(Installs)), Installs = as.numeric(gsub(",", "", Installs)), Size = gsub("M", "", Size), Size = ifelse(grepl("k", Size), 0, as.numeric(Size)), Rating = as.numeric(Rating), Reviews = as.numeric(Reviews), Price = as.numeric(gsub("\\$", "", as.character(Price))) )%>% filter( Type %in% c("Free", "Paid") ) extract = c("Rating","Reviews","Size","Installs","Price") df.extract = df[extract] df.extract %>% filter(is.nan(df.extract$Reviews)) %>% filter(is.na(df.extract$Size)) ## [1] Rating Reviews Size Installs Price ## <0 rows> (or 0-length row.names) df.extract = na.omit(df.extract) cor_matrix = cor(df.extract) corrplot(cor_matrix,method = "color",order = "AOE",addCoef.col = "grey") Unfortunately, there is no strong relation between columns. Also, let’s see a number of installs by content rating. tmp <- df %>% group_by(Content.Rating) %>% summarize(Total.Installs = sum(Installs)) %>% arrange(-Total.Installs) highchart() %>% hc_chart(type = "funnel") %>% hc_add_series_labels_values( labels = tmp$Content.Rating, values = tmp$Total.Installs ) %>% hc_title( text="Number of Installs by Content Rating" ) %>% hc_add_theme(hc_theme_elementary()) Number of Installs by Content Rating Everyone Everyone Teen Teen Everyone 10+ Everyone 10+ Mature 17+ Mature 17+ Adults only 18+ Adults only 18+ Unrated Unrated As you might have guessed, teens take an active part in rating Play Store. You may notice a  hc_add_theme  line in the code. It adds a theme to your chart. Highchart has an extensive list of themes, and you can choose one via this  link . One of the most popular chart types is time series which we will explore at last. Also, we will transform our date type using lubridate package. # Get number of apps by last updated date tmp <- df %>% count(Last.Updated) # Transform date column type from text to date tmp$Last.Updated<-mdy(tmp$Last.Updated) # Transform data into time series time_series <- xts( tmp$n, order.by = tmp$Last.Updated ) highchart(type = "stock") %>% hc_title(text = "Last updated date") %>% hc_subtitle(text = "Number of applications by date of last update") %>% hc_add_series(time_series) %>% hc_add_theme(hc_theme_economist()) Last updated date Number of applications by date of last update Jan '13 Jul '14 Jul '15 Jan '16 Jul '16 Jan '17 Jul '17 Jan '18 Jul '18 2013 2015 2016 2017 2018 0 50 100 150 200 250 300 350 Zoom 1m 3m 6m YTD 1y All From May 21, 2010 To Aug 8, 2018 Sunday, Aug 5, 2018 ●  Series 1:  45 Such visualization is very convenient as it contains zoom options, range slider, date filtering, and points hovering. Using this chart, we can see that the number of updates is increasing with time.   Conclusion To sum up, exploration data analysis is a powerful tool for a comprehensive analysis of a dataset. In general, we can divide EDA into the next stages: data overview, duplicate records analysis, NA analysis, and data exploration. So, starting with reviewing the data structure, columns, contents, etc., we move forward to estimating and preparing our data for further analysis. Finally, visual data exploration helps to find dependencies, distribution, and more.
In the modern world, the information flow which befalls on a person is daunting. This led to a rather abrupt change in the basic principles of data perception. Therefore visualization is becoming the main tool for presenting information. With the help of visualization, information is presented to the audience in a more accessible, clear, visual form. Properly chosen method of visualization can make it possible to structure large data arrays, schematically depict elements that are insignificant in content, and make information more comprehensible. One of the most popular languages for data processing and analysis is Python, largely due to the high speed of creating and development of the libraries which grant basically unlimited possibilities for various data processing. The same is true for data visualization libraries. In this article, we will look at the basic tools of visualizing data that are used in the Python development environment. Matplotlib Matplotlib is perhaps the most widely known Python library for data visualization. Being easy to use, it offers ample opportunities to fine tune the way data is displayed. Polar area chart The library provides main visualization algorithms, including scatter plots, line plots, histograms, bar plots, box plots, and more. It is worth noting that the library has fairly extensive documentation, that makes it comfortable enough to work with even for beginners in the sphere of data processing and visualization. Multicategorical plot One of the main advantages of this library is a well-thought hierarchical structure. The highest level is represented by the functional interface called  matplotlib.pyplot , which allows users to create complex infographics with just a couple of lines of code by choosing ready-made solutions from the functions offered by the interface. Histogram The convenience of creating visualizations using matplotlib is provided not only due to the presence of a number of built-in graphic commands but also due to the rich arsenal on the configuration of standard forms. Settings include the ability to set arbitrary colors, shapes, line type or marker, line thickness, transparency level, font size and type, and so on. Seaborn Despite the wide popularity of the Matplotlib library, it has one drawback, which can become critical for some users: the low-level API and therefore, in order to create truly complex infographics, you may need to write a lot of generic code. Hexbin plot Fortunately, this problem is successfully leveled by the Seaborn library, which is a kind of high-level wrapper over Matplotlib. With its help, users are able to create colorful specific visualizations: heat maps, time series, violin charts, and much more. Seaborn heatmap Being highly customizable, Seaborn allows users wide opportunities to add unique and fancy looks to their charts in a quite a simple way with no time costs. ggplot Those users who have experience with R, probably heard about ggplot2, a powerful data visualization tool within the R programming language environment. This package is recognized as one of the best tools for graphical presentation of information. Fortunately, the extensive capabilities of this library are now available in the Python environment due to porting the package, which is available there under the name  ggplot . Box plot As we mentioned earlier, the process of data visualization has a deep internal structure. In other words, the process of creating a visualization is a clearly structured system, which largely influences the way of the thoughts in the process of creating infographics. And ggplot teaches the user to think in such a structured approach, to think according to this system so that in the process of consistently building commands, the user automatically starts detecting patterns in the data. Scatter plot Moreover, the library is very flexible. Ggplot provides users with ample opportunities for customizing how data will be displayed and preprocessing datasets before they are rendered. Bokeh Despite the rich potential of the ggplot library, some users may lack interactivity. Therefore, for those who need interactive data visualization, the Bokeh library has been created. Stacked area chart Bokeh is an open-source Javascript library with client-side for Python that allows users to create flexible, powerful and beautiful visualizations for web applications. With its help, users can create both simple bar charts and complex, highly detailed interactive visualizations without writing a single line in Javascript. Please have a look at  this gallery  to get an idea of the interactive features of Bokeh. plotly For those who need interactive diagrams, we recommend to check out the plotly library. It is positioned primarily as an online  platform , on which the users can create and publish their own visualizations. However, the library can also be used offline without uploading the visualization to the plotly server. Contour plot Due to the fact that this library is positioned by developers mostly as an autonomous product, it is constantly being refined and expanded. So, it provides users truly unlimited possibilities for data visualization, whether it’s interactive graphics or contours. You can find some examples of Plotly through the link below and have a look at the features of the library.  https://plot.ly/python/ Conclusion Over the past few years, data visualization tools available to Python developers have made a significant leap forward. Many powerful packages have appeared and are expanding in every possible way, implementing quite complex ways of graphical representation of information. This allows users not only to create various infographics but also to make them truly attractive and understandable to the audience.
View all blog posts