Blog > Python

A guide to Exploratory Data Analysis in Python

A guide to Exploratory Data Analysis in Python

What is Exploratory Data Analysis

Exploratory data analysis (EDA) is a powerful tool for a comprehensive study of the available information providing answers to basic data analysis questions.

What distinguishes it from traditional analysis based on testing a priori hypothesis is that EDA makes it possible to detect — by using various methods — all potential systematic correlations in the data. Exploratory data analysis is practically unlimited in time and methods allowing to identify curious data fragments and correlations. Therefore, you are able to examine information more deeply and accurately, as well as choose a proper model for further work.

In Python language environment, there is a wide range of libraries that can not only ease but also streamline the process of exploring a dataset. We will use Google Play Store Apps dataset and go through the main tasks of exploration analysis to find out if there are any trends that can facilitate the process of setting and resolving a business problem.

Data overview

Before we start exploring our data, we must import the dataset and Python libraries needed for further work. We will use pandas library, a very powerful tool for comprehensive data analysis.

In [1]:
import pandas as pd
In [2]:
googleplaystore = pd.read_csv("googleplaystore.csv")

Let's explore the structure of our dataframe by viewing the first and the last 10 rows.

In [3]:
  App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up
5 Paper flowers instructions ART_AND_DESIGN 4.4 167 5.6M 50,000+ Free 0 Everyone Art & Design March 26, 2017 1.0 2.3 and up
6 Smoke Effect Photo Maker - Smoke Editor ART_AND_DESIGN 3.8 178 19M 50,000+ Free 0 Everyone Art & Design April 26, 2018 1.1 4.0.3 and up
7 Infinite Painter ART_AND_DESIGN 4.1 36815 29M 1,000,000+ Free 0 Everyone Art & Design June 14, 2018 4.2 and up
8 Garden Coloring Book ART_AND_DESIGN 4.4 13791 33M 1,000,000+ Free 0 Everyone Art & Design September 20, 2017 2.9.2 3.0 and up
9 Kids Paint Free - Drawing Fun ART_AND_DESIGN 4.7 121 3.1M 10,000+ Free 0 Everyone Art & Design;Creativity July 3, 2018 2.8 4.0.3 and up
In [4]:
  App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
10831 MAPS_AND_NAVIGATION NaN 38 9.8M 5,000+ Free 0 Everyone Maps & Navigation June 13, 2018 4.0 and up
10832 FR Tides WEATHER 3.8 1195 582k 100,000+ Free 0 Everyone Weather February 16, 2014 6.0 2.1 and up
10833 Chemin (fr) BOOKS_AND_REFERENCE 4.8 44 619k 1,000+ Free 0 Everyone Books & Reference March 23, 2014 0.8 2.2 and up
10834 FR Calculator FAMILY 4.0 7 2.6M 500+ Free 0 Everyone Education June 18, 2017 1.0.0 4.1 and up
10835 FR Forms BUSINESS NaN 0 9.6M 10+ Free 0 Everyone Business September 29, 2016 1.1.5 4.0 and up
10836 Sya9a Maroc - FR FAMILY 4.5 38 53M 5,000+ Free 0 Everyone Education July 25, 2017 1.48 4.1 and up
10837 Fr. Mike Schmitz Audio Teachings FAMILY 5.0 4 3.6M 100+ Free 0 Everyone Education July 6, 2018 1.0 4.1 and up
10838 Parkinson Exercices FR MEDICAL NaN 3 9.5M 1,000+ Free 0 Everyone Medical January 20, 2017 1.0 2.2 and up
10839 The SCP Foundation DB fr nn5n BOOKS_AND_REFERENCE 4.5 114 Varies with device 1,000+ Free 0 Mature 17+ Books & Reference January 19, 2015 Varies with device Varies with device
10840 iHoroscope - 2018 Daily Horoscope & Astrology LIFESTYLE 4.5 398307 19M 10,000,000+ Free 0 Everyone Lifestyle July 25, 2018 Varies with device Varies with device

We can see that dataframe googleplaystore has such problem as missing values. But for a more complex view on data, let's do a few more things. Firstly, we will use describe() pandas method that will help us to get a statistic summary of numerical columns in our dataset. We can also use info() method to check data types in each column as well as missing values and shape() for retrieving a number of rows and columns in the dataframe.

In [5]:
count 9367.000000
mean 4.193338
std 0.537431
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 19.000000
In [6]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
App               10841 non-null object
Category          10841 non-null object
Rating            9367 non-null float64
Reviews           10841 non-null object
Size              10841 non-null object
Installs          10841 non-null object
Type              10840 non-null object
Price             10841 non-null object
Content Rating    10840 non-null object
Genres            10841 non-null object
Last Updated      10841 non-null object
Current Ver       10833 non-null object
Android Ver       10838 non-null object
dtypes: float64(1), object(12)
memory usage: 1.1+ MB
In [7]:
(10841, 13)
In [8]:
App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

So, what information do we have after these small actions? Firstly, we have some number of apps that are divided into various categories. Secondly, although such columns as, for example, "Reviews" contain numeric data, they have non-numeric type, that can cause some problems while further data processing.

We are also interested in the total amount of apps and available categories in the dataset. To get the exact amount of apps, we will find all the unique values in the corresponding column.

In [9]:
In [10]:
unique_categories = googleplaystore["Category"].unique()
In [11]:
       '1.9'], dtype=object)

Duplicate records removal

Usually, the duplicates of data appear in datasets, and this can aggravate the quality and accuracy of exploration. Plus, such data clogs the dataset, so we need to get rid of it.

In [14]:
googleplaystore.drop_duplicates(keep='first', inplace = True)
In [15]:
(10358, 13)

For removing rows with duplicates from a dataset, pandas has powerful and customizable method drop_duplicates(), which takes certain parameters needed to be considered while cleaning dataset. "keep=False" means that method will drop all the duplicates found in dataset with keeping only one value. "inplace = True" means that all the manipulations will be done and stored in the dataset we are currently using.

As we can see above, our initial googleplaystore dataset contained 10841 rows. After removing duplicates, the number of rows decreased to 9948.

NA analysis

Another common problem of almost every dataset is columns with missing values. We will explore only the most common ways to clean a dataset from missing values.

Firstly, let's look at the total amount of missing values in every column for each dataset. One of the great things about pandas is that it allows users to combine various operations in a single action, that brings great optimization opportunities and makes the code more compact.

In [14]:
Rating            1465
Current Ver          8
Android Ver          3
Content Rating       1
Type                 1
Last Updated         0
Genres               0
Price                0
Installs             0
Size                 0
Reviews              0
Category             0
App                  0
dtype: int64

Now, let's get rid of all the rows with missing values. Although some statistical approaches allow us to impute missing data with some values (like the most common value or mean value), today we will work only with cleared data.

Pandas dropna() method also allows users to set parameters for proper data processing depending on the expected result. Here we stated that program must drop every row that contains any NA values and all the changes will be stored directly in our dataframe.

In [16]:
googleplaystore.dropna(how ='any', inplace = True)

Let's now check the shape of the dataframe after all cleaning manipulations were performed.

In [17]:
(8886, 13)

If we look closer at our dataset and result of the dtypes method, we would see that such columns like "Reviews", "Size", "Price" and "Installs" should definitely have numeric values. So, let's see what values every column has in order to specify our further manipulations.

In [18]:
array(['0', '$4.99', '$3.99', '$6.99', '$7.99', '$5.99', '$2.99', '$3.49',
       '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49', '$10.00',
       '$24.99', '$11.99', '$79.99', '$16.99', '$14.99', '$29.99',
       '$12.99', '$2.49', '$10.99', '$1.50', '$19.99', '$15.99', '$33.99',
       '$39.99', '$3.95', '$4.49', '$1.70', '$8.99', '$1.49', '$3.88',
       '$399.99', '$17.99', '$400.00', '$3.02', '$1.76', '$4.84', '$4.77',
       '$1.61', '$2.50', '$1.59', '$6.49', '$1.29', '$299.99', '$379.99',
       '$37.99', '$18.99', '$389.99', '$8.49', '$1.75', '$14.00', '$2.00',
       '$3.08', '$2.59', '$19.40', '$3.90', '$4.59', '$15.46', '$3.04',
       '$13.99', '$4.29', '$3.28', '$4.60', '$1.00', '$2.95', '$2.90',
       '$1.97', '$2.56', '$1.20'], dtype=object)
In [19]:
array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000,000,000+', '1,000+', '500,000,000+', '100+', '500+', '10+',
       '5+', '50+', '1+'], dtype=object)
In [20]:
array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
       '28M', '12M', '20M', '21M', '37M', '5.5M', '17M', '39M', '31M',
       '4.2M', '23M', '6.0M', '6.1M', '4.6M', '9.2M', '5.2M', '11M',
       '24M', 'Varies with device', '9.4M', '15M', '10M', '1.2M', '26M',
       '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k', '3.6M', '5.7M',
       '8.6M', '2.4M', '27M', '2.7M', '2.5M', '7.0M', '16M', '3.4M',
       '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
       '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
       '7.1M', '22M', '6.4M', '3.2M', '8.2M', '4.9M', '9.5M', '5.0M',
       '5.9M', '13M', '73M', '6.8M', '3.5M', '4.0M', '2.3M', '2.1M',
       '42M', '9.1M', '55M', '23k', '7.3M', '6.5M', '1.5M', '7.5M', '51M',
       '41M', '48M', '8.5M', '46M', '8.3M', '4.3M', '4.7M', '3.3M', '40M',
       '7.8M', '8.8M', '6.6M', '5.1M', '61M', '66M', '79k', '8.4M',
       '3.7M', '118k', '44M', '695k', '1.6M', '6.2M', '53M', '1.4M',
       '3.0M', '7.2M', '5.8M', '3.8M', '9.6M', '45M', '63M', '49M', '77M',
       '4.4M', '70M', '9.3M', '8.1M', '36M', '6.9M', '7.4M', '84M', '97M',
       '2.0M', '1.9M', '1.8M', '5.3M', '47M', '556k', '526k', '76M',
       '7.6M', '59M', '9.7M', '78M', '72M', '43M', '7.7M', '6.3M', '334k',
       '93M', '65M', '79M', '100M', '58M', '50M', '68M', '64M', '34M',
       '67M', '60M', '94M', '9.9M', '232k', '99M', '624k', '95M', '8.5k',
       '41k', '292k', '80M', '1.7M', '10.0M', '74M', '62M', '69M', '75M',
       '98M', '85M', '82M', '96M', '87M', '71M', '86M', '91M', '81M',
       '92M', '83M', '88M', '704k', '862k', '899k', '378k', '4.8M',
       '266k', '375k', '1.3M', '975k', '980k', '4.1M', '89M', '696k',
       '544k', '525k', '920k', '779k', '853k', '720k', '713k', '772k',
       '318k', '58k', '241k', '196k', '857k', '51k', '953k', '865k',
       '251k', '930k', '540k', '313k', '746k', '203k', '26k', '314k',
       '239k', '371k', '220k', '730k', '756k', '91k', '293k', '17k',
       '74k', '14k', '317k', '78k', '924k', '818k', '81k', '939k', '169k',
       '45k', '965k', '90M', '545k', '61k', '283k', '655k', '714k', '93k',
       '872k', '121k', '322k', '976k', '206k', '954k', '444k', '717k',
       '210k', '609k', '308k', '306k', '175k', '350k', '383k', '454k',
       '1.0M', '70k', '812k', '442k', '842k', '417k', '412k', '459k',
       '478k', '335k', '782k', '721k', '430k', '429k', '192k', '460k',
       '728k', '496k', '816k', '414k', '506k', '887k', '613k', '778k',
       '683k', '592k', '186k', '840k', '647k', '373k', '437k', '598k',
       '716k', '585k', '982k', '219k', '55k', '323k', '691k', '511k',
       '951k', '963k', '25k', '554k', '351k', '27k', '82k', '208k',
       '551k', '29k', '103k', '116k', '153k', '209k', '499k', '173k',
       '597k', '809k', '122k', '411k', '400k', '801k', '787k', '50k',
       '643k', '986k', '516k', '837k', '780k', '20k', '498k', '600k',
       '656k', '221k', '228k', '176k', '34k', '259k', '164k', '458k',
       '629k', '28k', '288k', '775k', '785k', '636k', '916k', '994k',
       '309k', '485k', '914k', '903k', '608k', '500k', '54k', '562k',
       '847k', '948k', '811k', '270k', '48k', '523k', '784k', '280k',
       '24k', '892k', '154k', '18k', '33k', '860k', '364k', '387k',
       '626k', '161k', '879k', '39k', '170k', '141k', '160k', '144k',
       '143k', '190k', '376k', '193k', '473k', '246k', '73k', '253k',
       '957k', '420k', '72k', '404k', '470k', '226k', '240k', '89k',
       '234k', '257k', '861k', '467k', '676k', '552k', '582k', '619k'],

First of all, let's get rid of the dollar sign in "Price" column and turn values into numeric type.

In [21]:
googleplaystore['Price'] = googleplaystore['Price'].apply(lambda x: x.replace('$', '') if '$' in str(x) else x)
googleplaystore['Price'] = googleplaystore['Price'].apply(lambda x: float(x))

Now, we will work with "Installs" column. We must get rid of plus sign and convert values to numeric.

In [22]:
googleplaystore['Installs'] = googleplaystore['Installs'].apply(lambda x: x.replace('+', '') if '+' in str(x) else x)
googleplaystore['Installs'] = googleplaystore['Installs'].apply(lambda x: x.replace(',', '') if ',' in str(x) else x)
googleplaystore['Installs'] = googleplaystore['Installs'].apply(lambda x: int(x))

Also, convert "Reviews" column to numeric type.

In [23]:
googleplaystore['Reviews'] = googleplaystore['Reviews'].apply(lambda x: int(x))

Finally, let's work with "Size" column as it needs more complex approach. This column contains various types of data. Among numeric values which can be whether in Mb or Kb, there are null values and strings. Moreover, we need to deal with the difference in values written in Mb and Kb.

In [24]:
googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: str(x).replace('Varies with device', 'NaN') if 'Varies with device' in str(x) else x)
googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: str(x).replace('M', '') if 'M' in str(x) else x)
googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: str(x).replace(',', '') if 'M' in str(x) else x)
googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: float(str(x).replace('k', '')) / 1000 if 'k' in str(x) else x)
googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: float(x))

Let's call describe() method one more time. As we can see, now we have statistical summary for all the needed columns that contain numeric values.

In [25]:
  Rating Reviews Size Installs Price
count 8886.000000 8.886000e+03 7418.000000 8.886000e+03 8886.000000
mean 4.187959 4.730928e+05 22.760829 1.650061e+07 0.963526
std 0.522428 2.906007e+06 23.439210 8.640413e+07 16.194792
min 1.000000 1.000000e+00 0.008500 1.000000e+00 0.000000
25% 4.000000 1.640000e+02 5.100000 1.000000e+04 0.000000
50% 4.300000 4.723000e+03 14.000000 5.000000e+05 0.000000
75% 4.500000 7.131325e+04 33.000000 5.000000e+06 0.000000
max 5.000000 7.815831e+07 100.000000 1.000000e+09 400.000000

Building visualizations

Visualization is probably one of the most useful approaches in data analysis. Sometimes not all the correlations and dependencies can be seen from the tabular data, and therefore various plots and diagrams can help to clearly depict them.

Let's go through the different ways we can explore categories.

Exploring which categories have the biggest amount of apps

One of the fanciest ways to visualize such data is to use WordCloud. With a few lines of code, we can create an illustration that shows what categories have the biggest amount of apps.

In [30]:
import matplotlib.pyplot as plt
import wordcloud
from wordcloud import WordCloud
import seaborn as sns
color = sns.color_palette()

%matplotlib inline
In [33]:
from plotly import tools
from plotly.offline import iplot, init_notebook_mode
from IPython.display import Image
import plotly.offline as py
import plotly.graph_objs as go
import as pio
import numpy as np
In [34]:
wc = WordCloud(max_font_size=250,collocations=False, 
               height=800,background_color="white").generate(' '.join(googleplaystore['Category']))
plt.figure( figsize=(20,10))
plt.imshow(wc, interpolation="bilinear")

Exploring app ratings across top categories

In [35]:
groups = googleplaystore.groupby('Category').filter(lambda x: len(x) > 286).reset_index()
array = groups['Rating'].hist(by=groups['Category'], sharex=True, figsize=(20,20))

As we can see, average apps ratings are quite different across the categories.

Average Rating of all the Apps


And what insight will we get, if we explore average rating for all of the apps?

In [36]:
avg_rate_data = go.Figure()
        x = googleplaystore.Rating,
        xbins = {'start': 1, 'size': 0.1, 'end' :6}

In [38]:
img_bytes = pio.to_image(avg_rate_data, format='png', width=1600, height=800, scale=2)
In [39]:

As we can see, most of the apps clearly hold a rating above 4.0! Actually, quite a lot of apps seem to have 5.0 rating. Let's check how many apps do have the highest possible rating.

In [40]:
googleplaystore.Rating[googleplaystore['Rating'] == 5 ].count()

But does any feature from the dataset really affect on the apps' rating? Let's try to figure out how size, amount of installs, reviews, and price correlate between each other and then explore the impact of every feature on the rating.

First of all, let's build a heatmap. For exploring correlations between features, a heatmap is among the best visual tools. The individual values in the data matrix are represented by different colors helping quickly see what features have the most and the least dependencies.

In [41]:
sns.heatmap(googleplaystore.corr(), annot=True, linewidth=0.5)
<matplotlib.axes._subplots.AxesSubplot at 0x11f75fbe0>

A positive correlation of 0.62 exists between the number of reviews and the number of installations, which means that customers tend to download a given app more if it has been reviewed by a larger number of people. This also means that many active users who download the app usually give feedback.

Sizing strategy: How does size of the app impact rating?

Despite the fact that modern phones and pads have enough memory to deal with various kinds of tasks and store Gigabytes of data, the size of the apps still matters. Let's explore whether this value really affects app rating or not.

To find an answer to this question, we will use scatterplot which is definitely the most common and informant way to see how two variables correlate.

In [42]:
groups = googleplaystore.groupby('Category').filter(lambda x: len(x) >= 50).reset_index()
In [43]:
ax = sns.jointplot(googleplaystore['Size'], googleplaystore['Rating'])
/anaconda3/lib/python3.7/site-packages/scipy/stats/ FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

As we can see, most of the apps with the highest rating have a size between approximately 20Mb and 40Mb.

Pricing: How does price affect app rating?

In [44]:
paid_apps = googleplaystore[googleplaystore.Price>0]
p = sns.jointplot( "Price", "Rating", paid_apps)

So, the top-rated apps do not have big prices: only a few apps have a price higher than $20.

Pricing across categories

In [45]:
fig, ax = plt.subplots()
fig.set_size_inches(15, 8)
p = sns.stripplot(x="Price", y="Category", data=googleplaystore, jitter=True, linewidth=1)
title = ax.set_title('App pricing trends across categories')
 As we can see, there are apps with a price higher than $200! Let's see, what categories these apps belong to.
In [46]:
googleplaystore[['Category', 'App']][googleplaystore.Price > 200].groupby([ "Category"], as_index=False).count()
  Category App

Price vs. installation: are free apps downloaded more than paid?

For visualizing this answer we will use boxplot, so we can compare the range and distribution of the number of downloads for paid and free apps. Boxplots also help to answer questions like:

  • what are the key values (average, median, first quartile, and so on)
  • does our data have outliers and what are their values
  • whether our data is symmetric
  • how tightly the data is grouped
  • is the data shifted and, if so, in which direction, etc.
In [47]:
trace0 = go.Box(
    name = 'Paid',
    marker = dict(
        color = 'rgb(214, 12, 140)',

trace1 = go.Box(
    name = 'Free',
    marker = dict(
        color = 'rgb(0, 128, 128)',
layout = go.Layout(
    title = "Paid apps Vs free apps",
    yaxis= {'title': 'Downloads (log-scaled)'}
data = [trace0, trace1]
iplot({'data': data, 'layout': layout}) 

As we can see, paid apps are downloaded less frequently than free ones.


Exploratory data analysis is an inherent part of data exploration that helps to get a general knowledge about the dataset you work with as well as find basic conceptions and outlines to get first insights.

In this tutorial we walked through the general approaches for initial data exploration on the example of apps categories and rating columns. However, there are a lot of other interesting dependencies and correlations left within other columns.

The dataset we used is available via the following link: