Your Career Platform for Big Data

Be part of the digital revolution in the UK and Ireland


Latest job opportunities

Sartorius Stedim Biotech Royston, UK
Full time
Sartorius develop automated laboratory systems which integrate novel hardware and software solutions with electronics, sensor technology and single use consumables to provide easy-to-use platforms for development scientists within the modern life sciences industry. Sartorius is looking for a capable Data Scientist to contribute to development projects and shape how we process and analyse multivariate data.  Key job accountabilities will include : Formatting data (and automating the formatting of data) from a broad range of sources for analysis. Working with multi-discipline engineering teams to analyse experimental data, present conclusions and shape the development of software and hardware systems. Working with software development teams to encode data formatting and data analysis algorithms into finished products. Multivariate Data Analysis – likely with SIMCA. Evaluation of spectroscopy data. Assisting with troubleshooting customer issues relating software data processing. This role will be based in Royston, Cambridge UK supporting the development teams located there but may require occasional travel to customers and other development groups within the worldwide Sartorius organisation. Qualifications : PhD/ EngD, MSc, or Degree level qualification plus equivalent experience in relevant field of study, e.g. Physics, Chemistry, Engineering, Biochemical Engineering, Mathematics.  Experience : General Programming skills and data manipulation (e.g. Python, C#, C++) Mathematical statistical scripting (e.g. R or Matlab) Multivariate Data Analysis (e.g. SIMCA, PLS Toolbox, Unscrambler) Mathematical / Statistical data evaluation methods and algorithms Spectroscopy data evaluation (ideally Raman and NIR) Key job skills Essential : Rapid software prototyping / scripting. Multivariate data analysis. Mathematical / Statistical data evaluation. Desirable : Spectroscopy data evaluation. Cell culture experience. Engineering Product Development Lifecycle. Software Development Process. Experimental design, including Design Of Experiments approaches. In order to commence working with us, the successful candidate will need the right to work in the UK.
QuantumBlack London, UK
Full time
QUALIFICATIONS MSc or PhD level in a relevant field (such as Computer Science, Machine Learning, Applied Statistics, Mathematics) Client-facing skills e.g. working in close-knit teams on topics such as data warehousing, machine learning Demonstrated leadership (thought leadership or people leadership e.g. managed project teams or direct reports)  Experience in applying data science methods to business problems Programming experience in languages such as: Python, R, Scala, SQL Good presentation and communication skills, with a knack for explaining complex analytical concepts to people from other fields Knowledge of distributed computing or NoSQL technologies is a bonus Proven application of advanced analytical and statistical methods in the commercial world WHO YOU'LL WORK WITH As a Principal Data Scientist at QuantumBlack, you will work with other Data Scientists, Data Engineers, Machine Learning Engineers, Designers and Project Managers on interdisciplinary projects, using Maths, Stats and Machine Learning to derive structure and knowledge from raw data across various industry sectors.  Who you are You are a highly collaborative individual who is capable of laying aside your own agenda, listening to and learning from colleagues, challenging thoughtfully and prioritising impact. You search for ways to improve things and work collaboratively with colleagues. You believe in iterative change, experimenting with new approaches, learning and improving to move forward quickly. WHAT YOU'LL DO You will work in multi-disciplinary environments harnessing data to provide real-world impact for organisations globally. You will influence many of the recommendations our clients need to positively change their businesses and enhance performance. Role responsibilities Work with our clients to model their data landscape, obtain data extracts and define secure data exchange approaches Mentor Senior Data Scientists and providing guidance to Fellow Data Scientists where required Be responsible for providing insights to the Leadership team based on client and team learnings Take ownership of separate work streams, for instance R&D or Recruiting Be responsible for Recruiting other Data Scientists with the Recruiting team Plan and execute secure, good practice data integration strategies and approaches Acquire, ingest, and process data from multiple sources and systems into Big Data platforms Create and manage data environments in the Cloud Collaborate with our data scientists to map data fields to hypotheses and curate, wrangle, and prepare data for use in their advanced analytical models Using a strong understanding of Information Security principles to ensure compliant handling and management of client data Supporting Data Scientists by creating Views, Queries, and data extracts to help their analysis What you’ll learn How successful projections on real world problems across a variety of industries are completed through referencing past deliveries of end to end machine learning pipelines Build products alongside the Core engineering team and evolve the engineering process to scale with data, handling complex problems and advanced client situations Be able to focus on modelling by working alongside the Data Engineering team which focuses on the wrangling, clean-up and transformation of data. Best practices in software development and productionise machine learning by working with our Machine Learning Engineering teams which optimise code for model development and scale it Work with our UX and Visual Design teams to interpret your complex models into stunning and user-focused visualisations Using new technologies and problem-solving skills in a multicultural and creative environment You will work on the frameworks and libraries that our teams of Data Scientists and Data Engineers use to progress from data to impact. You will guide global companies through data science solutions to transform their businesses and enhance performance across industries including healthcare, automotive, energy and elite sport.  Real-World Impact  – No project is ever the same; we work across multiple sectors, providing unique learning and development opportunities internationally. Fusing Tech & Leadership  – We work with the latest technologies and methodologies and offer first class learning programmes at all levels. Multidisciplinary Teamwork  - Our teams include data scientists, engineers, project managers, UX and visual designers who work collaboratively to enhance performance. Innovative Work Culture  – Creativity, insight and passion come from being balanced. We cultivate a modern work environment through an emphasis on wellness, insightful talks and training sessions. Striving for Diversity  – With colleagues from over 40 nationalities, we recognise the benefits of working with people from all walks of life. Our projects range from helping pharmaceutical companies bring lifesaving drugs to market quicker to optimising a Formula1 car’s performance. At QuantumBlack you have the best of both worlds; all the benefits of being part of one of the leading management consultancies globally and the autonomy to thrive in a fast growth tech culture: Healthcare Efficiency  – We helped a healthcare provider improve their clinical trial practices by identifying congestion in diagnostic testing as a key indicator of admissions breaches. Environmental Impact  – We designed and built the first data-driven application for a state of the art centre of excellence in urban innovation by collecting real-time data from environmental sensors across London and deploying proprietary analytics to find unexpected patterns in air pollution. Product Development  – We worked with the CEO of an elite automotive organisation to reduce the 18-month car development timeframe by improving processes, designs and team structures.
bgc London, UK
Full time
Main purpose of the role: This is a senior development role working in the team responsible for the firm’s Big Data analytics platform, with a specialism in Java or Scala required. The successful candidate will work in the data team to build out the existing platform to support additional technology and business use-cases involving both on-premise and Cloud implementations. The platform delivers data solutions across BGC Group including use-cases for: market data analytics, regulatory reporting, surveillance and revenue analysis.   Key responsibilities: The objectives of the role: To work in partnership with the business/business analysts to identify key requirements for implementation To identify any technical requirements for new products To work with all the business and technology departments to ensure all business and technical requirements are met To identify and manage any integration work To analyse, design and build any such projects To provide input into the current and on-going system architecture To liaise with other development teams as necessary to implement cross-team projects Be alert to Conduct Risk issues, specifically the risk of harm to client interests, market integrity and/or competition in financial markets due to inappropriate practices or behaviours across the firm  To undertake and manage: Systems analysis and design Systems development Systems documentation Production support and out-of-hours system maintenance Skills / experience required: Minimum of 3 years development experience working with Big Data platforms covering the following technologies: Java or Scala on Linux. Docker beneficial. Hortonworks, Hive Kafka, KafkaConnect, KafkaStreaming Spark, SparkSQL, Yarn, Ansible   Also : Substantial database experience: Relational and NoSQL data modelling. Experience of software development in a financial services environment advantageous Willingness to keep up to date with latest technology trends and proactively identify appropriate areas into which they can be applied Solid Computing Degree Additional Beneficial Skills / Experience: Cassandra / Datastax, ElasticSearch, LogStash, Kibana Experience in addressing efficient data storage and querying against very large stores Experience in accommodating key requirements of MIFID II in designs/implementations AWS: Data Pipeline, S3, EMR, Lamda, DynamoDB, RedShift Experience in the use of high performance messaging middleware such as Tibco RV or Solace   Personal attributes: Team player with excellent inter-personal skills and confident communication skills. Able to effectively disseminate knowledge and experience to less experienced team members Must be able to deal with our customers effectively, ie development teams working across the globe using Messaging and Data products, alongside external vendors who use the products Must be able to deal with and adapt to change extremely effectively Must be proactive in generating ideas and effective at developing solutions that are balanced, proportionate and effective. It is critical that this is achieved in collaboration with the global team Ability to work and function under pressure, handle multiple tasks, and shifting priorities “Self-starter” always looking to improve quality of process and deliverables and keen to take a lead role in that process

DataCareer Blog

What is Exploratory Data Analysis Exploratory data analysis (EDA) is a powerful tool for a comprehensive study of the available information providing answers to basic data analysis questions. What distinguishes it from traditional analysis based on testing a priori hypothesis is that EDA makes it possible to detect — by using various methods — all potential systematic correlations in the data. Exploratory data analysis is practically unlimited in time and methods allowing to identify curious data fragments and correlations. Therefore, you are able to examine information more deeply and accurately, as well as choose a proper model for further work. In Python language environment, there is a wide range of libraries that can not only ease but also streamline the process of exploring a dataset. We will use  Google Play Store Apps dataset  and go through the main tasks of exploration analysis to find out if there are any trends that can facilitate the process of setting and resolving a business problem. Data overview Before we start exploring our data, we must import the dataset and Python libraries needed for further work. We will use pandas library, a very powerful tool for comprehensive data analysis. In [1]: import pandas as pd In [2]: googleplaystore = pd.read_csv("googleplaystore.csv") Let's explore the structure of our dataframe by viewing the first and the last 10 rows. In [3]: googleplaystore.head(10) Out[3]:     App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver 0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up 1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up 2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up 3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up 4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up 5 Paper flowers instructions ART_AND_DESIGN 4.4 167 5.6M 50,000+ Free 0 Everyone Art & Design March 26, 2017 1.0 2.3 and up 6 Smoke Effect Photo Maker - Smoke Editor ART_AND_DESIGN 3.8 178 19M 50,000+ Free 0 Everyone Art & Design April 26, 2018 1.1 4.0.3 and up 7 Infinite Painter ART_AND_DESIGN 4.1 36815 29M 1,000,000+ Free 0 Everyone Art & Design June 14, 2018 4.2 and up 8 Garden Coloring Book ART_AND_DESIGN 4.4 13791 33M 1,000,000+ Free 0 Everyone Art & Design September 20, 2017 2.9.2 3.0 and up 9 Kids Paint Free - Drawing Fun ART_AND_DESIGN 4.7 121 3.1M 10,000+ Free 0 Everyone Art & Design;Creativity July 3, 2018 2.8 4.0.3 and up In [4]: googleplaystore.tail(10) Out[4]:     App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver 10831 MAPS_AND_NAVIGATION NaN 38 9.8M 5,000+ Free 0 Everyone Maps & Navigation June 13, 2018 4.0 and up 10832 FR Tides WEATHER 3.8 1195 582k 100,000+ Free 0 Everyone Weather February 16, 2014 6.0 2.1 and up 10833 Chemin (fr) BOOKS_AND_REFERENCE 4.8 44 619k 1,000+ Free 0 Everyone Books & Reference March 23, 2014 0.8 2.2 and up 10834 FR Calculator FAMILY 4.0 7 2.6M 500+ Free 0 Everyone Education June 18, 2017 1.0.0 4.1 and up 10835 FR Forms BUSINESS NaN 0 9.6M 10+ Free 0 Everyone Business September 29, 2016 1.1.5 4.0 and up 10836 Sya9a Maroc - FR FAMILY 4.5 38 53M 5,000+ Free 0 Everyone Education July 25, 2017 1.48 4.1 and up 10837 Fr. Mike Schmitz Audio Teachings FAMILY 5.0 4 3.6M 100+ Free 0 Everyone Education July 6, 2018 1.0 4.1 and up 10838 Parkinson Exercices FR MEDICAL NaN 3 9.5M 1,000+ Free 0 Everyone Medical January 20, 2017 1.0 2.2 and up 10839 The SCP Foundation DB fr nn5n BOOKS_AND_REFERENCE 4.5 114 Varies with device 1,000+ Free 0 Mature 17+ Books & Reference January 19, 2015 Varies with device Varies with device 10840 iHoroscope - 2018 Daily Horoscope & Astrology LIFESTYLE 4.5 398307 19M 10,000,000+ Free 0 Everyone Lifestyle July 25, 2018 Varies with device Varies with device   We can see that dataframe googleplaystore has such problem as missing values. But for a more complex view on data, let's do a few more things. Firstly, we will use describe() pandas method that will help us to get a statistic summary of numerical columns in our dataset. We can also use info() method to check data types in each column as well as missing values and shape() for retrieving a number of rows and columns in the dataframe. In [5]: googleplaystore.describe() Out[5]:     Rating count 9367.000000 mean 4.193338 std 0.537431 min 1.000000 25% 4.000000 50% 4.300000 75% 4.500000 max 19.000000 In [6]: <class 'pandas.core.frame.DataFrame'> RangeIndex: 10841 entries, 0 to 10840 Data columns (total 13 columns): App 10841 non-null object Category 10841 non-null object Rating 9367 non-null float64 Reviews 10841 non-null object Size 10841 non-null object Installs 10841 non-null object Type 10840 non-null object Price 10841 non-null object Content Rating 10840 non-null object Genres 10841 non-null object Last Updated 10841 non-null object Current Ver 10833 non-null object Android Ver 10838 non-null object dtypes: float64(1), object(12) memory usage: 1.1+ MB In [7]: googleplaystore.shape Out[7]: (10841, 13) In [8]: googleplaystore.dtypes Out[8]: App object Category object Rating float64 Reviews object Size object Installs object Type object Price object Content Rating object Genres object Last Updated object Current Ver object Android Ver object dtype: object So, what information do we have after these small actions? Firstly, we have some number of apps that are divided into various categories. Secondly, although such columns as, for example, "Reviews" contain numeric data, they have non-numeric type, that can cause some problems while further data processing. We are also interested in the total amount of apps and available categories in the dataset. To get the exact amount of apps, we will find all the unique values in the corresponding column. In [9]: len(googleplaystore["App"].unique()) Out[9]: 9660 In [10]: unique_categories = googleplaystore["Category"].unique() In [11]: unique_categories Out[11]: array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY', 'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION', 'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE', 'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME', 'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL', 'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL', 'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER', 'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION', '1.9'], dtype=object) Duplicate records removal Usually, the duplicates of data appear in datasets, and this can aggravate the quality and accuracy of exploration. Plus, such data clogs the dataset, so we need to get rid of it. In [14]: googleplaystore.drop_duplicates(keep='first', inplace = True) In [15]: googleplaystore.shape Out[15]: (10358, 13) For removing rows with duplicates from a dataset, pandas has powerful and customizable method drop_duplicates(), which takes certain parameters needed to be considered while cleaning dataset. "keep=False" means that method will drop all the duplicates found in dataset with keeping only one value. "inplace = True" means that all the manipulations will be done and stored in the dataset we are currently using. As we can see above, our initial googleplaystore dataset contained 10841 rows. After removing duplicates, the number of rows decreased to 9948. NA analysis Another common problem of almost every dataset is columns with missing values. We will explore only the most common ways to clean a dataset from missing values. Firstly, let's look at the total amount of missing values in every column for each dataset. One of the great things about pandas is that it allows users to combine various operations in a single action, that brings great optimization opportunities and makes the code more compact. In [14]: googleplaystore.isnull().sum().sort_values(ascending=False) Out[14]: Rating 1465 Current Ver 8 Android Ver 3 Content Rating 1 Type 1 Last Updated 0 Genres 0 Price 0 Installs 0 Size 0 Reviews 0 Category 0 App 0 dtype: int64 Now, let's get rid of all the rows with missing values. Although some statistical approaches allow us to impute missing data with some values (like the most common value or mean value), today we will work only with cleared data. Pandas dropna() method also allows users to set parameters for proper data processing depending on the expected result. Here we stated that program must drop every row that contains any NA values and all the changes will be stored directly in our dataframe. In [16]: googleplaystore.dropna(how ='any', inplace = True) Let's now check the shape of the dataframe after all cleaning manipulations were performed. In [17]: googleplaystore.shape Out[17]: (8886, 13) If we look closer at our dataset and result of the dtypes method, we would see that such columns like "Reviews", "Size", "Price" and "Installs" should definitely have numeric values. So, let's see what values every column has in order to specify our further manipulations. In [18]: googleplaystore.Price.unique() Out[18]: array(['0', '$4.99', '$3.99', '$6.99', '$7.99', '$5.99', '$2.99', '$3.49', '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49', '$10.00', '$24.99', '$11.99', '$79.99', '$16.99', '$14.99', '$29.99', '$12.99', '$2.49', '$10.99', '$1.50', '$19.99', '$15.99', '$33.99', '$39.99', '$3.95', '$4.49', '$1.70', '$8.99', '$1.49', '$3.88', '$399.99', '$17.99', '$400.00', '$3.02', '$1.76', '$4.84', '$4.77', '$1.61', '$2.50', '$1.59', '$6.49', '$1.29', '$299.99', '$379.99', '$37.99', '$18.99', '$389.99', '$8.49', '$1.75', '$14.00', '$2.00', '$3.08', '$2.59', '$19.40', '$3.90', '$4.59', '$15.46', '$3.04', '$13.99', '$4.29', '$3.28', '$4.60', '$1.00', '$2.95', '$2.90', '$1.97', '$2.56', '$1.20'], dtype=object) In [19]: googleplaystore.Installs.unique() Out[19]: array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+', '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+', '1,000,000,000+', '1,000+', '500,000,000+', '100+', '500+', '10+', '5+', '50+', '1+'], dtype=object) In [20]: googleplaystore.Size.unique() Out[20]: array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M', '28M', '12M', '20M', '21M', '37M', '5.5M', '17M', '39M', '31M', '4.2M', '23M', '6.0M', '6.1M', '4.6M', '9.2M', '5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M', '1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k', '3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.7M', '2.5M', '7.0M', '16M', '3.4M', '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M', '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M', '7.1M', '22M', '6.4M', '3.2M', '8.2M', '4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M', '4.0M', '2.3M', '2.1M', '42M', '9.1M', '55M', '23k', '7.3M', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M', '8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M', '5.1M', '61M', '66M', '79k', '8.4M', '3.7M', '118k', '44M', '695k', '1.6M', '6.2M', '53M', '1.4M', '3.0M', '7.2M', '5.8M', '3.8M', '9.6M', '45M', '63M', '49M', '77M', '4.4M', '70M', '9.3M', '8.1M', '36M', '6.9M', '7.4M', '84M', '97M', '2.0M', '1.9M', '1.8M', '5.3M', '47M', '556k', '526k', '76M', '7.6M', '59M', '9.7M', '78M', '72M', '43M', '7.7M', '6.3M', '334k', '93M', '65M', '79M', '100M', '58M', '50M', '68M', '64M', '34M', '67M', '60M', '94M', '9.9M', '232k', '99M', '624k', '95M', '8.5k', '41k', '292k', '80M', '1.7M', '10.0M', '74M', '62M', '69M', '75M', '98M', '85M', '82M', '96M', '87M', '71M', '86M', '91M', '81M', '92M', '83M', '88M', '704k', '862k', '899k', '378k', '4.8M', '266k', '375k', '1.3M', '975k', '980k', '4.1M', '89M', '696k', '544k', '525k', '920k', '779k', '853k', '720k', '713k', '772k', '318k', '58k', '241k', '196k', '857k', '51k', '953k', '865k', '251k', '930k', '540k', '313k', '746k', '203k', '26k', '314k', '239k', '371k', '220k', '730k', '756k', '91k', '293k', '17k', '74k', '14k', '317k', '78k', '924k', '818k', '81k', '939k', '169k', '45k', '965k', '90M', '545k', '61k', '283k', '655k', '714k', '93k', '872k', '121k', '322k', '976k', '206k', '954k', '444k', '717k', '210k', '609k', '308k', '306k', '175k', '350k', '383k', '454k', '1.0M', '70k', '812k', '442k', '842k', '417k', '412k', '459k', '478k', '335k', '782k', '721k', '430k', '429k', '192k', '460k', '728k', '496k', '816k', '414k', '506k', '887k', '613k', '778k', '683k', '592k', '186k', '840k', '647k', '373k', '437k', '598k', '716k', '585k', '982k', '219k', '55k', '323k', '691k', '511k', '951k', '963k', '25k', '554k', '351k', '27k', '82k', '208k', '551k', '29k', '103k', '116k', '153k', '209k', '499k', '173k', '597k', '809k', '122k', '411k', '400k', '801k', '787k', '50k', '643k', '986k', '516k', '837k', '780k', '20k', '498k', '600k', '656k', '221k', '228k', '176k', '34k', '259k', '164k', '458k', '629k', '28k', '288k', '775k', '785k', '636k', '916k', '994k', '309k', '485k', '914k', '903k', '608k', '500k', '54k', '562k', '847k', '948k', '811k', '270k', '48k', '523k', '784k', '280k', '24k', '892k', '154k', '18k', '33k', '860k', '364k', '387k', '626k', '161k', '879k', '39k', '170k', '141k', '160k', '144k', '143k', '190k', '376k', '193k', '473k', '246k', '73k', '253k', '957k', '420k', '72k', '404k', '470k', '226k', '240k', '89k', '234k', '257k', '861k', '467k', '676k', '552k', '582k', '619k'], dtype=object) First of all, let's get rid of the dollar sign in "Price" column and turn values into numeric type. In [21]: googleplaystore['Price'] = googleplaystore['Price'].apply(lambda x: x.replace('$', '') if '$' in str(x) else x) googleplaystore['Price'] = googleplaystore['Price'].apply(lambda x: float(x)) Now, we will work with "Installs" column. We must get rid of plus sign and convert values to numeric. In [22]: googleplaystore['Installs'] = googleplaystore['Installs'].apply(lambda x: x.replace('+', '') if '+' in str(x) else x) googleplaystore['Installs'] = googleplaystore['Installs'].apply(lambda x: x.replace(',', '') if ',' in str(x) else x) googleplaystore['Installs'] = googleplaystore['Installs'].apply(lambda x: int(x)) Also, convert "Reviews" column to numeric type. In [23]: googleplaystore['Reviews'] = googleplaystore['Reviews'].apply(lambda x: int(x)) Finally, let's work with "Size" column as it needs more complex approach. This column contains various types of data. Among numeric values which can be whether in Mb or Kb, there are null values and strings. Moreover, we need to deal with the difference in values written in Mb and Kb. In [24]: googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: str(x).replace('Varies with device', 'NaN') if 'Varies with device' in str(x) else x) googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: str(x).replace('M', '') if 'M' in str(x) else x) googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: str(x).replace(',', '') if 'M' in str(x) else x) googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: float(str(x).replace('k', '')) / 1000 if 'k' in str(x) else x) googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: float(x)) Let's call describe() method one more time. As we can see, now we have statistical summary for all the needed columns that contain numeric values. In [25]: googleplaystore.describe() Out[25]:     Rating Reviews Size Installs Price count 8886.000000 8.886000e+03 7418.000000 8.886000e+03 8886.000000 mean 4.187959 4.730928e+05 22.760829 1.650061e+07 0.963526 std 0.522428 2.906007e+06 23.439210 8.640413e+07 16.194792 min 1.000000 1.000000e+00 0.008500 1.000000e+00 0.000000 25% 4.000000 1.640000e+02 5.100000 1.000000e+04 0.000000 50% 4.300000 4.723000e+03 14.000000 5.000000e+05 0.000000 75% 4.500000 7.131325e+04 33.000000 5.000000e+06 0.000000 max 5.000000 7.815831e+07 100.000000 1.000000e+09 400.000000   Building visualizations Visualization is probably one of the most useful approaches in data analysis. Sometimes not all the correlations and dependencies can be seen from the tabular data, and therefore various plots and diagrams can help to clearly depict them. Let's go through the different ways we can explore categories. Exploring which categories have the biggest amount of apps One of the fanciest ways to visualize such data is to use WordCloud. With a few lines of code, we can create an illustration that shows what categories have the biggest amount of apps. In [30]: import matplotlib.pyplot as plt import wordcloud from wordcloud import WordCloud import seaborn as sns color = sns.color_palette() %matplotlib inline In [32]: sys.path()   --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-32-7ffc7f27bab0> in <module> ----> 1 system.path() NameError: name 'system' is not defined In [33]: from plotly import tools from plotly.offline import iplot, init_notebook_mode from IPython.display import Image import plotly.offline as py import plotly.graph_objs as go import as pio import numpy as np py.init_notebook_mode()     In [34]: wc = WordCloud(max_font_size=250,collocations=False, max_words=33,width=1600, height=800,background_color="white").generate(' '.join(googleplaystore['Category'])) plt.figure( figsize=(20,10)) plt.imshow(wc, interpolation="bilinear") plt.axis("off") plt.tight_layout(pad=0)     Exploring app ratings across top categories In [35]: groups = googleplaystore.groupby('Category').filter(lambda x: len(x) > 286).reset_index() array = groups['Rating'].hist(by=groups['Category'], sharex=True, figsize=(20,20))   As we can see, average apps ratings are quite different across the categories. Average Rating of all the Apps   And what insight will we get, if we explore average rating for all of the apps? In [36]: avg_rate_data = go.Figure() avg_rate_data.add_histogram( x = googleplaystore.Rating, xbins = {'start': 1, 'size': 0.1, 'end' :6} ) iplot(avg_rate_data)   In [38]: img_bytes = pio.to_image(avg_rate_data, format='png', width=1600, height=800, scale=2) In [39]: Image(img_bytes) Out[39]:     As we can see, most of the apps clearly hold a rating above 4.0! Actually, quite a lot of apps seem to have 5.0 rating. Let's check how many apps do have the highest possible rating. In [40]: googleplaystore.Rating[googleplaystore['Rating'] == 5 ].count() Out[40]: 271 But does any feature from the dataset really affect on the apps' rating? Let's try to figure out how size, amount of installs, reviews, and price correlate between each other and then explore the impact of every feature on the rating. First of all, let's build a heatmap. For exploring correlations between features, a heatmap is among the best visual tools. The individual values in the data matrix are represented by different colors helping quickly see what features have the most and the least dependencies. In [41]: sns.heatmap(googleplaystore.corr(), annot=True, linewidth=0.5) Out[41]: <matplotlib.axes._subplots.AxesSubplot at 0x11f75fbe0>     A positive correlation of 0.62 exists between the number of reviews and the number of installations, which means that customers tend to download a given app more if it has been reviewed by a larger number of people. This also means that many active users who download the app usually give feedback. Sizing strategy: How does size of the app impact rating? Despite the fact that modern phones and pads have enough memory to deal with various kinds of tasks and store Gigabytes of data, the size of the apps still matters. Let's explore whether this value really affects app rating or not. To find an answer to this question, we will use scatterplot which is definitely the most common and informant way to see how two variables correlate. In [42]: groups = googleplaystore.groupby('Category').filter(lambda x: len(x) >= 50).reset_index() In [43]: sns.set_style("whitegrid") ax = sns.jointplot(googleplaystore['Size'], googleplaystore['Rating'])   /anaconda3/lib/python3.7/site-packages/scipy/stats/ FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.   As we can see, most of the apps with the highest rating have a size between approximately 20Mb and 40Mb. Pricing: How does price affect app rating? In [44]: paid_apps = googleplaystore[googleplaystore.Price>0] p = sns.jointplot( "Price", "Rating", paid_apps)   So, the top-rated apps do not have big prices: only a few apps have a price higher than $20. Pricing across categories In [45]: sns.set_style('whitegrid') fig, ax = plt.subplots() fig.set_size_inches(15, 8) p = sns.stripplot(x="Price", y="Category", data=googleplaystore, jitter=True, linewidth=1) title = ax.set_title('App pricing trends across categories')    As we can see, there are apps with a price higher than $200! Let's see, what categories these apps belong to. In [46]: googleplaystore[['Category', 'App']][googleplaystore.Price > 200].groupby([ "Category"], as_index=False).count() Out[46]:     Category App 0 FAMILY 4 1 FINANCE 6 2 LIFESTYLE 5   Price vs. installation: are free apps downloaded more than paid? For visualizing this answer we will use boxplot, so we can compare the range and distribution of the number of downloads for paid and free apps. Boxplots also help to answer questions like: what are the key values (average, median, first quartile, and so on) does our data have outliers and what are their values whether our data is symmetric how tightly the data is grouped is the data shifted and, if so, in which direction, etc. In [47]: trace0 = go.Box( y=np.log10(googleplaystore['Installs'][googleplaystore.Type=='Paid']), name = 'Paid', marker = dict( color = 'rgb(214, 12, 140)', ) ) trace1 = go.Box( y=np.log10(googleplaystore['Installs'][googleplaystore.Type=='Free']), name = 'Free', marker = dict( color = 'rgb(0, 128, 128)', ) ) layout = go.Layout( title = "Paid apps Vs free apps", yaxis= {'title': 'Downloads (log-scaled)'} ) data = [trace0, trace1] iplot({'data': data, 'layout': layout})    As we can see, paid apps are downloaded less frequently than free ones. Conclusion Exploratory data analysis is an inherent part of data exploration that helps to get a general knowledge about the dataset you work with as well as find basic conceptions and outlines to get first insights. In this tutorial we walked through the general approaches for initial data exploration on the example of apps categories and rating columns. However, there are a lot of other interesting dependencies and correlations left within other columns. The dataset we used is available via the following link:
In the modern world, the information flow which befalls on a person is daunting. This led to a rather abrupt change in the basic principles of data perception. Therefore visualization is becoming the main tool for presenting information. With the help of visualization, information is presented to the audience in a more accessible, clear, visual form. Properly chosen method of visualization can make it possible to structure large data arrays, schematically depict elements that are insignificant in content, and make information more comprehensible. One of the most popular languages for data processing and analysis is Python, largely due to the high speed of creating and development of the libraries which grant basically unlimited possibilities for various data processing. The same is true for data visualization libraries. In this article, we will look at the basic tools of visualizing data that are used in the Python development environment. Matplotlib Matplotlib is perhaps the most widely known Python library for data visualization. Being easy to use, it offers ample opportunities to fine tune the way data is displayed. Polar area chart The library provides main visualization algorithms, including scatter plots, line plots, histograms, bar plots, box plots, and more. It is worth noting that the library has fairly extensive documentation, that makes it comfortable enough to work with even for beginners in the sphere of data processing and visualization. Multicategorical plot One of the main advantages of this library is a well-thought hierarchical structure. The highest level is represented by the functional interface called  matplotlib.pyplot , which allows users to create complex infographics with just a couple of lines of code by choosing ready-made solutions from the functions offered by the interface. Histogram The convenience of creating visualizations using matplotlib is provided not only due to the presence of a number of built-in graphic commands but also due to the rich arsenal on the configuration of standard forms. Settings include the ability to set arbitrary colors, shapes, line type or marker, line thickness, transparency level, font size and type, and so on. Seaborn Despite the wide popularity of the Matplotlib library, it has one drawback, which can become critical for some users: the low-level API and therefore, in order to create truly complex infographics, you may need to write a lot of generic code. Hexbin plot Fortunately, this problem is successfully leveled by the Seaborn library, which is a kind of high-level wrapper over Matplotlib. With its help, users are able to create colorful specific visualizations: heat maps, time series, violin charts, and much more. Seaborn heatmap Being highly customizable, Seaborn allows users wide opportunities to add unique and fancy looks to their charts in a quite a simple way with no time costs. ggplot Those users who have experience with R, probably heard about ggplot2, a powerful data visualization tool within the R programming language environment. This package is recognized as one of the best tools for graphical presentation of information. Fortunately, the extensive capabilities of this library are now available in the Python environment due to porting the package, which is available there under the name  ggplot . Box plot As we mentioned earlier, the process of data visualization has a deep internal structure. In other words, the process of creating a visualization is a clearly structured system, which largely influences the way of the thoughts in the process of creating infographics. And ggplot teaches the user to think in such a structured approach, to think according to this system so that in the process of consistently building commands, the user automatically starts detecting patterns in the data. Scatter plot Moreover, the library is very flexible. Ggplot provides users with ample opportunities for customizing how data will be displayed and preprocessing datasets before they are rendered. Bokeh Despite the rich potential of the ggplot library, some users may lack interactivity. Therefore, for those who need interactive data visualization, the Bokeh library has been created. Stacked area chart Bokeh is an open-source Javascript library with client-side for Python that allows users to create flexible, powerful and beautiful visualizations for web applications. With its help, users can create both simple bar charts and complex, highly detailed interactive visualizations without writing a single line in Javascript. Please have a look at  this gallery  to get an idea of the interactive features of Bokeh. plotly For those who need interactive diagrams, we recommend to check out the plotly library. It is positioned primarily as an online  platform , on which the users can create and publish their own visualizations. However, the library can also be used offline without uploading the visualization to the plotly server. Contour plot Due to the fact that this library is positioned by developers mostly as an autonomous product, it is constantly being refined and expanded. So, it provides users truly unlimited possibilities for data visualization, whether it’s interactive graphics or contours. You can find some examples of Plotly through the link below and have a look at the features of the library. Conclusion Over the past few years, data visualization tools available to Python developers have made a significant leap forward. Many powerful packages have appeared and are expanding in every possible way, implementing quite complex ways of graphical representation of information. This allows users not only to create various infographics but also to make them truly attractive and understandable to the audience.
The more carefully you process the data and go into details, the more valuable information you can get for your benefit. Data visualization is an efficient and handy tool for gaining insights from data. Moreover, you can make the data far more understandable, colorful and pleasant with the help of visualization tools. As data is changing every second, it is an urgent task to investigate it carefully and get the insights as fast as possible. Data visualization tools cover a full scope of opportunities and additional functions which are called upon to facilitate the visualization process for you. Thus, we attempted to make an overview of the most popular and useful libraries for data visualization in R. Ggplot2 Ggplot2 is a system for creating charts based on the Grammar of Graphics. It proved to be one of the best R libraries for visualization. Ggplot2 works with both univariate and multivariate numerical and categorical data. Thus, it is very flexible. The plot specification is at a high level of abstraction and has complete graphics system. It contains a variety of labels, themes, different coordinate systems, etc. Therefore, you get the opportunity to: control data values with scales option filter, sort, summarize datasets create complex plots. However, some activities are not available with ggplot2 such as 3d graphics, graph-theory type graphs, and interactive graphics. Here are several examples of the visualization plots made with the help of Ggplot2. Density plot Boxplot Scatterplot Plotly Plotly is an online platform for data visualization, available in R and Python. This package creates interactive web-based plots using plotly.js library. Its advantage is that it can build contour plots, candlestick charts, maps, and 3D charts, which cannot be created using most packages. In addition, it has 30 repositories available. Plotly gives you an opportunity to interact with graphs, change their scale and point out the necessary record. The library also supports graph hovering. Moreover, you can easily add Plotly in knitr/R Markdown or Shiny apps. Have a look at several plots and charts created with Plotly. Contour plot Candlestick chart 3d scatterplot Dygraphs Dygraphs is an R interface to the JavaScript charting library. This library proved to be fast and flexible in its application. It facilitates the work with dense data. Dygraphs is a useful tool for charting time-series data in R. The benefits of this library include the support of visualizing xts objects, support of graph covering such as shaded regions, event lines, and point annotations, interaction with graphs, showing upper/lower bars, synchronization and the range selector, and more. You can also easily add Dygraphs in knitr/R Markdown or Shiny apps. Moreover, huge datasets with millions of points don’t affect its speed. Also, you can use RColorBrewer with Dygraphs to increase the range of colors. Below you can see a vivid representation of the data visualization with Dygraphs package. Leaflet Leaflet is a well-known package based on JavaScript libraries for interactive maps. It is widely used for mapping and working with the customization and design of interactive maps. Besides, Leaflet provides an opportunity to make these maps mobile-friendly. It's abilities include: interaction with plots, and the change in their scale map design (Markers, Pop-ups, GeoJSON) easy integration with knitr/R Markdown or Shiny apps work with latitude/longitude columns support of the Shiny logic using map bounds and mouse events visualization of maps in non-spherical Mercator projections. Rgl Rgl package may become a perfect fit for creating interactive 3D plots in R. It offers a variety of 3D shapes, lighting effects, different types of the objects, and even the ability to make an animation of your 3D scene. Rgl contains high-graphics commands and low-level structure. The plot types include meaning points, lines, segments from z=0, and spheres. Moreover, with Rgl, you can: interact with graphs apply various decorative functions easily add Rgl in knitr/R Markdown or Shiny apps create new shapes Conclusion To sum up, data visualization is more than a charming picture of your data. It's a chance to see the data under the hood. R is one of the powerful visualization tools. Using R you can build a variety of charts from a simple pie chart to more sophisticated such as 3d graphs, interactive graphs, maps, etc. Of course, this list is not complete and there exist many other great visualization tools which can bring their specific benefits to your data visualization. Nevertheless, we compiled this list from our experience. Summarizing everything mentioned before, Plotly, Dygraph, and Leaflet support zooming, moving your graphs. If you are plotting time series, you can filter dates using a range selector. For building 3d models it is highly suitable to use Rgl. Do your best with handy R visualization tools!  
View all blog posts