Blog

Blog Categories

Introduction Nowadays PostgreSQL is probably one of the most powerful relational databases among the open-source solutions. Its functional capacities are no worse than Oracle’s and definitely way ahead of the MySQL. So if you are working on apps using Python, someday you will face the need of working with databases. Luckily, Python has quite a wide amount of packages that provide an easy way of connecting and using databases. In...
The way other people think about one or another product or service has a big impact on our everyday process of making decisions. Earlier, people relied on the opinion of their friends, relatives, or products and services reposts, but the era of the Internet has made significant changes. Today opinions are collected from different people around the world via reviewing e-commerce sites as well as blogs and social nets. To transform gathered...
In the modern world, the information flow which befalls on a person is daunting. This led to a rather abrupt change in the basic principles of data perception. Therefore visualization is becoming the main tool for presenting information. With the help of visualization, information is presented to the audience in a more accessible, clear, visual form. Properly chosen method of visualization can make it possible to structure large data arrays,...
What is Exploratory Data Analysis Exploratory data analysis (EDA) is a powerful tool for a comprehensive study of the available information providing answers to basic data analysis questions. What distinguishes it from traditional analysis based on testing a priori hypothesis is that EDA makes it possible to detect — by using various methods — all potential systematic correlations in the...
In financial markets, tradable instruments and securities have unique identifiers. The identifiers are very useful, because you can make sure that you and your counterparty are talking about the same instrument while trading. The difficulty is that there isn't really a standard for all the various sorts of instruments or markets. Anyone working in the industry will recognize this issue, especially people working at larger institutions who...
The random forest algorithm is the combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. It can be applied to different machine learning tasks, in particular, classification and regression. Random Forest uses an ensemble of decision trees as a basis and therefore has all advantages of decision trees, such as high accuracy,...
   .caret, .dropup > .btn > .caret { border-top-color: #000 !important; } .label { border: 1px solid #000; } .table { border-collapse: collapse !important; } .table td, .table th { background-color: #fff !important; } .table-bordered th, .table-bordered td { border: 1px solid #ddd !important; } } @font-face { font-family: 'Glyphicons Halflings'; src: url('../components/bootstrap/fonts/glyphicons-halflings-regular.eot');...
In this Jupyter Notebook we will retrieve data from the European Central Bank (ECB). The ECB publishes through the European Open Data Portal, which we discussed in the previous tutorial . Before diving into the code, please take a quick look at the following websites, to get a feel for what we will be dealing with. EU portal: https://data.europa.eu/euodp/en/data/publisher/ecb ECB SDMX 2.1 RESTful web service:...
  The EU Open Data Portal gives access to open data published by EU institutions, agencies and other bodies. Around 70 EU institutions, bodies or departments use the platform to make over 12,500 datasets available. In this Jupyter Notebook we will retrieve data from open data portal " http://data.europa.eu/euodp/en/home ". The portal is based on the open source project CKAN. CKAN stands for Comprehensive...
Are you looking for real world data science problems to sharpen your skills? In this post, we introduce you to four platforms hosting data science competitions. Data science competitions can be a great way for gaining practical experience with real world data, and for boosting your motivation through the competitive environment they provide. Check them out, competitions are a lot of fun! Kaggle Kaggle is the best known platform...
GBM is a highly popular prediction model among data scientists or as top Kaggler Owen Zhang describes it: "My confession: I (over)use GBM. When in doubt, use GBM." GradientBoostingClassifier from sklearn is a popular and user friendly application of Gradient Boosting in Python (another nice and even faster tool is xgboost). Apart from setting up the feature space and fitting the model, parameter tuning is a crucial task in...
  Currently, Python and R are the dominating data science tools and Python will probably even be taking the lead (at least based on the latest KDNuggets survey ). When did the two open source players manage to become the leading platforms for analytics, data science, and machine learning, leaving behind established players such as Matlab or SAS? Here are some insights from Google Trends. Looking at the years 2009 - 2013 in the...