In this we will retrieve data from the Bank of England's Statistical Interactive Database (IADB) using Python 3. Please take a look at the following sites to get a feel for what kind of data is available:
As an , we will take a look at the fixed mortgage interest rates in the UK since the year 2000. The full description of the series is:
Monthly interest rate of UK monetary financial institutions (excl. Central Bank) sterling 2, 3, 5 and 10 year (75% LTV) fixed rate mortgage to households (in percent) not seasonally adjusted.
We will use Python to (re)construct the following example URL and retrieve the data. If you click on this link, it will download the data:
In the above URL, we used CSV as the data format. It is possible to download data in other formats, like XML, HTML Excel. In this tutorial we will work with Pandas, so we prefer to retrieve the data in CSV format. Please note that the API takes the following parameters:
Parameter name | Required format | Example |
---|---|---|
DD/MON/YYYY (mandatory parameter) | 01/Feb/2006 | |
DD/MON/YYYY or "now" (mandatory parameter) | 01/Oct/2007 | |
SeriesCodes | list of full series codes (mandatory parameter) | LPMAUZI |
"Y" (mandatory parameter) | Y | |
CSVF | "TT", "TN", "CT" or "CN" (mandatory parameter) | TN |
VPD | "Y" or "N" (optional parameter) | Y |
NB:
As always, let's start with importing some packages we will use for this exercise.
import pandas as pd # 0.23.0
import requests # 2.19.1
import io
First, we define the URL endpoint:
url_endpoint = 'http://www.bankofengland.co.uk/boeapps/iadb/fromshowcolumns.asp?csv.x=yes'
Secondly, we define the parameters of our request; the header will further avoid potential HTTP 500 errors:
payload = {
'Datefrom' : '01/Jan/2000',
'Dateto' : '01/Oct/2018',
'SeriesCodes': 'IUMBV34,IUMBV37,IUMBV42,IUMBV45',
'CSVF' : 'TN',
'UsingCodes' : 'Y',
'VPD' : 'Y',
'VFD' : 'N'
}
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/54.0.2840.90 '
'Safari/537.36'
}
We can use the requests package to construct the URL with the dictionary of parameters:
response = requests.get(url_endpoint, params=payload, headers=headers)
# Check if the response was successful, it should return '200'
print(response.status_code)
200
The URL that was constructed by the requests package is:
print(response.url)
http://www.bankofengland.co.uk/boeapps/iadb/fromshowcolumns.asp?csv.x=yes&Datefrom=01%2FJan%2F2000&Dateto=01%2FOct%2F2018&SeriesCodes=IUMBV34%2CIUMBV37%2CIUMBV42%2CIUMBV45&CSVF=TN&UsingCodes=Y&VPD=Y&VFD=N
We can use Pandas' read_csv function to load the data. The function takes a file (or an URL) as an argument. We already downloaded the data (as bytes), so we can pass it with 'io.BytesIO'.
df = pd.read_csv(io.BytesIO(response.content))
# Print the first 5 rows to the screen for inspection
df.head()
DATE | IUMBV34 | IUMBV37 | IUMBV42 | IUMBV45 | |
---|---|---|---|---|---|
0 | 31 Jan 2000 | 6.78 | 6.78 | 7.09 | 7.29 |
1 | 29 Feb 2000 | 6.79 | 6.80 | 7.11 | 0 |
2 | 31 Mar 2000 | 6.45 | 6.70 | 6.94 | 6.54 |
3 | 30 Apr 2000 | 6.31 | 6.68 | 6.92 | 6.54 |
4 | 31 May 2000 | 6.39 | 6.64 | 6.90 | 6.69 |
This already looks pretty good. But, notice the value of '0' at '29 Feb 2000' for the 10-year mortgage rate ('IUMBV45'), which looks suspicious. Please inspect the series df['IUMBV45'] and you'll see there are missing values between index 116 to 174, represented by '..' (two dots). Strangely, at index 157, there is a value. If you are working in Jupyter Notebooks and cannot print the entire series to your screen, you can use this little hack (converting the series to string)
print(df['IUMBV45'].to_string())
Can you see the series? Let's continue the inspection.
# Check the DataFrame's info
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 225 entries, 0 to 224 Data columns (total 5 columns): DATE 225 non-null object IUMBV34 225 non-null float64 IUMBV37 225 non-null float64 IUMBV42 225 non-null float64 IUMBV45 225 non-null object dtypes: float64(3), object(2) memory usage: 8.9+ KB
Checking the DataFrame's information, it seems there are three things that need to be done:
# Step 1: Convert 'DATE' to datetime format
df['DATE'] = pd.to_datetime(df['DATE'])
# Step 2: Set 'DATE' to be the index
df = df.set_index('DATE')
# Check the new index
df.index
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-30', '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31', '2000-09-30', '2000-10-31', ... '2017-12-31', '2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30', '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31', '2018-09-30'], dtype='datetime64[ns]', name='DATE', length=225, freq=None)
Series 'IUMBV45' is of type "object" (and not "float64") because of the missing values are represented by '..' (two dots). We can convert it to float64 and will empty all the cells with dots by setting the parameter errors='coerce' in Pandas 'to_numeric' function.
# Step 3: Convert the 10 year mortgage rate 'IUMBV45' to numeric
df['IUMBV45'] = pd.to_numeric(df['IUMBV45'], errors='coerce')
# Check the DataFrame's info again
df.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 225 entries, 2000-01-31 to 2018-09-30 Data columns (total 4 columns): IUMBV34 225 non-null float64 IUMBV37 225 non-null float64 IUMBV42 225 non-null float64 IUMBV45 167 non-null float64 dtypes: float64(4) memory usage: 8.8 KB
This looks better, the dates are the index now and there are 167 values left for 'IUMBV45'. Remember the value of '0' at '29 Feb 2000'? Let's see if there are any other dates with a value of '0'.
df.loc[df['IUMBV45'] == 0]
IUMBV34 | IUMBV37 | IUMBV42 | IUMBV45 | |
---|---|---|---|---|
DATE | ||||
2000-02-29 | 6.79 | 6.8 | 7.11 | 0.0 |
It happens only once. For sake of simplicity of this tutorial, let's not worry too much about it and just get rid of this one occurrence. (We did report this anomaly to the Bank of England, so perhaps it is not there anymore when you run the code).
# Remove the value by setting it to None
df.loc['2000-02-29','IUMBV45'] = None
# Finally, check the statistics of the table
df.describe()
IUMBV34 | IUMBV37 | IUMBV42 | IUMBV45 | |
---|---|---|---|---|
count | 225.000000 | 225.000000 | 225.000000 | 166.000000 |
mean | 3.955422 | 4.305067 | 4.590000 | 4.998976 |
std | 1.554679 | 1.492455 | 1.389015 | 1.326059 |
min | 1.350000 | 1.600000 | 1.900000 | 2.660000 |
25% | 2.540000 | 2.960000 | 3.440000 | 3.475000 |
50% | 4.060000 | 4.650000 | 4.980000 | 5.570000 |
75% | 5.160000 | 5.470000 | 5.680000 | 5.890000 |
max | 6.790000 | 6.800000 | 7.110000 | 7.290000 |
Excellent, the data does seem to make sense. Longer mortgages normally have higher rates than shorter ones. The mean interest rates from 2, 3, 5 and 10 mortgages (left to right respectively) are indeed increasing, which we expected. To conclude, let's visualize this data.
# IPython's magic to plot inline
%matplotlib inline
df.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1162eeeb8>
you enjoyed this tutorial!
About the author:
Joris H., Python & open source enthusiast. Entrepreneur @ PythonSherpa - https://www.pythonsherpa.com