I was notified yesterday that Helsinki district heating company Helen published their district heat generation data from 2015-2020. This is very nice indeed, since generation data has been notoriously hard to get as a researcher. I’m leaving here some notes on how to read the data in a Jupyter notebook.

See the announcement and link to dataset here, in Finnish.

In this post, we read in the Helen DH dataset and fix their timestamps. This content is also in Jupyter notebook format in GitHub

First, we request content from the server (link as of 2021-04-09) and inspect the data.

import requests
import pandas as pd

from io import StringIO

HELEN_DATA_URL = 'https://www.helen.fi/globalassets/helen-oy/vastuullisuus/hki_dh_2015_2020_a.csv'

content = requests.get(HELEN_DATA_URL).text
content[0:100]
'date_time;dh_MWh\r\n1.1.2015 1:00;936\r\n1.1.2015 2:00;924,2\r\n1.1.2015 3:00;926,3\r\n1.1.2015 4:00;942,1\r\n'

Fixing timestamps

It seems that there are two columns, first including a naive timestamp and second the generation in MWh per hour. Let’s read the data into a Pandas dataframe and then see to the timestamps

df = pd.read_csv(StringIO(content), sep=';', decimal=',', parse_dates=['date_time'], dayfirst=True)\
    .set_index('date_time')
df
dh_MWh
date_time
2015-01-01 01:00:00 936.000
2015-01-01 02:00:00 924.200
2015-01-01 03:00:00 926.300
2015-01-01 04:00:00 942.100
2015-01-01 05:00:00 957.100
... ...
2020-12-31 19:00:00 1191.663
2020-12-31 20:00:00 1155.601
2020-12-31 21:00:00 1149.378
2020-12-31 22:00:00 1097.287
2020-12-31 23:00:00 1110.583

52607 rows × 1 columns

Now, with naive timestamps there is most likely confusion with the daytime changes. Let’s two locations we know it happening, on 29th March 2015 03:00 and on 25th October 2020 at 4:00.

df.loc[((df.index >='2015-03-29 00:00') & (df.index < '2015-03-29 06:00'))
        | ((df.index >= '2020-10-25 00:00') & (df.index < '2020-10-25 06:00'))]
dh_MWh
date_time
2015-03-29 00:00:00 958.673
2015-03-29 01:00:00 930.965
2015-03-29 02:00:00 919.913
2015-03-29 04:00:00 913.885
2015-03-29 05:00:00 908.093
2020-10-25 00:00:00 871.322
2020-10-25 01:00:00 860.284
2020-10-25 02:00:00 848.808
2020-10-25 03:00:00 851.583
2020-10-25 03:00:00 842.317
2020-10-25 04:00:00 819.973
2020-10-25 05:00:00 820.491

So there it is, time hops two hours between 2015-03-29 02:00:00 and 2015-03-29 04:00:00, and two similar timestamps exist at 2020-10-25 03:00:00. We will fix that next with tz_localize, see documentation. Ambiguous parameter ‘infer’ attempts to sort out autumn daylight saving time transition based on appearance order.

df.index = df.index.tz_localize(tz='Europe/Helsinki', ambiguous='infer')
df.loc[((df.index >='2015-03-29 00:00') & (df.index < '2015-03-29 06:00'))
        | ((df.index >= '2020-10-25 00:00') & (df.index < '2020-10-25 06:00'))]
dh_MWh
date_time
2015-03-29 00:00:00+02:00 958.673
2015-03-29 01:00:00+02:00 930.965
2015-03-29 02:00:00+02:00 919.913
2015-03-29 04:00:00+03:00 913.885
2015-03-29 05:00:00+03:00 908.093
2020-10-25 00:00:00+03:00 871.322
2020-10-25 01:00:00+03:00 860.284
2020-10-25 02:00:00+03:00 848.808
2020-10-25 03:00:00+03:00 851.583
2020-10-25 03:00:00+02:00 842.317
2020-10-25 04:00:00+02:00 819.973
2020-10-25 05:00:00+02:00 820.491

It seems correct now! Let’s check.

len(df.index.unique()) == len(df)
True

Finally, we will check that there is one hour difference between all rows

index_steps = df.index.to_frame().diff()
index_steps.loc[index_steps['date_time'] != pd.Timedelta('1 hours')]
date_time
date_time
2015-01-01 01:00:00+02:00 NaT

Basic plotting

Let’s have a look at the data, plotted fully, and then a subplot

df.plot(figsize=(20,6))
<AxesSubplot:xlabel='date_time'>

png

df.loc[(df.index > '2019-01-01') & (df.index < '2019-03-01')].plot(figsize=(20,6))
<AxesSubplot:xlabel='date_time'>

png

Next steps

The data seems to be in order. Next one would like to plot correlations with the temperature, build models et cetera. This will be left for another time.