Pandas meets probabilities

Showcasing Pandas on-the-fly computation capabilities to calculate probabilities on timeseries.

Posted by dr. Etienne B. Roesch on April 19, 2017

You may find interesting:


New Notebook33

Showcasing Jupyter uNotebook Translator Layout


New Notebook

Showcasing Jupyter uNotebook Translator Layout

tl;dr Pandas is extremely good at manipulating data. It can also compute new information on the fly. In this example, I compute the probability that a variable (called connection_status) be of values {-1, 0, 1, 2} given past values in a timeseries, for the same half hour, every week on the same day.

The problem: Given timeseries data representing the connection status for a device, coded {-1, 0, 1, 2}, what is the probability that the device be in any given state, for a given day and a given half hour?

The solution: I first extract the data and use Pandas to chunk and label the data for any given half hour in the past, and then use this subset of data to compute the probability for all four possible states. Plotting this with stacked histogram gives a nice representation of how, for a given half hour, the device can be in several states.

# Import all libraries needed for the tutorial
import pandas as pd
import pandasql as ps
import numpy as np
import matplotlib.pyplot as plt
import sys # Only needed to determine Python version number
import matplotlib # Only needed to determine Matplotlib version number

from matplotlib.pylab import rcParams

# Enable inline plotting
%matplotlib inline

rcParams['figure.figsize'] = 10,10

print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Matplotlib version ' + matplotlib.__version__)

pysqldf = lambda q: ps.sqldf(q, globals())
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

def add_half_hour(x):
    """Returns half hour indices given a timestamp."""
    x['half_hour'] = (x.index.hour * 60 + x.index.minute) / 30
    return x

def calc_probabilities(s):
    """Returns a probability."""
    temp = s.value_counts()
    return temp/temp.sum()

Python version 2.7.12 |Continuum Analytics, Inc.| (default, Jul 2 2016, 17:43:17)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)]
Pandas version 0.19.2
Matplotlib version 2.0.0

%%time
# Read data and create index as timestamps on-the-fly
logging_data = pd.read_csv("./logging_data.csv", parse_dates=['server_timestamp'],
                           index_col='server_timestamp', date_parser=dateparse)

CPU times: user 13.9 s, sys: 204 ms, total: 14.1 s
Wall time: 14.1 s

# Displays some information about the dataset
#logging_data.info()
#print logging_data.head()
#print '\nData Types:'
#print logging_data.dtypes
#logging_data.index
%%time
#.. take one device
device_id = 2    # Note: device list starts at 0, and displayed from 1
grouped = logging_data.groupby('device_id')
group =  list(grouped)[device_id][1]
group.head()  # Shows first few rows

CPU times: user 75.3 ms, sys: 27.7 ms, total: 103 ms
Wall time: 103 ms

%%time
#.. filter just one day
group['day'] = group['cpc_timestamp'].apply(lambda x: dateparse(x).weekday())
myday_group = group[group['day'] == 2]
myday_group.head()

CPU times: user 1.53 s, sys: 8.54 ms, total: 1.54 s
Wall time: 1.54 s

%%time
#.. add half_hour
df2 = myday_group.groupby('device_id').apply(add_half_hour)

CPU times: user 15.5 ms, sys: 3.07 ms, total: 18.6 ms
Wall time: 17.3 ms

#df2.tail()
#ts = group['connection_status']
#ts.head(10)
#ts.hist()
#temp = ts.value_counts()
#full_dist = (temp/temp.sum()).sort_index()
#full_dist
%%time
#.. calculate probabilities
res = df2.groupby('half_hour')['connection_status'].apply(calc_probabilities).sort_index()

CPU times: user 53.2 ms, sys: 3.45 ms, total: 56.6 ms
Wall time: 56.6 ms

res

half_hour
0 0 0.040000
1 0.746000
2 0.214000
1 0 0.040000
1 0.810000
2 0.150000
2 0 0.040000
1 0.840000
2 0.120000
3 0 0.040000
1 0.840000
2 0.120000
4 0 0.040000
1 0.898000
2 0.062000
5 0 0.040000
1 0.900000
2 0.060000
6 0 0.040000
1 0.914000
2 0.046000
7 0 0.040000
1 0.960000
8 0 0.040000
1 0.960000
9 0 0.040000
1 0.960000
10 0 0.040000
1 0.960000
11 0 0.040080
...
38 0 0.139535
1 0.153488
2 0.706977
39 0 0.129310
1 0.219828
2 0.650862
40 0 0.116466
1 0.297189
2 0.586345
41 0 0.104478
1 0.447761
2 0.447761
42 0 0.104478
1 0.503731
2 0.391791
43 0 0.104478
1 0.559701
2 0.335821
44 0 0.104869
1 0.621723
2 0.273408
45 0 0.104478
1 0.671642
2 0.223881
46 0 0.080906
1 0.627832
2 0.291262
47 0 0.079755
1 0.644172
2 0.276074
Name: connection_status, dtype: float64

ax = res.unstack(level=-1).plot(kind='bar', stacked=True, title=('Probability Device #%i' % (device_id+1)))
#handles, labels = ax.get_legend_handles_labels()
ax.legend(['-1', '0', '1', '2'])
ax.set_ylabel("Probability(connection_status)")
ax.set_xlabel("Half hours for Tuesdays")

<matplotlib.text.Text at 0x113eaff50>

<matplotlib.figure.Figure at 0x118256c90>