Kaggle - Predicting Bike Sharing Demand¶

Problem Statement

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these Bike Sharing systems, people rent a bike from one location and return it to a different or same place on need basis. People can rent a bike through membership (mostly regular users) or on demand basis (mostly casual users). This process is controlled by a network of automated kiosk across the city.

bike sharing

In Kaggle Bike Sharing Demand, the participants were asked to forecast bike rental demand of Bike sharing program in Washington, D.C. based on historical usage patterns in relation with weather, time and other data.

Evaluation

Submissions are evaluated one the Root Mean Squared Logarithmic Error (RMSLE). The RMSLE is calculated as

\[\sqrt{\frac{1}{n}\sum_{i=1}^n (log(p_i + 1) - log(a_i + 1))^2}\]

where:

\(n\) is the number of hours in the test set
\(p_i\) is your predicted count
\(a_i\) is the actual count
log(x) is the natural logarithm

I have compiled this notebook by collating theory and codes from other blogs as well along with my own implementations. Wherever I have copied the theory/codes verbatim I have highlighted the references with ^[1] to give the authors their due credit. Please refer to the reference section at the end of the notebook for the original blogs of the respective authors.

import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

import seaborn as sns
import matplotlib.pyplot as plt

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', 30)

#sns.set_style("whitegrid")
#plt.style.use('bmh')
plt.style.use('seaborn-whitegrid')

# this allows plots to appear directly in the notebook
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

1. Hypothesis Generation¶

Before exploring the data to understand the relationship between variables, it is recommended that we focus on hypothesis generation first. Now, this might sound counter-intuitive for solving a data science problem, but before exploring data, we should spend some time thinking about the business problem, gaining the domain knowledge and may be gaining first hand experience of the problem.

How does it help? This practice usually helps us form better features later on, which are not biased by the data available in the dataset. At this stage, we are expected to posses structured thinking i.e. a thinking process which takes into consideration all the possible aspects of a particular problem.

Here are some of the hypothesis which could influence the demand of bikes:

Hourly trend: There must be high demand during office timings. Early morning and late evening can have different trend (cyclist) and low demand during 10:00 pm to 4:00 am.

Daily Trend: Registered users demand more bike on weekdays as compared to weekend or holiday.

Rain: The demand of bikes will be lower on a rainy day as compared to a sunny day. Similarly, higher humidity will cause to lower the demand and vice versa.

Temperature: In India, temperature has negative correlation with bike demand. But, after looking at Washington’s temperature graph, we presume it may have positive correlation.

Pollution: If the pollution level in a city starts soaring, people may start using Bike (it may be influenced by government / company policies or increased awareness).

Time: Total demand should have higher contribution of registered user as compared to casual because registered user base would increase over time.

Traffic: It can be positively correlated with Bike demand. Higher traffic may force people to use bike as compared to other road transport medium like car, taxi etc.

2. Understanding the Data Set¶

The dataset shows hourly rental data for two years (2011 and 2012). The training data set is for the first 19 days of each month. The test dataset is from 20th day to month’s end. We are required to predict the total count of bikes rented during each hour covered by the test set.

In the training data set, they have separately given bike demand by registered, casual users and sum of both is given as count.

Training data set has 12 variables (see below) and Test has 9 (excluding registered, casual and count).

Independent Variables

datetime:   date and hour in "mm/dd/yyyy hh:mm" format
season:     Four categories-> 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday:    whether the day is a holiday or not (1/0)
workingday: whether the day is neither a weekend nor holiday (1/0)
weather:    Four Categories of weather
            1-> Clear, Few clouds, Partly cloudy
            2-> Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
            3-> Light Snow and Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
            4-> Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp:       hourly temperature in Celsius
atemp:      "feels like" temperature in Celsius
humidity:   relative humidity
windspeed:  wind speed

Dependent Variables

registered: number of registered user
casual:     number of non-registered user
count:      number of total rentals (registered + casual)

3. Importing Data Set and Basic Data Exploration¶

train_df = pd.read_csv('data/train.csv')
train_df['data_set'] = 'train'
train_df.head(5)

	datetime	season	weather	temp	atemp	humidity	casual	registered	count	data_set
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	16	train
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	40	train
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	32	train
3	2011-01-01 03:00:00	1	1	9.84	14.395	75	3	10	13	train
4	2011-01-01 04:00:00	1	1	9.84	14.395	75	0	1	1	train

test_df = pd.read_csv('data/test.csv')
test_df['data_set'] = 'test'
test_df.head(5)

	datetime	season	workingday	weather	temp	atemp	humidity	windspeed	data_set
0	2011-01-20 00:00:00	1	1	1	10.66	11.365	56	26.0027	test
1	2011-01-20 01:00:00	1	1	1	10.66	13.635	56	0.0000	test
2	2011-01-20 02:00:00	1	1	1	10.66	13.635	56	0.0000	test
3	2011-01-20 03:00:00	1	1	1	10.66	12.880	56	11.0014	test
4	2011-01-20 04:00:00	1	1	1	10.66	12.880	56	11.0014	test

3.1. Combine both Train and Test Data set (to understand the distribution of independent variable together).¶

# combine train and test data into one df
test_df['registered'] = 0
test_df['casual'] = 0
test_df['count'] = 0

all_df = pd.concat([train_df, test_df])
all_df.head(5)

	atemp	casual	count	data_set	datetime	humidity	registered	season	temp	weather
0	14.395	3	16	train	2011-01-01 00:00:00	81	13	1	9.84	1
1	13.635	8	40	train	2011-01-01 01:00:00	80	32	1	9.02	1
2	13.635	5	32	train	2011-01-01 02:00:00	80	27	1	9.02	1
3	14.395	3	13	train	2011-01-01 03:00:00	75	10	1	9.84	1
4	14.395	0	1	train	2011-01-01 04:00:00	75	1	1	9.84	1

all_df.tail(5)

	atemp	data_set	datetime	humidity	season	temp	weather	windspeed	workingday
6488	12.880	test	2012-12-31 19:00:00	60	1	10.66	2	11.0014	1
6489	12.880	test	2012-12-31 20:00:00	60	1	10.66	2	11.0014	1
6490	12.880	test	2012-12-31 21:00:00	60	1	10.66	1	11.0014	1
6491	13.635	test	2012-12-31 22:00:00	56	1	10.66	1	8.9981	1
6492	13.635	test	2012-12-31 23:00:00	65	1	10.66	1	8.9981	1

# lowercase column names
all_df.columns = map(str.lower, all_df.columns)
all_df.columns

Index(['atemp', 'casual', 'count', 'data_set', 'datetime', 'holiday',
       'humidity', 'registered', 'season', 'temp', 'weather', 'windspeed',
       'workingday'],
      dtype='object')

# parse datetime colum & add new time related columns
dt = pd.DatetimeIndex(all_df['datetime'])
all_df.set_index(dt, inplace=True)

all_df.head(5)

	atemp	casual	count	data_set	datetime	holiday	humidity	registered	season	temp	weather	windspeed	workingday
datetime
2011-01-01 00:00:00	14.395	3	16	train	2011-01-01 00:00:00	0	81	13	1	9.84	1	0.0	0
2011-01-01 01:00:00	13.635	8	40	train	2011-01-01 01:00:00	0	80	32	1	9.02	1	0.0	0
2011-01-01 02:00:00	13.635	5	32	train	2011-01-01 02:00:00	0	80	27	1	9.02	1	0.0	0
2011-01-01 03:00:00	14.395	3	13	train	2011-01-01 03:00:00	0	75	10	1	9.84	1	0.0	0
2011-01-01 04:00:00	14.395	0	1	train	2011-01-01 04:00:00	0	75	1	1	9.84	1	0.0	0

# Indexing by datetime let's us to select rows by specifying time ranges
all_df['2011-01-01 01:00:00':'2011-01-01 03:00:00']

	atemp	casual	count	data_set	datetime	holiday	humidity	registered	season	temp	weather	windspeed	workingday
datetime
2011-01-01 01:00:00	13.635	8	40	train	2011-01-01 01:00:00	0	80	32	1	9.02	1	0.0	0
2011-01-01 02:00:00	13.635	5	32	train	2011-01-01 02:00:00	0	80	27	1	9.02	1	0.0	0
2011-01-01 03:00:00	14.395	3	13	train	2011-01-01 03:00:00	0	75	10	1	9.84	1	0.0	0

# find missing values in dataset if any
all_df.isnull().values.sum()

3.2. Understand the distribution of numerical variables and generate a frequency table for numeric variables.¶

plt.figure(figsize=(20,15))
plt.subplot(421)
all_df['season'].plot.hist(bins=10, color='blue', label='Histogram of Season', edgecolor='black')
plt.legend(loc='best')
plt.subplot(422)
all_df['weather'].plot.hist(bins=10, color='green', label='Histogram of Weather', edgecolor='black')
plt.legend(loc='best')
plt.subplot(423)
all_df['humidity'].plot.hist(bins=10, color='orange', label='Histogram of Humidity', edgecolor='black')
plt.legend(loc='best')
plt.subplot(424)
all_df['holiday'].plot.hist(bins=10, color='pink', label='Histogram of Holiday', edgecolor='black')
plt.legend(loc='best')
plt.subplot(425)
all_df['workingday'].plot.hist(bins=10, color='red', label='Histogram of Working Day', edgecolor='black')
plt.legend(loc='best')
plt.subplot(426)
all_df['temp'].plot.hist(bins=10, color='yellow', label='Histogram of Temperature', edgecolor='black')
plt.legend(loc='best')
plt.subplot(427)
all_df['atemp'].plot.hist(bins=10, color='cyan', label='Histogram of Feels Like Temp', edgecolor='black')
plt.legend(loc='best')
plt.subplot(428)
all_df['windspeed'].plot.hist(bins=10, color='purple', label='Histogram of Windpseed', edgecolor='black')
plt.legend(loc='best')
plt.tight_layout();

png

Few inferences can be drawn by looking at the these histograms:

Season has four categories of almost equal distribution
Weather 1 has higher contribution i.e. mostly clear weather.
As expected, mostly working days and variable holiday is also showing a similar inference. You can use the code above to look at the distribution in detail. Here you can generate a variable for weekday using holiday and working day. Incase, if both have zero values, then it must be a working day.
Variables temp, atemp, humidity and windspeed looks naturally distributed.

# logarithmic transformation of dependent cols
# (adding 1 first so that 0 values don't become -inf)
for col in ['casual', 'registered', 'count']:
    all_df['%s_log' % col] = np.log(all_df[col] + 1)

all_df['date'] = dt.date
all_df['day'] = dt.day
all_df['month'] = dt.month
all_df['year'] = dt.year
all_df['hour'] = dt.hour
all_df['dow'] = dt.dayofweek
all_df['woy'] = dt.weekofyear

# How many columns have null values
all_df.isnull().sum()

atemp             0
casual            0
count             0
data_set          0
datetime          0
holiday           0
humidity          0
registered        0
season            0
temp              0
weather           0
windspeed         0
workingday        0
casual_log        0
registered_log    0
count_log         0
date              0
day               0
month             0
year              0
hour              0
dow               0
woy               0
dtype: int64

# interpolate weather, temp, atemp, humidity, windspeed
all_df["weather"] = all_df["weather"].interpolate(method='time').apply(np.round)
all_df["temp"] = all_df["temp"].interpolate(method='time')
all_df["atemp"] = all_df["atemp"].interpolate(method='time')
all_df["humidity"] = all_df["humidity"].interpolate(method='time').apply(np.round)
all_df["windspeed"] = all_df["windspeed"].interpolate(method='time')

# add a count_season column using join
by_season = all_df[all_df['data_set'] == 'train'].copy().groupby(['season'])[['count']].agg(sum)
by_season.columns = ['count_season']
all_df = all_df.join(by_season, on='season')

print(by_season)

        count_season
season              
1             312498
2             588282
3             640662
4             544034

by_season.plot(kind='barh')
plt.grid(True)
plt.show();

png

def get_day(day_start):
    day_end = day_start + pd.offsets.DateOffset(hours=23)
    return pd.date_range(day_start, day_end, freq="H")

# tax day
all_df.loc[get_day(pd.datetime(2011, 4, 15)), "workingday"] = 1
all_df.loc[get_day(pd.datetime(2012, 4, 16)), "workingday"] = 1

# thanksgiving friday
all_df.loc[get_day(pd.datetime(2011, 11, 25)), "workingday"] = 0
all_df.loc[get_day(pd.datetime(2012, 11, 23)), "workingday"] = 0

# tax day
all_df.loc[get_day(pd.datetime(2011, 4, 15)), "holiday"] = 0
all_df.loc[get_day(pd.datetime(2012, 4, 16)), "holiday"] = 0

# thanksgiving friday
all_df.loc[get_day(pd.datetime(2011, 11, 25)), "holiday"] = 1
all_df.loc[get_day(pd.datetime(2012, 11, 23)), "holiday"] = 1

#storms
all_df.loc[get_day(pd.datetime(2012, 5, 21)), "holiday"] = 1

#tornado
all_df.loc[get_day(pd.datetime(2012, 6, 1)), "holiday"] = 1

by_hour = all_df[all_df['data_set'] == 'train'].copy().groupby(['hour', 'workingday'])['count'].agg('sum').unstack()
by_hour.head(10)

workingday	0	1
hour
0	13633	11455
1	10384	4988
2	7654	2605
3	3666	1425
4	1230	1602
5	1280	7655
6	2719	31979
7	6318	90650
8	15380	149680
9	25324	75586

Hourly trend: There must be high demand during office timings. Early morning and late evening can have different trend (cyclist) and low demand during 10:00 pm to 4:00 am.

# rentals by hour, split by working day (or not)
by_hour.plot(kind='bar', figsize=(15,5), width=0.8);
plt.grid(True)
plt.tight_layout();

png

4. Hypothesis Testing (using multivariate analysis)¶

Till now, we have got a fair understanding of the data set. Now, let’s test the hypothesis which we had generated earlier. Here we have added some additional hypothesis from the dataset. Let’s test them one by one:

Hourly trend: We don’t have the variable ‘hour’ as part of data provides but we extracted it using the datetime column. Let’s plot the hourly trend of count over hours and check if our hypothesis is correct or not. We will separate train and test data set from combined one.

train_df = all_df[all_df['data_set'] == 'train'].copy()

#train_df.boxplot(column='count', by='hour', figsize=(15,5))
#plt.ylabel('Count of Users')
#plt.title("Boxplot of Count grouped by hour")
#plt.suptitle("") # get rid of the pandas autogenerated title

fig, ax = plt.subplots(figsize=(18, 5))
sns.boxplot(x=train_df['hour'], y=train_df['count'], ax=ax)
ax.set_ylabel('Count of Users')
ax.set_title("Boxplot of Count grouped by hour");
#plt.suptitle("") # get rid of the pandas autogenerated title

png

Above, we can see the trend of bike demand over hours. Quickly, we’ll segregate the bike demand in three categories:

High : 7-9 and 17-19 hours
Average : 10-16 hours
Low : 0-6 and 20-24 hours Here we have analyzed the distribution of total bike demand.

Let’s look at the distribution of registered and casual users separately.

Good Weather is most frequent in Fall^[5]¶

season_map = {1:'Spring', 2:'Summer', 3:'Fall', 4:'Winter'}
good_weather = all_df[all_df['weather'] == 1][['hour', 'season']].copy()
data = pd.DataFrame({'count' : good_weather.groupby(["hour","season"]).size()}).reset_index()
data['season'] = data['season'].map(lambda d : season_map[d])

fig, ax = plt.subplots(figsize=(18, 5))
sns.pointplot(x=data["hour"], y=data["count"], hue=data["season"], ax=ax)
ax.set(xlabel='Hour Of The Day', ylabel='Good Weather Count', title="Good Weather By Hour Of The Day Across Season");

png

Normal Weather happens most frequent in Spring^[5]¶

season_map = {1:'Spring', 2:'Summer', 3:'Fall', 4:'Winter'}
normal_weather = all_df[all_df['weather'] == 3][['hour', 'season']].copy()
data = pd.DataFrame({'count' : normal_weather.groupby(["hour","season"]).size()}).reset_index()
data['season'] = data['season'].map(lambda d : season_map[d])

data.sample(n=5)

	hour	season	count
50	12	Fall	12
46	11	Fall	11
3	0	Winter	11
63	15	Winter	19
33	8	Summer	18

fig, ax = plt.subplots(figsize=(18, 5))
sns.pointplot(x=data["hour"], y=data["count"], hue=data["season"], ax=ax)
ax.set(xlabel='Hour Of The Day', ylabel='Normal Weather Count', title="Normal Weather By Hour Of The Day Across Season");

png

data = pd.pivot_table(data, values='count', columns='season', index='hour')
data.sample(5)

season	Fall	Spring	Summer	Winter
hour
18	14	18	19	18
4	8	14	19	16
15	15	16	16	19
12	12	12	14	17
11	11	11	8	17

fig, ax = plt.subplots(figsize=(12, 5))
data.plot.area(stacked=False, ax=ax)
ax.set(xlabel='Hour Of The Day', ylabel='Normal Weather Count', title="Normal Weather By Hour Of The Day Across Season");

png

Bad Weather happens most frequent in Summer & Winter^[5]¶

season_map = {1:'Spring', 2:'Summer', 3:'Fall', 4:'Winter'}
bad_weather = all_df[all_df['weather'] == 3][['hour', 'season']].copy()
data = pd.DataFrame({'count' : bad_weather.groupby(["hour","season"]).size()}).reset_index()
data['season'] = data['season'].map(lambda d : season_map[d])

fig, ax = plt.subplots(figsize=(18, 5))
sns.pointplot(x=data["hour"], y=data["count"], hue=data["season"], ax=ax)
ax.set(xlabel='Hour Of The Day', ylabel='Bad Weather Count', title="Bad Weather By Hour Of The Day Across Season");

png

Bikes are rented more in Good Weather and much less in Bad Weather¶

weather_map = {1:'Good', 2:'Normal', 3:'Bad', 4:'Worse'}
data = pd.DataFrame(train_df.groupby(["hour","weather"], sort=True)["count"].mean()).reset_index()
data['weather'] = data['weather'].map(lambda d : weather_map[d])
fig, ax = plt.subplots(figsize=(18, 5))
sns.pointplot(x=data["hour"], y=data["count"], hue=data["weather"], ax=ax)
ax.set(xlabel='Hour Of The Day', ylabel='Users Count', title="Average Users Count By Hour Of The Day Across Weather");

png

Bikes are rented more in Fall and much less in Spring¶

season_map = {1:'Spring', 2:'Summer', 3:'Fall', 4:'Winter'}
data = pd.DataFrame({'mean':train_df.groupby(["hour","season"], sort=True)["count"].mean()}).reset_index()
data['season'] = data['season'].map(lambda d : season_map[d])

data.sample(n=5)

	hour	season	mean
48	12	Spring	154.412281
92	23	Spring	45.333333
27	6	Winter	82.254386
89	22	Summer	154.192982
1	0	Summer	58.473684

fig, ax = plt.subplots(figsize=(18, 5))
sns.pointplot(x=data["hour"], y=data["mean"], hue=data["season"], ax=ax)
ax.set(xlabel='Hour Of The Day', ylabel='Users Count', title="Average Users Count By Hour Of The Day Across Season");

png

Bikes are rented mostly for Morning/Evening commutes on Weekdays, and mostly Daytime rides on Weekends¶

day_map = {0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
hueOrder = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]
data = pd.DataFrame({'mean':train_df.groupby(["hour","dow"], sort=True)["count"].mean()}).reset_index()
data['dow'] = data['dow'].map(lambda d : day_map[d])

data.sample(n=5)

	hour	dow	mean
103	14	Saturday	398.409091
134	19	Tuesday	356.123077
55	7	Sunday	34.742424
78	11	Tuesday	145.609375
67	9	Friday	262.406250

fig, ax = plt.subplots(figsize=(18, 5))
sns.pointplot(x=data["hour"], y=data["mean"], hue=data["dow"], hue_order=hueOrder, ax=ax)
ax.set(xlabel='Hour Of The Day', ylabel='Users Count', title="Average Users Count By Hour Of The Day Across Weekdays");

png

Renting patterns of Bikes are significantly different between Registered and Casual Users¶

#fig, axs = plt.subplots(1, 2, figsize=(15,5), sharex=False, sharey=False)

#train_df.boxplot(column='casual', by='hour', ax=axs[0])
#axs[0].set_ylabel('casual users')
#axs[0].set_title('')

#train_df.boxplot(column='registered', by='hour', ax=axs[1])
#axs[1].set_ylabel('registered users')
#axs[1].set_title('')

fig, axs = plt.subplots(1, 2, figsize=(18,5), sharex=False, sharey=False)

sns.boxplot(x='hour', y='casual', data=train_df, ax=axs[0])
axs[0].set_ylabel('casual users')
axs[0].set_title('')

sns.boxplot(x='hour', y='registered', data=train_df, ax=axs[1])
axs[1].set_ylabel('registered users')
axs[1].set_title('');

png

train_df[["hour","casual","registered"]].head(5)

	hour	casual	registered
datetime
2011-01-01 00:00:00	0	3	13
2011-01-01 01:00:00	1	8	32
2011-01-01 02:00:00	2	5	27
2011-01-01 03:00:00	3	3	10
2011-01-01 04:00:00	4	0	1

pd.melt(train_df[["hour","casual","registered"]], id_vars=['hour'], value_vars=['casual', 'registered'], var_name='usertype', value_name='count').head(5)

	hour	usertype	count
0	0	casual	3
1	1	casual	8
2	2	casual	5
3	3	casual	3
4	4	casual	0

fig, ax = plt.subplots(figsize=(18, 5))
train_df_melt = pd.melt(train_df[["hour","casual","registered"]], id_vars=['hour'], value_vars=['casual', 'registered'], var_name='usertype', value_name='count')
data = pd.DataFrame(train_df_melt.groupby(["hour", "usertype"], sort=True)["count"].mean()).reset_index()
sns.pointplot(x=data["hour"], y=data["count"], hue=data["usertype"], hue_order=["casual","registered"], ax=ax)
ax.set(xlabel='Hour Of The Day', ylabel='Users Count', title='Average Users Count By Hour Of The Day Across User Type');

png

Above we can see that registered users have similar trend as count. Whereas, casual users have different trend. Thus, we can say that ‘hour’ is significant variable and our hypothesis is ‘true’.

We can notice that there are a lot of outliers while plotting the count of registered and casual users. These values are not generated due to error, so we consider them as natural outliers. They might be a result of groups of people taking up cycling (who are not registered). To treat such outliers, we will use logarithm transformation. Let’s look at the similar plot after log transformation.

train_df = train_df.assign(log_count = lambda df : np.log(train_df['count']))

fig, ax = plt.subplots(figsize=(18, 5))
sns.boxplot(x='hour', y='log_count', data=train_df, ax=ax)
ax.set(ylabel='log(count) of Users',title='Boxplot of Log of Count grouped by hour')

#plt.suptitle("") # get rid of the pandas autogenerated title
train_df.drop(['log_count'], axis = 1, inplace=True);

png

On Workdays most Bikes are rented on Warm Mornings and Evenings^[2]¶

^[4]When graphing a categorical variable vs. a continuous variable, it can be useful to create a scatter plot to visually examine distributions. Together with a box plot, it will allow us to see the distributions of our variables. Unfortunately, if our points occur close together, we will get a very uninformative smear. One way of making the scatter plot work is by adding jitter. With the jitter, a random amount is added or subtracted to each of the variables along the categorical axis. Where before, we may have had a categorical value vector that looked something like [1,2,2,2,1,3], post-jitter, they would look something like [1.05, 1.96, 2.05, 2, .97, 2.95]. Each value has had somewhere between [-0.05,0.05] added to it. This then means that when we plot our variables, we’ll see a cloud of points that represent our distribution, rather than a long smear.

def hour_jitter(h):
    #return h + ((np.random.randint(low=0, high=9, size=1)[0] - 4) / 10)
    return h + np.random.uniform(-0.4, 0.4)

def hour_format(h):
    return "{:02d}:00 AM".format(h) if h <= 12 else "{:02d}:00 PM".format(h%12)

# jitter plot
import matplotlib.colors as mcolors
import matplotlib.cm as cm

# color_map = plt.get_cmap("jet")
color_map = mcolors.ListedColormap(list(["#5e4fa2", "#3288bd", "#66c2a5", "#abdda4", "#e6f598", "#fee08b", "#fdae61", "#f46d43", "#d53e4f", "#9e0142"]))
train_df['hour_jitter'] = train_df['hour'].map(hour_jitter)
train_df[train_df['workingday'] == 1].plot(kind="scatter", x='hour_jitter', y='count',
    figsize=(18,6),
    c='temp', cmap=color_map, colorbar=True,
    sharex=False)

hours = np.unique(train_df['hour'].values)
hour_labels = [hour_format(h) for h in hours]
plt.xticks(hours, hour_labels, rotation='vertical');

png

train_df.drop('hour_jitter', axis=1, inplace=True);

Daily Trend: Like Hour, we will generate a variable for day from datetime variable and after that we’ll plot it.

dayOfWeek={0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
all_df['weekday'] = all_df['dow'].map(dayOfWeek)

fig, axs = plt.subplots(1, 2, figsize=(15,5), sharex=False, sharey=False)

sns.boxplot(x='weekday', y='registered', data=all_df, ax=axs[0])
axs[0].set_ylabel('registered users')
axs[0].set_title('')

sns.boxplot(x='weekday', y='casual', data=all_df, ax=axs[1])
axs[1].set_ylabel('casual users')
axs[1].set_title('')

all_df.drop('weekday', axis=1, inplace=True);

png

Rain: We don’t have the ‘rain’ variable with us but have ‘weather’ which is sufficient to test our hypothesis. As per variable description, weather 3 represents light rain and weather 4 represents heavy rain. Take a look at the plot:

fig, axs = plt.subplots(1, 2, figsize=(15,5), sharex=False, sharey=False)

sns.boxplot(x='weather', y='registered', data=all_df, ax=axs[0])
axs[0].set_ylabel('registered users')
axs[0].set_title('')

sns.boxplot(x='weather', y='casual', data=all_df, ax=axs[1])
axs[1].set_ylabel('casual users')
axs[1].set_title('');

png

It is clearly satisfying our hypothesis.

Temperature, Windspeed and Humidity: These are continuous variables so we can look at the correlation factor to validate hypothesis.

Correlation Between Count And Features¶

sub_df = train_df[['count', 'registered', 'casual', 'temp', 'atemp', 'humidity', 'windspeed', 'workingday', 'holiday']]

sub_df.corr()

	count	registered	casual	temp	atemp	humidity	windspeed	workingday	holiday
count	1.000000	0.970948	0.690414	0.394454	0.389784	-0.317371	0.101369	0.011965	-0.008049
registered	0.970948	1.000000	0.497250	0.318571	0.314635	-0.265458	0.091052	0.120154	-0.023038
casual	0.690414	0.497250	1.000000	0.467097	0.462067	-0.348187	0.092276	-0.319864	0.040464
temp	0.394454	0.318571	0.467097	1.000000	0.984948	-0.064949	-0.017852	0.033174	0.002969
atemp	0.389784	0.314635	0.462067	0.984948	1.000000	-0.043536	-0.057473	0.027851	-0.003455
humidity	-0.317371	-0.265458	-0.348187	-0.064949	-0.043536	1.000000	-0.318607	-0.011039	0.012114
windspeed	0.101369	0.091052	0.092276	-0.017852	-0.057473	-0.318607	1.000000	0.018454	-0.000585
workingday	0.011965	0.120154	-0.319864	0.033174	0.027851	-0.011039	0.018454	1.000000	-0.213189
holiday	-0.008049	-0.023038	0.040464	0.002969	-0.003455	0.012114	-0.000585	-0.213189	1.000000

corrMatt = sub_df.corr()
mask = np.zeros_like(corrMatt)
mask[np.triu_indices_from(mask)] = True
fig, ax = plt.subplots(figsize=(15, 6))
sns.heatmap(corrMatt, mask=mask, vmax=.8, square=False, annot=True, ax=ax, linewidths=1);

png

Here are a few inferences you can draw by looking at the above histograms:

Variable temp is positively correlated with dependent variables (casual is more compare to registered)
Variable atemp is highly correlated with temp.
Windspeed has lower correlation as compared to temp and humidity

Time: Let’s extract year of each observation from the datetime column and see the trend of bike demand over year.

Distribution of data between Train and Test set based on Season¶

season_map = {1:'Spring', 2:'Summer', 3:'Fall', 4:'Winter'}
data = all_df[['data_set', 'season']].copy()
data['season'] = data['season'].map(lambda d : season_map[d])

sns.countplot(x="data_set", hue="season", data=data);

png

Distribution of data between Train and Test set based on Weather¶

weather_map = {1:'Good', 2:'Normal', 3:'Bad', 4:'Worse'}
data = all_df[['data_set', 'weather']].copy()
data['weather'] = data['weather'].map(lambda d : weather_map[d])

sns.countplot(x="data_set", hue="weather", data=data);

png

Distribution of count of users Train and Test set based on Year¶

plt.figure(figsize=(8, 5))
sns.boxplot(x='year', y='count', data=train_df)
plt.ylabel('Count of Users')
plt.title("Boxplot of Count grouped by year");

png

We can see that 2012 has higher bike demand as compared to 2011.

Pollution & Traffic: We don’t have the variable related with these metrics in our data set so we cannot test this hypothesis.

5. Feature Engineering¶

In addition to existing independent variables, we will create new variables to improve the prediction power of model. Initially, we have generated new variables like hour, month, day and year.

# feature engineer a new column whether its a peak hour or not
all_df['peak'] = all_df[['hour', 'workingday']]\
    .apply(lambda df: 1 if ((df['workingday'] == 1 and (df['hour'] == 8 or 17 <= df['hour'] <= 18)) \
                            or (df['workingday'] == 0 and 10 <= df['workingday'] <= 19)) else 0, axis = 1)

# sandy
all_df['holiday'] = all_df[['month', 'day', 'holiday', 'year']]\
    .apply(lambda df: 1 if (df['year'] == 2012 and df['month'] == 10 and df['day'] == 30) else 0, axis = 1)

# christmas and others
all_df['holiday'] = all_df[['month', 'day', 'holiday']]\
    .apply(lambda df: 1 if (df['month'] == 12 and df['day'] in [24, 26, 31]) else df['holiday'], axis = 1)
all_df['workingday'] = all_df[['month', 'day', 'workingday']]\
    .apply(lambda df: 0 if df['month'] == 12 and df['day'] in [24, 31] else df['workingday'], axis = 1)

# from histogram
all_df['ideal'] = all_df[['temp', 'windspeed']]\
    .apply(lambda df: 1 if (df['temp'] > 27 and df['windspeed'] < 30) else 0, axis = 1)
    
all_df['sticky'] = all_df[['humidity', 'workingday']]\
    .apply(lambda df: 1 if (df['workingday'] == 1 and df['humidity'] >= 60) else 0, axis = 1)

all_df.sample(10)

	atemp	casual	count	data_set	datetime	holiday	humidity	registered	season	temp	weather	windspeed	workingday	casual_log	registered_log	count_log	date	day	month	year	hour	dow	woy	count_season	peak	ideal	sticky
datetime
2012-08-30 02:00:00	29.545	0	0	test	2012-08-30 02:00:00	0	73	0	3	25.42	1	0.0000	1	0.000000	0.000000	0.000000	2012-08-30	30	8	2012	2	3	35	640662	0	0	1
2012-08-14 20:00:00	35.605	65	436	train	2012-08-14 20:00:00	0	74	371	3	30.34	1	15.0013	1	4.189655	5.918894	6.079933	2012-08-14	14	8	2012	20	1	33	640662	0	1	1
2012-05-01 04:00:00	25.000	1	8	train	2012-05-01 04:00:00	0	72	7	2	21.32	2	6.0032	1	0.693147	2.079442	2.197225	2012-05-01	1	5	2012	4	1	18	588282	0	0	1
2011-07-13 19:00:00	33.335	79	419	train	2011-07-13 19:00:00	0	79	340	3	28.70	1	7.0015	1	4.382027	5.831882	6.040255	2011-07-13	13	7	2011	19	2	28	640662	0	1	1
2012-02-05 09:00:00	11.365	9	92	train	2012-02-05 09:00:00	0	70	83	1	9.84	2	15.0013	0	2.302585	4.430817	4.532599	2012-02-05	5	2	2012	9	6	5	312498	0	0	0
2011-05-22 02:00:00	25.000	0	0	test	2011-05-22 02:00:00	0	94	0	2	21.32	1	7.0015	0	0.000000	0.000000	0.000000	2011-05-22	22	5	2011	2	6	20	588282	0	0	0
2011-11-11 22:00:00	15.910	13	91	train	2011-11-11 22:00:00	0	39	78	4	13.12	1	12.9980	0	2.639057	4.369448	4.521789	2011-11-11	11	11	2011	22	4	45	544034	0	0	0
2011-04-10 23:00:00	22.725	5	32	train	2011-04-10 23:00:00	0	88	27	2	18.86	1	23.9994	0	1.791759	3.332205	3.496508	2011-04-10	10	4	2011	23	6	14	588282	0	0	0
2012-05-07 15:00:00	31.060	76	262	train	2012-05-07 15:00:00	0	53	186	2	25.42	2	19.9995	1	4.343805	5.231109	5.572154	2012-05-07	7	5	2012	15	0	19	588282	0	0	0
2011-06-23 09:00:00	34.850	0	0	test	2011-06-23 09:00:00	0	70	0	3	30.34	2	8.9981	1	0.000000	0.000000	0.000000	2011-06-23	23	6	2011	9	3	25	640662	0	1	1

6. Training Models¶

We create a machine learning model to predict for casual and registered users separately and then combine them to generate the overall prediction for the counts.

# instead of randomly splitting our training data 
# for cross validation, let's construct a framework that's more
# in line with how the data is divvied up for this competition
# (given first 19 days of each month, what is demand for remaining days)
# so, let's split our training data into 2 time contiguous datasets
# for fitting and validating our model (days 1-14 vs. days 15-19).

# also, since submissions are evaluated based on the
# root mean squared logarithmic error (RMSLE), let's replicate
# that computation as we test and tune our model.

train_df = all_df[all_df['data_set'] == 'train']
test_df = all_df[all_df['data_set'] == 'test']

def get_rmsle(y_pred, y_actual):
    diff = np.log(y_pred + 1) - np.log(y_actual + 1)
    mean_error = np.square(diff).mean()
    return np.sqrt(mean_error)

def custom_train_valid_split(data, cutoff_day=15):
    train = data[data['day'] <= cutoff_day]
    valid = data[data['day'] > cutoff_day]

    return train, valid

def prep_train_data(data, input_cols):
    X = data[input_cols].values
    y_r = data['registered_log'].values
    y_c = data['casual_log'].values

    return X, y_r, y_c

# predict on validation set & transform output back from log scale
def predict_on_validation_set(model, input_cols):
    
    train, valid = custom_train_valid_split(train_df)

    # prepare training & validation set
    X_train, y_train_r, y_train_c = prep_train_data(train, input_cols)
    X_valid, y_valid_r, y_valid_c = prep_train_data(valid, input_cols)

    model_r = model.fit(X_train, y_train_r)
    y_pred_r = np.exp(model_r.predict(X_valid)) - 1

    model_c = model.fit(X_train, y_train_c)
    y_pred_c = np.exp(model_c.predict(X_valid)) - 1

    y_pred_comb = np.round(y_pred_r + y_pred_c)
    y_pred_comb[y_pred_comb < 0] = 0

    y_actual_comb = np.exp(y_valid_r) + np.exp(y_valid_c) - 2

    rmsle = get_rmsle(y_pred_comb, y_actual_comb)
    return (y_pred_comb, y_actual_comb, rmsle)


# predict on test set & transform output back from log scale
def predict_on_test_set(model, input_cols):
    
    # prepare training set
    X_train, y_train_r, y_train_c = prep_train_data(train_df, input_cols)

    # prepare testing set
    X_test = test_df[input_cols].values
    
    model_c = model.fit(X_train, y_train_c)
    y_pred_c = np.exp(model_c.predict(X_test)) - 1

    model_r = model.fit(X_train, y_train_r)
    y_pred_r = np.exp(model_r.predict(X_test)) - 1
    
    # add casual & registered predictions together
    y_pred_comb = np.round(y_pred_r + y_pred_c)
    y_pred_comb[y_pred_comb < 0] = 0
    
    return y_pred_comb

params = {'n_estimators': 1000, 'max_depth': 15, 'random_state': 0, 'min_samples_split' : 5, 'n_jobs': -1}
rf_model = RandomForestRegressor(**params)
rf_cols = [
    'weather', 'temp', 'atemp', 'windspeed',
    'workingday', 'season', 'holiday', 'sticky',
    'hour', 'dow', 'woy', 'peak'
    ]

(rf_pred, rf_actual, rf_rmsle) = predict_on_validation_set(rf_model, rf_cols)

print(rf_rmsle)

0.434625932998

all_df[rf_cols].corr()

	weather	temp	atemp	windspeed	workingday	season	holiday	sticky	hour	dow	woy	peak
weather	1.000000	-0.102640	-0.105563	0.026226	0.042061	-0.014524	0.038472	0.243523	-0.020203	-0.046424	0.009692	0.013506
temp	-0.102640	1.000000	0.987672	-0.023125	0.069153	0.312025	-0.101406	-0.007074	0.137603	-0.036220	0.198641	0.044723
atemp	-0.105563	0.987672	1.000000	-0.062336	0.067594	0.319380	-0.101800	0.004717	0.133750	-0.038918	0.205561	0.042167
windspeed	0.026226	-0.023125	-0.062336	1.000000	-0.001937	-0.149773	0.008593	-0.187671	0.137252	0.003274	-0.131613	0.054581
workingday	0.042061	0.069153	0.067594	-0.001937	1.000000	0.010879	-0.091171	0.536900	0.002185	-0.698028	-0.025700	0.207653
season	-0.014524	0.312025	0.319380	-0.149773	0.010879	1.000000	-0.109490	0.095556	-0.006117	-0.007448	0.814302	-0.001289
holiday	0.038472	-0.101406	-0.101800	0.008593	-0.091171	-0.109490	1.000000	-0.029345	0.007158	-0.049770	0.104596	-0.004713
sticky	0.243523	-0.007074	0.004717	-0.187671	0.536900	0.095556	-0.029345	1.000000	-0.186289	-0.399949	0.096672	0.050274
hour	-0.020203	0.137603	0.133750	0.137252	0.002185	-0.006117	0.007158	-0.186289	1.000000	-0.002893	-0.005437	0.124008
dow	-0.046424	-0.036220	-0.038918	0.003274	-0.698028	-0.007448	-0.049770	-0.399949	-0.002893	1.000000	0.009368	-0.148325
woy	0.009692	0.198641	0.205561	-0.131613	-0.025700	0.814302	0.104596	0.096672	-0.005437	0.009368	1.000000	-0.007311
peak	0.013506	0.044723	0.042167	0.054581	0.207653	-0.001289	-0.004713	0.050274	0.124008	-0.148325	-0.007311	1.000000

params = {'n_estimators': 150, 'max_depth': 5, 'random_state': 0, 'min_samples_leaf' : 10, 'learning_rate': 0.1, 'subsample': 0.7, 'loss': 'ls'}
gbm_model = GradientBoostingRegressor(**params)
gbm_cols = [
    'weather', 'temp', 'atemp', 'humidity', 'windspeed',
    'holiday', 'workingday', 'season',
    'hour', 'dow', 'year', 'ideal', 'count_season',
]

(gbm_pred, gbm_actual, gbm_rmsle) = predict_on_validation_set(gbm_model, gbm_cols)

print(gbm_rmsle)

0.313091436534

all_df[gbm_cols].corr()

	weather	temp	atemp	humidity	windspeed	holiday	workingday	season	hour	dow	year	ideal	count_season
weather	1.000000	-0.102640	-0.105563	0.418130	0.026226	0.038472	0.042061	-0.014524	-0.020203	-0.046424	-0.019157	-0.145407	-0.051863
temp	-0.102640	1.000000	0.987672	-0.069881	-0.023125	-0.101406	0.069153	0.312025	0.137603	-0.036220	0.040913	0.727266	0.705172
atemp	-0.105563	0.987672	1.000000	-0.051918	-0.062336	-0.101800	0.067594	0.319380	0.133750	-0.038918	0.039222	0.701874	0.701434
humidity	0.418130	-0.069881	-0.051918	1.000000	-0.290105	0.014029	0.014316	0.150625	-0.276498	-0.035233	-0.083546	-0.141678	0.113724
windspeed	0.026226	-0.023125	-0.062336	-0.290105	1.000000	0.008593	-0.001937	-0.149773	0.137252	0.003274	-0.008740	-0.051489	-0.113048
holiday	0.038472	-0.101406	-0.101800	0.014029	0.008593	1.000000	-0.091171	-0.109490	0.007158	-0.049770	0.006293	-0.054138	-0.146902
workingday	0.042061	0.069153	0.067594	0.014316	-0.001937	-0.091171	1.000000	0.010879	0.002185	-0.698028	-0.007959	0.023068	0.044535
season	-0.014524	0.312025	0.319380	0.150625	-0.149773	-0.109490	0.010879	1.000000	-0.006117	-0.007448	-0.010742	0.156455	0.663537
hour	-0.020203	0.137603	0.133750	-0.276498	0.137252	0.007158	0.002185	-0.006117	1.000000	-0.002893	-0.003867	0.113745	-0.008248
dow	-0.046424	-0.036220	-0.038918	-0.035233	0.003274	-0.049770	-0.698028	-0.007448	-0.002893	1.000000	0.000977	-0.009208	-0.014472
year	-0.019157	0.040913	0.039222	-0.083546	-0.008740	0.006293	-0.007959	-0.010742	-0.003867	0.000977	1.000000	0.000788	-0.009706
ideal	-0.145407	0.727266	0.701874	-0.141678	-0.051489	-0.054138	0.023068	0.156455	0.113745	-0.009208	0.000788	1.000000	0.462633
count_season	-0.051863	0.705172	0.701434	0.113724	-0.113048	-0.146902	0.044535	0.663537	-0.008248	-0.014472	-0.009706	0.462633	1.000000

7. Stacking¶

We can combine the predictions of two or more models to create a meta prediction.

It’s much like cross validation. Take 5-fold stacking as an example. First we split the training data into 5 folds. Next we will do 5 iterations. In each iteration, train every base model on 4 folds and predict on the hold-out fold. We have to keep the predictions on the testing data as well. This way, in each iteration every base model will make predictions on 1 fold of the training data and all of the testing data. After 5 iterations we will obtain a matrix of shape #(samples in training data) X #(base models). This matrix is then fed to the stacker (it’s just another model) in the second level. After the stacker is fitted, use the predictions on testing data by base models (each base model is predicts on the test data, since there are 5 base models we will get 5 predictions on thesame test data, therefore we have to take an average to obtain a matrix of the same shape) as the input for the stacker and obtain our final predictions.

stacking

7.1 Manual Stacking¶

We can stack the predictions of two or more different models in a pre-defined weighted mechanism to get the final prediction. Instead of relying on the prediction of a single model we average out the predictions from two or more models.

# the blend gives a better score on the leaderboard, even though it does not on the validation set
y_pred = np.round(.2*rf_pred + .8*gbm_pred)
print(get_rmsle(y_pred, rf_actual))

0.316017221761

rf_pred = predict_on_test_set(rf_model, rf_cols)
gbm_pred = predict_on_test_set(gbm_model, gbm_cols)

y_pred = np.round(.2*rf_pred + .8*gbm_pred)

# output predictions for submission
submit_manual_blend_df = test_df[['datetime', 'count']].copy()
submit_manual_blend_df['count'] = y_pred

submit_manual_blend_df.head(5)

	datetime	count
datetime
2011-01-20 00:00:00	2011-01-20 00:00:00	13.0
2011-01-20 01:00:00	2011-01-20 01:00:00	5.0
2011-01-20 02:00:00	2011-01-20 02:00:00	3.0
2011-01-20 03:00:00	2011-01-20 03:00:00	2.0
2011-01-20 04:00:00	2011-01-20 04:00:00	2.0

submit_manual_blend_df.to_csv('output/submit_manual_blend.csv', index=False)

7.1 Stacking with Linear Regression¶

In the previous section we manually assigned weights to the two classifiers and blended the predictions. Here we will train the same two classifiers with different input cols on the data and then combine the predictions of the two classifiers using a LinearRegressor. The LinearRegressor acts as a meta-classifier which learnes what should be the weighted combination of the two level 0 classifiers.

# Level 0 RandomForestRegressor
rf_params = {'n_estimators': 1000, 'max_depth': 15, 'random_state': 0, 'min_samples_split' : 5, 'n_jobs': -1}
rf_model = RandomForestRegressor(**rf_params)
rf_cols = [
    'weather', 'temp', 'atemp', 'windspeed',
    'workingday', 'season', 'holiday', 'sticky',
    'hour', 'dow', 'woy', 'peak'
    ]

# Level 0 GradientBoostingRegressor
gbm_params = {'n_estimators': 150, 'max_depth': 5, 'random_state': 0, 'min_samples_leaf' : 10, 'learning_rate': 0.1, 'subsample': 0.7, 'loss': 'ls'}
gbm_model = GradientBoostingRegressor(**gbm_params)
gbm_cols = [
    'weather', 'temp', 'atemp', 'humidity', 'windspeed',
    'holiday', 'workingday', 'season',
    'hour', 'dow', 'year', 'ideal', 'count_season',
]

clf_input_cols = [rf_cols, gbm_cols]
clfs = [rf_model, gbm_model]

# Create train and test sets for blending and Pre-allocate the data
blend_train = np.zeros((train_df.shape[0], len(clfs)))
blend_test = np.zeros((test_df.shape[0], len(clfs)))

# For each classifier, we train the classifier with its corresponding input_cols 
# and record the predictions on the train and the test set
for clf_index, (input_cols, clf) in enumerate(zip(clf_input_cols, clfs)):
    
    # prepare training and validation set
    X_train, y_train_r, y_train_c = prep_train_data(train_df, input_cols)
    
    # prepare testing set
    X_test = test_df[input_cols].values
    
    model_r = clf.fit(X_train, y_train_r)
    y_pred_train_r = np.exp(model_r.predict(X_train)) - 1
    y_pred_test_r = np.exp(model_r.predict(X_test)) - 1

    model_c = clf.fit(X_train, y_train_c)
    y_pred_train_c = np.exp(model_c.predict(X_train)) - 1
    y_pred_test_c = np.exp(model_c.predict(X_test)) - 1

    y_pred_train_comb = np.round(y_pred_train_r + y_pred_train_c)
    y_pred_train_comb[y_pred_train_comb < 0] = 0
    
    y_pred_test_comb = np.round(y_pred_test_r + y_pred_test_c)
    y_pred_test_comb[y_pred_test_comb < 0] = 0
    
    blend_train[:, clf_index] = y_pred_train_comb
    blend_test[:, clf_index] = y_pred_test_comb

# Level 1 Belending Classifier using LinearRegression
from sklearn.linear_model import LinearRegression
bclf = LinearRegression(fit_intercept=False)
bclf.fit(blend_train, train_df['count'])

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)

# What is the weighted combination of the base classifiers?
bclf.coef_

array([ 0.31355465,  0.73048534])

We observe that the meta-learner LinearRegression has assigned weights 0.3 to the RandomForestRegressor and 0.7 to the GradientBoostingRegressor similar to the ones we did manually in the previous step.

# Stacked and Blending predictions
y_pred_blend = np.round(bclf.predict(blend_test))

# R^2 score
bclf.score(blend_train, train_df['count'])

0.96562432974162338

# output predictions for submission
submit_stack_blend_df = test_df[['datetime', 'count']].copy()
submit_stack_blend_df['count'] = y_pred_blend

submit_stack_blend_df.head(5)

	datetime	count
datetime
2011-01-20 00:00:00	2011-01-20 00:00:00	13.0
2011-01-20 01:00:00	2011-01-20 01:00:00	6.0
2011-01-20 02:00:00	2011-01-20 02:00:00	3.0
2011-01-20 03:00:00	2011-01-20 03:00:00	2.0
2011-01-20 04:00:00	2011-01-20 04:00:00	2.0

submit_stack_blend_df.to_csv('output/submit_stack_blend.csv', index=False)

References:

Blogs without which this notebook would not have been possible

[1] https://www.analyticsvidhya.com/blog/2015/06/solution-kaggle-competition-bike-sharing-demand/
[2]https://www.kaggle.com/benhamner/bike-rentals-by-time-and-temperature
[3]https://github.com/logicalguess/kaggle-bike-sharing-demand
[4]http://dataviztalk.blogspot.com/2016/02/how-to-add-jitter-to-plot-using-pythons.html
[5]https://www.kaggle.com/anuragreddy333/data-vizualization/code
[6][https://dnc1994.com/2016/05/rank-10-percent-in-first-kaggle-competition-en/]

Anindya's Blog