Linear regression – machine learning – Python

Final code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.cross_validation import train_test_split

df = pd.read_csv('Droid control - wind speed.csv')
print df.head()
print df.describe()
print df.info()

plt.figure(1)
plt.scatter(df['Wind speed'], df['Control metrics'], color = 'red')
plt.title('Control action / wind speed')
plt.xlabel('Wind speed (km/h)')
plt.ylabel('Control metrics')
plt.savefig('Windspeed-ml.jpg')

X = df.iloc[:, :-1].values
y = df.iloc[:, 1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

regressor = linear_model.LinearRegression()
regressor.fit(X_train, y_train)

fig2, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10,5), sharey = 'row')
plt.suptitle('Linear regression prediction plots', fontsize=16)

ax1.scatter(X_train, y_train, color = 'red', label = 'Train set - actual values')
ax1.plot(X_train, regressor.predict(X_train), color = 'blue', label = 'Prediction based on Train set')
ax1.set_title('Train set with prediction')
ax1.set_xlabel('Train set wind speed')
ax1.set_ylabel('Control metrics & control prediction')
ax1.legend(loc = 'upper left')

ax2.scatter(X_test, y_test, color = 'red', label = 'Test set - actual values')
ax2.plot(X_train, regressor.predict(X_train), color = 'blue', label = 'Prediction based on Train set')
ax2.set_title('Test set with train set prediction')
ax2.set_xlabel('Test set wind speed')
ax2.set_ylabel('Control metrics & control prediction')
ax2.legend(loc = 'upper left')

fig2.subplots_adjust(top = 0.88)
plt.savefig('Windspeed-linreg-ml.jpg')

Windspeed-linreg-ml


In the previous blog post I used linear regression to calculate the equation needed to control my amazing battle droid in westerly wind. (Yes, this example is quite silly.) I admit, it was extremely fun to dig up the equations to do the math and ‘translate’ them into Python. The final product was convincing, the linear regression model produced the best-fit line and the code responded with the correct metrics for any wind speed input.

It was not too… ‘mad scientist’-like though. After all, I did the math, I wrote the program and then the droid simply followed the instruction my code gave to it. It would be much better if I just fed the data from my test flights and then it would learn how to fly on itself by using machine learning!

On a more practical note: doing the math and typing up the code takes time and we humans are prone to make mistakes (typos). Using machine learning simplifies my original code making it easier to produce and read. Linear regression is a fairly simple concept yet the equations required to calculate linear regression line looked a bit intimidating at first. More advanced methods may require more math, more equations, more chance to err. With machine learning algorithms I can use the same methods much simpler and faster.

The task is the same: I have a dataset with wind speed recordings, and metrics explaining what kind of controls I used to compensate against the wind.

The first few lines are identical to the code in the previous post, I import the modules, load the dataset, have a quick peek into it and plot a scatter plot to see it there is a linear correlation between the variables. You can get the dataset from here: https://github.com/zmraz/data-science


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.cross_validation import train_test_split

df = pd.read_csv('Droid control - wind speed.csv')
print df.head
print df.describe()
print df.info()

   Wind speed  Control metrics
0           4              545
1           5              572
2           5              619
3           5              639
4           6              645
       Wind speed  Control metrics
count  100.000000          100.000
mean    30.230000         1797.320
std     14.918536          719.562
min      4.000000          545.000
25%     18.500000         1286.500
50%     31.500000         1864.000
75%     43.000000         2350.500
max     60.000000         3670.000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
Wind speed         100 non-null int64
Control metrics    100 non-null int64
dtypes: int64(2)
memory usage: 1.6 KB

plt.figure(1)
plt.scatter(df['Wind speed'], df['Control metrics'], color = 'red')
plt.title('Control action / wind speed')
plt.xlabel('Wind speed (km/h)')
plt.ylabel('Control metrics')

Windspeed-ml


I assign the independent variable ‘Wind speed’ to X and the dependent variable ‘Control metrics’ to y. Then split them into training set and test set.


X = df.iloc[:, :-1].values
y = df.iloc[:, 1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

I create the regressor object using the LinearRegression class, and ask it to ‘learn’ from my flights by fitting it with the training set.


regressor = linear_model.LinearRegression()
regressor.fit(X_train, y_train)

Believe it or not that’s it! Well, kind of… At this point by true mad scientist tradition I could load this model into the battle droid’s brain and then it could fly (as long as the wind blows west). I could add the line ‘print regressor.predict(55)’ and I would receive a number in return which – hopefully – would be the correct reaction to a 55 km/h wind.

I am not a mad scientist though and I want to test my model’s performance first.

I am going to draw two plots, the first one will show a scatterplot based on the training set with the predicted linear regression line based on the same set. The second one will show a scatter plot based on the test set with the predicted linear regression line based on the training set. In other words: the scatter plots will show different data points (training set, test set) but the linear regression line will be drawn based on the training set in both plots.

If the model works the linear regression line will be drawn to best-fit both plots.


fig2, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10,5), sharey = 'row')
plt.suptitle('Linear regression prediction plots', fontsize=16)

ax1.scatter(X_train, y_train, color = 'red', label = 'Train set - actual values')
ax1.plot(X_train, regressor.predict(X_train), color = 'blue', label = 'Prediction based on Train set')
ax1.set_title('Train set with prediction')
ax1.set_xlabel('Train set wind speed')
ax1.set_ylabel('Control metrics & control prediction')
ax1.legend(loc = 'upper left')

ax2.scatter(X_test, y_test, color = 'red', label = 'Test set - actual values')
ax2.plot(X_train, regressor.predict(X_train), color = 'blue', label = 'Prediction based on Train set')
ax2.set_title('Test set with train set prediction')
ax2.set_xlabel('Test set wind speed')
ax2.set_ylabel('Control metrics & control prediction')
ax2.legend(loc = 'upper left')

fig2.subplots_adjust(top = 0.88)
plt.savefig('Windspeed-linreg-ml.jpg')

Windspeed-linreg-ml


And here they are! The linear regression model works, the blue line is exactly the same in both plots and fits both the training set and test set. Now I have a model which can predict a control metric for every wind speed. You can see on the right hand side plot that the blue line does not extend to the extreme left and right and it does not provide a visual information of what the linear regression model predicts for those values. It is because they are higher or lower than the minimum and maximum values in the training set and the blue line uses the training set minimum and maximum values as the start and end points.

However, as I wrote above: I can still use the predict method and get a prediction for those values so ‘print regressor.predict(60)’ would give me a best-fit prediction.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s