Final code:

import numpy as np import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('Droid control - wind speed.csv') print df.head() print df.describe() print df.info() plt.figure(1) plt.scatter(df['Wind speed'], df['Control metrics'], color = 'red') plt.title('Control action / wind speed') plt.xlabel('Wind speed (km/h)') plt.ylabel('Control metrics') plt.savefig('Windspeed.jpg') X = df.iloc[:, :-1].values y = df.iloc[:, 1].values X_squares = X[:,0] ** 2 X_times_Y = X[:,0] * y N = len(X) # b = ((∑X^2)(∑Y) – (∑X)(∑XY)) / (N(∑X^2) – (∑X)^2) b1 = X_squares.sum() * y.sum() # (∑X^2)(∑Y) b2 = X.sum() * X_times_Y.sum() # (∑X)(∑XY) b3 = N * X_squares.sum() # N(∑X^2) b4 = X.sum() ** 2 # (∑X)^2 b = (b1 - b2) / (b3 - b4) # m = (N(∑XY) – (∑X)(∑Y)) / (N(∑X^2) – (∑X)^2) m1 = N * X_times_Y.sum() # (∑X^2)(∑Y) m2 = X.sum() * y.sum() # (∑X)(∑XY) m3 = N * X_squares.sum() # N(∑X^2) m4 = X.sum() ** 2 # (∑X)^2 m = (m1 - m2) / (m3 - m4) manual_linear_regression = [] for el in X: f_of_X = b + m * el manual_linear_regression = np.append(manual_linear_regression, f_of_X) lin_equation = 'Y = {} + {}X'.format(b, m) plt.figure(2) plt.scatter(df['Wind speed'], df['Control metrics'], color = 'red', label = 'Original data') plt.scatter(df['Wind speed'], manual_linear_regression, color = 'blue', label = lin_equation) plt.title('Control action / wind speed') plt.xlabel('Wind speed (km/h)') plt.ylabel('Control metrics') plt.legend(loc = 'upper left') plt.savefig('Windspeed-linreg.jpg')

(You can find this tutorial using R here.)

Let’s assume I work for one the biggest battle droid manufacturers. Our newest model K1ll 3M4ll is already conducting test flights and my job is to optimize its performance. Our first goal in the test flight to fly in straight line. Interestingly, over the test area there is a constant westerly wind (it’s probably not the best test area).

At this point all I need to worry is to program the droid to compensate against this wind which blows – of course – with varying force.

I take out the droid, fly it on manual just as you would do with a toy model plane to learn the controls. When I feel confident enough I fly it in straight lines from my position toward the target. All the while the droid records the wind speed and my manoeuvres. After spending a day on the field I grab the droid’s memory unit, go back to my office and import the flight metrics into my computer.

Let’s just do the same here. First, I import the usual modules, and print out the usual dataframe details. You can get the dataset from here: https://github.com/zmraz/data-science/tree/master/datasets

import numpy as np import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('Drone control - wind speed.csv') print df.head() print df.describe() print df.info() 0 4 545 1 5 572 2 5 619 3 5 639 4 6 645 5 6 649 [100 rows x 2 columns]> Wind speed Control metrics count 100.000000 100.000 mean 30.230000 1797.320 std 14.918536 719.562 min 4.000000 545.000 25% 18.500000 1286.500 50% 31.500000 1864.000 75% 43.000000 2350.500 max 60.000000 3670.000 <class 'pandas.core.frame.DataFrame'> RangeIndex: 100 entries, 0 to 99 Data columns (total 2 columns): Wind speed 100 non-null int64 Control metrics 100 non-null int64 dtypes: int64(2) memory usage: 1.6 KB

I have 100 rows of data with two variables: ‘Wind speed’ and ‘Control metrics’. ‘Wind speed’ is the independent variable and ‘Control metrics’ is the dependent variable: it should change depending on my actions to compensate against the wind.

Let’s see if there really is a correlation between the two variables. After all, maybe my actions had no consequences in the flight and my droid flew in straight lines because of something else.

I now plot all 100 rows of data in a scatter plot.

plt.figure(1) plt.scatter(df['Wind speed'], df['Control metrics'], color = 'red') plt.title('Control action / wind speed') plt.xlabel('Wind speed (km/h)') plt.ylabel('Control metrics') plt.savefig('Windspeed.jpg')

This is good. It looks just as I expected. I have a record of control metrics which can ‘tell’ me what action should be done at different wind speeds. Now I can feed this data back to the droid and it can select the appropriate action depending on wind speed. Or can I?

I want the droid to fly as efficiently and straight as possible. After all I don’t want it to hop left and right while it fires its mighty laser cannons.

Unfortunately my data is a bit ambiguous, I tried to control the droid the best I could but I made mistakes and of course couldn’t react the same way every time. I am only human after all.

At wind speed 10km/h I have two different metrics values, my reactions were different, and it happened rather often for other wind speeds. Which control value the droid would choose?

For wind speed 60km/h I have a metrics value of about 3600 which is clearly off compared to the tendency of the scatter plot. And what happens if the wind gets stronger, say 70 km/h? I simply don’t have data for that wind speed.

If my droid merely reproduces my actions based on this data it won’t fly like a high-tech droid, it’ll fly like a cheap toy controlled by a human. I need to find the optimum control values based on my data which lets the droid select one action for each wind speed and also lets the droid to react optimally to wind speeds not recorded in my data. This is where linear regression comes into play.

It’s easy to see that I can draw a straight line over the scatter plot points and that line can be extended thus giving us a single, best-fit value for each wind speed. How can I draw this line? I can try to connect the first point and the last.

Of course it would not be correct, the very last value at 60 km/h is way off, but even without that I could only guess where the line should go. Intuitively it is not too complicated to define the position of the line. It simply has to go in way that any of its points should be at minimum distance to the corresponding scatter plot point. Now I should calculate the distance of all 100 points from my line then draw a different line with slightly different position and/or steepness then calculate the distances for that and it the sum of the new distances are smaller than the sum of the first line distances then the new line is better positioned. I would then repeat this process with a third line and a fourth line and so on until call it a day.

Or I can use the mathematical equation, plug in all the data to calculate the slope and y intercept of the line, feed the calculated linear function back to the droid’s CPU and from then on it would be able to find out what control metrics to choose for any given wind temperature.

It is easy to see that my best-fit line is defined by the linear equation: Y = **b** + **m** X

where

**b** is the Y intercept

**m** is the slope.

I know the Xs and Ys. To get **b** and **m** I need the below equations.

b = ((∑X^2)(∑Y) – (∑X)(∑XY)) / (N(∑X^2) – (∑X)^2)

m = (N(∑XY) – (∑X)(∑Y)) / (N(∑X^2) – (∑X)^2)

where

N is the size of the Population

The above equations looks mighty complicated but in fact I already have all the information I need and now I just plug in the values to get **b** and **m**. The b1, b2, b3, b4 variables refer to the 4 parts of the equation for **b** and the m1, m2, m3, m4 variables refer to the 4 parts of the equation for **m**.

X = df.iloc[:, :-1].values y = df.iloc[:, 1].values X_squares = X[:,0] ** 2 X_times_Y = X[:,0] * y N = len(X) # b = ((∑X^2)(∑Y) – (∑X)(∑XY)) / (N(∑X^2) – (∑X)^2) b1 = X_squares.sum() * y.sum() # (∑X^2)(∑Y) b2 = X.sum() * X_times_Y.sum() # (∑X)(∑XY) b3 = N * X_squares.sum() # N(∑X^2) b4 = X.sum() ** 2 # (∑X)^2 b = (b1 - b2) / (b3 - b4) # m = (N(∑XY) – (∑X)(∑Y)) / (N(∑X^2) – (∑X)^2) m1 = N * X_times_Y.sum() # (∑X^2)(∑Y) m2 = X.sum() * y.sum() # (∑X)(∑XY) m3 = N * X_squares.sum() # N(∑X^2) m4 = X.sum() ** 2 # (∑X)^2 m = (m1 - m2) / (m3 - m4)

I now know **b** and **m** and I can use the general linear equation (Y = **b** + **m** X) to calculate the best-fit points for all 100 records.

manual_linear_regression = [] for el in X: f_of_X = b + m * el manual_linear_regression = np.append(manual_linear_regression, f_of_X)

Finally I draw both the original scatter plot and the points calculated by the completed linear function equation in the same plot.

# The complete linear function, will be used as Linear regression scatter plot label lin_equation = 'Y = {0} + {1}X'.format(b, m) plt.figure(2) plt.scatter(df['Wind speed'], df['Control metrics'], color = 'red', label = 'Original data') plt.scatter(df['Wind speed'], manual_linear_regression, color = 'blue', label = lin_equation) plt.title('Control action / wind speed') plt.xlabel('Wind speed (km/h)') plt.ylabel('Control metrics') plt.legend(loc = 'upper left') plt.savefig('Windspeed-linreg.jpg')

All the blue points are nicely arranged along a straight line. It means my droid has one distinct value for each wind speed measure and using the equation it can calculate responses for wind speeds missing from my dataset. I could now put the equation into a function and upload it into the droid’s memory.

Or I can apply a machine learning algorithm, fit it against my data and let the droid fly by its predictions. Somehow it sounds a lot cooler and more exciting and this is exactly what I am going to do in the next part.