Linear regression example – R

Final code:

library(ggplot2)
library(reshape2)

df = read.csv('Droid control - wind speed.csv')

summary(df)
head(df)
nrow(df)

ggplot(df, aes(x = df$Wind.speed, y = df$Control.metrics, color = 'red'))+
    geom_point()+
    scale_color_discrete(name = 'Legend:',
    breaks='red',
    labels= 'Control values')+
    xlab('Wind speed')+
    ylab('Control metrics')+
    ggtitle('Control metrics and Linear regression')+
    ggsave('Droid_flight_data.jpg')

X <- df$Wind.speed
y <- df$Control.metrics
X_squares <- X ^ 2
X_times_y <- X * y
N <- length(X)

# b = ((∑X^2)(∑Y) – (∑X)(∑XY)) / (N(∑X^2) – (∑X)^2)
b1 <- sum(X_squares) * sum(y) # (∑X^2)(∑Y)
b2 <- sum(X) * as.numeric(sum(X_times_y)) # (∑X)(∑XY)
b3 <- N * sum(X_squares) # N(∑X^2)
b4 <- sum(X) ^ 2 # (∑X)^2

b <- (b1 - b2) / (b3 - b4)

# m = (N(∑XY) – (∑X)(∑Y)) / (N(∑X^2) – (∑X)^2)
m1 <- N * sum(X_times_y) # (∑X^2)(∑Y)
m2 <- sum(X) * sum(y) # (∑X)(∑XY)
m3 <- N * sum(X_squares) # N(∑X^2)
m4 <- sum(X) ^ 2 # (∑X)^2

m = (m1 - m2) / (m3 - m4)

lin_equation = paste('Y =' , b, '+', m, '* X')

manual_linear_regression <- vector()
for (el in X){
    f_of_X <- b + m * el
    manual_linear_regression <- c(manual_linear_regression, f_of_X)
}
lin_equation

df$linreg <- manual_linear_regression
dfm <- melt(df, id = 'Wind.speed')
ggplot(dfm, aes(x, y = value, color = variable))+
    geom_point(aes(x = dfm$Wind.speed, y = dfm$value))+
    scale_color_discrete(name = 'Legend:',
    breaks=c("Control.metrics", "linreg"),
    labels=c("Control metrics", lin_equation))+
    xlab('Wind speed')+
    ylab('Control metrics')+
    ggtitle('Control metrics and Linear regression')+
    ggsave('Linreg-R2.jpg')

Linreg-R2


(You can find this tutorial using Python here.)

Let’s assume I work for one the biggest battle droid manufacturers. Our newest model K1ll 3M4ll is already conducting test flights and my job is to optimize its performance. Our first goal in the test flight to fly in straight line. Interestingly, over the test area there is a constant westerly wind (it’s probably not the best test area).

At this point all I need to worry is to program the droid to compensate against this wind which blows – of course – with varying force.

I take out the droid, fly it on manual just as you would do with a toy model plane to learn the controls. When I feel confident enough I fly it in straight lines from my position toward the target. All the while the droid records the wind speed and my manoeuvres. After spending a day on the field I grab the droid’s memory unit, go back to my office and import the flight metrics into my computer.

Let’s just do the same here. First, I activate the modules, and print out the usual dataframe details. You can get the dataset from here: https://github.com/zmraz/data-science


library(ggplot2)
library(reshape2)

df = read.csv('Droid control - wind speed.csv')

summary(df)
head(df)
nrow(df)

> summary(df)
Wind.speed Control.metrics
Min. : 4.00 Min. : 545
1st Qu.:18.50 1st Qu.:1286
Median :31.50 Median :1864
Mean :30.23 Mean :1797
3rd Qu.:43.00 3rd Qu.:2350
Max. :60.00 Max. :3670
> head(df)
Wind.speed Control.metrics
1 4 545
2 5 572
3 5 619
4 5 639
5 6 645
6 6 649
> nrow(df)
[1] 100


I have 100 rows of data with two variables: ‘Wind speed’ and ‘Control metrics’. ‘Wind speed’ is the independent variable and ‘Control metrics’ is the dependent variable: it should change depending on my actions to compensate against the wind.

Let’s see if there really is a correlation between the two variables. After all, maybe my actions had no consequences in the flight and my droid flew in straight lines because of something else.

I now plot all 100 rows of data in a scatter plot.


ggplot(df, aes(x = df$Wind.speed, y = df$Control.metrics, color = 'red'))+
    geom_point()+
    scale_color_discrete(name = 'Legend:',
    breaks='red',
    labels= 'Control values')+
    xlab('Wind speed')+
    ylab('Control metrics')+
    ggtitle('Control metrics and Linear regression')+
    ggsave('Droid_flight_data.jpg')

Linreg-R


This is good. It looks just as I expected. I have a record of control metrics which can ‘tell’ me what action should be done at different wind speeds. Now I can feed this data back to the droid and it can select the appropriate action depending on wind speed. Or can I?

I want the droid to fly as efficiently and straight as possible. After all I don’t want it to hop left and right while it fires its mighty laser cannons.
Unfortunately my data is a bit ambiguous, I tried to control the droid the best I could but I made mistakes and of course couldn’t react the same way every time. I am only human after all.

At wind speed 10km/h I have two different metrics values, my reactions were different, and it happened rather often for other wind speeds. Which control value the droid would choose?

For wind speed 60km/h I have a metrics value of about 3600 which is clearly off compared to the tendency of the scatter plot. And what happens if the wind gets stronger, say 70 km/h? I simply don’t have data for that wind speed.

If my droid merely reproduces my actions based on this data it won’t fly like a high-tech droid, it’ll fly like a cheap toy controlled by a human. I need to find the optimum control values based on my data which lets the droid select one action for each wind speed and also lets the droid to react optimally to wind speeds not recorded in my data. This is where linear regression comes into play.

It’s easy to see that I can draw a straight line over the scatter plot points and that line can be extended thus giving us a single, best-fit value for each wind speed. How can I draw this line? I can try to connect the first point and the last.

Of course it would not be correct, the very last value at 60 km/h is way off, but even without that I could only guess where the line should go. Intuitively it is not too complicated to define the position of the line. It simply has to go in way that any of its points should be at minimum distance to the corresponding scatter plot point. Now I should calculate the distance of all 100 points from my line then draw a different line with slightly different position and/or steepness then calculate the distances for that and it the sum of the new distances are smaller than the sum of the first line distances then the new line is better positioned. I would then repeat this process with a third line and a fourth line and so on until call it a day.

Or I can use the mathematical equation, plug in all the data to calculate the slope and y intercept of the line, feed the calculated linear function back to the droid’s CPU and from then on it would be able to find out what control metrics to choose for any given wind temperature.

It is easy to see that my best-fit line is defined by the linear equation: Y = b + m X
where
b is the Y intercept
m is the slope.

I know the Xs and Ys. To get b and m I need the below equations.

b = ((∑X^2)(∑Y) – (∑X)(∑XY)) / (N(∑X^2) – (∑X)^2)
m = (N(∑XY) – (∑X)(∑Y)) / (N(∑X^2) – (∑X)^2)

where
N is the size of the Population

The above equations looks mighty complicated but in fact I already have all the information I need and now I just plug in the values to get b and m. The b1, b2, b3, b4 variables refer to the 4 parts of the equation for b and the m1, m2, m3, m4 variables refer to the 4 parts of the equation for m.


X <- df$Wind.speed
y <- df$Control.metrics
X_squares <- X ^ 2
X_times_y <- X * y
N <- length(X)

# b = ((∑X^2)(∑Y) – (∑X)(∑XY)) / (N(∑X^2) – (∑X)^2)
b1 <- sum(X_squares) * sum(y) # (∑X^2)(∑Y)
b2 <- sum(X) * as.numeric(sum(X_times_y)) # (∑X)(∑XY)
b3 <- N * sum(X_squares) # N(∑X^2)
b4 <- sum(X) ^ 2 # (∑X)^2

b <- (b1 - b2) / (b3 - b4)

# m = (N(∑XY) – (∑X)(∑Y)) / (N(∑X^2) – (∑X)^2)
m1 <- N * sum(X_times_y) # (∑X^2)(∑Y)
m2 <- sum(X) * sum(y) # (∑X)(∑XY)
m3 <- N * sum(X_squares) # N(∑X^2)
m4 <- sum(X) ^ 2 # (∑X)^2

m = (m1 - m2) / (m3 - m4)

I now know b and m and I can use the general linear equation (Y = b + m X) to calculate the best-fit points for all 100 records.


lin_equation = paste('Y =' , b, '+', m, '* X')
manual_linear_regression <- vector()
for (el in X){
    f_of_X <- b + m * el
    manual_linear_regression <- c(manual_linear_regression, f_of_X)
}
lin_equation

Finally I draw both the original scatter plot and the points calculated by the completed linear function equation in the same plot.


df$linreg <- manual_linear_regression
dfm <- melt(df, id = 'Wind.speed')
ggplot(dfm, aes(x, y = value, color = variable))+
    geom_point(aes(x = dfm$Wind.speed, y = dfm$value))+
    scale_color_discrete(name = 'Legend:',
    breaks=c("Control.metrics", "linreg"),
    labels=c("Control metrics", lin_equation))+
    xlab('Wind speed')+
    ylab('Control metrics')+
    ggtitle('Control metrics and Linear regression')+
    ggsave('Linreg-R2.jpg')

Linreg-R2


All the blue points are nicely arranged along a straight line. It means my droid has one distinct value for each wind speed measure and using the equation it can calculate responses for wind speeds missing from my dataset. I could now put the equation into a function and upload it into the droid’s memory.

Or I can apply a machine learning algorithm, fit it against my data and let the droid fly by its predictions. Somehow it sounds a lot cooler and more exciting and this is exactly what I am going to do in the next part.


Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s