Almagest – k-Means clustering – R

Final code:

library(ggplot2)

dataset <- read.csv('Almagest.csv')

summary(dataset)
head(dataset)
nrow(dataset)

dataset$Longs <- (dataset$Longitude.degree * 60 + dataset$Longitude.minute) / 60
dataset$Lats <- (dataset$Latitude.degree * 60 + dataset$Latitude.minute) / 60
head(dataset)

dataset$Longitudes <- dataset$Longs * pi / 180
dataset$Latitudes <- 90 - dataset$Lats

# Clustering ============================================================

km <- kmeans(dataset[,6:7], 48)
dataset_clustered <- data.frame(dataset, km$cluster)
dataset_clustered[1:10,]

# Plotting ==============================================================

constellation <- 'Aries'
dataset_highlight <- dataset[dataset$Constellation == constellation,]

cluster_highlight = 1
dataset_cluster_highlight <- dataset[km$cluster == cluster_highlight,]

longitude_ticks <- seq(0, 2, 0.25) * pi
longitude_labels <- c('0', '45', '90', '135', '180', '225', '270', '315', '360')

ggplot(dataset, aes(x = dataset$Longitudes,
                    y = dataset$Latitudes,
                    color = dataset$Constellation))+
  geom_point(size = 1.3)+
  geom_point(data = dataset_highlight,
             aes(x = dataset_highlight$Longitudes, y = dataset_highlight$Latitudes),
             size = 2, colour = 'red')+
  ggtitle('Ptolemy star chart')+
  theme(plot.title = element_text(hjust = 0.5),
        panel.background = element_rect(fill = 'white', color = 'grey'),
        panel.grid.major = element_line(color = 'lightgrey'),
        panel.grid.minor = element_line(color = 'lightgrey'))+
  scale_x_continuous(breaks = longitude_ticks, labels = longitude_labels)+
  scale_y_continuous(limits = c(0,91))+
  #theme(legend.position = 'none')+
  labs(color = 'Constellations')+
  xlab('Longitudes: 0 - 360 degrees')+
  ylab('Latitudes: 0 - 90 degrees')+
  coord_polar(start = -0.3, direction = -1)+
  ggsave('starchart.jpg', width = 6, height = 6)

ggplot(dataset, aes(x = dataset$Longitudes, y = dataset$Latitudes, color = factor(km$cluster)))+
  geom_point(size = 1.3)+
  geom_point(data = dataset_cluster_highlight,
             aes(x = dataset_cluster_highlight$Longitudes, y = dataset_cluster_highlight$Latitudes),
             size = 2.5, colour = 'blue')+
  ggtitle('Ptolemy k-means star chart')+
  theme(plot.title = element_text(hjust = 0.5),
        panel.background = element_rect(fill = 'white', color = 'grey'),
        panel.grid.major = element_line(color = 'lightgrey'),
        panel.grid.minor = element_line(color = 'lightgrey'))+
  scale_x_continuous(breaks = longitude_ticks, labels = longitude_labels)+
  scale_y_continuous(limits = c(0,91))+
  #theme(legend.position = 'none')+
  labs(color = 'Predicted constellations')+
  xlab('Longitudes: 0 - 360 degrees')+
  ylab('Latitudes: 0 - 90 degrees')+
  coord_polar(start = -0.3, direction = -1)+
  ggsave('starchart k-means.jpg', width = 6, height = 6)

In the first part of this tutorial I plotted star coordinates recorded two thousand years ago to demonstrate the power of data visualisation. It worked well, I was able to identify constellations and the chart matched the star locations from an earlier illustration.

The data also gives me the opportunity to train a clustering machine learning algorithm on it. In machine learning clustering means that we try to find distinct groups based on the values. Our predecessor did the same think: they looked up at sky and started seeing forms. These forms not necessarily based on stars close to each other but it was certainly a factor.

Can I reproduce the same star groups with machine learning?

The first third of the code is similar to the code from the first part: importing libraries and the dataset, and feature engineering.


library(ggplot2)

dataset <- read.csv('Almagest.csv')

summary(dataset)
head(dataset)
nrow(dataset)

dataset$Longs <- (dataset$Longitude.degree * 60 + dataset$Longitude.minute) / 60
dataset$Lats <- (dataset$Latitude.degree * 60 + dataset$Latitude.minute) / 60
head(dataset)

dataset$Longitudes <- dataset$Longs * pi / 180
dataset$Latitudes <- 90 - dataset$Lats

The second part is the actual application of the machine learning algorithm. This time it is k-means and it is disappointingly simple. One line of code: I call the kmeans function and create the ‘km’ model. The parameters are the two columns of the dataset with the coordinates and number of clusters I want.

Finding the correct number of clusters in k-mean is a whole subject in itself but this time I had it easy: there were 48 ancient constellations in the Almagest so I chose 48 too.

There is a nice explanation of how k-means works and how to find the optimum number clusters behind the link below:

https://www.datascience.com/blog/introduction-to-k-means-clustering-algorithm-learn-data-science-tutorials

With the other two lines I combine the predicted cluster numbers and the original dataset and print out the first 10 rows so I can see if the constellations are rendered into the same groups.


km <- kmeans(dataset[,6:7], 48)
dataset_clustered <- data.frame(dataset, km$cluster) dataset_clustered[1:10,] > dataset_clustered[1:10,]
   Constellation Longitude.degree Longitude.minute Latitude.degree Latitude.minute     Longs     Lats Longitudes Latitudes
1     Ursa Minor               60               10              66               0  60.16667 66.00000   1.050106  24.00000
2     Ursa Minor               62               30              70               0  62.50000 70.00000   1.090831  20.00000
3     Ursa Minor               70               10              74              20  70.16667 74.33333   1.224639  15.66667
4     Ursa Minor               89               40              75              40  89.66667 75.66667   1.564979  14.33333
5     Ursa Minor               93               40              77              40  93.66667 77.66667   1.634792  12.33333
6     Ursa Minor              107               10              72              50 107.16667 72.83333   1.870411  17.16667
7     Ursa Minor              116               10              74              50 116.16667 74.83333   2.027491  15.16667
8     Ursa Minor              103                0              71              10 103.00000 71.16667   1.797689  18.83333
9     Ursa Major               85               20              39              50  85.33333 39.83333   1.489348  50.16667
10    Ursa Major               85               50              43               0  85.83333 43.00000   1.498074  47.00000
   km.cluster
1           8
2           8
3           8
4          46
5          46
6          46
7          46
8          46
9          40
10         40

… And they are not. In this run Ursa Minor is split into two clusters (8, 46). If you check the rest of the clustered dataset you will see that the k-means algorithm failed to identify the same groups which were recorded in the Almagest.

Let’s see a few examples. Below is the code for plotting the original dataset and the k-means outcome. Constellations will be colour coded and I can specify which constellation or cluster I want to be highlighted by giving the name to ‘constellation’ and cluster number to ‘cluster_highlight’.

If you want to remove the legends just restore the commented out lines in the code.


constellation <- 'Aries'
dataset_highlight <- dataset[dataset$Constellation == constellation,]

cluster_highlight = 1
dataset_cluster_highlight <- dataset[km$cluster == cluster_highlight,]

longitude_ticks <- seq(0, 2, 0.25) * pi
longitude_labels <- c('0', '45', '90', '135', '180', '225', '270', '315', '360')

ggplot(dataset, aes(x = dataset$Longitudes,
                    y = dataset$Latitudes,
                    color = dataset$Constellation))+
  geom_point(size = 1.3)+
  geom_point(data = dataset_highlight,
             aes(x = dataset_highlight$Longitudes, y = dataset_highlight$Latitudes),
             size = 2, colour = 'red')+
  ggtitle('Ptolemy star chart')+
  theme(plot.title = element_text(hjust = 0.5),
        panel.background = element_rect(fill = 'white', color = 'grey'),
        panel.grid.major = element_line(color = 'lightgrey'),
        panel.grid.minor = element_line(color = 'lightgrey'))+
  scale_x_continuous(breaks = longitude_ticks, labels = longitude_labels)+
  scale_y_continuous(limits = c(0,91))+
  #theme(legend.position = 'none')+
  labs(color = 'Constellations')+
  xlab('Longitudes: 0 - 360 degrees')+
  ylab('Latitudes: 0 - 90 degrees')+
  coord_polar(start = -0.3, direction = -1)+
  ggsave('starchart.jpg', width = 6, height = 6)

ggplot(dataset, aes(x = dataset$Longitudes, y = dataset$Latitudes, color = factor(km$cluster)))+
  geom_point(size = 1.3)+
  geom_point(data = dataset_cluster_highlight,
             aes(x = dataset_cluster_highlight$Longitudes, y = dataset_cluster_highlight$Latitudes),
             size = 2.5, colour = 'blue')+
  ggtitle('Ptolemy k-means star chart')+
  theme(plot.title = element_text(hjust = 0.5),
        panel.background = element_rect(fill = 'white', color = 'grey'),
        panel.grid.major = element_line(color = 'lightgrey'),
        panel.grid.minor = element_line(color = 'lightgrey'))+
  scale_x_continuous(breaks = longitude_ticks, labels = longitude_labels)+
  scale_y_continuous(limits = c(0,91))+
  #theme(legend.position = 'none')+
  labs(color = 'Predicted constellations')+
  xlab('Longitudes: 0 - 360 degrees')+
  ylab('Latitudes: 0 - 90 degrees')+
  coord_polar(start = -0.3, direction = -1)+
  ggsave('starchart k-means.jpg', width = 6, height = 6)

First: the final product. On the left side is Ptolemy’s star chart and (more or less) the sky you would see if looked up in Alexandria on a dark night. On the right side is what the machine learning algorithm found based on the coordinates.


What am I looking at then? Success of failure? The good news is that the algorithm clearly identified groups. The colours are not random, they appear in clusters as they should be but it is hard to see how accurate those groups are. Let’s highlight some!

I start with Aries. I used this constellation in the first part to set up my chart. To find the closest group you need to check the ‘dataset_clustered’ dataset and find the numbers which you think the best represent the constellation. As you can see there is a group where Aries is and they have commons stars but is not an exact match.


Now I try to locate Ursa Major. It is clearly a miss: the closest group I found has only a few stars in common. When I tried other clusters around the same area I had the same result: a few stars match but the k-means clusters are clearly not Ursa Major.


This is Orion, a famous constellation. K-means found a group inside it but missed the outlying stars and grouped them into another clusters.


Sagittarius: k-means again found the core stars but those stars farther away are grouped into another clusters.


Leo is quite close compared to the previous examples. Most of the body is there (plus a few extra stars) but the head is missing.


Let’s draw the conclusion.

K-means did a pretty good job to find distinct groups of stars based on their coordinates but those groups do not match the actual constellations. The algorithm has done a great job but the outcome is not what we wanted. There can be a few reasons behind this.

Maybe we don’t have the sufficient data: we used only the coordinates, but we could have included the magnitudes of the individual stars.

More importantly: it seems we humans used a different method to map the stars. Looking at the examples I can see that the elements of the k-means clusters are close to each other and the algorithm tried to create circular groups. We humans however saw contours, and the stars showing those contours are too far away each other for the k-means algorithm to group them together.

K-means showed its usefulness however to identify distinct groups of data. Even though the clusters it predicted did not match actual constellations which often contained stars far away from each others, I can see that predicted clusters are distinct and well defined.

You can see it below: Cygnus compared to three separate clustering runs.

 

 


Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s