Plotting, plotting and plotting – Python

Final code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('companies.csv')
print df.shape
print df.info()
print df.describe()

df = df[df['Renewal fee'] < 9000]

bins = np.arange(df['Renewal fee'].min(), df['Renewal fee'].max()+600, 500)

plt.figure(1)
plt.subplot(121)
plt.boxplot(df['Renewal fee'])
plt.title('Revenues - boxplot')

plt.subplot(122)
plt.hist(df['Renewal fee'], bins = bins, alpha = 0.7, color = '#112299', edgecolor = 'w')
plt.grid(True, linestyle = '--', alpha = 0.6)
plt.title('Revenues - histogram')
plt.savefig('Revenues.jpg')

df = df.drop('GOC code', 1)
df = df.drop('Company _name', 1)

df.loc[df['CA'] == 'Cliare', 'CA'] = 'Claire'

months = {'1': 'January', '2': 'February', '3': 'March', '4': 'April', '5': 'May',
            '6': 'June', '7': 'July', '8': 'August', '9': 'September', '10': 'October', '11': 'November', '12': 'December'}

df['Renewal date'].replace(months, inplace = True)

month_unique = df['Renewal date'].unique()
revenue = []
clients_per_month = []

months_reverse = {'January': '1', 'February': '2', 'March': '3', 'April': '4', 'May': '5',
            'June': '6', 'July': '7', 'August': '8', 'September': '9', 'October': '10', 'November': '11', 'December': '12'}

#Sorting the month names in order using the 'months_reverse' dictionary
month_unique = sorted(month_unique, key=lambda x: months_reverse [x])

#Adding the monthly totals to the revenue list
for el in month_unique:
   revenue.append(df.loc[df['Renewal date'] == el, 'Renewal fee'].sum())

#Adding the number of clients by month to the list
for el in month_unique:
    clients_per_month.append(len(df[df['Renewal date'] == el]))

plt.figure(2)
plt.subplot(121)
xs = np.arange(len(month_unique))
plt.xticks(np.arange(len(month_unique)), month_unique)
plt.plot(xs, revenue, alpha = 0.7, color = 'g')
plt.grid(True, linestyle = '--', alpha = 1)
plt.title('Revenue per month')

plt.subplot(122)
xs = np.arange(len(month_unique))
plt.xticks(np.arange(len(month_unique)), month_unique)
plt.plot(xs, clients_per_month, alpha = 0.7, color = 'g')
plt.grid(True, linestyle = '--', alpha = 1)
plt.title('Renewals per month')
plt.savefig('Per_month.jpg')

Plotting2
Per_month


In this post I am going through yet another plotting tutorial. The reason being when I started learning pandas and matplotlib I found them extremely confusing. These tools are very versatile and they provide different approaches and solutions. Because of this the beginning of the learning can be intimidating and even later it is a good practice to experiment with different approaches.

What I am going do below is not the best solution. I don’t even know what the best solution would be. I just set out a few goals for myself and use the tools provided in pandas and matplotlib libraries to come up with a solution.

Imagine you are working for an insurance company. Your manager sends you an Excel sheet or csv document with data regarding the renewal of insurance policies and asks you to provide insights of the data. They may ask for something specific, for example find out which client advisor is the most efficient, or which client category provides the greatest revenue.

In short, take a big, complicated data table and translate it into sexy plots and charts.

First I do the imports, load the csv file and print out a few details of it. You can get the dataset from here: https://github.com/zmraz/data-science/tree/master/datasets


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('companies.csv')
print df.shape
print df.info()
print df.describe()

(21, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 8 columns):
GOC code        21 non-null object
Company name    21 non-null object
CA              21 non-null object
Renewal fee     21 non-null int64
Renewal date    21 non-null object
Industry        21 non-null object
Scheme          21 non-null object
Risk            21 non-null object
dtypes: int64(1), object(7)
memory usage: 1.4+ KB
None
       Renewal fee
count    21.000000
mean   2395.238095
std    1685.371232
min    1000.000000
25%    1500.000000
50%    2000.000000
75%    2500.000000
max    9000.000000

It is not a particularly large dataframe.

I start with the two most obvious plots. I want to know how the revenue is distributed so I use a boxplot and histogram.


bins = np.arange(df['Renewal fee'].min(), df['Renewal fee'].max()+600, 500)

plt.figure(1)
plt.subplot(121)
plt.boxplot(df['Renewal fee'])
plt.title('Revenues - boxplot')

plt.subplot(122)
plt.hist(df['Renewal fee'], bins = bins, alpha = 0.7, color = '#112299', edgecolor = 'w')
plt.grid(True, linestyle = '--', alpha = 0.6)
plt.title('Revenues - histogram')
plt.savefig('Revenues.jpg')

Plotting1


Both the boxplot and the histogram show that I have a relatively normal distribution. The median is approximately halfway between the lower and upper quartile and the lowest value is about twice as far from the median as the lower quartile. The histogram  resembles a right-skewed bell curve.

There is however an outlier up around 9000. Let’s take at look at it.


print df[df['Renewal fee'] > 8000]
  GOC code                                       Company name     CA  \
6   MUN101  Jebediah Kerman's Junkyard and Spacecraft Part...  Brian   

   Renewal fee Renewal date   Industry          Scheme     Risk
6         9000     February  Transport  Administration  Extreme  


There is only one sample with such a high renewal fee. Is there any reason? I can see the ‘Risk’ value is ‘Extreme’. I check the dataframe for similar values.


print df['Risk'].unique()
print df[df['Risk'] == 'Extreme']

  GOC code                                       Company name     CA  \
6   MUN101  Jebediah Kerman's Junkyard and Spacecraft Part...  Brian   

   Renewal fee Renewal date   Industry          Scheme     Risk
6         9000     February  Transport  Administration  Extreme  


Again, there is only one sample with ‘Extreme’ risk which happens to be same with the unusually high renewal fee. It is clearly very different from the rest of the data and in this exercise I am trying to find connections among the sample values. I don’t think this particular record would help me in that.

I think it is safe to remove it and redraw the plots.


df = df[df['Renewal fee'] < 9000]

plt.figure(1)
plt.subplot(121)
plt.boxplot(df['Renewal fee'])
plt.title('Revenues - boxplot')

plt.subplot(122)
plt.hist(df['Renewal fee'], bins = bins, alpha = 0.7, color = '#112299', edgecolor = 'w')
plt.grid(True, linestyle = '--', alpha = 0.6)
plt.title('Revenues - histogram')
plt.savefig('Revenues.jpg')

Plotting2


Before I return to plotting it may be a good idea to check the unique values. Who knows there may be other outliers hiding in other columns.


print df['GOC code'].unique()

['YOYO11' 'SCHW46' 'RAD002' ..., 'CYS009' 'TASH76' 'CHO030']

print df['Company name'].unique()

['Yoyodyne Propulsion Systems' 'Rekall, Inc' 'Radikal Kourier Systems' ...,
 'Cyberdyne Systems' 'Tessier-Ashpool' 'CHOAM']

print df['CA'].unique()

['Brian' 'Claire' 'Cliare' 'Eva']

print df['Renewal date'].unique()

['February' 'March' 'April' '4' '3' '5' 'May']

print df['Industry'].unique()

['Transport' 'Computers' 'Mining' 'Robotics' 'General']

print df['Scheme'].unique()

['Administration' 'Consulting']

print df['Risk'].unique()

['Moderate' 'Low' 'High' 'Very high' 'Very High']

All the values in the first two columns are unique. They won’t help me here as I cannot assign them into any groups. Let’s drop them.

In ‘Renewal dates’ some administrators entered numbers while others month names. It would make sense to use one format in the whole column. I convert the numbers into month names as it is easier to read names than numbers.

In the ‘CA’ column there are four names, however ‘Claire’ and ‘Cliare’ are quite similar. Can I contact the department to confirm these names? For the sake of this exercise let’s say yes, I call the Client Administrators and asks if there is ‘Cliare’. As it turns out it is a typo (not a surprise). I correct the typos. In ‘Industries’ I have ‘Very High’ and ‘Very high’. Let’s use ‘Very high’ only.

I fix these issues here.


df = df.drop('GOC code', 1)
df = df.drop('Company _name', 1)

df.loc[df['CA'] == 'Cliare', 'CA'] = 'Claire'

months = {'1': 'January', '2': 'February', '3': 'March', '4': 'April', '5': 'May',
            '6': 'June', '7': 'July', '8': 'August', '9': 'September', '10': 'October', '11': 'November', '12': 'December'}

df['Renewal date'].replace(months, inplace = True)

df.loc[df['Risk'] == 'Very High', 'Risk'] = 'Very high'

I keep working on the renewal fees and plot them by months. It requires some data wrangling. I create 3 lists, the first one is the unique names of the month in the ‘Renewal date’ column. Then another list to hold the total revenue received by those months and a third one to hold total number of clients by months. Finally I plot the revenue and the number of clients next to each other.


month_unique = df['Renewal date'].unique()
revenue = []
clients_per_month = []

months_reverse = {'January': '1', 'February': '2', 'March': '3', 'April': '4', 'May': '5',
            'June': '6', 'July': '7', 'August': '8', 'September': '9', 'October': '10', 'November': '11', 'December': '12'}

#Sorting the month names in order using the 'months_reverse' dictionary
month_unique = sorted(month_unique, key=lambda x: months_reverse [x])

#Adding the monthly totals to the revenue list
for el in month_unique:
   revenue.append(df.loc[df['Renewal date'] == el, 'Renewal fee'].sum())

#Adding the number of clients by month to the list
for el in month_unique:
    clients_per_month.append(len(df[df['Renewal date'] == el]))

plt.figure(2)
plt.subplot(121)
xs = np.arange(len(month_unique))
plt.xticks(np.arange(len(month_unique)), month_unique)
plt.plot(xs, revenue, alpha = 0.7, color = 'g')
plt.grid(True, linestyle = '--', alpha = 1)
plt.title('Revenue per month')

plt.subplot(122)
xs = np.arange(len(month_unique))
plt.xticks(np.arange(len(month_unique)), month_unique)
plt.plot(xs, clients_per_month, alpha = 0.7, color = 'g')
plt.grid(True, linestyle = '--', alpha = 1)
plt.title('Renewals per month')
plt.savefig('Per_month.jpg')

Per_month


Well, it looks a bit scary doesn’t it? Looks like our revenues dropped a lot in the last month. What happened? Is there something we can say about it?

Let’s find out in the second part!


Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s