
Read Time:198 Minute, 9 Second
In [432]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('fivethirtyeight')
Structured Data has "Two Types" and "Four Levels".¶
Two Types of Structured Data:¶
- Quantitative data: Numerical in nature.
- Qualitative data: Categorical in nature.
Four Levels of Structured Data:¶
- Nominal level
- Ordinal level
- Interval level
- Ratio level
In [433]:
# load in the data set; explore salaries of different job titles in San Francisco.
salary_ranges = pd.read_csv('./data/Salary_Ranges_by_Job_Classification.csv')
In [434]:
# view the first few rows and the headers
salary_ranges.head()
Out[434]:
In [435]:
# get...
# 1. how many rows of data there are
# 2. if there are any missing values
# 3. what data type each column has
salary_ranges.info()
In [436]:
# another method to check for missing values
salary_ranges.isnull().sum()
Out[436]:
In [437]:
# Show descriptive stats:
# Three quantitative columns:
# 1. Step
# 2. Union Code
# 3. Extended Step
salary_ranges.describe()
Out[437]:
Work with Quantitative data & Qualitative data¶
- Biweekly High Rate: A quantitative column that refers to the average high weekly salary for a specified department.
- Grade: A qualitative column that refers to a department and not a quantity of any kind.
To do with the features that we are more interested in¶
In [438]:
# We are more intertested in:
# 1. a single quantitative column, the Biweekly High Rate
# 2. a single qualitative column, Grade (the type of job)
# Pull out 'Biweekly High Rate' & 'Grade'
salary_ranges = salary_ranges[['Biweekly High Rate', 'Grade']]
salary_ranges.head()
Out[438]:
In [439]:
# Show 'Biweekly High Rate' descriptive stats:
salary_ranges['Biweekly High Rate'].describe()
Out[439]:
In [440]:
# Clean our Biweekly High columns to remove the dollar sign.
salary_ranges['Biweekly High Rate'] = salary_ranges['Biweekly High Rate'].map(lambda value: value.replace('
#39;, ''))
In [441]:
# Check to see the '
#39; has been removed.
salary_ranges.head()
Out[441]:
In [442]:
# Convert the 'Biweekly High Rate' columns to float.
salary_ranges['Biweekly High Rate'] = salary_ranges['Biweekly High Rate'].astype(float)
In [443]:
# Convert the 'Grade' columns to string.
salary_ranges['Grade'] = salary_ranges['Grade'].astype(str)
In [444]:
# Check to see if converting the data types worked
salary_ranges.info()
Work with the Nominal level¶
- Nominal level, consists of data that are purely described by name.
- Norminal level, can do basic counts using pandas' value_counts methods.
- Norminal level, cannot perform any quantitative mathematical operations, such as addition or division.
- Due to the lack of addition and division, we obviously cannot find an average value at the nominal level.
In [445]:
# 'Grade' column is Nominal level.
# Basic Value Counts of the Grade column.
# Grade 0000 has the number of 61.
# Grade 07450 has the number of 12.
# Grade 06870 has the number of 9, and so on.
salary_ranges['Grade'].value_counts().head()
Out[445]:
In [446]:
# Bar Chart of the 'Grade' column
salary_ranges['Grade'].value_counts().sort_values(ascending=False).head(5).plot(kind='bar')
Out[446]:
In [447]:
# the 'Grade' column as a pie chart (top 5 values only)
salary_ranges['Grade'].value_counts().sort_values(ascending=False).head(5).plot(kind='pie')
Out[447]:
Work with the Ordinal level¶
- Ordinal level, naturally ordered.
- Ordinal level, consists of data that are categorical in nature, even if numbers are used to represent the categories.
- Ordinal level, some data values in the column can be considered better than or greater than others.
- Ordinal level, can do basic counts using pandas' value_counts methods.
- Ordinal level, can calculate medians and percentiles, because we now have ordering and comparisons.
In [448]:
# load in the data set; how much people enjoy the San Francisco International Airport or SFO.
customer_enjoy = pd.read_csv('./data/2013_SFO_Customer_survey.csv')
In [449]:
# '2013_SFO_Customer_survey.csv' has 3535 rows, 95 columns.
customer_enjoy.shape
Out[449]:
To do with the features that we are more interested in¶
In [450]:
# Feature 'Q7A_ART': artwork and exhibitions.
# Feature 'Q7A_ART'; The possible choices are 0, 1, 2, 3, 4, 5, 6.
# 1: Unacceptable
# 2: Below Average
# 3: Average
# 4: Good
# 5: Outstanding
# 6: Have Never Used or Visited
# 0: Blank
# Even though the values of Feature 'Q7A_ART' are numbers, those numbers represent a category.
# Feature 'Q7A_ART' belongs to the qualitative side, and more specifically, ordinal.
art_ratings = customer_enjoy['Q7A_ART']
# Show descriptive stats:
art_ratings.describe()
Out[450]:
In [451]:
# Only consider ratings 1-5
art_ratings = art_ratings[(art_ratings >= 1) & (art_ratings <= 5)]
# Cast the values as strings
art_ratings = art_ratings.astype(str)
# Show descriptive stats:
art_ratings.describe()
Out[451]:
In [452]:
# Ordinal level can use pie charts, just like in nominal level.
art_ratings.value_counts().plot(kind='pie')
Out[452]:
In [453]:
# Ordinal level can use bar charts, just like in nominal level.
art_ratings.value_counts().plot(kind='bar')
Out[453]:
In [454]:
# Boxplots are available at the ordinal level.
art_ratings.value_counts().plot(kind='box')
Out[454]:
Work with the Interval level¶
- Interval level, quantitative data.
- Interval level, numerical data; not only has ordering like at the ordinal level, but also has meaningful differences between values.
- Interval level, not only may we order and compare values, we may also add and subtract values.
In [455]:
# load in the data set.
# each row quantifies the average temperature of cities around the world by the month.
# going back to the 18th century.
climate = pd.read_csv('./data/GlobalLandTemperaturesByCity.csv')
climate.head()
Out[455]:
In [456]:
# see if we have any missing values.
# 'AverageTemperature' has 364130 missing values.
# 'AverageTemperatureUncertainty' also has 364130 missing values.
climate.isnull().sum()
Out[456]:
In [457]:
# remove missing values in order to get a better look.
climate.dropna(axis=0, inplace=True)
In [458]:
climate.head()
Out[458]:
In [459]:
# see if we have any missing values.
# No missing values; all good.
climate.isnull().sum()
Out[459]:
To do with the features that we are more interested in¶
In [460]:
# the number of 'AverageTemperature' is 111994.
# 'AverageTemperature' has too many values.
# we cannot use a bar/pie chart for 'AverageTemperature'.
climate['AverageTemperature'].nunique()
Out[460]:
In [461]:
# the most common graph to utilize starting at interval level is "histogram".
# "histogram"; visualizes buckets of quantities and shows frequencies of these buckets.
# a histogram for the 'AverageTemperature' around the world.
# see the distribution of 'AverageTemperature' in a very holistic view.
climate['AverageTemperature'].hist()
Out[461]:
In [462]:
# Show descriptive stats.
climate['AverageTemperature'].describe()
Out[462]:
To add more features that we are interested in¶
In [463]:
# convert the dt column to datetime.
climate['dt'] = pd.to_datetime(climate['dt'])
# extract the year
climate['year'] = climate['dt'].map(lambda value: value.year)
In [464]:
climate.head()
Out[464]:
In [465]:
# subset the data to only be the temperatures recorded in the China:
climate_sub_china = climate.loc[climate['Country'] == 'China']
In [466]:
# year 1983 would become 20.
# year 1750 would become 18.
climate_sub_china['century'] = climate_sub_china['year'].map(lambda x: x/100+1)
In [467]:
climate_sub_china['century'] = climate_sub_china['century'].round().astype(int)
In [468]:
climate_sub_china.head()
Out[468]:
In [469]:
# with the new column century
# let's plot four histograms of temperature, one for each century.
climate_sub_china['AverageTemperature'].hist(by=climate_sub_china['century'],
sharex=True, sharey=True,
figsize=(10, 10),
bins=10)
Out[469]:
In [470]:
climate_sub_china.groupby('century')['AverageTemperature'].mean().plot(kind='line')
Out[470]:
In [471]:
centry_changes_china = climate_sub_china.groupby('century')['AverageTemperature'].mean()
centry_changes_china
Out[471]:
In [472]:
centry_changes_china[21] - centry_changes_china[19]
Out[472]:
Better Plotting at the interval level¶
In [473]:
# oof, that's not pretty. It seems to be a lot of noise, and that is to be expected.
# every year has lots of 'AverageTemperature'.
# so we see many vertical points (the values of 'AverageTemperature') at each year.
x = climate_sub_china['year']
y = climate_sub_china['AverageTemperature']
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(x, y)
plt.show()
In [474]:
# employ a groupby the year column to do "mean" ; remove much of this noise.
climate_sub_china.groupby('year').mean()['AverageTemperature'].plot()
Out[474]:
In [475]:
# taking a rolling mean over the years; rolling = 10.
# rolling = 10; time windows is 10 years.
# time windows is 10 years; plot the mean of 'AverageTemperature' in every 10 years interval.
climate_sub_china.groupby('year').mean()['AverageTemperature'].rolling(10).mean().plot()
Out[475]:
Work with the ratio level¶
- ratio level; like the interval level, still working with quantitative data.
- ratio level; inherit addition and subtraction from the interval level.
- ratio level; the ability to multiply and divide values.
In [476]:
# What is the average rate across all of the Grades
# Which 20 Grades have the highest Biweekly high rate
fig = plt.figure(figsize=(15,5))
ax = fig.gca()
salary_ranges.groupby('Grade')[['Biweekly High Rate']].mean().sort_values(
'Biweekly High Rate', ascending=False).head(20).plot.bar(stacked=False,
ax=ax, color='darkorange')
ax.set_title('Top 20 Grade by Mean Biweekly High Rate')
Out[476]:
In [477]:
# Which 20 Grades have the lowest Biweekly high rate
fig = plt.figure(figsize=(15,5))
ax = fig.gca()
salary_ranges.groupby('Grade')[['Biweekly High Rate']].mean().sort_values(
'Biweekly High Rate', ascending=False).tail(20).plot.bar(stacked=False,
ax=ax, color='darkorange')
ax.set_title('Bottom 20 Grade by Mean Biweekly High Rate')
Out[477]:
In [478]:
# money is at the ratio level.
# we can also find the ratio of the highest-paid employee to the lowest-paid employee.
salary_df = salary_ranges.groupby('Grade')[['Biweekly High Rate']].mean().sort_values(
'Biweekly High Rate', ascending=False)
salary_df.iloc[0][0] / salary_df.iloc[-1][0]
Out[478]:
In [ ]: