Feature_Understanding
In [432]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('fivethirtyeight')

Structured Data has "Two Types" and "Four Levels".

Two Types of Structured Data:

  1. Quantitative data: Numerical in nature.
  2. Qualitative data: Categorical in nature.

Four Levels of Structured Data:

  1. Nominal level
  2. Ordinal level
  3. Interval level
  4. Ratio level
In [433]:
# load in the data set; explore salaries of different job titles in San Francisco. 
salary_ranges = pd.read_csv('./data/Salary_Ranges_by_Job_Classification.csv')
In [434]:
# view the first few rows and the headers
salary_ranges.head()
Out[434]:
SetID Job Code Eff Date Sal End Date Salary SetID Sal Plan Grade Step Biweekly High Rate Biweekly Low Rate Union Code Extended Step Pay Type
0 COMMN 0109 07/01/2009 12:00:00 AM 06/30/2010 12:00:00 AM COMMN SFM 00000 1 $0.00 $0.00 330 0 C
1 COMMN 0110 07/01/2009 12:00:00 AM 06/30/2010 12:00:00 AM COMMN SFM 00000 1 $15.00 $15.00 323 0 D
2 COMMN 0111 07/01/2009 12:00:00 AM 06/30/2010 12:00:00 AM COMMN SFM 00000 1 $25.00 $25.00 323 0 D
3 COMMN 0112 07/01/2009 12:00:00 AM 06/30/2010 12:00:00 AM COMMN SFM 00000 1 $50.00 $50.00 323 0 D
4 COMMN 0114 07/01/2009 12:00:00 AM 06/30/2010 12:00:00 AM COMMN SFM 00000 1 $100.00 $100.00 323 0 M
In [435]:
# get...
# 1. how many rows of data there are
# 2. if there are any missing values
# 3. what data type each column has
salary_ranges.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1356 entries, 0 to 1355
Data columns (total 13 columns):
SetID                 1356 non-null object
Job Code              1356 non-null object
Eff Date              1356 non-null object
Sal End Date          1356 non-null object
Salary SetID          1356 non-null object
Sal Plan              1356 non-null object
Grade                 1356 non-null object
Step                  1356 non-null int64
Biweekly High Rate    1356 non-null object
Biweekly Low Rate     1356 non-null object
Union Code            1356 non-null int64
Extended Step         1356 non-null int64
Pay Type              1356 non-null object
dtypes: int64(3), object(10)
memory usage: 137.8+ KB
In [436]:
# another method to check for missing values
salary_ranges.isnull().sum()
Out[436]:
SetID                 0
Job Code              0
Eff Date              0
Sal End Date          0
Salary SetID          0
Sal Plan              0
Grade                 0
Step                  0
Biweekly High Rate    0
Biweekly Low Rate     0
Union Code            0
Extended Step         0
Pay Type              0
dtype: int64

The describe method will default to describe quantitative columns.

The describe method will describe qualitative columns if there are no quantitative columns.

In [437]:
# Show descriptive stats:
# Three quantitative columns: 
# 1. Step 
# 2. Union Code 
# 3. Extended Step
salary_ranges.describe()
Out[437]:
Step Union Code Extended Step
count 1356.000000 1356.000000 1356.000000
mean 1.294985 392.676991 0.150442
std 1.045816 338.100562 1.006734
min 1.000000 1.000000 0.000000
25% 1.000000 21.000000 0.000000
50% 1.000000 351.000000 0.000000
75% 1.000000 790.000000 0.000000
max 5.000000 990.000000 11.000000

Work with Quantitative data & Qualitative data

  1. Biweekly High Rate: A quantitative column that refers to the average high weekly salary for a specified department.
  2. Grade: A qualitative column that refers to a department and not a quantity of any kind.

To do with the features that we are more interested in

In [438]:
# We are more intertested in: 
# 1. a single quantitative column, the Biweekly High Rate 
# 2. a single qualitative column, Grade (the type of job)
# Pull out 'Biweekly High Rate' & 'Grade'
salary_ranges = salary_ranges[['Biweekly High Rate', 'Grade']]
salary_ranges.head()
Out[438]:
Biweekly High Rate Grade
0 $0.00 00000
1 $15.00 00000
2 $25.00 00000
3 $50.00 00000
4 $100.00 00000
In [439]:
# Show 'Biweekly High Rate' descriptive stats:
salary_ranges['Biweekly High Rate'].describe()
Out[439]:
count         1356
unique         593
top       $3460.00
freq            12
Name: Biweekly High Rate, dtype: object
In [440]:
# Clean our Biweekly High columns to remove the dollar sign.
salary_ranges['Biweekly High Rate'] = salary_ranges['Biweekly High Rate'].map(lambda value: value.replace('


#39;, ''))

In [441]:
# Check to see the '


#39; has been removed.
salary_ranges.head()

Out[441]:
Biweekly High Rate Grade
0 0.00 00000
1 15.00 00000
2 25.00 00000
3 50.00 00000
4 100.00 00000
In [442]:
# Convert the 'Biweekly High Rate' columns to float.
salary_ranges['Biweekly High Rate'] = salary_ranges['Biweekly High Rate'].astype(float)
In [443]:
# Convert the 'Grade' columns to string.
salary_ranges['Grade'] = salary_ranges['Grade'].astype(str)
In [444]:
# Check to see if converting the data types worked
salary_ranges.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1356 entries, 0 to 1355
Data columns (total 2 columns):
Biweekly High Rate    1356 non-null float64
Grade                 1356 non-null object
dtypes: float64(1), object(1)
memory usage: 21.3+ KB

Work with the Nominal level

  1. Nominal level, consists of data that are purely described by name.
  2. Norminal level, can do basic counts using pandas' value_counts methods.
  3. Norminal level, cannot perform any quantitative mathematical operations, such as addition or division.
  4. Due to the lack of addition and division, we obviously cannot find an average value at the nominal level.
In [445]:
# 'Grade' column is Nominal level.
# Basic Value Counts of the Grade column.
# Grade 0000 has the number of 61.
# Grade 07450 has the number of 12.
# Grade 06870 has the number of 9, and so on.
salary_ranges['Grade'].value_counts().head()
Out[445]:
00000    61
07450    12
06870     9
07170     9
07420     9
Name: Grade, dtype: int64
In [446]:
# Bar Chart of the 'Grade' column
salary_ranges['Grade'].value_counts().sort_values(ascending=False).head(5).plot(kind='bar')
Out[446]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2136c550>
In [447]:
# the 'Grade' column as a pie chart (top 5 values only)
salary_ranges['Grade'].value_counts().sort_values(ascending=False).head(5).plot(kind='pie')
Out[447]:
<matplotlib.axes._subplots.AxesSubplot at 0x1095f01d0>

Work with the Ordinal level

  1. Ordinal level, naturally ordered.
  2. Ordinal level, consists of data that are categorical in nature, even if numbers are used to represent the categories.
  3. Ordinal level, some data values in the column can be considered better than or greater than others.
  4. Ordinal level, can do basic counts using pandas' value_counts methods.
  5. Ordinal level, can calculate medians and percentiles, because we now have ordering and comparisons.
In [448]:
# load in the data set; how much people enjoy the San Francisco International Airport or SFO. 
customer_enjoy = pd.read_csv('./data/2013_SFO_Customer_survey.csv')
In [449]:
# '2013_SFO_Customer_survey.csv' has 3535 rows, 95 columns.
customer_enjoy.shape
Out[449]:
(3535, 95)

To do with the features that we are more interested in

In [450]:
# Feature 'Q7A_ART': artwork and exhibitions. 
# Feature 'Q7A_ART'; The possible choices are 0, 1, 2, 3, 4, 5, 6.
# 1: Unacceptable
# 2: Below Average
# 3: Average
# 4: Good
# 5: Outstanding
# 6: Have Never Used or Visited
# 0: Blank
# Even though the values of Feature 'Q7A_ART' are numbers, those numbers represent a category.
# Feature 'Q7A_ART' belongs to the qualitative side, and more specifically, ordinal.
art_ratings = customer_enjoy['Q7A_ART'] 

# Show descriptive stats:
art_ratings.describe()
Out[450]:
count    3535.000000
mean        4.300707
std         1.341445
min         0.000000
25%         3.000000
50%         4.000000
75%         5.000000
max         6.000000
Name: Q7A_ART, dtype: float64
In [451]:
# Only consider ratings 1-5
art_ratings = art_ratings[(art_ratings >= 1) & (art_ratings <= 5)]

# Cast the values as strings
art_ratings = art_ratings.astype(str) 

# Show descriptive stats:
art_ratings.describe()
Out[451]:
count     2656
unique       5
top          4
freq      1066
Name: Q7A_ART, dtype: object
In [452]:
# Ordinal level can use pie charts, just like in nominal level.
art_ratings.value_counts().plot(kind='pie')
Out[452]:
<matplotlib.axes._subplots.AxesSubplot at 0x1be75aa160>
In [453]:
#  Ordinal level can use bar charts, just like in nominal level.
art_ratings.value_counts().plot(kind='bar')
Out[453]:
<matplotlib.axes._subplots.AxesSubplot at 0x1bf3d72278>
In [454]:
# Boxplots are available at the ordinal level.
art_ratings.value_counts().plot(kind='box')
Out[454]:
<matplotlib.axes._subplots.AxesSubplot at 0x1be29d0278>

Work with the Interval level

  1. Interval level, quantitative data.
  2. Interval level, numerical data; not only has ordering like at the ordinal level, but also has meaningful differences between values.
  3. Interval level, not only may we order and compare values, we may also add and subtract values.
In [455]:
# load in the data set.
# each row quantifies the average temperature of cities around the world by the month.
# going back to the 18th century. 
climate = pd.read_csv('./data/GlobalLandTemperaturesByCity.csv')
climate.head()
Out[455]:
dt AverageTemperature AverageTemperatureUncertainty City Country Latitude Longitude
0 1743-11-01 6.068 1.737 Århus Denmark 57.05N 10.33E
1 1743-12-01 NaN NaN Århus Denmark 57.05N 10.33E
2 1744-01-01 NaN NaN Århus Denmark 57.05N 10.33E
3 1744-02-01 NaN NaN Århus Denmark 57.05N 10.33E
4 1744-03-01 NaN NaN Århus Denmark 57.05N 10.33E
In [456]:
# see if we have any missing values.
# 'AverageTemperature' has 364130 missing values.
# 'AverageTemperatureUncertainty' also has 364130 missing values.
climate.isnull().sum() 
Out[456]:
dt                                    0
AverageTemperature               364130
AverageTemperatureUncertainty    364130
City                                  0
Country                               0
Latitude                              0
Longitude                             0
dtype: int64
In [457]:
# remove missing values in order to get a better look.
climate.dropna(axis=0, inplace=True)
In [458]:
climate.head()
Out[458]:
dt AverageTemperature AverageTemperatureUncertainty City Country Latitude Longitude
0 1743-11-01 6.068 1.737 Århus Denmark 57.05N 10.33E
5 1744-04-01 5.788 3.624 Århus Denmark 57.05N 10.33E
6 1744-05-01 10.644 1.283 Århus Denmark 57.05N 10.33E
7 1744-06-01 14.051 1.347 Århus Denmark 57.05N 10.33E
8 1744-07-01 16.082 1.396 Århus Denmark 57.05N 10.33E
In [459]:
# see if we have any missing values.
# No missing values; all good.
climate.isnull().sum()
Out[459]:
dt                               0
AverageTemperature               0
AverageTemperatureUncertainty    0
City                             0
Country                          0
Latitude                         0
Longitude                        0
dtype: int64

To do with the features that we are more interested in

In [460]:
# the number of 'AverageTemperature' is 111994.
# 'AverageTemperature' has too many values. 
# we cannot use a bar/pie chart for 'AverageTemperature'.
climate['AverageTemperature'].nunique()
Out[460]:
111994
In [461]:
# the most common graph to utilize starting at interval level is "histogram". 
# "histogram"; visualizes buckets of quantities and shows frequencies of these buckets.
# a histogram for the 'AverageTemperature' around the world.
# see the distribution of 'AverageTemperature' in a very holistic view.
climate['AverageTemperature'].hist()
Out[461]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c4907c048>
In [462]:
# Show descriptive stats.
climate['AverageTemperature'].describe()
Out[462]:
count    8.235082e+06
mean     1.672743e+01
std      1.035344e+01
min     -4.270400e+01
25%      1.029900e+01
50%      1.883100e+01
75%      2.521000e+01
max      3.965100e+01
Name: AverageTemperature, dtype: float64

To add more features that we are interested in

In [463]:
# convert the dt column to datetime.
climate['dt'] = pd.to_datetime(climate['dt'])

# extract the year 
climate['year'] = climate['dt'].map(lambda value: value.year)
In [464]:
climate.head()
Out[464]:
dt AverageTemperature AverageTemperatureUncertainty City Country Latitude Longitude year
0 1743-11-01 6.068 1.737 Århus Denmark 57.05N 10.33E 1743
5 1744-04-01 5.788 3.624 Århus Denmark 57.05N 10.33E 1744
6 1744-05-01 10.644 1.283 Århus Denmark 57.05N 10.33E 1744
7 1744-06-01 14.051 1.347 Århus Denmark 57.05N 10.33E 1744
8 1744-07-01 16.082 1.396 Århus Denmark 57.05N 10.33E 1744
In [465]:
# subset the data to only be the temperatures recorded in the China:
climate_sub_china = climate.loc[climate['Country'] == 'China']
In [466]:
# year 1983 would become 20.
# year 1750 would become 18.
climate_sub_china['century'] = climate_sub_china['year'].map(lambda x: x/100+1)
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
In [467]:
climate_sub_china['century'] = climate_sub_china['century'].round().astype(int) 
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [468]:
climate_sub_china.head()
Out[468]:
dt AverageTemperature AverageTemperatureUncertainty City Country Latitude Longitude year century
11925 1820-08-01 20.146 2.286 Ürümqi China 44.20N 87.20E 1820 19
11926 1820-09-01 15.331 1.775 Ürümqi China 44.20N 87.20E 1820 19
11927 1820-10-01 5.256 1.848 Ürümqi China 44.20N 87.20E 1820 19
11928 1820-11-01 -4.956 2.190 Ürümqi China 44.20N 87.20E 1820 19
11929 1820-12-01 -14.118 2.581 Ürümqi China 44.20N 87.20E 1820 19
In [469]:
# with the new column century
# let's plot four histograms of temperature, one for each century.
climate_sub_china['AverageTemperature'].hist(by=climate_sub_china['century'],
    sharex=True, sharey=True,
    figsize=(10, 10),
    bins=10)
Out[469]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1d6923c1d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1c43b37208>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1c43b5b8d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1c43b05f98>]],
      dtype=object)
In [470]:
climate_sub_china.groupby('century')['AverageTemperature'].mean().plot(kind='line')
Out[470]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c43a8d7f0>
In [471]:
centry_changes_china = climate_sub_china.groupby('century')['AverageTemperature'].mean()

centry_changes_china
Out[471]:
century
19     9.865149
20    12.547079
21    13.219288
Name: AverageTemperature, dtype: float64
In [472]:
centry_changes_china[21] - centry_changes_china[19]
Out[472]:
3.3541395924040245

Better Plotting at the interval level

In [473]:
# oof, that's not pretty. It seems to be a lot of noise, and that is to be expected.
# every year has lots of 'AverageTemperature'.
# so we see many vertical points (the values of 'AverageTemperature') at each year.
x = climate_sub_china['year']
y = climate_sub_china['AverageTemperature']
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(x, y)
plt.show()
In [474]:
#  employ a groupby the year column to do "mean" ; remove much of this noise.
climate_sub_china.groupby('year').mean()['AverageTemperature'].plot()
Out[474]:
<matplotlib.axes._subplots.AxesSubplot at 0x1bf3a09d30>
In [475]:
# taking a rolling mean over the years; rolling = 10.
# rolling = 10; time windows is 10 years.
# time windows is 10 years; plot the mean of 'AverageTemperature' in every 10 years interval.
climate_sub_china.groupby('year').mean()['AverageTemperature'].rolling(10).mean().plot()
Out[475]:
<matplotlib.axes._subplots.AxesSubplot at 0x1bf3c46fd0>

Work with the ratio level

  1. ratio level; like the interval level, still working with quantitative data.
  2. ratio level; inherit addition and subtraction from the interval level.
  3. ratio level; the ability to multiply and divide values.
In [476]:
# What is the average rate across all of the Grades
# Which 20 Grades have the highest Biweekly high rate
fig = plt.figure(figsize=(15,5))
ax = fig.gca()

salary_ranges.groupby('Grade')[['Biweekly High Rate']].mean().sort_values(
    'Biweekly High Rate', ascending=False).head(20).plot.bar(stacked=False,
                                                             ax=ax, color='darkorange') 

ax.set_title('Top 20 Grade by Mean Biweekly High Rate')
Out[476]:
Text(0.5, 1.0, 'Top 20 Grade by Mean Biweekly High Rate')
In [477]:
# Which 20 Grades have the lowest Biweekly high rate
fig = plt.figure(figsize=(15,5))
ax = fig.gca()

salary_ranges.groupby('Grade')[['Biweekly High Rate']].mean().sort_values(
    'Biweekly High Rate', ascending=False).tail(20).plot.bar(stacked=False,
                                                             ax=ax, color='darkorange') 

ax.set_title('Bottom 20 Grade by Mean Biweekly High Rate')
Out[477]:
Text(0.5, 1.0, 'Bottom 20 Grade by Mean Biweekly High Rate')
In [478]:
# money is at the ratio level.
# we can also find the ratio of the highest-paid employee to the lowest-paid employee.
salary_df = salary_ranges.groupby('Grade')[['Biweekly High Rate']].mean().sort_values(
    'Biweekly High Rate', ascending=False)

salary_df.iloc[0][0] / salary_df.iloc[-1][0]
Out[478]:
13.931919540229886
In [ ]:
 

发表评论

电子邮件地址不会被公开。 必填项已用*标注

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据