Data analysis in Python with Pandas¶

Instructor notes¶

Estimated teaching time: 30 min

Estimated challenge time: 30 min

Key questions:

“How can I import data in Python ?”
“What is Pandas ?”
“Why should I use Pandas to work with data ?”

Learning objectives:

“Navigate the workshop directory and download a dataset.”
“Explain what a library is and what libraries are used for.”
“Describe what the Python Data Analysis Library (Pandas) is.”
“Load the Python Data Analysis Library (Pandas).”
“Use read_csv to read tabular data into Python.”
“Describe what a DataFrame is in Python.”
“Access and summarize data stored in a DataFrame.”
“Define indexing as it relates to data structures.”
“Perform basic mathematical operations and summary statistics on data in a Pandas DataFrame.”
“Create simple plots.”

Automating data analysis tasks in Python¶

We can automate the process of performing data manipulations in Python. It’s efficient to spend time building the code to perform these tasks because once it’s built, we can use it over and over on different datasets that use a similar format. This makes our methods easily reproducible. We can also easily share our code with colleagues and they can replicate the same analysis.

The Dataset¶

For this lesson, we will be using the Portal Teaching data, a subset of the data from Ernst et al Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA

We will be using this dataset, which can be downloaded here: surveys.csv … but don’t click to download it in your browser - we are going to use Python !

import urllib.request
# You can also get this URL value by right-clicking the `surveys.csv` link above and selecting "Copy Link Address"
url = 'https://monashdatafluency.github.io/python-workshop-base/modules/data/surveys.csv'
# url = 'https://goo.gl/9ZxqBg'  # or a shortened version to save typing
urllib.request.urlretrieve(url, 'surveys.csv')

('surveys.csv', <http.client.HTTPMessage at 0x7fbdc67c6460>)

If Jupyter is running locally on your computer, you’ll now have a file surveys.csv in the current working directory. You can check by clicking on File tab on the top left of the notebook to see if the file exists. If you are running Jupyter on a remote server or cloud service (eg Colaboratory or Azure Notebooks), the file will be there instead.

We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a .csv file: each row holds information for a single animal, and the columns represent:

Column	Description
record_id	Unique id for the observation
month	month of observation
day	day of observation
year	year of observation
site_id	ID of a particular plot
species_id	2-letter code
sex	sex of animal (“M”, “F”)
hindfoot_length	length of the hindfoot in mm
weight	weight of the animal in grams

The first few rows of our file look like this:

record_id,month,day,year,site_id,species_id,sex,hindfoot_length,weight
1,7,16,1977,2,NL,M,32,
2,7,16,1977,3,NL,M,33,
3,7,16,1977,2,DM,F,37,
4,7,16,1977,7,DM,M,36,
5,7,16,1977,3,DM,M,35,
6,7,16,1977,1,PF,M,14,
7,7,16,1977,2,PE,F,,
8,7,16,1977,1,DM,M,37,
9,7,16,1977,1,DM,F,34,

About Libraries¶

A library in Python contains a set of tools (called functions) that perform tasks on our data. Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench for use in a project. Once a library is set up, it can be used or called to perform many tasks.

If you have noticed in the previous code import urllib.request, we are calling a request function from library urllib to download our dataset from web.

Pandas in Python¶

The dataset we have, is in table format. One of the best options for working with tabular data in Python is to use the Python Data Analysis Library (a.k.a. Pandas). The Pandas library provides data structures, produces high quality plots with matplotlib and integrates nicely with other libraries that use NumPy (which is another Python library) arrays.

First, lets make sure the Pandas and matplotlib packages are installed.

!pip install pandas matplotlib

Requirement already satisfied: pandas in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (1.4.0)
Requirement already satisfied: matplotlib in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (3.5.1)
Requirement already satisfied: pytz>=2020.1 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from pandas) (2021.3)
Requirement already satisfied: numpy>=1.18.5 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from pandas) (1.22.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from pandas) (2.8.2)

Requirement already satisfied: pyparsing>=2.2.1 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from matplotlib) (3.0.7)
Requirement already satisfied: fonttools>=4.22.0 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from matplotlib) (4.29.1)
Requirement already satisfied: cycler>=0.10 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: packaging>=20.0 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from matplotlib) (21.3)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from matplotlib) (1.3.2)
Requirement already satisfied: pillow>=6.2.0 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from matplotlib) (9.0.1)
Requirement already satisfied: six>=1.5 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)

Python doesn’t load all of the libraries available to it by default. We have to add an import statement to our code in order to use library functions. To import a library, we use the syntax import libraryName. If we want to give the library a nickname to shorten the command, we can add as nickNameHere. An example of importing the pandas library using the common nickname pd is below.

import pandas as pd

Each time we call a function that’s in a library, we use the syntax LibraryName.FunctionName. Adding the library name with a . before the function name tells Python where to find the function. In the example above, we have imported Pandas as pd. This means we don’t have to type out pandas each time we call a Pandas function.

	record_id	month	day	year	site_id	species_id	sex	hindfoot_length	weight
0	1	7	16	1977	2	NL	M	32.0	NaN
1	2	7	16	1977	3	NL	M	33.0	NaN
2	3	7	16	1977	2	DM	F	37.0	NaN
3	4	7	16	1977	7	DM	M	36.0	NaN
4	5	7	16	1977	3	DM	M	35.0	NaN
...	...	...	...	...	...	...	...	...	...
35544	35545	12	31	2002	15	AH	NaN	NaN	NaN
35545	35546	12	31	2002	15	AH	NaN	NaN	NaN
35546	35547	12	31	2002	10	RM	F	15.0	14.0
35547	35548	12	31	2002	7	DO	M	36.0	51.0
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN

	record_id	month	day	year	site_id	species_id	sex	hindfoot_length	weight
0	1	7	16	1977	2	NL	M	32.0	NaN
1	2	7	16	1977	3	NL	M	33.0	NaN
2	3	7	16	1977	2	DM	F	37.0	NaN
3	4	7	16	1977	7	DM	M	36.0	NaN
4	5	7	16	1977	3	DM	M	35.0	NaN
...	...	...	...	...	...	...	...	...	...
35544	35545	12	31	2002	15	AH	NaN	NaN	NaN
35545	35546	12	31	2002	15	AH	NaN	NaN	NaN
35546	35547	12	31	2002	10	RM	F	15.0	14.0
35547	35548	12	31	2002	7	DO	M	36.0	51.0
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN

	record_id	month	day	year	site_id	hindfoot_length	weight
sex
F	18036.412046	6.583047	16.007138	1990.644997	11.440854	28.836780	42.170555
M	17754.835601	6.392668	16.184286	1990.480401	11.098282	29.709578	42.995379

	record_id	month	day	year	site_id	species_id	hindfoot_length	weight
sex
F	15690	15690	15690	15690	15690	15690	14894	15303
M	17348	17348	17348	17348	17348	17348	16476	16879

	count	mean	std	min	25%	50%	75%	max
site_id
1	1903.0	51.822911	38.176670	4.0	30.0	44.0	53.0	231.0
2	2074.0	52.251688	46.503602	5.0	24.0	41.0	50.0	278.0
3	1710.0	32.654386	35.641630	4.0	14.0	23.0	36.0	250.0
4	1866.0	47.928189	32.886598	4.0	30.0	43.0	50.0	200.0
5	1092.0	40.947802	34.086616	5.0	21.0	37.0	48.0	248.0
6	1463.0	36.738893	30.648310	5.0	18.0	30.0	45.0	243.0
7	638.0	20.663009	21.315325	4.0	11.0	17.0	23.0	235.0
8	1781.0	47.758001	33.192194	5.0	26.0	44.0	51.0	178.0
9	1811.0	51.432358	33.724726	6.0	36.0	45.0	50.0	275.0
10	279.0	18.541219	20.290806	4.0	10.0	12.0	21.0	237.0
11	1793.0	43.451757	28.975514	5.0	26.0	42.0	48.0	212.0
12	2219.0	49.496169	41.630035	6.0	26.0	42.0	50.0	280.0
13	1371.0	40.445660	34.042767	5.0	20.5	33.0	45.0	241.0
14	1728.0	46.277199	27.570389	5.0	36.0	44.0	49.0	222.0
15	869.0	27.042578	35.178142	4.0	11.0	18.0	26.0	259.0
16	480.0	24.585417	17.682334	4.0	12.0	20.0	34.0	158.0
17	1893.0	47.889593	35.802399	4.0	27.0	42.0	50.0	216.0
18	1351.0	40.005922	38.480856	5.0	17.5	30.0	44.0	256.0
19	1084.0	21.105166	13.269840	4.0	11.0	19.0	27.0	139.0
20	1222.0	48.665303	50.111539	5.0	17.0	31.0	47.0	223.0
21	1029.0	24.627794	21.199819	4.0	10.0	22.0	31.0	190.0
22	1298.0	54.146379	38.743967	5.0	29.0	42.0	54.0	212.0
23	369.0	19.634146	18.382678	4.0	10.0	14.0	23.0	199.0
24	960.0	43.679167	45.936588	4.0	19.0	27.5	45.0	251.0

Introduction to Python

Data analysis in Python with Pandas

Contents

Data analysis in Python with Pandas¶

Instructor notes¶

Automating data analysis tasks in Python¶

The Dataset¶

About Libraries¶

Pandas in Python¶

Reading CSV Data Using Pandas¶

So What’s a DataFrame?¶

Exploring Our Species Survey Data¶

Useful Ways to View DataFrame objects in Python¶

Challenge - DataFrames¶

Solution - DataFrames¶

Calculating Statistics From Data¶

Challenge - Statistics¶

Solution - Statistics¶

Groups in Pandas¶

Challenge - Summary Data¶

Solution- Summary Data¶

Solution - Challenge 2¶

Did you get #3 right?¶

Quickly Creating Summary Counts in Pandas¶

Basic Math Functions¶

Quick & Easy Plotting Data Using Pandas¶

Animals per site plot¶

Extra Plotting Challenge¶

Solution to Extra Plotting Challenge 1¶

Solution to Extra Plotting Challenge 2¶

Solution to Extra Plotting Challenge 3¶

sex	F	M
site_id
1	38253.0	59979.0
2	50144.0	57250.0
3	27251.0	28253.0
4	39796.0	49377.0
5	21143.0	23326.0
6	26210.0	27245.0
7	6522.0	6422.0
8	37274.0	47755.0
9	44128.0	48727.0
10	2359.0	2776.0
11	34638.0	43106.0
12	51825.0	57420.0
13	24720.0	30354.0
14	32770.0	46469.0
15	12455.0	11037.0
16	5446.0	6310.0
17	42106.0	48082.0
18	27353.0	26433.0
19	11297.0	11514.0
20	33206.0	25988.0
21	15481.0	9815.0
22	34656.0	35363.0
23	3352.0	3883.0
24	22951.0	18835.0