Pandas Practice II
Overview
In today’s tutorial, you will be working in small groups to conduct data analysis on real historical datasets. This hands-on activity will help you apply the Pandas concepts you’ve been learning while exploring new datasets. Rather than have you try to just mimic what I’m doing as your instructor, this group-based activity will have you apply what you’ve learned and think more creatively about how to use these skills.
Each group will be assigned one dataset to analyze. Your goal is to explore the data, develop interesting questions, and use Pandas to discover answers and insights. At the end of the session, your group will briefly share your most interesting finding.
Get the Data
- Open GitHub Desktop and select your course repository (
lastname-sp25-data-materials
) - Click
Fetch origin
to check for any changes - Go to Branch →
Merge into current branch
→ selectupstream/main
and clickCreate a merge commit
if there are updates - Click
Pull origin
if it’s available (if not, you’re up to date!) - Click
Push origin
to sync everything up - Launch Jupyter Labs and navigate to this week’s folder
- You should see four
.csv
data files - each group will be assigned one of these datasets - Create a new Jupyter Notebook in this week’s folder with the filename:
yourlastname-pandas-3.ipynb
The Data
I’ve pre-selected several datasets. To learn more about what information they contain and how that information is captured, use the links below:
nobel_winners.csv
jobs_gender.csv
olympics.csv
baby-names.csv
. Note: this particular dataset has a bit less documentation - to help you out, thepercent
column refers to the relative value of that name for that gende, for that year. Ex. ifJohn
has0.07
percent in1905
, that means babies namedJohn
made up7%
of all male babies born that year.
Steps for Analysis
Initial Data Exploration
- Import the necessary libraries (pandas, matplotlib)
- Load your chosen dataset
- Refer to the original data sources (above) and try to figure out what kind of information is stored and how it is stored
- Examine the dataset structure using:
.head()
to view the first few rows.info()
to understand columns and data types.describe()
to see statistical summaries.shape
to see dimensions- Check for missing values with
.isna().sum()
Brainstorm Questions
- As a group, brainstorm three questions you could answer with this data
- Focus on questions that might reveal interesting patterns or stories
- Document your brainstormed list of questions in a markdown cell
- Choose one of your questions to start with
Analysis & Visualization
- Use Pandas methods to analyze your data and answer your questions - refer to Walsh’s Pandas Basics 1 and Pandas 2
- Create any new columns that might help answer your questions
- Filter, group, sort, or aggregate data to get closer to your answers
- Create at least one visualization to illustrate your findings
- Use markdown cells to explain your process and interpret results
Class Debrief
- What was the most interesting or surprising finding?
- What challenges did you encounter in the analysis?
- What additional data would help you answer your questions better?
Collaboration Tips
- Take turns writing code
- If someone is stuck, work through the problem together instead of taking over
- Document your thought process in markdown cells
- When you encounter errors, read them carefully and try to debug together
- This should be an iterative process! It’s normal to refine your questions as you learn more about the data, and you might need to try several approaches before finding something interesting.