🐼 Pandas Concepts
The Basics: Spreadsheets in Python
Think about pandas like working with a spreadsheet:
- A DataFrame is the entire spreadsheet with both rows and columns (tables)
- A Series is just a single column from that spreadsheet (a list of values with labels)
Working with Columns
Imagine we have the following table of data stored as a DataFrame in Python:
name | age | city |
---|---|---|
Alice | 25 | New York |
Bob | 19 | Boston |
Charlie | 22 | Chicago |
Claire | 18 | New York |
Getting a Single Column (Series)
“Show me just the ages of all students”:
# This gives you just the 'age' column as a Series
ages = students['age']
Returns: A single column of data (a Series)
age |
---|
25 |
19 |
22 |
18 |
Getting Multiple Columns (DataFrame)
“I don’t care about the city colun, give me a mini-spreadsheet of just names and ages”
# This gives you both 'name' and 'age' columns as a DataFrame
name_and_age = students[['name', 'age']]
- Notice the double brackets
[[ ]]
- they’re important! - Returns: A mini-spreadsheet with the selected columns
name | age |
---|---|
Alice | 25 |
Bob | 19 |
Charlie | 22 |
Claire | 18 |
Working With Rows
Filter Rows by Slicing
You’ll often want to “filter” your dataframe, or create a shorter table with only a certain subset of rows. If you know how your rows are arranged sequentially, you can use a similar “slice” method as you’ve done for lists. If you only want the first two rows of your dataframe you would write:
students[0:2]
This would return:
name | age | city |
---|---|---|
Alice | 25 | New York |
Bob | 19 | Boston |
Filtering Rows by a Condition
However, slicing is not as common with dataframes because they aren’t always sequential (ie. the order of the rows doesn’t matter). The more common way in which you’ll end up filtering rows in a dataframe is by defining a “conditional”. Let’s say we have our full table of students but want to create a subset of this table of only students over the age of 20. So we’d essentially want Python to ask:
- Is Alice’s age > 20? Yes → Keep this row
- Is Bob’s age > 20? No → Remove this row
- Is Charlie’s age > 20? Yes → Keep this row
- Is Claire’s age > 20? No → Remove this row
Here’s the code to apply this filter:
students[students['age'] > 20]
name | age | city |
---|---|---|
Alice | 25 | New York |
Charlie | 22 | Chicago |
Let’s break our code students[students['age'] > 20]
into two parts to make it easier to understand.
Part 1: Build our filter
students['age'] > 20
: This part of the code tells Python to check each row in the entire students
dataframe and ask “is the value in the age
column greater than 20?” You can visualize this process a bit like this:
name | age | city | … | Did it meet our condition? |
---|---|---|---|---|
Alice | 25 | New York | -> | 👍 |
Bob | 19 | Boston | -> | ❌ |
Charlie | 22 | Chicago | -> | 👍 |
Claire | 18 | New York | -> | ❌ |
Pandas generates a Series (like a list) made up of True
or False
values for every row depending on whether it meets our condition:
0 | True |
1 | False |
2 | True |
3 | False |
Part 2: Apply the filter to the dataframe
students[...our filter...]
: This looks at the entire dataframe, feeds it our Series of True
/False
values, and uses it to only select rows where the answer to our condition is True
.
Grouping and Summarizing Data
Grouping data in Pandas divides up the rows in your dataframe and lumps them into separate buckets based on shared characteristics. You can then perform “summarize” features about these groups by performing calculations, observations, etc.
Let’s take the analogy of doing laundry:
- Your hamper is filled with a jumbled up pile of different pieces clothing - ex. t-shirts, jeans, sweaters, socks, etc. All of your pieces of clothing have characteristics - color, fabric type, etc.
- This like your starting dataframe, with each row as an individual piece of clothing and each column a characteristic about that piece of clothing.
- To do your laundry, you might want to group similar items together based on certain shared characteristics. You have a few options:
- You could sort by color in order to keep your dark-wash jeans from bleeding into your nice white t-shirts. In this case, you’d you put your whites in one pile and your darks in another another pile.
- You could sort by fabric type if you need to usedifferent dryer settings. In this case, you’d put your delicates that need low heat into one pile and your more durable clothes into another pile.
- Once your laundry is sorted into groups, you might “summarize” each of your groups by making observations about them:
- “My darks pile is a lot bigger than my whites pile this week”
- “I only have two items in my delicates pile, I guess I could wait to wash them until I have more…”
Grouping Data
The .groupby()
function in Pandas is the primary way to take your data and put it into separate buckets to analyze. Let’s take a starting dataframe called students
:
row | name | age | city | grade |
---|---|---|---|---|
0 | Alice | 25 | New York | 85 |
1 | Bob | 19 | Boston | 72 |
2 | Charlie | 22 | New York | 90 |
3 | Claire | 18 | Chicago | 78 |
4 | Emma | 24 | New York | 88 |
5 | Frank | 20 | Boston | 79 |
We could group each student according to the city where they are from:
- New York group: Alice, Charlie, Emma
- Boston group: Bob, Frank
- Chicago group: Claire
The code to do this would be:
students.groupby('city')
Pandas is doing several important things behind the scenes with this code:
- Identifies all the unique values in the dataframe’s
city
column (New York
,Boston
, andChicago
) - Rather than modifying the original dataframe or creating new data, Pandas creates “virtual groups” that point back to the corresponding rows in the
students
dataframe.- “New York” group → points back to rows 0, 2, 4 in the original dataframe
- “Boston” group → points back to rows 1, 5 in the original dataframe
- “Chicago” group → points back to row 3 in the original dataframe
- Within each group, the original row order is preserved. So in the “New York” group, Alice (row 0) comes before Charlie (row 2) who comes before Emma (row 4), matching their order in the original dataframe.
Summarizing Groups
Once our data is in groups, we usually want to “summarize” or calculate something about each group. This is where aggregation functions come in. They are applied to an entire column and can help us answer questions like:
- How many students are from each city?
- What’s the average age of students from each city?
Counting
Let’s start with the first question. If we apply .count()
to our grouped data, it might seem like we would write the following:
city_groups=students.groupby('city')
city_groups.count()
But this means that you are telling Python to count up all the values (“non-null” values, specifically) for each column in our grouped data, except the grouping column itself (‘city’).
city | name | age | grade |
---|---|---|---|
Boston | 2 | 2 | 2 |
Chicago | 1 | 1 | 1 |
New York | 3 | 3 | 3 |
This is why it’s often better to specifically select one column when using .count()
after a groupby operation:
city_groups=students.groupby('city')
city_groups['name'].count()
city | count |
---|---|
New York | 3 |
Boston | 2 |
Chicago | 1 |
Average
If we wanted the answer to: what’s the average age of students from each city? We could use .mean()
(ie. average) for our age
column in our city groups:
city_groups=students.groupby('city')
city_groups['age'].mean()
This would give us the average ages of our students from each city:
city | |
---|---|
New York | 23.7 |
Boston | 19.5 |
Chicago | 18.0 |
Applying Functions to Dataframes
Oftentimes you want to take each individual value in a column and do something to it - change its contents, use it to calculate a new value, etc. This is where Pandas’ .apply()
comes in. In this case, you define a function then use .apply()
to feed every individual value in a column into that function.
It can be confusing distinguishing between the built-in summary functions above (ex. .mean()
, .count()
, etc.) vs. .apply()
. The main difference is that the summary functions act typically act on a column as a whole to aggregate or collapse all the values into a single “summary” (a number, value, etc.), whereas you are using .apply()
on individual cells/values in a column.
How .apply() Works: Breaking It Down
Let’s start with our familiar students dataframe:
name | age | city | grade |
---|---|---|---|
Alice | 25 | New York | 85 |
Bob | 19 | Boston | 72 |
Charlie | 22 | New York | 90 |
Claire | 18 | Chicago | 78 |
Emma | 24 | New York | 88 |
Frank | 20 | Boston | 79 |
Let’s say we want to add a new column to the dataframe that flags students who might be struggling based on their grade.
We’d start by defining a custom function that takes a number grade and returns either Doing Okay
or Needs Help
based on this 80-point threshold.
def grade_status(grade):
if grade >= 80:
return 'Doing Okay'
else:
return 'Needs Help'
The next step is to .apply()
this function specifically to the grade
column of the students
dataframe:
students['grade'].apply(grade_status)
And finally, we need to create a new column called grade_flag
and “fill” that column with values that are returned from our function for every grade
value:
students['grade_flag']=students['grade'].apply(grade_status)
Let’s break down what’s happening:
- Python looks at EVERY single grade in the ‘grade’ column
- It runs each individual grade through our
grade_status()
function - It creates a new column with the results:
name | age | city | grade | grade_status |
---|---|---|---|---|
Alice | 25 | New York | 85 | Doing Okay |
Bob | 19 | Boston | 72 | Needs Help |
Charlie | 22 | New York | 90 | Doing Okay |
Claire | 18 | Chicago | 78 | Doing Okay |
Emma | 24 | New York | 88 | Doing Okay |
Frank | 20 | Boston | 79 | Needs Help |
Tips
- Your custom function should always:
- Take one input (the individual item from the column)
- Return something based on that item
- Your function takes in one input (ex. a single value in a column), but you point
.apply()
on an entire column or columns. - You can use
.apply()
with built-in Python functions or with functions that you create yourself