🐼 Pandas Concepts

The Basics: Spreadsheets in Python

Think about pandas like working with a spreadsheet:

A DataFrame is the entire spreadsheet with both rows and columns (tables)
A Series is just a single column from that spreadsheet (a list of values with labels)

Working with Columns

Imagine we have the following table of data stored as a DataFrame in Python:

name	age	city
Alice	25	New York
Bob	19	Boston
Charlie	22	Chicago
Claire	18	New York

Getting a Single Column (Series)

“Show me just the ages of all students”:

# This gives you just the 'age' column as a Series
ages = students['age']

Returns: A single column of data (a Series)

age
25
19
22
18

Getting Multiple Columns (DataFrame)

“I don’t care about the city colun, give me a mini-spreadsheet of just names and ages”

# This gives you both 'name' and 'age' columns as a DataFrame
name_and_age = students[['name', 'age']]

Notice the double brackets [[ ]] - they’re important!
Returns: A mini-spreadsheet with the selected columns

name	age
Alice	25
Bob	19
Charlie	22
Claire	18

Working With Rows

Filter Rows by Slicing

You’ll often want to “filter” your dataframe, or create a shorter table with only a certain subset of rows. If you know how your rows are arranged sequentially, you can use a similar “slice” method as you’ve done for lists. If you only want the first two rows of your dataframe you would write:

students[0:2]

This would return:

name	age	city
Alice	25	New York
Bob	19	Boston

Filtering Rows by a Condition

However, slicing is not as common with dataframes because they aren’t always sequential (ie. the order of the rows doesn’t matter). The more common way in which you’ll end up filtering rows in a dataframe is by defining a “conditional”. Let’s say we have our full table of students but want to create a subset of this table of only students over the age of 20. So we’d essentially want Python to ask:

Is Alice’s age > 20? Yes → Keep this row
Is Bob’s age > 20? No → Remove this row
Is Charlie’s age > 20? Yes → Keep this row
Is Claire’s age > 20? No → Remove this row

Here’s the code to apply this filter:

students[students['age'] > 20]

name	age	city
Alice	25	New York
Charlie	22	Chicago

Let’s break our code students[students['age'] > 20] into two parts to make it easier to understand.

Part 1: Build our filter

students['age'] > 20: This part of the code tells Python to check each row in the entire students dataframe and ask “is the value in the age column greater than 20?” You can visualize this process a bit like this:

name	age	city	…	Did it meet our condition?
Alice	25	New York	->	👍
Bob	19	Boston	->	❌
Charlie	22	Chicago	->	👍
Claire	18	New York	->	❌

Pandas generates a Series (like a list) made up of True or False values for every row depending on whether it meets our condition:

0	True
1	False
2	True
3	False

Part 2: Apply the filter to the dataframe

students[...our filter...]: This looks at the entire dataframe, feeds it our Series of True/False values, and uses it to only select rows where the answer to our condition is True.

Grouping and Summarizing Data

Grouping data in Pandas divides up the rows in your dataframe and lumps them into separate buckets based on shared characteristics. You can then perform “summarize” features about these groups by performing calculations, observations, etc.

Let’s take the analogy of doing laundry:

Your hamper is filled with a jumbled up pile of different pieces clothing - ex. t-shirts, jeans, sweaters, socks, etc. All of your pieces of clothing have characteristics - color, fabric type, etc.
- This like your starting dataframe, with each row as an individual piece of clothing and each column a characteristic about that piece of clothing.
To do your laundry, you might want to group similar items together based on certain shared characteristics. You have a few options:
- You could sort by color in order to keep your dark-wash jeans from bleeding into your nice white t-shirts. In this case, you’d you put your whites in one pile and your darks in another another pile.
- You could sort by fabric type if you need to usedifferent dryer settings. In this case, you’d put your delicates that need low heat into one pile and your more durable clothes into another pile.
Once your laundry is sorted into groups, you might “summarize” each of your groups by making observations about them:
- “My darks pile is a lot bigger than my whites pile this week”
- “I only have two items in my delicates pile, I guess I could wait to wash them until I have more…”

Grouping Data

The .groupby() function in Pandas is the primary way to take your data and put it into separate buckets to analyze. Let’s take a starting dataframe called students:

row	name	age	city	grade
0	Alice	25	New York	85
1	Bob	19	Boston	72
2	Charlie	22	New York	90
3	Claire	18	Chicago	78
4	Emma	24	New York	88
5	Frank	20	Boston	79

We could group each student according to the city where they are from:

New York group: Alice, Charlie, Emma
Boston group: Bob, Frank
Chicago group: Claire

The code to do this would be:

students.groupby('city')

Pandas is doing several important things behind the scenes with this code:

Identifies all the unique values in the dataframe’s city column (New York, Boston, and Chicago)
Rather than modifying the original dataframe or creating new data, Pandas creates “virtual groups” that point back to the corresponding rows in the students dataframe.
- “New York” group → points back to rows 0, 2, 4 in the original dataframe
- “Boston” group → points back to rows 1, 5 in the original dataframe
- “Chicago” group → points back to row 3 in the original dataframe
Within each group, the original row order is preserved. So in the “New York” group, Alice (row 0) comes before Charlie (row 2) who comes before Emma (row 4), matching their order in the original dataframe.

Summarizing Groups

Once our data is in groups, we usually want to “summarize” or calculate something about each group. This is where aggregation functions come in. They are applied to an entire column and can help us answer questions like:

How many students are from each city?
What’s the average age of students from each city?

Counting

Let’s start with the first question. If we apply .count() to our grouped data, it might seem like we would write the following:

city_groups=students.groupby('city')
city_groups.count()

But this means that you are telling Python to count up all the values (“non-null” values, specifically) for each column in our grouped data, except the grouping column itself (‘city’).

city	name	age	grade
Boston	2	2	2
Chicago	1	1	1
New York	3	3	3

This is why it’s often better to specifically select one column when using .count() after a groupby operation:

city_groups=students.groupby('city')
city_groups['name'].count()

city	count
New York	3
Boston	2
Chicago	1

Average

If we wanted the answer to: what’s the average age of students from each city? We could use .mean() (ie. average) for our age column in our city groups:

city_groups=students.groupby('city')
city_groups['age'].mean()

This would give us the average ages of our students from each city:

city
New York	23.7
Boston	19.5
Chicago	18.0

Applying Functions to Dataframes

Oftentimes you want to take each individual value in a column and do something to it - change its contents, use it to calculate a new value, etc. This is where Pandas’ .apply() comes in. In this case, you define a function then use .apply() to feed every individual value in a column into that function.

It can be confusing distinguishing between the built-in summary functions above (ex. .mean(), .count(), etc.) vs. .apply(). The main difference is that the summary functions act typically act on a column as a whole to aggregate or collapse all the values into a single “summary” (a number, value, etc.), whereas you are using .apply() on individual cells/values in a column.

How .apply() Works: Breaking It Down

Let’s start with our familiar students dataframe:

name	age	city	grade
Alice	25	New York	85
Bob	19	Boston	72
Charlie	22	New York	90
Claire	18	Chicago	78
Emma	24	New York	88
Frank	20	Boston	79

Let’s say we want to add a new column to the dataframe that flags students who might be struggling based on their grade.

We’d start by defining a custom function that takes a number grade and returns either Doing Okay or Needs Help based on this 80-point threshold.

def grade_status(grade):
    if grade >= 80:
        return 'Doing Okay'
    else:
        return 'Needs Help'

The next step is to .apply() this function specifically to the grade column of the students dataframe:

students['grade'].apply(grade_status)

And finally, we need to create a new column called grade_flag and “fill” that column with values that are returned from our function for every grade value:

students['grade_flag']=students['grade'].apply(grade_status)

Let’s break down what’s happening:

Python looks at EVERY single grade in the ‘grade’ column
It runs each individual grade through our grade_status() function
It creates a new column with the results:

name	age	city	grade	grade_status
Alice	25	New York	85	Doing Okay
Bob	19	Boston	72	Needs Help
Charlie	22	New York	90	Doing Okay
Claire	18	Chicago	78	Doing Okay
Emma	24	New York	88	Doing Okay
Frank	20	Boston	79	Needs Help

Tips

Your custom function should always:
- Take one input (the individual item from the column)
- Return something based on that item
Your function takes in one input (ex. a single value in a column), but you point .apply() on an entire column or columns.
You can use .apply() with built-in Python functions or with functions that you create yourself