First create a new directory within your class folder data-a-user-manual called pandas-3. Launch Jupyter Labs and create a new Jupyter Notebook in that folder with the filename: pandas-3.ipynb.
We’re going to continue working with the dataset: Census of Utah Territory, 1880. I’ve modified the dataset to make it more usable in Python. Download this CSV file and make sure it is in your pandas-3 folder.
To replicate best coding practices, you’re going to use alternating Markdown and Code cells in your Jupyter Notebook. Copy and paste each of the following steps into a new Markdown cell that documents in your own words what you’re doing in the following code cell. Then insert a new code cell and write your Python code that completes the task for that step.
- Import the pandas library
- Read in the contents of the CSV of Utah census data and assign it to a variable
utah_df(ie. Utah dataframe) - Show a random sample of 10 rows from the dataframe
- Look at the column
birthstateand some of the values from your random sample. Change the name of the column to something more accurate. - Replace the
NaNvalues in theoccupationcolumn with the string:Unknown. - Just isolate people who work on a farm in some capacity. Create a filter and subset the data using
str.containsto return rows whoseoccupationcontainsFARM. How many people work on a farm? - The three most common categories for
marrystatushave been shortened to single letters. Write three lines of code that each usesstr.replaceon themarrystatuscolumn to replace:mwithmarriedswithsinglewwithwidowed
- Define a function called
birthyear_calcto calculate the year an individual was born based on theiragecolumn and the year the census was recorded (1880) - note: this won’t be perfect. - Add a new column called
birthyearthat is populated with the same values as theagecolumn. These are placeholder values that we are about to change. - Use
applyand your functionbirthyear_calcon thebirthyearcolumn to calculate new values for each person’s birth year. - Use the following code to change the
birthyearcolumn to an Integer data type rather than a Float (decimal):utah_df['birthyear']=utah_df['birthyear'].astype('int64') - Use
groupbyto count the number of people living in Utah in 1880 who were born each year leading up to 1880. Assign this to a new variablebirthsbyyear. - Plot a time series graph of
birthsbyyearto chart how many people were born each year. - What explains the weird spikes in the graph?
