In this homework you’re going to reinforce more of the Python skills you learned in class in Week 4: use Python to open and read a bunch of text files, process them in some way, and then do some basic analysis. Refer back to the in-class portion of Week 4 to help you. Once again, please alternate Markdown cells with Code cells where you explain each step.
The Data
For this week we’re going to use diary entries from the diary of Martha Ballard, a midwife from Maine made famous by Laurel Ulrich’s A Midwife’s Tale. A project at George Mason University digitized her diary and put it online. I’ve done some work on the entries, and am supplying you with two years’ worth of Ballard’s entries (1804 and 1805). Each of her entries for these two years are contained in a separate text file that I’ve already preprocessed, cleaned up, and put into all lowercase.
Set Up
- Create a new folder inside the
homework
folder on your computer calledweek-06-homework
. Launch Anaconda Navigator and then create a new Jupyter Notebook inside this folder using the filename convention:yourlastname-week-06-homework.ipynb
. - Download Martha Ballard’s diary entries for 1804 and 1805 and put the file in your
week-06-homework
folder. Unzip this and rename the folderdata
. This is your directory of text files. - In your Jupyter Notebook:
- Import the
os
andstring
libraries - Use
os
to tell Python where to look for the data (text files)
- Import the
Wrangle the Data
The goal of this section is to take your hundreds of text files worth of diary entries and add them into two lists, one containing all of the diary entries for 1804 and one for 1805. You’re going to do this through the following steps:
- Make two new variables,
year_1804
andyear_1805
, and make them empty lists. This is where you’re going to be adding individual entries for that year as items in your list. - Use
os.listdir()
to get a list of filenames contained in your data folder and assign it as a new variable - Write a
for
loop to go through your list of filenames andopen()
each diary entry read() its contents. - Inside that same
for
loop, then use anif
statement to figure out which year the diary entry was written. Based on that,append()
the entry to either youryear_1804
oryear_1805
list.- Hint: this is going to require a new function, but one that is related to the
if f.endswith('.txt'):
in your Week 4 exercise. Try Googling to see if you can figure out what function to use.
- Hint: this is going to require a new function, but one that is related to the
Analyze the Data
Let’s do some basic analysis of Martha Ballard’s diary entries for these two years.
- In which year did she write more entries?
- What is Ballard’s longest entry that she wrote in 1804 or 1805?
- What is the shortest entry that she wrote?
- What is the average length of her entries in 1804 vs. 1805?
- Note: this is a tricky one that requires some thinking outside of the box or importing a library you haven’t used yet - if you can’t figure it out, feel free to skip it
- What was the weather in Maine exactly 215 years ago from today? The goal is to generate a print statement that just prints out the sentence from that particular entry containing the weather.
- Use a
for
loop to go back through your data files - Use an
if
statement to locate the correct text file based on its filename open()
,read()
, and assign the contents of just that file to a new variable- Use the
split()
function to create a new list, with each individual item being a different sentence from her diary entry. Think about what character you want to “split” on. print()
just the sentence that talks about the weather. To do this, you’re going to tell Python which item inside your list of sentences you want to print using the brackets[]
notation. Annoying Python feature: Python starts “counting” at 0 instead of 1. So to access the second item in your list, you’d useprint(somelist[3])
NOTprint(somelist[2])
. Fun right? :)
- Use a