Working With Text in Python
Overview
Today we’ll be working with a single text: a narrative dictated by Venture Smith, a successful businessman and formerly enslaved man in colonial-era New England. The narrative, A Narrative of the Life and Adventures of Venture, a Native of Africa: But Resident above Sixty Years in the United States of America, Related by Himself, is available in several formats online but for the purposes of today I’ve put it into the primary kind of file you will want to use when working with textual data in Python: a text file (ending in .txt
).
Get Updates
⚠️ ⚠️ ⚠️ Before starting any new work in this class, you always want to check for any updates from your instructor’s sp25-data-materials
repository.
- Open GitHub Desktop and select your course repository
- Make sure it says:
Current Branch: main
(ie. you’re on the main branch) - Click
Fetch origin
to check for any changes - Go to Branch →
Merge into current branch
→ selectupstream/main
and clickCreate a merge commit
if there are updates - Click
Pull origin
if it’s available (if not, you’re up to date!) - Click
Push origin
to sync everything up
Get Started in Jupyter Labs
- Open Anaconda Navigator, launch Jupyter Labs, and navigate to the
week-03
folder inyourlastname-sp25-data-materials
- Create a new Jupyter Notebook and name it:
yourlastname-working-with-text.ipynb
.
Opening, Reading, and Writing Text Files
- Open up the
venture-smith.txt
text file in Jupyter Labs by double-clicking on the file in the left pane. This is how you, as a human reader, might read the contents of the file. - Similarly, Python requires a two-step process to first open a file and then read its contents:
- Add a code cell to
open
andread
the file (Walsh instructions). - Assign the contents of the opened file to the variable
smith
- Add a code cell to
- Let’s say we wanted to create a back-up copy of the file just in case. Follow the instructions in the Walsh tutorial to
write
a new file that copies the contents of your newly createdsmith
variable into a new file namedventure-smith-copy.txt
Common String Methods in Python
- In a new code cell, add two lines of code. In the first line, just run your new variable
smith
to display its contents. In the second line, useprint()
to display its contents. What is the difference between these two? - Use
index()
function to show the first character (letter) of Smith’s narrative. It should show:'A'
. - Use
slice()
to show the first 100 characters (letters) of Smith’s narrative. It should show:'A NARRATIVE OF THE LIFE AND ADVENTURES\n\nOF VENTURE, A NATIVE OF AFRICA,\n\nBut resident above sixty ye'
- Let’s just isolate the title of Smith’s narrative. This is comprised of the first 158 characters. Make a new variable called
smith_title
and assign it the first 158 characters of the text file (since Python starts counting at 0, this means we want to usesmith[0:157]
). - Use
string.title()
to reformat the title of Smith’s narrative by making the first letter of each word capitalized.- Hint: the
string
in this example is your new variablesmith_title
.
- Hint: the
Analyzing Lines
- Notice how your main variable
smith
contains newline characters (\n
). This is a “hidden” character contained in text files that tells a text editor to show the following text as starting on a new line (like hitting Enter or Return in a Word document).- Use
string.split('delim')
to “split” up Smith’s narrative into separate lines. - What would you use in place of
string.
anddelim
to do this? - Assign this new collection of separate lines to a new variable called
smith_lines
(this will create something called a “list” in Python)
- Use
- Check and see what’s in your new variable by using the following code (don’t worry if you don’t understand it):
for line in smith_lines:
print(line)
- The
len()
function tells you how long something is. In this case, we’ve created a variable calledsmith_lines
containing a list of all the lines from Smith’s narrative. Uselen()
andsmith_lines
to show the length of Smith’s narrative in terms of the number of lines.
Counting Words
As you saw in the Walsh tutorial Anatomy of a Python Script, one of the basic forms of text analysis you can do - and yet is central to more sophisticated forms of analysis - is to count words. Today we’re going to count words in Venture Smith’s narrative.
To do so, we’re going to follow an age-old practice amongst coders: borrow other people’s code and use it ourselves. Melanie Walsh provides precisely such a chunk of code in her tutorial to count words. See if you can copy and paste the below code (use the copy button) into your Jupyter Notebook and then edit it so that it is counting the 20-most frequent words in Venture Smith’s narrative.
Here is Walsh’s description of the code:
Below is a chunk of Python code. These lines, when put together, do something simple yet important. They count and display the most frequent words in a text file. The example below specifically counts and displays the 40 most frequent words in Charlotte Perkins Gilman’s short story “The Yellow Wallpaper” (1892).
import re
from collections import Counter
def split_into_words(any_chunk_of_text):
lowercase_text = any_chunk_of_text.lower()
split_words = re.split("\W+", lowercase_text)
return split_words
filepath_of_text = "../texts/literature/The-Yellow-Wallpaper_Charlotte-Perkins-Gilman.txt"
number_of_desired_words = 40
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp']
full_text = open(filepath_of_text, encoding="utf-8").read()
all_the_words = split_into_words(full_text)
meaningful_words = [word for word in all_the_words if word not in stopwords]
meaningful_words_tally = Counter(meaningful_words)
most_frequent_meaningful_words = meaningful_words_tally.most_common(number_of_desired_words)
most_frequent_meaningful_words
Bonus Practice
- Use
string.split()
,index()
, andlen()
to:- Print the 200th word in Smith’s narrative
- Print the length of Smith’s narrative measured by number of total words.
- Use
string.split()
to break Smith’s narrative apart into separate chapters and thenlen()
to calculate how long Chapter II is based on the number of characters in that chapter.