Data Tutorials: Jane Austen Text Analysis

Analyzing Jane Austen's Novels

In honor of Jane Austen's 250th birthday we will be using her novels to begin to explore text analysis. You'll learn how to analyze the text of Jane Austen's novels using Python. We'll cover text collection, sentiment analysis, and some basic visualization techniques.

Jane Austen with the words Happy Birthday Jane superimposed

Step 1: Collecting Texts from Project Gutenberg

In this section, you will learn how to pull Jane Austen novels from Project Gutenberg for analysis. The full version (available in Colab) includes all necessary code and explanations to download the text directly from Project Gutenberg.

Launch Full Notebook in Colab

In the cell below, we give an example of how you might define the list of Jane Austen's novels by title or ID number. This can then be used to download several different texts from Project Gutenberg at once. However, for this example we start with a single book "Pride and Prejudice" and we show how you can use the requests package to download the text.

jane_austen_book_titles = ["Pride and Prejudice", "Sense and Sensibility", "Emma"] ## Option 1: by title
jane_austen_book_ids = [1342, 161, 158]  ## Option 2: by Gutenberg ID number

## let's start with downloading one book as an example
book_id = jane_austen_book_ids[0]  # Pride and Prejudice

import requests ## to make HTTP requests
book_url = f"https://www.gutenberg.org/cache/epub/{book_id}/pg{book_id}.txt"

response = requests.get(book_url) ## this makes the call to the website to download the text
if response.status_code == 200: # check if the request was successful
    book_text = response.text
    print(f"Successfully downloaded book ID {book_id}")
else:
    print(f"Failed to download book ID {book_id}")

Notice how we downloaded the text file by making an HTTP GET request to the Gutenberg website. The text is then stored in the `book_text` variable for further analysis. We can do a quick sanity check to make sure we're downloading the correct file by following the link we created: https://www.gutenberg.org/cache/epub/1342/pg1342.txt.

Step 2: Sentiment Analysis

Now that we have the text of a single book "Pride and Prejudice" grabbed, we can perform some basic sentiment analysis. What is sentiment analysis? Using natural language processing (NLP) we can analyze the emotions expressed in a text. Here are three different approaches we can take to better understand the text. Each exercise is fully developed in the Colab notebook linked above, but these code snippets give you a starting point.

Exercise: Count Words in a Sample Text

One thing you might want to understand about your text is how often each word appears. We can do this by counting the occurrences of each word.

import re

## Here we give a small sample text
sample_text = "Pride and Prejudice is a novel by Jane Austen. It is full of wit and social commentary."

words = re.findall(r'\b\w+\b', sample_text.lower()) ## this regex finds all words and makes them lowercase

print("Words:", words)
print("Total words:", len(words))

Exercise: Filter Stopwords

You might notice that in the sentence above "is" occurs twice, but doesn't affect the overall meaning. Can you imagine other words that appear often but don't have significant affect on the meaning of the text? Stopwords are common words that may not carry significant meaning in text analysis. We can filter these out to focus on more meaningful words.

stop_words = {"and", "is", "a", "it", "of"} ## a small set of common stopwords

filtered_words = [] ## empty list to hold words that are not stopwords
for word in words: ## loop through all words to filter out stopwords
    if word not in stop_words: ## check if word is not a stopword
        filtered_words.append(word) ## add to filtered list

print("Filtered words:", filtered_words)

Exercise: Compute Average Sentiment

In the actual Colab notebook, we would calculate the sentiment scores using a library like NLTK. For this sample code though, we will use dummy data sentiment scores.

## series of fabricated sentiment scores for a series of sentences
sentiment_scores = [0.1, 0.3, 0.2, -0.1, 0.0, 0.2, 0.1, -0.2, 0.0, 0.1]

avg_sentiment = sum(sentiment_scores) / len(sentiment_scores) ## calculate average sentiment

print("Average sentiment:", avg_sentiment)

Step 3: Visualizing Sentiment Trajectories

Here is a small interactive demo of sentiment trajectories using preloaded data. This demo visualizes sample data, but in the full notebook, we would use real sentiment analysis results.

import pandas as pd ## for data table manipulation
import matplotlib.pyplot as plt ## for plotting

data = {
    "Book": ["Pride and Prejudice"]*5 + ["Emma"]*5,
    "Segment": [1,2,3,4,5]*2,
    "Sentiment": [0.1,0.3,0.2,-0.1,0.0, 0.2,0.1,-0.2,0.0,0.1]
} ## sample data representing sentiment scores across segments of two books

df = pd.DataFrame(data) ## create a DataFrame from the sample data
fig, ax = plt.subplots(figsize=(6,4)) ## create a plot
for book, subset in df.groupby("Book"):
    ax.plot(subset["Segment"], subset["Sentiment"], marker='o', label=book) ## plot sentiment trajectories for each book

## Add titles and labels
ax.set_title("Sentiment Trajectories Across Austen Novels") 
ax.set_xlabel("Segment")
ax.set_ylabel("Sentiment Score")
ax.axhline(0, color='gray', linestyle='--')
ax.legend()
plt.show()