Scraping SpongeBob Scripts

Kevin Minkus

2019-03-01

Encyclopedia SpongeBobia is a fantastic wiki dedicated to SpongeBob Squarepants. It's filled with great content, like a list of all the show's deceased characters (like Smitty Werben Jager Man Jensen - he was number 1). It also includes transcripts from every single episode. I'm a casual admirer of the digital humanities, so I thought it'd be fun to apply methods from distant reading to a body of literature with which I'm especially familiar.

To do that, I first need to scrape those transcripts. I'm going to post a quick walkthrough of how I'm doing that. We'll be using the requests python package for sending HTTP requests, as well as BeautifulSoup for parsing the html, and re for a bit of regex matching.

I'll link this great post, too, on how to scrape web pages ethically.

Ultimately the goal is to create something like this force directed graph of character co-occurences: visualizations, statistics, and models that better illuminate our understanding of the show.

First let's load in the needed packages.

import requests
from bs4 import BeautifulSoup
import re
import pickle

And we'll set up where to pull the URLs from, and our request headers.

root_url = "http://spongebob.fandom.com"

transcript_list = "https://spongebob.fandom.com/wiki/List_of_transcripts"

headers = {
    'User-Agent': 'My User Agent 1.0',
    'From': '[email protected]'
}

Let's grab the page that contains the locations of every trasncript, and parse it.

transcript_page = requests.get(transcript_list, headers=headers)

soup = BeautifulSoup(transcript_page.text, 'html.parser')

I usually do a lot of trial and error / hunting and pecking to get the right html tags to pull the information I'm actually interested in.

text_entries = soup.find_all('a', text=True)
transcript_href = [x.get('href') for x in text_entries if x.text == "View transcript"]

This gives us the locations of those transcripts.

transcript_href[0:10]

['/wiki/Help_Wanted/transcript',
 '/wiki/Reef_Blower/transcript',
 '/wiki/Tea_at_the_Treedome/transcript',
 '/wiki/Bubblestand/transcript',
 '/wiki/Ripped_Pants/transcript',
 '/wiki/Jellyfishing/transcript',
 '/wiki/Plankton!/transcript',
 '/wiki/Naughty_Nautical_Neighbors/transcript',
 '/wiki/Boating_School/transcript',
 '/wiki/Pizza_Delivery/transcript']

Some of these are None, which we'll filter out.

transcript_list = [x for x in transcript_href if x]
transcript_list[-10:]

['/wiki/Squidward%27s_Back/transcript',
 '/wiki/Ask_Patrick_Anything/transcript',
 '/wiki/The_SpongeBob_SquarePants_Movie/transcript',
 '/wiki/The_SpongeBob_Movie:_Sponge_Out_of_Water/transcript',
 '/wiki/Texas_(voice-over)/transcript',
 '/wiki/Drawing_Characters/transcript',
 '/wiki/Behind_the_Scenes:_The_Voices_of_SpongeBob_%26_Friends/transcript',
 '/wiki/Behind_the_Scenes_with_Pick_Boy_and_SpongeBob/transcript',
 '/wiki/Behind_the_Scenes_of_the_SpongeBob_Opening/transcript',
 '/wiki/How_to_Make_SpongeBob_SquarePants/transcript']

This function below takes an above transcript location and returns a dictionary where the keys are the speakers in that episode, and the values are that speaker's lines. The regex in here is taken from stackoverflow.

def get_episode_quotes(transcript_code):
    # get episode webpage
    episode = requests.get(f"{root_url}{transcript_code}", headers=headers)
    # parse webpage for speakers, which are bold text
    ep_bs = BeautifulSoup(episode.text, 'html.parser')
    speakers = [x.text for x in ep_bs.find_all("b")[14:]]
    # add speakers to dictionary
    speaker_dict = {key: "" for (key) in speakers}
    # grab all lines
    lines = [x.text for x in ep_bs.find_all("li")]
    for speaker in speaker_dict.keys():
        # pull the lines for a given speaker
        quotes = [x.replace(speaker, "") for x in lines if speaker in x]
        # parse out the stage notes which occur in brackets
        speaker_dict[speaker] = re.sub("[\(\[].*?[\)\]]", "", " ".join(quotes))
    return speaker_dict

And let's grab every set of lines from every episode.

quote_dicts = [get_episode_quotes(code) for code in transcript_list]

Here are the speakers from the first episode, Help Wanted:

quote_dicts[0].keys()

dict_keys(['French Narrator:', 'SpongeBob:', 'Gary:', 'Patrick:', 'Squidward:', 'Mr. Krabs:', 'Bus driver:', 'Anchovy:', 'Anchovies:', 'Squidward and Mr. Krabs:'])

And Patrick's quotes from that episode:

quote_dicts[0]['Patrick:']

' Go, SpongeBob!  Whoa! \n  Where do you think you\'re going?\n  No you\'re not. You\'re going to the Krusty Krab and get that job.\n  Whose first words were "may I take your order"?\n  Who made a spatula out of toothpicks in wood shop?\n   Who\'s a, uh, who\'s uhh, oh! Who\'s a big yellow cube with holes?\n  Who\'s ready?\n  Who\'s ready?\n  Who\'s ready?\n  Good morning, Krusty Krew!\n  One Krabby Patty, please. \n  \n'

Let's save this list of dictionaries for my next post.

with open("transcript_speaker_dicts.pkl", "wb") as buff:
    pickle.dump(quote_dicts, buff)