Sean Trott
The goal of this tutorial is to give a quick overview of some tools, libraries, and resources to help researchers interested in using language as a kind of data.
It'll focus on:
pandas
. spaCy
and nltk
. Caveat (1): This tutorial is by no means authoritative! It simply represents some of what I've had to learn along the way in my own research, so hopefully it will be helpful to others looking to do similar things!
Caveat (2): The resources I mention will all be English. Some of them do exist in other languages (like WordNet or SUBTLEX).
These libraries will be useful for pretty much any data science-y project involving importing/wrangling/visualizing data. (Note that later on, we'll also import some other relevant, language-specific libraries.)
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina' # makes figs nicer!
# Progress bar!
from tqdm import tqdm
There are a ton of useful lexical resources out there. These might help you with:
Age of Acquisition
and Frequency
)So Part 1 will involve:
pandas
There are a number of really useful, freely available resources, particularly for English lexical data.
These are all resources I've benefitted greatly from:
Let's start with SUBTLEX, which contains word token frequency estimates from English movie subtitles.
Lexical frequency is often important in experiments:
I've included a .csv
file of the SUBTLEX data in this repository, so you should be able to load it if you've cloned the repository on your computer.
First, we call the read_csv
function from the pandas
library, and pass in the filepath. This will create a DataFrame
object---which is really well-suited to working with large, tabular datasets.
# Read in the dataframe
df_subtlex = pd.read_csv("data/english_subtlex.csv")
# How big is the dataset?
len(df_subtlex)
We can also see exactly what the dataframe looks like by calling the head
function. This will print out a nice table of the first N
rows, giving us the ability to see which columns the data includes. The documentation tells us that this dataset should contain the following rows:
Word
: the word!Freqcount
: #times the word appears in the corpus. CDcount
: the number of films in which the word appears. Freqlow
: number of times the word appears starting with a lowercase letter. CDlow
: number of films in which word appears starting with a lowercase letter. SUBTLWF
: word frequency, per million words. Lg10WF
: equivalent to log10(Freqcount + 1)
, with four-digit precision. SUBTLCD
: in how many percent of the films the word appears. Lg10CD
: equivalent to log10(CDcount + 1)
. df_subtlex.head(5)
We can also create some simple visualizations of the data. Let's start with just the raw counts:
# Use matplotlib to make a histogram of counts
plt.hist(df_subtlex['FREQcount'])
What's going on here?
Because frequency is extremely right-skewed (a small number of words are very frequent, and most words are much less frequent), it'll be hard to visualize the raw counts.
Thus, frequency is often analyzed (and visualized) as its log:
# Histogram of *log* frequencies
plt.hist(df_subtlex['Lg10WF'])
Famously, Zipf's law states that word frequency is inversely proportional to the rank frequency of a word. When visualized, Frequency ~ Rank(Frequency)
exhibits a classic power law relationship. We can visualize that here.
First, we need to calculate rank frequency. Fortunately, pandas
has a built-in function to calculate this:
# Use {column}.rank to frequency in reverse order
df_subtlex['rank_frequency'] = df_subtlex['Lg10WF'].rank(ascending = False)
# Now visualize relationship for top 1000 words
sns.lineplot(data = df_subtlex[df_subtlex['rank_frequency'] <1000],
x = 'rank_frequency',
y = 'Lg10WF')
plt.xlabel("Frequency Rank")
plt.ylabel("Log(Frequency)")
plt.title("1000 most frequent English words")
There's been a ton of great work trying to model Zipf's law of frequency, which is typically represented by the function:
$F(w) = \frac{a}{r(w)^b}$
Where:
Below, I have a quick example of fitting these parameters using scipy.optimize.curve_fit
.
# First, we define a function representing Zipf's law
def zipf(x, a, b):
"""Zipf's law, where x = rank, and a&b are the parameters to learn."""
return a / (x**b)
## Let's set up our x and y variables
df_frequencies_top_n = df_subtlex[df_subtlex['rank_frequency']<1000]
x = df_frequencies_top_n['rank_frequency'].values
y = df_frequencies_top_n['Lg10WF'].values
from scipy.optimize import curve_fit
## Use 'curve_fit' to fit our actual data to this function
z_popt, z_pcov = curve_fit(zipf, x, y)
## Our estimated *parameters*
z_popt
# We can then use these learned parameters to generate predictions
y_pred = zipf(x, *z_popt)
# Assign to new column in dataframe
df_frequencies_top_n['y_pred'] = y_pred
# We could also plot the *normalized* log frequency on the y-axis
# This relationship is called *Zipf's Law*
sns.lineplot(data = df_frequencies_top_n,
x = 'rank_frequency',
y = 'Lg10WF',
label = "Real values")
sns.lineplot(data = df_frequencies_top_n,
x = 'rank_frequency',
y = 'y_pred',
label = "Predictions")
plt.title("Zipf's law of frequency")
So that was a brief detour into modeling frequency data.
What else might you use these frequency estimates for?
Among other things, a common use for psycholinguists designing their own experiment is to match stimuli for something like word frequency. This is useful if you want to balance across conditions, or match filler items.
To make this a little simpler, let's just randomly sample words from our frequency dataset, and pretend those are the items we're interested in.
critical_items = df_subtlex['Word'].sample(100, random_state = 10).values
critical_items
Note that we could've just sampled the frequency information along with the words themselves, but I'm trying to make this harder!
Let's start by turning those critical words into a dataframe.
df_critical = pd.DataFrame(critical_items, columns=['Word'])
df_critical.head(5)
# Now we can just call `pd.merge` on our frequency data.
df_merged = pd.merge(df_critical,
df_subtlex[['Word', 'Lg10WF']],
on="Word")
df_merged.head(5)
plt.hist(df_merged['Lg10WF'], label = "critical", alpha = .6, color = "gray")
Now comes the tricky part. We need to identify another set of 100 words in our frequency dataset that matches the frequency distribution above.
How can we do this?
Here's one solution.
For each critical word:
THRESHOLD
of the target word. Now repeat 1-2 for each target word, making sure we don't sample the same word twice.
sampled = [] ## Keep track of the words we've sampled already
FREQ_THRESHOLD = 1 ## Feel free to play around with the stringency of our threshold
fillers = []
for index, row in df_merged.iterrows(): ## This will iterate through each *row* of our critical dataframe
freq = row['Lg10WF']
# Now filter our subtlex dataframe to words with frequency values in the appropriate range
df_tmp = df_subtlex[(df_subtlex['Lg10WF']<freq+FREQ_THRESHOLD)&(df_subtlex['Lg10WF']>freq-FREQ_THRESHOLD)]
# FOR PRACTICE:
# How would you also match for POS, Concreteness, AoA, ...?
# Now sample from that dataframe
while True:
filler_row = df_tmp.sample(n = 1)
if filler_row['Word'].iloc[0] not in sampled:
fillers.append(filler_row)
sampled.append(filler_row['Word'].iloc[0])
break
df_fillers = pd.concat(fillers)
df_fillers.head(5)
They look very similar. That's good! That means we've done a good job matching.
As noted above, feel free to play around with FREQ_THRESHOLD
variable--the larger you make the threshold (i.e., the more "relaxed"/less stringent it is), the less tightly our distributions will be matched.
plt.hist(df_merged['Lg10WF'], label = "critical", alpha = .6, color = "gray")
plt.hist(df_fillers['Lg10WF'], label = "fillers", alpha = .3, color = "green")
What if you want to look at how Frequency
relates to other variables, like Age of Acquisition
or Concreteness
?
This is a common data science problem. We have multiple datasets, and we want to join (i.e., merge) them in some way.
Let's walk through how that'd work below.
# First, let's read in an AoA dataset
df_aoa = pd.read_csv("data/english_aoa_norms.csv")
len(df_aoa)
df_aoa.head(5)
# How can we merge this with our frequency dataset?
df_aoa.head(5)
## now we should merge these datasets.
## We'll lose some rows b/c not all the same words are in both.
df_merged = pd.merge(df_aoa, df_subtlex, on = "Word")
len(df_merged)
df_merged.head(5)
## Now we can visualize this relationship
sns.scatterplot(data = df_merged,
x = "Lg10WF",
y = "Rating.Mean",
alpha = .1)
plt.ylabel("Age of Acquisition")
plt.xlabel("Log(Frequency)")
plt.title("Aoa ~ Frequency")
Hopefully, Part 1 has been a helpful introduction to:
pandas
In Part 2, I'll talk briefly about manipulating strings in Python––using built-in methods that don't require installing other packages.
I'll just focus on a simple example to start: counting the number of times that a word occurs in a passage, using built-in Python methods.
A common task is counting how many times some word occurs in a passage, or across some set of passages.
For example, maybe you think that a passage contains a metaphor, and so you want to know whether there are any words associated with that metaphor.
## target words to search for
battle_words = ['fight', 'win', 'war', 'battle']
## Now let's consider a few example sentences
example_passages = [
'The war on climate change is just beginning!',
'We have been battling climate change for years',
'Win the war on global warming!'
]
How could we count the number of times each battle_word
occurs in each sentence?
There are a couple different approaches to this. The first couple I'll describe use variants of string matching: i.e., they check whether a substring in a sentence perfectly matches one of our target words.
What are some possibe limitations to exact string matching for counting words?
## Method 1: split each sentence into words, and count number of battle words
from collections import Counter
all_counts = []
for p in example_passages:
p = p.lower() # Make lowercase!
words = p.split(" ") # Split based on spaces (not always appropriate!)
# Count #occurrences of each word in sentence
counts = Counter(words)
# Get counts for each battle word
battle_counts = [(i, counts[i]) for i in battle_words]
all_counts.append((p, battle_counts))
## This is what we get. How did we do?
all_counts
We missed "battling" because we only included the string battle.
There are at least two solutions here:
battling
as well!I'll skip these solutions for now, but note that nltk
(reviewed later) has methods to lemmatize each word in a string. This would change battling
to battle
, so we could match it in the counting process.
## Method 2: Regex
import re
r = re.compile(r'\bbattle\b|\bfight\b|\win\b|\war\b')
all_counts = []
for p in example_passages:
p = p.lower() # Make lowercase!
words = p.split(" ") # Split based on spaces (not always appropriate!)
# Use regex to find matches of each string
groups = r.findall(p)
battle_counts = Counter(groups)
all_counts.append((p, battle_counts))
## How did we do?
all_counts
nltk
and spaCy
¶Many Python packages are free to install and use.
Here, I'll focus on two widely used libraries:
nltk
(the Natural Language Toolkit for Python)spaCy
both come with their own detailed set of documentation, so the goal here is really just to introduce you to some of their basic affordances.
nltk
¶The nltk
library comes with its own free book, which is really fantastic. Each chapter deals with a different issue, from accessing text corpora to using classifiers.
I'm going to discuss two resources/tools that I use frequently from nltk
: sentiment analysis and WordNet.
The basic goal of sentiment analysis is to label/predict the "sentiment" of a piece of text. A simple version of this is labeling text as positive or negative.
nltk
has its own page on sentiment analysis, so I'm going to focus here on a particular tool within nltk
that's pretty plug-and-play.
The simplest possible approach to sentiment analysis might be something like:
This "bag of words" approach sounds simplistic (and it is), but it can also be surprisingly effective! It's the basis behind lots of NLP tools like LIWC, which of course has more dimensions than simply valence.
The downside to a purely bag of words approach is, of course, that words have different meanings in different contexts!
Example: the word "good" might be positive in "the food was good", but the use of negation ("the food was not good") flips the valence.
On the spectrum of simple to complex, VADER is probably closer to "simple", but there are some clever design features that make it better than the simplest approach.
To give a very simple example:
VADER builds upon this simple principle and adds five simple rules or heuristics. Here's a summary, pretty much copied/pasted from their paper:
## First, we need to make sure we have downloaded the "Vader Sentiment lexicon"
import nltk
nltk.download('vader_lexicon')
## Now we import and create the VADER object
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
## Let's try VADER out on a simple sentence
ss = sid.polarity_scores("The food was delicious")
ss
VADER gives us a dictionary
, which breaks down the model's confidence that the statement was:
It also contains a compound
score, which averages the valence scores across the words.
## Now let's add puncutation: we see that it increases the overall sentiment!
ss = sid.polarity_scores("The food was delicious!!")
ss['compound']
## Now let's try this on a trickier sentence, involving negation. We see that "not" flips the orientation!
ss = sid.polarity_scores("The food was not delicious!!")
ss['compound']
Of course, VADER isn't perfect.
But VADER is a nice, plug-and-play model to get an estimate of text valence.
## Example:
## VADER thinks this sentence is negative,
## possibly just because of the word "gun".
## But different people might think it's positive or negative...
ss = sid.polarity_scores("Congress passes gun control legislation.")
ss['compound']
nltk
also comes with a number of corpora, including WordNet. Again, nltk
has a great tutorial on working with WordNet, so here I'll just limit this to a brief demonstration.
## Import wordnet corpus
from nltk.corpus import wordnet as wn
## WordNet maps words onto "synsets"; each synset is a set of synonyms with a common meaning
bank_synsets = wn.synsets('bank')
## With "bank", we can print out the definition of each synset
for ss in bank_synsets:
print("{name}: {definition}\n".format(name = ss.name(), definition = ss.definition()))
## Some synsets are also annotated with examples
wn.synset('bank.n.01').examples()
## Some synsets are also annotated with examples
wn.synset('bank.n.02').examples()
## Synsets also have hypernyms and hyponyms
wn.synset('bank.n.02').hypernyms()
## Synsets also have hypernyms and hyponyms
wn.synset('bank.n.02').hyponyms()
Importantly, WordNet's synsets are structured in a taxonomy. So we can measure the distance between different synsets in a couple ways.
## "Path similarity" identifies the shortest path between the two synsets in the taxonomy
wn.synset('bank.n.02').path_similarity(wn.synset('bank.n.01'))
## "Path similarity" identifies the shortest path between the two synsets in the taxonomy
wn.synset('bank.n.02').path_similarity(wn.synset('acquirer.n.02'))
What if you want to get an estimate of how many meanings each unique word in a lexicon has?
### if you want to look at all words in (English) WordNet, you can call this function:
words = list(wn.words())
## Now we could count the number of senses for each word!
senses = []
for w in tqdm(words):
for ss in wn.synsets(w):
senses.append({
'word': w,
'sense': ss.name(),
'definition': ss.definition()
})
## They've got some random words in here!
df_senses = pd.DataFrame(senses)
df_senses.sample(5, random_state = 42)
## Lots of meanings in WordNet:
len(df_senses)
## But if we collapse across multiple senses of a word, we can count a word's polysemy
## Group by each unique word, and count number of entries for that word
df_collapsed = df_senses.groupby('word').count().reset_index()
len(df_collapsed) ## Should be #words
## About 1.5 meanings per word
df_collapsed['sense'].mean()
## The most polysemous word has 75 meanings!
df_collapsed['sense'].max()
## Let's view the most polysemous words
df_collapsed.sort_values(by = 'sense', ascending = False).head(5)
## We could also rank by polysemy, and visualize that distribution
df_collapsed['polysemy_rank'] = df_collapsed['sense'].rank(ascending = False)
sns.lineplot(data = df_collapsed[df_collapsed['polysemy_rank']<1000],
x = 'polysemy_rank',
y = 'sense')
plt.xlabel("Rank(#senses)")
plt.ylabel("#senses")
spaCy
¶Like nltk
, spaCy
has great documentation. Here, I'll just show a couple of the cool things you can do with this library:
import spacy
## Load English spacy---you can also use spaCy for
## German, Dutch, and other languages (https://spacy.io/models)
nlp = spacy.load("en")
Producing a dependency parse is strikingly straightforward!
Just pass a string into the nlp
object. The displacy
function can be used to display the parsed sentence.
from spacy import displacy
doc = nlp("The man went to the store")
displacy.render(doc, style="dep", options = {'compact': True})
This is a dependency parse.
That is, each word token is labeled with a directed link to or from another word (and is also labeled with a part of speech).
If you want to access these labels (rather than just visualize them), you can iterate through the doc
object:
for i in doc:
print("Word: {w}, \n POS: {pos}, \n Head: {head}, \n Dependency label: {dep}".format(
w = i.text, pos = i.pos_, dep = i.dep_, head = i.head))
We can also use this doc
object to return all the noun phrases (noun_chunks
) in a given string.
for nc in doc.noun_chunks:
print(nc.text)
print(" Root: " + nc.root.text)
print(" Root POS: " + nc.root.tag_)
print(" Determiner: " + nc[0].text)
print(" Determiner POS: " + nc[0].tag_)
To sum up what we've learned:
nltk
and spacy
, that can help with text processing.This is only the tip of the iceberg, but hopefully it helps you get started doing text processing in Python!