5. Events Analysis¶

Dr. W.J.B. Mattingly Smithsonian Data Science Lab and United States Holocaust Memorial Museum August 2021

5.1. Covered in this Chapter¶

How to analyze events found in .tokens file
How to identify sentences that have events.
How to create a custom .events file.

5.2. Introduction¶

In this chapter, we will be exploring the events found by the BookNLP pipeline. Unfortunately, there is not a .events file outputted from this pipeline so if we want to discover event data, we need to figure out how to manipulate the output files to our advantage.

The only output file that details event data is the .tokens file. As a result, this file will be the focus of this chapter. Each section of this chapter will analyze the .tokens file in a deeper way to identify and extract event data. At the end of the chapter, we will bring everything together with a single function that can recreate these results on any BookNLP output .tokens file.

5.3. Exploring the Tokens File¶

Let’s first go ahead and open up the .tokens file and take a look at it so we can remember precisely what the .tsv file looks like. If you remember from chapter 3, we can analyze the files a bit more easily, if we use pandas, a tabular data analysis library in Python.

import pandas as pd
df = pd.read_csv("data/harry_potter/harry_potter.tokens", delimiter="\t")
df

	paragraph_ID	sentence_ID	token_ID_within_sentence	token_ID_within_document	word	lemma	byte_onset	byte_offset	POS_tag	fine_POS_tag	dependency_relation	syntactic_head_ID	event
0	0	0	0	0	Mr.	Mr.	0	3	PROPN	NNP	nmod	3	O
1	0	0	1	1	and	and	4	7	CCONJ	CC	cc	0	O
2	0	0	2	2	Mrs.	Mrs.	8	12	PROPN	NNP	compound	3	O
3	0	0	3	3	Dursley	Dursley	13	20	PROPN	NNP	nsubj	12	O
4	0	0	4	4	,	,	20	21	PUNCT	,	punct	3	O
...	...	...	...	...	...	...	...	...	...	...	...	...	...
99251	2995	6172	10	99251	Dudley	Dudley	438929	438935	PROPN	NNP	pobj	99250	O
99252	2995	6172	11	99252	this	this	438936	438940	DET	DT	det	99253	O
99253	2995	6172	12	99253	summer	summer	438941	438947	NOUN	NN	npadvmod	99245	O
99254	2995	6172	13	99254	....	....	438947	438951	PUNCT	.	punct	99243	O
99255	2995	6172	14	99255	\t	438951	438952	PUNCT	''	punct	99243	O	NaN

99256 rows × 13 columns

We have approximately 99,000 rows and 13 columns of data. Throughout this chapter, we will focus on only four columns in particular:

sentence_ID
word
lemma
event

As such, let’s go ahead and remove all the extra data for now so that we can just view the columns we care about.

df = df[[“sentence_ID”, “word”, “lemma”, “event”]] df

Excellent! Now we can analyze this event column a bit more easily.

5.4. Grabbing the Events¶

One of the things we can see above is that some event columns contain NaN. Ideally, we want to ignore these entirely. We can do this in pandas by using the isnull() method.

events = df[~df['event'].isnull()]
events

	sentence_ID	word	lemma	event
0	0	Mr.	Mr.	O
1	0	and	and	O
2	0	Mrs.	Mrs.	O
3	0	Dursley	Dursley	O
4	0	,	,	O
...	...	...	...	...
99250	6172	with	with	O
99251	6172	Dudley	Dudley	O
99252	6172	this	this	O
99253	6172	summer	summer	O
99254	6172	....	....	O

94498 rows × 4 columns

As we can see this eliminated roughly 5,000 rows. Let’s take a closer look at the column event and see what kind of data we can expect to see here.

event_options = set(events.event.tolist())
print (event_options)

{'EVENT', 'O'}

By converting this column to a list and then to a set (which eliminates the duplicates), we can see that we have two types of data in the event column:

EVENT
O

If a row has “EVENT” in the column then it means the corresponding word was identified by the BookNLP pipeline as being an event-triggering word. Now that we know this, let’s take a look at only the rows that have EVENT in the event column.

real_events = events.loc[df["event"] == "EVENT"]
real_events

	sentence_ID	word	lemma	event
242	9	shuddered	shudder	EVENT
308	12	woke	wake	EVENT
346	13	hummed	hum	EVENT
349	13	picked	pick	EVENT
361	13	gossiped	gossip	EVENT
...	...	...	...	...
99152	6167	hung	hang	EVENT
99185	6169	said	say	EVENT
99209	6170	said	say	EVENT
99215	6170	surprised	surprised	EVENT
99218	6170	grin	grin	EVENT

6029 rows × 4 columns

We now have only 6,029 rows to analyze!

5.5. Analyzing Events Words and Lemmas¶

Let’s dig a little deeper. Let’s try to analyze the words and lemmas of these rows to see how many unique words and lemmas we have.

event_words = set(real_events.word.tolist())
len(event_words)

event_lemmas = list(set(real_events.lemma.tolist()))
event_lemmas.sort()
len(event_lemmas)

print (event_lemmas[:10])

['BOOM', 'Bludger', 'Pompously', 'Scowling', 'Smelting', 'Whispers', 'aback', 'accept', 'ache', 'act']

While we have 1501 unique words, we only have 1021 unique lemmas. If we were interested in seeing the type of event words and lemmas appear in Harry Potter, we can now do that, but something I notice quickly is that some lemmas are capitalized. Let’s eliminate all duplicates by lowering all lemmas.

final_lemmas = []
for lemma in event_lemmas:
    lemma = lemma.lower()
    if lemma not in final_lemmas:
        final_lemmas.append(lemma)
        
print(len(final_lemmas))
print(final_lemmas[:10])

1020
['boom', 'bludger', 'pompously', 'scowling', 'smelting', 'whispers', 'aback', 'accept', 'ache', 'act']

We eliminated only one duplicate.

5.6. Grabbing Event Sentences¶

Now that we know how to grab individual event-triggering words, what about the sentences that contain events? To analyze this, we can use the sentence_ID column which contains a unique number for each sentence.

sentences = real_events.sentence_ID.tolist()
events = real_events.word.tolist()
print (sentences[:10])
print (events[:10])

[9, 12, 13, 13, 13, 13, 13, 14, 15, 15]
['shuddered', 'woke', 'hummed', 'picked', 'gossiped', 'wrestled', 'screaming', 'flutter', 'picked', 'pecked']

We can see that some sentences appear multiple times. This is because they contain multiple words that are event-triggering.

Let’s take a look at our initial DataFrame once again.

df

	sentence_ID	word	lemma	event
0	0	Mr.	Mr.	O
1	0	and	and	O
2	0	Mrs.	Mrs.	O
3	0	Dursley	Dursley	O
4	0	,	,	O
...	...	...	...	...
99251	6172	Dudley	Dudley	O
99252	6172	this	this	O
99253	6172	summer	summer	O
99254	6172	....	....	O
99255	6172	\t	438951	NaN

99256 rows × 4 columns

Let’s say we were interested in grabbing the first sentence from the first event, we can grab all rows that have a matching sentence_ID.

sentence1 = sentences[0]
result = df[df["sentence_ID"] == int(sentence)]
result

	sentence_ID	word	lemma	event
240	9	The	the	O
241	9	Dursleys	Dursleys	O
242	9	shuddered	shudder	EVENT
243	9	to	to	O
244	9	think	think	O
245	9	what	what	O
246	9	the	the	O
247	9	neighbors	neighbor	O
248	9	would	would	O
249	9	say	say	O
250	9	if	if	O
251	9	the	the	O
252	9	Potters	Potters	O
253	9	arrived	arrive	O
254	9	in	in	O
255	9	the	the	O
256	9	street	street	O
257	9	.	.	O

With this data, we can then grab all the words and reconstruct the sentence.

words = result.word.tolist()
resentence = " ".join(words)

print (resentence)

The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street .

5.7. Bringing Everything Together¶

Let’s now bring everything together we just learned in this chapter and make it into a function. This function will receive a file that corresponds to the .tokens file. It will find the relevant event rows and then reconstruct the sentences that correspond to each event word. The output will be a list of dictionaries that are event-centric. Each dictionary will have 3 keys:

event_word = the event-triggering word
event_lemma = the event_word’s lemma
sentence = the sentence that the event-triggering word is in

def grab_event_sentences(file):
    df = pd.read_csv(file, delimiter="\t")
    real_events = df.loc[df["event"] == "EVENT"]
    sentences = real_events.sentence_ID.tolist()
    event_words = real_events.word.tolist()
    event_lemmas = real_events.lemma.tolist()
    final_sentences = []
    x=0
    for sentence in sentences:
        result = df[df["sentence_ID"] == int(sentence)]
        words = result.word.tolist()
        resentence = " ".join(words)
        final_sentences.append({"event_word": event_words[x],
                                "event_lemma": event_lemmas[x],
                                "sentence": resentence
                                   
                               })
        x=x+1
    return final_sentences
    
    
event_data = grab_event_sentences("data/harry_potter/harry_potter.tokens")

Let’s take a look at the output now.

print (event_data[0])

{'event_word': 'shuddered', 'event_lemma': 'shudder', 'sentence': 'The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street .'}

5.8. Creating a .events File.¶

This allows us to now analyze the events identified in the BookNLP pipeline a bit more easily. Since we don’t have a .events output file, this is currently one way that we can simulate the same result by creating a special events-centric output. With this data, we can now create a new DataFrame.

new_df = pd.DataFrame(event_data)

new_df

	event_word	event_lemma	sentence
0	shuddered	shudder	The Dursleys shuddered to think what the neigh...
1	woke	wake	When Mr. and Mrs. Dursley woke up on the dull ...
2	hummed	hum	Mr. Dursley hummed as he picked out his most b...
3	picked	pick	Mr. Dursley hummed as he picked out his most b...
4	gossiped	gossip	Mr. Dursley hummed as he picked out his most b...
...	...	...	...
6024	hung	hang	Harry hung back for a last word with Ron and H...
6025	said	say	\t \t Hope you have -- er -- a good holiday , ...
6026	said	say	\t Oh , I will , \t said Harry , and they were...
6027	surprised	surprised	\t Oh , I will , \t said Harry , and they were...
6028	grin	grin	\t Oh , I will , \t said Harry , and they were...

6029 rows × 3 columns

We can also output it to the same subdirectory as the other files.

new_df.to_csv("data/harry_potter/harry_potter.events", index=False)

And now you have a .events file!

Introduction to BookNLP

Events Analysis

Contents