5.
Events Analysis

Dr. W.J.B. Mattingly
Smithsonian Data Science Lab and United States Holocaust Memorial Museum
August 2021

5.1. Covered in this Chapter

  1. How to analyze events found in .tokens file

  2. How to identify sentences that have events.

  3. How to create a custom .events file.

5.2. Introduction

In this chapter, we will be exploring the events found by the BookNLP pipeline. Unfortunately, there is not a .events file outputted from this pipeline so if we want to discover event data, we need to figure out how to manipulate the output files to our advantage.

The only output file that details event data is the .tokens file. As a result, this file will be the focus of this chapter. Each section of this chapter will analyze the .tokens file in a deeper way to identify and extract event data. At the end of the chapter, we will bring everything together with a single function that can recreate these results on any BookNLP output .tokens file.

5.3. Exploring the Tokens File

Let’s first go ahead and open up the .tokens file and take a look at it so we can remember precisely what the .tsv file looks like. If you remember from chapter 3, we can analyze the files a bit more easily, if we use pandas, a tabular data analysis library in Python.

import pandas as pd
df = pd.read_csv("data/harry_potter/harry_potter.tokens", delimiter="\t")
df
paragraph_ID sentence_ID token_ID_within_sentence token_ID_within_document word lemma byte_onset byte_offset POS_tag fine_POS_tag dependency_relation syntactic_head_ID event
0 0 0 0 0 Mr. Mr. 0 3 PROPN NNP nmod 3 O
1 0 0 1 1 and and 4 7 CCONJ CC cc 0 O
2 0 0 2 2 Mrs. Mrs. 8 12 PROPN NNP compound 3 O
3 0 0 3 3 Dursley Dursley 13 20 PROPN NNP nsubj 12 O
4 0 0 4 4 , , 20 21 PUNCT , punct 3 O
... ... ... ... ... ... ... ... ... ... ... ... ... ...
99251 2995 6172 10 99251 Dudley Dudley 438929 438935 PROPN NNP pobj 99250 O
99252 2995 6172 11 99252 this this 438936 438940 DET DT det 99253 O
99253 2995 6172 12 99253 summer summer 438941 438947 NOUN NN npadvmod 99245 O
99254 2995 6172 13 99254 .... .... 438947 438951 PUNCT . punct 99243 O
99255 2995 6172 14 99255 \t 438951 438952 PUNCT '' punct 99243 O NaN

99256 rows × 13 columns

We have approximately 99,000 rows and 13 columns of data. Throughout this chapter, we will focus on only four columns in particular:

  • sentence_ID

  • word

  • lemma

  • event

As such, let’s go ahead and remove all the extra data for now so that we can just view the columns we care about.

df = df[[“sentence_ID”, “word”, “lemma”, “event”]] df

Excellent! Now we can analyze this event column a bit more easily.

5.4. Grabbing the Events

One of the things we can see above is that some event columns contain NaN. Ideally, we want to ignore these entirely. We can do this in pandas by using the isnull() method.

events = df[~df['event'].isnull()]
events
sentence_ID word lemma event
0 0 Mr. Mr. O
1 0 and and O
2 0 Mrs. Mrs. O
3 0 Dursley Dursley O
4 0 , , O
... ... ... ... ...
99250 6172 with with O
99251 6172 Dudley Dudley O
99252 6172 this this O
99253 6172 summer summer O
99254 6172 .... .... O

94498 rows × 4 columns

As we can see this eliminated roughly 5,000 rows. Let’s take a closer look at the column event and see what kind of data we can expect to see here.

event_options = set(events.event.tolist())
print (event_options)
{'EVENT', 'O'}

By converting this column to a list and then to a set (which eliminates the duplicates), we can see that we have two types of data in the event column:

  • EVENT

  • O

If a row has “EVENT” in the column then it means the corresponding word was identified by the BookNLP pipeline as being an event-triggering word. Now that we know this, let’s take a look at only the rows that have EVENT in the event column.

real_events = events.loc[df["event"] == "EVENT"]
real_events
sentence_ID word lemma event
242 9 shuddered shudder EVENT
308 12 woke wake EVENT
346 13 hummed hum EVENT
349 13 picked pick EVENT
361 13 gossiped gossip EVENT
... ... ... ... ...
99152 6167 hung hang EVENT
99185 6169 said say EVENT
99209 6170 said say EVENT
99215 6170 surprised surprised EVENT
99218 6170 grin grin EVENT

6029 rows × 4 columns

We now have only 6,029 rows to analyze!

5.5. Analyzing Events Words and Lemmas

Let’s dig a little deeper. Let’s try to analyze the words and lemmas of these rows to see how many unique words and lemmas we have.

event_words = set(real_events.word.tolist())
len(event_words)
1501
event_lemmas = list(set(real_events.lemma.tolist()))
event_lemmas.sort()
len(event_lemmas)
1021
print (event_lemmas[:10])
['BOOM', 'Bludger', 'Pompously', 'Scowling', 'Smelting', 'Whispers', 'aback', 'accept', 'ache', 'act']

While we have 1501 unique words, we only have 1021 unique lemmas. If we were interested in seeing the type of event words and lemmas appear in Harry Potter, we can now do that, but something I notice quickly is that some lemmas are capitalized. Let’s eliminate all duplicates by lowering all lemmas.

final_lemmas = []
for lemma in event_lemmas:
    lemma = lemma.lower()
    if lemma not in final_lemmas:
        final_lemmas.append(lemma)
        
print(len(final_lemmas))
print(final_lemmas[:10])
1020
['boom', 'bludger', 'pompously', 'scowling', 'smelting', 'whispers', 'aback', 'accept', 'ache', 'act']

We eliminated only one duplicate.

5.6. Grabbing Event Sentences

Now that we know how to grab individual event-triggering words, what about the sentences that contain events? To analyze this, we can use the sentence_ID column which contains a unique number for each sentence.

sentences = real_events.sentence_ID.tolist()
events = real_events.word.tolist()
print (sentences[:10])
print (events[:10])
[9, 12, 13, 13, 13, 13, 13, 14, 15, 15]
['shuddered', 'woke', 'hummed', 'picked', 'gossiped', 'wrestled', 'screaming', 'flutter', 'picked', 'pecked']

We can see that some sentences appear multiple times. This is because they contain multiple words that are event-triggering.

Let’s take a look at our initial DataFrame once again.

df
sentence_ID word lemma event
0 0 Mr. Mr. O
1 0 and and O
2 0 Mrs. Mrs. O
3 0 Dursley Dursley O
4 0 , , O
... ... ... ... ...
99251 6172 Dudley Dudley O
99252 6172 this this O
99253 6172 summer summer O
99254 6172 .... .... O
99255 6172 \t 438951 NaN

99256 rows × 4 columns

Let’s say we were interested in grabbing the first sentence from the first event, we can grab all rows that have a matching sentence_ID.

sentence1 = sentences[0]
result = df[df["sentence_ID"] == int(sentence)]
result
sentence_ID word lemma event
240 9 The the O
241 9 Dursleys Dursleys O
242 9 shuddered shudder EVENT
243 9 to to O
244 9 think think O
245 9 what what O
246 9 the the O
247 9 neighbors neighbor O
248 9 would would O
249 9 say say O
250 9 if if O
251 9 the the O
252 9 Potters Potters O
253 9 arrived arrive O
254 9 in in O
255 9 the the O
256 9 street street O
257 9 . . O

With this data, we can then grab all the words and reconstruct the sentence.

words = result.word.tolist()
resentence = " ".join(words)
print (resentence)
The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street .

5.7. Bringing Everything Together

Let’s now bring everything together we just learned in this chapter and make it into a function. This function will receive a file that corresponds to the .tokens file. It will find the relevant event rows and then reconstruct the sentences that correspond to each event word. The output will be a list of dictionaries that are event-centric. Each dictionary will have 3 keys:

  • event_word = the event-triggering word

  • event_lemma = the event_word’s lemma

  • sentence = the sentence that the event-triggering word is in

def grab_event_sentences(file):
    df = pd.read_csv(file, delimiter="\t")
    real_events = df.loc[df["event"] == "EVENT"]
    sentences = real_events.sentence_ID.tolist()
    event_words = real_events.word.tolist()
    event_lemmas = real_events.lemma.tolist()
    final_sentences = []
    x=0
    for sentence in sentences:
        result = df[df["sentence_ID"] == int(sentence)]
        words = result.word.tolist()
        resentence = " ".join(words)
        final_sentences.append({"event_word": event_words[x],
                                "event_lemma": event_lemmas[x],
                                "sentence": resentence
                                   
                               })
        x=x+1
    return final_sentences
    
    
event_data = grab_event_sentences("data/harry_potter/harry_potter.tokens")

Let’s take a look at the output now.

print (event_data[0])
{'event_word': 'shuddered', 'event_lemma': 'shudder', 'sentence': 'The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street .'}

5.8. Creating a .events File.

This allows us to now analyze the events identified in the BookNLP pipeline a bit more easily. Since we don’t have a .events output file, this is currently one way that we can simulate the same result by creating a special events-centric output. With this data, we can now create a new DataFrame.

new_df = pd.DataFrame(event_data)
new_df
event_word event_lemma sentence
0 shuddered shudder The Dursleys shuddered to think what the neigh...
1 woke wake When Mr. and Mrs. Dursley woke up on the dull ...
2 hummed hum Mr. Dursley hummed as he picked out his most b...
3 picked pick Mr. Dursley hummed as he picked out his most b...
4 gossiped gossip Mr. Dursley hummed as he picked out his most b...
... ... ... ...
6024 hung hang Harry hung back for a last word with Ron and H...
6025 said say \t \t Hope you have -- er -- a good holiday , ...
6026 said say \t Oh , I will , \t said Harry , and they were...
6027 surprised surprised \t Oh , I will , \t said Harry , and they were...
6028 grin grin \t Oh , I will , \t said Harry , and they were...

6029 rows × 3 columns

We can also output it to the same subdirectory as the other files.

new_df.to_csv("data/harry_potter/harry_potter.events", index=False)

And now you have a .events file!