Events Analysis
Contents
5. Events Analysis ¶
5.1. Covered in this Chapter¶
How to analyze events found in .tokens file
How to identify sentences that have events.
How to create a custom .events file.
5.2. Introduction¶
In this chapter, we will be exploring the events found by the BookNLP pipeline. Unfortunately, there is not a .events file outputted from this pipeline so if we want to discover event data, we need to figure out how to manipulate the output files to our advantage.
The only output file that details event data is the .tokens file. As a result, this file will be the focus of this chapter. Each section of this chapter will analyze the .tokens file in a deeper way to identify and extract event data. At the end of the chapter, we will bring everything together with a single function that can recreate these results on any BookNLP output .tokens file.
5.3. Exploring the Tokens File¶
Let’s first go ahead and open up the .tokens file and take a look at it so we can remember precisely what the .tsv file looks like. If you remember from chapter 3, we can analyze the files a bit more easily, if we use pandas, a tabular data analysis library in Python.
import pandas as pd
df = pd.read_csv("data/harry_potter/harry_potter.tokens", delimiter="\t")
df
paragraph_ID | sentence_ID | token_ID_within_sentence | token_ID_within_document | word | lemma | byte_onset | byte_offset | POS_tag | fine_POS_tag | dependency_relation | syntactic_head_ID | event | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | Mr. | Mr. | 0 | 3 | PROPN | NNP | nmod | 3 | O |
1 | 0 | 0 | 1 | 1 | and | and | 4 | 7 | CCONJ | CC | cc | 0 | O |
2 | 0 | 0 | 2 | 2 | Mrs. | Mrs. | 8 | 12 | PROPN | NNP | compound | 3 | O |
3 | 0 | 0 | 3 | 3 | Dursley | Dursley | 13 | 20 | PROPN | NNP | nsubj | 12 | O |
4 | 0 | 0 | 4 | 4 | , | , | 20 | 21 | PUNCT | , | punct | 3 | O |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
99251 | 2995 | 6172 | 10 | 99251 | Dudley | Dudley | 438929 | 438935 | PROPN | NNP | pobj | 99250 | O |
99252 | 2995 | 6172 | 11 | 99252 | this | this | 438936 | 438940 | DET | DT | det | 99253 | O |
99253 | 2995 | 6172 | 12 | 99253 | summer | summer | 438941 | 438947 | NOUN | NN | npadvmod | 99245 | O |
99254 | 2995 | 6172 | 13 | 99254 | .... | .... | 438947 | 438951 | PUNCT | . | punct | 99243 | O |
99255 | 2995 | 6172 | 14 | 99255 | \t | 438951 | 438952 | PUNCT | '' | punct | 99243 | O | NaN |
99256 rows × 13 columns
We have approximately 99,000 rows and 13 columns of data. Throughout this chapter, we will focus on only four columns in particular:
sentence_ID
word
lemma
event
As such, let’s go ahead and remove all the extra data for now so that we can just view the columns we care about.
df = df[[“sentence_ID”, “word”, “lemma”, “event”]] df
Excellent! Now we can analyze this event column a bit more easily.
5.4. Grabbing the Events¶
One of the things we can see above is that some event columns contain NaN. Ideally, we want to ignore these entirely. We can do this in pandas by using the isnull() method.
events = df[~df['event'].isnull()]
events
sentence_ID | word | lemma | event | |
---|---|---|---|---|
0 | 0 | Mr. | Mr. | O |
1 | 0 | and | and | O |
2 | 0 | Mrs. | Mrs. | O |
3 | 0 | Dursley | Dursley | O |
4 | 0 | , | , | O |
... | ... | ... | ... | ... |
99250 | 6172 | with | with | O |
99251 | 6172 | Dudley | Dudley | O |
99252 | 6172 | this | this | O |
99253 | 6172 | summer | summer | O |
99254 | 6172 | .... | .... | O |
94498 rows × 4 columns
As we can see this eliminated roughly 5,000 rows. Let’s take a closer look at the column event and see what kind of data we can expect to see here.
event_options = set(events.event.tolist())
print (event_options)
{'EVENT', 'O'}
By converting this column to a list and then to a set (which eliminates the duplicates), we can see that we have two types of data in the event column:
EVENT
O
If a row has “EVENT” in the column then it means the corresponding word was identified by the BookNLP pipeline as being an event-triggering word. Now that we know this, let’s take a look at only the rows that have EVENT in the event column.
real_events = events.loc[df["event"] == "EVENT"]
real_events
sentence_ID | word | lemma | event | |
---|---|---|---|---|
242 | 9 | shuddered | shudder | EVENT |
308 | 12 | woke | wake | EVENT |
346 | 13 | hummed | hum | EVENT |
349 | 13 | picked | pick | EVENT |
361 | 13 | gossiped | gossip | EVENT |
... | ... | ... | ... | ... |
99152 | 6167 | hung | hang | EVENT |
99185 | 6169 | said | say | EVENT |
99209 | 6170 | said | say | EVENT |
99215 | 6170 | surprised | surprised | EVENT |
99218 | 6170 | grin | grin | EVENT |
6029 rows × 4 columns
We now have only 6,029 rows to analyze!
5.5. Analyzing Events Words and Lemmas¶
Let’s dig a little deeper. Let’s try to analyze the words and lemmas of these rows to see how many unique words and lemmas we have.
event_words = set(real_events.word.tolist())
len(event_words)
1501
event_lemmas = list(set(real_events.lemma.tolist()))
event_lemmas.sort()
len(event_lemmas)
1021
print (event_lemmas[:10])
['BOOM', 'Bludger', 'Pompously', 'Scowling', 'Smelting', 'Whispers', 'aback', 'accept', 'ache', 'act']
While we have 1501 unique words, we only have 1021 unique lemmas. If we were interested in seeing the type of event words and lemmas appear in Harry Potter, we can now do that, but something I notice quickly is that some lemmas are capitalized. Let’s eliminate all duplicates by lowering all lemmas.
final_lemmas = []
for lemma in event_lemmas:
lemma = lemma.lower()
if lemma not in final_lemmas:
final_lemmas.append(lemma)
print(len(final_lemmas))
print(final_lemmas[:10])
1020
['boom', 'bludger', 'pompously', 'scowling', 'smelting', 'whispers', 'aback', 'accept', 'ache', 'act']
We eliminated only one duplicate.
5.6. Grabbing Event Sentences¶
Now that we know how to grab individual event-triggering words, what about the sentences that contain events? To analyze this, we can use the sentence_ID column which contains a unique number for each sentence.
sentences = real_events.sentence_ID.tolist()
events = real_events.word.tolist()
print (sentences[:10])
print (events[:10])
[9, 12, 13, 13, 13, 13, 13, 14, 15, 15]
['shuddered', 'woke', 'hummed', 'picked', 'gossiped', 'wrestled', 'screaming', 'flutter', 'picked', 'pecked']
We can see that some sentences appear multiple times. This is because they contain multiple words that are event-triggering.
Let’s take a look at our initial DataFrame once again.
df
sentence_ID | word | lemma | event | |
---|---|---|---|---|
0 | 0 | Mr. | Mr. | O |
1 | 0 | and | and | O |
2 | 0 | Mrs. | Mrs. | O |
3 | 0 | Dursley | Dursley | O |
4 | 0 | , | , | O |
... | ... | ... | ... | ... |
99251 | 6172 | Dudley | Dudley | O |
99252 | 6172 | this | this | O |
99253 | 6172 | summer | summer | O |
99254 | 6172 | .... | .... | O |
99255 | 6172 | \t | 438951 | NaN |
99256 rows × 4 columns
Let’s say we were interested in grabbing the first sentence from the first event, we can grab all rows that have a matching sentence_ID.
sentence1 = sentences[0]
result = df[df["sentence_ID"] == int(sentence)]
result
sentence_ID | word | lemma | event | |
---|---|---|---|---|
240 | 9 | The | the | O |
241 | 9 | Dursleys | Dursleys | O |
242 | 9 | shuddered | shudder | EVENT |
243 | 9 | to | to | O |
244 | 9 | think | think | O |
245 | 9 | what | what | O |
246 | 9 | the | the | O |
247 | 9 | neighbors | neighbor | O |
248 | 9 | would | would | O |
249 | 9 | say | say | O |
250 | 9 | if | if | O |
251 | 9 | the | the | O |
252 | 9 | Potters | Potters | O |
253 | 9 | arrived | arrive | O |
254 | 9 | in | in | O |
255 | 9 | the | the | O |
256 | 9 | street | street | O |
257 | 9 | . | . | O |
With this data, we can then grab all the words and reconstruct the sentence.
words = result.word.tolist()
resentence = " ".join(words)
print (resentence)
The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street .
5.7. Bringing Everything Together¶
Let’s now bring everything together we just learned in this chapter and make it into a function. This function will receive a file that corresponds to the .tokens file. It will find the relevant event rows and then reconstruct the sentences that correspond to each event word. The output will be a list of dictionaries that are event-centric. Each dictionary will have 3 keys:
event_word = the event-triggering word
event_lemma = the event_word’s lemma
sentence = the sentence that the event-triggering word is in
def grab_event_sentences(file):
df = pd.read_csv(file, delimiter="\t")
real_events = df.loc[df["event"] == "EVENT"]
sentences = real_events.sentence_ID.tolist()
event_words = real_events.word.tolist()
event_lemmas = real_events.lemma.tolist()
final_sentences = []
x=0
for sentence in sentences:
result = df[df["sentence_ID"] == int(sentence)]
words = result.word.tolist()
resentence = " ".join(words)
final_sentences.append({"event_word": event_words[x],
"event_lemma": event_lemmas[x],
"sentence": resentence
})
x=x+1
return final_sentences
event_data = grab_event_sentences("data/harry_potter/harry_potter.tokens")
Let’s take a look at the output now.
print (event_data[0])
{'event_word': 'shuddered', 'event_lemma': 'shudder', 'sentence': 'The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street .'}
5.8. Creating a .events File.¶
This allows us to now analyze the events identified in the BookNLP pipeline a bit more easily. Since we don’t have a .events output file, this is currently one way that we can simulate the same result by creating a special events-centric output. With this data, we can now create a new DataFrame.
new_df = pd.DataFrame(event_data)
new_df
event_word | event_lemma | sentence | |
---|---|---|---|
0 | shuddered | shudder | The Dursleys shuddered to think what the neigh... |
1 | woke | wake | When Mr. and Mrs. Dursley woke up on the dull ... |
2 | hummed | hum | Mr. Dursley hummed as he picked out his most b... |
3 | picked | pick | Mr. Dursley hummed as he picked out his most b... |
4 | gossiped | gossip | Mr. Dursley hummed as he picked out his most b... |
... | ... | ... | ... |
6024 | hung | hang | Harry hung back for a last word with Ron and H... |
6025 | said | say | \t \t Hope you have -- er -- a good holiday , ... |
6026 | said | say | \t Oh , I will , \t said Harry , and they were... |
6027 | surprised | surprised | \t Oh , I will , \t said Harry , and they were... |
6028 | grin | grin | \t Oh , I will , \t said Harry , and they were... |
6029 rows × 3 columns
We can also output it to the same subdirectory as the other files.
new_df.to_csv("data/harry_potter/harry_potter.events", index=False)
And now you have a .events file!