Character Analysis
Contents
4. Character Analysis ¶
4.1. Covered in this Chapter¶
How to Analyze the .book file in depth
Create custom functions to generate character data
Analyze the character data at the verb-level
4.2. Introduction¶
This chapter is dedicated to analyzing the characters contained within the .book file. As you may recall from the last chapter, this is a JSON file. A lot of what I will cover here, can be found in the BookNLP repository, specifically in the Google Colab Jupyter Notebook. I am, however, making some modifications to the code there to make it a bit more useful for varying circumstances. I will specifically show you how to use this restructured data to pose narrow questions about characters in a text.
The following functions and imports will be necessary for this chapter. They allow us to load up the JSON data from the .book file and count the occurrences of certain things found within the .book file.
import json
from collections import Counter
def proc(filename):
with open(filename) as file:
data=json.load(file)
return data
def get_counter_from_dependency_list(dep_list):
counter=Counter()
for token in dep_list:
term=token["w"]
tokenGlobalIndex=token["i"]
counter[term]+=1
return counter
Now that we have successfully created these functions, let’s go ahead and load up our JSON data from the .book file. We can do this with the function above that we created called “proc”. Essentially, this loads and parses the JSON file for us using the JSON library that comes standard with Python.
data=proc("data/harry_potter/harry_potter.book")
Now that we have loaded the data, we can start to analyze it!
4.3. Analyzing the Characters (From BookNLP Repo)¶
If you have had a chance to look at the Google Colab notebook provided by BookNLP, this function will look similar. I have made some modifications to the code presented there so that we can do a bit more with it. In the notebook, the original code printed off character data. My modifications and the fact that I have structured it as a function, allow us to do a bit more. We can actually begin analyzing the characters.
I have kept my function’s code as close to the original as possible so that it can be better understood within the documentation.
def create_character_data(data, printTop):
character_data = {}
for character in data["characters"]:
agentList=character["agent"]
patientList=character["patient"]
possList=character["poss"]
modList=character["mod"]
character_id=character["id"]
count=character["count"]
referential_gender_distribution=referential_gender_prediction="unknown"
if character["g"] is not None and character["g"] != "unknown":
referential_gender_distribution=character["g"]["inference"]
referential_gender=character["g"]["argmax"]
mentions=character["mentions"]
proper_mentions=mentions["proper"]
max_proper_mention=""
#Let's create some empty lists that we can append to.
poss_items = []
agent_items = []
patient_items = []
mod_items = []
# just print out information about named characters
if len(mentions["proper"]) > 0:
max_proper_mention=mentions["proper"][0]["n"]
for k, v in get_counter_from_dependency_list(possList).most_common(printTop):
poss_items.append((v,k))
for k, v in get_counter_from_dependency_list(agentList).most_common(printTop):
agent_items.append((v,k))
for k, v in get_counter_from_dependency_list(patientList).most_common(printTop):
patient_items.append((v,k))
for k, v in get_counter_from_dependency_list(modList).most_common(printTop):
mod_items.append((v,k))
# print(character_id, count, max_proper_mention, referential_gender)
character_data[character_id] = {"id": character_id,
"count": count,
"max_proper_mention": max_proper_mention,
"referential_gender": referential_gender,
"possList": poss_items,
"agentList": agent_items,
"patientList": patient_items,
"modList": mod_items
}
return character_data
This function expects two arguments:
the data that we created above, i.e. the original .book JSON data
printTop which will be the number of items you seek to return about the character
Let’s go ahead and create some character_data now that will retain the top 10 items connected to each character. If you want to see all possible things connected to the character, simply set this item to a very high number, e.g. 20000. This is not the cleanest, but it allowed me to keep this function as simple as possible.
This function will return a new data file that will be a dictionary where each unique id is a key and the corresponding character data will be populated as its value (also structured as a dictionary). I have kept the keys of this nested dictionary identical to the original Google Colab file.
character_data = create_character_data(data, 10)
Now that we have created this character data, let’s take a look at the main Harry Potter id (which is 98).
print (character_data[98])
{'id': 98, 'count': 2029, 'max_proper_mention': 'Harry', 'referential_gender': 'he/him/his', 'possList': [(19, 'head'), (15, 'eyes'), (12, 'parents'), (12, 'cupboard'), (10, 'life'), (10, 'hand'), (9, 'aunt'), (8, 'mind'), (7, 'heart'), (7, 'uncle')], 'agentList': [(91, 'said'), (46, 'had'), (39, 'know'), (22, 'felt'), (22, 'saw'), (21, 'got'), (21, 'going'), (21, 'thought'), (18, 'heard'), (18, 'looked')], 'patientList': [(10, 'told'), (5, 'take'), (5, 'asked'), (4, 'kill'), (4, 'reminded'), (4, 'stop'), (4, 'got'), (4, 'tell'), (3, 'took'), (3, 'saw')], 'modList': [(8, 'sure'), (5, 'able'), (4, 'famous'), (3, 'glad'), (2, 'name'), (2, 'special'), (2, 'surprised'), (2, 'baby'), (2, 'stupid'), (2, 'afraid')]}
Notice that we can now see the main gender, verbs, possession items, etc. connected to Harry Potter. Having the data structured in this manner allows us to more easily start posing some questions to the original .book file.
4.4. Parsing Verb Usage¶
One of those questions can be about verb usage. I have created a brand new function that allows you to explore how certain verbs are used within the text based on the new character data file we just created. It expects one argument: the new character data file. We can pass an additional keyword argument that should be a list. This list will contain one or two of the following items:
agent - the doer of the action
patient - the recipient of the action
Again, this function is not something I would put in production. I have designed it to be easier to read so that you can do something similar and grab data you may find relevant for your own project or research.
def find_verb_usage(data, analysis=["agent", "patient"]):
new_analysis = []
for item in analysis:
if item == "agent":
new_analysis.append("agentList")
elif item == "patient":
new_analysis.append("patientList")
main_agents = {}
main_patients = {}
for character in character_data:
temp_data = character_data[character]
for item in new_analysis:
for verb in temp_data[item]:
verb = verb[1].lower()
if item == "agentList":
if verb not in main_agents:
main_agents[verb] = [(character, temp_data["max_proper_mention"])]
else:
main_agents[verb].append((character, temp_data["max_proper_mention"]))
elif item == "patientList":
if verb not in main_patients:
main_patients[verb] = [(character, temp_data["max_proper_mention"])]
else:
main_patients[verb].append((character, temp_data["max_proper_mention"]))
verb_usage = {"agent": main_agents,
"patient": main_patients}
return verb_usage
Essentially, this function will read in the character data file that we created above and create a new dictionary that has two keys: agent and patient. Within each will be the verbs used in the text. These will be matched to a list of the characters connected to those verbs. Let’s go ahead and create this verb data now.
verb_data = find_verb_usage(data)
By restructuring the data around the verbs, you can analyze the characters in a verb-centric manner. Let’s say I was interested in what characters were the agents of the verb “reared”. I could go into the dictionary at the agent key and look for the key of reared within the agent verbs. My output is the tuple of (character_id, most frequent name for that character). In this case: Firenze the centaur.
verb_data["agent"]["reared"]
[(352, 'Firenze')]
It is important to note two things here, however. First, our verbs are not lemmatized. I intentionally left this as the case because in some circumstances understanding how a verb is used is important. You may, for example, be interested in how “said” functioned in the next, not both “said” and “say”. If you wanted to modify the code above, therefore, you could go into the tokens file to find that verb’s lemma.
Another thing to note is that we are only seeing the results from the top-10 in this scenario. If you want to see how verbs area used by all characters, create a new character data file and make your top-n equal to a larger number.
I am going to conclude this chapter temporarily here. I would encourage you all to explore the functions a bit more on your own and come up with your own functions that allow you to extract data to address specific questions. Submit them to me via GitHub and I will add them as separate sections in this chapter. Just let me know how you wish to be cited.
In the next chapter, we will do something similar to this, but for events.