{ "cells": [ { "cell_type": "markdown", "id": "identified-insert", "metadata": {}, "source": [ "#
The Output Files
" ] }, { "cell_type": "markdown", "id": "brave-attack", "metadata": {}, "source": [ "
Dr. W.J.B. Mattingly
\n", "\n", "
Smithsonian Data Science Lab and United States Holocaust Memorial Museum
\n", "\n", "
March 2022
" ] }, { "cell_type": "markdown", "id": "characteristic-mills", "metadata": {}, "source": [ "## Covered in this Chapter" ] }, { "cell_type": "markdown", "id": "favorite-actor", "metadata": {}, "source": [ "1) The .tokens file
\n", "2) The .entities file
\n", "3) The .quote file
\n", "4) The .supersense file
\n", "5) The .book file
\n", "6) The .book.html file
" ] }, { "cell_type": "markdown", "id": "vocational-tomorrow", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "id": "ordinary-dealing", "metadata": {}, "source": [ "In the last chapter, we looked at how to create a BookNLP pipeline and process a book or longer document. The goal of that process was to generate a collection of files within an output directory. In our case, we stored our files in \"data/harry_potter\". Within this repo, you will be able to examine the output files, but rather than making you switch between this textbook and the repo, I thought I would present the files in this chapter as data.\n", "\n", "The output from the BookNLP pipeline is three types of files: TSV files (.tokens, .entities, .quotes, .supersense), a JSON file (.book) and an HTML file (.book.html). A good way to think about a TSV is as a CSV where tabs are used to separate tabular data, rather than commas. Essentially, this is a dataset that can be viewed and analyzed in Excel. A JSON file is a bit different. It stores data as you would expect to see it in Python, e.g. dictionaries, lists, etc.\n", "\n", "The goal of this chapter is to explain what each of these files contains so that in the next few chapters, we can start extracting important data from them." ] }, { "cell_type": "markdown", "id": "detected-disney", "metadata": {}, "source": [ "## The .tokens File" ] }, { "cell_type": "markdown", "id": "latest-console", "metadata": {}, "source": [ "The very first file that we should analyze is the .tokens file. Essentially, this is a tab separated value file (TSV) that contains all the tokens on each line of the file and some important data about those tokens. A token is a word or punctuation mark within a text. The very first line of the file will look something like this:\n", "\n", "paragraph_ID\tsentence_ID\ttoken_ID_within_sentence\ttoken_ID_within_document\tword\tlemma\tbyte_onset\tbyte_offset\tPOS_tag\tfine_POS_tag\tdependency_relation\tsyntactic_head_ID\tevent\n", "\n", "As this can be a bit difficult to parse, I am going to load it up as a TSV file through Pandas so we can analyze it a bit better." ] }, { "cell_type": "code", "execution_count": 24, "id": "informed-playback", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
paragraph_IDsentence_IDtoken_ID_within_sentencetoken_ID_within_documentwordlemmabyte_onsetbyte_offsetPOS_tagfine_POS_tagdependency_relationsyntactic_head_IDevent
00000Mr.Mr.03PROPNNNPnsubj12O
10011andand47CCONJCCcc0O
20022Mrs.Mrs.812PROPNNNPcompound3O
30033DursleyDursley1320PROPNNNPconj0O
40044,,2021PUNCT,punct0O
..........................................
99251299568851099251DudleyDudley438929438935PROPNNNPpobj99250O
99252299568851199252thisthis438936438940DETDTdet99253O
99253299568851299253summersummer438941438947NOUNNNnpadvmod99245O
99254299568851399254........438947438951PUNCT.punct99243O
99255299568851499255\\t438951438952PUNCT''punct99243ONaN
\n", "

99256 rows × 13 columns

\n", "
" ], "text/plain": [ " paragraph_ID sentence_ID token_ID_within_sentence \\\n", "0 0 0 0 \n", "1 0 0 1 \n", "2 0 0 2 \n", "3 0 0 3 \n", "4 0 0 4 \n", "... ... ... ... \n", "99251 2995 6885 10 \n", "99252 2995 6885 11 \n", "99253 2995 6885 12 \n", "99254 2995 6885 13 \n", "99255 2995 6885 14 \n", "\n", " token_ID_within_document word lemma byte_onset byte_offset \\\n", "0 0 Mr. Mr. 0 3 \n", "1 1 and and 4 7 \n", "2 2 Mrs. Mrs. 8 12 \n", "3 3 Dursley Dursley 13 20 \n", "4 4 , , 20 21 \n", "... ... ... ... ... ... \n", "99251 99251 Dudley Dudley 438929 438935 \n", "99252 99252 this this 438936 438940 \n", "99253 99253 summer summer 438941 438947 \n", "99254 99254 .... .... 438947 438951 \n", "99255 99255 \\t 438951 438952 PUNCT \n", "\n", " POS_tag fine_POS_tag dependency_relation syntactic_head_ID event \n", "0 PROPN NNP nsubj 12 O \n", "1 CCONJ CC cc 0 O \n", "2 PROPN NNP compound 3 O \n", "3 PROPN NNP conj 0 O \n", "4 PUNCT , punct 0 O \n", "... ... ... ... ... ... \n", "99251 PROPN NNP pobj 99250 O \n", "99252 DET DT det 99253 O \n", "99253 NOUN NN npadvmod 99245 O \n", "99254 PUNCT . punct 99243 O \n", "99255 '' punct 99243 O NaN \n", "\n", "[99256 rows x 13 columns]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv(\"data/harry_potter/harry_potter.tokens\", delimiter=\"\\t\")\n", "df" ] }, { "cell_type": "markdown", "id": "wireless-discharge", "metadata": {}, "source": [ "If you don't know what the block of code above does, please do not be concerned. We will not be dealing with Pandas in this textbook. If you are interested in Pandas, though, I have a free textbook on it entitled Introduction to Pandas.\n", "\n", "As you can see from the output above, we have something that looks like Excel, or tabular data. Let's break this down a bit and explain what each column represents:\n", "\n", "- paragraph_ID - the index of the paragraph, starting at paragraph 1 being 0 and moving up to 3031 in our case.\n", "- sentence_ID - same as the paragraph_ID, but with sentences\n", "- token_ID_within_sentence - same as a the two above, but with a token count by sentence, resetting with each sentence.\n", "- token_ID_within_document - same as above, but where tokens keep going up in value throughout the whole document, starting at 0 and ending, in our case, at 99400.\n", "- word - this is the raw text of the word\n", "- lemma - this is the root of the word\n", "- byte_onset - think of this as the start character index\n", "- byte_offset - think of this as the concluding character index\n", "- POS_tag = the Part of Speech (based on spaCy)\n", "- fine_POS_tag - a more granular understanding of the Part of Speech\n", "- dependency_relation - this is equivalent to spaCy's dep tag.\n", "- syntactic_head_ID - This points to the head of the current token so that you can understand how a token relates to other words in the sentence\n", "- event = this tells you if the token is a trigger for an EVENT or not. You will see, 0, EVENT, or NaN here." ] }, { "cell_type": "markdown", "id": "offshore-peripheral", "metadata": {}, "source": [ "## The .entities File" ] }, { "cell_type": "markdown", "id": "advisory-possibility", "metadata": {}, "source": [ "Let's do the same thing with the .entities file now!" ] }, { "cell_type": "code", "execution_count": 26, "id": "crucial-alpha", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
COREFstart_tokenend_tokenpropcattext
036400PROPPERMr.
19223PROPPERMrs. Dursley
21910PROPFACPrivet Drive
33651717PRONPERthey
43662323PRONPERyou
.....................
1585823559922799227PRONPERThey
1585923519923199231PRONPERwe
158604419923999239NOMFAChome
15861989924199241PRONPERI
15862959925199251PROPPERDudley
\n", "

15863 rows × 6 columns

\n", "
" ], "text/plain": [ " COREF start_token end_token prop cat text\n", "0 364 0 0 PROP PER Mr.\n", "1 92 2 3 PROP PER Mrs. Dursley\n", "2 1 9 10 PROP FAC Privet Drive\n", "3 365 17 17 PRON PER they\n", "4 366 23 23 PRON PER you\n", "... ... ... ... ... ... ...\n", "15858 2355 99227 99227 PRON PER They\n", "15859 2351 99231 99231 PRON PER we\n", "15860 441 99239 99239 NOM FAC home\n", "15861 98 99241 99241 PRON PER I\n", "15862 95 99251 99251 PROP PER Dudley\n", "\n", "[15863 rows x 6 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_entities = pd.read_csv(\"data/harry_potter/harry_potter.entities\", delimiter=\"\\t\")\n", "df_entities" ] }, { "cell_type": "markdown", "id": "lonely-lottery", "metadata": {}, "source": [ "If you get an error that looks like this:\n", "```{image} ./images/booknlp_error.PNG\n", ":alt: jupyter_org\n", ":class: bg-primary\n", ":width: 500px\n", ":align: center\n", "```\n", "\n", "Fear not! This happens sometimes when the .entities file is corrupted with something like a \" mark. You simply need to go into the file and remove the character that is causing the error. Use the row number as an indicator of where to go in the text file. Remember, add one row because row 1 is the header data.\n", "\n", "Before:\n", "```{image} ./images/booknlp_solution.PNG\n", ":alt: jupyter_org\n", ":class: bg-primary\n", ":width: 500px\n", ":align: center\n", "```\n", "\n", "After:\n", "```{image} ./images/booknlp_solution2.PNG\n", ":alt: jupyter_org\n", ":class: bg-primary\n", ":width: 500px\n", ":align: center\n", "```" ] }, { "cell_type": "markdown", "id": "incoming-defensive", "metadata": {}, "source": [ "Let's return to our data." ] }, { "cell_type": "code", "execution_count": 27, "id": "joined-halloween", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
COREFstart_tokenend_tokenpropcattext
036400PROPPERMr.
19223PROPPERMrs. Dursley
21910PROPFACPrivet Drive
33651717PRONPERthey
43662323PRONPERyou
.....................
1585823559922799227PRONPERThey
1585923519923199231PRONPERwe
158604419923999239NOMFAChome
15861989924199241PRONPERI
15862959925199251PROPPERDudley
\n", "

15863 rows × 6 columns

\n", "
" ], "text/plain": [ " COREF start_token end_token prop cat text\n", "0 364 0 0 PROP PER Mr.\n", "1 92 2 3 PROP PER Mrs. Dursley\n", "2 1 9 10 PROP FAC Privet Drive\n", "3 365 17 17 PRON PER they\n", "4 366 23 23 PRON PER you\n", "... ... ... ... ... ... ...\n", "15858 2355 99227 99227 PRON PER They\n", "15859 2351 99231 99231 PRON PER we\n", "15860 441 99239 99239 NOM FAC home\n", "15861 98 99241 99241 PRON PER I\n", "15862 95 99251 99251 PROP PER Dudley\n", "\n", "[15863 rows x 6 columns]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_entities" ] }, { "cell_type": "markdown", "id": "anonymous-cartoon", "metadata": {}, "source": [ "Here we see all the entities found within the text. In our case, wee have 15,863 entities in the entire book. It is important to remember that some of these are, of course. Before we get to that, though, let's break down the columns.\n", "\n", "- COREF - This is a COREF id that is a unique identifier for the person. This number will be used elsewhere to reference a person, such as in the .quotes file, to link the speaker with the block of text. It should be noted, that COREF is one of the more challenging problems in NLP. Expect this to not be even close to 90% accurate, rather around the 70% accuracy range, particularly when pronouns are used for the entity.\n", "- start_token - this is the start token of the entity name\n", "- end_token - this is the end token of the entity name. Single token entities will have the same start and end, while multi-word tokens (MWTs) will increase by one for each additional token\n", "- prop - this will tlel you if it is a PROP (proper noun) or PROPN (pronoun), or other categories\n", "- cat - cat will be the entity type (in spaCy terms. BookNLP includes a few other useful categories, notable VEH for vehicle.\n", "- text - this is the raw text that corresponds to the entity." ] }, { "cell_type": "markdown", "id": "martial-trinidad", "metadata": {}, "source": [ "## The .quotes File" ] }, { "cell_type": "markdown", "id": "municipal-puzzle", "metadata": {}, "source": [ "The .quotes file will contain all the quotes in the book. Let's take a look at this data like we did above." ] }, { "cell_type": "code", "execution_count": 31, "id": "british-pledge", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
quote_startquote_endmention_startmention_endmention_phrasechar_idquote
0434438443443he93Little tyke ,
11089110810851085they417The Potters , that 's right , that 's what I ...
21343134613471347he93Sorry ,
31416146014051405he435Do n't be sorry , my dear sir , for nothing c...
41603160616081609Mr. Dursley93Shoo !
........................
232299133991469914799147He119Hurry up , boy , we have n't got all day .
232399163991729916199161Hermione220See you over the summer , then .
232499173991849918699186Hermione220Hope you have -- er -- a good holiday ,
232599202992089921099210Harry98Oh , I will ,
232699226992559921099210Harry98They do n't know we 're not allowed to use ma...
\n", "

2327 rows × 7 columns

\n", "
" ], "text/plain": [ " quote_start quote_end mention_start mention_end mention_phrase \\\n", "0 434 438 443 443 he \n", "1 1089 1108 1085 1085 they \n", "2 1343 1346 1347 1347 he \n", "3 1416 1460 1405 1405 he \n", "4 1603 1606 1608 1609 Mr. Dursley \n", "... ... ... ... ... ... \n", "2322 99133 99146 99147 99147 He \n", "2323 99163 99172 99161 99161 Hermione \n", "2324 99173 99184 99186 99186 Hermione \n", "2325 99202 99208 99210 99210 Harry \n", "2326 99226 99255 99210 99210 Harry \n", "\n", " char_id quote \n", "0 93 Little tyke , \n", "1 417 The Potters , that 's right , that 's what I ... \n", "2 93 Sorry , \n", "3 435 Do n't be sorry , my dear sir , for nothing c... \n", "4 93 Shoo ! \n", "... ... ... \n", "2322 119 Hurry up , boy , we have n't got all day . \n", "2323 220 See you over the summer , then . \n", "2324 220 Hope you have -- er -- a good holiday , \n", "2325 98 Oh , I will , \n", "2326 98 They do n't know we 're not allowed to use ma... \n", "\n", "[2327 rows x 7 columns]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_quotes = pd.read_csv(\"data/harry_potter/harry_potter.quotes\", delimiter=\"\\t\")\n", "df_quotes" ] }, { "cell_type": "markdown", "id": "presidential-courtesy", "metadata": {}, "source": [ "In our case, we have 2,326 quotes in the entire book. Each quote contains some important metadata:\n", "\n", "- quote_start - the start token of the quote\n", "- quote_end - the end token of the quote\n", "- mention_start - this is the start token of the speaker entity\n", "- mention_end - this is the end token of the speaker entity\n", "- char_id - this will be the unique identifier we saw above in the .entities file so that you can perform COREF and find all dialogues for a single character. Remember, there WILL LIKELY BE ERRORS here. Sometimes you may need to manually align two entity ids as a single character (as we will see)\n", "- quote - this is the raw text of the quote." ] }, { "cell_type": "markdown", "id": "italic-cornwall", "metadata": {}, "source": [ "## The .supersense file" ] }, { "cell_type": "markdown", "id": "involved-swiss", "metadata": {}, "source": [ "The final TSV file that we have is the .supersense file. This is something that I think is quite unique to BookNLP and an absolute delight to have. Here we have all supersense text found. A good way to think about supersense is as a more broadly defined entities file. Here, we not only have entities, like people, places, etc, but also things like \"perception\"." ] }, { "cell_type": "code", "execution_count": 30, "id": "special-register", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
start_tokenend_tokensupersense_categorytext
000noun.personMr.
123noun.personMrs. Dursley
266noun.quantitynumber
377noun.quantityfour
4910noun.locationPrivet Drive
...............
293139923999239noun.locationhome
293149924599245verb.perceptionhave
293159924999249noun.actfun
293169925199251noun.personDudley
293179925399253noun.timesummer
\n", "

29318 rows × 4 columns

\n", "
" ], "text/plain": [ " start_token end_token supersense_category text\n", "0 0 0 noun.person Mr.\n", "1 2 3 noun.person Mrs. Dursley\n", "2 6 6 noun.quantity number\n", "3 7 7 noun.quantity four\n", "4 9 10 noun.location Privet Drive\n", "... ... ... ... ...\n", "29313 99239 99239 noun.location home\n", "29314 99245 99245 verb.perception have\n", "29315 99249 99249 noun.act fun\n", "29316 99251 99251 noun.person Dudley\n", "29317 99253 99253 noun.time summer\n", "\n", "[29318 rows x 4 columns]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_supersense = pd.read_csv(\"data/harry_potter/harry_potter.supersense\", delimiter=\"\\t\")\n", "df_supersense" ] }, { "cell_type": "markdown", "id": "dominant-criticism", "metadata": {}, "source": [ "We can see that we have 29,318 different supersense items with four pieces of data:\n", "\n", "- start_token - this is the start token for the supersense text\n", "- end_token - this is the end token for the supersense text\n", "- supersense_category - this is the part of speech and category to which the supersense belongs\n", "- text - this is the raw text of the supersense" ] }, { "cell_type": "markdown", "id": "horizontal-piano", "metadata": {}, "source": [ "## The .book File" ] }, { "cell_type": "markdown", "id": "orange-tracker", "metadata": {}, "source": [ "Now that we have looked at all the TSV files, let's take a look at the .book file. This is a large JSON file that contains information structured around the characters. In the next few chapters, we will learn a lot more about this file, but for now, let's explore how it is structured." ] }, { "cell_type": "code", "execution_count": 32, "id": "biological-opera", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['characters'])" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import json\n", "\n", "with open (\"data/harry_potter/harry_potter.book\", \"r\") as f:\n", " book_data = json.load(f)\n", "book_data.keys()" ] }, { "cell_type": "markdown", "id": "brutal-orleans", "metadata": {}, "source": [ "It is a giant dictionary with one key: characters. The value of characters is a list. Let's check out it's length." ] }, { "cell_type": "code", "execution_count": 33, "id": "national-numbers", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "723" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(book_data[\"characters\"])" ] }, { "cell_type": "markdown", "id": "understanding-continuity", "metadata": {}, "source": [ "So, we have 723 unique characters throughout the book. Again, expect errors here. For each character, we have a dictionary with 8 keys:;" ] }, { "cell_type": "code", "execution_count": 17, "id": "balanced-machinery", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['agent', 'patient', 'mod', 'poss', 'id', 'g', 'count', 'mentions'])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "book_data[\"characters\"][0].keys()" ] }, { "cell_type": "markdown", "id": "academic-volleyball", "metadata": {}, "source": [ "These keys are as follows:\n", "\n", "- agent - actions that character does\n", "- patient - actions done to that character\n", "- mod - adjectives that describe them in the text\n", "- poss - things the entity has (very broadly defined), e.g. relatives like aunt, uncle; or parts of the body, e.g. head, back, etc. \n", "- id - their unique id (as seen above)\n", "- g - analysis about gender pronouns used\n", "- count - number of times the entity appears\n", "- mentions - how the character is referenced" ] }, { "cell_type": "code", "execution_count": 47, "id": "informational-venice", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['proper', 'common', 'pronoun'])" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "book_data[\"characters\"][0][\"agent\"][:1]\n", "book_data[\"characters\"][0][\"patient\"][:1]\n", "book_data[\"characters\"][0][\"mod\"][:1]\n", "book_data[\"characters\"][0][\"poss\"][:1]\n", "book_data[\"characters\"][0][\"id\"]\n", "book_data[\"characters\"][0][\"g\"]\n", "book_data[\"characters\"][0][\"count\"]\n", "book_data[\"characters\"][0][\"mentions\"].keys()" ] }, { "cell_type": "code", "execution_count": 49, "id": "bound-aaron", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'w': 'name', 'i': 1206},\n", " {'w': 'older', 'i': 4370},\n", " {'w': 'famous', 'i': 4423},\n", " {'w': 'ready', 'i': 4533},\n", " {'w': 'special', 'i': 5645},\n", " {'w': 'famous', 'i': 5651},\n", " {'w': 'asleep', 'i': 5935},\n", " {'w': 'fast', 'i': 6318},\n", " {'w': 'small', 'i': 6338},\n", " {'w': 'skinny', 'i': 6340}]" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "book_data[\"characters\"][0][\"mod\"][:10]" ] }, { "cell_type": "code", "execution_count": 50, "id": "exact-apache", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'w': 'aunt', 'i': 4356},\n", " {'w': 'uncle', 'i': 4358},\n", " {'w': 'name', 'i': 4461},\n", " {'w': 'blankets', 'i': 5622},\n", " {'w': 'cousin', 'i': 5698},\n", " {'w': 'Petunia', 'i': 5947},\n", " {'w': 'aunt', 'i': 5981},\n", " {'w': 'back', 'i': 6020},\n", " {'w': 'aunt', 'i': 6062},\n", " {'w': 'aunt', 'i': 6133}]" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "book_data[\"characters\"][0][\"poss\"][:10]" ] }, { "cell_type": "code", "execution_count": 51, "id": "involved-expansion", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "98" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "book_data[\"characters\"][0][\"id\"]" ] }, { "cell_type": "markdown", "id": "organic-observer", "metadata": {}, "source": [ "For the g category, we see a few different keys:\n", "\n", "- inference - the pronouns for the entity in order of highest frequency to lowest\n", "- argmax - the likely pronoun/gender for the entity\n", "- max - the degree to which that pronoun set is used compared to others, e.g. the percentage\n", "- total (not entirely sure about this)" ] }, { "cell_type": "code", "execution_count": 52, "id": "polished-judges", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'inference': {'he/him/his': 0.811,\n", " 'she/her': 0.112,\n", " 'they/them/their': 0.077,\n", " 'xe/xem/xyr/xir': 0.0,\n", " 'ze/zem/zir/hir': 0.0},\n", " 'argmax': 'he/him/his',\n", " 'max': 0.811,\n", " 'total': 200311.834}" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "book_data[\"characters\"][0][\"g\"]" ] }, { "cell_type": "code", "execution_count": 53, "id": "scenic-sensitivity", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2005" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "book_data[\"characters\"][0][\"count\"]" ] }, { "cell_type": "markdown", "id": "small-sister", "metadata": {}, "source": [ "For mentions, we have three special keys:\n", "\n", "- proper - the way they are referenced as proper nouns\n", "- common - informal names\n", "- pronoun - the pronouns used to refer to them in prose and dialogue" ] }, { "cell_type": "code", "execution_count": 54, "id": "arabic-chrome", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['proper', 'common', 'pronoun'])" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "book_data[\"characters\"][0][\"mentions\"].keys()" ] }, { "cell_type": "code", "execution_count": 55, "id": "measured-desert", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'c': 664, 'n': 'Harry'},\n", " {'c': 46, 'n': 'Potter'},\n", " {'c': 23, 'n': 'Harry Potter'},\n", " {'c': 11, 'n': 'Mr. Potter'},\n", " {'c': 2, 'n': 'Mr. Harry Potter'},\n", " {'c': 1, 'n': 'Harry Hunting'},\n", " {'c': 1, 'n': 'Cokeworth Harry'},\n", " {'c': 1, 'n': 'Both Harry'},\n", " {'c': 1, 'n': 'The Harry Potter'},\n", " {'c': 1, 'n': 'HARRY POTTER'},\n", " {'c': 1, 'n': 'Even Harry'},\n", " {'c': 1, 'n': 'POTTER'},\n", " {'c': 1, 'n': 'the famous Harry Potter'}]" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "book_data[\"characters\"][0][\"mentions\"][\"proper\"]" ] }, { "cell_type": "code", "execution_count": 56, "id": "korean-yahoo", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "book_data[\"characters\"][0][\"mentions\"][\"common\"]" ] }, { "cell_type": "code", "execution_count": 57, "id": "framed-elevation", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'c': 303, 'n': 'he'},\n", " {'c': 217, 'n': 'his'},\n", " {'c': 172, 'n': 'you'},\n", " {'c': 144, 'n': 'He'},\n", " {'c': 107, 'n': 'him'},\n", " {'c': 99, 'n': 'I'},\n", " {'c': 34, 'n': 'me'},\n", " {'c': 30, 'n': 'your'},\n", " {'c': 27, 'n': 'yeh'},\n", " {'c': 27, 'n': 'You'},\n", " {'c': 18, 'n': 'yer'},\n", " {'c': 16, 'n': 'himself'},\n", " {'c': 14, 'n': 'my'},\n", " {'c': 12, 'n': 'His'},\n", " {'c': 5, 'n': 'Your'},\n", " {'c': 3, 'n': 'Yeh'},\n", " {'c': 3, 'n': 'Yer'},\n", " {'c': 3, 'n': 'My'},\n", " {'c': 2, 'n': \"yeh've\"},\n", " {'c': 2, 'n': \"yeh'd\"},\n", " {'c': 2, 'n': 'ter'},\n", " {'c': 2, 'n': 'myself'},\n", " {'c': 2, 'n': 'yourself'},\n", " {'c': 1, 'n': 'YOU'},\n", " {'c': 1, 'n': 'mine'},\n", " {'c': 1, 'n': 'yours'},\n", " {'c': 1, 'n': \"Yeh'd\"},\n", " {'c': 1, 'n': 'yerself'},\n", " {'c': 1, 'n': \"Yeh've\"},\n", " {'c': 1, 'n': \"yeh'll\"}]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "book_data[\"characters\"][0][\"mentions\"][\"pronoun\"]" ] }, { "cell_type": "markdown", "id": "abandoned-passion", "metadata": {}, "source": [ "## The .book.html File" ] }, { "cell_type": "markdown", "id": "smaller-table", "metadata": {}, "source": [ "The final file that is outputted from BookNLP is the .book.html file. This is a nicely organized, easy-to-read, html file that should open in your browser. For this file, I am going to be covering it exclusively in the attached video as there is too much to realistically display in this notebook. If you find the video inaccessible, please let me know and I will add some text here with screenshots as a future update." ] }, { "cell_type": "markdown", "id": "rough-afternoon", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "id": "environmental-newport", "metadata": {}, "source": [ "It is my goal that this chapter has helped you understand the large quantity of data and files outputted by the BookNLP pipeline. Getting this data and understanding it is only half the battle. In the coming chapters, we will use what we learned here to gain some valuable insight about the new data that we have generated." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.11" } }, "nbformat": 4, "nbformat_minor": 5 }