{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "identified-insert",
   "metadata": {},
   "source": [
    "# <center>The Output Files</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "brave-attack",
   "metadata": {},
   "source": [
    "<center>Dr. W.J.B. Mattingly</center>\n",
    "\n",
    "<center>Smithsonian Data Science Lab and United States Holocaust Memorial Museum</center>\n",
    "\n",
    "<center>March 2022</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "characteristic-mills",
   "metadata": {},
   "source": [
    "## Covered in this Chapter"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "favorite-actor",
   "metadata": {},
   "source": [
    "1) The .tokens file<br>\n",
    "2) The .entities file<br>\n",
    "3) The .quote file<br>\n",
    "4) The .supersense file<br>\n",
    "5) The .book file<br>\n",
    "6) The .book.html file<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "vocational-tomorrow",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ordinary-dealing",
   "metadata": {},
   "source": [
    "In the last chapter, we looked at how to create a BookNLP pipeline and process a book or longer document. The goal of that process was to generate a collection of files within an output directory. In our case, we stored our files in \"data/harry_potter\". Within this repo, you will be able to examine the output files, but rather than making you switch between this textbook and the repo, I thought I would present the files in this chapter as data.\n",
    "\n",
    "The output from the BookNLP pipeline is three types of files: TSV files (.tokens, .entities, .quotes, .supersense), a JSON file (.book) and an HTML file (.book.html). A good way to think about a TSV is as a CSV where tabs are used to separate tabular data, rather than commas. Essentially, this is a dataset that can be viewed and analyzed in Excel. A JSON file is a bit different. It stores data as you would expect to see it in Python, e.g. dictionaries, lists, etc.\n",
    "\n",
    "The goal of this chapter is to explain what each of these files contains so that in the next few chapters, we can start extracting important data from them."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "detected-disney",
   "metadata": {},
   "source": [
    "## The .tokens File"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "latest-console",
   "metadata": {},
   "source": [
    "The very first file that we should analyze is the .tokens file. Essentially, this is a tab separated value file (TSV) that contains all the tokens on each line of the file and some important data about those tokens. A token is a word or punctuation mark within a text. The very first line of the file will look something like this:\n",
    "\n",
    "paragraph_ID\tsentence_ID\ttoken_ID_within_sentence\ttoken_ID_within_document\tword\tlemma\tbyte_onset\tbyte_offset\tPOS_tag\tfine_POS_tag\tdependency_relation\tsyntactic_head_ID\tevent\n",
    "\n",
    "As this can be a bit difficult to parse, I am going to load it up as a TSV file through Pandas so we can analyze it a bit better."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "informed-playback",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>paragraph_ID</th>\n",
       "      <th>sentence_ID</th>\n",
       "      <th>token_ID_within_sentence</th>\n",
       "      <th>token_ID_within_document</th>\n",
       "      <th>word</th>\n",
       "      <th>lemma</th>\n",
       "      <th>byte_onset</th>\n",
       "      <th>byte_offset</th>\n",
       "      <th>POS_tag</th>\n",
       "      <th>fine_POS_tag</th>\n",
       "      <th>dependency_relation</th>\n",
       "      <th>syntactic_head_ID</th>\n",
       "      <th>event</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Mr.</td>\n",
       "      <td>Mr.</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>PROPN</td>\n",
       "      <td>NNP</td>\n",
       "      <td>nsubj</td>\n",
       "      <td>12</td>\n",
       "      <td>O</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>and</td>\n",
       "      <td>and</td>\n",
       "      <td>4</td>\n",
       "      <td>7</td>\n",
       "      <td>CCONJ</td>\n",
       "      <td>CC</td>\n",
       "      <td>cc</td>\n",
       "      <td>0</td>\n",
       "      <td>O</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>Mrs.</td>\n",
       "      <td>Mrs.</td>\n",
       "      <td>8</td>\n",
       "      <td>12</td>\n",
       "      <td>PROPN</td>\n",
       "      <td>NNP</td>\n",
       "      <td>compound</td>\n",
       "      <td>3</td>\n",
       "      <td>O</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>Dursley</td>\n",
       "      <td>Dursley</td>\n",
       "      <td>13</td>\n",
       "      <td>20</td>\n",
       "      <td>PROPN</td>\n",
       "      <td>NNP</td>\n",
       "      <td>conj</td>\n",
       "      <td>0</td>\n",
       "      <td>O</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>,</td>\n",
       "      <td>,</td>\n",
       "      <td>20</td>\n",
       "      <td>21</td>\n",
       "      <td>PUNCT</td>\n",
       "      <td>,</td>\n",
       "      <td>punct</td>\n",
       "      <td>0</td>\n",
       "      <td>O</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99251</th>\n",
       "      <td>2995</td>\n",
       "      <td>6885</td>\n",
       "      <td>10</td>\n",
       "      <td>99251</td>\n",
       "      <td>Dudley</td>\n",
       "      <td>Dudley</td>\n",
       "      <td>438929</td>\n",
       "      <td>438935</td>\n",
       "      <td>PROPN</td>\n",
       "      <td>NNP</td>\n",
       "      <td>pobj</td>\n",
       "      <td>99250</td>\n",
       "      <td>O</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99252</th>\n",
       "      <td>2995</td>\n",
       "      <td>6885</td>\n",
       "      <td>11</td>\n",
       "      <td>99252</td>\n",
       "      <td>this</td>\n",
       "      <td>this</td>\n",
       "      <td>438936</td>\n",
       "      <td>438940</td>\n",
       "      <td>DET</td>\n",
       "      <td>DT</td>\n",
       "      <td>det</td>\n",
       "      <td>99253</td>\n",
       "      <td>O</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99253</th>\n",
       "      <td>2995</td>\n",
       "      <td>6885</td>\n",
       "      <td>12</td>\n",
       "      <td>99253</td>\n",
       "      <td>summer</td>\n",
       "      <td>summer</td>\n",
       "      <td>438941</td>\n",
       "      <td>438947</td>\n",
       "      <td>NOUN</td>\n",
       "      <td>NN</td>\n",
       "      <td>npadvmod</td>\n",
       "      <td>99245</td>\n",
       "      <td>O</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99254</th>\n",
       "      <td>2995</td>\n",
       "      <td>6885</td>\n",
       "      <td>13</td>\n",
       "      <td>99254</td>\n",
       "      <td>....</td>\n",
       "      <td>....</td>\n",
       "      <td>438947</td>\n",
       "      <td>438951</td>\n",
       "      <td>PUNCT</td>\n",
       "      <td>.</td>\n",
       "      <td>punct</td>\n",
       "      <td>99243</td>\n",
       "      <td>O</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99255</th>\n",
       "      <td>2995</td>\n",
       "      <td>6885</td>\n",
       "      <td>14</td>\n",
       "      <td>99255</td>\n",
       "      <td>\\t</td>\n",
       "      <td>438951</td>\n",
       "      <td>438952</td>\n",
       "      <td>PUNCT</td>\n",
       "      <td>''</td>\n",
       "      <td>punct</td>\n",
       "      <td>99243</td>\n",
       "      <td>O</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>99256 rows × 13 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       paragraph_ID  sentence_ID  token_ID_within_sentence  \\\n",
       "0                 0            0                         0   \n",
       "1                 0            0                         1   \n",
       "2                 0            0                         2   \n",
       "3                 0            0                         3   \n",
       "4                 0            0                         4   \n",
       "...             ...          ...                       ...   \n",
       "99251          2995         6885                        10   \n",
       "99252          2995         6885                        11   \n",
       "99253          2995         6885                        12   \n",
       "99254          2995         6885                        13   \n",
       "99255          2995         6885                        14   \n",
       "\n",
       "       token_ID_within_document     word    lemma  byte_onset byte_offset  \\\n",
       "0                             0      Mr.      Mr.           0           3   \n",
       "1                             1      and      and           4           7   \n",
       "2                             2     Mrs.     Mrs.           8          12   \n",
       "3                             3  Dursley  Dursley          13          20   \n",
       "4                             4        ,        ,          20          21   \n",
       "...                         ...      ...      ...         ...         ...   \n",
       "99251                     99251   Dudley   Dudley      438929      438935   \n",
       "99252                     99252     this     this      438936      438940   \n",
       "99253                     99253   summer   summer      438941      438947   \n",
       "99254                     99254     ....     ....      438947      438951   \n",
       "99255                     99255       \\t   438951      438952       PUNCT   \n",
       "\n",
       "      POS_tag fine_POS_tag dependency_relation syntactic_head_ID event  \n",
       "0       PROPN          NNP               nsubj                12     O  \n",
       "1       CCONJ           CC                  cc                 0     O  \n",
       "2       PROPN          NNP            compound                 3     O  \n",
       "3       PROPN          NNP                conj                 0     O  \n",
       "4       PUNCT            ,               punct                 0     O  \n",
       "...       ...          ...                 ...               ...   ...  \n",
       "99251   PROPN          NNP                pobj             99250     O  \n",
       "99252     DET           DT                 det             99253     O  \n",
       "99253    NOUN           NN            npadvmod             99245     O  \n",
       "99254   PUNCT            .               punct             99243     O  \n",
       "99255      ''        punct               99243                 O   NaN  \n",
       "\n",
       "[99256 rows x 13 columns]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(\"data/harry_potter/harry_potter.tokens\", delimiter=\"\\t\")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "wireless-discharge",
   "metadata": {},
   "source": [
    "If you don't know what the block of code above does, please do not be concerned. We will not be dealing with Pandas in this textbook. If you are interested in Pandas, though, I have a free textbook on it entitled <a href=\"pandas.pythonhumanities.com\">Introduction to Pandas</a>.\n",
    "\n",
    "As you can see from the output above, we have something that looks like Excel, or tabular data. Let's break this down a bit and explain what each column represents:\n",
    "\n",
    "- paragraph_ID - the index of the paragraph, starting at paragraph 1 being 0 and moving up to 3031 in our case.\n",
    "- sentence_ID - same as the paragraph_ID, but with sentences\n",
    "- token_ID_within_sentence - same as a the two above, but with a token count by sentence, resetting with each sentence.\n",
    "- token_ID_within_document - same as above, but where tokens keep going up in value throughout the whole document, starting at 0 and ending, in our case, at 99400.\n",
    "- word - this is the raw text of the word\n",
    "- lemma - this is the root of the word\n",
    "- byte_onset - think of this as the start character index\n",
    "- byte_offset - think of this as the concluding character index\n",
    "- POS_tag = the Part of Speech (based on spaCy)\n",
    "- fine_POS_tag - a more granular understanding of the Part of Speech\n",
    "- dependency_relation - this is equivalent to spaCy's dep tag.\n",
    "- syntactic_head_ID - This points to the head of the current token so that you can understand how a token relates to other words in the sentence\n",
    "- event = this tells you if the token is a trigger for an EVENT or not. You will see, 0, EVENT, or NaN here."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "offshore-peripheral",
   "metadata": {},
   "source": [
    "## The .entities File"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "advisory-possibility",
   "metadata": {},
   "source": [
    "Let's do the same thing with the .entities file now!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "crucial-alpha",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>COREF</th>\n",
       "      <th>start_token</th>\n",
       "      <th>end_token</th>\n",
       "      <th>prop</th>\n",
       "      <th>cat</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>364</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>PROP</td>\n",
       "      <td>PER</td>\n",
       "      <td>Mr.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>92</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>PROP</td>\n",
       "      <td>PER</td>\n",
       "      <td>Mrs. Dursley</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>PROP</td>\n",
       "      <td>FAC</td>\n",
       "      <td>Privet Drive</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>365</td>\n",
       "      <td>17</td>\n",
       "      <td>17</td>\n",
       "      <td>PRON</td>\n",
       "      <td>PER</td>\n",
       "      <td>they</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>366</td>\n",
       "      <td>23</td>\n",
       "      <td>23</td>\n",
       "      <td>PRON</td>\n",
       "      <td>PER</td>\n",
       "      <td>you</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15858</th>\n",
       "      <td>2355</td>\n",
       "      <td>99227</td>\n",
       "      <td>99227</td>\n",
       "      <td>PRON</td>\n",
       "      <td>PER</td>\n",
       "      <td>They</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15859</th>\n",
       "      <td>2351</td>\n",
       "      <td>99231</td>\n",
       "      <td>99231</td>\n",
       "      <td>PRON</td>\n",
       "      <td>PER</td>\n",
       "      <td>we</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15860</th>\n",
       "      <td>441</td>\n",
       "      <td>99239</td>\n",
       "      <td>99239</td>\n",
       "      <td>NOM</td>\n",
       "      <td>FAC</td>\n",
       "      <td>home</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15861</th>\n",
       "      <td>98</td>\n",
       "      <td>99241</td>\n",
       "      <td>99241</td>\n",
       "      <td>PRON</td>\n",
       "      <td>PER</td>\n",
       "      <td>I</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15862</th>\n",
       "      <td>95</td>\n",
       "      <td>99251</td>\n",
       "      <td>99251</td>\n",
       "      <td>PROP</td>\n",
       "      <td>PER</td>\n",
       "      <td>Dudley</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>15863 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       COREF  start_token  end_token  prop  cat          text\n",
       "0        364            0          0  PROP  PER           Mr.\n",
       "1         92            2          3  PROP  PER  Mrs. Dursley\n",
       "2          1            9         10  PROP  FAC  Privet Drive\n",
       "3        365           17         17  PRON  PER          they\n",
       "4        366           23         23  PRON  PER           you\n",
       "...      ...          ...        ...   ...  ...           ...\n",
       "15858   2355        99227      99227  PRON  PER          They\n",
       "15859   2351        99231      99231  PRON  PER            we\n",
       "15860    441        99239      99239   NOM  FAC          home\n",
       "15861     98        99241      99241  PRON  PER             I\n",
       "15862     95        99251      99251  PROP  PER        Dudley\n",
       "\n",
       "[15863 rows x 6 columns]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_entities = pd.read_csv(\"data/harry_potter/harry_potter.entities\", delimiter=\"\\t\")\n",
    "df_entities"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "lonely-lottery",
   "metadata": {},
   "source": [
    "If you get an error that looks like this:\n",
    "```{image} ./images/booknlp_error.PNG\n",
    ":alt: jupyter_org\n",
    ":class: bg-primary\n",
    ":width: 500px\n",
    ":align: center\n",
    "```\n",
    "\n",
    "Fear not! This happens sometimes when the .entities file is corrupted with something like a \" mark. You simply need to go into the file and remove the character that is causing the error. Use the row number as an indicator of where to go in the text file. Remember, add one row because row 1 is the header data.\n",
    "\n",
    "Before:\n",
    "```{image} ./images/booknlp_solution.PNG\n",
    ":alt: jupyter_org\n",
    ":class: bg-primary\n",
    ":width: 500px\n",
    ":align: center\n",
    "```\n",
    "\n",
    "After:\n",
    "```{image} ./images/booknlp_solution2.PNG\n",
    ":alt: jupyter_org\n",
    ":class: bg-primary\n",
    ":width: 500px\n",
    ":align: center\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "incoming-defensive",
   "metadata": {},
   "source": [
    "Let's return to our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "joined-halloween",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>COREF</th>\n",
       "      <th>start_token</th>\n",
       "      <th>end_token</th>\n",
       "      <th>prop</th>\n",
       "      <th>cat</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>364</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>PROP</td>\n",
       "      <td>PER</td>\n",
       "      <td>Mr.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>92</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>PROP</td>\n",
       "      <td>PER</td>\n",
       "      <td>Mrs. Dursley</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>PROP</td>\n",
       "      <td>FAC</td>\n",
       "      <td>Privet Drive</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>365</td>\n",
       "      <td>17</td>\n",
       "      <td>17</td>\n",
       "      <td>PRON</td>\n",
       "      <td>PER</td>\n",
       "      <td>they</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>366</td>\n",
       "      <td>23</td>\n",
       "      <td>23</td>\n",
       "      <td>PRON</td>\n",
       "      <td>PER</td>\n",
       "      <td>you</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15858</th>\n",
       "      <td>2355</td>\n",
       "      <td>99227</td>\n",
       "      <td>99227</td>\n",
       "      <td>PRON</td>\n",
       "      <td>PER</td>\n",
       "      <td>They</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15859</th>\n",
       "      <td>2351</td>\n",
       "      <td>99231</td>\n",
       "      <td>99231</td>\n",
       "      <td>PRON</td>\n",
       "      <td>PER</td>\n",
       "      <td>we</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15860</th>\n",
       "      <td>441</td>\n",
       "      <td>99239</td>\n",
       "      <td>99239</td>\n",
       "      <td>NOM</td>\n",
       "      <td>FAC</td>\n",
       "      <td>home</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15861</th>\n",
       "      <td>98</td>\n",
       "      <td>99241</td>\n",
       "      <td>99241</td>\n",
       "      <td>PRON</td>\n",
       "      <td>PER</td>\n",
       "      <td>I</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15862</th>\n",
       "      <td>95</td>\n",
       "      <td>99251</td>\n",
       "      <td>99251</td>\n",
       "      <td>PROP</td>\n",
       "      <td>PER</td>\n",
       "      <td>Dudley</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>15863 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       COREF  start_token  end_token  prop  cat          text\n",
       "0        364            0          0  PROP  PER           Mr.\n",
       "1         92            2          3  PROP  PER  Mrs. Dursley\n",
       "2          1            9         10  PROP  FAC  Privet Drive\n",
       "3        365           17         17  PRON  PER          they\n",
       "4        366           23         23  PRON  PER           you\n",
       "...      ...          ...        ...   ...  ...           ...\n",
       "15858   2355        99227      99227  PRON  PER          They\n",
       "15859   2351        99231      99231  PRON  PER            we\n",
       "15860    441        99239      99239   NOM  FAC          home\n",
       "15861     98        99241      99241  PRON  PER             I\n",
       "15862     95        99251      99251  PROP  PER        Dudley\n",
       "\n",
       "[15863 rows x 6 columns]"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_entities"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "anonymous-cartoon",
   "metadata": {},
   "source": [
    "Here we see all the entities found within the text. In our case, wee have 15,863 entities in the entire book. It is important to remember that some of these are, of course. Before we get to that, though, let's break down the columns.\n",
    "\n",
    "- COREF - This is a COREF id that is a unique identifier for the person. This number will be used elsewhere to reference a person, such as in the .quotes file, to link the speaker with the block of text. It should be noted, that COREF is one of the more challenging problems in NLP. Expect this to not be even close to 90% accurate, rather around the 70% accuracy range, particularly when pronouns are used for the entity.\n",
    "- start_token - this is the start token of the entity name\n",
    "- end_token - this is the end token of the entity name. Single token entities will have the same start and end, while multi-word tokens (MWTs) will increase by one for each additional token\n",
    "- prop - this will tlel you if it is a PROP (proper noun) or PROPN (pronoun), or other categories\n",
    "- cat - cat will be the entity type (in spaCy terms. BookNLP includes a few other useful categories, notable VEH for vehicle.\n",
    "- text - this is the raw text that corresponds to the entity."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "martial-trinidad",
   "metadata": {},
   "source": [
    "## The .quotes File"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "municipal-puzzle",
   "metadata": {},
   "source": [
    "The .quotes file will contain all the quotes in the book. Let's take a look at this data like we did above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "british-pledge",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>quote_start</th>\n",
       "      <th>quote_end</th>\n",
       "      <th>mention_start</th>\n",
       "      <th>mention_end</th>\n",
       "      <th>mention_phrase</th>\n",
       "      <th>char_id</th>\n",
       "      <th>quote</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>434</td>\n",
       "      <td>438</td>\n",
       "      <td>443</td>\n",
       "      <td>443</td>\n",
       "      <td>he</td>\n",
       "      <td>93</td>\n",
       "      <td>Little tyke ,</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1089</td>\n",
       "      <td>1108</td>\n",
       "      <td>1085</td>\n",
       "      <td>1085</td>\n",
       "      <td>they</td>\n",
       "      <td>417</td>\n",
       "      <td>The Potters , that 's right , that 's what I ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1343</td>\n",
       "      <td>1346</td>\n",
       "      <td>1347</td>\n",
       "      <td>1347</td>\n",
       "      <td>he</td>\n",
       "      <td>93</td>\n",
       "      <td>Sorry ,</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1416</td>\n",
       "      <td>1460</td>\n",
       "      <td>1405</td>\n",
       "      <td>1405</td>\n",
       "      <td>he</td>\n",
       "      <td>435</td>\n",
       "      <td>Do n't be sorry , my dear sir , for nothing c...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1603</td>\n",
       "      <td>1606</td>\n",
       "      <td>1608</td>\n",
       "      <td>1609</td>\n",
       "      <td>Mr. Dursley</td>\n",
       "      <td>93</td>\n",
       "      <td>Shoo !</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2322</th>\n",
       "      <td>99133</td>\n",
       "      <td>99146</td>\n",
       "      <td>99147</td>\n",
       "      <td>99147</td>\n",
       "      <td>He</td>\n",
       "      <td>119</td>\n",
       "      <td>Hurry up , boy , we have n't got all day .</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2323</th>\n",
       "      <td>99163</td>\n",
       "      <td>99172</td>\n",
       "      <td>99161</td>\n",
       "      <td>99161</td>\n",
       "      <td>Hermione</td>\n",
       "      <td>220</td>\n",
       "      <td>See you over the summer , then .</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2324</th>\n",
       "      <td>99173</td>\n",
       "      <td>99184</td>\n",
       "      <td>99186</td>\n",
       "      <td>99186</td>\n",
       "      <td>Hermione</td>\n",
       "      <td>220</td>\n",
       "      <td>Hope you have -- er -- a good holiday ,</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2325</th>\n",
       "      <td>99202</td>\n",
       "      <td>99208</td>\n",
       "      <td>99210</td>\n",
       "      <td>99210</td>\n",
       "      <td>Harry</td>\n",
       "      <td>98</td>\n",
       "      <td>Oh , I will ,</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2326</th>\n",
       "      <td>99226</td>\n",
       "      <td>99255</td>\n",
       "      <td>99210</td>\n",
       "      <td>99210</td>\n",
       "      <td>Harry</td>\n",
       "      <td>98</td>\n",
       "      <td>They do n't know we 're not allowed to use ma...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2327 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      quote_start  quote_end  mention_start  mention_end mention_phrase  \\\n",
       "0             434        438            443          443             he   \n",
       "1            1089       1108           1085         1085           they   \n",
       "2            1343       1346           1347         1347             he   \n",
       "3            1416       1460           1405         1405             he   \n",
       "4            1603       1606           1608         1609    Mr. Dursley   \n",
       "...           ...        ...            ...          ...            ...   \n",
       "2322        99133      99146          99147        99147             He   \n",
       "2323        99163      99172          99161        99161       Hermione   \n",
       "2324        99173      99184          99186        99186       Hermione   \n",
       "2325        99202      99208          99210        99210          Harry   \n",
       "2326        99226      99255          99210        99210          Harry   \n",
       "\n",
       "      char_id                                              quote  \n",
       "0          93                                     Little tyke ,   \n",
       "1         417   The Potters , that 's right , that 's what I ...  \n",
       "2          93                                           Sorry ,   \n",
       "3         435   Do n't be sorry , my dear sir , for nothing c...  \n",
       "4          93                                            Shoo !   \n",
       "...       ...                                                ...  \n",
       "2322      119        Hurry up , boy , we have n't got all day .   \n",
       "2323      220                  See you over the summer , then .   \n",
       "2324      220           Hope you have -- er -- a good holiday ,   \n",
       "2325       98                                     Oh , I will ,   \n",
       "2326       98   They do n't know we 're not allowed to use ma...  \n",
       "\n",
       "[2327 rows x 7 columns]"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_quotes = pd.read_csv(\"data/harry_potter/harry_potter.quotes\", delimiter=\"\\t\")\n",
    "df_quotes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "presidential-courtesy",
   "metadata": {},
   "source": [
    "In our case, we have 2,326 quotes in the entire book. Each quote contains some important metadata:\n",
    "\n",
    "- quote_start - the start token of the quote\n",
    "- quote_end - the end token of the quote\n",
    "- mention_start - this is the start token of the speaker entity\n",
    "- mention_end - this is the end token of the speaker entity\n",
    "- char_id - this will be the unique identifier we saw above in the .entities file so that you can perform COREF and find all dialogues for a single character. Remember, there WILL LIKELY BE ERRORS here. Sometimes you may need to manually align two entity ids as a single character (as we will see)\n",
    "- quote - this is the raw text of the quote."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "italic-cornwall",
   "metadata": {},
   "source": [
    "## The .supersense file"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "involved-swiss",
   "metadata": {},
   "source": [
    "The final TSV file that we have is the .supersense file. This is something that I think is quite unique to BookNLP and an absolute delight to have.  Here we have all supersense text found. A good way to think about supersense is as a more broadly defined entities file. Here, we not only have entities, like people, places, etc, but also things like \"perception\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "special-register",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>start_token</th>\n",
       "      <th>end_token</th>\n",
       "      <th>supersense_category</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>noun.person</td>\n",
       "      <td>Mr.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>noun.person</td>\n",
       "      <td>Mrs. Dursley</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>noun.quantity</td>\n",
       "      <td>number</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "      <td>noun.quantity</td>\n",
       "      <td>four</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>noun.location</td>\n",
       "      <td>Privet Drive</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29313</th>\n",
       "      <td>99239</td>\n",
       "      <td>99239</td>\n",
       "      <td>noun.location</td>\n",
       "      <td>home</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29314</th>\n",
       "      <td>99245</td>\n",
       "      <td>99245</td>\n",
       "      <td>verb.perception</td>\n",
       "      <td>have</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29315</th>\n",
       "      <td>99249</td>\n",
       "      <td>99249</td>\n",
       "      <td>noun.act</td>\n",
       "      <td>fun</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29316</th>\n",
       "      <td>99251</td>\n",
       "      <td>99251</td>\n",
       "      <td>noun.person</td>\n",
       "      <td>Dudley</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29317</th>\n",
       "      <td>99253</td>\n",
       "      <td>99253</td>\n",
       "      <td>noun.time</td>\n",
       "      <td>summer</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>29318 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       start_token  end_token supersense_category          text\n",
       "0                0          0         noun.person           Mr.\n",
       "1                2          3         noun.person  Mrs. Dursley\n",
       "2                6          6       noun.quantity        number\n",
       "3                7          7       noun.quantity          four\n",
       "4                9         10       noun.location  Privet Drive\n",
       "...            ...        ...                 ...           ...\n",
       "29313        99239      99239       noun.location          home\n",
       "29314        99245      99245     verb.perception          have\n",
       "29315        99249      99249            noun.act           fun\n",
       "29316        99251      99251         noun.person        Dudley\n",
       "29317        99253      99253           noun.time        summer\n",
       "\n",
       "[29318 rows x 4 columns]"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_supersense = pd.read_csv(\"data/harry_potter/harry_potter.supersense\", delimiter=\"\\t\")\n",
    "df_supersense"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dominant-criticism",
   "metadata": {},
   "source": [
    "We can see that we have 29,318 different supersense items with four pieces of data:\n",
    "\n",
    "- start_token - this is the start token for the supersense text\n",
    "- end_token - this is the end token for the supersense text\n",
    "- supersense_category - this is the part of speech and category to which the supersense belongs\n",
    "- text - this is the raw text of the supersense"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "horizontal-piano",
   "metadata": {},
   "source": [
    "## The .book File"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "orange-tracker",
   "metadata": {},
   "source": [
    "Now that we have looked at all the TSV files, let's take a look at the .book file. This is a large JSON file that contains information structured around the characters. In the next few chapters, we will learn a lot more about this file, but for now, let's explore how it is structured."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "biological-opera",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['characters'])"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "with open (\"data/harry_potter/harry_potter.book\", \"r\") as f:\n",
    "    book_data = json.load(f)\n",
    "book_data.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "brutal-orleans",
   "metadata": {},
   "source": [
    "It is a giant dictionary with one key: characters. The value of characters is a list. Let's check out it's length."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "national-numbers",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "723"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(book_data[\"characters\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "understanding-continuity",
   "metadata": {},
   "source": [
    "So, we have 723 unique characters throughout the book. Again, expect errors here. For each character, we have a dictionary with 8 keys:;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "balanced-machinery",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['agent', 'patient', 'mod', 'poss', 'id', 'g', 'count', 'mentions'])"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "book_data[\"characters\"][0].keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "academic-volleyball",
   "metadata": {},
   "source": [
    "These keys are as follows:\n",
    "\n",
    "- agent - actions that character does\n",
    "- patient - actions done to that character\n",
    "- mod - adjectives that describe them in the text\n",
    "- poss - things the entity has (very broadly defined), e.g. relatives like aunt, uncle; or parts of the body, e.g. head, back, etc. \n",
    "- id - their unique id (as seen above)\n",
    "- g - analysis about gender pronouns used\n",
    "- count - number of times the entity appears\n",
    "- mentions - how the character is referenced"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "informational-venice",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['proper', 'common', 'pronoun'])"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "book_data[\"characters\"][0][\"agent\"][:1]\n",
    "book_data[\"characters\"][0][\"patient\"][:1]\n",
    "book_data[\"characters\"][0][\"mod\"][:1]\n",
    "book_data[\"characters\"][0][\"poss\"][:1]\n",
    "book_data[\"characters\"][0][\"id\"]\n",
    "book_data[\"characters\"][0][\"g\"]\n",
    "book_data[\"characters\"][0][\"count\"]\n",
    "book_data[\"characters\"][0][\"mentions\"].keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "bound-aaron",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'w': 'name', 'i': 1206},\n",
       " {'w': 'older', 'i': 4370},\n",
       " {'w': 'famous', 'i': 4423},\n",
       " {'w': 'ready', 'i': 4533},\n",
       " {'w': 'special', 'i': 5645},\n",
       " {'w': 'famous', 'i': 5651},\n",
       " {'w': 'asleep', 'i': 5935},\n",
       " {'w': 'fast', 'i': 6318},\n",
       " {'w': 'small', 'i': 6338},\n",
       " {'w': 'skinny', 'i': 6340}]"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "book_data[\"characters\"][0][\"mod\"][:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "exact-apache",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'w': 'aunt', 'i': 4356},\n",
       " {'w': 'uncle', 'i': 4358},\n",
       " {'w': 'name', 'i': 4461},\n",
       " {'w': 'blankets', 'i': 5622},\n",
       " {'w': 'cousin', 'i': 5698},\n",
       " {'w': 'Petunia', 'i': 5947},\n",
       " {'w': 'aunt', 'i': 5981},\n",
       " {'w': 'back', 'i': 6020},\n",
       " {'w': 'aunt', 'i': 6062},\n",
       " {'w': 'aunt', 'i': 6133}]"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "book_data[\"characters\"][0][\"poss\"][:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "involved-expansion",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "98"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "book_data[\"characters\"][0][\"id\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "organic-observer",
   "metadata": {},
   "source": [
    "For the g category, we see a few different keys:\n",
    "\n",
    "- inference - the pronouns for the entity in order of highest frequency to lowest\n",
    "- argmax - the likely pronoun/gender for the entity\n",
    "- max - the degree to which that pronoun set is used compared to others, e.g. the percentage\n",
    "- total (not entirely sure about this)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "polished-judges",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'inference': {'he/him/his': 0.811,\n",
       "  'she/her': 0.112,\n",
       "  'they/them/their': 0.077,\n",
       "  'xe/xem/xyr/xir': 0.0,\n",
       "  'ze/zem/zir/hir': 0.0},\n",
       " 'argmax': 'he/him/his',\n",
       " 'max': 0.811,\n",
       " 'total': 200311.834}"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "book_data[\"characters\"][0][\"g\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "scenic-sensitivity",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2005"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "book_data[\"characters\"][0][\"count\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "small-sister",
   "metadata": {},
   "source": [
    "For mentions, we have three special keys:\n",
    "\n",
    "- proper - the way they are referenced as proper nouns\n",
    "- common - informal names\n",
    "- pronoun - the pronouns used to refer to them in prose and dialogue"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "arabic-chrome",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['proper', 'common', 'pronoun'])"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "book_data[\"characters\"][0][\"mentions\"].keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "measured-desert",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'c': 664, 'n': 'Harry'},\n",
       " {'c': 46, 'n': 'Potter'},\n",
       " {'c': 23, 'n': 'Harry Potter'},\n",
       " {'c': 11, 'n': 'Mr. Potter'},\n",
       " {'c': 2, 'n': 'Mr. Harry Potter'},\n",
       " {'c': 1, 'n': 'Harry Hunting'},\n",
       " {'c': 1, 'n': 'Cokeworth Harry'},\n",
       " {'c': 1, 'n': 'Both Harry'},\n",
       " {'c': 1, 'n': 'The Harry Potter'},\n",
       " {'c': 1, 'n': 'HARRY POTTER'},\n",
       " {'c': 1, 'n': 'Even Harry'},\n",
       " {'c': 1, 'n': 'POTTER'},\n",
       " {'c': 1, 'n': 'the famous Harry Potter'}]"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "book_data[\"characters\"][0][\"mentions\"][\"proper\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "korean-yahoo",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "book_data[\"characters\"][0][\"mentions\"][\"common\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "framed-elevation",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'c': 303, 'n': 'he'},\n",
       " {'c': 217, 'n': 'his'},\n",
       " {'c': 172, 'n': 'you'},\n",
       " {'c': 144, 'n': 'He'},\n",
       " {'c': 107, 'n': 'him'},\n",
       " {'c': 99, 'n': 'I'},\n",
       " {'c': 34, 'n': 'me'},\n",
       " {'c': 30, 'n': 'your'},\n",
       " {'c': 27, 'n': 'yeh'},\n",
       " {'c': 27, 'n': 'You'},\n",
       " {'c': 18, 'n': 'yer'},\n",
       " {'c': 16, 'n': 'himself'},\n",
       " {'c': 14, 'n': 'my'},\n",
       " {'c': 12, 'n': 'His'},\n",
       " {'c': 5, 'n': 'Your'},\n",
       " {'c': 3, 'n': 'Yeh'},\n",
       " {'c': 3, 'n': 'Yer'},\n",
       " {'c': 3, 'n': 'My'},\n",
       " {'c': 2, 'n': \"yeh've\"},\n",
       " {'c': 2, 'n': \"yeh'd\"},\n",
       " {'c': 2, 'n': 'ter'},\n",
       " {'c': 2, 'n': 'myself'},\n",
       " {'c': 2, 'n': 'yourself'},\n",
       " {'c': 1, 'n': 'YOU'},\n",
       " {'c': 1, 'n': 'mine'},\n",
       " {'c': 1, 'n': 'yours'},\n",
       " {'c': 1, 'n': \"Yeh'd\"},\n",
       " {'c': 1, 'n': 'yerself'},\n",
       " {'c': 1, 'n': \"Yeh've\"},\n",
       " {'c': 1, 'n': \"yeh'll\"}]"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "book_data[\"characters\"][0][\"mentions\"][\"pronoun\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "abandoned-passion",
   "metadata": {},
   "source": [
    "## The .book.html File"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "smaller-table",
   "metadata": {},
   "source": [
    "The final file that is outputted from BookNLP is the .book.html file. This is a nicely organized, easy-to-read, html file that should open in your browser. For this file, I am going to be covering it exclusively in the attached video as there is too much to realistically display in this notebook. If you find the video inaccessible, please let me know and I will add some text here with screenshots as a future update."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "rough-afternoon",
   "metadata": {},
   "source": [
    "## Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "environmental-newport",
   "metadata": {},
   "source": [
    "It is my goal that this chapter has helped you understand the large quantity of data and files outputted by the BookNLP pipeline. Getting this data and understanding it is only half the battle. In the coming chapters, we will use what we learned here to gain some valuable insight about the new data that we have generated."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}