Introduction to BookNLP

Dr. W.J.B. Mattingly
Smithsonian Data Science Lab and United States Holocaust Memorial Museum
March 2022

1.1. Key Concepts in this Notebook

  1. What is BookNLP?

  2. How to Install BookNLP

1.2. About the Author

I am Dr. William Mattingly. I hold a PhD in Medieval History from the University of Kentucky where I explored early medieval social networks. A lot of my research was aided by my ability to code, specifically in Python, for data cleaning and analysis. I was even able to use Python to plot and visualize social networks and data. During the fourth year of my PhD, I used Python to create an app that I could use to plot and analyze these social networks. Currently, I am a Postdoctoral Fellow for the Analysis of Historical Documents at the Smithsonian Institution with a joint appointment at the United States Holocaust Memorial Museum. In both institutions, I use Python, machine learning, and natural language processing (NLP) to analyze historical texts in large quantities to generate new insights about the documents held in the archives. In all, I have nearly a decade of experience using Python as a historian.

When I first started to explore Python, there were not many available tutorials geared towards humanists and, for that reason, four years ago I started PythonHumanities.com and Python Tutorials for Digital Humanities on YouTube. I geared these resources to humanists who had no prior knowledge about computing or coding. This new JupyterBook is the third iteration of this textbook that brings a lot of the material that first appeared on PythonHumanities.com years ago into a new, more accessible JupyterBook. It will forever remain free to all as will the video lectures embedded in this book.

1.3. About this Textbook

Because this textbook is not peer-reviewed, typos may remain or errors may exist. I openly and freely admit to this. This textbook is community-inspired and I would like it to be community-supported. If you see a mistake, you can click thee GitHub logo inn the top right corner of the screen to submit a pull request or make a note for edits. I highly encourage this and I am open to and welcome any criticism to improve this textbook for all.


1.4. What is BookNLP?

BookNLP is a new Python library created by David Bamman. It was originally created as a Java library in 2014 under the same name, BookNLP by David Bamman, Ted Underwood, and Noah Smith (see, David Bamman, Ted Underwood and Noah Smith, “A Bayesian Mixed Effects Model of Literary Character,” ACL 2014). While Java is a powerful coding language, both in speed and ease-of-use, not many digital humanists code in Java primarily. I suspect (I want to emphasize I could be wrong) the reason for the Python library was to address the larger Python-coding community both in general and specifically within the digital humanities. This textbook will deal strictly with the Python library.

In the documentation, Bamman states:

“BookNLP is a natural language processing pipeline that scales to books and other long documents (in English), including:

  • Part-of-speech tagging

  • Dependency parsing

  • Entity recognition

  • Character name clustering (e.g., “Tom”, “Tom Sawyer”, “Mr. Sawyer”, “Thomas Sawyer” -> TOM_SAWYER) and coreference resolution

  • Quotation speaker identification

  • Supersense tagging (e.g., “animal”, “artifact”, “body”, “cognition”, etc.)

  • Event tagging

  • Referential gender inference (TOM_SAWYER -> he/him/his)”

Unlike its predecessor, the Java library, the Python library leverages the Python NLP library, spaCy, and the Python Transformer library from HuggingFace, rather than Stanford, to perform many of these tasks. In the last few years, spaCy has proven itself as a dominate force within the NLP community, outperforming many of its predecessors in accuracy and in its ability to perform at scale. HuggingFace is a library that allows one to create and leverage large and powerful transformer language models. It also allows users to store these models in the cloud which are too large to store within GitHub or other comparable repositories. I have two textbooks on spaCy, both for the library generally, and one for named entity recognition, specifically.

BookNLP delivers in all the things it sets out to do, though it currently only supports English. Because it leverages transformer models, BookNLP’s results can generalize well on non-standard English. I have seen it perform quite well with the South African dialect of English, by correctly identifying out-of-vocabulary (OOV) words, specifically the correct labeling of Afrikaans words for minivans as vehicles.

Although only available in English as if March 2022, there are clear plans to expand the library to include Spanish, Japanese, Russian, and Germran, as per their recent NEH grant, awarded in September 2020.

Throughout this book, we will use BookNLP to do what it was intended to do, analyze large fictional works. We will also, however, push it to analyze larger historical documents.

1.5. Why Books and Larger Documents?

Both the documentation and this textbook emphasize the word large here. The reason? Because most language models do not perform well with larger documents. Old RNN-based language models had a hard time remembering earlier words and while newer transformer-based models, such as BERT, have a larger memory and can look forwards and backwards, the size of the input they can take in is only 512 words. For larger documents, therefore, different solutions (and libraries) should be considered. This is where BookNLP comes in. It also addresses several problems associated with books and larger documents, such as:

  • Characters (and people) are referenced by different names. BookNLP solves this problem with name clustering and coreference resolution. This is a task in NLP where we try and find all uses a name and correctly assign them to the same identifier, such as Harry, Harry Potter, and Mr. Harry Potter all being the same person, Harry Potter.

  • An adjacent problem is referential gender inferencing. Like coreference resolution, often times in a book or larger document, a person will be referred to as a pronoun. This is where referential gender inferencing comes in. This allows a user to correctly assign the antecedent or postcedent to the correct pronoun. When done successfully, this also allows you to make decisions about the gender of the character or person based on how they are referenced in the text. Because this task is so delicate, given the delicate nature of assigning gender, BookNLP fortunately gives users the data with each pronoun used to reference a character and also includes non-binary pronouns.

  • Another issue is quotation speaker identification. This is when we need to understand who is speaking, so that we can correctly link characters to their dialogues. It is possible to do this with spaCy, but it is extremely difficult to do well. BookNLP does a remarkable job of handling this problem and it does it with a fair degree of accuracy, from what I have seen.

  • Event tagging is another key issue with longer documents and books. There are machine learning models that find events and you can easily cultivate a list of domain-specific events to improve a pipeline, but for BookNLP event is defined more broadly. From my experience, it is more based around key actions, rather than named events (as it is in named entity recognition). This has a tangential benefit known as triple extraction. In my opinion, it might be a bit better to view BookNLP events through this lens. Triple extraction is when we try and extract three pieces of information, such as (Actor, Action, Recipient) or (Actor, IS, Something). With these types of tuples, we can construct a knowledge tree about a corpus fairly easily. This a very challenging problem in NLP because triple extraction can be very domain-specific. BookNLP provides a great starting place for triple extraction with its events.

1.6. How to Install BookNLP

If you are using Linux (and I believe Mac - for Mac M1, see below), installation will be easy. Use pip install booknlp. You can opt to create a custom environment (recommended but not necessary). If you are using Windows, however, as of March 3 2022, you will need to do a few additional steps which I have documented in this video below:

from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/3l5ERF3QX0M" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

1.7. How to Install BookNLP on Mac M1

This section was kindly written by Joel Lee, Fellow at the United States Holocaust Memorial Museum.

When running on M1, you may run into a “zsh: illegal hardware instruction” when trying to run spacy or booknlp commands after pip installing booknlp, or you may run into some dependency errors when trying to pip install booknlp. This is my workaround for that, where we download the dependencies (spacy, tensorflow, pytorch, and transformers) stated in the setup.py manually and then running pip install booknlp without its dependencies.

  • First, if you do not already have an Anaconda or miniconda for your local Python development, start with a conda install, we will follow this guide to use miniforge.

    • https://www.mrdbourke.com/setup-apple-m1-pro-and-m1-max-for-machine-learning-and-data-science/

      • Instant Download Link for Miniforge3 (Conda installer) for macOS arm64 chips (M1, M1 Pro, M1 Max).

      • chmod +x ~/Downloads/Miniforge3-MacOSX-arm64.sh

      • sh ~/Downloads/Miniforge3-MacOSX-arm64.sh

      • source ~/miniforge3/bin/activate

  • With your existing or newly installed conda environment installed, continue in your Terminal or Anaconda Prompt session.

    • To ensure the conda environment you will create uses the MacOS native osx-arm64 version of Python packages for your M1 Apple Silicon computer, execute this command in your Terminal or Anaconda Prompt session before following the rest of this recipe.

      • conda env config vars set CONDA_SUBDIR=osx-arm64

    • Now create and activate a new conda virtual environment that you will use for the BookNLP course:

      • conda create -n booknlp python=3.8

      • conda activate booknlp

  • Installing tensorflow.

    • conda install -c apple tensorflow-deps

    • python -m pip install tensorflow-macos

    • python -m pip install tensorflow-metal

  • Installing spacy.

    • conda install -c conda-forge spacy

  • Installing transformers.

    • conda install transformers OR

    • pip install transformers

  • Installing pytorch.

    • conda install -c pytorch pytorch OR, if this gives you installation troubles, try

    • conda install -c conda-forge pytorch

  • Now if we were to run pip install booknlp we would very likely get a conflicting dependency error or ResolutionImpossible error between booknlp and tensorflow. Fortunately, because we have installed the dependencies ourselves, we can run the pip install without including the already installed dependencies.

    • pip install --no-dependencies booknlp (Note the option parameter starts with two-dash characters.)

  • Now we can download the spacy pipeline and should be able to follow the rest of the booknlp repo.

    • python -m spacy download en_core_web_sm

Congratulations! Your M1-based Macintosh should now be ready to use as you take the BookNLP course.