2. Getting Started with BookNLP¶

Dr. W.J.B. Mattingly Smithsonian Data Science Lab and United States Holocaust Memorial Museum March 2022

2.1. Covered in this Chapter¶

How to Import BookNLP
How to Setup the BookNLP Pipeline and Model
How to Run the BookNLP Pipeline

2.2. Introduction¶

Now that you have successfully installed BookNLP and the requisite spaCy small model, we can now dive in! I have tested BookNLP on Linux (Ubuntu 20.04) and Windows 10. The code provided in this chapter has worked on both systems once installed correctly. If you are receiving errors, feel free to submit an issue or pull request with the GitHub icon in the top right corner of the page.

In this chapter, I will introduce you to a lot of the code that can be found in the README.md file on the official repository for BookNLP as well as the official Google Colab Notebook. I always believe in following the docs early in these textbooks, so that you can follow along a bit more easily.

The goal of this chapter is to teach you how to run BookNLP over a book (stored as a text file) and generate the requisite output: a series of files (explained in detail in the next chapter).

Unlike most libraries I have introduced to readers and viewers of my YouTube channel in the past, the main goal of BookNLP is to generate a series of files stored in a subdirectory. These files will contain all the requisite data to then begin analyzing your works.

2.3. Importing BookNLP and Creating a Pipeline¶

As with all libraries, we need to import BookNLP into our Python file (or notebook) in order to work with it. We will specifically need the BookNLP class, so let’s go ahead and import everything with the command below.

from booknlp.booknlp import BookNLP

INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2
using device cuda

Now that we have imported the BookNLP class correctly, it’s time to create the BookNLP pipeline and model in memory. In order to create a BookNLP pipeline, you will want to create a dictionary. Stick with the documentation here and call this object “model_params”. This will be a dictionary that will have two keys: pipeline, and model.

pipeline will take a value that is a string within which are commas that separate the different components. You can play around with this later, but for now let’s work with the entire pipeline which consists of:
- entity
- quote
- supersense
- event
- coref
model will have a key that states the size of the model. For now, use big as we are just trying to follow the docs and create an output that we can analyze in the next chapter

model_params={
                "pipeline":"entity,quote,supersense,event,coref", 
                "model":"big"
    
                }

booknlp=BookNLP("en", model_params)

{'pipeline': 'entity,quote,supersense,event,coref', 'model': 'big'}

c:\users\wma22\appdata\local\programs\python\python39\lib\site-packages\spacy\util.py:833: UserWarning: [W095] Model 'en_core_web_sm' (3.1.0) was trained with spaCy v3.1 and may not be 100% compatible with the current version (3.2.2). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)

--- startup: 7.314 seconds ---

2.4. Setting up the File and Directories¶

Next, we need to setup two objects: the input_file and the output_directory. The input file will be a string that corresponds to the location of your text file that contains the book or large document you want to analyze. For simplicity sake, I have placed our input file in data.

The output directory is the directory into which you want BookNLP to dump all the generated data files. Although there are ways to generate folders programmatically with os, I recommend to keep things simple and make the directories manually for now. In our case, I have already created a subfolder within data entitled “harry_potter”. This is where the files I generate will be stored.

Finally, let’s make a third object that will be a string. This will be our book_id. Think of this as a unique name that will the basis for how the external files are named.

input_file="data/harry_potter_cleaned.txt"

output_directory="data/harry_potter"

book_id="harry_potter"

2.5. Running the Pipeline¶

Now that we have created the model and the necessary object names, let’s process our text! To do this, we will use booknlp.process(). This will take three arguments, all of which we have already created:

input_file
output_directory
book_id

The code below will take some time to run. Even on a powerful computer, it will take a few minutes for a 100k-word file. Do not bee surprised if this takes 10+ minutes. For benchmarks, you can see the repository.

booknlp.process(input_file, output_directory, book_id)

--- spacy: 18.936 seconds ---
--- entities: 88.072 seconds ---
--- quotes: 0.105 seconds ---
--- attribution: 28.766 seconds ---
--- name coref: 0.545 seconds ---
--- coref: 28.508 seconds ---
--- TOTAL (excl. startup): 165.277 seconds ---, 99256 words

If all goes well, you should see an output like the one above that lists each process after it completes with the corresponding time it took to complete the task. You should also see the files generated in the output directory.

2.6. Conclusion¶

At this point, we are finished! We have successfully used BookNLP to create all the requisite files. Now comes the fun part of analyzing those files which will be the task in the next chapter.

Introduction to BookNLP

Getting Started with BookNLP

Contents