Monday, December 12, 2016

Preparing a Book Index Using Python

Preparing a Book Index Using Python

You have probably come across some of those large text books and noticed the index at the end. With a hard copy, it is nice to have such an index to navigate to the desired page quickly. I have recently published a very short book, and when it came to setting the index, the task seemed daunting even though the book is very short. The book doesn't have an index yet anyway.

If you have been following my articles, you will notice that I mainly write about Python and how it can help us in solving different issues in a simple manner. So let's see how we can set a book index using Python.

Without further ado, let's get started.

What Is a Book Index?

I'm pretty sure that most of you know what a book index is, but I just want to quickly clarify this concept.

A book index is simply a collection of words and/or phrases that are considered important to the book, along with their locations in the book. The index does not contain every word/phrase in the book. The reason for that is shown in the next section.

What Makes a Good Book Index?

What if you had an index through which you can find the location of each word or phrase in the book? Wouldn't that be considered as the index of choice? Wrong!

The index of choice, or what would be considered a good index, is that which points to the important words and phrases in the book. You might be questioning the reason for that. Let's take an example. Say that we have a book that consists only of the following sentence:

My book is short

What would happen if we try to index each word and phrase in that very short sentence, assuming that the location is the word number in the sentence? This is the index that we would have in this case:

From the example above, we can see that such an index would be larger than the book itself! So a good index would be one that contains the words and phrases considered important to the reader.

Setup

Natural Language Toolkit (NLTK)

In this tutorial, we will be using the Natural Language Toolkit (NLTK) library, which is used to work with human language data. As mentioned in the documentation, NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”

I'm currently writing this tutorial from my Ubuntu machine, and the steps for installing NLTK in this section will be relevant to the Ubuntu Operating System. But don't worry, you can find the steps for installing NLTK on other Operating Systems on the NLTK website.

In order to install NLTK, I'm going to use pip. If you don't have pip installed already, you can use the following command in your terminal to install pip:

sudo easy_install3 pip

To make sure you have pip installed, type the following command:

pip --version

You should get something similar to the following:

pip 8.1.2 from /usr/local/lib/python3.5/dist-packages/pip-8.1.2-py3.5.egg (python 3.5)

Now, to install NLTK, simply run the following command in your terminal:

sudo pip install -U nltk

You can test the nltk installation by typing python, and then importing nltk in your terminal. If  you get ImportError: No module named nltk, this thread might help you out.

Test File

At this point, we need a test file (book) to use for creating a book index. I'll grab this book: The Rate of Change of the Rate of Change by the EFF. You can download the text file of the book from Dropbox. You can of course use any book of your choice; you just need something to experiment with in this tutorial.

Program

Let's start with the interesting part in this tutorial, the program that will help us form the book index. The first thing we want to do is find the word frequency in the book. I have shown how we can do that in another tutorial, but I want to show you how we can do that using the NLTK library.

This can be done as follows:

When you run the program, you will notice that we will have a very long list of words and their frequencies.

Before moving further, let's analyze the above code a bit. In the following line:

We are trying to use the Counter() function in order to get the word frequencies in the book (how many times the word occurred in the book).

word_tokenize, on the other hand, splits the sentences into their constituent parts. Let's take a simple example to see how word_tokenize actually works:

The output of the above script is as follows:

['My', 'name', 'is', 'Abder', '.', 'I', 'like', 'Python', '.', 'It', "'s", 'a', 'pretty', 'nice', 'programming', 'language']

We then loop through the words and find the frequency of occurrence of each word.
What about phrases (combination of words)? Those are called collocations (a sequence of words that occur together often). An example of collocations is bigrams, that is a list of word pairs. Similar to that is trigrams (a combination of three words), and so forth (i.e. n-grams).

Let's say we want to extract the bigrams from our book. We can do that as follows:

The number 2 in the apply_freq_filter( ) function is telling us to ignore all bigrams that occur less than two times in the book.

If we want to find the 30 most occurring bigrams in the book, we can use the following code statement:

Finally, if we would like to find the location, which is in our case where the word or phrase occurs in the book (not the page number), we can do the following:

The above statements seem to return the word location in a sentence, similar to what we have seen in our short sentence example at the beginning of the tutorial.

Putting It All Together

Let's put what we have learned in a single Python script. The following script will read our book and return the word frequencies, along with the 30 most occurring bigrams in the book, in addition to the location of a word and a phrase in the book:

Conclusion

As we have seen in this tutorial, even a short text can be very daunting when it comes to building an index for that text. Also, an automated way for building the optimum index for the book might not be feasible enough.

We were able to solve this issue through using Python and the NLTK library, where we could pick the best words and phrases for the book index based on their frequency of occurrence (i.e. importance) in the book.

There is, of course, more you can do with NLTK, as shown in the library's documentation. You can also refer to the book Natural Language Processing with Python if you would like to go deeper in this library.


No comments:

Post a Comment