Tuesday, May 23, 2017

Working With MeSH Files in Python: Linking Terms and Numbers

Working With MeSH Files in Python: Linking Terms and Numbers

This tutorial shows how we can use different aspects of Python (i.e. dictionaries, lists, and regular expressions) together to solve different issues. It also shows how we can use Python to link the relationships in the MeSH file, making it easier to understand its hierarchy and structure.

Before moving ahead with this tutorial, you might be wondering what we mean by MeSH. So let's start by defining this term first, and then go into a bit more detail on its structure.

What Is MeSH?

MeSH is an acronym for Medical Subject Headings. It is considered the U.S. National Library of Medicine's controlled vocabulary (thesaurus), which gives uniformity and consistency to the indexing and cataloging of biomedical literature. MeSH, a distinctive feature of MEDLINE, is arranged in a hierarchical manner called the MesH Tree Structure, and is updated annually.

MeSH is thus a nomenclature of medical terms available from the U.S. National Library of Medicine, that aims to create new knowledge by exploiting the relationships among terms that annotate the biomedical literature.  

People searching MEDLINE/PubMed and other databases make use of MeSH to assist with subject searching. The National Library of Medicine (NLM) indexers use MeSH to describe the subject content of journal articles for MEDLINE. Catalogers use MeSH to describe books and audiovisuals in the NLM and other library collections. So MeSH can be used for numerous tasks involving indexing, tagging, searching, retrieving, analyzing, coding, merging, and sharing biomedical text.

MeSH File Structure

MeSH descriptors are organized into 16 categories:

  • A: anatomy
  • B: organisms 
  • C: diseases
  • D: drugs and chemicals
  • E: analytical, diagnostic and therapeutic techniques and equipment
  • F: psychiatry and psychology
  • G: phenomena and processes 
  • H: disciplines and occupations
  • I: anthropology, education, sociology, and social phenomena
  • J: technology, industry, agriculture
  • K: humanities
  • L: information science
  • M: named groups
  • N: health care
  • V: publication characteristics
  • Z: geographicals

You can find more information about the categories from the U.S. National Library of Medicine. As we can see, each category is further divided into subcategories. This structure is, however, not considered an authoritative subject classification system, but rather as an arrangement of descriptors for the guidance and convenience of people who are assigning subject headings to documents or are searching for literature. It is thus not an exhaustive classification of the subject and contains only the terms that have been selected for inclusion in this thesaurus.

Here's some more information on the MeSH Tree Structures:

Because of the branching structure of the hierarchies, these lists are sometimes referred to as "trees". Each MeSH descriptor appears in at least one place in the trees, and may appear in as many additional places as may be appropriate. Those who index articles or catalog books are instructed to find and use the most specific MeSH descriptor that is available to represent each indexable concept. 

Downloading a MeSH File

For the purpose of this tutorial, we need a MeSH file to work with in Python. You can find MeSH file on the NLM download site.

Let's go ahead and download the latest ASCII MeSH file. We can first go to the MeSH FTP Archive: http://ift.tt/28NhyNm, and then choose the 2017 directory. In the asciimesh/ directory, you will find three .bin files: c2017.bind2017.bin, and q2017.bin. Let's download d2017.bin. You can download the file from: http://ift.tt/2rOr33x(27.5 MB).

Linking Terms to Numbers

Let's jump into the core of this article. What we are trying to do is read a MeSH file (i.e. the .bin file you just downloaded), browse through the entries, find all the MeSH numbers for each entry, and list the terms along with their relevant numbers. 

The first thing we would normally do is read the .bin file, as follows:

Notice that we have used the rb mode, meaning that we are reading binary with no line-break translation.

We also need to define an output file where we would store the results (output):

At this point, we want to check the lines that start with MH = (MeSH term) and MN = (MeSH number). I shouldn't do this now, but will show you a snapshot of the MeSH file to have some idea of the structure and to remove any confusions (MH and MN are surrounded by red rectangles, respectively).

A snapshot of the MeSH file

To check lines that start with MH = and MN =, we need to use regular expressions. So, if we want to check the lines that start with MH = followed by any characters, we would do as shown in the code below (I'll get to what line is in a moment). Notice that I have used b instead of r for the regular expression, since we are applying the pattern on a byte object and not a string object, so we should use a byte pattern.

The same thing would apply for the MeSH number, but this time for lines starting with MN =.

Coming back to line, this refers to the lines in the MeSH file. So we would be walking through the file line by line, looking for the MeSH terms and numbers. As you can see from the above MeSH file snapshot, the MeSH term comes before the MeSH number. So, in our code, the MeSH number will always be the number corresponding to the previously captured MeSH term. We will thus do the following:

Let's go through the above code step by step. If we look at the regular expression MH = (.+)$, this is basically telling us to find the literal MH = followed by at least one character. (.) means any character, and + means that it has to be one or more characters, and return everything to the end of the line ($). 

The parenthesis around .+, that is (.+), is a capture group so we can retrieve the result. So, for the MeSH term surrounded by a red rectangle in the above snapshot, the retrieved term will be Calcomycin. The reason we are using if-statements is that some lines will neither start with MH = nor MN =.

For the captured MeSH term and MeSH number, we create a new key-value pair for a dictionary object, as shown in this line of code: numbers[str(number)] = term.

It is important to note that a single MeSH term might have more than one MeSH number. So we concatenate every new MeSH number with the relevant term into a string, as shown in this portion of the code:

Thus in this case we will be having a dictionary object with key-value pairs that consist of a MeSH term as the key, and the concatenation collection of all corresponding MeSH numbers as the value.

What we want to do now is list the different keys (terms), and have the relevant values (numbers) listed under the relevant term. To list the different terms, we do the following:

Finally, we will list the term and its relevant numbers as follows:

Before showing the output of the program, let's put it all together.

Putting It All Together

In this section, I will show you what our full Python program that links the MeSH term to its numbers looks like:

Output

You can download the output from Dropbox (1.77 MB). Taking a sample of the output as shown below, we can see how a MeSH term (Pterygopalatine Fossa) is listed with its MeSH numbers that are grouped immediately underneath.

Conclusion

The tutorial showed how we can use different aspects of Python (i.e. dictionaries, lists, and regular expressions) together to solve different issues. It also shows how we can use Python to work with MeSH files for linking some parts of this complex file in a way that makes it easier to understand its hierarchy and structure, as we did here by linking the MeSH term to its relevant MeSH numbers.


No comments:

Post a Comment