Extracting and linking ontology terms from text

2 min readApr 10, 2020

I recently needed to develop a quick solution to extract ontology terms and their corresponding ID from free text. The following will describe the development of a custom spaCy pipeline that does the required pattern matching.

The terms are from the Disease Ontology (DO), which is part of the Disease Ontology project hosted at the Institute for Genome Sciences at the University of Maryland School of Medicine. It is covering the full spectrum of diseases and links to repositories of various biomedical datasets. Therefore DO uses identifiers (DOIDs) to uniquely map human diseases to numeric strings. These DOIDs are used to cross-reference to other well-established ontologies, including SNOMED, ICD-10, MeSH, and UMLS.

Working with ontologies in Python

Pronto is a library to view, modify, create and export ontologies in Python. It implements the specifications of the Open Biomedical Ontologies 1.4 in the form of a safe high-level interface. You can find a lot of ontologies in the OBO format on the website of the OBO Foundry.

Print all direct child terms for term “disease by infectious agent” from DOID ontology

Matching component

While in this case, simple regular expressions would be sufficient, we use spaCy’s existing components that offer additional functionality. They enable higher-level matching on Doc and Tokenobjects, not just plain text. While the Matcher component allows to create rules that can make use of attributes as part-of-speech, entity types, lemmatization among others, one can directly specifying the phrases itself using the PhraseMatcher. It can be used to match a large list of phrases, which would otherwise be difficult to realise with the token-based Matcher.

SpaCy pipeline

Custom components are a good way to add functionality to spaCy. E.g. if you want to add additional metadata to tokens or the document — or to add entities. They are executed in the specified order when the nlp object is called on a text.

The DOIDExtractorComponent

Using the information above, we can build a DOID extractor component that will be added to the spaCy pipeline. It is important to note that we do not edit the entities but create a new custom attribute at theDoc level called doids so we do not interfere with the regular NER. The extractor uses only the best — in our case that is the longest match, as we prefer to match “1,4-phenylenediamine allergic contact dermatitis” over just “dermatitis”.

Example

The following shows a short example on how to use the component.

As shown, the component successfully extracts the terms from the DOID ontology and the ID of the term can easily be used for linking.