Posts in Category: Algorithm

04 Jul 2018

A perfect match: locating plain text in HTML pages

SciLite is a Europe PMC tool that allows biological terms or relations, such as diseases, chemicals or protein interactions, to be highlighted for readers on abstracts and full text articles. These terms are identified as annotations by text mining algorithms, developed by a variety of text mining groups.

The main challenge for the SciLite tool is locating plain text annotations in HTML pages. The challenges derive from the nature of HTML pages. Below is a list of the major challenges we faced and the solutions adopted to mitigate them.

  1. The pages contain HTML tags, obviously. For example, visit this article, and click on the “Gene Function” checkbox, on the right-hand side of the page, to see the sentence highlighted.

    Annotation containing HTML tags Figure 1: Annotation containing HTML tags