Latest posts:

04 Jul 2018

A perfect match: locating plain text in HTML pages

SciLite is a Europe PMC tool that allows biological terms or relations, such as diseases, chemicals or protein interactions, to be highlighted for readers on abstracts and full text articles. These terms are identified as annotations by text mining algorithms, developed by a variety of text mining groups.

The main challenge for the SciLite tool is locating plain text annotations in HTML pages. The challenges derive from the nature of HTML pages. Below is a list of the major challenges we faced and the solutions adopted to mitigate them.

  1. The pages contain HTML tags obviously. For an example consider the page and click on the “Gene Function” checkbox on the right hand side of the page to see the sentence highlighted.

    Annotation containing HTML tags Figure 1: Annotation containing HTML tags

    The problem is caused by the sub tag that it is surrounding the character “v” inside the world “Nav1.7”. Therefore if you search for an exact match of the plain sentence in the HTML page, it will not be found. Our solution was to search for a regular expression built including an optional HTML tag between any two characters of the annotation text. The disadvantage of this approach is that this type of search is much more computationally demanding than an exact match search. Therefore, we decided to adopt this regular expression search only for sentence based annotations, where the chance of having HTML tags is much higher than for named entity annotations usually composed of only one or two words.

04 Jun 2018

Integrating Literature and Data

Data is at the heart of research. Scientific papers describe how data has been obtained, analysed, and what conclusions have been drawn. But it is the data that comprises the essential evidence, which confirms or disproves the original hypothesis. In the life sciences it is essential to look at scientific literature in the context of other publications, the data it builds on and other data linked to the publication. At Europe PMC we have developed a number of features to support data discovery and reuse.

As one of the ELIXIR Core Resources, Europe PMC benefits from excellent links to essential research data hubs located at EMBL-EBI.This helps us interweave publications and data, enriching the graph of research objects and help researchers discover linked and related data.

The literature-data links come in different forms and shapes. An article might be citing a DOI for a dataset in a repository, or describe a protein structure cited as an accession number for PDBe database. An publication itself might be cited by a database, such as Flybase or even a Wikipedia article. Europe PMC obtains such literature-data links in three ways. They might be provided directly by databases at the EBI, submitted by providers participating in Europe PMC external links program, or picked up directly by our text-mining pipeline that extracts data mentions, such as accession numbers, from research publications.The three link types largely share the same characteristics, but used to be scattered through the Europe PMC website. They would show up in different locations and were obtained through different web service methods in different formats, even though they all link external content, mostly data, to the publication.

Based on their commonalities and use, it made sense for us to start consolidating our datalinks in the Europe PMC API, as well as in their presentation through the Europe PMC website. To adhere to community standards and allow exchange of data with other providers, we have turned to the Scholix format for scholarly link exchange, which we have helped to shape and have subsequently used to represent datalinks in Europe PMC web services.

Collaborating with Scholix

Scholix, or Scholarly link exchange, is an initiative is to establish a multi-hub infrastructure to harmonize and enable the exchange of data-literature links between several natural hubs, such as DataCite, CrossRef, or OpenAIRE, in scholarly communities. The centerpiece of the Scholix landscape is the format that is used to facilitate link exchange between the hubs and other interested parties. Data links in Scholix format are presented as an “information package”. The package contains information about the two linked objects (e.g. a publication and a dataset), as well as link metadata: date, provider, copyrights, etc.

Scholix Hub Architecture Figure 1: Scholix hub architecture

08 Mar 2018

The Importance of Software Testing in DevOps

Software testing is the process of identifying the correctness and quality of a software program. In other words, testing is executing a system or application in order to find software bugs, defects, errors or unexpected behavior.

Software testing is necessary because we all make mistakes. Some of those mistakes are minor, but others can be expensive or dangerous. Especially while practicing continuous integration, continuous delivery, or continuous deployment, we need to test anything and everything we produce, because things can always go wrong.

Testing is mainly classified into two types, Functional Testing and Non-functional Testing.

Types of Functional and Non-functional testing

29 Jan 2018

Behavior-Driven Development in Bioinformatics

What is BDD?

Behavior-Driven Development (BDD) is a set of Software Engineering practices designed to help teams deliver more valuable and higher quality software features.

It adopts general techniques and principles of Test Driven Development (TDD) with ideas from Domain-driven Design (DDD). BDD incrementally builds functionality guided by expected behavior.

A simple BDD scenario / requirement is as follows:

  Scenario: Specific Search by Keyword
    Given I am researcher
    When I open the 'Europe PMC' Website
    And Enter the keyword "Glycosyl transferases" on the Query field
    And Click on the Search button
    Then I should be able to see the matching results on the Search Result page

27 Nov 2017

Swagger documentation customisation

Swagger is a popular software framework that helps developers build RESTful Web services through their entire lifecycle, from design and documentation, to test and deployment. This post focuses on how to incorporate the API documentation generated through Swagger inside an HTML page hosted from another web application.

One of the main features of Swagger is producing interactive documentation for a RESTful API. Swagger can be used in conjunction with a multitude of different languages and frameworks. It will always produce two different outputs inside the same web application hosting the API:

  1. A default HTML page having a standard Swagger style. (Europe PMC Annotations API Swagger standard html page)
  2. A JSON file that will contain the description of the generated documentation (Europe PMC Annotations API documentation descriptor)

03 Nov 2017

Remote debug Ruby on Rails running in a Docker container using RubyMine

RubyMine brings a sophisticated debugger with a graphical UI for Ruby, JS, and CoffeeScript. You can set breakpoints and run your code step by step with all the information at your fingertips, without having to modify your code as Pry.

This article is written about how one can use RubyMine to remote debug a Ruby on Rails application that even runs inside a Docker container.

Versions of software in our environment

18 Aug 2017

Add Google reCAPTCHA to Your Website

Google reCAPTCHA is a free service that protects your site from spam and abuse.

You can just follow the steps below to add it to your website.

First of all, go to Google reCAPTCHA, and register your application there. Then you can work on the client and server sides respectively as below:

On the client side (see ref)

First, paste the snippet below <script...></script> before the closing </head> tag on your HTML template, for example:

    <script src=''></script>

Then paste the snippet below <div...></div> at the end of the <form> where you want the reCAPTCHA widget to appear, for example: