04 Jul 2018
A perfect match: locating plain text in HTML pages
SciLite is a Europe PMC tool that allows biological terms or relations, such as diseases, chemicals or protein interactions, to be highlighted for readers on abstracts and full text articles. These terms are identified as annotations by text mining algorithms, developed by a variety of text mining groups.
The main challenge for the SciLite tool is locating plain text annotations in HTML pages. The challenges derive from the nature of HTML pages. Below is a list of the major challenges we faced and the solutions adopted to mitigate them.
- The pages contain HTML tags obviously. For an example consider the page https://europepmc.org/articles/PMC1215513 and click on the “Gene Function” checkbox on the right hand side of the page to see the sentence highlighted.
Figure 1: Annotation containing HTML tags
The problem is caused by the sub tag that it is surrounding the character “v” inside the world “Nav1.7”. Therefore if you search for an exact match of the plain sentence in the HTML page, it will not be found. Our solution was to search for a regular expression built including an optional HTML tag between any two characters of the annotation text. The disadvantage of this approach is that this type of search is much more computationally demanding than an exact match search. Therefore, we decided to adopt this regular expression search only for sentence based annotations, where the chance of having HTML tags is much higher than for named entity annotations usually composed of only one or two words.
04 Jun 2018
Integrating Literature and Data
Data is at the heart of research. Scientific papers describe how data has been obtained, analysed, and what conclusions have been drawn. But it is the data that comprises the essential evidence, which confirms or disproves the original hypothesis. In the life sciences it is essential to look at scientific literature in the context of other publications, the data it builds on and other data linked to the publication. At Europe PMC we have developed a number of features to support data discovery and reuse.
As one of the ELIXIR Core Resources, Europe PMC benefits from excellent links to essential research data hubs located at EMBL-EBI.This helps us interweave publications and data, enriching the graph of research objects and help researchers discover linked and related data.
The literature-data links come in different forms and shapes. An article might be citing a DOI for a dataset in a repository, or describe a protein structure cited as an accession number for PDBe database. An publication itself might be cited by a database, such as Flybase or even a Wikipedia article. Europe PMC obtains such literature-data links in three ways. They might be provided directly by databases at the EBI, submitted by providers participating in Europe PMC external links program, or picked up directly by our text-mining pipeline that extracts data mentions, such as accession numbers, from research publications.The three link types largely share the same characteristics, but used to be scattered through the Europe PMC website. They would show up in different locations and were obtained through different web service methods in different formats, even though they all link external content, mostly data, to the publication.
Based on their commonalities and use, it made sense for us to start consolidating our datalinks in the Europe PMC API, as well as in their presentation through the Europe PMC website. To adhere to community standards and allow exchange of data with other providers, we have turned to the Scholix format for scholarly link exchange, which we have helped to shape and have subsequently used to represent datalinks in Europe PMC web services.
Collaborating with Scholix
Scholix, or Scholarly link exchange, is an initiative is to establish a multi-hub infrastructure to harmonize and enable the exchange of data-literature links between several natural hubs, such as DataCite, CrossRef, or OpenAIRE, in scholarly communities. The centerpiece of the Scholix landscape is the format that is used to facilitate link exchange between the hubs and other interested parties. Data links in Scholix format are presented as an “information package”. The package contains information about the two linked objects (e.g. a publication and a dataset), as well as link metadata: date, provider, copyrights, etc.
Figure 1: Scholix hub architecture
08 Mar 2018
The Importance of Software Testing in DevOps
Software testing is the process of identifying the correctness and quality of a software program. In other words, testing is executing a system or application in order to find software bugs, defects, errors or unexpected behavior.
Software testing is necessary because we all make mistakes. Some of those mistakes are minor, but others can be expensive or dangerous. Especially while practicing continuous integration, continuous delivery, or continuous deployment, we need to test anything and everything we produce, because things can always go wrong.
Testing is mainly classified into two types, Functional Testing and Non-functional Testing.
29 Jan 2018
Behavior-Driven Development in Bioinformatics
What is BDD?
Behavior-Driven Development (BDD) is a set of Software Engineering practices designed to help teams deliver more valuable and higher quality software features.
It adopts general techniques and principles of Test Driven Development (TDD) with ideas from Domain-driven Design (DDD). BDD incrementally builds functionality guided by expected behavior.
A simple BDD scenario / requirement is as follows:
27 Nov 2017
Swagger documentation customisation
Swagger is a popular software framework that helps developers build RESTful Web services through their entire lifecycle, from design and documentation, to test and deployment. This post focuses on how to incorporate the API documentation generated through Swagger inside an HTML page hosted from another web application.
One of the main features of Swagger is producing interactive documentation for a RESTful API. Swagger can be used in conjunction with a multitude of different languages and frameworks. It will always produce two different outputs inside the same web application hosting the API:
- A default HTML page having a standard Swagger style. (Europe PMC Annotations API Swagger standard html page)
- A JSON file that will contain the description of the generated documentation (Europe PMC Annotations API documentation descriptor)
03 Nov 2017
Remote debug Ruby on Rails running in a Docker container using RubyMine
RubyMine brings a sophisticated debugger with a graphical UI for Ruby, JS, and CoffeeScript. You can set breakpoints and run your code step by step with all the information at your fingertips, without having to modify your code as Pry.
This article is written about how one can use RubyMine to remote debug a Ruby on Rails application that even runs inside a Docker container.
Versions of software in our environment
- RubyMine 2017.2.4 (Build #RM-172.4155.44, built on September 26, 2017)
- Ruby inside docker (ruby 2.4.2p198 (2017-09-14 revision 59899) [x86_64-linux])
- Ruby SDK and Gems used by RubyMine (ruby-2.4.2-p198)
18 Aug 2017
Add Google reCAPTCHA to Your Website
Google reCAPTCHA is a free service that protects your site from spam and abuse.
You can just follow the steps below to add it to your website.
First of all, go to Google reCAPTCHA, and register your application there. Then you can work on the client and server sides respectively as below:
On the client side (see ref)
First, paste the snippet below
<script...></script> before the closing </head> tag on your HTML template, for example:
Then paste the snippet below
<div...></div> at the end of the <form> where you want the reCAPTCHA widget to appear, for example: