Posts in Category: Literature Data Integration

04 Jun 2018

Integrating Literature and Data

Data is at the heart of research. Scientific papers describe how data has been obtained, analysed, and what conclusions have been drawn. But it is the data that comprises the essential evidence, which confirms or disproves the original hypothesis. In the life sciences it is essential to look at scientific literature in the context of other publications, the data it builds on and other data linked to the publication. At Europe PMC we have developed a number of features to support data discovery and reuse.

As one of the ELIXIR Core Resources, Europe PMC benefits from excellent links to essential research data hubs located at EMBL-EBI.This helps us interweave publications and data, enriching the graph of research objects and help researchers discover linked and related data.

The literature-data links come in different forms and shapes. An article might be citing a DOI for a dataset in a repository, or describe a protein structure cited as an accession number for PDBe database. An publication itself might be cited by a database, such as Flybase or even a Wikipedia article. Europe PMC obtains such literature-data links in three ways. They might be provided directly by databases at the EBI, submitted by providers participating in Europe PMC external links program, or picked up directly by our text-mining pipeline that extracts data mentions, such as accession numbers, from research publications.The three link types largely share the same characteristics, but used to be scattered through the Europe PMC website. They would show up in different locations and were obtained through different web service methods in different formats, even though they all link external content, mostly data, to the publication.

Based on their commonalities and use, it made sense for us to start consolidating our datalinks in the Europe PMC API, as well as in their presentation through the Europe PMC website. To adhere to community standards and allow exchange of data with other providers, we have turned to the Scholix format for scholarly link exchange, which we have helped to shape and have subsequently used to represent datalinks in Europe PMC web services.

Collaborating with Scholix

Scholix, or Scholarly link exchange, is an initiative is to establish a multi-hub infrastructure to harmonize and enable the exchange of data-literature links between several natural hubs, such as DataCite, CrossRef, or OpenAIRE, in scholarly communities. The centerpiece of the Scholix landscape is the format that is used to facilitate link exchange between the hubs and other interested parties. Data links in Scholix format are presented as an “information package”. The package contains information about the two linked objects (e.g. a publication and a dataset), as well as link metadata: date, provider, copyrights, etc.

Scholix Hub Architecture Figure 1: Scholix hub architecture