What is the DataSHaPER

GENERAL DESCRIPTION

The DataSHaPER (DataSchema and Harmonization Platform for Epidemiological Research) is both a scientific approach and a suite of practical tools. Its primary aims are to facilitate the prospective harmonization of emerging biobanks, provide a template for retrospective synthesis and support the development of questionnaires and information-collection devices, even when pooling of data with other biobanks is not foreseen.

Its basic structure reflects a four step approach to harmonization:

  • Identify and document the set of core variables to be shared;
  • Formally assess the potential to share each variable between participating studies;
  • Define appropriate data processing algorithms;
  • Process and synthetize real data.

In the context of the DataSHaPER, the term "variables" refers to the primary units of interest in a statistical analysis (e.g. current smoker [yes/no], or body mass index as a quantitative trait). An important distinction is drawn between such variables and the specific "assessment items" that are collected by a particular study (e.g. questions in a questionnaire or physical measures collected by a study). Crucially, it is variables that are harmonized between studies and it is this that provides for flexible yet robust harmonization, because a given variable may potentially be built using different assessment items in different studies.

Structurally, the DataSHaPER is a dynamically evolving entity with two primary components: the DataSchema Platform and the Harmonization Platform.

The DataSchema Platform

A DataSchema identifies and describes a thematic set of core variables that are of particular value in a specified scientific setting.

The core variables in each DataSchema are grouped under a four level nested hierarchy:

1. Module:
       Assessment modes or type of element measured or collected. Each module subsumes one or more themes.
2. Theme:
       General area of interest. Each theme subsumes one or more domains.
3. Domain:
       Risk factor or outcome of interest. Each domain subsumes a number of variables.
4. Variable:
       Primary unit of interest for a statistical analysis.

The DataSchema Platform contains a growing number of such DataSchemas, each with its own scientific purpose. The platform also contains associated support material including variable definitions, links to relevant standard classifications, and access to reference questionnaires and operating procedures that have been selected or developed to reliably generate the variables in each DataSchema.

The Harmonization Platform

Each DataSchema in the DataSchema platform can be partnered by corresponding Harmonization Units that provide a foundation for harmonizing studies relative to that particular schema. Ultimately, it will contain a growing number of harmonization units.

Access to the Harmonization Platform is limited to collaborative context. Please contact us to see how we can work together.

DOCUMENTATION