Techno Data Management across DPPN locations
Plant phenotyping platforms, as provided by the DPPN, produce huge amounts of heterogeneous data. Their scientific potential can only be fully exploited if the datasets are a) available to the community and b) annotated in a comprehensive and comprehensible manner to allow both systematic reanalysis and integration with other data sources. Thus, the DPPN Techno module establishes infrastructures for experiment documentation and data publication as well as methods linking heterogeneous phenomic and genomic data for knowledge discovery.
The core of sustainable experiment annotation is the description of samples with respect to all experimental factors and source characteristics as well as protocols and parameters for plant growth and specific assays. The international ISA-Tab standard provides table formats for general-purpose data documentation. We are involved in the development of recommendations for minimal information about plant phenotyping experiments (MIAPPE) registered at biosharing.org and their implementation in a specific ISA-Tab configuration for plant phenotyping data. The corresponding checklists are applied in the local data management systems of the DPPN sites to support sustainable experiment documentation. At IPK Gatersleben, a PhenoLIMS module based on the commercial LIMSOPHY lab information management system was developed for that purpose, Helmholtz Zentrum München sets up an openBIS framework and Forschungszentrum Jülich employs a custom-built integrative information system. All sites use the ISA-Tab standard to export datasets for publication and exchange.
To facilitate finding and reuse of datasets, IPK developed the e!DAL software and established based on that the Plant Genomics and Phenomics Research Data Repository (PGP), which assigns citable digital object identifiers to datasets and registers their metadata at DataCite. The high level of automation lowers the barriers of data submission for scientists, and a journal-like review process ensures the quality of released data. Sustaining storage, access and browsing of datasets, PGP was successfully registered as research data repository at biosharing.org, re3data.org and OpenAIRE. Furthermore, the Nature Publishing Group accepted PGP as institutional data repository. PGP already provides access to a large dataset from IPK's LemnaTec high-throughput plant phenotyping platform with a MIAPPE-compliant ISA-Tab metadata representation.
To further improve interoperability of datasets within and beyond DPPN, we suggest controlled vocabularies integrating existing public ontologies and universal identifiers to extend the current ISA-Tab configuration and homogeneously describe crucial metadata of plant phenotyping experiments. This will enable targeted searches for relevant datasets via keywords in DataCite or a structured metadata registry. Compatible annotations of datasets from different research sites, measurement domains and omics levels open up new chances of data-driven hypothesis generation. In studies with Arabidopsis and poplar we demonstrate how integration of heterogeneous data from phenotypic observations to gene expression contributed findings that suggest potential mechanisms of abiotic stress responses, which in part are supported by reviewing multiple annotated previous datasets in context.