Data management pipeline for plant phenotyping in a multisite project
Kenny Billiau A , Heike Sprenger A , Christian Schudoma A , Dirk Walther A and Karin I. Köhl A BA Max Planck Institute of Molecular Plant Physiology, Am Muehlenberg 1, 14476 Potsdam OT Golm, Germany.
B Corresponding author. Email: koehl@mpimp-golm.mpg.de
Functional Plant Biology 39(11) 948-957 https://doi.org/10.1071/FP12009
Submitted: 13 January 2012 Accepted: 22 June 2012 Published: 15 August 2012
Journal Compilation © CSIRO Publishing 2012 Open Access CC BY-NC-ND
Abstract
In plant breeding, plants have to be characterised precisely, consistently and rapidly by different people at several field sites within defined time spans. For a meaningful data evaluation and statistical analysis, standardised data storage is required. Data access must be provided on a long-term basis and be independent of organisational barriers without endangering data integrity or intellectual property rights. We discuss the associated technical challenges and demonstrate adequate solutions exemplified in a data management pipeline for a project to identify markers for drought tolerance in potato. This project involves 11 groups from academia and breeding companies, 11 sites and four analytical platforms. Our data warehouse concept combines central data storage in databases and a file server and integrates existing and specialised database solutions for particular data types with new, project-specific databases. The strict use of controlled vocabularies and the application of web-access technologies proved vital to the successful data exchange between diverse institutes and data management concepts and infrastructures. By presenting our data management system and making the software available, we aim to support related phenotyping projects.
Additional keywords: controlled vocabulary, data integration, field trials, marker assisted selection, mixed schema design, ontologies.
References
Alshawi S, Saez-Pujol I, Irani Z (2003) Data warehousing in decision support for pharmaceutical R&D supply chain. International Journal of Information Management 23, 259–268.| Data warehousing in decision support for pharmaceutical R&D supply chain.Crossref | GoogleScholarGoogle Scholar |
Bérard C, Cloutier LM, Cassivi L (2012) Evaluating clinical trial management systems: a simulation approach. Industrial Management & Data Systems 112, 146–164.
Cote R, Jones P, Apweiler R, Hermjakob H (2006) The Ontology Lookup Service, a lightweight cross-platform tool for controlled vocabulary queries. BMC Bioinformatics 7, 97
Dinu V, Nadkarni P (2007) Guidelines for the effective use of entity-attribute-value modeling for biomedical databases. International Journal of Medical Informatics 76, 769–779.
| Guidelines for the effective use of entity-attribute-value modeling for biomedical databases.Crossref | GoogleScholarGoogle Scholar |
Fabre J, Dauzat M, Negre V, Wuyts N, Tireau A, Gennari E, Neveu P, Tisne S, Massonnet C, Hummel I, Granier C (2011) PHENOPSIS DB: an information system for Arabidopsis thaliana phenotypic data in an environmental context. BMC Plant Biology 11, 77
| PHENOPSIS DB: an information system for Arabidopsis thaliana phenotypic data in an environmental context.Crossref | GoogleScholarGoogle Scholar |
Finkel E (2009) With ‘Phenomics,’ plant scientists hope to shift breeding into overdrive. Science 325, 380–381.
| With ‘Phenomics,’ plant scientists hope to shift breeding into overdrive.Crossref | GoogleScholarGoogle Scholar | 1:CAS:528:DC%2BD1MXpt1GjtL4%3D&md5=e770a9a60203a82cc9b4d4ed4e500658CAS |
Gibson D, Harvey AJ, Everett V, Parmar MKB (1994) Is double data-entry necessary – the CHART trials. Controlled Clinical Trials 15, 482–488.
| Is double data-entry necessary – the CHART trials.Crossref | GoogleScholarGoogle Scholar | 1:STN:280:DyaK2M7lsVSjsA%3D%3D&md5=0007325534b43a860693cc51428bd092CAS |
Gollub J, Ball CA, Binkley G, Demeter J, Finkelstein DB, Hebert JM, Hernandez-Boussard T, Jin H, Kaloper M, Matese JC, Schroeder M, Brown PO, Botstein D, Sherlock G (2003) The Stanford microarray database: data access and quality assessment tools. Nucleic Acids Research 31, 94–96.
| The Stanford microarray database: data access and quality assessment tools.Crossref | GoogleScholarGoogle Scholar | 1:CAS:528:DC%2BD3sXhvFSnsrY%3D&md5=3b91922c324661cbbfa3e196667fe3c7CAS |
Harnsomburana J, Green JM, Barb AS, Schaeffer M, Vincent L, Shyu CR (2011) Computable visually observed phenotype ontological framework for plants. BMC Bioinformatics 12, 260
| Computable visually observed phenotype ontological framework for plants.Crossref | GoogleScholarGoogle Scholar |
Hummel J, Selbig J, Walther D, Kopka J (2007) The Golm Metabolome Database: a database for GC-MS based metabolite profiling. Topics in Current Genetics 18, 75–95.
| The Golm Metabolome Database: a database for GC-MS based metabolite profiling.Crossref | GoogleScholarGoogle Scholar | 1:CAS:528:DC%2BD1cXntlWqsg%3D%3D&md5=1f725e0e3d52e17f85a8c19e5a49ecbdCAS |
Jaiswal P, Avraham S, Ilic K, Kellogg EA, McCouch S, Pujar A, Reiser L, Rhee SY, Sachs MM, Schaeffer M, Stein L, Stevens P, Vincent L, Ware D, Zapata F (2005) Plant ontology (PO): a controlled vocabulary of plant structures and growth stages. Comparative and Functional Genomics 6, 388–397.
| Plant ontology (PO): a controlled vocabulary of plant structures and growth stages.Crossref | GoogleScholarGoogle Scholar | 1:CAS:528:DC%2BD28Xit1Knur0%3D&md5=c2c6f0819dca1bec4acf49262dd4e38eCAS |
Kattge J, Ogle K, Bönisch G, Díaz S, Lavorel S, Madin J, Nadrowski K, Nöllert S, Sartor K, Wirth C (2011) A generic structure for plant trait databases. Methods in Ecology and Evolution 2, 202–213.
| A generic structure for plant trait databases.Crossref | GoogleScholarGoogle Scholar |
Khatri P, Dr˘aghici S (2005) Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 21, 3587–3595.
| Ontological analysis of gene expression data: current tools, limitations, and open problems.Crossref | GoogleScholarGoogle Scholar | 1:CAS:528:DC%2BD2MXpvFGqsb0%3D&md5=1129574b2b87793e034b9f2c9cef377dCAS |
Köhl KI, Basler G, Luedemann A, Selbig J, Walther D (2008) A plant resource and experiment management system based on the Golm Plant Database as a basic tool for omics research. Plant Methods 4, 11
| A plant resource and experiment management system based on the Golm Plant Database as a basic tool for omics research.Crossref | GoogleScholarGoogle Scholar |
Lancashire PD, Bleiholder H, Van Den Boom T, Landgeluddeke P, Strauss R, Weber E, Witzenberger A (1991) A uniform decimal code for growth stages of crops and weeds. Annals of Applied Biology 119, 561–601.
| A uniform decimal code for growth stages of crops and weeds.Crossref | GoogleScholarGoogle Scholar |
Li Y-F, Kennedy G, Ngoran F, Wu P, Hunter J (2011) An ontology-centric architecture for extensible scientific data management systems. Future Generation Computer Systems in press.
Marenco L, Tosches N, Crasto C, Shepherd G, Miller P, Nadkarni P (2003) Achieving evolvable web-database bioscience applications using the EAV/CR framework: recent advances. Journal of the American Medical Informatics Association 10, 444–453.
| Achieving evolvable web-database bioscience applications using the EAV/CR framework: recent advances.Crossref | GoogleScholarGoogle Scholar |
Mungall CJ (2004) Obol: integrating language and meaning in bio-ontologies. Comparative and Functional Genomics 5, 509–520.
| Obol: integrating language and meaning in bio-ontologies.Crossref | GoogleScholarGoogle Scholar | 1:CAS:528:DC%2BD2MXhtFertbo%3D&md5=034da8f487835fda6a458dd2b5582f5aCAS |
Mungall C, Gkoutos G, Smith C, Haendel M, Lewis S, Ashburner M (2010) Integrating phenotype ontologies across multiple species. Genome Biology 11, R2
| Integrating phenotype ontologies across multiple species.Crossref | GoogleScholarGoogle Scholar |
Nadkarni PM, Marenco L, Chen R, Skoufos E, Shepherd G, Miller P (1999) Organization of heterogeneous scientific data using the EAV/CR representation. Journal of the American Medical Informatics Association 6, 478–493.
| Organization of heterogeneous scientific data using the EAV/CR representation.Crossref | GoogleScholarGoogle Scholar | 1:STN:280:DC%2BD3c%2FksVajtQ%3D%3D&md5=cda5c35da840cc42eeb062ca17ccbda0CAS |
Reynolds-Haertle RA, McBride R (1992) Single vs double data entry in CAST. Controlled Clinical Trials 13, 487–494.
| Single vs double data entry in CAST.Crossref | GoogleScholarGoogle Scholar | 1:STN:280:DyaK3s7gsFKhsg%3D%3D&md5=746d78097dadf941954cf0221699d1fcCAS |
Riano-Pachon DM, Nagel A, Neigenfind J, Wagner R, Basekow R, Weber E, Mueller-Roeber B, Diehl S, Kersten B (2009) GabiPD: the GABI primary database – a plant integrative ‘omics’ database. Nucleic Acids Research 37, D954–D959.
| GabiPD: the GABI primary database – a plant integrative ‘omics’ database.Crossref | GoogleScholarGoogle Scholar | 1:CAS:528:DC%2BD1cXhsFejtL%2FO&md5=81e7f28134b9e3761173c20aa4cf9a42CAS |
Richards RA, Rebetzke GJ, Watt M, Condon AG, Spielmeyer W, Dolferus R (2010) Breeding for improved water productivity in temperate cereals: phenotyping, quantitative trait loci, markers and the selection environment. Functional Plant Biology 37, 85–97.
| Breeding for improved water productivity in temperate cereals: phenotyping, quantitative trait loci, markers and the selection environment.Crossref | GoogleScholarGoogle Scholar |
Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Mizrachi I, Ostell J, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Shumway M, Sirotkin K, Souvorov A, Starchenko G, Tatusova TA, Wagner L, Yaschenko E, Ye J (2009) Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 37, D5–D15.
| Database resources of the National Center for Biotechnology Information.Crossref | GoogleScholarGoogle Scholar | 1:CAS:528:DC%2BD1cXhsFejt7fN&md5=446409177ebdafeeb885011b69a01a5bCAS |
Sherry ST, Ward MH, Sirotkin K (1999) dbSNP – Database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome Research 9, 677–679.
Smith CL, Goldsmith CA, Eppig JT (2004) The mammalian phenotype ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biology 6, R7
| The mammalian phenotype ontology as a tool for annotating, analyzing and comparing phenotypic information.Crossref | GoogleScholarGoogle Scholar |
Smith B, Ceusters W, Kohler J, Kumar A, Lomax J, Mungall CJ, Neuhaus F, Rector A, Rosse C (2005) Relations in biomedical ontologies. Genome Biology 6, R46
| Relations in biomedical ontologies.Crossref | GoogleScholarGoogle Scholar |
Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, Goldberg LJ, Eilbeck K, Ireland A, Mungall CJ, Leontis N, Rocca-Serra P, Ruttenberg A, Sansone S-A, Scheuermann RH, Shah N, Whetzel PL, Lewis S (2007) The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology 25, 1251–1255.
| The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration.Crossref | GoogleScholarGoogle Scholar | 1:CAS:528:DC%2BD2sXht1Oru7zK&md5=2ab5bcc4ac71fade4a3425ad1d976e8cCAS |
Washington NL, Haendel MA, Mungall CJ, Ashburner M, Westerfield M, Lewis SE (2009) Linking human diseases to animal models using ontology-based phenotype annotation. PLoS Biology 7, e1000247
| Linking human diseases to animal models using ontology-based phenotype annotation.Crossref | GoogleScholarGoogle Scholar |
Yamazaki Y, Jaiswal P (2005) Biological ontologies in rice databases. An introduction to the activities in gramene and oryzabase. Plant & Cell Physiology 46, 63–68.
| Biological ontologies in rice databases. An introduction to the activities in gramene and oryzabase.Crossref | GoogleScholarGoogle Scholar | 1:CAS:528:DC%2BD2MXht1eqt7w%3D&md5=252aa64009b566b3aad883ad5a45f1d3CAS |
Zimmermann P, Hirsch-Hoffmann M, Hennig L, Gruissem W (2004) GENEVESTIGATOR. Arabidopsis microarray database and analysis toolbox. Plant Physiology 136, 2621–2632.
| GENEVESTIGATOR. Arabidopsis microarray database and analysis toolbox.Crossref | GoogleScholarGoogle Scholar | 1:CAS:528:DC%2BD2cXnvFOru74%3D&md5=756aecd9238590fd8f3a3c36afca5916CAS |