Construction and annotation of large phylogenetic trees
Michael J. SandersonDepartment of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721, USA.
Australian Systematic Botany 20(4) 287-301 https://doi.org/10.1071/SB07006
Submitted: 28 February 2007 Accepted: 22 May 2007 Published: 5 September 2007
Abstract
Broad availability of molecular sequence data allows construction of phylogenetic trees with 1000s or even 10 000s of taxa. This paper reviews methodological, technological and empirical issues raised in phylogenetic inference at this scale. Numerous algorithmic and computational challenges have been identified surrounding the core problem of reconstructing large trees accurately from sequence data, but many other obstacles, both upstream and downstream of this step, are less well understood. Before phylogenetic analysis, data must be generated de novo or extracted from existing databases, compiled into blocks of homologous data with controlled properties, aligned, examined for the presence of gene duplications or other kinds of complicating factors, and finally, combined with other evidence via supermatrix or supertree approaches. After phylogenetic analysis, confidence assessments are usually reported, along with other kinds of annotations, such as clade names, or annotations requiring additional inference procedures, such as trait evolution or divergence time estimates. Prospects for partial automation of large-tree construction are also discussed, as well as risks associated with ‘outsourcing’ phylogenetic inference beyond the systematics community.
Aho AV,
Sagiv Y,
Szymanski TG, Ullman JD
(1981) Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM Journal of Computing 10, 405–421.
| Crossref | GoogleScholarGoogle Scholar |
[verified 17 July 2007].
McCubbin AG, Roalson EH
(2005) Construction of bacterial artificial chromosome libraries for use in phylogenetic studies. Methods in Enzymology 395, 384–400.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
McMahon MM, Sanderson MJ
(2006) Phylogenetic supermatrix analysis of GenBank sequences from 2228 papilionoid legumes. Systematic Biology 55, 818–836.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Minh BQ,
Vinh LS,
von Haeseler A, Schmidt HA
(2005) pIQPNNI: parallel reconstruction of large maximum likelihood phylogenies. Bioinformatics (Oxford, England) 21, 3794–3796.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Moles A,
Ackerly D,
Webb C,
Tweddle J,
Dickie J, Westoby M
(2005) A brief history of seed size. Science 307, 576–580.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Moore B,
Smith S, Donoghue MJ
(2006) Increasing data transparency and estimating phylogenetic uncertainty in supertrees: approaches using nonparametric bootstrapping. Systematic Biology 55, 662–676.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Mort ME,
Soltis PS,
Soltis DE, Mabry ML
(2000) Comparison of three methods for estimating internal support on phylogenetic trees. Systematic Biology 49, 160–171.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Mossel E
(2007) Distorted metrics on trees and phylogenetic forests. IEEE/ACM Transactions on Computational Biology and Bioinformatics 4, 108–116.
| Crossref | GoogleScholarGoogle Scholar |
Mower JP,
Stefanovic S,
Young GJ, Palmer JD
(2004) Plant genetics—Gene transfer from parasitic to host plants. Nature 432, 165–166.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Munzner T
(1998) Exploring large graphs in 3D hyperbolic space. IEEE Computer Graphics and Applications 18, 18–23.
| Crossref | GoogleScholarGoogle Scholar |
Munzner T,
Guimbretiere F,
Tasiran S,
Zhang L, Zhou YH
(2003) TreeJuxtaposer: scalable tree comparison using Focus+Context with guaranteed visibility. ACM Transactions on Graphics 22, 453–462.
| Crossref | GoogleScholarGoogle Scholar |
Myers DS, Cummings MP
(2003) Necessity is the mother of invention: a simple grid computing system using commodity tools. Journal of Parallel and Distributed Computing 63, 578–589.
| Crossref | GoogleScholarGoogle Scholar |
Nilsson RH,
Rajashekar B,
Larsson KH, Ursing BM
(2004) GalaxieEST: addressing EST identity through automated phylogenetic analysis. BMC Bioinformatics 5,
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Page RDM
(1998) GeneTree: comparing gene and species phylogenies using reconciled trees. Bioinformatics 14, 819–820.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Page RDM, Charleston MA
(1998) Trees within trees: phylogeny and historical associations. Trends in Ecology & Evolution 13, 356–359.
| Crossref | GoogleScholarGoogle Scholar |
Parmentier G,
Trystram D, Zola J
(2006) Large scale multiple sequence alignment with simultaneous phylogeny inference. Journal of Parallel and Distributed Computing 66, 1534–1545.
| Crossref |
Qiu Y-L,
Lee J,
Bernasconi-Quadroni F,
Soltis DE,
Soltis PS,
Zanis M,
Zimmer EA,
Chen Z,
Savolainen V, Chase MW
(1999) The earliest angiosperms: evidence from mitochondrial, plastid and nuclear genomes. Nature 402, 404–407.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Qiu YL,
Dombrovska O,
Lee J,
Li L,
Whitlock BA,
Bernasconi-Quadroni F,
Rest JS,
Davis CC,
Borsch T,
Hilu KW,
Renner SS,
Soltis DE,
Soltis PS,
Zanis MJ,
Cannone JJ,
Gutell RR,
Powell M,
Savolainen V,
Chatrou LW, Chase MW
(2005) Phylogenetic analyses of basal angiosperms based on nine plastid, mitochondrial, and nuclear genes. International Journal of Plant Sciences 166, 815–842.
| Crossref | GoogleScholarGoogle Scholar |
de Queiroz A,
Donoghue MJ, Kim J
(1995) Separate versus combined analysis of phylogenetic evidence. Annual Review of Ecology and Systematics 26, 657–681.
| Crossref | GoogleScholarGoogle Scholar |
Rice KA,
Donoghue MJ, Olmstead RG
(1997) Analyzing large data sets: rbcL 500 revisited. Systematic Biology 46, 554–563.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Robbertse B,
Reeves JB,
Schoch CL, Spatafora JW
(2006) A phylogenomic analysis of the Ascomycota. Fungal Genetics and Biology 43, 715–725.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Rokas A,
Williams B,
King N, Carroll S
(2003) Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425, 798–804.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Ross HA,
Lento GM,
Dalebout ML,
Goode M,
Ewing G,
McLaren P,
Rodrigo AG,
Lavery S, Baker CS
(2003) DNA Surveillance: web-based molecular identification of whales, dolphins and porpoises. Journal of Heredity 94, 111–114.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Rutschmann F
(2006) Molecular dating of phylogenetic trees: a brief review of current methods that estimate divergence times. Diversity & Distributions 12, 35–48.
| Crossref | GoogleScholarGoogle Scholar |
Salamin N,
Hodkinson TR, Savolainen V
(2002) Building supertrees: an empirical assessment using the grass family (Poaceae). Systematic Biology 51, 136–150.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Salamin N,
Chase MW,
Hodkinson TR, Savolainen V
(2003) Assessing internal support with large phylogenetic DNA matrices. Molecular Phylogenetics and Evolution 27, 528–539.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Sanderson MJ
(2006) Paloverde: an OpenGL 3D phylogeny browser. Bioinformatics 22, 1004–1006.
| Crossref |
PubMed |
Sanderson MJ, McMahon MM
(2007) Inferring angiosperm phylogeny from EST data with widespread gene duplication. BMC Evolutionary Biology Suppl. 1 7, S3.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Sanderson MJ, Wojciechowski MF
(2000) Improved bootstrap confidence limits in large-scale phylogenies, with an example from Neo-Astragalus (Leguminosae). Systematic Biology 49, 671–685.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Sanderson MJ,
Wojciechowski MF,
Hu JM,
Khan TS, Brady SG
(2000) Error, bias, and long-branch attraction in data for two chloroplast photosystem genes in seed plants. Molecular Biology and Evolution 17, 782–797.
| PubMed |
Sanderson MJ,
Driskell AC,
Ree RH,
Eulenstein O, Langley S
(2003) Obtaining maximal concatenated phylogenetic data sets from large sequence databases. Molecular Biology and Evolution 20, 1036–1042.
| Crossref |
PubMed |
Schlueter JA,
Dixon P,
Granger C,
Grant D,
Clark L,
Doyle JJ, Shoemaker RC
(2004) Mining EST databases to resolve evolutionary events in major crop species. Genome 47, 868–876.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Schmidt HA,
Strimmer K,
Vingron M, von Haeseler A
(2002) TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18, 502–504.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Semple C,
Daniel P,
Hordijk W,
Page RDM, Steel M
(2004) Supertree algorithms for ancestral divergence dates and nested taxa. Bioinformatics 20, 2355–2360.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Shimodaira H
(2002) An approximately unbiased test of phylogenetic tree selection. Systematic Biology 51, 492–508.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Soltis DE,
Soltis PS,
Nickrent DL,
Johnson LA,
Hahn WJ,
Hoot SB, Sweere JA
(1997) Angiosperm phylogeny inferred from 18S ribosomal sequences. Annals of the Missouri Botanical Garden 84, 1–49.
| Crossref | GoogleScholarGoogle Scholar |
Soltis PS,
Soltis DE,
Wolf PG,
Nickrent DL,
Chaw S-M, Chapman RL
(1999) The phylogeny of land plants inferred from 18S rDNA sequences: pushing the limits of rDNA signal? Molecular Biology and Evolution 16, 1774–1784.
| PubMed |
Stamatakis A
(2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688–2690.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Stamatakis A,
Ludwig T, Meier H
(2005) RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 21, 456–463.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Storm CEV, Sonnhammer ELL
(2002) Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics 18, 92–99.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Tehler A,
Little DP, Farris JS
(2003) The full-length phylogenetic tree from 1551 ribosomal sequences of chitinous fungi. Mycological Research 107, 901–916.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Till M,
Zhou BB,
Zomaya A, Jermiin LS
(2004) Phylogenetic analysis using maximum likelihood methods in homogeneous parallel environments. Lecture Notes in Computer Science 3320, 274–279.
de la Torre J,
Egan M,
Katari M,
Brenner E,
Stevenson D,
Coruzzi G, Desalle R
(2006) ESTimating plant phylogeny: lessons from partitioning. BMC Evolutionary Biology 6, 48.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Vilgalys R
(2003) Taxonomic misidentification in public DNA databases. New Phytologist 160, 4–5.
| Crossref | GoogleScholarGoogle Scholar |
Vogl C,
Badger J,
Kearney P,
Li M,
Clegg M, Jian T
(2003) Probabilistic analysis indicates discordant gene trees in chloroplast evolution. Journal of Molecular Evolution 56, 330–340.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Walters JD,
Casavant TL,
Robinson JP,
Bair TB,
Braun TA, Scheetz TE
(2005) XenoCluster: a grid computing approach to finding ancient evolutionary genetic anomalies. Lecture Notes in Computer Science 3606, 355–366.
Webb CO, Donoghue MJ
(2005) Phylomatic: tree assembly for applied phylogenetics. Molecular Ecology Notes 5, 181–183.
| Crossref | GoogleScholarGoogle Scholar |
Webb CO,
Losos JB, Agrawal AA
(2006) Integrating phylogenies into community ecology. Ecology 87, S1–S2.
| Crossref | GoogleScholarGoogle Scholar |
Yan CH,
Burleigh JG, Eulenstein O
(2005) Identifying optimal incomplete phylogenetic data sets from sequence databases. Molecular Phylogenetics and Evolution 35, 528–535.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Yang ZH, Rannala B
(2006) Bayesian estimation of species divergence times under a molecular clock using multiple fossil calibrations with soft bounds. Molecular Biology and Evolution 23, 212–226.
| Crossref | GoogleScholarGoogle Scholar | PubMed |
Yesson C, Culham A
(2006) A phyloclimatic study of cyclamen. BMC Evolutionary Biology 6, 72.
| Crossref |
PubMed |
1 1Reflecting on one of many raging arguments over phenetic systematics in the late 1960s, L.A.S. Johnson argued that problems of homology (‘matching’) would not all be whisked away by large oceans of data: ‘…even if we knew the entire nucleotide sequences over a set of organisms we should still have to make many decisions on matching…’ (Johnson 1970: p. 227, based on his presidential address for the Linnean Society of New South Wales in 1968). At the time the prospects for studying such complete genome sequences must have seemed remote. Now the data are here, and the newest genomics technologies (e.g. 454 Life Sciences’s FLX system) promise to deliver 100 million base pairs of sequence in an eight hour run (50 chloroplast genomes or one entire Arabidopsis genome…). However, the number of ‘decisions’ to be made regarding the analysis of such data has grown along with the quantity of information.