Free Standard AU & NZ Shipping For All Book Orders Over $80!
Register      Login
Animal Production Science Animal Production Science Society
Food, fibre and pharmaceuticals from animals
RESEARCH ARTICLE

Predictive ability of Random Forests, Boosting, Support Vector Machines and Genomic Best Linear Unbiased Prediction in different scenarios of genomic evaluation

Farhad Ghafouri-Kesbi A D , Ghodratollah Rahimi-Mianji A , Mahmood Honarvar B and Ardeshir Nejati-Javaremi C
+ Author Affiliations
- Author Affiliations

A Department of Animal Science, Faculty of Animal and Aquatic Sciences, Sari Agricultural Sciences and Natural Resources University, Sari, Iran.

B Department of Animal Science, Shahr-e-Qods Branch, Islamic Azad University, Tehran, Iran.

C Department of Animal Science, University College of Agriculture and Natural Resources, University of Tehran, Karaj, Iran.

D Corresponding author. Email: farhad_ghy@yahoo.com

Animal Production Science 57(2) 229-236 https://doi.org/10.1071/AN15538
Submitted: 29 June 2015  Accepted: 28 October 2015   Published: 23 March 2016

Abstract

Three machine learning algorithms: Random Forests (RF), Boosting and Support Vector Machines (SVM) as well as Genomic Best Linear Unbiased Prediction (GBLUP) were used to predict genomic breeding values (GBV) and their predictive performance was compared in different combinations of heritability (0.1, 0.3, and 0.5), number of quantitative trait loci (QTL) (100, 1000) and distribution of QTL effects (normal, uniform and gamma). To this end, a genome comprised of five chromosomes, one Morgan each, was simulated on which 10 000 bi-allelic single nucleotide polymorphisms were distributed. Pearson’s correlation between the true and predicted GBV and Mean Squared Error of GBV prediction were used, respectively, as measures of the predictive accuracy and the overall fit achieved with each method. In all methods, an increase in accuracy of prediction was seen following increase in heritability and decrease in the number of QTL. GBLUP had better predictive accuracy than machine learning methods in particular in the scenarios of higher number of QTL and normal and uniform distributions of QTL effects; though in most cases, the differences were non-significant. In the scenarios of small number of QTL and gamma distribution of QTL effects, Boosting outperformed other methods. Regarding Mean Squared Error of GBV prediction, in most cases Boosting outperformed other methods, although the estimates were close to that of GBLUP. Among methods studied, SVM with 0.6 gigabytes (GIG) was the most efficient user of memory followed by RF, GBLUP and Boosting with 1.2-GIG, 1.3-GIG and 2.3-GIG memory requirements, respectively. Regarding computational time, GBLUP, SVM, RF and Boosting ranked first, second, third and last with 10 min, 15 min, 75 min and 600 min, respectively. It was concluded that although stochastic gradient Boosting can predict GBV with high prediction accuracy, significantly longer computational time and memory requirement can be a serious limitation for this algorithm. Therefore, using of other variants of Boosting such as Random Boosting was recommended for genomic evaluation.

Additional keywords: genomic breeding values, machine learning, QTL effects, SNP.


References

Abdollahi-Arpanahi R, Pakdel A, Nejati-Javaremi A, Moradi Shahre Babak M (2013) Comparison of different methods of genomic evaluation in traits with different genetic architecture. Animal Production 15, 65–77. [In Persian with English abstract]

Boser B, Guyon I, Vapnik V (1992) A training algorithm for optimal margin classifiers. In ‘Proceedings of the fifth annual workshop on computational learning theory. Pittsburgh (USA). 27–29 July 1992’. Pittsburgh, USA. pp. 144–152. (ACM Press: New York)

Coster A, Bastiaansen JWM, Calus MPL, van Arendonk JAM, Bovenhuis H (2010) Sensitivity of methods for estimating breeding values using genetic markers to the number of QTL and distribution of QTL variance. Genetics, Selection, Evolution. 42, 9–19.
Sensitivity of methods for estimating breeding values using genetic markers to the number of QTL and distribution of QTL variance.Crossref | GoogleScholarGoogle Scholar | 20302681PubMed |

Daetwyler HD, Calus MPL, Pong-Wong R, de los Campos G, Hickey JM (2013) Genomic prediction in animals and plants: simulation of data, validation, reporting, and benchmarking. Genetics 193, 347–365.
Genomic prediction in animals and plants: simulation of data, validation, reporting, and benchmarking.Crossref | GoogleScholarGoogle Scholar | 23222650PubMed |

de los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL (2013) Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics 193, 327–345.
Whole-genome regression and prediction methods applied to plant and animal breeding.Crossref | GoogleScholarGoogle Scholar | 22745228PubMed |

Goddard ME, Wray NR, Verbyla K, Visscher PM (2009) Estimating effects and making predictions from genome-wide marker data. Statistical Science 24, 517–529.
Estimating effects and making predictions from genome-wide marker data.Crossref | GoogleScholarGoogle Scholar |

González-Recio O, Forni S (2011) Genome-wide prediction of discrete traits using Bayesian regressions and machine learning. Genetics, Selection, Evolution. 43, 7
Genome-wide prediction of discrete traits using Bayesian regressions and machine learning.Crossref | GoogleScholarGoogle Scholar | 21329522PubMed |

González-Recio O, Weigel KA, Gianola D, Naya H, Rosa GJM (2010) L2-boosting algorithm applied to high dimensional problems in genomic selection. Genetical Research 92, 227–237.
L2-boosting algorithm applied to high dimensional problems in genomic selection.Crossref | GoogleScholarGoogle Scholar |

González-Recio O, Jiménez-Montero JA, Alenda R (2013) The gradient boosting algorithm and random boosting for genome-assisted evaluation in large data sets. Journal of Dairy Science 96, 614–624.
The gradient boosting algorithm and random boosting for genome-assisted evaluation in large data sets.Crossref | GoogleScholarGoogle Scholar | 23102953PubMed |

Hastie TJ, Tibshirani R, Friedman J (2009) ‘The elements of statistical learning.’ (Springer: New York)

Hayes BJ, Daetwyler HD, Bowman P, Moser G, Tier B, Crump R, Khatkar M, Raadsma HW, Goddard ME (2010) Accuracy of genomic selection: comparing theory and results. In ‘Proceedings of the 18th conference of the Association for the Advancement of Animal Breeding and Genetics. Barossa Valley (Australia). 27 September–2 October 2009. Barossa Valley, Australia’. pp. 34–37. (Association for the Advancement of Animal Breeding and Genetics Press: Barossa Valley)

Heaton MP, Grosse WM, Kappes SM, Keele JW, Chitko-McKown CG (2001) Estimation of DNA sequence diversity in bovine cytokine genes. Mammalian Genome 12, 32–37.
Estimation of DNA sequence diversity in bovine cytokine genes.Crossref | GoogleScholarGoogle Scholar | 1:CAS:528:DC%2BD3MXptlGqtA%3D%3D&md5=9080a92f5650abc406bcf14e7788bd68CAS | 11178741PubMed |

Jiménez-Montero JA, González-Recio O, Alenda R, Pena J (2012) Genomic evaluation using machine learning algorithms in the Spanish Holstein population. Interbull Bulletin 31, 66–71.

Liaw A, Wiener M (2013) Breiman and Cutler’s random forests for classification and regression. Available at http://cran.r-project.org/web/packages/randomForest/index.html [Verified 20 September 2013]

Lindblad-Toh K, Winchester E, Daly MJ, Wang DG, Hirschhorn JN (2000) Large-scale discovery and genotyping of single-nucleotide polymorphisms in the mouse. Nature Genetics 24, 381–386.
Large-scale discovery and genotyping of single-nucleotide polymorphisms in the mouse.Crossref | GoogleScholarGoogle Scholar | 1:CAS:528:DC%2BD3cXisVCjsbg%3D&md5=af29587453fa4bebcd70ac06dcff31f8CAS | 10742102PubMed |

Markovtsova L, Marjoram P, Tavare S (2000) The age of a unique event polymorphism. Genetics 156, 401–409.

Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome wide dense marker maps. Genetics 157, 1819–1829.

Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch K (2013) Misc functions of the department of statistics (e1071), TU Wien. Available at http://cran.r-project.org/web/packages/e1071/index.html [Verified 20 September 2013]

Moser G, Tier B, Crump RE, Khatkar MS, Raadsma HW (2009) A comparison of five methods to predict genomic breeding values of dairy bulls from genome-wide SNP markers. Genetics, Selection, Evolution. 41, 56
A comparison of five methods to predict genomic breeding values of dairy bulls from genome-wide SNP markers.Crossref | GoogleScholarGoogle Scholar | 20043835PubMed |

Natekin A, Knoll A (2013) Gradient boosting machines, a tutorial. Frontiers in Neurorobotics 7, 1
Gradient boosting machines, a tutorial.Crossref | GoogleScholarGoogle Scholar |

Nejati-Javaremi A, Smith C, Gibson J (1997) Effect of total allelic relationship on accuracy of evaluation and response to selection. Journal of Animal Science 75, 1738–1745.

Neves HHR, Carvalheiro R, Queiroz SA (2012) A comparison of statistical methods for genomic selection in a mice population. BMC Genetics 13, 100
A comparison of statistical methods for genomic selection in a mice population.Crossref | GoogleScholarGoogle Scholar | 1:CAS:528:DC%2BC3sXmt1Gktrk%3D&md5=db773bb82e3097af5bc44746f18891ccCAS |

Ogutu JO, Piepho HP, Schulz-Streeck T (2011) A comparison of random forests, boosting and support vector machines for genomic selection. BMC Proceedings 5, S11
A comparison of random forests, boosting and support vector machines for genomic selection.Crossref | GoogleScholarGoogle Scholar | 21624167PubMed |

Porto-Neto LR, Kijas JW, Reverter A (2014) The extent of linkage disequilibrium in beef cattle breeds using high-density SNP genotypes. Genetics, Selection, Evolution. 46, 22
The extent of linkage disequilibrium in beef cattle breeds using high-density SNP genotypes.Crossref | GoogleScholarGoogle Scholar | 24661366PubMed |

Ridgeway G (2013) gbm: generalized boosted regression models. Available at http://cran.r-project.org/web/packages/gbm/index.html [Verified 20 September 2013]

Schapire R (2003) ‘The boosting approach to machine learning – an overview.’ MSRI Workshop on Nonlinear Estimation and Classification (Eds DD Denison, MH Hansen, C Holmes, B Mallick, B Yu) (Springer: New York)

Scholkopf B, Smola A (2002) ‘Learning with kernels.’ (MIT Press: Cambridge)

Technow F (2013) hypred: simulation of genomic data in applied genetics. Available at http://cran.r-project.org/web/packages/hypred/index.html [Verified 20 September 2013]

VanRaden P (2008) Efficient methods to compute genomic predictions. Journal of Dairy Science 91, 4414–4423.
Efficient methods to compute genomic predictions.Crossref | GoogleScholarGoogle Scholar | 1:CAS:528:DC%2BD1cXhtlajtLzO&md5=5f5d518ecbd76e274a573e8a260c7914CAS | 18946147PubMed |

Wimmer V, Auinger HJ, Albrecht T, Schoen CC (2015) Framework for the analysis of genomic prediction data using R (synbreed). Available at https://cran.rproject.org/web/packages/synbreed/index.html [Verified 10 September 2015]

Yang P, Yang YH, Zhou BB, Zomaya AY (2010) A review of ensemble methods in bioinformatics. Current Bioinformatics 5, 296–308.
A review of ensemble methods in bioinformatics.Crossref | GoogleScholarGoogle Scholar | 1:CAS:528:DC%2BC3MXls1ChsQ%3D%3D&md5=9fac3a98e63adc677d447ab0d8330a51CAS |