A holistic approach towards a generalizable machine learning predictor of cell penetrating peptides
Bahaa Ismail A , Sarah Jones A and John Howl A *A Research Institute in Healthcare Science, University of Wolverhampton, Wulfruna Street, Wolverhampton, WV1 1LY, UK. Email: Bahaa.Ismail@wlv.ac.uk, S.Jones4@wlv.ac.uk
Australian Journal of Chemistry 76(8) 493-506 https://doi.org/10.1071/CH22247
Submitted: 25 November 2022 Accepted: 23 May 2023 Published: 21 June 2023
© 2023 The Author(s) (or their employer(s)). Published by CSIRO Publishing.
Abstract
The development of machine learning (ML) predictors does not necessarily require the employment of expansive classifiers and complex feature encoding schemes to achieve the highest accuracy scores. It rather requires data pre-processing, feature optimization, and robust evaluation to ensure consistent results and generalizability. Herein, we describe a multi-stage process to develop a reliable ML predictor of cell penetrating peptides (CPPs). We emphasize the challenges of: (i) the generation of representative datasets with all required pre-processing procedures; (ii) comprehensive and exclusive encoding of peptides using their amino acid composition; (iii) obtaining an optimized feature set using a simple classifier (support vector machine, SVM); (iv) ensuring consistent results; and (v) verifying generalizability at the highest achievable accuracy scores. Two peptide sub-spaces were used to generate the negative examples, which are required, along with the known CPPs, to train the classifier. These included: (i) randomly generated peptides with all amino acid types being equally represented and (ii) extracted peptides from receptor proteins. Results indicated that the randomly generated dataset performed perfectly well within its own peptide sub-space, while it poorly generalized to the other sub-space. Conversely, the dataset extracted from receptor proteins, while achieving lower accuracies, showed a perfect generalizability to the other peptide sub-space. We combined the qualities of these two datasets by utilizing the average of their predictions within our ultimate framework. This functional ML predictor, WLVCPP, and associated software and datasets can be downloaded from https://github.com/BahaaIsmail/WLVCPP.
Keywords: amino acid composition, cellular uptake, CPP, data pre-processing, drug delivery, feature optimization, machine learning, peptide classification, SVM.
References
[1] M Zorko, S Jones, Ü Langel, Cell-penetrating peptides in protein mimicry and cancer therapeutics. Adv Drug Deliv Rev 2022, 180, 114044.| Cell-penetrating peptides in protein mimicry and cancer therapeutics.Crossref | GoogleScholarGoogle Scholar |
[2] S Silva, J Marto, LM Gonçalves, HS Fernandes, SF Sousa, AJ Almeida, et al. Development of Neuropeptide Y and Cell-Penetrating Peptide MAP Adsorbed onto Lipid Nanoparticle Surface. Molecules 2022, 27, 2734.
| Development of Neuropeptide Y and Cell-Penetrating Peptide MAP Adsorbed onto Lipid Nanoparticle Surface.Crossref | GoogleScholarGoogle Scholar |
[3] J Geng, X Xia, L Teng, L Wang, L Chen, X Guo, et al. Emerging landscape of cell-penetrating peptide-mediated nucleic acid delivery and their utility in imaging, gene-editing, and RNA-sequencing. J Control Release 2022, 341, 166.
| Emerging landscape of cell-penetrating peptide-mediated nucleic acid delivery and their utility in imaging, gene-editing, and RNA-sequencing.Crossref | GoogleScholarGoogle Scholar |
[4] MI Sajid, D Mandal, NS El-Sayed, S Lohan, J Moreno, RK Tiwari, Oleyl Conjugated Histidine-Arginine Cell-Penetrating Peptides as Promising Agents for siRNA Delivery. Pharmaceutics 2022, 14, 881.
| Oleyl Conjugated Histidine-Arginine Cell-Penetrating Peptides as Promising Agents for siRNA Delivery.Crossref | GoogleScholarGoogle Scholar |
[5] J Liu, S Afshar, In vitro assays: friends or foes of cell-penetrating peptides. Int J Mol Sci 2020, 21, 4719.
| In vitro assays: friends or foes of cell-penetrating peptides.Crossref | GoogleScholarGoogle Scholar |
[6] GM Sheynkman, MR Shortreed, AJ Cesnik, LM Smith, Proteogenomics: integrating next-generation sequencing and mass spectrometry to characterize human proteomic variation. Annu Rev Anal Chem (Palo Alto Calif) 2016, 9, 521.
| Proteogenomics: integrating next-generation sequencing and mass spectrometry to characterize human proteomic variation.Crossref | GoogleScholarGoogle Scholar |
[7] R Su, J Hu, Q Zou, B Manavalan, L Wei, Empirical comparison and analysis of web-based cell-penetrating peptide prediction tools. Brief Bioinform 2020, 21, 408.
| Empirical comparison and analysis of web-based cell-penetrating peptide prediction tools.Crossref | GoogleScholarGoogle Scholar |
[8] X Fu, L Cai, X Zeng, Q Zou, StackCPPred: A stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency. Bioinformatics 2020, 36, 3028.
| StackCPPred: A stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency.Crossref | GoogleScholarGoogle Scholar |
[9] B Manavalan, S Subramaniyam, TH Shin, MO Kim, G Lee, Machine-Learning-Based Prediction of Cell-Penetrating Peptides and Their Uptake Efficiency with Improved Accuracy. J Proteome Res 2018, 17, 2715.
| Machine-Learning-Based Prediction of Cell-Penetrating Peptides and Their Uptake Efficiency with Improved Accuracy.Crossref | GoogleScholarGoogle Scholar |
[10] B Manavalan, MC Patra, MLCPP 2.0: An Updated Cell-penetrating Peptides and Their Uptake Efficiency Predictor. J Mol Biol 2022, 434, 167604.
| MLCPP 2.0: An Updated Cell-penetrating Peptides and Their Uptake Efficiency Predictor.Crossref | GoogleScholarGoogle Scholar |
[11] X Qiang, C Zhou, X Ye, P-f Du, R Su, L Wei, CPPred-FL: A sequence-based predictor for large-scale identification of cell-penetrating peptides by feature representation learning. Brief Bioinform 2018, 21, 11.
| CPPred-FL: A sequence-based predictor for large-scale identification of cell-penetrating peptides by feature representation learning.Crossref | GoogleScholarGoogle Scholar |
[12] L Wei, J Tang, Q Zou, SkipCPP-Pred: An improved and promising sequence-based predictor for predicting cell-penetrating peptides. BMC Genomics 2017, 18, 742.
| SkipCPP-Pred: An improved and promising sequence-based predictor for predicting cell-penetrating peptides.Crossref | GoogleScholarGoogle Scholar |
[13] WS Sanders, CI Johnston, SM Bridges, SC Burgess, KO Willeford, Prediction of Cell Penetrating Peptides by Support Vector Machines. PLoS Comput Biol 2011, 7, e1002101.
| Prediction of Cell Penetrating Peptides by Support Vector Machines.Crossref | GoogleScholarGoogle Scholar |
[14] A Gautam, H Singh, A Tyagi, K Chaudhary, R Kumar, P Kapoor, et al. CPPsite: A curated database of cell penetrating peptides. Database 2012, 2012, bas015.
| CPPsite: A curated database of cell penetrating peptides.Crossref | GoogleScholarGoogle Scholar |
[15] P Agrawal, S Bhalla, SS Usmani, S Singh, K Chaudhary, GPS Raghava, et al. CPPsite 2.0: A repository of experimentally validated cell-penetrating peptides. Nucleic Acids Res 2016, 44, D1098.
| CPPsite 2.0: A repository of experimentally validated cell-penetrating peptides.Crossref | GoogleScholarGoogle Scholar |
[16] A Gautam, K Chaudhary, R Kumar, A Sharma, P Kapoor, A Tyagi, et al. In silico approaches for designing highly effective cell penetrating peptides. J Transl Med 2013, 11, 74.
| In silico approaches for designing highly effective cell penetrating peptides.Crossref | GoogleScholarGoogle Scholar |
[17] P Pandey, V Patel, N V George, SS Mallajosyula, KELM-CPPpred: Kernel Extreme Learning Machine Based Prediction Model for Cell-Penetrating Peptides. J Proteome Res 2018, 17, 3214.
| KELM-CPPpred: Kernel Extreme Learning Machine Based Prediction Model for Cell-Penetrating Peptides.Crossref | GoogleScholarGoogle Scholar |
[18] V Kumar, P Agrawal, R Kumar, S Bhalla, SS Usmani, GC Varshney, et al. Prediction of cell-penetrating potential of modified peptides containing natural and chemically modified residues. Front Microbiol 2018, 9, 725.
| Prediction of cell-penetrating potential of modified peptides containing natural and chemically modified residues.Crossref | GoogleScholarGoogle Scholar |
[19] TA Holton, G Pollastri, DC Shields, C Mooney, CPPpred: Prediction of cell penetrating peptides. Bioinformatics 2013, 29, 3094.
| CPPpred: Prediction of cell penetrating peptides.Crossref | GoogleScholarGoogle Scholar |
[20] L Chen, C Chu, T Huang, X Kong, YD Cai, Prediction and analysis of cell-penetrating peptides using pseudo-amino acid composition and random forest models. Amino Acids 2015, 47, 1485.
| Prediction and analysis of cell-penetrating peptides using pseudo-amino acid composition and random forest models.Crossref | GoogleScholarGoogle Scholar |
[21] H Tang, ZD Su, HH Wei, W Chen, H Lin, Prediction of cell-penetrating peptides with feature selection techniques. Biochem Biophys Res Commun 2016, 477, 150.
| Prediction of cell-penetrating peptides with feature selection techniques.Crossref | GoogleScholarGoogle Scholar |
[22] ECL de Oliveira, K Santana, L Josino, AH Lima e Lima, C de Souza de Sales Júnior, Predicting cell-penetrating peptides using machine learning algorithms and navigating in their chemical space. Sci Rep 2021, 11, 7628.
| Predicting cell-penetrating peptides using machine learning algorithms and navigating in their chemical space.Crossref | GoogleScholarGoogle Scholar |
[23] L Wei, P Xing, R Su, G Shi, ZS Ma, Q Zou, CPPred-RF: a sequence-based predictor for identifying cell-penetrating peptides and their uptake efficiency. J Proteome Res 2017, 16, 2044.
| CPPred-RF: a sequence-based predictor for identifying cell-penetrating peptides and their uptake efficiency.Crossref | GoogleScholarGoogle Scholar |
[24] JM Wolfe, CM Fadzen, ZN Choo, RL Holden, M Yao, GJ Hanson, et al. Machine Learning to Predict Cell-Penetrating Peptides for Antisense Delivery. ACS Cent Sci 2018, 4, 512.
| Machine Learning to Predict Cell-Penetrating Peptides for Antisense Delivery.Crossref | GoogleScholarGoogle Scholar |
[25] DA Dobchev, I Mager, I Tulp, G Karelson, T Tamm, K Tamm, et al. Prediction of Cell-Penetrating Peptides Using Artificial Neural Networks. Curr Comput Aided Drug Des 2010, 6, 79.
| Prediction of Cell-Penetrating Peptides Using Artificial Neural Networks.Crossref | GoogleScholarGoogle Scholar |
[26] Y Huang, B Niu, Y Gao, L Fu, W Li, CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 2010, 26, 680.
| CD-HIT Suite: a web server for clustering and comparing biological sequences.Crossref | GoogleScholarGoogle Scholar |
[27] FI Mowbray, SM Fox-Wasylyshyn, MM El-Masri, Univariate outliers: a conceptual overview for the nurse researcher. Can J Nurs Res 2019, 51, 31.
| Univariate outliers: a conceptual overview for the nurse researcher.Crossref | GoogleScholarGoogle Scholar |
[28] Tiwari K, Mehta K, Jain N, Tiwari R, Kanda G. Selecting the appropriate outlier treatment for common industry applications. In: NESUG Conference Proceedings on Statistics and Data Analysis; Baltimore, MD, USA; 2007. pp. 1–5.
[29] DK Lee, Data transformation: a focus on the interpretation. Korean J Anesthesiol 2020, 73, 503.
| Data transformation: a focus on the interpretation.Crossref | GoogleScholarGoogle Scholar |
[30] George D, Mallery P. SPSS for Windows step by step. A simple study guide and reference (10. Baskı). Boston, MA: Pearson Education, Inc; 2010. p. 10.
[31] Hair JF, Ortinau DJ, Harrison DE. Essentials of marketing research. New York, NY: McGraw-Hill/Irwin; 2010.
[32] MJ Mizianty, LA Kurgan, MR Ogiela, Discretization as the enabling technique for the Naive Bayes and semi-Naive Bayes-based classification. Knowl Eng Rev 2010, 25, 421.
| Discretization as the enabling technique for the Naive Bayes and semi-Naive Bayes-based classification.Crossref | GoogleScholarGoogle Scholar |
[33] B Tran, B Xue, M Zhang, A new representation in PSO for discretization-based feature selection. IEEE Trans Cybern 2018, 48, 1733.
| A new representation in PSO for discretization-based feature selection.Crossref | GoogleScholarGoogle Scholar |
[34] W Zhao, H Tian, Y Wu, Z Cui, T Feng, A New Item-Based Collaborative Filtering Algorithm to Improve the Accuracy of Prediction in Sparse Data. Int J Comput Intell Syst 2022, 15, 15.
| A New Item-Based Collaborative Filtering Algorithm to Improve the Accuracy of Prediction in Sparse Data.Crossref | GoogleScholarGoogle Scholar |
[35] J Howl, S Jones, A new biology of cell penetrating peptides. Pept Sci 2021, 113, e24154.
| A new biology of cell penetrating peptides.Crossref | GoogleScholarGoogle Scholar |
[36] M Qi, O Cahan, MA Foreman, DM Gruen, AK Das, KP Bennett, Quantifying representativeness in randomized clinical trials using machine learning fairness metrics. JAMIA Open 2021, 4, ooab077.
| Quantifying representativeness in randomized clinical trials using machine learning fairness metrics.Crossref | GoogleScholarGoogle Scholar |
[37] A Garg, M Bhasin, GPS Raghava, Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. J Biol Chem 2005, 280, 14427.
| Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search.Crossref | GoogleScholarGoogle Scholar |
[38] GPS Raghava, JH Han, Correlation and prediction of gene expression level from amino acid and dipeptide composition of its protein. BMC Bioinformatics 2005, 6, 59.
| Correlation and prediction of gene expression level from amino acid and dipeptide composition of its protein.Crossref | GoogleScholarGoogle Scholar |
[39] T Sui, Y Yang, X Wang, Sequence-based feature extraction for type III effector prediction. Int J Biosci Biochem Bioinforma 2013, 3, 246.
| Sequence-based feature extraction for type III effector prediction.Crossref | GoogleScholarGoogle Scholar |
[40] Géron A. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media, Inc.; 2022.
[41] D Chicco, G Jurman, The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 2020, 21, 6.
| The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation.Crossref | GoogleScholarGoogle Scholar |
[42] P Schober, C Boer, LA Schwarte, Correlation coefficients: appropriate use and interpretation. Anaesth Analg 2018, 126, 1763.
| Correlation coefficients: appropriate use and interpretation.Crossref | GoogleScholarGoogle Scholar |
[43] QC Song, C Tang, S Wee, Making sense of model generalizability: A tutorial on cross-validation in R and Shiny. Adv Methods Pract Psychol Sci 2021, 4, 2515245920947067.
| Making sense of model generalizability: A tutorial on cross-validation in R and Shiny.Crossref | GoogleScholarGoogle Scholar |
[44] JD Carruthers, S Fagien, JH Joseph, SD Humphrey, BS Biesman, CJ Gallagher, Y Liu, RG Rubio, DaxibotulinumtoxinA for injection for the treatment of glabellar lines: Results from each of two multicenter, randomized, double-blind, placebo-controlled, phase 3 studies (SAKURA1 and SAKURA 2). Plast Reconstr Surg 2020, 145, 45.
| DaxibotulinumtoxinA for injection for the treatment of glabellar lines: Results from each of two multicenter, randomized, double-blind, placebo-controlled, phase 3 studies (SAKURA1 and SAKURA 2).Crossref | GoogleScholarGoogle Scholar |
[45] van Rossum G, Drake FL. Python 3 reference manual. CreateSpace; 2009.
[46] F Pedregosa, G Varoquaux, A Gramfort, V Michel, B Thirion, O Grisel, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res 2011, 12, 2825.
[47] McKinney W. Data structures for statistical computing in python. In: Proceedings of the 9th Python in Science Conference. Austin, TX: SciPy; 2010. ISSN: 2575‐9752. pp. 51–6.
[48] CR Harris, KJ Millman, SJ van der Walt, R Gommers, P Virtanen, D Cournapeau, et al. Array programming with NumPy. Nature 2020, 585, 357.
| Array programming with NumPy.Crossref | GoogleScholarGoogle Scholar |
[49] P Virtanen, R Gommers, TE Oliphant, M Haberland, T Reddy, D Cournapeau, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 2020, 17, 261.
| SciPy 1.0: fundamental algorithms for scientific computing in Python.Crossref | GoogleScholarGoogle Scholar |
[50] JD Hunter, Matplotlib: A 2D graphics environment. Comput Sci Eng 2007, 9, 90.
| Matplotlib: A 2D graphics environment.Crossref | GoogleScholarGoogle Scholar |
[51] Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International Joint Conference on Artificial Intelligence (IJCAI). Montreal, Canada: IJCAI; 1995. ISBN (Online): 978‐0‐9992411‐9‐6. pp. 1137–45.
[52] A Rabinowicz, S Rosset, Cross-validation for correlated data. J Am Stat Assoc 2022, 117, 718.
| Cross-validation for correlated data.Crossref | GoogleScholarGoogle Scholar |
[53] Shreffler J, Huecker MR. Diagnostic testing accuracy: Sensitivity, specificity, predictive values and likelihood ratios. StatPearls Publishing; 2022.
[54] CS Hong, SH Oh, YW Choi, Optimal threshold using the correlation coefficient for the confusion matrix. Korean J Appl Stat 2022, 35, 77.
[55] FS Nahm, Receiver operating characteristic curve: overview and practical use for clinicians. Korean J Anesthesiol 2022, 75, 25.
| Receiver operating characteristic curve: overview and practical use for clinicians.Crossref | GoogleScholarGoogle Scholar |
[56] Mohr F, van Rijn JN. Learning Curves for Decision Making in Supervised Machine Learning — A Survey [Preprint]. arXiv: 2201.12150; 2022. Available at https://arxiv.org/abs/2201.12150