Dr Rebeca Kawahara (first author): Dr Kawahara is a passionate early career scientist funded by Cancer Institute NSW Early Career Research Fellowship (2019-22) and a member of the Analytical Glycoimmunology Group, led by Dr Morten Thaysen-Andersen in the Department of Molecular Sciences at Macquarie University, Sydney, Australia. Her core expertise is to develop and apply advanced mass spectrometry-based glycoproteomics, proteomics and glycomics to study complex biological systems. Her current research focus is to use integrated multi-omics approaches to advance our molecular understanding of the role of glycoproteins in health and diseases, including cancer and immune-related disorders.
Mrs. Anastasia Chernykh (second author): Anastasia Chernykh is a second-year PhD candidate in the Analytical Glycoimmunology Group situated in the Department of Molecular Sciences, Macquarie University, Sydney, Australia. Her research focuses on the structural and functional characterisation of protein glycosylation in inflammatory processes using glycomics and glycoproteomics.
Profiling of intact glycopeptides at scale from complex biological mixtures (glycoproteomics) remains a considerable analytical challenge and a frontier in proteomics. It is recognised that the lack of efficient search engines tailored to the unique challenges associated with large-scale glycopeptide analysis continues to hinder the rapid expansion and democratisation of glycoproteomics technologies beyond specialist laboratories. Excitingly, the field has seen the development of many innovative bioinformatic tools over the past decade that promise to streamline and semi-automate the glycopeptide identification process based on high content tandem mass spectrometry data. However, the annotation process of glycopeptide MS/MS data is highly error-prone due to the challenging task of correctly assigning both the glycan composition, modification site(s) and peptide carrier; glycopeptides are therefore frequently misidentified or suffer from ambiguous annotation. Critically missing is a systematic, comprehensive and unbiased comparison of the relative performance of the informatics solutions available to the community to identify strengths and weaknesses of existing software and guide the field towards better glycopeptide data analysis.
Conducted through the HUPO Human Proteome Project – Human Glycoproteomics Initiative (HGI), this international community study brought together field-leading developers and expert users of glycoproteomics software to evaluate the performance of informatics solutions for system-wide glycopeptide mass spectrometric analysis (1). In total, 25 teams from 11 countries across five continents signed up for the challenge out of which 22 teams (~90%) completed the study (see accompanying figure for location of participants and key contributing authors). All teams were provided the same two high-resolution LC-MS/MS data files of N- and O-glycopeptides from human serum proteins generated on an Thermo Orbitrap Fusion Lumos mass spectrometer using different dissociation methods (HCD, ETciD, EThcD, CID) (data kindly provided by Drs Rosa Viner and Sergei Snovida, Thermo Fisher Scientific). A synthetic N-glycopeptide was included as a positive control. The study design including the sample type, preparation and data collection method was carefully chosen to mimic conditions typically encountered in glycoproteomics analysis while also aiming to accommodate most available informatic solutions and to appeal to users in the field. All teams were asked to identify intact N- and O-glycopeptides from the two shared data files and report back their identifications and their search strategies in a comprehensive and standardised reporting template. Completed reports were thoroughly checked by the independent study organisers for compliance to the study guidelines to enable a fair comparison between teams.
The identified glycopeptides varied dramatically between teams as illustrated by the wide range of N-glycopeptides (49-2,122 N-glycoPSMs) and O-glycopeptides (5-578 O-glycoPSMs) reported by the participants. Discrepant and non-uniform reporting is a recognised challenge in glycoproteomics. The search strategies employed by the teams were found to be highly diverse as exemplified by the variation in the applied glycan search space (23-381 N-glycan compositions and 3-223 O-glycan compositions). Despite the discrepant reporting and varied search strategies, high-confidence lists spanning 163 N- and 23 O-glycopeptides commonly reported by the teams could be generated from the standardised reports. These consensus glycopeptides form an important reference for future studies of the human serum glycoproteome and have therefore been made publicly available (GlyConnect Reference ID 2943).
The relative team performance for N- and O-glycopeptide data analysis was comprehensively established through multiple carefully constructed independent performance tests. The team scoring and ranking were subsequently validated using an orthogonal scoring method. Excitingly, the performance testing revealed that several high-performance glycoproteomics informatics solutions, some well-established (Protein Prospector, Byonic) and others only recently developed (IQ-GPA, GlycoPAT, glyXtoolMS), from both academic and commercial origins are available for N- and O-glycopeptide data analysis.
Backed by robust statistics, deep mining of the performance data also unearthed a set of both software-independent and software-specific performance-associated search variables and identified key parameters important for high-performance glycoproteomics data analysis. Notably, exploration of the impact of the different search strategies on the glycoproteomics data output by the popular Byonic search engine used by 11 teams led to recommendations for improved “high coverage” and “high accuracy” glycoproteomics search strategies that will immediately benefit researchers in the field when studying biological samples of this nature.
Most software solutions currently available for glycoproteomics data analysis were evaluated in this study. However, several newer glycopeptide search engines e.g. pGlyco, MSFragger-Glyco, O-Pair Search, and StrucGP were not represented due to LC-MS/MS data incompatibility or due to their development after the study period. Follow-up efforts to compare the performance of the latest glycoproteomics software upgrades and informatics solutions not included in this study are therefore being drafted within the next study of the Human Glycoproteomics Initiative.
This community-driven study concludes that diverse software for comprehensive glycopeptide data analysis exist, points to several high-performance search strategies, and specifies key variables that may guide future software developments and assist informatics decision-making in glycoproteomics data analysis. While informatics challenges undoubtedly still exist in glycoproteomics, our study interestingly highlights that several computational tools, some already demonstrating high performance, others considerable potential, are available to the community.
Kawahara, R., Chernykh, A., Alagesan, K., Bern, M., Cao, W., Chalkley, R. J., Cheng, K., Choo, M. S., Edwards, N., Goldman, R., Hoffmann, M., Hu, Y., Huang, Y., Kim, J. Y., Kletter, D., Liquet-Weiland, B., Liu, M., Mechref, Y., Meng, B., Neelamegham, S., Nguyen-Khuong, T., Nilsson, J., Pap, A., Park, G. W., Parker, B. L., Pegg, C. L., Penninger, J. M., Phung, T. K., Pioch, M., Rapp, E., Sakalli, E., Sanda, M., Schulz, B. L., Scott, N. E., Sofronov, G., Stadlmann, J., Vakhrushev, S. Y., Woo, C. M., Wu, H.-Y., Yang, P., Ying, W., Zhang, H., Zhang, Y., Zhao, J., Zaia, J., Haslam, S. M., Palmisano, G., Yoo, J. S., Larson, G., Khoo, K.-H., Medzihradszky, K. F., Kolarich, D., Packer, N. H., and Thaysen-Andersen, M. (2021) Community Evaluation of Glycoproteomics Informatics Solutions Reveals High-Performance Search Strategies of Serum N- and O-Glycopeptide Data. Under final stages of consideration. Available at bioRxiv, (https://www.biorxiv.org/content/10.1101/2021.03.14.435332v3)