News

Proteomics standards - building trust in proteomics reporting and sharing for almost two decades

28 May 2021 3:38 PM | Deleted user

Written by Eric Deutsch, Institute for Systems Biology USA; David L. Tabb, Pasteur Institute, Department of Structural Biology and Chemistry France; Sandra Orchard, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI) UK; Juan Antonio Vizcaíno European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI) UK; Andrew R Jones, Institute of Systems, Molecular and Integrative Biology UK

The Proteomics Standards Initiative (PSI) was formed as a HUPO working group back in 2002, with an aim to improve data sharing and standardisation in proteomics. At that time, proteomics was a rapidly developing technique, and the concept of sharing data in support of publications was still in its infancy. Most proteomics articles reported tables of protein identifications or quantitative values in PDF format, inhibiting their import into other software. Visualizing peptide-spectrum matches was rare, preventing critical assessment of PSM quality. The PSI’s primary aims were i) to develop data standards that would enable the submission of data (raw and/or processed) to public databases or for interchange between software, improving transparency and data re-use; ii) to develop “reporting requirements” indicating the types of information that should be recorded e.g. in a Methods section or database entry, to allow critical interpretation of results; and iii) to help foster a culture of data sharing, including developing software to make sharing easier. The PSI has had major impact in the past 19 years, where we are now in a situation where most proteomics data sets are readily available, in formats that support critical assessment of results, and allow for data re-purposing.

The PSI’s activities have largely been divided in two main branches - molecular interactions (MI) and mass spectrometry (MS)-based workflows, with the main highlights being the development of widely used standard formats (https://www.psidev.info), and international cooperation for data submission and access, coordinated through the ProteomeXchange (http://www.proteomexchange.org/) and IMEx (www.imexconsortium.org) Consortia. These developments have been essential in the context of the HUPO Human Proteome Project.

The current make-up of the PSI involves six working groups - MS, Proteome Informatics, MI, Protein Modifications (MOD), Quality Control (QC) and a recent addition: Intrinsically Disordered Proteins (IDP). We also have a group dedicated to coordination of activity with parallel efforts in metabolomics.

The roster of current PSI leadership is available at https://www.psidev.info/roles. We would like to bring to the attention of the proteomics community that are planning refresh roles for working group Chairs and co-Chairs in 2021, particularly in the Proteome Informatics area. We are also searching for a new PSI Editor. If you would like to nominate yourself or someone else for a role as Chair, co-Chair or other role within a working group, please email to andrew.jones@liverpool.ac.uk to start the conversation. We would expect that individuals would have had some involvement before in PSI activities, but of course we are a fully open organisation, and always welcome new members.

In March 2021, we held our annual PSI Spring Workshop (fully online for the second time). The main focal points of the meeting were:

MI: This track (www.psidev.info/groups/molecular-interactions) has been responsible for the creation and maintenance of several different data exchange formats (PSI-I XML2.5, PSI-MI XML3.0, MITAB2.5-2.8, MI-JSON) and the accompanying controlled vocabulary (PSI-MI CV), and this year focused on a number of new use cases, modeling these to the existing standards to ensure that these were still fit for purpose. Use case examples included contextual interactions, the interactome of a specific cell or tissue type and dynamic interactions, the sequential changes in the composition of these interactome over time or in response to an agonist or inhibitor. We agreed upon a major refactoring of the genetic interaction branch of the PSI-MI CV and discussed methods for linking entities derived from the same gene e.g. proteins and their originating mRNAs, in a network. Finally, we held the annual IMEx Consortium meeting, with participating databases and data users in attendance. We are open to welcoming new members interested in curating molecular interactions or in further developing data standards.

MS/PI: This track touched on many different topics in an effort to gather input from those participants who do not regularly participate in the ongoing development. We began with a session exploring how the PSI might contribute to the emerging field of non-MS affinity-based workflows, and then moved on to the nearly-ratified Universal Spectrum Identifier (USI) standard. We discussed the ProForma 2.0 in-development standard for proteoform and peptidoform notation, and the PROXI application programming interface for programmatically exchanging proteomics information between ProteomeXchange resources. We explored a proposed extension of the existing mzIdentML and mzTab standards for glycoproteomics, and a proposed format (called SDRF-Proteomics) for the annotation of the relationship between samples and data files in proteomics datasets, aiming to capture the experimental design. We finished the workshop with discussions on the emerging PSI Spectral Library Format (mzSpecLib), an update of mzTab for proteomics, and a binary version of mzML called mzMLb. The MS/PI working group aims to ratify and release several of these formats in the coming year and welcomes additional participation. Several MS/PI members also participated in efforts in the MOD working group, to produce updates to the PSI modification controlled vocabulary that is used across a range of standards in both the MI and MS spaces.

QC: Across all of biological MS, efforts to increase repeatability and reproducibility of experimentation has become a priority. Communicating quality information in a standardized way has the potential to improve interaction between quality metric-generating software (e.g. reporting the fraction of all MS/MS scans being identified in an LC-MS/MS experiment) and quality decision-making frameworks (e.g. recognizing an LC-MS/MS experiment is an outlier). The mzQC format specification should shortly be released for public comment. It is distinctive for being one of the first HUPO-PSI formats to employ lightweight JSON notation rather than XML, while continuing to define the terms employed in mzQC in a CV.

IDP: The recently formed Intrinsically Disordered Proteins group (https://www.psidev.info/groups/intrinsically-disordered-proteins), is aimed with understanding and sharing data on disordered regions of proteins, particularly developing annotation standards for structural and structure-function attributes. The group is starting to work on reporting guidelines, and data standards for describing annotations in tab-based and XML formats.

What all of these efforts have in common is that proteomics (and metabolomics) researchers need ways to communicate their results in a way that is verifiable and trustworthy. Already we have seen a sizable impact from these efforts; analytical chemists that use SCIEX instruments can post to a repository both mzML-formatted data and WIFF files, for example, to ensure that laboratories that do not have access to vendor software can still make use of their MS experiments. Another example is that bioinformatics researchers seeking to optimize the classification of PSMs into accepted and rejected collections can work from the mzIdentML PSMs reported for a wide variety of database search algorithms in thousands of experiments.

The PSI continues to make rapid progress in its mission to facilitate data sharing and access in the proteomics community, but needs additional participation from the community at all levels to maintain momentum. In addition to the need to support novel approaches and workflows in the field, an additional point of interest is the integration of proteomics with other omics data types and approaches. Will you join us?