Virus Pop—Expanding Viral Databases by Protein Sequence Simulation - Institut Pasteur Accéder directement au contenu
Article Dans Une Revue Viruses Année : 2023

Virus Pop—Expanding Viral Databases by Protein Sequence Simulation

Résumé

The improvement of our knowledge of the virosphere, which includes unknown viruses, is a key area in virology. Metagenomics tools, which perform taxonomic assignation from high throughput sequencing datasets, are generally evaluated with datasets derived from biological samples or in silico spiked samples containing known viral sequences present in public databases, resulting in the inability to evaluate the capacity of these tools to detect novel or distant viruses. Simulating realistic evolutionary directions is therefore key to benchmark and improve these tools. Additionally, expanding current databases with realistic simulated sequences can improve the capacity of alignment-based searching strategies for finding distant viruses, which could lead to a better characterization of the “dark matter” of metagenomics data. Here, we present Virus Pop, a novel pipeline for simulating realistic protein sequences and adding new branches to a protein phylogenetic tree. The tool generates simulated sequences with substitution rate variations that are dependent on protein domains and inferred from the input dataset, allowing for a realistic representation of protein evolution. The pipeline also infers ancestral sequences corresponding to multiple internal nodes of the input data phylogenetic tree, enabling new sequences to be inserted at various points of interest in the group studied. We demonstrated that Virus Pop produces simulated sequences that closely match the structural and functional characteristics of real protein sequences, taking as an example the spike protein of sarbecoviruses. Virus Pop also succeeded at creating sequences that resemble real sequences not included in the databases, which facilitated the identification of a novel pathogenic human circovirus not included in the input database. In conclusion, Virus Pop is helpful for challenging taxonomic assignation tools and could help improve databases to better detect distant viruses.
Fichier principal
Vignette du fichier
viruses-15-01227.pdf (918.94 Ko) Télécharger le fichier
Origine : Publication financée par une institution
Licence : CC BY - Paternité

Dates et versions

pasteur-04106241 , version 1 (25-05-2023)

Licence

Paternité

Identifiants

Citer

Julia Kende, Massimiliano Bonomi, Sarah Temmam, Béatrice Regnault, Philippe Pérot, et al.. Virus Pop—Expanding Viral Databases by Protein Sequence Simulation. Viruses, 2023, 15 (6), pp.1227. ⟨10.3390/v15061227⟩. ⟨pasteur-04106241⟩
28 Consultations
16 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More