Genome Data Exploration Using Correspondence Analysis

Fredj Tekaia

doi:10.4137/BBi.s39614

Article Dans Une Revue Bioinformatics and Biology Insights Année : 2016

Genome Data Exploration Using Correspondence Analysis

(1)

Fredj Tekaia

Fonction : Auteur correspondant
PersonId : 977378

Connectez-vous pour contacter l'auteur

Microbiologie structurale - Structural Microbiology

Résumé

Recent developments of sequencing technologies that allow the production of massive amounts of genomic and genotyping data have highlighted the need for synthetic data representation and pattern recognition methods that can mine and help discovering biologically meaningful knowledge included in such large data sets. Correspondence analysis (CA) is an exploratory descriptive method designed to analyze two-way data tables, includ-ing some measure of association between rows and columns. It constructs linear combinations of variables, known as factors. CA has been used for decades to study high-dimensional data, and remarkable inferences from large data tables were obtained by reducing the dimensionality to a few orthogonal factors that correspond to the largest amount of variability in the data. Herein, I review CA and highlight its use by considering examples in handling high-dimen-sional data that can be constructed from genomic and genetic studies. Examples in amino acid compositions of large sets of species (viruses, phages, yeast, and fungi) as well as an example related to pairwise shared orthologs in a set of yeast and fungal species, as obtained from their proteome comparisons, are considered. For the first time, results show striking segregations between yeasts and fungi as well as between viruses and phages. Distributions obtained from shared orthologs show clusters of yeast and fungal species corresponding to their phylogenetic relationships. A direct comparison with the principal component analysis method is discussed using a recently published example of genotyping data related to newly discovered traces of an ancient hominid that was compared to modern human populations in the search for ancestral similarities. CA offers more detailed results highlighting links between modern humans and the ancient hominid and their characterizations. Compared to the popular principal component analysis method, CA allows easier and more effective interpretation of results, particularly by the ability of relating individual patterns with their corresponding characteristic variables.

Mots clés

high-dimensional data reduction joint representation of observations and variables data mining shared orthologs genome tree bioinformatics principal component analysis correspondence analysis amino acid composition

Domaines

Bio-Informatique, Biologie Systémique [q-bio.QM]

Fichier principal

Tekaia_BioinforlaticsBiologyInsights_2016.pdf (2.14 Mo)

Origine : Publication financée par une institution

Fredj Tekaia : Connectez-vous pour contacter le contributeur

https://pasteur.hal.science/pasteur-01329327

Soumis le : jeudi 9 juin 2016-09:26:53

Dernière modification le : mercredi 17 avril 2024-11:20:03

Dates et versions

pasteur-01329327 , version 1 (09-06-2016)

Licence

Paternité - Pas d'utilisation commerciale

Identifiants

HAL Id : pasteur-01329327 , version 1
DOI : 10.4137/BBi.s39614

Citer

Fredj Tekaia. Genome Data Exploration Using Correspondence Analysis: Correspondence Analysis and Genome Data. Bioinformatics and Biology Insights, 2016, 10, pp.59-72. ⟨10.4137/BBi.s39614⟩. ⟨pasteur-01329327⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

PASTEUR UNIV-PARIS7 CNRS USPC

105 Consultations

245 Téléchargements

Genome Data Exploration Using Correspondence Analysis

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager