Systematic detection of divergent brain proteins in human evolution and their roles in cognition

Abstract The human brain differs from that of other primates, but the underlying genetic mechanisms remain unclear. Here we measured the evolutionary pressures acting on all human protein-coding genes (N=17,808) based on their divergence from early hominins such as Neanderthal, and non-human primates. We confirm that genes encoding brain-related proteins are among the most conserved of the human proteome. Conversely, several of the most divergent proteins in humans compared to other primates are associated with brain-associated diseases such as micro/macrocephaly, dyslexia, and autism. We identified specific expression profiles of a set of divergent genes in ciliated cells of the cerebellum, that might have contributed to the emergence of fine motor skills and social cognition in humans. This resource is available at http://neanderthal.pasteur.fr and can be used to estimate evolutionary constraints acting on a set of genes and to explore their relative contribution to human traits.

coding genes in the human genome. Combining our evolutionary pressure metrics for the 23 protein-coding genome with recent datasets, we found that this conservation applied to genes 24 functionally associated with the synapse and expressed in brain structures such as the 25 prefrontal cortex and the cerebellum. Conversely, several of the protein-coding genes that 26 diverge most in hominins relative to other primates are associated with brain-associated 27 diseases, such as micro/macrocephaly, dyslexia, and autism. We also showed that cerebellum 28 granule neurons express a set of divergent protein-coding genes that may have contributed to 29 the emergence of fine motor skills and social cognition in humans. This resource is available 30 from http://neanderthal.pasteur.fr and can be used to estimate evolutionary constraints acting 31 on a set of genes and to explore their relative contributions to human traits. 32

Introduction 33
Modern humans (Homo sapiens) can perform complex cognitive tasks well and communicate 34 with their peers [1]. Anatomic differences between the brains of humans and other primates 35 are well documented (e.g. cortex size, prefrontal white matter thickness, lateralization), but 36 the way in which the human brain evolved remains a matter of debate [2]. A recent study of 37 endocranial casts of Homo sapiens fossils indicates that, brain size in early Homo sapiens, 38 300,000 years ago, was already within the range of that in present-day humans [3]. However, 39 brain shape, evolved more gradually within the Homo sapiens lineage, reaching its current 40 form between about 100,000 and 35,000 years ago. It has also been suggested that the 41 enlargement of the prefrontal cortex relative to the motor cortex in humans is mirrored in the 42 cerebellum by an enlargement of the regions of the cerebellum connected to the prefrontal 43 cortex [4]. These anatomic processes of tandem evolution in the brain paralleled the 44 emergence of motor and cognitive abilities, such as bipedalism, planning, language, and 45 social awareness, which are particularly well developed in humans. 46 Genetic differences in primates undoubtedly contributed to these brain and cognitive 47 differences, but the genes or variants involved remain largely unknown. Indeed,48 demonstrating that a genetic variant is adaptive requires strong evidence at both the genetic 49 and functional levels. Only few genes have been shown to be human-specific. They include 50 SRGAP2C [5], ARHGAP11B [6] and NOTCH2NL [7], which emerged through recent gene 51 duplication in the Homo lineage [8]. Remarkably, the expression of these human specific 52 genes in the mouse brain expand cortical neurogenesis [6,7,9,10]. Several genes involved in 53 brain function have been shown to display accelerated coding region evolution in humans. 54 For example, FOXP2 has been associated with verbal apraxia and ASPM with microcephaly 55 [11,12]. Functional studies have also shown that mice carrying a "humanized" version of 56 FOXP2 display qualitative changes in ultrasonic vocalization [13]. However, these reports 57 targeting only specific genes sometimes provide contradictory results [14]. Other studies 58 have reported sequence conservation to be stronger in the protein-coding genes of the brain 59 than in those of other tissues [15][16][17], suggesting that the main substrate of evolution in the 60 brain is regulatory changes in gene expression [18][19][20] and splicing [21]. In addition, several 61 recent studies have recently explored the genes subjected to the highest degrees of constraint 62 during primate evolution or in human populations, to improve estimations of the 63 pathogenicity of variants identified in patients with genetic disorders [22,23] We describe here an exhaustive screening of all protein-coding genes for conservation 67 and divergence from the common primate ancestor, making use of rich datasets of brain 68 single-cell transcriptomics, proteomics and imaging to investigate the relationships between 69 these genes and brain structure, function, and diseases.

Figure 1 Evolution of protein-coding genes across tissues and biological functions. (a)
Results 88

Strong conservation of brain protein-coding genes 89
We first compared the sequences of modern humans, archaic humans, and other primates to 90 revealed a particularly strong conservation of genes encoding proteins involved in nervous 112 system development (OR=1.2, p=2.4e-9) and synaptic transmission (OR=1.35, p=1.7e-8). 113 We investigated the possible enrichment of specific tissues in conserved and 114 divergent proteins by analyzing RNAseq (Illumina Bodymap2 and GTEx), microarray and 115 proteomics datasets (Methods). For expression data, we evaluated the specificity of genes by 116 normalizing their profile across tissues ( Supplementary Fig. 3). The results confirmed a 117 higher degree of conservation for protein-coding genes expressed in the brain (Wilcoxon rank 118 correlation (rc)=-0.1, p=4.1e-12, bootstrap corrected for gene length and GC content) than for 119 those expressed elsewhere in the body, with the greatest divergence observed for genes 120 expressed in the testis (Wilcoxon rc=0.3, p=7.8e-11, bootstrap corrected for gene length and 121 GC content; Fig. 1d Table 3). Indeed, it has been 137 suggested that the prefrontal cortex is one of the most divergent brain structure in human 138 evolution [32], this diversity being associated with high-level cognitive function [33]. Only 139 one brain structure was more divergent than expected: the superior cervical ganglion 140 (Wilcoxon rc=0.22, p=1e-6, bootstrap corrected for gene length and GC content). This 141 structure provides sympathetic innervation to many organs and is associated with the archaic 142 functions of fight-or-flight response. The divergent genes expressed in the superior cervical 143 ganglion include CARF, which was found to be specifically divergent in the genus Homo. 144 This gene encodes a calcium-responsive transcription factor that regulates the neuronal 145 activity-dependent expression of BDNF [34] and a set of singing-induced genes in the song 146 nuclei of the zebra finch, a songbird capable of vocal learning [35]. This gene had a raw 147 dN/dS of 2.44 (7 non-synonymous vs 1 synonymous mutations in Homo sapiens compared to 148 the common primate ancestor) and was found to be one of the most divergent protein-coding 149 genes expressed in the human brain. 150 We then investigated the possible enrichment of conserved and divergent genes in 151 brain-specific gene ontology terms. All pathways displayed high overall levels of 152 conservation, but genes encoding proteins involved in glutamatergic and GABAergic 153 neurotransmission were generally more conserved (Wilcoxon rc=-0.25; p=9.8e-6, Bonferroni 154 corrected) than those encoding proteins involved in dopamine and peptide neurotransmission 155 and intracellular trafficking (  Table  156 3). The recently released ontology of the synapse provided by the SynGO consortium 157 (http://syngoportal.org) was incorporated into this analysis, not only confirming the globally 158 strong conservation of the synapse, but also revealing its close relationship to trans-synaptic 159 signaling processes (Wilcoxon rc=-0.21, p=4.5e-5, Bonferroni corrected) and to postsynaptic  Table 4) and those fixed in the modern Homo sapiens population 179 (neutrality index<1), to ensure that we analyzed the most-divergent protein-coding genes. 180 Only 126 of these 352 highly divergent protein-coding genes were brain-related 181 (impoverishment for brain genes, Fisher's exact test OR=0.66, p=1e-4), listed as synaptic 182 genes [36,37], specifically expressed in the brain (+2SD for specific expression) or related to 183 a brain disease (extracted systematically from Online Mendelian Inheritance in Man -184 OMIM: https://www.omim.org and Human Phenotype Ontology -HPO: 185 https://hpo.jax.org/app/). For comparison, we also extracted the 427 most strongly conserved 186 protein-coding genes, 290 of which were related to the brain categories listed above 187 (enrichment for brain genes, Fisher's exact test OR=1.26, p=0.0032). Given that most protein coding divergence occurs in testes and that the flagella of sperm and 230 cilia of other cells are structurally related, is it possible that the enrichment of ciliated cells 231 among the most divergent genes could be another feature of testis rather than brain 232 divergence. However, only TTLL6 is highly expressed in testes, suggesting a neural relevance 233 for DYNC2LI1, LCA5, and TOP3A. Interestingly, some of these protein coding genes are also 234 involved in human brain-related ciliopathies such as Joubert syndrome [43] and microcephaly 235 (see below). A similar single-cell transcriptomic analysis of the human cerebral cortex [41] 236 revealed no such strong divergent pattern in any cell type (Supplementary Figure 9). 237 Finally, we assessed the potential association with brain functions, by extracting 19,244 brain 238 imaging results from 315 fMRI-BOLD studies (T and Z score maps; see Supplementary 239 Table 7 for the complete list) from NeuroVault [44] and comparing the spatial patterns 240 observed with the patterns of gene expression in the Allen Brain atlas [45,46]. The 241 correlation between brain activity and divergent gene expression was stronger in subcortical 242 structures than in the cortex (Wilcoxon rc=0.14, p=2.5e-248). The brain activity maps that 243 correlate with the expression pattern of the divergent genes (see Supplementary Table 8 Table 4). We also identified 31 highly divergent protein-260 coding genes associated (based on OMIM and HPO data) with several human diseases or 261 conditions, such as micro/macrocephaly, autism or dyslexia. We also identified divergent protein-coding genes associated with communication 282 disorders (Fig. 3c), such as autism (CNTNAP4, AHI1, FAN1, SNTG2 and GRIP1) and 283 dyslexia (KIAA0319). Interestingly, these genes diverged from the common primate ancestor 284 only in the hominin lineage, and were strongly conserved in all other taxa (Fig. 6). They all 285 have roles relating to neuronal connectivity (neuronal migration and synaptogenesis) and, 286 within the human brain, were more specifically expressed in the cerebellum, except for 287 GRIP1, which was expressed almost exclusively in the cortex. We also identified the dyslexia susceptibility gene KIAA0319, encoding a protein 306 involved in axon growth inhibition [51,52], as one of the most divergent brain protein-coding 307 genes in humans relative to the common primate ancestor (raw dN/dS=3.9; 9 non-308 synonymous vs 1 synonymous mutations in Homo sapiens compared to the common primate 309 ancestor). The role of KIAA0319 in dyslexia remains a matter of debate, but its rapid 310 evolution in the hominoid lineage warrants further genetic and functional studies. 311 Finally, several genes display very high levels of divergence in Homo sapiens, but 312 their functions or association with disease remain unknown. For example, the zinc finger 313 protein ZNF491 (raw dN/dS=4.7; 14 non-synonymous vs 1 synonymous mutations in Homo 314 sapiens compared to the common primate ancestor) is specifically expressed in the 315 cerebellum and is structurally similar to a chromatin remodeling factor, but its biological role 316 remains to be determined. Another example is the CCP110 gene, encoding a centrosomal 317 protein resembling ASPM, but not associated with a disease. Its function suggests that this 318 divergent protein-coding gene would be a compelling candidate for involvement in 319 microcephaly in humans. A complete list of the most conserved and divergent protein-coding 320 genes is available in Supplementary Table 4  The emergence of a large cortex was undoubtedly an important step for human cognition, but 378 other parts of the brain, such as the cerebellum, may also have made major contributions to 379 both motricity and cognition. In this study, we showed that the protein-coding genes 380 expressed in the cerebellum were among the most conserved in humans. However, we also 381 identified a set of divergent protein-coding genes with relatively strong expression in the 382 cerebellum and/or for which mutations affected cerebellar function. As discussed above,

Gene sets 469
We used different gene sets, starting at the tissue level and then focusing on the brain and key 470 pathways. For body tissues, we used Illumina Body Map 2.0 RNA-Seq data, corresponding to 471 16 human tissue types: adrenal, adipose, brain, breast, colon, heart, kidney, liver, lung, 472 lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells (for more values cross all available replicates, to guarantee a high signal-to-noise ratio. We then 506 calculated the values for the associated genes in Homo sapiens according to the paralogous 507 correspondence between humans and mice (Ensembl Biomart accessed on February 23, 508 2019). 509

Gene nomenclature 510
We extracted all the EntrezId of the protein-coding genes for Grc37 from Ensembl Biomart. 511 We used the HGNC database to recover their symbols. For the 46 unmapped genes, we 512 searched the NCBI database manually for the official symbol. 513

McDonald-Kreitman-test (MK) and neutrality index (NI) 514
We assessed the possible fixation of variants in the Homo sapiens population by first 515 calculating the relative ratio of non-synonymous to synonymous polymorphism (pN/pS) from 516 the 1000 Genomes VCF for all SNPs, for SNPs with a minor allele frequency (MAF) <1% 517 and >5%. SNPs were annotated with ANNOVAR across 1000 Genomes Project (ALL+5 518 ethnicity groups), ESP6500 (ALL+2 ethnicity groups), ExAC (ALL+7 ethnicity groups), and 519 CG46 (see http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#popfreqmax-520 and-popfreqall-annotations for more details). We then performed the McDonald-Kreitman 521 test by calculating the neutrality index (NI) as the ratio of raw pN/pS and dN/dS values [74]. 522 We considered the divergent genes to be fixed in the population when NI < 1. 523

Protein-protein interaction network 524
We plotted the protein-protein interaction ( supported by at least two traceable pieces of evidence (publications and/or methods).

NeuroVault analyses 539
We used the NeuroVault website [44] to collect 19,244 brain imaging results from fMRI-540 BOLD studies (T and Z score maps) and their correlation with the gene expression data [46] 541 of the Allen Brain atlas [45]. The gene expression data of the Allen Brain atlas were 542 normalized and projected into the MNI152 stereotactic space used by NeuroVault, using the 543 spatial coordinates provided by the Allen Brain Institute. An inverse relationship between 544 cortical and subcortical expression dominated the pattern of expression for many genes. We 545 therefore calculated the correlations for the cortex and subcortical structures separately. 546

Allen Brain data 547
We downloaded the Allen Brain atlas microarray-based gene data from the Allen Brain 548 website (accessed January 19, 2018 at http://www.brain-map.org). Microarray data were 549 available for six adult brains; the right hemisphere was missing for three donors so we 550 considered only the left hemisphere for our analyses. For each donor, we averaged probes 551 targeting the same gene and falling in the same brain area. We then subjected the data to log 552 normalization and calculated Z-scores: across the 20787 genes for each brain region to obtain 553 expression levels; across the 212 brain areas for each gene to obtain expression specificity. 554 For genes with more than one probe, we averaged the normalized values over all probes 555 available. As a complementary dataset, we also used a mapping of the Allen Brain Atlas onto 556 the 68 brain regions of the . We 568 used all 12,400 genes as the background. We eliminated redundancy, by first filtering out all 569 the statistically significant Gene Ontology (GO) terms associated with fewer than 10 or more 570 than 1000 genes, and then combining the remaining genes with the EnrichmentMap plugin 571 [83]. We used a P-value cutoff of 0.005, an FDR Q-value cutoff of 0.05, and a Jaccard 572 coefficient of 0.5. 573 For the cell type-specific expression Aanalysis (CSEA; 86), we used the CSEA method with 574 the online tool http://genetics.wustl.edu/jdlab/csea-tool-2/. This method associates gene lists 575 with brain expression profiles across cell types, regions, and time periods. 576 Wilcoxon and rank-biserial correlation: We investigated the extent to which each gene set 577 was significantly more conserved or divergent than expected by chance, by performing 578 Wilcoxon tests on the normalized dN/dS values (ω GC ) for the genes in the set against zero 579 (the mean value for the genome). We quantified effect size by matched pairs rank-biserial 580 correlation, as described by Kerby [85]. Following non-parametric Wilcoxon signed-rank 581 tests, the rank-biserial correlation was evaluated as the difference between the proportions of 582 negative and positive ranks over the total sum of ranks: 583 permutations, we randomly selected the same number of genes as for the sample of genes 598 from the total set of genes for which dN/dS was not missing. We corrected for CCDS length 599 and GC content by bootstrap resampling. We estimated significance, to determine whether 600 the null hypothesis could be rejected, by calculating the number of bootstrap draws ‫ܤ(‬ ) 601 falling below and above the observed measurement (݉