s13059-014-0459-2 1465-6906 Research <p>Genome analysis of a major urban malaria vector mosquito, <it>Anopheles stephensi</it> </p> Jiang Xiaofang jiangx@vt.edu Peery Ashley peerya2@vt.edu Hall Brantley A brant@vt.edu Sharma Atashi atashi04@vt.edu Chen Xiao-Guang xgchen2001@hotmail.com Waterhouse M Robert robert.waterhouse@gmail.com Komissarov Aleksey ad3002@gmail.com Riehle M Michelle mriehle@umn.edu Shouche Yogesh yogesh@nccs.res.in Sharakhova V Maria msharakh@vt.edu Lawson Dan lawson@ebi.ac.uk Pakpour Nazzy npakpour@ucdavis.edu Arensburger Peter arensburger@gmail.com Davidson L M Victoria victorea@ksu.edu Eiglmeier Karin kei@pasteur.fr Emrich Scott semrich@nd.edu George Phillip phippl3@vt.edu Kennedy C Ryan Ryan.C.Kennedy@alumni.nd.edu Mane P Shrinivasrao smane@vbi.vt.edu Maslen Gareth gmaslen@ebi.ac.uk Oringanje Chioma chyoma12@yahoo.com Qi Yumin yuqi@vt.edu Settlage Robert rsettlage@vbi.vt.edu Tojo Marta marta.tojo@gmail.com Tubio M C Jose jmctubio@gmail.com Unger F Maria munger1@nd.edu Wang Bo bwangucd@gmail.com Vernick D Kenneth kenneth.vernick@pasteur.fr Ribeiro M C Jose jribeiro@niaid.nih.gov James A Anthony aajames@uci.edu Michel Kristin kmichel@ksu.edu Riehle A Michael mriehle@ag.arizona.edu Luckhart Shirley sluckhart@ucdavis.edu Sharakhov V Igor igor@vt.edu Tu Zhijian jaketu@vt.edu

Program of Genetics, Bioinformatics, and Computational Biology, Virginia Tech, Blacksburg, VA, USA

Department of Biochemistry, Virginia Tech, Blacksburg, VA, USA

Department of Entomology, Virginia Tech, Blacksburg, VA, USA

Department of Pathogen Biology, Southern Medical University, Guangzhou, Guangdong, China

Department of Genetic Medicine and Development, University of Geneva Medical School, rue Michel-Servet 1, Geneva, 1211, Switzerland

Swiss Institute of Bioinformatics, rue Michel-Servet 1, Geneva, 1211, Switzerland

Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar Street, Cambridge, MA, USA

The Broad Institute of MIT and Harvard, 7 Cambridge Center, Cambridge, MA, USA

Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, St. Petersburg, Russia

Institute of Cytology Russian Academy of Sciences, St. Petersburg, Russia

Department of Microbiology, University of Minnesota, Minneapolis, MN, USA

National Center for Cell Science, Pune University Campus, Ganeshkhind, Pune, India

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SD, Cambridge, UK

Department of Medical Microbiology and Immunology, University of California, Davis, CA, USA

Biological Sciences Department, California State Polytechnic University, Pomona, CA, USA

Division of Biology, Kansas State University, Manhattan, KS, USA

Department of Parasitology and Mycology, Unit of Insect Vector Genetics and Genomics, Institut Pasteur, Paris, France

CNRS Unit of Hosts, Vectors and Pathogens (URA3012), Paris, France

Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USA

Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA, USA

Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA, USA

Department of Entomology, University of Arizona, Tucson, AZ, USA

Department of Physiology, School of Medicine – CIMUS, Instituto de Investigaciones Sanitarias, University of Santiago de Compostela, Santiago de Compostela, Spain

Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, UK

Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, USA

Section of Vector Biology, Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, Rockville, MD, USA

Departments of Microbiology & Molecular Genetics and Molecular Biology & Biochemistry, University of California, Irvine, CA, USA

Genome Biology 1465-6906 2014 15 9 459 http://genomebiology.com/2014/15/9/459 10.1186/s13059-014-0459-2
27 4 2014 3 9 2014 23 9 2014 2014 Jiang et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Abstract

Background

Anopheles stephensi is the key vector of malaria throughout the Indian subcontinent and Middle East and an emerging model for molecular and genetic studies of mosquito-parasite interactions. The type form of the species is responsible for the majority of urban malaria transmission across its range.

Results

Here, we report the genome sequence and annotation of the Indian strain of the type form of An. stephensi. The 221 Mb genome assembly represents more than 92% of the entire genome and was produced using a combination of 454, Illumina, and PacBio sequencing. Physical mapping assigned 62% of the genome onto chromosomes, enabling chromosome-based analysis. Comparisons between An. stephensi and An. gambiae reveal that the rate of gene order reshuffling on the X chromosome was three times higher than that on the autosomes. An. stephensi has more heterochromatin in pericentric regions but less repetitive DNA in chromosome arms than An. gambiae. We also identify a number of Y-chromosome contigs and BACs. Interspersed repeats constitute 7.1% of the assembled genome while LTR retrotransposons alone comprise more than 49% of the Y contigs. RNA-seq analyses provide new insights into mosquito innate immunity, development, and sexual dimorphism.

Conclusions

The genome analysis described in this manuscript provides a resource and platform for fundamental and translational research into a major urban malaria vector. Chromosome-based investigations provide unique perspectives on Anopheles chromosome evolution. RNA-seq analysis and studies of immunity genes offer new insights into mosquito biology and mosquito-parasite interactions.

1 Background

Mosquitoes in the genus Anopheles are the primary vectors of human malaria parasites and the resulting disease is one of the most deadly and costly in history 1 , 2 . Publication and availability of the Anopheles gambiae genome sequence accelerated research that has not only enhanced our basic understanding of vector genetics, behavior, and physiology and roles in transmission, but also contributed to new strategies for combating malaria 3 . Recent application of next-generation sequencing technologies to mosquito genomics offers exciting opportunities to expand our understanding of mosquito biology in many important vector species and harness the power of comparative genomics. Such information will further facilitate the development of new strategies to combat malaria and other mosquito-borne diseases. An. stephensi is among approximately 60 species considered important in malaria transmission and is the key vector of urban malaria on the Indian subcontinent and the Middle East 4 , 5 . The fact that a recent resurgence of human malaria in Africa could have been caused by the sudden appearance of An. stephensi indicates that An. stephensi may pose an even greater risk to human health in the future 6 . Of the three forms, type, mysorensis, and intermediate, the former is responsible for the majority, if not all, of urban malaria transmission across its range and accounts for approximately 12% of all transmission in India 7 . Thus efforts to control it can be expected to contribute significantly to the malaria eradication agenda 8 , 9 . An. stephensi is amenable to genetic manipulations such as transposon-based germline transformation 10 , genome-wide mutagenesis 11 , site-specific integration 12 , genome-editing 13 , and RNAi-based functional genomics analysis 14 . Our understanding of the interactions between An. stephensi and the malaria parasites is rapidly improving 15 - 20 . Thus An. stephensi is emerging as a model species for genetic and molecular studies. We report the draft genome sequence of the Indian strain of the type form of An. stephensi as a resource and platform for fundamental and translational research. We also provide unique perspectives on Anopheles chromosome evolution and offer new insights into mosquito biology and mosquito-parasite interactions.

2 Results and discussion

2.1 Draft genome sequence of An. stephensi: Assembly and verification

The An. stephensi genome was sequenced using 454 GS FLX, Illumina HiSeq, and PacBio RS technologies (Additional file 1: Table S1). The 454 reads comprised 19.4× coverage: 12.2× from single-end reads, 2.2× from 3 kilobase (kb) paired-end reads, 3.4× from 8 kb paired-end reads, and 1.7× from 20 kb paired-end reads. The majority of 454 reads was in the range of 194 to 395 base-pairs (bp) in length. A single lane of Illumina sequencing of male genomic DNA resulted in 86.4× coverage of 101 bp paired-end reads with an average insert size of approximately 200 bp. Ten cells of PacBio RS sequencing of male genomic DNA produced 5.2× coverage with a median length of 1,295 bp. A hybrid assembly combining 454 and Illumina data produced a better overall result than using 454 data alone (Materials and methods). The resulting assembly was further improved by filling gaps with error-corrected PacBio reads and scaffolding with BAC-ends. The current assembly, verified using various methods, contains 23,371 scaffolds spanning 221 Mb. The assembly includes 11.8 Mb (5.3%) of gaps filled with Ns (Table 1), which is slightly lower than the size of gaps in the An. gambiae assembly (20.7 Mb, 7.6%). The N50 scaffold size is 1.59 Mb and the longest scaffold is 5.9 Mb. The number of scaffolds is inflated because we choose to set the minimum scaffold length to 500 bp to include repeat-rich short scaffolds. The assembled size of 221 Mb is consistent with the previous estimate of the An. stephensi genome size of approximately 235 Mb 21 .

<p>Table 1</p>

Statistic

Value

Scaffolds (n)

23,371

Scaffold N50 size

1,591,355

Maximum scaffold length

5,975,090

Minimum scaffold length

486

Total length of scaffolds

221,309,404

Percent Ns

5.35%

Contigs (n)

31,761

Contig N50 size

36,511

Maximum contig length

475,937

Minimum contig length

347

Total length of contigs

209,483,518

GC percent

44.80%

Assembly statistics

2.2 Physical mapping

Mapping of 227 probes was sufficient to assign 86 scaffolds to unique positions on the An. stephensi polytene chromosomes (Figure 1; Table 2; Additional file 2: Physical Map Data). These 86 scaffolds comprise 137.14 Mb or 62% of the assembled genome. Our physical map includes 28 of the 30 largest scaffolds and we were able to determine the orientation of 32 of the 86 scaffolds. We expect that relatively little of the heterochromatin was captured in our chromosomal assembly based on the morphology of the chromosomes in regions to which the scaffolds mapped. For this reason, subsequent comparisons with An. gambiae on molecular features of the genome landscape exclude regions of known heterochromatin from the An. gambiae dataset. An. stephensi and An. gambiae have different chromosome arm associations with 2L of An. gambiae homologous to 3L of An. stephensi 22 . Therefore, all ensuing discussion of synteny between the two species refers to An. stephensi chromosome arms listed in homologous order to those of An. gambiae: X, 2R, 3L, 3R, and 2L. While draft genomes also are available for An. darlingi and An. sinensis 23 , 24 , we focused our comparative analysis on An. stephensi and An. gambiae, the only two species that have chromosome-based assembly.

<p>Figure 1</p>

Physical map.

Physical map. A physical map of the An. stephensi genome was created from FISH on polytene chromosomes comprising 227 probes and 86 scaffolds. These 86 scaffolds comprise 137.14 Mb or 62% of the An. stephensi genome. Orientation was assigned to 32 of the 86 scaffolds. The physical map includes 28 of the 30 largest scaffolds.

<p>Table 2</p>

Arm

Scaffolds per arm (n)

Length (Mb)

Mapped genome (%)

Total genome (%)

Scaffolds mapped to each chromosome, total bp to each chromosome, percent of the predicted genome covered.

X

9

14.95

10.90

6.77

2R

21

39.50

28.80

17.87

2L

15

22.40

16.33

10.14

3R

24

37.83

27.59

17.12

3L

17

22.45

16.37

10.16

Total

86

137.14

100

62.05

Physical map information

2.3 Gene annotation

A total of 11,789 protein-encoding genes were annotated using a combination of homology and de novo prediction. These gene models have been submitted to the NCBI (GCA_000300775.2) and are hosted in VectorBase 25 . The average transcript length was 3,666 bp and the average number of exons per transcript was 4.18. Evolutionary relationships among An. stephensi and other dipteran insects were evaluated by constructing a maximum likelihood molecular species phylogeny using universal single-copy orthologs (Figure 2A). An. stephensi and An. gambiae form a well-supported clade representing the subgenus Cellia within the genus Anopheles. This phylogeny provides the evolutionary context for current and future comparative genomics analysis. A total 10,492 (89.0%) of the 11,789 predicted An. stephensi protein-encoding genes had orthologs in An. gambiae, Aedes aegypti, and Drosophila melanogaster (Figure 2).

<p>Figure 2</p>

Molecular species phylogeny and orthology. (A)

Molecular species phylogeny and orthology. (A) The maximum likelihood molecular species phylogeny estimated from universal single-copy orthologs supports the recognized species relationships with An. stephensi and An. gambiae in subgenus Cellia within the genus Anopheles. (B) Comparative analysis of orthologs from An. stephensi, An. gambiae, Ae. aegypti, and D. melanogaster. Orthologous genes were retrieved from OrthoDB. A total of 7,305 genes were shared among all four species, 1,297 genes were specific to An. stephensi, 653 genes were Anopheles-specific, and 1,863 genes were mosquito-specific.

2.4 Global transcriptome analysis

Eleven RNA-seq samples were prepared from 0 to 1, 2 to 4, 4 to 8, and 8 to 12 h post-egg deposition embryos, larvae, pupae, adult males, adult females, non-blood-fed ovaries, blood-fed ovaries, and 24 h post-blood-fed female carcasses without ovaries 26 . The corresponding genes were clustered into 20 distinct groups in sizes in the range of 8 to 2,106 genes per group on the basis of similar expression patterns (Figure 3). Many of the clusters correspond to either a specific developmental stage or sex (Additional file 2). A search for over-represented gene ontology (GO) terms in the 20 clusters found that many of the co-regulated genes have similar inferred functions or roles. Adult females require a protein-rich blood-meal for oogenesis and thus are the most interesting sex from a health perspective. Genes in clusters 1, 10, and 17 are induced in the female soma after blood-feeding. These clusters are enriched for genes encoding proteins with proteolytic activity, including serine peptidases, and involved in blood-meal digestion. Mosquitoes have undergone lineage-specific amplification of serine peptidases when compared to Drosophila, many of which are found in the three clusters described above. Cluster 9 contains 258 genes that showed peak expression in the pupal stage and it is enriched for genes whose products are involved in exoskeleton development. GO analyses of other clusters are described in the Additional file 1: Text.

<p>Figure 3</p>

Gene clustering according to expression profile.

Gene clustering according to expression profile. Twenty groups of genes were clustered by expression profile. The expression profiles used for grouping were generated using 11 RNA-seq samples spanning developmental time points including: 0 to 1, 2 to 4, 4 to 8, and 8 to 12 h embryos, larva, pupa, adult males, adult females, non-blood-fed ovaries, blood-fed ovaries, and 24 h post-blood-fed female carcass without ovaries. Male stage are colored blue, female stages are colored green, ovary samples are colored yellow, embryo samples are colored red, larva samples are colored pink, and pupa samples are colored purple. Many of these clusters correspond to either a specific developmental stage or specific sex.

We identified 241 and 313 genes with female- or male-biased expression, respectively (Additional file 2: Sex-biased genes list and GO terms). The male-biased genes are enriched for those whose products are involved in spermatogenesis and the auditory perception. Male mosquitoes detect potential mates using their Johnston’s organ, which has twice the number of sensory neurons as that of the females 27 , 28 . The female-biased genes are enriched for those whose products are involved in proteolysis and other metabolic processes likely relevant to blood digestion.

2.5 Immunity genes

Manual annotation was performed on genes involved in innate immunity including those that encode the LRR immune (LRIM) and the Anopheles Plasmodium-responsive leucine-rich repeat 1 (APL1) proteins, and the genes of the Toll, immune deficiency (IMD), insulin/insulin-like growth factor signalling (IIS), mitogen-activated protein kinase (MAPK), and TGF-β signalling pathways. A number of studies have demonstrated the importance of these genes or pathways in mosquito defense against parasites or viruses 16 - 20 , 29 - 31 . Manual analysis showed overall agreement with the automated annotation and improved the gene models in some cases (Additional file 2). A high level of orthology is generally observed between An. stephensi and An. gambiae and we highlight here a few potentially interesting exceptions. An. stephensi may have only one APL1 gene (ASTEI02571) instead of the three APL1 gene cluster found in An. gambiae (Additional file 1: Figure S1). We also observed the apparent lack of TOLL1B and 5B sequences in An. stephensi, which in An. gambiae are recent duplications of TOLL1A and 5A, respectively.

Expression profiles of all immunity genes were analyzed using the 11 RNA-seq samples to provide insights into their biological functions (Additional file 2: RNA-seq expression profile of immunity-related genes). For example, FKBP12, a protein known to regulate both transforming growth factor (TGF)-β and target of rapamycin (TOR) signaling, showed abundant transcript levels across immature stages and adult tissues (Additional file 1: Figure S2). The high expression levels of AsteFKBP12 in all examined stages and tissues were unexpected. Examination of existing publicly-available microarray data confirmed these expression levels and patterns 32 . FKBP12 in mammals forms a complex with rapamycin and FKBP-rapamycin-associated protein (FRAP) to inhibit TOR 33 . Given that TOR signaling is fundamental to many biological functions in mammals 34 and cumulative data support the same for D. melanogaster 35 , a high level of FBKP12 expression may be critical for tight regulation of TOR activity in An. stephensi and perhaps An. gambiae 36 . Expression patterns of the An. gambiae FKBP12 ortholog, AGAP012184, from microarray datasets 37 support the hypothesis that this protein is involved in a broad array of Anopheline physiologies including: development, blood-feeding, molecular form-specific insecticide resistance, circadian rhythms, desiccation resistance, mating status, and possibly also broad regulation of infection based on studies with murine (Plasmodium berghei) and human (Plasmodium falciparum) malaria parasites. Whether these same physiologies and others are regulated by FKBP12 in An. stephensi will require experimental confirmation. Given that signalling pathways regulating embryonic pattern formation in Drosophila (for example, the Toll pathway 38 ) have been co-opted in the adult fly for regulation of various physiologies including metabolism and immune defense, the data presented here support the hypothesis that pathways integral to adult biology in adult Anophelines also have been similarly co-opted from important developmental roles.

2.6 Salivary genes

Saliva of blood-feeding arthropods contains a cocktail of pharmacologically active components that disarm vertebrate host’s blood clotting and platelet aggregation, induce vasodilation, and affect inflammation and immunity. These salivary proteins are under accelerated evolution due most likely to their host’s immune pressure. A previous salivary gland transcriptome study identified 37 corresponding salivary proteins in An. stephensi, most of which are shared with An. gambiae, including mosquito and Anopheles-specific protein families 39 . A more extensive sialotranscriptome based on approximately 3,000 ESTs identified the templates for 71 putative secreted proteins for An. gambiae 40 . The combined data verify the identity of 71 putative salivary secreted proteins for An. stephensi, seven of which have no similarities to An. gambiae proteins (Additional file 2: Automatic annotated salivary genes). The current assembly of the An. stephensi genome shows that many salivary gland genes are present as tandem repeated genes and represent families that arose by gene duplication events. Tandem repeated gene families often are poorly annotated by automated approaches, therefore, manual annotation was necessary to improve the salivary gland gene models (Additional file 2). In particular, An. gambiae has eight genes of the D7 family, which has modified odorant binding domains (OBD) that strongly bind agonists of platelet aggregation and vasoconstriction (histamine, serotonin, epinephrine, and norepinephrine) 41 . Three of these genes have two OBDs while the remaining five have only one domain each. As in An. gambiae, the short forms are oriented in tandem and in the opposite orientation of the long-form genes. However, An. stephensi has apparently collapsed the second long form to create a sixth short form.

2.7 Comparative analysis of additional gene families

Functional annotations of a number of gene families in An. stephensi were obtained based on their InterPro ID 42 (Additional file 2: Gene families counts table). We also compared gene numbers in these gene families across several species. An. stephensi and An. gambiae showed similar gene numbers in most of the gene families 3 and this is consistent with the close phylogenetic relationship between the two species. As observed with manually annotated immunity-related genes (Additional file 1: Figure S3), strong one-to-one relationship was observed between An. stephensi and An. gambiae genes in odorant binding proteins (OBPs) (Additional file 1: Figure S4A) and other gene families studied. There are a few gene families that showed obvious difference in numbers between An. stephensi and An. gambiae. We performed phylogenetic analysis of these gene families. The results (Additional file 1: Figure S4B and Figure S4C) indicate gene expansion in the odorant receptors (OR) and fibrinogen-related proteins in An. gambiae. Interestingly, a plurality of expanded genes is physically clustered in An. gambiae, suggesting that the gene expansions in An. gambiae may have arisen from local duplications. For example, the An. stephensi single-copy OR gene ASTEI08685 has four orthologs in An. gambiae (AGAP004354, AGAP004355, AGAP004356, and AGAP004357). The putative orthologs of these ‘expanded’ genes tend to be single- or low-copy in An. stephensi and other related species in Vectorbase, supporting the interpretation that the lack of duplicated copies in An. stephensi is not due to assembly or annotation error. Further analysis that includes all species in the ongoing 16 Anopheles genomes project 43 will facilitate future comparative analysis of gene family expansions and gene losses.

2.8 Repeat content

Transposable elements (TEs) and other unclassified interspersed repeats constitute 7.1% of the assembled An. stephensi genome (Table 3: Additional file 2: Repeat sequences). TE occupancy of the euchromatic genome in D. melanogaster and An. gambiae is 2% and 16%, respectively 3 . Thus variations in the size of the genomes correlate with different amounts of repetitive DNA in these three species. More than 200 TEs have been annotated. DNA transposons and miniature inverted-repeat TEs (MITEs) comprise 0.44% of the genome. Non-LTR retrotransposons (or LINEs) comprise 2.36% of the genome. Short intersperse nuclear elements (SINEs), although less than 300 bp in length, are highly repetitive and comprise 1.7% of the genome. There is considerable diversity among the LTR-retrotransposons although they occupy only 0.7% of the genome. Approximately 2% of the genome consists of interspersed repeats that remain to be classified.

<p>Table 3</p>

Type

Elements (n)

Length occupied (bp)

Genome (%)

SINEs

30,514

3,739,253

1.69

LINEs

22,022

5,231,240

2.36

LTR elements

4,359

1,499,282

0.68

DNA elements

4,611

966,667

0.44

Unclassified

30,611

4,322,468

1.95

Total

92,117

15,758,910

7.12

Transposable elements and other interspersed repeats

2.9 Genome landscape: a chromosomal arm perspective

The density of genes, TEs, and short tandem repeats (STRs) for each chromosome were determined based on the physical map (Figure 4). The average numbers of genes for each chromosome arm are consistent with those in An. gambiae. The X had the lowest number of genes per 100 kb, and the highest densities of genes per 100 kb were seen on 2R and 3 L (Figure 5; Additional file 1: Tables S2 and S3). Chromosomes 2R and 3 L also contain the greatest numbers of polymorphic inversions 44 . Genes functioning as drivers of adaptation could be expected to occur in greater densities on chromosome arms with higher numbers of polymorphic inversions 45 .

<p>Figure 4</p>

Genome landscape.

Genome landscape. Density of genes (black vertical lines), transposable elements (TEs; green vertical lines), and short tandem repeats (STRs; red vertical lines) in 100 kb windows of mapped scaffolds. Based on the physical map, scaffolds were ordered and oriented respective to their position in the chromosomes and then 100 kb non-overlapping windows were generated for each scaffold (X-axis). The density of genes and TEs (Y-axis) was determined using coverageBed. Satellite sequences were identified using TandemRepeatFinder. The short tandem repeats track is a combination of the number of microsatellites, minisatellites, and satellites per 100 kb window.

<p>Figure 5</p>

Average density/100 kb/ARM.

Average density/100 kb/ARM. A comparison of the average density per 100 kb of genes, TEs, S/MARS, microsatellites, minisatellites, and satellites between chromosome arms.

An. stephensi has a lower density of transposable elements across all chromosome arms than An. gambiae (Figure 5; Additional file 1: Tables S2 and S3; Additional file 2: Genome Landscape). The density of transposable elements on the An. stephensi X is more than twice that of the autosomes. A comparison of the An. stephensi simple repeats with those in An. gambiae euchromatin showed that densities in the latter were approximately 2-2.5× higher (Figure 5; Additional file 1: Tables S2 and S3). The greatest densities of simple repeats were found on the X chromosome and this is consistent with a previous study in An. gambiae 46 . Although An. stephensi shows lower densities of simple repeats across all arms compared to An. gambiae, its X appears to harbor an over-representation of simple repeats compared to its autosomes. Scaffold/Matrix-associated regions (S/MARs) can potentially affect chromosome mobility in the cell nucleus and rearrangements during evolution 47 , 48 and these were found to be enriched in the 2 L and 3R arms (Figure 5; Additional file 1: Tables S2 and S3).

2.10 Molecular organization of pericentric heterochromatin

We observed clear differences in heterochromatin staining patterns when comparing mitotic chromosome squashes prepared from imaginal discs of An. gambiae and An. stephensi. An. stephensi appears to have more pericentric heterochromatin than An. gambiae (Additional file 1: Figure S5). This is particularly evident in the sex chromosomes. Mitotic X chromosomes in An. stephensi possess much more pericentric heterochromatin compared with X chromosomes from several different strains of An. gambiae. Finally, the Y chromosome in An. stephensi has a large block of heterochromatin. We further investigated whether particular tandem repeats are concentrated in heterochromatin. Aste72A and Aste190A, the two repeats with highest coverage in raw genomic data reads, were selected as probes for FISH analysis (Additional file 2: Tandem repeat sequences). Aste72A, which comprises approximately 1% of the raw genomic reads, was mapped to the pericentric heterochromatin of X and Y chromosomes (Figure 6). Aste190A, which comprises approximately 2% of the raw genomic reads, was mapped to centromere of both autosomes (Additional file 1: Figure S6). The Aste72A tandem repeat has a 26.7% mean GC content and contributes significantly to the AT-rich peak in the plot of GC distribution of raw genomic reads (Additional file 1: Figure S7).

<p>Figure 6</p>

FISH with Aste72A, rDNA, and DAPI on mitotic chromosomes.

FISH with Aste72A, rDNA, and DAPI on mitotic chromosomes. The pattern of hybridization for satellite DNA Aste72A on mitotic sex chromosomes of An. stephensi. Aste72A hybridizes to pericentric heterochromatin in both X and Y chromosomes while ribosomal DNA locus maps next to the heterochromatin band in sex chromosomes.

2.11 Y chromosome

Anopheles mosquitoes have heteromorphic sex-chromosomes where males are heterogametic (XY) and females homogametic (XX) 49 . The high repetitive DNA content of Y chromosomes makes them difficult to assemble and they often are ignored in genome projects. An approach called the chromosome quotient 50 was used to identify 57 putative Y sequences spanning 50,375 bp (Additional file 2). All of these sequences are less than 4,000 bp in length and appear to be highly repetitive. Five BACs that appeared to be Y-linked based on the CQs of their end sequences were analyzed by sequencing and their raw PacBio reads were assembled with the HGAP assembler 51 . Eleven contigs spanning 196,498 bp of predicted Y-linked sequences were obtained (Additional file 2). The 57 Y-linked sequences and 11 contigs from the Y-linked BACs represent currently the most abundant set of Y sequences in any Anopheles species. RepeatMasker analysis using the annotated An. stephensi interspersed repeats showed that approximately 65% of the An. stephensi Y sequences are interspersed repeats. LTR retrotansposons alone occupy approximately 49% of the annotated Y (Additional file 2).

2.12 Synteny and gene order evolution

We used the chromosomal location and orientation of 6,448 one-to-one orthologs from An. gambiae and An. stephensi to examine synteny and estimate the number of chromosomal inversions between these two species (Figure 7; Additional file 2: Synteny Blocks). Syntenic blocks were defined as those that had at least two genes and all genes within the block had the same order and orientation with respect to one another in both genomes. The X chromosome has markedly more inversions than the autosomes. The number of chromosomal inversions that might have happened since An. stephensi and An. gambiae last shared a common ancestor was determined with GRIMM 52 . We calculated the density of inversions per chromosome arm ignoring breakpoint reuse and assuming two breakpoints per inversion (Additional file 1: Tables S4 and S5). The length of An. stephensi assembly was used as a proxy for the size of the An. stephensi chromosomes. The density of inversions per megabase on the X chromosome supports the conclusion that it is much more prone to rearrangement than the autosomes. Genomic segments on the X are approximately three-fold more likely to change order than those on the autosomes (Figure 8A and Additional file 1: Table S6). The fast rate of X chromosome rearrangements contrasts with the lack of polymorphic inversions in An. stephensi and An. gambiae (Additional file 1: Table S5). Interestingly, a recent comparative genomic study between An. gambiae and Ae. aegypti revealed that the homomorphic sex-determining chromosome in Ae. aegypti has a higher rate of genome rearrangements than autosomes 53 .

<p>Figure 7</p>

Synteny.

Synteny. Synteny between An. stephensi and An. gambiae based on 6,448 single-copy orthologs. Orthologs with the same orientation in An. stephensi and An. gambiae are connected with red lines and orthologs with the opposite orientation are connected with blue lines. Orthologous genes from An. stephensi and An. gambiae were retrieved from OrthoDB. The physical map was used to identify the relative locations of genes on the An. stephensi chromosomes. The relationship of the position between the An. stephensi and An. gambiae orthologs were plotted with GenoPlotR. 66 syntenic blocks were identified on the X chromosome. A total of 104 and 64 syntenic blocks were identified on 2R and 2L (3L in An. stephensi). A total of 104 and 42 syntenic blocks were identified on 3R and 3L (2L in An. stephensi). Therefore, the X chromosome has undergone the most rearrangements per megabase.

<p>Figure 8</p>

Chromosome evolution in

Chromosome evolution in Anopheles and Drosophila . (A) Higher rates of rearrangement on the X chromosome compared to autosomes between An. stephensi and An. gambiae. Arm designations for the figure are according to An. stephensi. (B) The ratio of the X chromosome evolution rate to the total rate of rearrangement is higher in Anopheles than in Drosophila.

2.13 Rates of chromosome evolution in Drosophila and Anopheles

Recent studies have established that both Anopheles and Drosophila species have high rates of chromosomal evolution as compared with mammalian species 46 , 54 - 61 . We compared the number of breaks per megabase for the X chromosome and all chromosomes to understand the differences in the dynamics of chromosome evolution between Drosophila and Anopheles (Additional file 1: Table S7). These results reveal a higher ratio of the rates of evolution of sex chromosome to all chromosomes in Anopheles than Drosophila, with means of 2.116 and 1.197, respectively (Figure 8B). We correlated densities of different molecular features including simple repeats, TEs, genes, and S/MARs with the rates of rearrangement calculated for each arm (Additional file 1: Tables S8-S13). The strongest correlations were found among the rates of evolution across all chromosome arms and the densities of microsatellites, minisatellites, and satellites in both An. gambiae and An. stephensi. The highly-positive correlations between rates of inversion across all chromosome arms and satellites of different sizes are due most likely to the co-occurring abundance of satellites and inversions on the X chromosome. Rates of inversions and satellite densities are much lower on the autosomes. S/MARs in autosomes were correlated negatively and genes correlated positively with polymorphic inversions.

2.14 Genetic diversity of the genome

The genome sequencing effort reported in the current study is based on an inbred laboratory strain to ensure good assembly. Nonetheless, we performed genome-wide SNP analysis based on the available data. A total of 530,997 SNPs were detected (Additional file 2: SNP analysis raw data). A total of 319,751 SNPs were assigned to chromosomes based on mapping information (Additional file 1: Table S14). The SNP calls were assessed for their effect on the primary sequence of transcripts (Additional file 2: Summary of transcript consequences for An stephensi Indian strain SNP calls). These analyses will help future population genomic studies and facilitate association studies. We found that the X chromosome has a markedly lower frequency of SNPs than the autosomes in agreement with the similar observation in An. gambiae 3 . The observed pattern may be explained by a smaller effective population size of the X chromosome due to male hemizygosity and lower sequence coverage of the X chromosome 62 .

3 Conclusions

The genome assembly of the type-form of the Indian strain of An. stephensi was produced using a combination of 454, Illumina, and PacBio sequencing and verified by analysis of BAC clones and ESTs. Physical mapping was in complete agreement with the genome assembly and resulted in a chromosome-based assembly that includes 62% of the genome. Such an assembly enabled analysis of chromosome arm-specific differences that are seldom feasible in next-gen genome projects.

Comparative analyses between An. stephensi and An. gambiae showed that the Anopheles X has a high rate of chromosomal rearrangement when compared with autosomes, despite the lack of polymorphic inversions in the X chromosomes in both species. Additionally, the difference between the rates of X chromosome and all chromosome evolution is much more striking in Anopheles than in Drosophila. The high rate of evolution on the X correlates well with the density of simple repeats. Our data indicate that overall high rates of chromosomal evolution are not restricted to Drosophila but may be a feature common to Diptera.

The genome landscape of An. stephensi is characterized by relatively low repeat content compared to An. gambiae. An. stephensi appears to have larger amount of repeat-rich heterochromatin in pericentric regions but far less repetitive sequences in chromosomal arms as compared with An. gambiae. Using a newly developed chromosome quotient method, we identified a number of Y-chromosome contigs and BACs, which together represent currently the most abundant set of Y sequences in any Anopheles species.

The current assembly contains 11,789 predicted protein coding genes, 127 miRNA genes, 434 tRNA genes, and 53 fragments of rRNA genes. An. stephensi appears to have fewer gene duplications than An. gambiae according to orthology analysis, which may explain the slightly lower number of gene models.

This genome project is accompanied by the first comprehensive RNA-seq-based transcriptomic analysis of an Anopheles mosquito. Twenty gene clusters were identified according to gene expression profiles, many of which are stage- or sex-specific. GO term analysis of these gene clusters provided biological insights and leads for important research. For example, male-biased genes were enriched for genes involved in spermatogenesis and the auditory perception.

Close attention was paid to genes involved in innate immunity including LRIMs, APL1, and proteins in the Toll, IMD, insulin, and TGF-β signaling pathways. A high level of orthology is generally observed between An. stephensi and An. gambiae. RNA-seq analysis, which was corroborated by other expression analysis methods, provided novel insights. For example, a protein known to interact with both TOR and TGF-β signaling pathways showed abundant mRNA expression in a wide range of tissues, providing new leads for insights into both TOR and TGF-β signaling in mosquitoes.

4 Material and methods

4.1 Strain selection

The Indian strain of An. stephensi, a representative of the type form was sequenced. The lab colony from which we selected mosquitoes for sequencing was originally established from wild mosquitoes collected in India. The lab colony has been maintained continuously for many generations so we did not attempt to inbreed it.

4.2 Sample collection

DNA was isolated from more than 50 adult male and female An. stephensi using the Qiagen (Hilden, Germany) DNeasy Blood and tissue kit following the suggested protocol. The integrity of the DNA was verified by running an aliquot on a 1% agarose gel to visualize any degradation. Total RNA was isolated using the standard protocol of the mirVana RNA isolation kit (Life Technologies, Carlsbad, CA, USA) and quality was verified using Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA).

4.3 Sequencing

The An. stephensi genome was sequenced to 19.4× coverage using 454 FLX Titanium sequencing performed by the Virginia Bioinformatics Institute (VBI) core laboratory. Sequencing was performed on four different libraries: a single-end shotgun library, and 3 kb, 8 kb, and 20 kb mate-pair libraries. A 200 bp insert size library produced from male An. stephensi genomic DNA was prepared and subjected to a single lane of Illumina HiSeq. Genomic DNA from male An. sequence was subjected to 10 SMRT cells of Pacific Biosciences (PacBio) v1 sequencing. Only males were sequenced with PacBio because we are interested in increasing the probability of finding Y chromosome sequences. Sanger sequencing performed by Amplicon Express was used to sequence 7,263 BAC-ends.

4.4 Genome assembly

We used several approaches to combine the Illumina and 454 data to generate a better assembly. Newbler can take raw Illumina data as input, so we tried a Newbler assembly with the 454 and Illumina data. However, this resulted in a worse assembly than 454 alone. We had much more success with the strategy used to assemble the Solenopsis invicta genome 63 . We assembled the Illumina data first, and then cut the assembly into pseudo-454 reads. These reads were then used along with the real 454 data as input to Newbler 64 .

4.5 De novo Illumina assembly with Celera

We assembled the paired-end Illumina reads using the Celera assembler 65 with the parameters: ‘overlapper = ovl; unitigger = bogart; utgBubblePopping = 1; kickOutNonOvlContigs = 1; cgwDemoteRBP = 0; cgwMergeMissingThreshold = 0.5; merSize = 14’. The Celera assembler output comprised 41,213 contigs spanning 212.8 Mb. The N50 contig size of this assembly was 16.8 kb.

4.6 De novo 454 and Illumina pseudo-454 reads assembly with Newbler 2.8

The contigs of the aforementioned Illumina assembly were shredded informatically into 400 bp pieces with overlapping 200 bp to approximate 454 reads. To artificially simulate coverage depth, we started the shredding at offsets with the values of 0, 10, and 20. Shredding the Illumina assembly resulted in 2,452,038 pseudo-454 reads simulating 4.17× coverage.

We generated an assembly of the 454 and pseudo-454 reads with Newbler 2.8 using the ‘-het -scaffold -large -s 500’ parameters. The resulting assembly contained 23,595 scaffolds spanned 221 Mb. The scaffold N50 size was 1.34 Mb. Mitochondrial DNA (1 scaffold), and other contamination (87 scaffolds) were identified by blastn and removed from the assembly.

4.7 Gap-filling with PacBio reads

PacBio data was used to fill gaps in the scaffolds to further improve the genome assembly. We error-corrected raw PacBio reads using the 454 sequencing data with the Celera pacBioToCa pipeline. pacBioToCa produced 0.88 Gb of error-corrected PacBio reads. Using the error-corrected PacBio data as input, Pbjelly 66 was used to fill gaps with parameters: ‘-minMatch 30 -minPctIdentity 98 -bestn 10 -n Candidates 5 -maxScore -500 -nproc 36-noSplitSubreads’. Pbjelly filled 1,310 gaps spanning 5.4 Mb.

4.8 Further scaffolding with BAC-ends

The scaffolds of the assembly were improved subsequently through the integration of 3,527 BAC-end pairs (120 kb ± 70 kb) using the Bambus scaffolder 67 (Additional file 2: BAC-ends dbGSS accession numbers). The BAC-end sequences were mapped to the scaffolds using Nucmer 68 . The output files were used to generate the ‘.contig’ format files required for Bambus. In total, 275 links between scaffolds were detected. Of these, 169 were retained as potential valid links, which are links connected by uniquely mapped BAC-ends. Links confirmed by less than two BAC-ends were rejected. A total of 46 links were retained that together connected 22 scaffolds, increasing the N50 scaffold size from 1,378 kb to 1,572 kb.

4.9 Assembly validation

4.9.1 CEGMA (Core Eukaryotic Genes)

We used CEGMA 69 to search for the number of core eukaryotic genes to test the completeness and correctness of the genome assembly. CEGMA provides additional information as to whether the entire core eukaryotic genes are present (>70%) or only partially present (>20% and <70%). In total, CEGMA found 96.37% of the 248 core eukaryotic genes to be present, and 97.89% of the core eukaryotic genes to be partially present.

4.9.2 BAC-ends

We checked whether BAC-ends align concordantly to the genome to study the structural correctness of the de novo assembly. BAC-ends were aligned to the scaffolds using NUCMER. In order to ensure unambiguous mapping, only sequences that aligned to a unique location with >95% coverage and 99% identity were used. In total, 21.6% of the BAC-end sequence pairs could be aligned to a unique position in the An. stephensi genome with these stringent criteria. Pairs of BAC-end sequence that aligned discordantly to a single scaffold were considered indicative of potential misassembly. Only four of 717 aligned BAC-end pairs aligned discordantly with the assembly confirming overall structural correctness.

4.9.3 ESTs

An. stephensi EST sequences were downloaded from both the NCBI and VectorBase. We screened the EST sequences to remove any residual vector sequence. The screened ESTs were aligned to the assembly with GMAP 70 . In total, 35,367 of 36,064 ESTs aligned to the assembly. Of these, 26,638 aligned over at least 95% of their length with an identity of >98%. The high percentage of aligned ESTs demonstrates the near-completeness of the An. stephensi genome assembly.

Fluorescent in situ hybridization (FISH): Slides were prepared from ovaries of lab reared, half-gravid females of the An. stephensi Indian wild-type strain. Slide preparation and hybridization experiments followed the techniques described in Sharakhova et al. 71 . Fluorescent microscope images were converted to black and white and inverted in Adobe Photoshop. FISH signals were mapped to specific bands or interbands on the physical map for An. stephensi presented by Sharakhova et al. 72 .

4.10 Constructing the physical map

For the chromosomal based genome assembly, all probes mapped by in situ hybridization by Sharakhova 72 and this study were aligned to the final version of the An. stephensi genome using NCBI blast + blastn. Different blastn parameters were used for probes from different sources to determine if the probe was kept in the final assembly. An e-value of 1e-40 and an identity of >95% was required for probes from An. stephensi. An e-value of 1e-5 was required for probes from species other than An. stephensi. Probes that mapped to more than one location in the genome were discarded. The work by Sharakhova et al. 72 hybridized 345 probes however, only approximately 200 probes from that study were maintained in the final chromosomal assembly. An additional 27 PCR products and BAC clones were hybridized to increase the coverage of our chromosomal assembly.

4.11 Annotation

The genome assembly was annotated initially using the MAKER pipeline 73 . This software synthesizes the results from ab initio gene prediction with experimental gene evidence to produce final annotations. Within the MAKER framework, RepeatMasker 74 was used to mask low-complexity genomic sequence based on the repeat library from previous prediction. First, ESTs and proteins were aligned to the genome by MAKER using BLASTn and BLASTx, respectively. MAKER uses the program Exonerate to polish BLAST hits. Next, within the MAKER framework, SNAP 75 and AUGUSTUS 76 were run to produce ab initio gene predictions based on the initial training data. SNAP and AUGUSTUS were run once again inside of MAKER using the initial training obtained from the ESTs and protein alignments to produce the final annotations.

4.12 Orthology and molecular species phylogeny

Orthologs of predicted An. stephensi genes were assigned by OrthoDB 77 . Information about orthologous genes for An. gambiae, Ae. aegypti, and D. melanogaster also were downloaded from OrthoDB. Enrichment analysis was performed for categories of orthologs using the methods provided in the ontology section. The molecular phylogeny of the 10 selected species was determined from the concatenated protein sequence alignments using MUSCLE 78 (default parameters) followed by alignment trimming with trimAl 79 (automated1 parameters) of 3,695 relaxed single-copy orthologs (a maximum of three paralogs allowed in no more than two species, longest protein selected) from OrthoDB 77 . The resulting 2,246,060 amino acid columns with 932,504 distinct alignment patterns was analyzed with RAxML 80 with the PROTGAMMAJTT model to estimate the maximum likelihood species phylogeny with 100 bootstrap samples.

4.13 Transcriptomics

RNA-seq from 11 samples including: 0 to 1, 2 to 4, 4 to 8, and 8 to 12 h embryos, larva, pupa, adult males, adult females, non-blood-fed ovaries, blood-fed ovaries, and female carcasses without ovaries as described 26 were used for transcriptome analysis. These RNA-seq samples are available from the NCBI SRA (SRP013839). Tophat 81 was used to align these RNA-seq reads to the An. stephensi genome and HTSeq-count 82 was used to generate an occurrence table for each gene in each sample. The numbers of alignments to each gene in each sample then were clustered using MBCluster.Seq 83 , an R package designed to cluster genes by expression profile based on Poisson or Negative-Binomial models. MBCluster.Seq generated 20 clusters. To visualize these results we performed regularized log transformation to the original occurrence tables for all 20 clusters using DESeq2 84 . The results were plotted using ggplot2 85 .

4.14 Ontology

Gene ontology (GO) terms were assigned for the 20 clusters of predicted An. stephensi genes. GO terms were assigned using Blast2Go 86 . The predicted proteins are blasted against the NCBI non-redundant protein database and scanned with InterProScan 87 against InterPro’s signatures. After GO terms were assigned, GO-slim results were generated for the available annotation based on the Generic GO slim mapping. The GO terms assigned by Blast2GO were subject to GO term enrichment. Over-represented GO terms were identified using a hypergeometric test using the GOstats package in R 88 .

4.15 Functional annotation of key gene families

We obtained the InterPro ID information for proteins in An. stephensi from the ontology analysis. We functionally annotated gene families based on the assigned InterPro ID. The gene families, including genes involved in immunity, chemosensation, and detoxification were studied. For comparative genome analysis, we retrieved the InterPro ID for seven other species (An. gambiae, An. darlingi, A. aegypti, Culex quinquefasciatus, D. melanogaster, Bombyx mori, and Tribolium castaneum) using Biomart 89 from vectorbase 90 and Ensembl Metazoa 91 . We compared gene numbers in gene families of interest. For gene families with obvious differences in numbers between An. stephensi and An. gambiae, we preformed phylogenetic analysis of these genes. First we aligned these genes from Anopheles species using MUSCLE 78 . Then, we constructed phylogenetic tree using Neighbor-joining method with 1,000 bootstrap replicates by CLC Genomics Workbench 4 92 .

4.16 Non-coding RNA

We used tRNAScan-SE 93 with the default eukaryotic mode to predict 434 tRNAs in the An. stephensi genome (Additional file 1: Table S15; Additional file 2: Non-coding RNA annotation). Other non-coding RNAs were predicted with INFERNAL 94 by searching against Rfam database version 11.0 95 . A total of 53 fragmental ribosomal RNA, 34 snRNA, 7 snoRNA, 127 miRNA, and 148 sequences with homology to the An. gambiae self-cleaving riboswitch were predicted with an e-value cutoff of 1e-5.

4.17 Transposable elements and other interspersed repeats

Transposable element discovery and classification were performed on the An. stephensi scaffold sequences using previously-described pipelines for LTR-retrotransposons, non-LTR-retrotransposons, SINEs, DNA-transposons, and MITEs, followed by manual inspection 96 . The manually-annotated TE libraries then were compared with the RepeatModeler output to remove redundancy and to correct mis-classification by RepeatModler. A repeat library was produced that contains all manually-annotated TEs and non-redundant sequences from RepeatModeler. The repeat library was used to run RepeatMasker at default settings on the An. stephensi assembly to calculate TE copy number and genome occupancy.

4.18 Simple repeats

The number of microsatellites, minisatellites, and satellites present in the mapped scaffolds for each chromosome were derived by dividing the scaffolds into strings of 100,000 bp and then concatenating them into a multi-FASTA file to represent an An. stephensi pseudo chromosome. Scaffolds were oriented when possible, and all unoriented scaffolds were given the default positive orientation for that chromosome. The multiFASTA file for each pseudo-chromosome was analyzed using a local copy of TandemRepeatsFinder v 4.07b 97 . Parameters for the analysis followed those used by Xia et al. 46 : microsatellites were those of period size 2 to 6 with copy number of >8. Minisatellites had period size 7 to 99 while repeats were considered satellites if they had a period size of >100. Both satellites and minisatellites were considered only if they had a copy number of >2. Simple repeats were recorded only if they had at least 80% identity.

4.19 Identification of S/MARs

Scaffold/matrix associated regions were identified using the SMARTest bioinformatic tool provided by Genomatix 98 . Densities of genes and TEs per 100 kb window were calculated using Bedtools coverage based on the genome annotation and TE annotation, respectively.

4.20 Synteny, gene order evolution, and inversions

One-to-one orthologs from An. gambiae and An. stephensi were identified using OrthoDB 77 and their locations on the An. gambiae and An. stephensi scaffolds determined. Comparative positions of the genes on the scaffolds based on ontology relationships were plotted using genoPlotR 99 . Scaffolds that mapped using two or more probes were oriented properly, but those anchored by only one probe were used in their default orientation. The number of synteny blocks for each pair of homologous chromosome arms between An. stephensi and An. gambiae was determined from the images output from genoPlotR. Two criteria were imposed to determine the number of synteny blocks: the orientation of two or more orthologous genes, and whether the genes remained in the same order on the chromosome of An. stephensi as in An. gambiae. Thus, a group of two or more genes is assigned to the same synteny block if it has the same orientation and order in both species. Synteny blocks were numbered 1, 2, 3, 4, and so on along the chromosome by assigning An. gambiae as the default gene order. An. stephensi was considered rearranged compared to An. gambiae when the numbering of synteny blocks was the same in both species but the order was rearranged in An. stephensi. After quantifying the number of synteny blocks and the amount of gene rearrangement between the two species, we estimated the number of chromosomal inversions between them using the programs Genome Rearrangements in Mouse and Man (GRIMM 52 ).

4.21 SNP analysis

We used CLC Genomics Workbench 4 92 to identify SNPs using a combination of the male and female Illumina data (Accession number: SRP013838). The required coverage was 20 and minimum variant frequency was 35. SNP calls made on the assembly were assessed for their effect on transcripts from the gene build using the Ensembl e-hive, variation database, and variation consequence pipeline (available from github 100 and 101 ). The Ensembl variation consequence pipeline uses the Ensembl API in the same manner as the Variant Effect Predictor 102 and produces equivalent output. The variation consequence pipeline directly loaded the analysis results into an Ensembl MySQL variation database which was used to generate summary statistics of transcript consequences classified using Sequence Ontologs 103 .

4.22 Data access

The An. stephensi genome assembly has been deposited in GenBank under the accession number ALPR00000000 and is available at 90 . The raw sequence data used for genome assembly are available in the NCBI SRA: 454 - SRP037783, Illumina - SRP037783, and PacBio - SRP037783. The BAC-ends used for scaffolding are available from the NCBI dbGSS accession numbers: KG772729 - KG777469. RNA-Seq data can be accessed at the NCBI SRA with ID SRP013839.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

Conceived and designed experiments: ZT and IVS; Data generation, analysis, and presentation: XJ, AP, AS, ABH, MK, MVS, AK, BW, CO, DL, KE, KM, JMCT, JMCR, MAR, MRR, MU, NP, PA, PG, RK, RS, RMW, SL, SM, VLMD, YQ, and ZT; Writing of the manuscript: XJ, ABH, AAJ, AP, AS, JMCR, KDV, KM, KP, MK, MAR, MMR, SL, IVS, and ZT; Provided resources and tools and critical reviewed manuscript: XC and YS. All authors read and approved the final manuscript.

Additional files

<p>Additional file 1: Figure S1.</p>

Genome alignments showing possible gene copy number evolution within the APL1 gene family. Figure S2. Expression of Aste4e-BP1 and FKBP12 during development and following a bloodmeal relative to a representative subset of other signaling molecules. Figure S3. Phylogenetic tree for manually annotated immunity-related genes. Figure S4. Phylogenetic tree for Anopheles OBPs, OR, and fibrinogen-related proteins. Figure S5. Comparison of DAPI stained heterochromatin in mitotic chromosomes between An. stephensi and An. gambiae. Figure S6. FISH with Aste190A, rDNA, and DAPI on mitotic chromosomes. Figure S7. The GC content in raw HiSeq reads of An. stephensi (left) and An. gambiae (right). Table S1. Data used for assembly. Table S2. An. stephensi density/100 kb. Table S3. An. gambiae density/100 kb. Table S4. Synteny blocks and inversions between An. stephensi and An. gambiae. Table S5. Inversion breaks/Mb between An. stephensi and An. gambiae. Table S6. The ratio of the X chromosome evolution rate to the autosomal rate of rearrangement between An. stephensi and An. gambiae. Table S7. The ratio of the X chromosome evolution rate to the total rate of rearrangement in Anopheles and Drosophila. Table S8. Correlation of An. stephensi genes and rearrangements. Table S9. Correlation of An. gambiae genes and rearrangements. Table S10. Correlation of An. stephensi S/MARs and rearrangements. Table S11. Correlation of An. gambiae S/MARs and rearrangements. Table S12. Correlation of An. stephensi short tandem repeats and rearrangements. Table S13. Correlation of An. gambiae short tandem repeats and rearrangements. Table S14. SNPs chromosomal distribution Tables. Table S15. tRNA Tables.

Click here for file

<p>Additional file 2:</p>

1. Physical map data. 2. Lists of genes in clusters. 3. Cluster ontology. 4. Sex-biased genes list and GO terms. 5. Revised annotation for immunity-related genes. 6. Sequences of immunity-related genes. 7. RNA-seq expression profile of immunity-related genes. 8. Automatic annotated salivary genes. 9. Revised manual annotation for salivary genes. 10. Manual annotated salivary genes sequences. 11. Gene families counts table. 12. Repeat sequences. 13. Genome landscape. 14. Tandem repeat sequences. 15. Chromosome quotients of putative Y-linked scaffolds. 16. Sequences of putative Y-linked scaffolds. 17. Chromosome quotients of Y-linked BACs. 18. Sequences of Y-linked BACs. 19. Repeat masker output of Y-linked BACs and Y-linked scaffolds. 20. Repeat masker output of Y-linked BACs. 21. Synteny blocks. 22: SNP analysis raw data. 23. Summary of transcript consequences for An. stephensi Indian strain SNP calls. 24. BAC-ends dbGSS accession numbers. 25. Non-coding RNA annotation.

Click here for file

Acknowledgements

This work is supported by the Fralin Life Science Institute and the Virginia Experimental Station, and by NIH grants AI77680 and AI105575 to ZT, AI094289 and AI099528 to IVS, AI29746 to AAJ, AI095842 to KM, AI073745 to MAR, AI080799 and AI078183 to SL, and AI042361 and AI073685 to KDV. AP and IVS are supported in part by the Institute for Critical Technology and Applied Science and the NSF award 0850198. RMW is supported by Marie Curie International Outgoing Fellowship PIOF-GA-2011-303312. XC is supported by GDUPS (2009). This work was also supported in part by NSF grant CNS-0960081 and the HokieSpeed and BlueRidge supercomputers at Virginia Tech. YS thanks the Department of Biotechnology, Government of India for the financial support. Drew Cocrane provided assistance with in situ hybridization.

<p>Shrinking the malaria map: Progress and prospects</p> Feachem RGA Phillips AA Hwang J Cotter C Wielgosz B Greenwood BM Sabot O Rodriguez MH Abeyasinghe RR Ghebreyesus TA Snow RW Lancet 2010 376 1566 1578 10.1016/S0140-6736(10)61270-6 21035842 <p>Costs and cost-effectiveness of malaria control interventions–a systematic review</p> White MT Conteh L Cibulskis R Ghani AC Malar J 2011 10 337 10.1186/1475-2875-10-337 22050911 <p>The genome sequence of the malaria mosquito Anopheles gambiae</p> Holt RA Subramanian GM Halpern A Sutton GG Charlab R Nusskern DR Wincker P Clark AG Ribeiro JMC Wides R Salzberg SL Loftus B Yandell M Majoros WH Rusch DB Lai Z Kraft CL Abril JF Anthouard V Arensburger P Atkinson PW Baden H de Berardinis V Baldwin D Benes V Biedler J Blass C Bolanos R Boscus D Barnstead M Science (New York, NY) 2002 298 129 149 10.1126/science.1076181 <p>Effect of washing on the bioefficacy of insecticide-treated nets (ITNs) and long-lasting insecticidal nets (LLINs) against main malaria vector Anopheles stephensi by three bioassay methods</p> Rafinejad J Vatandoost H Nikpoor F Abai MR Shaeghi M Duchen S Rafi F J Vector Borne Dis 2008 45 143 150 18592843 <p>Current scenario of malaria in India</p> Sharma VP Parassitologia 1999 41 349 353 10697882 <p>First record of the Asian malaria vector Anopheles stephensi and its possible role in the resurgence of malaria in Djibouti, Horn of Africa</p> Faulde MK Rueda LM Khaireh BA Acta Trop 2014 139C 39 43 10.1016/j.actatropica.2014.06.016 <p>Population genetic structure of malaria vector Anopheles stephensi Liston (Diptera: Culicidae)</p> Gakhar SK Sharma R Sharma A Indian J Exp Biol 2013 51 273 279 24195346 <p>Global malaria mortality between 1980 and 2010: A systematic analysis</p> Murray CJL Rosenfeld LC Lim SS Andrews KG Foreman KJ Haring D Fullman N Naghavi M Lozano R Lopez AD Lancet 2012 2012 413 431 10.1016/S0140-6736(12)60034-8 <p>A research agenda to underpin malaria eradication</p> Alonso PL Brown G Arevalo-Herrera M Binka F Chitnis C Collins F Doumbo OK Greenwood B Hall BF Levine MM Mendis K Newman RD Plowe CV Rodríguez MH Sinden R Slutsker L Tanner M PLoS Med 2011 8 e1000406 10.1371/journal.pmed.1000406 21311579 <p>piggyBac-mediated germline transformation of the malaria mosquito Anopheles stephensi using the red fluorescent protein dsRED as a selectable marker</p> Nolan T Bower TM Brown AE Crisanti A Catteruccia F J Biol Chem 2002 277 8759 8762 10.1074/jbc.C100766200 11805082 <p>piggyBac transposon remobilization and enhancer detection in Anopheles mosquitoes</p> O’Brochta DA Alford RT Pilitt KL Aluvihare CU Harrell RA Proc Natl Acad Sci U S A 2011 108 16339 16344 10.1073/pnas.1110628108 21930941 <p>PNAS Plus: Transgenic Anopheles stephensi coexpressing single-chain antibodies resist Plasmodium falciparum development</p> Isaacs AT Jasinskiene N Tretiakov M Thiery I Zettor A Bourgouin C James AA Proc Natl Acad Sci U S A 2012 109 E1922 E1930 10.1073/pnas.1207738109 22689959 <p>Targeted mutagenesis in the malaria mosquito using TALE nucleases</p> Smidler AL Terenzi O Soichot J Levashina EA Marois E PLoS One 2013 8 e74511 10.1371/journal.pone.0074511 23977401 <p>Stable and heritable gene silencing in the malaria vector Anopheles stephensi</p> Brown AE Bugeon L Crisanti A Catteruccia F Nucleic Acids Res 2003 31 e85 10.1093/nar/gng085 12888537 <p>Wolbachia invades Anopheles stephensi populations and induces refractoriness to Plasmodium infection</p> Bian G Joshi D Dong Y Lu P Zhou G Pan X Xu Y Dimopoulos G Xi Z Science (New York, NY) 2013 340 748 751 10.1126/science.1236192 <p>Engineered anopheles immunity to plasmodium infection</p> Dong Y Das S Cirimotich C Souza-Neto JA McLean KJ Dimopoulos G PLoS Pathog 2011 7 e1002458 10.1371/journal.ppat.1002458 22216006 <p>Caspar controls resistance to plasmodium falciparum in diverse anopheline species</p> Garver LS Dong Y Dimopoulos G PLoS Pathog 2009 5 e1000335 10.1371/journal.ppat.1000335 19282971 <p>Sustained activation of Akt elicits mitochondrial dysfunction to block Plasmodium falciparum infection in the mosquito host</p> Luckhart S Giulivi C Drexler AL Antonova-Koch Y Sakaguchi D Napoli E Wong S Price MS Eigenheer R Phinney BS Pakpour N Pietri JE Cheung K Georgis M Riehle M PLoS Pathog 2013 9 e1003180 10.1371/journal.ppat.1003180 23468624 <p>Density-dependent impact of the human malaria parasite Plasmodium falciparum gametocyte sex ratio on mosquito infection rates</p> Mitri C Thiery I Bourgouin C Paul REL Proc Roy Soc Lond B Biol Sci 2009 276 3721 3726 10.1098/rspb.2009.0962 <p>Ingested human insulin inhibits the mosquito NF-κB-dependent immune response to Plasmodium falciparum</p> Pakpour N Corby-Harris V Green GP Smithers HM Cheung KW Riehle MA Luckhart S Infect Immun 2012 80 2141 2149 10.1128/IAI.00024-12 22473605 <p>Mosquito genomes: structure, organization, and evolution</p> Rai KS Black WC IV Adv Genet 1999 41 1 33 10.1016/S0065-2660(08)60149-2 10494615 <p>Arm-specific dynamics of chromosome evolution in malaria mosquitoes</p> Sharakhova MV Xia A Leman SC Sharakhov IV BMC Evol Biol 2011 11 91 10.1186/1471-2148-11-91 21473772 <p>The genome of Anopheles darlingi, the main neotropical malaria vector</p> Marinotti O Cerqueira GC de Almeida LG Ferro MI Loreto EL Zaha A Teixeira SM Wespiser AR Almeida ESA Schlindwein AD Pacheco AC Silva AL Graveley BR Walenz BP Lima Bde A Ribeiro CA Nunes-Silva CG de Carvalho CR Soares CM de Menezes CB Matiolli C Caffrey D Araújo DA de Oliveira DM Golenbock D Grisard EC Fantinatti-Garboggini F de Carvalho FM Barcellos FG Prosdocimi F Nucleic Acids Res 2013 41 7387 7400 10.1093/nar/gkt484 23761445 <p>Genome sequence of Anopheles sinensis provides insight into genetics basis of mosquito competence for malaria parasites</p> Zhou D Zhang D Ding G Shi L Hou Q Ye Y Xu Y Zhou H Xiong C Li S Yu J Hong S Yu X Zou P Chen C Chang X Wang W Lv Y Sun Y Ma L Shen B Zhu C BMC Genomics 2014 15 42 10.1186/1471-2164-15-42 24438588 https://www.vectorbase.org/Anopheles_stephensiI/Info/Index VectorBase, Anopheles stephensi Indian strain. [] <p>A unique Y gene in the Asian malaria mosquito Anopheles stephensi encodes a small lysine-rich protein and is transcribed at the onset of embryonic development</p> Criscione F Qi Y Saunders R Hall B Tu Z Insect Mol Biol 2013 22 433 441 10.1111/imb.12034 23683123 <p>Active auditory mechanics in mosquitoes</p> Göpfert MC Robert D Proc Roy Soc Lond B Biol Sci 2001 268 333 339 10.1098/rspb.2000.1376 <p>Humming in tune: sex and species recognition by mosquitoes on the wing</p> Gibson G Warren B Russell IJ J Assoc Res Otolaryngol 2010 11 527 540 10.1007/s10162-010-0243-2 20976515 <p>The Aedes aegypti toll pathway controls dengue virus infection</p> Xi Z Ramirez JL Dimopoulos G PLoS Pathog 2008 4 e1000098 10.1371/journal.ppat.1000098 18604274 <p>In vivo, in vitro, and in silico studies suggest a conserved immune module that regulates malaria parasite transmission from mammals to mosquitoes</p> Price I Ermentrout B Zamora R Wang B Azhar N Mi Q Constantine G Faeder JR Luckhart S Vodovotz Y J Theor Biol 2013 334 173 186 10.1016/j.jtbi.2013.05.028 23764028 <p>The mitogen-activated protein kinome from Anopheles gambiae: identification, phylogeny and functional characterization of the ERK, JNK and p38 MAP kinases</p> Horton AA Wang B Camp L Price MS Arshi A Nagy M Nadler SA Faeder JR Luckhart S BMC Genomics 2011 12 574 10.1186/1471-2164-12-574 22111877 <p>A comprehensive gene expression atlas of sex- and tissue-specificity in the malaria vector</p> Baker DA Nolan T Fischer B Pinder A Crisanti A Russell S Anopheles gambiae BMC Genomics 2011 12 296 10.1186/1471-2164-12-296 <p>Structure of the FKBP12-rapamycin complex interacting with the binding domain of human FRAP</p> Choi J Chen J Schreiber SL Clardy J Science (New York, NY) 1996 273 239 242 10.1126/science.273.5272.239 <p>mTOR signaling in growth control and disease</p> Laplante M Sabatini DM Cell 2012 149 274 293 10.1016/j.cell.2012.03.017 22500797 <p>Insulin/TOR signaling in growth and homeostasis: A view from the fly world</p> Grewal SS Int J Biochem Cell Biol 2009 41 1006 1010 10.1016/j.biocel.2008.10.010 18992839 <p>Nutrient content of diet affects the signaling activity of the insulin/target of rapamycin/p70 S6 kinase pathway in the African malaria mosquito Anopheles gambiae</p> Arsic D Guerin PM J Insect Physiol 2008 54 1226 1235 10.1016/j.jinsphys.2008.06.003 18634792 http://funcgen.vectorbase.org/expression-browser/gene/AGAP012184 Vectorbase, Gene AGAP012184 Expression Report. [] <p>Establishment of dorsal-ventral polarity in the Drosophila embryo: the induction of polarity by the Toll gene product</p> Anderson KV Bokla L Nüsslein-Volhard C Cell 1985 42 791 798 10.1016/0092-8674(85)90275-2 3931919 <p>Exploring the salivary gland transcriptome and proteome of the Anopheles stephensi mosquito</p> Valenzuela JG Francischetti IMB Pham VM Garfield MK Ribeiro JMC Insect Biochem Mol Biol 2003 33 717 732 10.1016/S0965-1748(03)00067-5 12826099 <p>An updated catalogue of salivary gland transcripts in the adult female mosquito, Anopheles gambiae</p> Arca B Lombardo F Valenzuela JG Francischetti IM Marinotti O Coluzzi M Ribeiro JM J Exp Biol 2005 208 3971 3986 10.1242/jeb.01849 16215223 <p>An insight into the sialome of blood-feeding Nematocera</p> Ribeiro JMC Mans BJ Arcà B Insect Biochem Mol Biol 2010 40 767 784 10.1016/j.ibmb.2010.08.002 20728537 <p>InterPro in 2011: new developments in the family and domain prediction database</p> Hunter S Jones P Mitchell A Apweiler R Attwood TK Bateman A Bernard T Binns D Bork P Burge S de Castro E Coggill P Corbett M Das U Daugherty L Duquenne L Finn RD Fraser M Gough J Haft D Hulo N Kahn D Kelly E Letunic I Lonsdale D Lopez R Madera M Maslen J McAnulla C McDowall J Nucleic Acids Res 2012 40 D306 D312 10.1093/nar/gkr948 22096229 <p>The evolution of the Anopheles 16 genomes project</p> Neafsey DE Christophides GK Collins FH Emrich SJ Fontaine MC Gelbart W Hahn MW Howell PI Kafatos FC Lawson D Muskavitch MA Waterhouse RM Williams LJ Besansky NJ G3 (Bethesda) 2013 3 1191 1194 10.1534/g3.113.006247 <p>Inversion polymorphisms in natural populations of Anopheles stephensi</p> Mahmood F Sakai RK Can J Genet Cytol 1984 26 538 546 6498600 <p>Chromosomal inversion polymorphisms and adaptation</p> Hoffmann AA Sgrò CM Weeks AR Trends Ecol Evol 2004 19 482 488 10.1016/j.tree.2004.06.013 16701311 <p>Genome landscape and evolutionary plasticity of chromosomes in malaria mosquitoes</p> Xia A Sharakhova MV Leman SC Tu Z Bailey JA Smith CD Sharakhov IV PLoS One 2010 5 e10592 10.1371/journal.pone.0010592 20485676 <p>DNA from Drosophila melanogaster β-heterochromatin binds specifically to nuclear lamins in vitro and the nuclear envelope in situ</p> Baricheva EA Berrios M Bogachev SS Borisevich IV Lapik ER Sharakhov IV Stuurman N Fisher PA Gene 1996 171 171 176 10.1016/0378-1119(96)00002-9 8666268 <p>Nuclear lamins: major factors in the structural organization and function of the nucleus and chromatin</p> Dechat T Pfleghaar K Sengupta K Shimi T Shumaker DK Solimando L Goldman RD Genes Dev 2008 22 832 853 10.1101/gad.1652708 18381888 <p>Triploids and male determination in the mosquito, Anopheles culicifacies</p> Baker RH Sakai RK J Hered 1979 70 345 346 43343 <p>Six novel Y chromosome genes in Anopheles mosquitoes discovered by independently sequencing males and females</p> Hall AB Qi Y Timoshevskiy V Sharakhova MV Sharakhov IV Tu Z BMC Genomics 2013 14 273 10.1186/1471-2164-14-273 23617698 <p>Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data</p> Chin CS Alexander DH Marks P Klammer AA Drake J Heiner C Clum A Copeland A Huddleston J Eichler EE Turner SW Korlach J Nat Methods 2013 10 563 569 10.1038/nmeth.2474 23644548 <p>GRIMM: genome rearrangements web server</p> Tesler G Bioinformatics 2002 18 492 493 10.1093/bioinformatics/18.3.492 11934753 <p>Genomic composition and evolution of Aedes aegypti chromosomes revealed by the analysis of physically mapped supercontigs</p> Timoshevskiy VA Kinney NA de Bruyn BS Mao C Tu Z Severson DW Sharakhov IV Sharakhova MV BMC Biol 2014 12 27 10.1186/1741-7007-12-27 24731704 <p>Inversions and gene order shuffling in Anopheles gambiae and A. funestus</p> Sharakhov IV Serazin AC Grushko OG Dana A Lobo N Hillenmeyer ME Westerman R Romero-Severson J Costantini C Sagnon N Collins FH Besansky NJ Science 2002 298 182 185 10.1126/science.1076803 12364797 <p>Polytene chromosomal maps of 11 Drosophila species: the order of genomic scaffolds inferred from genetic and physical maps</p> Schaeffer SW Bhutkar A McAllister BF Matsuda M Matzkin LM O'Grady PM Rohde C Valente VLS Aguadé M Anderson WW Edwards K Garcia AC Goodman J Hartigan J Kataoka E Lapoint RT Lozovsky ER Machado CA Noor MA Papaceit M Reed LK Richards S Rieger TT Russo SM Sato H Segarra C Smith DR Smith TF Strelets V Tobari YN Genetics 2008 179 1601 1655 10.1534/genetics.107.086074 18622037 <p>Principles of genome evolution in the Drosophila melanogaster species group</p> Ranz JM Maurin D Chan YS Von Grotthuss M Hillier LW Roote J Ashburner M Bergman CM PLoS Biol 2007 5 1366 1381 10.1371/journal.pbio.0050152 <p>How malleable is the eukaryotic genome? Extreme rate of chromosomal rearrangement in the genus Drosophila</p> Ranz JM Casals F Ruiz A Genome Res 2001 11 230 239 10.1101/gr.162901 11157786 <p>The fragile breakage versus random breakage models of chromosome evolution</p> Peng Q Pevzner PA Tesler G PLoS Comput Biol 2006 2 e14 10.1371/journal.pcbi.0020014 16501665 <p>Microinversions in mammalian evolution</p> Chaisson MJ Raphael BJ Pevzner PA Proc Natl Acad Sci U S A 2006 103 19824 19829 10.1073/pnas.0603984103 17189424 <p>Genome-scale evolution: reconstructing gene orders in the ancestral species</p> Bourque G Pevzner PA Genome Res 2002 12 26 36 11779828 <p>Chromosomal rearrangement inferred from comparisons of 12 Drosophila genomes</p> Bhutkar A Schaeffer SW Russo SM Xu M Smith TF Gelbart WM Genetics 2008 179 1657 1680 10.1534/genetics.107.086108 18622036 <p>Widespread divergence between incipient Anopheles gambiae species revealed by whole genome sequences</p> Lawniczak MK Emrich SJ Holloway AK Regier AP Olson M White B Redmond S Fulton L Appelbaum E Godfrey J Farmer C Chinwalla A Yang SP Minx P Nelson J Kyung K Walenz BP Garcia-Hernandez E Aguiar M Viswanathan LD Rogers YH Strausberg RL Saski CA Lawson D Collins FH Kafatos FC Christophides GK Clifton SW Kirkness EF Besansky NJ Science 2010 330 512 514 10.1126/science.1195755 20966253 <p>The genome of the fire ant Solenopsis invicta</p> Wurm Y Wang J Riba-Grognuz O Corona M Nygaard S Hunt BG Ingram KK Falquet L Nipitwattanaphon M Gotzek D Dijkstra MB Oettler J Comtesse F Shih CJ Wu WJ Yang CC Thomas J Beaudoing E Pradervand S Flegel V Cook ED Fabbretti R Stockinger H Long L Farmerie WG Oakey J Boomsma JJ Pamilo P Yi SV Heinze J Proc Natl Acad Sci U S A 2011 108 5679 5684 10.1073/pnas.1009690108 21282665 <p>Comparing de novo assemblers for 454 transcriptome data</p> Kumar S Blaxter ML BMC Genomics 2010 11 571 10.1186/1471-2164-11-571 20950480 <p>Consensus generation and variant detection by Celera Assembler</p> Denisov G Walenz B Halpern AL Miller J Axelrod N Levy S Sutton G Bioinformatics 2008 24 1035 1040 10.1093/bioinformatics/btn074 18321888 <p>Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology</p> English AC Richards S Han Y Wang M Vee V Qu J Qin X Muzny DM Reid JG Worley KC Gibbs RA PLoS One 2012 7 e47768 10.1371/journal.pone.0047768 23185243 <p>Hierarchical scaffolding with Bambus</p> Pop M Kosack DS Salzberg SL Genome Res 2004 14 149 159 10.1101/gr.1536204 14707177 <p>Versatile and open software for comparing large genomes</p> Kurtz S Phillippy A Delcher AL Smoot M Shumway M Antonescu C Salzberg SL Genome Biol 2004 5 R12 10.1186/gb-2004-5-2-r12 14759262 <p>CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes</p> Parra G Bradnam K Korf I Bioinformatics 2007 23 1061 1067 10.1093/bioinformatics/btm071 17332020 <p>GMAP: a genomic mapping and alignment program for mRNA and EST sequences</p> Wu TD Watanabe CK Bioinformatics 2005 21 1859 1875 10.1093/bioinformatics/bti310 15728110 <p>Genome mapping and characterization of the Anopheles gambiae heterochromatin</p> Sharakhova MV George P Brusentsova IV Leman SC Bailey JA Smith CD Sharakhov IV BMC Genomics 2010 11 459 10.1186/1471-2164-11-459 20684766 <p>A physical map for an Asian malaria mosquito, Anopheles stephensi</p> Sharakhova MV Xia A Tu Z Shouche YS Unger MF Sharakhov IV Am J Trop Med Hyg 2010 83 1023 1027 10.4269/ajtmh.2010.10-0366 21036831 <p>MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes</p> Cantarel BL Korf I Robb SM Parra G Ross E Moore B Holt C Sanchez Alvarado A Yandell M Genome Res 2008 18 188 196 10.1101/gr.6743907 18025269 <p>Using and understanding RepeatMasker</p> Tempel S Methods Mol Biol 2012 859 29 51 10.1007/978-1-61779-603-6_2 22367864 <p>Gene finding in novel genomes</p> Korf I BMC Bioinformatics 2004 5 59 10.1186/1471-2105-5-59 15144565 <p>AUGUSTUS: a web server for gene finding in eukaryotes</p> Stanke M Steinkamp R Waack S Morgenstern B Nucleic Acids Res 2004 32 W309 W312 10.1093/nar/gkh379 15215400 <p>OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs</p> Waterhouse RM Tegenfeldt F Li J Zdobnov EM Kriventseva EV Nucleic Acids Res 2013 41 D358 D365 10.1093/nar/gks1116 23180791 <p>MUSCLE: multiple sequence alignment with high accuracy and high throughput</p> Edgar RC Nucleic Acids Res 2004 32 1792 1797 10.1093/nar/gkh340 15034147 <p>trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses</p> Capella-Gutierrez S Silla-Martinez JM Gabaldon T Bioinformatics 2009 25 1972 1973 10.1093/bioinformatics/btp348 19505945 <p>RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies</p> Stamatakis A Bioinformatics 2014 30 1312 1313 10.1093/bioinformatics/btu033 24451623 <p>TopHat: discovering splice junctions with RNA-Seq</p> Trapnell C Pachter L Salzberg SL Bioinformatics 2009 25 1105 1111 10.1093/bioinformatics/btp120 19289445 <p>HTSeq - A Python framework to work with high-throughput sequencing data</p> Anders S Pyl PT Huber W Bioinformatics 2014 25260700 <p>Model-based clustering for RNA-seq data</p> Si Y Liu P Li P Brutnell TP Bioinformatics 2014 30 197 205 10.1093/bioinformatics/btt632 24191069 Love MI Huber W Anders S Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2 2014 <p>ggplot2</p> Wickham H Wiley Interdiscipl Rev Comput Stat 2011 3 180 185 10.1002/wics.147 <p>Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research</p> Conesa A Götz S García-Gómez JM Terol J Talón M Robles M Bioinformatics 2005 21 3674 3676 10.1093/bioinformatics/bti610 16081474 <p>InterProScan: protein domains identifier</p> Quevillon E Silventoinen V Pillai S Harte N Mulder N Apweiler R Lopez R Nucleic Acids Res 2005 33 W116 W120 10.1093/nar/gki442 15980438 <p>Using GOstats to test gene lists for GO term association</p> Falcon S Gentleman R Bioinformatics 2007 23 257 258 10.1093/bioinformatics/btl567 17098774 <p>BioMart: driving a paradigm change in biological data management</p> Kasprzyk A Database (Oxford) 2011 2011 bar049 10.1093/database/bar049 https://www.vectorbase.org/ Vectorbase. [] http://metazoa.ensembl.org Ensembl Metazoa. [] http://www.clcbio.com CLC bio, a QIAGEN Company. [] <p>tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence</p> Lowe TM Eddy SR Nucleic Acids Res 1997 25 955 964 10.1093/nar/25.5.0955 9023104 <p>Infernal 1.0: inference of RNA alignments</p> Nawrocki EP Kolbe DL Eddy SR Bioinformatics 2009 25 1335 1337 10.1093/bioinformatics/btp157 19307242 <p>Rfam: an RNA family database</p> Griffiths-Jones S Bateman A Marshall M Khanna A Eddy SR Nucleic Acids Res 2003 31 439 441 10.1093/nar/gkg006 12520045 <p>Genome sequence of Aedes aegypti, a major arbovirus vector</p> Nene V Wortman JR Lawson D Haas B Kodira C Tu ZJ Loftus B Xi Z Megy K Grabherr M Ren Q Zdobnov EM Lobo NF Campbell KS Brown SE Bonaldo MF Zhu J Sinkins SP Hogenkamp DG Amedeo P Arensburger P Atkinson PW Bidwell S Biedler J Birney E Bruggner RV Costas J Coy MR Crabtree J Crawford M Science 2007 316 1718 1723 10.1126/science.1138878 17510324 <p>Tandem repeats finder: a program to analyze DNA sequences</p> Benson G Nucleic Acids Res 1999 27 573 580 10.1093/nar/27.2.573 9862982 <p>In silico prediction of scaffold/matrix attachment regions in large genomic sequences</p> Frisch M Frech K Klingenhoff A Cartharius K Liebich I Werner T Genome Res 2002 12 349 354 10.1101/gr.206602. Article published online before print in January 2002 11827955 <p>genoPlotR: comparative gene and genome visualization in R</p> Guy L Kultima JR Andersson SG Bioinformatics 2010 26 2334 2335 10.1093/bioinformatics/btq413 20624783 https://github.com/Ensembl/ensembl-hive EnsEMBL Hive - a system for creating and running pipelines on a distributed compute resource. [] https://github.com/Ensembl/ensembl-variation/ The Ensembl Variation Perl API and SQL schema. [] <p>Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor</p> McLaren W Pritchard B Rios D Chen Y Flicek P Cunningham F Bioinformatics 2010 26 2069 2070 10.1093/bioinformatics/btq330 20562413 <p>The Sequence Ontology: a tool for the unification of genome annotations</p> Eilbeck K Lewis SE Mungall CJ Yandell M Stein L Durbin R Ashburner M Genome Biol 2005 6 R44 10.1186/gb-2005-6-5-r44 15892872