Brassica napus Cultivar Darmor-bzh Genome Assembly v4.1 & Annotation v5
Analysis Name | Brassica napus Cultivar Darmor-bzh Genome Assembly v4.1 & Annotation v5 |
---|---|
Method | SOAP and Newbler |
Source | A combination of GS FLX Titanium, Sanger BES and Illumina HiSeq sequencing reads |
Date Performed | Thursday, August 21, 2014 |
About the Assembly
Sequencing and Assembly Strategy
The methods were designed by the ‘Darmor-bzh’ genome article. The sequencing of B. napus genome was expected to face great difficulties in differentiating the An and Cn homeologous subgenomes. Fortunately, B. rapa and B. oleracea have been successfully assembled through WGS methods, which provided an opportunity to distinguish two subgenomes, and also distinguish the duplicated segments resulted from genome triplication or mesoploidy of B. napus. The homozygous B. napus genome of European winter oilseed cultivar ‘Darmor-bzh’ was selected to be sequenced. Different sequencing platforms were used in the process of assembly, including Sanger BAC-end sequencing (~7.8 x), GS FLX Titanium 454 sequencing that included long reads of 700 bases (~21.2 x), as well as the Illumina SBS technology (~53.9 x). In addition, about 5 x 454 sequencing data of B. rapa (‘Chiifu’) or B. oleracea (‘TO1000’) were used to distinguish the A and C subgenomes. Finally, about 849.7 Mb with scaffold N50 size of 763.7 Kb, accounting 75% of estimated genome size of 1,130 Mb, was obtained, and total 18,288 of 20,702 scaffolds were successfully assigned to either the An (8294) or the Cn (9984) subgenomes.
In order to construct a combined genetic linkage map with a high quality, three populations of ‘Darmor-bzh’ x ‘Yudal’ (DY), ‘Darmor’ x‘Bristol’ (DB), and ‘Avisol’ x ‘Aburamasari’ (AA) were used to develop single-nucleotide polymorphism (SNP) markers by using the Infinium 20K BeadChip (Illumina). For DY populations, a total of 5,738 genetic bins were developed and covered 2807 cM. Correspondingly, the genetic maps of DB and AA populations contained 2,350 and 2,692 genetic bins, covered 1,959 and 4,048 cM, respectively. Taking the genetic map of DY population as reference, we next integrated the DB and AA map step by step using BioMercator V4.2 program. Final consensus map contained 7,287 bins and covered 2,881 cM. Through allele sequence matching, 384 anchored scaffolds were aligned with 19 pseudochromosomes including A01–A10 A-subgenome and C01–C09 C-sungenome. As a result, 712.3 Mb (84%) of the genome was successfully anchored and the other unanchored scaffolds were also grouped based on marker alignment with genetic maps and orthologous alignment with parental genomes. The anchored Cn subgenome (525.8 Mb) is larger than the An subgenome (314.2 Mb) consistent with the size of assembled Co genome of B. oleracea (540 Mb of total ~630 Mb) and the Ar genome of B. rapa
(312 Mb of total ~530 Mb).
Repeat Elements Annotation
The methods were originally designed by the ‘Darmor-bzh’ and ‘ZS11’ genome articles. Repetitive sequences, including interspersed repetitive sequences and tandem repetitive sequences, make up a major part of eukaryotic genomes especially in most plant genome. Transposable elements (TEs) of B. napus were annotated by integrating both of de novo and homology-based approaches. The local querying database of TEs was built by three different programs LTR_FINDER, PILER, and Repeat Scout, and then the raw sequences of this database were classified by RepeatModeler. The software RepeatMasker was used to search TEs against whole genome based on the existing repeat database Repbase and the local database we constructed. As a complementary of homology-based methods, we performed PepeatProteinMask to search the TE-related proteins against the genome based on existing TE proteins of Repbase. The tandem repeat elements of B. napus were annotated by using TRF-finder software. For ‘Darmor-bzh’, about 37.48% of the genome sequences (318 Mb) are repetitive sequences, while only about 4.01% of them are tandem repeats. For ‘ZS11’, about 49.78% (485 Mb) of the genome sequences (318.65 Mb) are repetitive sequences, while only about 5.81% of them are tandem repeats, which revealed a more repetitive sequences in ‘ZS11’ that was proved to be the main reason of longer genomic length of ‘ZS11'.
Gene Annotation
The methods were originally designed by the ‘Darmor-bzh’ genome article. Functional genes of ‘Darmor-bzh’ were identified iteratively using a combination of homology-based and de novo prediction algorithms. (1) Protein mapping. Proteomes of five species were collected firstly to perform homology searches, including Arabidopsis thaliana (TAIR 10), B. rapa, B. oleracea and O. sativa (plantGBD, release 186). In order to shorten the calculation time of Genewise, the proteins were aligned to genome and extracted the candidate hits using BLAT. Each candidate match was refined using Genewise to confirm the exact gene structure and the open reading frame (ORF). (2) De novo prediction. The refined curate gene models were used to train the parameters of Hidden Markov Model (HMM) for ab initio software Geneid and SNAP using A. thaliana gene models, then two programs were used to predict gene models in Darmor-bzh. (3) Transcriptome sequence mapping. We firstly collected 643,937 cDNAs and 41,165 unigenes from EMBL and brassica.info. All cDNAs and unigenes were aligned to reference genome by BLAT to identify the best matches (identity >90%) and the initial gene structure. Then, Est2genome software was used to realign each match to cDNA sequences. In addition, RNA-Seq reads of major tissue and developmental stages for Darmor-bzh were obtained by Illumina technology. We next mapped all filtered reads and identified transcript models by using SOAP2 (Li et al. 2009) and Gmorse software. Finally, we obtained 162,177 loci that were clustered from 930,181 models. (4) Integration of all predicted gene models. Consensus gene set of Darmor-bzh was obtained by integrating all predicted gene models using GAZE. Final non-redundant gene set contains the total number of 101,040 genes, which were consisted with the combination of B. rapa and B. oleracea. According to three public assemblies, a total number of 41,174 non-redundancy gene models predicted in B. rapa accession Chiifu-401-42, while 45,758 and 59,225 gene models in B. oleracea accession var. capitata line 02-12 and accession TO1000, respectively.