Brassica oleracea cultivar TO1000DH3 Whole Genome v1.0 Assembly & Annotation
Analysis Name | Brassica oleracea cultivar TO1000DH3 Whole Genome v1.0 Assembly & Annotation |
---|---|
Method | SOAPdenovo (v1.05) |
Source | A combination of Illumina and Roche 454 sequencing reads |
Date Performed | Thursday, May 22, 2014 |
Genome assembly
Initially, all Illumina and 454 reads were filtered for adapter contamination, PCR duplicates, ambiguous residues (N residues) and low quality regions. The initial backbone of the draft genome was assembled with Illumina reads using De Bruijn graph-based SOAPdenovo (version 1.05) assembler, run with a k-mer parameter of 47 and each library ranked according to insert size from smallest to largest. The gaps within assembled scaffolds were filled with the short insert PE reads using GapCloser (version 1.12). The resulting assembly consisted of a total of 35,436 contigs and short scaffolds, with a sequence span of 488 Mb and an N50 size of 265 kb. BAC end sequences for TO1434 were downloaded from NCBI (LIBGSS_011756) and trimmed for quality, ambiguous bases and adapter sequences. Bambus was used to overlay all the 454 MP information and the BAC end sequence data onto SOAPdenovo scaffolds to improve scaffold lengths as described in [34]. In short, all 454 MP reads and BAC end sequence reads were aligned to the scaffolds using a genomic mapping and alignment program (GMAP). The output from GMAP was used to create a Bambus-compatible GDE formatted contig file that indicated scaffold links. Redundant or multi-mapped mates, mates where only one read mapped, and those where both mates mapped to a single scaffold were considered invalid links. Each link was considered in Bambus in ascending order of their length, with scaffolding parameters including a redundancy level of 3 and link size error of 5%. Any potentially ambiguous scaffolds were resolved using the 'untangle' utility of Bambus. Bambus was able to order, orient and merge 2,623 of these pre-assembled SOAPdenovo scaffolds into 646 superscaffolds, resulting in a greatly improved assembly with an N50 size of 850 kb.
Construction of a high density genetic map and anchoring of the genome
A high density genetic map representing nine linkage groups was constructed using a mapping population of 94 doubled haploid lines (DH) derived previously from a cross between TO1000 and Early Big. A total of 2,299 polymorphic loci (SNPs, simple sequence repeats and insertion/deletion polymorphisms) identified using the RAD approach were used to integrate assembled scaffolds with the genetic map. Collinearity between B. oleracea provisional pseudomolecules and A. thaliana and/or B. rapa was used to further assist with ordering and orientation of scaffolds for which there was paucity of adequate genetic recombination and markers. A total of 66 instances of false joins or insertions within Bambus superscaffolds were identified based on marker discontiguity and collinearity information. These scaffolds were split and the correct position of each of the fragments was determined based on marker and collinearity information. Final scaffolds were renamed as ‘Scaffold’ and numbered sequentially based on their length from longest to shortest. The order and orientation of scaffolds within each pseudomolecule were determined based on marker order within each scaffold, and marker contiguity pattern between adjoining scaffolds. Scaffolds with too few markers were ordered and oriented using collinearity information. The final version of the draft genome representing nine pseudochromosomes and 32,919 unanchored scaffolds was collated using a custom Perl script, and the ordering and orientation information of scaffolds within each pseudochromosome was compiled in AGP files. The quality of the assembled genome was ascertained by performing several independent tests.
Gene annotation
For accurate annotation of gene models, an integrated computational approach based on two major genome annotation pipelines, Maker and PASA, was adopted. Maker provides a simplified process for aligning expressed sequence tags (ESTs) and proteins to the genome, and integrates this external homology evidence with ab initio gene predictions to produce polished gene annotations with evidence-based quality statistics. Inputs for Maker included the repeat-masked B. oleracea genome assembly (masked against ‘te_proteins.fasta’ in the Maker package, which contains a generic list of common transposable elements (TEs)), PlantGDB ESTs from B. oleracea, B. rapa and B. napus, and Uniprot (SwissProt + TrEMBL) plant protein database. Ab initio gene predictions were made by Fgenesh and Augustus. Maker gene structure annotations were further updated by PASA using evidence from Sanger ESTs and multiple de novo RNA-seq assemblies. Post-Maker processing included splitting potentially fused genes, extending genes, resolving internal rearrangements, trimming overlapping genes, and removing proteins of less than 50 amino acids and with no BLAST match to A. thaliana. The final annotation set contained a total of 59,225 gene models. Protein names are assigned using AHRD pipeline with names extracted from BLASTP hits in TAIR v10, Swissprot and TrEMBL databases (queried in March 2012).
Repeat annotation
A TE database constituted from de novo analysis of B. napus was merged with databases of TEs previously constructed from analysis of B. rapa and B. oleracea. The TEs were classified into major subclasses and superfamilies based on their structural features. Inside each superfamily, elements sharing more than 80% sequence identity over more than 80% of their length, and at least 80 bp, are considered as belonging to the same family. The merged TE database was used for comparative repeat masking of the B. oleracea and B. rapa genomes.