Brassica oleracea cultivar 02-12 Whole Genome v1.1 Assembly & Annotation

Overview
Analysis Name
Brassica oleracea cultivar 02-12 Whole Genome v1.1 Assembly & Annotation
Method
SOAPdenovo (v1.04)
Source
A combination of Illumina Genome Analyser whole-genome shotgun and GS FLX Titanium sequencing reads
Date Performed
Friday, October 19, 2012
Genome assembly and validation

A series of checking and filtering measures on reads were taken following the Illumina-Pipeline, and low-quality reads, adaptor sequences and duplicates were removed. The reads after the above filtering and correction steps were used to perform assembly including contig construction, scaffold construction and gap filling using SOAPdenovo1.04. Finally, we used 20-kb-span paired-end data generated from the 454 platform and 105-kb-span BAC-end data downloaded from NCBI to extend scaffold length. The B. oleracea genome size was estimated using the distribution curve of 17-mer frequency.

To anchor the assembled scaffolds onto pseudo-chromosomes, they developed a genetic map using a double haploid population with 165 lines derived from a F1 cross between two homozygous lines 02–12 (sequenced) and 0188 (re-sequenced). The genetic map contains 1,227 simple sequence repeat markers and single nucleotide polymorphism markers in nine linkage groups, which span a total of 1,180.2 cM with an average of 0.96 cM between the adjacent loci16. To position these markers to the scaffolds, marker primers were compared with the scaffold sequences using e-PCR (parameters -n2 -g1 –d 400–800), with the best-scoring match chosen in case of multiple matches.

The B. oleracea genome assembly was validated by comparing it with the published physical map constructed using 73,728 BAC clones and a genetic map from B. napus. Eleven Sanger-sequenced B. oleracea BAC sequences were used to assess the assembled genome using MUMmer-3.22

Gene prediction and annotation

Gene prediction was performed on the genome sequence after pre-masking for TEs. Gene prediction was processed with the following steps: (i) De novo gene prediction used AUGUSTUS and GlimmerHMM with parameters trained from A. thaliana genes. (ii) For homologue prediction, they mapped the protein sequences from A. thaliana, O. sativa, C. papaya, V. vinifera and P. trichocarpa to the B. oleracea genome using tblastn with an E-value cutoff of 10-5, and used GeneWise (Version 2.2.0) for gene annotation. (iii) For EST-aided annotation, the Brassica ESTs from NCBI were aligned to the B. oleracea genome using BLAT (identity ≥ 0.95, coverage ≥ 0.90) and further assembled using PASA. Finally, all the predictions were combined using GLEAN to produce the consensus gene sets. Functional annotation of B. oleracea genes was based on comparison with SwissProt, TrEMBL, Interproscan and KEGG proteins databases. The tRNA genes were identified by tRNAscan-SE using default parameters. Then rRNAs were compared with the genome using blastn. Other non-coding RNAs, including miRNA, snRNA, were identified using INFERNAL by comparison with the Rfam database.

TE annotation

LTR-RTs were initially identified using the LTR_STRUC programme, and then manually annotated and checked based on structure characteristics and sequence homology. Refined intact elements were then used to identify other intact elements and solo LTRs. All the LTR-RTs with clear boundaries and insertion sites were classified into superfamilies (Copia-like, Gypsylike and Unclassified retroelements) and families relying on the internal protein sequence, 5', 3' LTRs, primer-binding site and polypurine tracts. Non-LTR-RTs (Long interspersed nuclear element, LINE and Short interspersed nuclear element, SINE) and DNA transposons (Tc1-Mariner, hAT, Mutator, Pong, PIF-Harbinger, CACTA and miniature inverted repeat TE) were identified using conserved protein domains of reverse transposase or transposase as queries to search against the assembled genome using tblastn. Further upstream and downstream sequences of the candidate matches were compared with each other to define their boundaries and structure. Helitron elements were identified by the HelSearch 1.0 programme and manually inspected. All the TE categories were identified according to the criteria described previously. Typical elements of each category were selected and mixed together as a database for RepeatMasker analysis. Around 20 x coverage of shotgun reads randomly sampled from the two Brassica genomes were masked by the same TE data set to confirm the different accumulation of TEs between the two genomes.

Reference

Liu, S., Liu, Y., Yang, X., Tong, C., Edwards, D., Parkin, I. A., ... & Wang, X. (2014). The Brassica oleracea genome reveals the asymmetrical evolution of polyploid genomes. Nature communications, 5, 3930.

Download

All assembly and annotation files are available for download by selecting the desired data type in the left-hand side bar link. Each data type page will provide a description of the available files and links to download. Alternatively, you can browse all available files on the HTTP download repository or here.

Assembly

The Brassica oleracea cultivar 02-12 genome v1.1 assembly file:

Downloads

Brassica oleracea cultivar 02-12 genome v1.1 assembly BOL.seq.lst.new.chr20110802_check.fa.gz
Gene Predictions

The Brassica oleracea cultivar 02-12 genome v1.1 gene prediction files:

Downloads

Predicted Genes (GFF3 file) BOL.seq.20110802.chr_check.gff.gz
CDS sequences (FASTA file) Scaffold.seq.110729_check.cds.gz
Protein sequences (FASTA file) Scaffold.seq.110729_check.pep.gz
Functional Analysis

The Brassica oleracea cultivar 02-12 genome v1.1 function analysis files:

Downloads

Interpro result Interpro
KEGG result KEGG