Brassica oleracea cultivar 02-12 Whole Genome v1.1 Assembly & Annotation
Analysis Name | Brassica oleracea cultivar 02-12 Whole Genome v1.1 Assembly & Annotation |
---|---|
Method | SOAPdenovo (v1.04) |
Source | A combination of Illumina Genome Analyser whole-genome shotgun and GS FLX Titanium sequencing reads |
Date Performed | Friday, October 19, 2012 |
Genome assembly and validation
A series of checking and filtering measures on reads were taken following the Illumina-Pipeline, and low-quality reads, adaptor sequences and duplicates were removed. The reads after the above filtering and correction steps were used to perform assembly including contig construction, scaffold construction and gap filling using SOAPdenovo1.04. Finally, we used 20-kb-span paired-end data generated from the 454 platform and 105-kb-span BAC-end data downloaded from NCBI to extend scaffold length. The B. oleracea genome size was estimated using the distribution curve of 17-mer frequency.
To anchor the assembled scaffolds onto pseudo-chromosomes, they developed a genetic map using a double haploid population with 165 lines derived from a F1 cross between two homozygous lines 02–12 (sequenced) and 0188 (re-sequenced). The genetic map contains 1,227 simple sequence repeat markers and single nucleotide polymorphism markers in nine linkage groups, which span a total of 1,180.2 cM with an average of 0.96 cM between the adjacent loci16. To position these markers to the scaffolds, marker primers were compared with the scaffold sequences using e-PCR (parameters -n2 -g1 –d 400–800), with the best-scoring match chosen in case of multiple matches.
The B. oleracea genome assembly was validated by comparing it with the published physical map constructed using 73,728 BAC clones and a genetic map from B. napus. Eleven Sanger-sequenced B. oleracea BAC sequences were used to assess the assembled genome using MUMmer-3.22.
Gene prediction and annotation
Gene prediction was performed on the genome sequence after pre-masking for TEs. Gene prediction was processed with the following steps: (i) De novo gene prediction used AUGUSTUS and GlimmerHMM with parameters trained from A. thaliana genes. (ii) For homologue prediction, they mapped the protein sequences from A. thaliana, O. sativa, C. papaya, V. vinifera and P. trichocarpa to the B. oleracea genome using tblastn with an E-value cutoff of 10-5, and used GeneWise (Version 2.2.0) for gene annotation. (iii) For EST-aided annotation, the Brassica ESTs from NCBI were aligned to the B. oleracea genome using BLAT (identity ≥ 0.95, coverage ≥ 0.90) and further assembled using PASA. Finally, all the predictions were combined using GLEAN to produce the consensus gene sets. Functional annotation of B. oleracea genes was based on comparison with SwissProt, TrEMBL, Interproscan and KEGG proteins databases. The tRNA genes were identified by tRNAscan-SE using default parameters. Then rRNAs were compared with the genome using blastn. Other non-coding RNAs, including miRNA, snRNA, were identified using INFERNAL by comparison with the Rfam database.
TE annotation
LTR-RTs were initially identified using the LTR_STRUC programme, and then manually annotated and checked based on structure characteristics and sequence homology. Refined intact elements were then used to identify other intact elements and solo LTRs. All the LTR-RTs with clear boundaries and insertion sites were classified into superfamilies (Copia-like, Gypsylike and Unclassified retroelements) and families relying on the internal protein sequence, 5', 3' LTRs, primer-binding site and polypurine tracts. Non-LTR-RTs (Long interspersed nuclear element, LINE and Short interspersed nuclear element, SINE) and DNA transposons (Tc1-Mariner, hAT, Mutator, Pong, PIF-Harbinger, CACTA and miniature inverted repeat TE) were identified using conserved protein domains of reverse transposase or transposase as queries to search against the assembled genome using tblastn. Further upstream and downstream sequences of the candidate matches were compared with each other to define their boundaries and structure. Helitron elements were identified by the HelSearch 1.0 programme and manually inspected. All the TE categories were identified according to the criteria described previously. Typical elements of each category were selected and mixed together as a database for RepeatMasker analysis. Around 20 x coverage of shotgun reads randomly sampled from the two Brassica genomes were masked by the same TE data set to confirm the different accumulation of TEs between the two genomes.