Brassica rapa cultivar Chiifu Whole Genome v1.5 Assembly & Annotation
Analysis Name | Brassica rapa cultivar Chiifu Whole Genome v1.5 Assembly & Annotation |
---|---|
Method | SOAPdenovo |
Source | Paired short read sequences generated by Illumina GA II technology |
Date Performed | Friday, December 3, 2010 |
Genome assembly
Approximately 72-fold shotgun coverage was generated using Illumina GA II sequencing from short (~200 bp), medium (~500 bp) and long (~2 kb, 5 kb and 10 kb) insert libraries. The raw Illumina reads were filtered for duplicates, adaptor contamination and low quality before assembly into preliminary scaffolds using SOAPdenovo run with default parameters. They first assembled the reads from the short insert size (≤500 bp) libraries into contigs using Kmer (de bruijn graph kmer) overlap information and ensured the resulting contigs were unique by determining an unambiguous path in the de bruijn graph. This resulted in contigs with an N50 length of 1.1 kb, achieving a total length of 222 Mb; the long insert size mate-paired libraries (≥2 kb) were not used initially because the chimaeric reads common to such libraries can generate incorrect sequence overlaps. After obtaining the unique contigs, they mapped all available paired-end reads to these contigs to connect adjacent contigs. In order to avoid interleaving and to reduce the impact of the insert-size deviation of any sequencing library, we used a hierarchical assembly method, constructing the scaffolds step by step by adding data from each library separately ranked according to insert size from smallest to largest. This obtained scaffolds with an N50 length of 347 kb and a total genome length of 288 Mb. Most of the remaining gaps between contigs probably occur in repetitive regions, so they identified the paired-end reads with only one end mapped to a unique contig and performed local assembly with the unmapped end to fill small gaps within the scaffolds. The resulting assembly had a final contig N50 length of 27 kb. In total, 32-Mb gaps were closed. A total of 199,452 BAC-end Sanger sequences retrieved from http://www.brassica-rapa.org/BRGP/bacEndList.jsp were used to construct the super scaffolds. The gaps within the scaffolds were filled in as previously described. The expected genome size of B. rapa was estimated from the distribution of 17-mer depth as assessed from the filtered sequence data using methods previously described. The peak depth of 17-mers was at 15-folds and a total 7,287,899,150 17-mers were obtained. We obtained an estimated genome size of 485 Mb by dividing the total number of 17-mers by the peak depth.
Validation of assembly
NUCmer was used to compare the sequence of chromosome A03 assembled here by whole-genome shotgun sequencing (WGS A03) to the same chromosome assembled by BAC Sanger sequencing (BAC A03) previously reported. The total sizes of WGS A03 and BAC A03 are approximately 31.72 Mb and 32.70 Mb, respectively, with slightly more repeat sequences assembled using the BAC approach (9.82 Mb in BAC A03 and 5.68 Mb in WGS A03). There were more gaps observed in BAC A03 (1,035/1,358,889 bp, number of gaps/total size of gaps) than in WGS A03 (858/844,319 bp). We identified 44 obvious inversions (>1 kb) between the two assemblies. Evidence provided by studying the mapped paired ends, the depth of the mapped reads and gaps at the boundaries for 38 inversions supported the WGS assembly, and 6 inversions remained ambiguous. To evaluate the accuracy of the assembly on a local scale, the sequence of 647 complete BAC clones (phase 2 and phase 3) that had been deposited in NCBI and had been genetically mapped (see URLs) were compared with their equivalent WGS sequence.
Integration of shotgun assembly with genetic maps
The scaffolds were anchored to the B. rapa genetic linkage map using 1,427 uniquely aligned markers from an integrated linkage map developed from four populations. In addition, 1,054 markers mapped to the B. napus A genome were used to verify and aid the alignment. Chromosomes were orientated by alignment to the reference A genome linkage groups from Parkin et al.42 (equivalent to N1-N10). Where genetic information was not available from Brassica maps, scaffold order and/or orientation was inferred based on evidence of conserved collinearity with the A. thaliana gene order.
Protein coding gene annotation
In addition to available Brassica EST data (downloaded from dbEST at NCBI 10 July 2010), they generated a total of 27.1 million Illumina RNA-Seq paired-end reads, 19.9 million of which were from Chiifu-401-42 and 7.2 million of which were from a Caixin accession, L58, to verify the predicted gene models. For Chiifu-401-42, equally mixed total RNA isolated from eight different tissues and growth conditions was used: leaves, roots and floral stems from plants grown in pots; 2-week-old etiolated seedlings; shoots from plants grown hydroponically under normal conditions; and leaves from plants treated with 0.5% NaCl at 4 °C and 37 °C for 24 h. For L58, equally mixed total RNA was isolated from similar tissues with the addition of germinating seeds, callus and pods. The genome assembly was premasked for class I and class II transposable elements, and Genscan and Augustus were used to carry out de novo predictions with gene model parameters trained from A. thaliana. Genes with less than 150 bp of coding sequence were filtered out. For homology-based gene prediction, we aligned A. thaliana, C. papaya, Populus trichocarpa, V. vinifera and Oryza sativa protein sequences to the B. rapa genome using TBLASTN (at an E value of 1 × 10−5) for fast alignment and Genewise for precise alignment. The Unigene sequences of B. rapa and the Brassica ESTs downloaded from NCBI were aligned to the B. rapa genome using BLAT and assembled by PASA based on genomic location. As the fragmental exons in ESTs data might lead to pseudo alignments, we filtered out the results with intron(s) more than 10,000 bp. GLEAN was used to combine de novo gene sets and homology-based gene sets and incorporated the expressed sequence data described above as supporting evidence. In addition, those predicted B. rapa proteins that aligned to the Repbase transposable element protein database (E value 1 × 10−5 at ≥50%) were filtered out. The B. rapa predicted proteins were annotated based on alignment to the Swiss-Prot and TrEMBL databases with BLASTP at E value 1 × 10−5. InterPro was used to annotate motifs and domains by comparison with publicly available databases including Pfam, PRINTS, PROSITE, ProDom and SMART. The Gene Ontology information for each gene code was extracted from the InterPro results. To identify and estimate the number of potential orthologous gene families between B. rapa, V. vinifera, A. thaliana and C. papaya, they applied the OrthoMCL pipeline using standard settings (BLASTP E value < 1 × 10−5) to compute the all-against-all similarities.