Brassica rapa cultivar Chiifu Whole Genome v3.0 Assembly & Annotation
Analysis Name | Brassica rapa cultivar Chiifu Whole Genome v3.0 Assembly & Annotation |
---|---|
Method | Canu (v1.5) |
Source | A combination of PacBio, BioNano and Hi-C sequencing reads |
Date Performed | Thursday, September 13, 2018 |
Genome assembly
To estimate genome size, six biological replicates were analyzed by flow cytometry of B. rapa (accession Chiifu-401-42) using rice (O. sativa ssp. japonica cv. Nipponbare)25 as an internal reference (Supplementary Table S1). The genome size of an unknown ecotype of B. rapa was estimated at 529 Mb by flow cytometry without control analysis15 and at 485 Mb for Chinese cabbage (accession Chiifu-401-42) using 17-mer analysis2. The estimated size of the B. rapa genome is very close to that of the BioNano consensus map, suggesting that previous studies may have overestimated its size.
The same B. rapa L. ssp. pekinensis in bred line (Chiifu-401-42) used for the earlier assemblies v1.5 and v2.5 was used for whole-genome sequencing in this study. High-quality genomic DNA from 500 mg of frozen leaf tissues was used to generate the PacBio libraries with an insert size of 20 Kb. The libraries were then sequenced in four Sequel cells (Pacific Biosciences, CA, USA). Approximately 19.40 Gb of newly generated data and another 6.5 Gb of previous PacBio data6 were incorporated into our genome assembly. Next, the PacBio subreads were de novo assembled using Canu (v1.5)26 with default parameters. The Illumina reads obtained from BRAD (http://brassicadb.org) were mapped to the PacBio contigs using BWA (v0.7.15)27. This alignment was then used to polish and correct the assembly by Pilon (v1.22)28.
The Hi-C libraries of B. rapa were constructed following the procedures described in a previous study with minor modifications29.The resulting libraries were submitted to an Illumina HiSeq 4000 sequencing device with 2 × 125 bp reads. Overall, we obtained ~584 million usable paired-end reads from two biological replicates. After alignment, ~27.87% of these read pairs could be uniquely mapped to the initial contigs.
The optical mapping (BioNano) data were generated using the BioNano Genomics Irys system (BioNano Genomics, CA, USA). The high-molecular-weight DNA was labeled by a specific nicking enzyme Nt.BspQ1 (New England Biolabs, MA, USA) using the IrysPrep Reagent Kit (BioNano Genomics, CA, USA) as described by the manufacturer. Molecules were then filtered by a minimum length of 100 kb and a signal-to-noise ratio of 3.5. The filtered molecules were de novo assembled into a consensus physical genome map using the BNG IrysView analysis software package using manufacturer-recommended parameters for B. rapa (molecular length threshold: 100 kb; minimum label per molecule: 8; maximum backbone intensity: 0.6; false positive density/100 kb: 1.5; false negative rate: 0.15%; scaling SD: 0; site SD: 0.2 kb; relative SD: 0.03; initial assembly p value cutoff: 1e-8; extension and refinement p value cutoff: 1e-9; and merge p value cutoff: 1e-12; autonoise adjustment and 4 iterations of computation).
Genome annotation
We named the newly annotated gene models following the standards of gene model nomenclature for Brassica reference genomes (http://www.brassica.info/info/genome_annotation.php): Bra (for Brassica rapa) followed by the chromosome number and letter “g” (for gene). Genes from the top to the bottom of chromosomes were assigned numbers (in steps of 10) with five digits with leading zero integers. To distinguish the genes in v3.0 from the other lines of B. rapa, the number “3” (for the third version of B. rapa reference genome) and a single capital letter “C” (for variety Chiifu-401-42) were assigned after a “.” following the gene numbers; for example, BraA05g036760.3C.
After gene prediction, gene functions were assigned according to the best match of the alignments against various protein databases using BLAST v2.2.31 (E-value = 1e-5), including the KEGG33, Swiss-Prot, and TrEMBL databases34. GO terms for each gene were obtained from the corresponding InterPro entries35. Overall, we inferred 44,539 (96.86%) genes that were annotated based on the results from searching the protein databases (Supplementary Table S18).
Intact LTR-RTs were identified using LTR_finder36 and classified the intact LTR-RTs by predicting the RT domains using the Pfam database (version 26.0) and HMMER software37. Muscle38 was then employed to perform multiple RT sequence alignments, and RAxML39 was adopted to construct maximum likelihood (ML) trees based on the sequence alignments with 500 bootstrap replications. Finally, the interactive tree of life (iTOL)40 was used to plot the ML trees. The analysis of LTR insertion time was performed as previously reported4.
They also performed noncoding RNA annotation for our assembly. tRNA annotation was conducted using tRNAscan-SE (v1.3.1)41 according to its structural characteristics. Homology-based rRNAs were localized by mapping known full-length plant rRNAs to the B. rapa genome v3.0. snRNAs were predicted by Infenal (v1.1)42 using the Rfam database43. miRNA annotation was performed as previously described44.