Brassica juncea var. tumida Whole Genome v1.5 Assembly & Annotation
Analysis Name | Brassica juncea var. tumida Whole Genome v1.5 Assembly & Annotation |
---|---|
Method | ALLPATHS-LG |
Source | A combination of shotgun reads, PacBio long reads and BioNano sequencing reads |
Date Performed | Sunday, July 10, 2016 |
De novo assembly
Genome assembly used ALLPATHS-LG. All the corrected Pacbio RS II reads were used to fill the gaps by PBjelly_V15.2.20. RefAligner utility in IrysView was used to perform alignment between Irys molecules and draft assemblies for correcting the scaffolds chimera error. Finally, the corrected scaffolds were anchored to the genomic (optical) maps assembled from BioNano data. This generated assembly v1.0.
Genome quality evaluation
The CEGMA v.2.3 was used to blast 458 conserved Core eukaryotic genes (CGE database) to assess the genome assembly of B. juncea. The assembled genome of B. juncea was also validated by mapping 23,002 ESTs (length ≥ 500 bp) downloaded from NCBI. To assess the accuracy of the B. juncea genome, we randomly aligned 10 sub-reads over 40 kb from PacBio data to check the paired end relationship using SOAP.
Genetic map and pseudo-chromosome construction
A reference genetic map of B. juncea was constructed based on genotyping by resequencing of 100 individuals of F2 population39. After resequencing reads alignment with BWA40, potential SNPs were identified by GATK v3.4. Pairwise recombination of this marker set on each scaffold was calculated, of which adjacent SNPs with pairwise recombination rate less than 0.001 were lumped into a genetic bin, excluding bins showing significantly distorted segregation (chi-squared test, P < 0.01). A final set of bin markers was grouped to 18 linkage groups using Highmap. ALLMAPS was used to construct the initial pseudo-chromosomes of B. juncea from scaffolds using the genetic map (T84/DTC) constructed in the present study being integrated with a published genetic map (SY/PM)23. We sorted BjuA and BjuB subgenomes of B. juncea referred to the final genetic map.
Genome annotation
The repetitive sequences of the B. juncea genome were identified with a combination of de novo and homolog strategies. Four de novo programs including RepeatScout, LTR-FINDER, MITE and PILER were used to generate the initial repeat library. The initial repeat database was classified into classes, subclasses, superfamilies and families by the PASTEClassifier with REPET. We then merged transposable element (TE) sequences of Brassica species and the Repbase database together to construct a new repeat database and distinguish the genome assembly repeat sequences through RepeatMasker. Genes were annotated iteratively using three main approaches: homology-based, de novo and EST/unigenes-based. Results of these three methods were integrated by GLEAN to get a high-confidence gene model. An RNA-seq based method mapping transcriptome data to the reference genome using TopHat and assembling transcripts with Cufflinks was adopted to obtain the gene structures and new genes.
tRNAscan-SEM (version 1.23) was used to detect reliable tRNA positions. Noncoding RNAs were predicted by the Infernal program using default parameters. Through comparing the similarity of secondary structure between the B. juncea sequence and Rfam (v12.0) database, the noncoding RNAs were classified into different families.
Stringent criteria and strategy were used to identify new TEs for the BjuA subgenome. The same strategy was used to identify new TEs in the subgenomes of B. juncea and B. napus compared to their corresponding ancestral genome after divergence from a common ancestor.
All-against-all BLASTP (E = 1 × 10−5) was performed and chained the BLASTP hits by QUOTA-ALIGN (cscore = 0.5) with ‘1:1 synteny screen’ to call synteny blocks. The ‘1:3 synteny screen’ model was used to identify synteny blocks between A. thaliana and Brassica because of whole genome triplication in Brassica evolution history by QUOTA-ALIGN (cscore = 0.5). All gene losses were calculated based on the Brassica ancestor common gene sets of each species. Meanwhile, gene loss was identified from other subgenomes (BniB,
BjuB, BolC, BnaA and BnaC) of Brassica.