Brassica nigra cultivar YZ12151 Genome v1.1 Assembly & Annotation
Analysis Name | Brassica nigra cultivar YZ12151 Genome v1.1 Assembly & Annotation |
---|---|
Method | ALLPATHS-LG |
Date Performed | Sunday, July 10, 2016 |
Genome assembly
For B. nigra genome assembly, the Illumina high quality reads were used for de novo assembly by the software ALLPATHS-LG with the default parameters. The software GapCloser (GapCloser v1.12 for SOAPdenovo) was used to fill gaps and improve the quality of the scaffolds by comparison with short paired-end libraries (inserted size < 1Kb).
Genome quality assessment
We searched the CEGMA v.2.3 method which including 458 conserved Core eukaryotic genes (CGE database12) to assess the completeness of finial genome assembly of B. nigra . The assembled genome of B. nigra was also validated by mapping 18344 ESTs (length >=500 bp) downloaded from NCBI (GenBank) to the corresponding genome. To assess the accuracy of the B. nigra genome assembled by HiSeq sequencing data, we randomly downloaded 15 BAC sequence from GeneBank for B. nigra. Firstly, 15 BAC sequences were anchored to assembly genome using blastn for B. nigra. Then, the blastn results were chained to larger syntenic region to identify corresponding scaffolds for each BAC. Finally the formulas (coverage = alignment length/BAC or subread length; identity = matched length / BAC or subread length without gap) were used to calculate the coverage and identity for each sub-read and BAC sequence. Furthermore, to inspect the paired end relationship for B. nigra, the mate pair reads (3/5/10K for B. nigra) were mapped to whole assembly genome using SOAP.
Genetic maps and pseudo-chromosome construction of B. nigra
AllMaps software was used to construct the initial pseudo-chromosomes of B. nigra from scaffolds using the linkage group of T84/DTC.
Repeats annotation
The repeats sequence of B. nigra genome were distinguished with a combination of de novo and homolog strategies. The results from four de novo programs including RepeatScout, LTR-FINDER, MITE and PILER were merged as the initial repeat library. The initial repeat database was classified into classes, subclasses, superfamilies and families by the PASTEClassifier.py script included with REPET. We then merged TE sequences of Brassica species (B. juncea, B. nigra, B. rapa, B. oleracea and B. napus) and the known repbase database together to construct a new repeat database. Finally this new repeat database was used to distinguish the genome assembly repeat sequences through RepeatMasker.
Gene model prediction and evaluation
Genes were annotated iteratively using three main approaches: homology-based (H), de novo (D) and EST/unigenes-based (C). Results of these three methods were integrated by the GLEAN to get high confidence gene model by combing all evidence. Homology-based method (H): Protein sequences from 2 sequenced eudicot species: A. thaliana and B. rapa from the public database, were used to perform prediction. We used the GeneWise (v2.2.0) to determine the accurate gene structure. For de novo prediction, we used Augustus with parameters trained by unigenes from transcriptome data, Genscan and GlimmerHMM with Arabidopsis parameters to obtain de novo gene models. In the third approach, unigenes were aligned to the genome assembly using BLAT (identity >= 0.95, coverage >= 0.90) and then filtered using PASA.
After combining all evidence to generate gene model by glean, RNA-seq-based method mapping transcriptome data to the reference genome using TopHat and assembling transcripts with Cufflinks was adopted to obtain the gene structures and new genes. We filtered short gene mode (< 150 bp) and single exon gene mode to generate final gene set for further analysis.
The resultant gene set contains 80,050 protein-coding gene models, with a mean CDS size of 1,111.07 bp and an average of 4.57 exons per gene. We used the RNA-seq data to evaluate the gene model predication.
Reference
Yang, J., Liu, D., Wang, X., Ji, C., Cheng, F., Liu, B., ... & Yao, P. (2016). The genome sequence of allopolyploid Brassica juncea and analysis of differential homoeolog gene expression influencing selection. Nature genetics, 48(10), 1225.