Brassica nigra cultivar YZ12151 Genome v1.1 Assembly & Annotation

Overview
Analysis Name
Brassica nigra cultivar YZ12151 Genome v1.1 Assembly & Annotation
Method
ALLPATHS-LG
Date Performed
Sunday, July 10, 2016
Genome assembly

For B. nigra genome assembly, the Illumina high quality reads were used for de novo assembly by the software ALLPATHS-LG with the default parameters. The software GapCloser (GapCloser v1.12 for SOAPdenovo) was used to fill gaps and improve the quality of the scaffolds by comparison with short paired-end libraries (inserted size < 1Kb).

Genome quality assessment

We searched the CEGMA v.2.3 method which including 458 conserved Core eukaryotic genes (CGE database12) to assess the completeness of finial genome assembly of B. nigra . The assembled genome of B. nigra was also validated by mapping 18344 ESTs (length >=500 bp) downloaded from NCBI (GenBank) to the corresponding genome. To assess the accuracy of the B. nigra genome assembled by HiSeq sequencing data, we randomly downloaded 15 BAC sequence from GeneBank for B. nigra. Firstly, 15 BAC sequences were anchored to assembly genome using blastn for B. nigra. Then, the blastn results were chained to larger syntenic region to identify corresponding scaffolds for each BAC. Finally the formulas (coverage = alignment length/BAC or subread length; identity = matched length / BAC or subread length without gap) were used to calculate the coverage and identity for each sub-read and BAC sequence. Furthermore, to inspect the paired end relationship for B. nigra, the mate pair reads (3/5/10K for B. nigra) were mapped to whole assembly genome using SOAP.

Genetic maps and pseudo-chromosome construction of B. nigra 

AllMaps software was used to construct the initial pseudo-chromosomes of B. nigra from scaffolds using the linkage group of T84/DTC.

Repeats annotation

The repeats sequence of B. nigra genome were distinguished with a combination of de novo and homolog strategies. The results from four de novo programs including RepeatScout, LTR-FINDER, MITE and PILER were merged as the initial repeat library. The initial repeat database was classified into classes, subclasses, superfamilies and families by the PASTEClassifier.py script included with REPET. We then merged TE sequences of Brassica species (B. juncea, B. nigra, B. rapa, B. oleracea and B. napus) and the known repbase database together to construct a new repeat database. Finally this new repeat database was used to distinguish the genome assembly repeat sequences through RepeatMasker.

Gene model prediction and evaluation

Genes were annotated iteratively using three main approaches: homology-based (H), de novo (D) and EST/unigenes-based (C). Results of these three methods were integrated by the GLEAN to get high confidence gene model by combing all evidence. Homology-based method (H): Protein sequences from 2 sequenced eudicot species: A. thaliana and B. rapa from the public database, were used to perform prediction. We used the GeneWise (v2.2.0) to determine the accurate gene structure. For de novo prediction, we used Augustus with parameters trained by unigenes from transcriptome data, Genscan and GlimmerHMM with Arabidopsis parameters to obtain de novo gene models. In the third approach, unigenes were aligned to the genome assembly using BLAT (identity >= 0.95, coverage >= 0.90) and then filtered using PASA.

After combining all evidence to generate gene model by glean, RNA-seq-based method mapping transcriptome data to the reference genome using TopHat and assembling transcripts with Cufflinks was adopted to obtain the gene structures and new genes. We filtered short gene mode (< 150 bp) and single exon gene mode to generate final gene set for further analysis.

The resultant gene set contains 80,050 protein-coding gene models, with a mean CDS size of 1,111.07 bp and an average of 4.57 exons per gene. We used the RNA-seq data to evaluate the gene model predication.

Reference

Yang, J., Liu, D., Wang, X., Ji, C., Cheng, F., Liu, B., ... & Yao, P. (2016). The genome sequence of allopolyploid Brassica juncea and analysis of differential homoeolog gene expression influencing selection. Nature genetics, 48(10), 1225.

Download

All assembly and annotation files are available for download by selecting the desired data type in the left-hand side bar link. Each data type page will provide a description of the available files and links to download. Alternatively, you can browse all available files on the HTTP download repository or here.

Assembly

The Brassica nigra cultivar YZ12151 genome v1.1 assembly file:

Downloads

Brassica nigra cultivar YZ12151 genome v1.1 assembly BniB_genome.chr.fa.gz
Gene Predictions

The Brassica nigra cultivar YZ12151 genome v1.1 gene prediction files:

Downloads

Predicted Genes (GFF3 file) BniB.gene.chr.gff.gz
CDS sequences (FASTA file) BniB_cds.fa.gz
Protein sequences (FASTA file) BniB_pep.fa.gz
Repeats

The Brassica nigra cultivar YZ12151 genome v1.1 TE annotation file:

Downloads

TE annotation (GFF3 file) BniB.TE.chr.gff.gz