Brassica napus cultivar NY7 Whole Genome Assembly & Annotation v2
Analysis Name | Brassica napus cultivar NY7 Whole Genome Assembly & Annotation v2 |
---|---|
Method | SOAPdenovo2, Gapcloser, SSPACE, OPERA-LG and Lachesis |
Source | A combination of sequencing technologies, including Illumina, PacBio Single Molecule Real Time (SMRT) and Hi-C sequencing |
Date Performed | Wednesday, March 13, 2019 |
De novo assembly of scaffolds
To avoid systematic bias from sequencing reads, the raw Illumina paired-end (PE) reads were filtered using NGSQC v2.3.3 and corrected using Lighter with the default setting. The clean Illumina reads were assembled using SOAPdenovo2. Further, the clean reads were reused to improve the SOAPdenovo2 assembly using Gapcloser v2.1, SSPACE v1.0 and OPERA-LG v2.1. Finally, the PacBio reads corrected by the clean Illumina reads with the software LoRDEC 0.6 were used to fill the gaps by PBJelly. To reduce errors in the initial assembly, three linkage maps and the Ar and Co subgenomes were applied to identify and correct chimeric scaffolds according to the dependable synteny relationship with the initial NY7 assembly. As a result, a total of 180 chimeric scaffolds were corrected, and the NY7 assembly was improved to have a scaffold N50 1.27 Mb and a contig N50 44.0 Kb.
Super-scaffolding with Hi-C data
Approximately, a total of 117 Gb Hi-C reads were mapped to the assembly using Bowtie v2.2.1 with parameters ‘–reorder’ and ‘–very-sensitive’. The software SAMtools was used to manipulate the BAM files and remove potential PCR duplicates. Then, Lachesis was used to cluster, order and orientate the scaffolds and created the raw Hi-C assembly with the mapping result in the last step. For accuracy of the assembly, the Hi-C contact matrix was generated, and custom scripts were used to find and split these weak points among the Hi-C assembly. The corrected Hi-C assembly was then aligned to the three linkage maps and the Ar and Co subgenomes to remove the abnormal synteny relationship. Finally, we obtained the NY7 Hi-C assembly with a scaffold
N50 6.91 Mb.
Pseudo-molecule construction
The backbone linkage map, BnaTNDH 2.3, and two other published linkage maps, DYDBAA and BS, were used to construct the pseudo-molecules. A final set of 13 164 unique SNP markers was utilized to anchor the Hi-C scaffolds using blast +2.3.0 with the parameter ‘-evalue 1e-10’. The markers with the best hits in the NY7 genome, and further filtrated using Allmaps, were defined as unique markers. Allmaps was used to construct the 19 pseudo-chromosomes that covered 890 Mb in length.
Improving the assembly of NY7 with GOGGs
Illumina RNA-seq reads from 45 DH lines of the BnaTNDH mapping population and 2 parental lines (NY7 and Tapidor) were mapped to the NY7 reference genome using the methodology developed and deployed previously. Genome-ordered graphical genotypes (GOGGs) by He and Bancroft (2018) were applied to improve the genome assembly of NY7. In brief, we developed a new NY7 genome sequence resource by cutting and inserting 54 segments from the draft genome sequence based on GOGGs, and the reassembly of the split segments was designed based on the congruence of genotypes of the mapping population. The new NY7 genome sequence was reassembled based on an automation of concatenating the sequence segments. Another iteration of SNP scoring and generation of GOGGs from this resource demonstrated the improved congruence of genotypes.
Assembly confirmation with Hi-C contact maps
~100x Hi-C data were remapped to the 19 pseudo-chromosomes and normalized using HiC-Pro 2.10.0 with the parameters ‘FILTER_LOW_COUNT_PERC = 0 and BIN = 100000’. The Hi-C contact maps with a bin of 100 Kb generated by HiC-Pro were used to plot Hi-C contact heatmaps for the 19 pseudo-chromosomes using HiCPlotter 0.7.3 with parameters ‘-tri 1 -wg 1 -o WholeGenome’.
Gene and repeat annotation
The de novo repeat library was built from the assembled genome using RepeatModeler. A total of 352.8 Mb repetitive elements covering 41.29% of the NY7 genome were identified using RepeatMasker 4.0.8 with the default settings. De novo gene structure predictions were carried out using AUGUSTUS 3.2.2, GeneMark.hmm and FGENESH 2.6. The coding sequences of A. thaliana (TAIR10), B. rapa (IVFCAASv1), B. napus (Darmorbzh, v5 and ZS11 (V201608)) and B. oleracea (v2.1) were downloaded to perform the homology predictions using gmap. A total of ~20 Gb RNA-seq reads generated by Bancroft et al. (2011) were aligned to the NY7 assembly using Tophat 2.1.1 and assembled to a set of transcripts using Cufflinks 2.2.1 with the default settings. Meanwhile, the RNA-seq data were assembled using Trinity 2.4.0 with the default settings. The assembled transcripts from Cufflinks and Trinity were integrated using the PASA 2.0.2 pipeline to provide expression evidence for gene predictions. All candidate gene models from the above evidence were combined using EVM with a higher weight for the expression evidence from RNA-seq results.