Brassica oleracea HDEM (broccoli) Whole Genome Assembly
Analysis Name | Brassica oleracea HDEM (broccoli) Whole Genome Assembly |
---|---|
Method | Ra, SMARTdenovo, Filtlong and wtdbg; Bionano Solve Pipeline (3.1.1) and Bionano Access (1.0a) |
Source | A combination of MinION, optical maps and Illumina sequencing reads |
Date Performed | Monday, May 7, 2018 |
Genome assembly
They used the Ra, SMARTdenovo and wtdbg assemblers with all nanopore raw (or corrected) reads or subsets of raw (or corrected) reads composed of either the longest reads or those selected by the Filtlong downsampling the read coverage can be beneficial for the assembly phase. They also tried to use Canu software, as it has been proven that but could not get to the final assembly stage owing to the high computational requirements. Moreover, they could not select any subsets of reads for B. oleracea, as the sequencing depth was too low to subsample the sequencing data. Ra, wtdbg and Filtlong were used with default parameters. They used the following options as inputs to SMARTdenovo: ‘-c 1’ to generate a consensus sequence, ‘-J 5000’ to remove sequences smaller than 5 kb and ‘-k 17’ to use 17-mers, as this is advised by the developers for large genomes.
Then, they selected the ‘best’ assembly for each organism, based on contiguity metrics, such as N50 or cumulative size. For all organisms, the Ra assembler produced the most contiguous assembly. The best B. oleracea assemblies were obtained using all of the reads and had contig N50s of 7.3 Mb.
A high-quality consensus was needed for both aligning the optical map onto the contigs and annotating genes. As nanopore reads contain systematic errors in homopolymeric regions, they polished the consensus of the selected assembly three times with the nanopore reads as input to the Racon additional times using Illumina reads as input to the Pilon software and then three tool. Both tools were used with default parameters. The polishing process significantly improved the number of complete BUSCOs detected in all organisms. The percentage of complete BUSCOs went from 74.2% to 97.3% for B. oleracea.
Long-range genome assembly
Two enzymes (BspQI and DLE-1) were used to generate optical maps and both maps were produced using a single chip for each genome. The DLE-1 map was generated using the new DLS technology, which significantly improved the contiguity of the optical maps (N50s are 6–15-times higher using DLS). Genome map assemblies for the three species were generated using Bionano Solve Pipeline version 3.1.1 and Bionano Access version 1.0a. A rough assembly was first performed with the following parameters: -i 0 -V 0 -A -z -u -m (pipelineCL.py). This first result was used as a reference for a second assembly, launched with the following parameters (as recommended by the supplier): -y -r (rough assembly cmap) -V 0 -m. They filtered out molecules smaller than 180 kb and molecules with less than nine labelling sites. The nanopore contigs were then organized using the two Bionano maps (DLE and BspQI), with the scaffolding procedure provided by BioNano Genomics, and negative gap sizes were checked with an internal procedure that fused overlapping contigs and greatly improved the contig size.
Construction of a high-density B. oleracea genetic map and validation and anchoring of our B. oleracea assembly
To construct a B. oleracea genetic map, an F2 population (95 progenies) was obtained from a cross between Richelain (B. oleracea ssp. capitata) and HDEM (B. oleracea var. botrytis italica). This population was genotyped using the Illumina 60 K array and a genetic map was constructed using the CarthaGène software. A total of 6,528 markers were genetically mapped, totalling 817.3 cm. The sequence contexts of all singlenucleotide polymorphism markers that were genetically mapped were blasted against our B. oleracea HDEM assembly to validate the quality of our assembly and to help with the ordering and orientation of scaffolds. Of these 6,528 markers, 5,449 were physically anchored on the HDEM assembly and, more specifically, onto the 20 largest scaffolds (out of 140), representing 96.96% of the whole assembly. The genetic and physical positions were discordant for only 95 markers (1.74%) due to an inaccurate position on the genetic map (of a few centimorgans in almost all cases). In most cases, only two scaffolds per pseudomolecule were obtained, with one end of each scaffold corresponding to a centromere region.
Transposable element annotation
Transposable elements and, more generally, transposable element-rich regions are under-represented in short-read assemblies. To investigate this aspect, they performed transposable element detection for our three genomes and the three available reference assemblies. Transposable elements were annotated using RepeatMasker (with default parameters, taxon eudicotyledons for Brassica) and transposable element libraries. The transposable element database generated was used to annotate the B. oleracea genome. They masked 37.82% of the genome of B. oleracea.
Gene prediction
Gene prediction was done using proteomes from homologous species. For B. oleracea, they used the proteomes of B. oleracea (UP000032141), B. napus and A. thaliana. Low complexity in protein sequences was masked with the SEG algorithm.
Low complexity in genomic sequences was masked using the DustMasker algorithm Tandem repeats were masked using Tandem Repeat Finder. The transposable elements detected by RepeatMasker were also masked for the gene prediction step.
The proteomes were aligned to the genomes in two steps. First, BLAT (default parameters) was used to quickly localize corresponding putative genes of the proteins on the genome. The best match and matches with a score of ≥ 90% (70% for A. thaliana proteins) of the best-match score were retained. Second, the alignments were refined using Genewise (default parameters), which is more precise for intron–exon boundary detection. Alignments were kept if more than 80% of the length of the protein was aligned to the genome.
They integrated the protein homologies using a combiner called Gmove to predict gene structures. This tool can find coding sequences based on the protein mapping structures. It is easy to use with no need for a pre-calibration step. Briefly, putative exons and introns, extracted from the alignments, were used to build a simplified graph by removing redundancies. Then, Gmove extracted all paths from the graph and searched for open reading frames consistent with the protein evidence. A selection step was applied to all candidate genes, essentially based on gene structure.
Finally, they used the pan-genome of B. oleracea to complete the gene catalogue. They aligned these two protein databases using the same workflow and integrated the results using Gmove. The final gene catalogues of B. oleracea is composed of the first Gmove results with the predicted genes (based on pan-genome) that do not overlap with any previous annotation.
Using this pipeline, 61,279 gene models were predicted for B. oleracea. They assessed the completeness of the annotation using BUSCO (embryophyta data set) and detected a similar (or higher) proportion of complete genes when compared with existing gene annotations. Moreover, they computed the annotation evaluation distance for the three genomes using the existing gene catalogues as reference annotation.