Nanopore assembly and variant calling

First follow the instructions here:

Step by step guide on how to use my pipelines
Click here for an introduction to Snakemake

ABOUT

This is a pipeline that uses Flye to create a nanopore assembly. It also does variant calling with long and short reads.
The pipeline starts by using porechop to trim the adaptors, then it uses Flye to create the assembly. After that, ntLink-arks from Lonstitch is used to scaffold the assembly using the nanopore reads. The scaffolded assembly is polished with polca. Polca also does variant calling with the short reads, while longshot does variant calling with the nanopore reads. To run longshot, first the long reads are aligned to the assembly with minimap2.
In the end, in addition to your assembly and variant calling results, you’ll also get assembly statistics and busco scores before and after the polishing.

Tools used:

Porechop - trim adaptors
Flye - assembly
Seqtk - convert fasta to one line fasta
LongStitch (ntLink-arks) - scaffolding with nanopore reads
BUSCO - assess assembly completeness
MaSuRCA (polca) - polish assembly and do variant calling with short reads
Python - get assembly stats
Minimap2 - map long reads to reference
Samtools - sort and index mapped reads and vcf files
Longshot - variant calling with nanopore reads
bcftools - vcf statistics


Pipeline workflow

Edit config.yaml with the paths to your files

LONGREADS: <nanopore_reads.fq.gz>
SHORTREADS:
  - /path/to/short/reads_1.fq.gz
  - /path/to/short/reads_2.fq.gz
GENOME_SIZE: <approximate genome size>
PREFIX: <prefix>
OUTDIR: /path/to/outdir
BUSCO_LINEAGE:
  - <lineage>

LONGREADS - name of file with long reads. This file should be in the working directory (where this config and the Snakefile are)
SHORTREADS - paths to short reads fq.gz
GENOME_SIZE - approximate genome size haploid genome size (bp)(e.g. '3e9' for human genome) from longstitch
PREFIX - prefix for the created files
OUTDIR - directory where snakemake will run and where the results will be written to
BUSCO_LINEAGE - lineage used for busco. Can be one or more (one per line). To see available lineages run busco --list-datasets

If you have your long reads in in several fastq files and need to create one file compressed file with all the reads:

In your pipeline directory create one file with all the reads
```
cat /path/to/fastq/directory/*.fastq > <name of file>.fq
```
Compress the file you just greated:
```
gzip <name of file>.fq
```

After installing the conda environment (first step of this guide) you’ll need to edit the polca.sh file.

First go to the directory where miniconda3 is installed (usually your home directory). Go to /<home>/miniconda/envs/<env_name>/bin and open the file polca.sh. In my case the path looks like this: /home/WUR/<username>/miniconda3/envs/<env_name>/bin/. In your editor open polca.sh and replace this line:

$SAMTOOLS sort -m $MEM -@ $NUM_THREADS <(samtools view -uhS $BASM.unSorted.sam) $BASM.alignSorted 2>>samtools.err && \

With this:

$SAMTOOLS sort -m $MEM -@ $NUM_THREADS <(samtools view -uhS $BASM.unSorted.sam) -o $BASM.alignSorted.bam 2>>samtools.err && \

RESULTS

The working directory will be messy with all the necessary files and results from the several pipeline steps. The most important files are and directories are:

**_files.txt** dated file with an overview of the files used to run the pipeline (for documentation purposes)
results directory that contains
- assembly_stats_.txt file with assembly statistics for the final assembly
- busco_{prefix}scaffolded** and **busco{prefix}_scaffolded_polished directories - contain busco results before and after polishing respectively
  - short_summary.specific.{lineage}.{prefix}_scaffolded.txt
  - short_summary.specific.{lineage}.{prefix}_scaffolded_polished.txt”
- variant_calling directory with variant calling VCF files with long and short reads, as well as VCF stats
  - {prefix}_shortreads.vcf.gz
  - {prefix}_shortreads.vcf.gz.stats
  - {prefix}_longreads.vcf.gz
  - {prefix}_longreads.vcf.gz.stats
- 3_mapped
  - {prefix}_longreads.mapped.sorted.bam - long reads mapped to the new assembly