Smoove split read support error

When mapping reads with bwa mem, if you use the -M flag, the split reads are maked as secondary and they will not be used by smoove to get split-read support. This will cause all the variants in your VCF file to have SR=0. If you don’t use split-read support in your analysis, you can run smoove as usual, if you want split-read support values in your structural variant file you can use the following solution.
This solution was created by Martijn Derks.

It requires python 2, and the pysam and argparse packages as well as samtools

module load samtools

sname=`samtools view -H <sample>.bam | grep '^@RG' | sed "s/.*SM:\([^\t]*\).*/\1/g" | uniq`

python bamgroupreads.py -f -M -i <sample>.bam | samblaster --ignoreUnmated -M -a -e -d <sample>.disc.sam -s <smoove call output dir>/<sample>.split.sam -o /dev/null

grep -v "SAMBLASTER" <smoove call output dir>/<sample>.split.sam > $sname.tmp.sam
mv $sname.tmp.sam <smoove call output dir>/<sample>.split.sam
grep -v "SAMBLASTER" <smoove call output dir>/<sample>.disc.sam > $sname.tmp.sam
mv $sname.tmp.sam <smoove call output dir>/<sample>.disc.sam

samtools sort -@ 12 -O bam <smoove call output dir>/<sample>.split.sam > <smoove call output dir>/$sname.split.bam
samtools sort -@ 12 -O bam <smoove call output dir>/<sample>.disc.sam > <smoove call output dir>/$sname.disc.bam

rm <smoove call output dir>/<sample>.split.sam <smoove call output dir>/<sample>.disc.sam

The final split and disc bam files should be in the same directory as the smoove call outdir. Smoove will then use these split and disc bam files for the smoove call step.

Step by step

Get sample name from the sample name read group tag in the bam file. The final files will be named according to the sample name, as expected by smoove.

sname=`samtools view -H <sample>.bam | grep '^@RG' | sed "s/.*SM:\([^\t]*\).*/\1/g" | uniq`

Then small script starts by creating two bam files, one with discordant reads and one with split reads using the python2 script bamgroupreads.py and samblaster.

python bamgroupreads.py -f -M -i <sample>.bam | samblaster --ignoreUnmated -M -a -e -d <sample>.disc.sam -s <smoove call output dir>/<sample>.split.sam -o /dev/null

I tried create sorted bam files from these split and disc bam files and use them with smoove, but I would always get this error:

panic: sam: duplicate program name: line 618: "@PG\tID:SAMBLASTER\tVN:0.1.26\tCL:samblaster -i stdin -o /dev/null -M --acceptDupMarks --excludeDups --ignoreUnmated -d <smoove output dir>/<sample>.disc.sam -s <smoove output dir>/<sample>.split.sam --maxSplitCount 2 --maxUnmappedBases 50 --minIndelSize 50 --minNonOverlap 20"

To avoid the previous error, the following steps are needed:

grep -v "SAMBLASTER" <smoove call output dir>/<sample>.split.sam > $sname.tmp.sam
mv $sname.tmp.sam <smoove call output dir>/<sample>.split.sam
grep -v "SAMBLASTER" <smoove call output dir>/<sample>.disc.sam > $sname.tmp.sam
mv $sname.tmp.sam <smoove call output dir>/<sample>.disc.sam

Then you can create the sorted bam files to be used by smoove

samtools sort -@ 12 -O bam <smoove call output dir>/<sample>.split.sam > <smoove call output dir>/$sname.split.bam
samtools sort -@ 12 -O bam <smoove call output dir>/<sample>.disc.sam > <smoove call output dir>/$sname.disc.bam

Note that the final split and disc bams are named according to the sample name present in the read group tag from the bam file.

Remove intermediate files

rm <smoove call output dir>/<sample>.split.sam <smoove call output dir>/<sample>.disc.sam

For an example on how to use this fix with smoove, see my Snakemake population level structural variant calling pipeline