Asmbl

Assembly and quality assessment of short-read Illumina data

Pipeline of tools used for quality assessment, assembly and gene detection from short-read Illumina sequences. Created for easy use at our microbiology research lab at Stavanger University Hopsital.

This script will quality- and adapter-trim your data with trim_galore, create a FastQC report, assemble the trimmed FASTQ reads using Unicycler, create a Quast QC-report, run mlst to identify sequence types and species, and will calculate average read depth (or coverage) of each sample.

The script creates a report summarising for each sample: Species, ST, no. reads, GC%, no. contigs, largest contig, total sequence length, N50, L50 and read depth.

Requirements
Basic Usage
Usage
Output
Detailed Explanation
Updates

Requirements

These need to be installed and in path for the entire pipeline to work. Other versions of these tools will possibly work too, but these are the ones I have tested.

Linux or MacOS
Python 3.9.7
Pandas (pip3 install pandas)
Paralell (conda install -c conda-forge parallel)
FastQC v0.11.9 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) (conda install -c bioconda fastqc)
MultiQC v1.11 (https://multiqc.info/) (conda install -c bioconda multiqc)
CutAdapt v3.5 (for TrimGalore) (conda install -c bioconda cutadapt)
TrimGalore v0.6.7 (conda install -c bioconda trim-galore)
Unicycler v0.5.0 (https://github.com/rrwick/Unicycler#installation)
SPAdes v3.15.3 (http://cab.spbu.ru/software/spades/) (conda install -c bioconda spades)
BLAST+ v2.12.0+ (conda install -c bioconda blast)
Bowtie2 v2.4.5 (conda install -c bioconda bowtie2)
Quast v5.0.2 (http://quast.sourceforge.net/quast) (conda install -c bioconda quast)
mlst v2.19.0 (https://github.com/tseemann/mlst) (conda install -c conda-forge -c bioconda -c defaults mlst)
BWA v0.7.17-r1188 (http://bio-bwa.sourceforge.net/) (conda install -c bioconda bwa)
SAMtools v1.14 (http://www.htslib.org/download/) (conda install -c bioconda samtools)
PicardTools v2.18.29-0 (https://broadinstitute.github.io/picard/) (conda install -c bioconda picard)
Optional: Kleborate v2.20 (https://github.com/katholt/Kleborate) including Kaptive v2.0.0
Optional but recommended: Install all in a conda environment

Basic usage

You must be in the directory containing the FASTQ-files to run this pipeline. Output-files will be stored in a specific file-structure in the input-directory.

cd ~/Directory_with_fastq/ #Enter directory with FASTQ-files
asmbl.py

Usage

You must be in the directory containing the FASTQ-files to run this pipeline. Output-files will be stored in a specific file-structure in the input-directory. In addition to the default pipeline, you can also run kleborate or abricate.

Usage:

ASMBL [-h] [-v] [-t THREADS] [--noex] [--nofqc] [--nomlst]
               [--noquast] [--nocov] [--klebs] [--argannot] [--resfinder]
               [--plasmidfinder] [--card] [--ncbi] [--ecoh] [--abricate_all]

ASMBL

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -t THREADS, --threads THREADS
                        Specify number of threads to use. Default: 4
  --noex                Do not run fastQC, multiQC, Quast, MLST or
                        read depth calculation.
  --nofqc               Do not run fastQC and multiQC
  --nomlst              Do not run MLST
  --noquast             Do not run Quast
  --nocov               Do not calculate read depth (X)
  --klebs               Run Kleborate, with option --all

You can add more FASTQ-files to the same output-directories/summary-report, by adding the FASTQ-files to the inital input-directory, leaving the output-folder structure as it was created and re-running the pipeline.

Note: This was initially created for scientist with little/no coding-experience to easily perform assembly, therefore, this script currently only works when you run it from the folder the FASTQ-files are in, and output-files are stored in the same directory. In the future, I will add input and output-options.

Output

The following output-files are created when running AsmPipe:

Fastq_raw: Any FASTQ-files that have been processed are placed here
trimmed_reads: The trimmed reads from TrimGalore are placed here
assembly: The Unicycler-assembled files will be placed here
assemblies: The FASTA-file from assembly will be copied to this direcory
QC: Contains reports from FastQC, multiQC, Quast, trimming and overall coverage calculation
analyses: Contains results from mlst, kleborate and abricate
sequence_list.txt: List of all samples that have been analysed
successful_sequences.txt: List of all samples that were successfully assembled
failed_sequences.txt: List of any samples that failed any stage of the pipeline
logs: Will contain run-logs from each tool for each sample
AsmPipe_date_time.csv: Overall summary report of: Species, ST, no. reads, GC%, no. contigs, largest contig, total sequence length, N50, L50 and sequence depth.

Detailed explanation

FastQC performs quality assessment of raw reads, indicating number of reads, GC%, adapter content, sequence length distribution, and more
TrimGalore - trims raw reads based on adapter sequences and Phred quality: trims 1 bp off 3' end of every read, removes low-quality (<Phred 20) 3' ends, removes adapter sequences and removes read-pairs if either of the reads' length is <20 bp
Unicycler functions as a SPAdes optimiser with short-reads only, and pilon polishing attempts to make imporvements on the genome
Quast quality assessment on assembly outputs the total length, GC%, number of contigs, N50, L50 and more.
MLST attempts to identify species and mlst based on the PubMLST schemes. Other tools may be needed for specification, e.g. Kleborate identifies locus variants for Klebsiella samples and separates klebsiella pneumoniae sensu lato into subspecies
Sequencing depth (X) - maps the reads against their assembled fasta-file to calculate the overall average depth of the genome.

Things to check QC-wise

That GC% matches the sample species
That the total length matches the sample species
That you do not have a high number of contigs (ideally <500)
That you do not have low average read depth (ideally >40X)
A low number og long contigs is preferable to a high number of contigs with short contigs

License

GNU General Public License, v3

Updates

2022-02-03: Updated pipeline to work with newest release of Unicycler (v0.5.0). Unicycler no longer uses pilon for polishing and read error correction is by default turned off, so these options have been removed from the pipeline. Also removed option for abricate as it does not work. Also updated some terms to stop confusion: The folder "assemblies" is now "fasta", "coverage" is now "read depth", and the final report from the pipeline is prefixed with "Asmbl" rather than "AsmPipe".

2021-04-21: Added version checks, fixed a bug with threading and most importantly: Added the flag --no_correct to the unicycler command to turn off spades read error correction. This is not needed as the files are QC'd with trim-galore first.

2019-07-18: Added options to find *fastq-gz files in subdirectories from previuos runs

2019-07-18: Added options to not run parts of the pipeline, and added option to run kleborate (https://github.com/katholt/Kleborate) at the end of the pipeline

2019-10-29: Added options to run ABRICATE as part of the pipeline, and created output-folder "analyses" to put mlst, kleborate and ABRICATE-outputs in. Also added merge_runs.sh which you can use to merge two parent folders with the same structure (from this script).

2019-10-30: Updated tool versions in README.

markus-soma/Asmbl

Asmbl

Table of Contents