DITTO
!!! For research purposes only !!!
NOTE: In a past life, DITTO used a different remote Git management provider, UAB
Gitlab. It was migrated to
Github in April 2023, and the Gitlab version has been archived.
DITTO is an explainable neural network that can be helpful for accurate and rapid interpretation of small
genetic variants for pathogenicity using patient’s genotype (VCF) information.
Getting Started
- Prerequisites
- Using DITTO
- Reproducing the DITTO model
- Download DITTO DB (Precomputed scores)
- How to cite?
- Contact
Prerequisites
The following prerequisites are required to be installed in the target envrionment for deploying and running DITTO
prediction model.
Tools
- Python 3.10 - Install
- The specified OpenCravat version requires Python 3.10
- Anaconda3 25.7+ - install
- OpenCravat 2.4.1 - install
- Git
- Setup with your favorite git client. Here is a GitHub Guide
for different platforms.
- Setup with your favorite git client. Here is a GitHub Guide
- Nextflow 22.10.7+ - install
NOTE: Current version of OpenCravat that we're using doesn't support "Spanning or overlapping deletions"
variants i.e. variants with*inALT Allelecolumn. More on these variants
here.
These will be ignored when running the pipeline.
System Requirements
- CPU: >2
- RAM: ~25GB for a WGS VCF sample
- Storage: 1TB
- The storage requirements are for hosting the OpenCravat annotators ~600GB of data required to store all annotators
Using DITTO
DITTO scores for variants can be obtained by the below 3 ways. Webapp and API are for single variant analysis and the
local setup is for batch/bulk variant predictions.
Webapp
DITTO is available for public use at this website.
API
DITTO is not hosted as a public API but one can serve up locally to query DITTO scores. Please follow the instructions
in this GitHub repo.
Prediction
Installation
To fetch DITTO source code, change in to directory of your choice and run:
git clone https://github.com/uab-cgds-worthey/DITTO.git
cd DITTOLocal Prediction
NOTE: This setup will allow one to annotate a VCF sample and make DITTO predictions. Currently tested only in
Cheaha (UAB HPC) because of resource limitations to download datasets from OpenCRAVAT.
Docker versions may need to be explored later to make it useable in Mac and Windows.
NextFlow Conda Vs. Mamba Setup
NOTE: If the user has conda running with Mamba instead of Conda, NextFlow can be configured to use Mamba instead
by modifying the configs/nextflow/local.config file and updating the useMamba parameter to reflect the user's
environment:
# This parameter is defaulted to false, change to true if using Mamba
useMamba = trueSetup Steps
-
Setup OpenCravat (only one-time installation)
Please follow the steps mentioned in install_openCravat.md.
-
Setup Nextflow
Create an environment via conda. Below is an example to install
nextflow.# create environment. Needed only the first time. Please use the above link if you're not using Mac. conda env create -f ./configs/conda/ditto-env.yaml conda activate ditto-env -
Sample Sheet
Please make a samplesheet
.test_data/file_list.txtwith VCF files (incl. path). One can supply either relative paths
or absolute paths to files for the vcf.gz files. Relative paths need to be relative to the work directory that DITTO
was executed from.Example
file_list.txtwith relative paths:.test_data/oc_test_data.vcf.gz .test_data/testing_variants_hg38.vcf.gz # Example, will become: /Users/<username>/Workspace/DITTO/.test_data/oc_test_data.vcf.gzOr absolute paths
/Users/<username>/Desktop/test_data/oc_test_data.vcf.gz /Users/<username>/Desktop/test_data/testing_variants_hg38.vcf.gz # Example is using MacOS Desktop folder with test_data directory
This will run DITTO prediction for both vcf files in the
file_list.txt. -
Run the NextFlow pipeline
Please make sure to edit the directory paths as needed and run the pipeline as shown below.
# Note: NextFlow work directory is defined as `-work-dir` in the run command parameters # Note: `--output` cannot be relative, set a path nextflow can access. ex. `/tmp/DITTO/output` nextflow run pipeline.nf \ -work-dir ./work_dir \ --build hg38 -c ./configs/nextflow/local.config -with-report \ --sample_sheet .test_data/file_list.txt \ --oc_modules /<path-to>/opencravat/modules \ --outdir $PWD/data/output
HPC Prediction with Cheaha
To run on UAB cheaha, see the installation step to clone the DITTO repository into a Cheaha directory.
-
Create a text file listing the path to VCF file(s) (1 path per line) with variants to score
- Paths can be full absolute paths or relative paths (relative to the directory where the pipeline will be run
from, note the directory where thepipeline.nffile is)
- Paths can be full absolute paths or relative paths (relative to the directory where the pipeline will be run
-
See the example input file .test_data/file_list.txt (lists 2 testing example input VCFs)
for reference or as an input file for testing (default behavior ofmodel.job)- One can supply either relative paths or absolute paths to files for the vcf.gz files. Relative paths need to be
relative to the work directory that DITTO was executed from.
Example
file_list.txtwith relative paths:.test_data/oc_test_data.vcf.gz .test_data/testing_variants_hg38.vcf.gz # Example, will become: /home/<username>/Workspace/DITTO/.test_data/oc_test_data.vcf.gzOr absolute paths
/home/<username>/test_data/oc_test_data.vcf.gz /home/<username>/test_data/testing_variants_hg38.vcf.gz # Example is using Linux home directory with a test_data directory
- One can supply either relative paths or absolute paths to files for the vcf.gz files. Relative paths need to be
-
Update
model.job(change the--sample_sheetoption to your input file with VCF path(s) and
--outdirto the desired output location of DITTO predictions)
sbatch model.jobReproducing the DITTO model
Detailed instructions on reproducing the model is explained in build_DITTO.md
Download DITTO DB (Precomputed scores)
Precomputed scores for all possible SNVs and known Indels from gnomADv3.0 in main chromosomes in hg38 reference genome
are available to download here - https://s3.lts.rc.uab.edu/cgds-public/dittodb/dittodb.html
How to cite?
Mamidi, T.K.K.; Wilk, B.M.; Gajapathy, M.; Worthey, E.A. DITTO: An Explainable Machine-Learning Model for
Transcript-Specific Variant Pathogenicity Prediction. Preprints 2024, 2024040837. https://doi.org/10.20944/preprints202404.0837.v1
Contact information
For queries, please open a GitHub issue.
For urgent queries, send an email with clear description to
| Name | |
|---|---|
| Tarun Mamidi | tmamidi@uab.edu |
| Liz Worthey | lworthey@uab.edu |