Great tit (Parus major) INDEL analysis

Pipeline from Barton and Zeng (2019).

Henry Juho Barton
Department of Animal and Plant Sciences, The University of Sheffield

Introduction

This repository outlines the pipeline used to generate and analyse an INDEL dataset from 10 high coverage (mean coverage = 44X) great tit (Parus major) genomes (described here: Corcoran et al. 2017). The repository is subdivided by processing steps.

Programs required

Python 2.7.2
GATK version 3.4-46-gbc02625 available from: https://software.broadinstitute.org/gatk/download/archive
VCFtools version 0.1.12b available from: https://sourceforge.net/projects/vcftools/files/
SAMtools version 1.2 available from: https://sourceforge.net/projects/samtools/files/samtools/
BCFtools version 1.3
bedtools version 2.23.0
anavar version 1.2.2
q_sub.py and qsub_gen.py available from https://github.com/henryjuho/python_qsub_wrapper
pysam version 0.11.2.1 available from https://github.com/pysam-developers/pysam

* Note * that most scripts make use of the script 'qsub_gen.py' which is designed to submit jobs in the form of shell scripts to the 'Sun Grid Engine', if shell scripts only are required the '-OM' option in the 'qsub_gen.py' command line within the scripts can be changed from 'q' to 'w'. Alternatively some scripts make use of the python qsub wrapper module qsub.py described here: https://github.com/henryjuho/python_qsub_wrapper.

Pre-prepared files required for analysis

Reference genome: /fastdata/bop15hjb/GT_ref/Parus_major_1.04.rename.fa
Reference genome index file: /fastdata/bop15hjb/GT_ref/Parus_major_1.04.rename.fa.fai
GFF annotation file: /fastdata/bop15hjb/GT_ref/GCF_001522545.1_Parus_major1.0.3_genomic.gff.gz
All sites VCF: /fastdata/bop15hjb/GT_data/BGI/bgi_10birds.raw.snps.indels.all_sites.vcf
Repeat masker bed file: /fastdata/bop15hjb/GT_data/BGI_10_repeats/ParusMajorBuild1_v24032014_reps.bed
BAM files for SAMtools calling: /fastdata/bop15hjb/GT_data/BGI_10_BAM/*.bam

Pipeline

Generating the dataset

The variant calling and filtering pipeline for both SNPs and INDELs is described here: variant_calling/.

Multispecies alignment and INDEL polarisation

The generation of a multiple species alignment between zebra finch, great tit and fly catcher and its use in polarisating variants and identifying ancestral repeats is described here: alignment_and_polarisation/.

Annotating the data

Variant annotation using the NCBI GFF file is described here: annotation/.

Summary statistics and analyses

The calculation of summary statistics and other data summary analyses are documented here: summary_analyses/.

Anavar analyses

Analysis of the INDEL data with the anavar package is described here: anavar_analyses/.

Proximity analyses

Analysis of INDEL data in windows of increasing distance from exons is described here: gene_proximity_analyses/.

Recombination analyses

Pipeline for relating INDEL diversity and Tajima's D with recombination rate is documented here: recombination_analyses/.

Length analyses

Analysis of impact of INDEL length on the SFS is documented here: length_analyses/.

henryjuho/parus_indel