pvarelad/RNAseq-Nextflow-Pipeline
A comprehensive Nextflow pipeline for RNA-seq data analysis, encompassing quality control, read alignment, quantification, and differential expression analysis.
RNA-seq Analysis Pipeline
A comprehensive Nextflow pipeline for RNA-seq data analysis, encompassing quality control, read alignment, quantification, and differential expression analysis.
Overview
This pipeline performs end-to-end RNA-seq analysis from raw FASTQ files to differential expression results and pathway enrichment. It is designed to handle paired-end sequencing data and provides reproducible, scalable analysis with automated quality control and reporting.
Pipeline Workflow
The pipeline consists of the following major steps:
- Quality Control - Assessment of raw sequencing data quality
- Read Alignment - Splice-aware alignment to reference genome
- Quantification - Gene-level read counting
- Differential Expression Analysis - Statistical identification of differentially expressed genes
- Pathway Enrichment - Gene set enrichment analysis
- Visualization - PCA, heatmaps, and comparative plots
Requirements
Software Dependencies
- Nextflow (≥21.0)
- STAR (v2.7.11b)
- FastQC (v0.12.1)
- MultiQC (v1.25)
- VERSE (v0.1.5)
- Python (≥3.7)
- Biopython (v1.76)
- R (≥4.0)
- DESeq2
- fgsea
Input Requirements
- Paired-end FASTQ files (gzipped or uncompressed)
- Reference genome FASTA file
- Gene annotation GTF file
- Sample metadata/condition information
Installation
# Clone the repository
git clone https://github.com/yourusername/rnaseq-pipeline.git
cd rnaseq-pipeline
# Install Nextflow (if not already installed)
curl -s https://get.nextflow.io | bashUsage
Basic Usage
nextflow run main.nf \
--input samples.csv \
--genome reference.fasta \
--gtf annotations.gtf \
--outdir resultsInput Format
Create a CSV file (samples.csv) with the following format:
sample,read1,read2
sample1,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz
sample2,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gzParameters
| Parameter | Description | Default |
|---|---|---|
--input |
Path to input CSV file | Required |
--genome |
Path to reference genome FASTA | Required |
--gtf |
Path to gene annotation GTF | Required |
--outdir |
Output directory | ./results |
--star_index |
Pre-built STAR index (optional) | None |
Pipeline Details
1. Reference Genome Indexing
STAR genome index is generated using default parameters. If a pre-built index is provided via --star_index, this step is skipped.
STAR --runMode genomeGenerate \
--genomeDir index \
--genomeFastaFiles reference.fasta \
--sjdbGTFfile annotations.gtf2. Quality Control (FastQC)
Raw FASTQ files are analyzed in parallel to assess:
- Per-base quality scores
- GC content distribution
- Adapter contamination
- Sequence duplication levels
3. Read Alignment (STAR)
Splice-aware alignment is performed using STAR with default settings optimized for RNA-seq data.
4. Quality Report Aggregation (MultiQC)
FastQC and STAR reports are aggregated into a single comprehensive HTML report for easy visualization across all samples.
5. Gene Quantification (VERSE)
Read counts are assigned to genomic features (exons/introns) to generate a raw count matrix. Individual sample count files are concatenated into a single CSV matrix.
6. Differential Expression Analysis (DESeq2)
- Filtering: Lowly expressed genes are removed
- Normalization: Library size and sequencing depth normalization
- Statistical Testing: Identification of significantly differentially expressed genes
- Visualization: PCA and heatmap generation for sample clustering
7. Pathway Enrichment (FGSEA)
Fast Gene Set Enrichment Analysis identifies biological pathways enriched among differentially expressed genes.
8. Comparative Analysis
Results are compared with original study findings to highlight concordance and divergence.
Output Structure
results/
├── fastqc/ # FastQC reports for each sample
├── multiqc/ # Aggregated quality control report
├── star/ # STAR alignment files and logs
│ └── *.bam
├── counts/ # Individual and merged count files
│ ├── sample1_counts.txt
│ └── counts_matrix.csv
├── deseq2/ # Differential expression results
│ ├── dge_results.csv
│ ├── normalized_counts.csv
│ ├── pca_plot.pdf
│ └── heatmap.pdf
└── fgsea/ # Pathway enrichment results
└── enrichment_results.csv
Example Dataset
This pipeline was developed and tested on a dataset containing:
- 6 samples
- Paired-end sequencing
- 12 FASTQ files total (6 samples × 2 reads)
Tool Citations
- STAR: Dobin et al. (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics.
- FastQC: Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data.
- MultiQC: Ewels et al. (2016) MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics.
- DESeq2: Love et al. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology.
- fgsea: Korotkevich et al. (2021) Fast gene set enrichment analysis. bioRxiv.