RNA-seq Analysis Pipeline

A comprehensive Nextflow pipeline for RNA-seq data analysis, encompassing quality control, read alignment, quantification, and differential expression analysis.

Overview

This pipeline performs end-to-end RNA-seq analysis from raw FASTQ files to differential expression results and pathway enrichment. It is designed to handle paired-end sequencing data and provides reproducible, scalable analysis with automated quality control and reporting.

Pipeline Workflow

The pipeline consists of the following major steps:

Quality Control - Assessment of raw sequencing data quality
Read Alignment - Splice-aware alignment to reference genome
Quantification - Gene-level read counting
Differential Expression Analysis - Statistical identification of differentially expressed genes
Pathway Enrichment - Gene set enrichment analysis
Visualization - PCA, heatmaps, and comparative plots

Requirements

Software Dependencies

Nextflow (≥21.0)
STAR (v2.7.11b)
FastQC (v0.12.1)
MultiQC (v1.25)
VERSE (v0.1.5)
Python (≥3.7)
- Biopython (v1.76)
R (≥4.0)
- DESeq2
- fgsea

Input Requirements

Paired-end FASTQ files (gzipped or uncompressed)
Reference genome FASTA file
Gene annotation GTF file
Sample metadata/condition information

Installation

# Clone the repository
git clone https://github.com/yourusername/rnaseq-pipeline.git
cd rnaseq-pipeline

# Install Nextflow (if not already installed)
curl -s https://get.nextflow.io | bash

Usage

Basic Usage

nextflow run main.nf \
  --input samples.csv \
  --genome reference.fasta \
  --gtf annotations.gtf \
  --outdir results

Input Format

Create a CSV file (samples.csv) with the following format:

sample,read1,read2
sample1,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz
sample2,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz

Parameters

Parameter	Description	Default
`--input`	Path to input CSV file	Required
`--genome`	Path to reference genome FASTA	Required
`--gtf`	Path to gene annotation GTF	Required
`--outdir`	Output directory	`./results`
`--star_index`	Pre-built STAR index (optional)	None

Pipeline Details

1. Reference Genome Indexing

STAR genome index is generated using default parameters. If a pre-built index is provided via --star_index, this step is skipped.

STAR --runMode genomeGenerate \
  --genomeDir index \
  --genomeFastaFiles reference.fasta \
  --sjdbGTFfile annotations.gtf

2. Quality Control (FastQC)

Raw FASTQ files are analyzed in parallel to assess:

Per-base quality scores
GC content distribution
Adapter contamination
Sequence duplication levels

3. Read Alignment (STAR)

Splice-aware alignment is performed using STAR with default settings optimized for RNA-seq data.

4. Quality Report Aggregation (MultiQC)

FastQC and STAR reports are aggregated into a single comprehensive HTML report for easy visualization across all samples.

5. Gene Quantification (VERSE)

Read counts are assigned to genomic features (exons/introns) to generate a raw count matrix. Individual sample count files are concatenated into a single CSV matrix.

6. Differential Expression Analysis (DESeq2)

Filtering: Lowly expressed genes are removed
Normalization: Library size and sequencing depth normalization
Statistical Testing: Identification of significantly differentially expressed genes
Visualization: PCA and heatmap generation for sample clustering

7. Pathway Enrichment (FGSEA)

Fast Gene Set Enrichment Analysis identifies biological pathways enriched among differentially expressed genes.

8. Comparative Analysis

Results are compared with original study findings to highlight concordance and divergence.

Output Structure

results/
├── fastqc/                 # FastQC reports for each sample
├── multiqc/                # Aggregated quality control report
├── star/                   # STAR alignment files and logs
│   └── *.bam
├── counts/                 # Individual and merged count files
│   ├── sample1_counts.txt
│   └── counts_matrix.csv
├── deseq2/                 # Differential expression results
│   ├── dge_results.csv
│   ├── normalized_counts.csv
│   ├── pca_plot.pdf
│   └── heatmap.pdf
└── fgsea/                  # Pathway enrichment results
    └── enrichment_results.csv

Example Dataset

This pipeline was developed and tested on a dataset containing:

6 samples
Paired-end sequencing
12 FASTQ files total (6 samples × 2 reads)

Tool Citations

STAR: Dobin et al. (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics.
FastQC: Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data.
MultiQC: Ewels et al. (2016) MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics.
DESeq2: Love et al. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology.
fgsea: Korotkevich et al. (2021) Fast gene set enrichment analysis. bioRxiv.

pvarelad/RNAseq-Nextflow-Pipeline