GitHunt
PV

pvarelad/RNAseq-Nextflow-Pipeline

A comprehensive Nextflow pipeline for RNA-seq data analysis, encompassing quality control, read alignment, quantification, and differential expression analysis.

RNA-seq Analysis Pipeline

A comprehensive Nextflow pipeline for RNA-seq data analysis, encompassing quality control, read alignment, quantification, and differential expression analysis.

Overview

This pipeline performs end-to-end RNA-seq analysis from raw FASTQ files to differential expression results and pathway enrichment. It is designed to handle paired-end sequencing data and provides reproducible, scalable analysis with automated quality control and reporting.

Pipeline Workflow

The pipeline consists of the following major steps:

  1. Quality Control - Assessment of raw sequencing data quality
  2. Read Alignment - Splice-aware alignment to reference genome
  3. Quantification - Gene-level read counting
  4. Differential Expression Analysis - Statistical identification of differentially expressed genes
  5. Pathway Enrichment - Gene set enrichment analysis
  6. Visualization - PCA, heatmaps, and comparative plots

Requirements

Software Dependencies

Input Requirements

  • Paired-end FASTQ files (gzipped or uncompressed)
  • Reference genome FASTA file
  • Gene annotation GTF file
  • Sample metadata/condition information

Installation

# Clone the repository
git clone https://github.com/yourusername/rnaseq-pipeline.git
cd rnaseq-pipeline

# Install Nextflow (if not already installed)
curl -s https://get.nextflow.io | bash

Usage

Basic Usage

nextflow run main.nf \
  --input samples.csv \
  --genome reference.fasta \
  --gtf annotations.gtf \
  --outdir results

Input Format

Create a CSV file (samples.csv) with the following format:

sample,read1,read2
sample1,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz
sample2,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz

Parameters

Parameter Description Default
--input Path to input CSV file Required
--genome Path to reference genome FASTA Required
--gtf Path to gene annotation GTF Required
--outdir Output directory ./results
--star_index Pre-built STAR index (optional) None

Pipeline Details

1. Reference Genome Indexing

STAR genome index is generated using default parameters. If a pre-built index is provided via --star_index, this step is skipped.

STAR --runMode genomeGenerate \
  --genomeDir index \
  --genomeFastaFiles reference.fasta \
  --sjdbGTFfile annotations.gtf

2. Quality Control (FastQC)

Raw FASTQ files are analyzed in parallel to assess:

  • Per-base quality scores
  • GC content distribution
  • Adapter contamination
  • Sequence duplication levels

3. Read Alignment (STAR)

Splice-aware alignment is performed using STAR with default settings optimized for RNA-seq data.

4. Quality Report Aggregation (MultiQC)

FastQC and STAR reports are aggregated into a single comprehensive HTML report for easy visualization across all samples.

5. Gene Quantification (VERSE)

Read counts are assigned to genomic features (exons/introns) to generate a raw count matrix. Individual sample count files are concatenated into a single CSV matrix.

6. Differential Expression Analysis (DESeq2)

  • Filtering: Lowly expressed genes are removed
  • Normalization: Library size and sequencing depth normalization
  • Statistical Testing: Identification of significantly differentially expressed genes
  • Visualization: PCA and heatmap generation for sample clustering

7. Pathway Enrichment (FGSEA)

Fast Gene Set Enrichment Analysis identifies biological pathways enriched among differentially expressed genes.

8. Comparative Analysis

Results are compared with original study findings to highlight concordance and divergence.

Output Structure

results/
├── fastqc/                 # FastQC reports for each sample
├── multiqc/                # Aggregated quality control report
├── star/                   # STAR alignment files and logs
│   └── *.bam
├── counts/                 # Individual and merged count files
│   ├── sample1_counts.txt
│   └── counts_matrix.csv
├── deseq2/                 # Differential expression results
│   ├── dge_results.csv
│   ├── normalized_counts.csv
│   ├── pca_plot.pdf
│   └── heatmap.pdf
└── fgsea/                  # Pathway enrichment results
    └── enrichment_results.csv

Example Dataset

This pipeline was developed and tested on a dataset containing:

  • 6 samples
  • Paired-end sequencing
  • 12 FASTQ files total (6 samples × 2 reads)

Tool Citations

  • STAR: Dobin et al. (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics.
  • FastQC: Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data.
  • MultiQC: Ewels et al. (2016) MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics.
  • DESeq2: Love et al. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology.
  • fgsea: Korotkevich et al. (2021) Fast gene set enrichment analysis. bioRxiv.
pvarelad/RNAseq-Nextflow-Pipeline | GitHunt