lewisl23/RNAseq-analysis-pipeline
RNA seq analysis coursework that analyses the GSE268197 datasets through differential gene expression and functional enrichment analysis
RNA seq analysis of GSE268197
This is a RNA seq analysis coursework for MSc Bioinformatics that analyses the
RNA sequencing datasets from the paper "Effect of GRP75 deficiency on gene expression in DN3 thymocytes".
The study sequences wildtype and Hspa9 cKO DN3 thymocytes using Illumina HiSeq 2000 with 3 samples
for each condition. This project performed differential gene expression and functional enrichment analysis
on the datasets to understand changes in gene expression and biological insights related to the
change.
Methods
Raw reads from Illumina HiSeq 2000
Taken from "Effect of GRP75 deficiency on gene expression in DN3 thymocytes"
Accessed through GSE268197 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE268197
Wildtype
- GSM8287690
- GSM8287691
- GSM8287692
cKO (knockout)
- GSM8287693
- GSM8287694
- GSM8287695
1. Quality control and alignment
- Illumina reads undergoes quality control using FastQC and MultiQC to combine the QC reports together
- Alignment using STAR with the mouse reference genome to produce aligned BAM file
- Count the number of reads that overlaps with the genome annotation file using featureCounts
2. Exploratory analysis with PCA and Heatmap
- Data normalisation with rlog to ensures different samples are comparable
- 3D and 2D PCA analysis to identify whether the 2 conditions have distinct variations or
unwanted variations within the 3 samples of each condition - Heatmap is drawn with clustering to identify the distance of relationships between different samples
3. Differential gene expression analysis
- Data normalisation with Size factors to ensure comparatability
- Differential gene expression with DEseq2 to output the MA plot of the datasets, which
allows visualisation of the overall diffferential gene expression of the control vs knockout - Removal of insignificant differences using adjusted p-value of 0.5 (multiple testing problem)
and selecting the top 10 genes with the highest absolute log2 fold change (heatmap)
4. Functional enrichment analysis using GeneOntology (GO)
- Using mart to access "mmusculus_gene_ensembl" and Gene Ontology database
- Search for biological process, molecular function, and cellular component