RNAseq data analysis

Please note that the methods below are for projects delivered before April 7th, 2025.

Methods for projects delivered after April 7th, 2025 can be found here: bioinformatics.mskcc.org/bic/resources/methods/nf-rnaseq-pipeline-methods

Description

The output data (FASTQ files) are mapped to the target genome using the STAR aligner [Dobin2012] that maps reads genomically and resolves reads across splice junctions. We use the 2 pass mapping method outlined in [Engström2013] in which the reads are mapped twice. The first mapping pass uses a list of known annotated junctions from Ensemble. Novel junctions found in the first pass are then added to the known junctions and a second mapping pass is done (on the second pass the RemoveNoncanoncial flag is used). After mapping we post process the output SAM files using the PICARD tools to: add read groups, AddOrReplaceReadGroups which in additional sorts the file and coverts it to the compressed BAM format.

We then compute the expression count matrix from the mapped reads using HTSeq (www-huber.embl.de/users/anders/HTSeq) and one of several possible gene model databases. The raw count matrix generated by HTSeq are then be processed using the R/Bioconductor package DESeq (www-huber.embl.de/users/anders/DESeq) which is used to both normalize the full dataset and analyze differential expression between sample groups.

The data was clustered in several ways using the normalized counts of all genes that a total of 10 counts when summed across all samples.

Hierarchical cluster with the correlation metric (D_{ij} = 1 - cor(X_i,X_j)) with the Pearson correlation on the normalized log2 expression values.
Multidimensional scaling
Principal component analysis

Heatmap are generated using the heatmap.2 function from the gplots R package. For the Heatmaps the top 100 differentiall expressed genes are used. The data plot was the mean centered normalized log2 expression of the top 100 significant genes.

For human and mouse datasets we run a gene set analysis using GSEA (see GSEA Info and User Guide) with gene sets from the Broad mSigDb. If a sample group has fewer than three samples, we run GSEAPreranked using log2 fold changes from DESeq.

Program versions

Program	Version
HTSEQ	htseq/HTSeq-0.5.3
PICARD	picard/picard-tools-1.124
R	R/R-3.2.0
STAR	star/STAR-STAR_2.5.0a
SAMTOOLS	samtools/samtools-0.1.19

Data Files

Human:
- GENOME: UCSC HG19
- GTF: gencode.v18.annotation
Mouse:
- GENOME: UCSC MM10
- GTF: Mus_musculus.GRCm38.80

References:

[Dobin2012] A. Dobin et al, STAR: ultrafast universal RNA-seq aligner. Bioinformatics 2012.

[Engström2013] Engström, et. al., Systematic evaluation of spliced alignment programs for RNA-seq data, Nat. Meth., 10, pp. 1185, 2013.