Input files
             In this section, we will read in
the required input files for MARVEL. The formatting requirement for each
input file will be explained.
Splice junction counts matrix
             The rows of this matrix represent
the splice junction coordinates, the columns represent the sample IDs,
and the values represent the splice junction counts. The first column
should be named coord.intron
.
             Here, the splice junction counts were quantified using the
STAR aligner version 2.6.1d in 2-pass mode (Dobin et al.,
2013). An example code for one sample (ERR1562083) below. Note a
separate folder SJ
is created here to contain the splice
junction count files (SJ.out.tab) generated from 1st pass mode to be
used for 2nd pass mode.
# STAR in 1st pass mode
STAR --runThreadN 16 \
--genomeDir GRCh38_GENCODE_genome_STAR_indexed \
--readFilesCommand zcat \
--readFilesIn ERR1562083_1_val_1.fq.gz ERR1562083_2_val_2.fq.gz \
--outFileNamePrefix SJ/ERR1562083. \
--outSAMtype None
# STAR in 2nd pass mode
STAR --runThreadN 16 \
--genomeDir GRCh38_GENCODE_genome_STAR_indexed \
--readFilesCommand zcat \
--readFilesIn ERR1562083_1_val_1.fq.gz ERR1562083_2_val_2.fq.gz \
--outFileNamePrefix ERR1562083. \
--sjdbFileChrStartEnd SJ/*SJ.out.tab \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM XS \
--quantMode TranscriptomeSAM
             Once the individual splice junction
count files have been generated, they should be collated and read into R
as follows:
path <- "Data/SJ/"
file <- "SJ.txt"
sj <- as.data.frame(fread(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA"))
sj[!is.na(sj[,2]), ][1:5,1:5]
## coord.intron ERR1562083 ERR1562084 ERR1562085 ERR1562086
## 9 chr1:100007082:100022621 0 NA NA NA
## 10 chr1:100007091:100024554 0 NA NA NA
## 18 chr1:100007157:100011364 12 3 5 9
## 23 chr1:100007601:100023781 1 NA NA NA
## 39 chr1:100011534:100015301 15 NA 8 12
Splicing event metadata
             The rows of this metadata represent
the splicing events while the columns represent the splicing event
information such as the transcript ID and the corresponding gene
information. Compulsory columns are tran_id
and
gene_id
.
             The splicing events here were detected using rMATS version
4.1.0 (Shen et al., 2014).Example code for running rMATS as
follows.
             Note that any BAM files may be specified in
--b1
and --b2
. This is because rMATS requires
these specification for statistical testing of splicing events between
the two samples. But here, we will only be using the splicing events
detected (fromGTF.SE.txt, fromGTF.MXE.txt, fromGTF.RI.txt,
fromGTF.A5SS.txt, fromGTF.A3SS.txt), but not the statistical test
results, from this step for our downstream analysis.
rmats \
--b1 path_to_BAM_sample_1.txt \
--b2 path_to_BAM_sample_2.txt \
--gtf gencode.v31.annotation.gtf \
--od rMATS/ \
--tmp rMATS/ \
-t paired \
--readLength 125 \
--variable-read-length \
--nthread 8 \
--statoff
# SE
path <- "Data/rMATS/SE/"
file <- "SE_featureData.txt"
df.feature.se <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")
head(df.feature.se)
## tran_id
## 1 chrX:156022314:156022459:+@chrX:156022699:156022834:+@chrX:156023012:156023209
## 2 chrX:154227192:154227360:+@chrX:154228828:154228993:+@chrX:154230548:154230598
## 3 chrX:154190054:154190222:+@chrX:154191688:154191853:+@chrX:154193408:154193458
## 4 chrX:154152940:154153108:+@chrX:154154574:154154739:+@chrX:154156294:154156344
## 5 chrX:152918983:152919081:+@chrX:152920328:152920411:+@chrX:152920707:152920748
## 6 chrX:152922173:152922256:+@chrX:152922720:152922809:+@chrX:152928575:152928661
## gene_id gene_short_name gene_type
## 1 ENSG00000182484.15 WASH6P transcribed_unprocessed_pseudogene
## 2 ENSG00000166160.9 OPN1MW2 protein_coding
## 3 ENSG00000268221.5 OPN1MW protein_coding
## 4 ENSG00000102076.9 OPN1LW protein_coding
## 5 ENSG00000147394.18 ZNF185 protein_coding
## 6 ENSG00000147394.18 ZNF185 protein_coding
# MXE
path <- "Data/rMATS/MXE/"
file <- "MXE_featureData.txt"
df.feature.mxe <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")
# RI
path <- "Data/rMATS/RI/"
file <- "RI_featureData.txt"
df.feature.ri <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")
# A5SS
path <- "Data/rMATS/A5SS/"
file <- "A5SS_featureData.txt"
df.feature.a5ss <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")
# A3SS
path <- "Data/rMATS/A3SS/"
file <- "A3SS_featureData.txt"
df.feature.a3ss <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")
# Merge
df.feature.list <- list(df.feature.se, df.feature.mxe, df.feature.ri, df.feature.a5ss, df.feature.a3ss)
names(df.feature.list) <- c("SE", "MXE", "RI", "A5SS", "A3SS")
Intron count matrix
             The rows of this matrix represent
intron coordinates, the columns represent the sample IDs, and the values
represent the total reads mapping to the intron. These values will be
used to compute the percent spliced-in (PSI) values of retained introns
(RI) splicing events downstream.
             Here, intron coverage was computed using Bedtools version
2.27.1 (Quinlan
et al., 2010). Example code for one sample
(ERR1562083) below. This code computes the counts at each base of a
given intron, the sum of which, will be the total counts for the given
intron. It is this total counts that is represented in the matrix.
             Note for GRCh38.primary_assembly.genome_bedtools.txt, the
first column consists of the chromosome name (chr1, chr2, chr3…) and the
second column consists of the chromosome size or length. Additionally,
the BED file RI_Coordinates.bed contains the intron coordinates from
RI_featureData.txt
generated from rMATS in the previous
step.
             Example data and codes to prepare the BEDTools input
(intron coordinates) and to tabulate the intron counts for each cell and
then across all cells available here:
https://drive.google.com/file/d/13YMkDg_oE3fD7jYyQna2QFlDltxxxRYW/view?usp=sharing.
bedtools coverage \
-g GRCh38.primary_assembly.genome_bedtools.txt \
-split \
-sorted \
-a RI_Coordinates.bed \
-b ERR1562083.Aligned.sortedByCoord.out.bam > \
ERR1562083.txt \
-d
             Once the individual splice junction
count files have been generated, they should be collated and read into R
as follows:
path <- "Data/MARVEL/PSI/RI/"
file <- "Counts_by_Region.txt"
df.intron.counts <- as.data.frame(fread(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA"))
df.intron.counts[1:5,1:5]
## coord.intron ERR1562083 ERR1562084 ERR1562085 ERR1562086
## 1 chr1:13375:13452 0 0 0 0
## 2 chr1:146510:146641 0 150 0 129
## 3 chr1:498457:498683 678 0 0 0
## 4 chr1:764801:765143 5375 4544 6794 1591
## 5 chr1:804967:806385 794 0 0 0
Gene expression matrix
             The rows of this matrix represent
the gene IDs, the columns represent the sample IDs, and the values
represent the normalised gene expression counts (e.g., RPKM/FPKM/TPM),
but not yet log2-transformed.
             Here, gene expression was quantified using RSEM version
1.2.31 (Li et al., 2011). Example code for one sample
(ERR1562083) as follows. Here, the values returned are in transcript per
million (TPM) unit.
rsem-calculate-expression --bam \
--paired-end \
-p 8 \
ERR1562083.Aligned.toTranscriptome.out.bam \
GRCh38_GENCODE_genome_RSEM_indexed/gencode.v31 \
ERR1562083
             Once the individual gene expression
files have been generated, they should be collated and read into R as
follows:
path <- "Data/RSEM/"
file <- "TPM.txt"
df.tpm <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)
df.tpm[1:5,1:5]
## gene_id ERR1562083 ERR1562084 ERR1562085 ERR1562086
## 1 ENSG00000000003.14 343.26 163.45 210.43 190.46
## 2 ENSG00000000005.6 5.69 0.00 0.00 0.00
## 3 ENSG00000000419.12 288.02 155.26 42.49 238.67
## 4 ENSG00000000457.14 1.58 8.71 0.00 1.06
## 5 ENSG00000000460.17 41.23 28.53 100.17 66.43
Gene transfer file (GTF)
             For consistency, this GTF should be
the same file and version (here v31) previously used for indexing the
genome (for STAR aligner here), computing gene expression (for RSEM
here), and detecting splicing events (for rMATS here). This file was
downloaded from GENCODE webpage (
https://www.gencodegenes.org).
path <- "Data/GTF/"
file <- "gencode.v31.annotation.gtf"
gtf <- as.data.frame(fread(paste(path, file, sep=""), sep="\t", header=FALSE, stringsAsFactors=FALSE, na.strings="NA", quote="\""))