Installation

             Complete instructions for installation available on MARVEL’s Github page: https://github.com/wenweixiong/MARVEL. While MARVEL is available on both CRAN and Github, prospective users are advised to install the Github version of MARVEL for the latest features.

Introduction

             This tutorial demonstrates the application of MARVEL for integrated gene and splicing analysis of RNA-sequencing data generated from plate-based library preparation methods such as Smart-seq2. Only the most salient options of the functions will be elaborated here. For more details on the function’s options, please launch the help page with ?function_name.
             The dataset used to demonstrate the utility of MARVEL here includes induced pluripotent stem cells (iPSCs) and iPSC-induced endoderm cells (Linker et al., 2019). The processed input files required for this tutorial may be downloaded from here: http://datashare.molbiol.ox.ac.uk/public/project/meadlab/wwen/MARVEL/Tutorial/Plate/Data.zip

Load package

# Load MARVEL package
library(MARVEL)

# Load adjunct packages for selected MARVEL features
  # General data processing, plotting
  library(ggnewscale)
  library(ggrepel)
  library(parallel)
  library(reshape2)
  library(stringr)
  library(textclean)

  # Dimension reduction analysis
  library(factoextra)
  library(FactoMineR)

  # Modality analysis
  library(fitdistrplus)

  # Differential splicing analysis
  library(kSamples)
  library(twosamples)

  # Gene ontology analysis
  library(AnnotationDbi)
  library(clusterProfiler)
  library(org.Hs.eg.db)
  library(org.Mm.eg.db)

  # Nonsense-mediated decay (NMD) analysis
  library(Biostrings)
  library(BSgenome)
  library(BSgenome.Hsapiens.NCBI.GRCh38)

# Load adjunct packages for this tutorial
library(data.table)
library(ggplot2)
library(gridExtra)

Input files

             In this section, we will read in the required input files for MARVEL. The formatting requirement for each input file will be explained.

Sample metadata

             This is a tab-delimited file created by the user whereby the rows represent the sample (cell) IDs and columns represent the cell information such as cell type, donor ID etc.. Compulsory column is sample.id while all other columns are optional.
path <- "Data/SJ/"
file <- "SJ_phenoData.txt"
df.pheno <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")

head(df.pheno)
##    sample.id cell.type sample.type qc.seq
## 1 ERR1562083   Unknown Single Cell   pass
## 2 ERR1562084      iPSC Single Cell   pass
## 3 ERR1562085      iPSC Single Cell   pass
## 4 ERR1562086      iPSC Single Cell   pass
## 5 ERR1562087      iPSC Single Cell   pass
## 6 ERR1562088      iPSC Single Cell   pass

Splice junction counts matrix

             The rows of this matrix represent the splice junction coordinates, the columns represent the sample IDs, and the values represent the splice junction counts. The first column should be named coord.intron.
             Here, the splice junction counts were quantified using the STAR aligner version 2.6.1d in 2-pass mode (Dobin et al., 2013). An example code for one sample (ERR1562083) below. Note a separate folder SJ is created here to contain the splice junction count files (SJ.out.tab) generated from 1st pass mode to be used for 2nd pass mode.
# STAR in 1st pass mode
STAR --runThreadN 16 \
     --genomeDir GRCh38_GENCODE_genome_STAR_indexed \
     --readFilesCommand zcat \
     --readFilesIn ERR1562083_1_val_1.fq.gz ERR1562083_2_val_2.fq.gz \
     --outFileNamePrefix SJ/ERR1562083. \
     --outSAMtype None

# STAR in 2nd pass mode
STAR --runThreadN 16 \
     --genomeDir GRCh38_GENCODE_genome_STAR_indexed \
     --readFilesCommand zcat \
     --readFilesIn ERR1562083_1_val_1.fq.gz ERR1562083_2_val_2.fq.gz \
     --outFileNamePrefix ERR1562083. \
     --sjdbFileChrStartEnd SJ/*SJ.out.tab \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes NH HI AS nM XS \
     --quantMode TranscriptomeSAM
             Once the individual splice junction count files have been generated, they should be collated and read into R as follows:
path <- "Data/SJ/"
file <- "SJ.txt"
sj <- as.data.frame(fread(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA"))

sj[!is.na(sj[,2]), ][1:5,1:5]
##                coord.intron ERR1562083 ERR1562084 ERR1562085 ERR1562086
## 9  chr1:100007082:100022621          0         NA         NA         NA
## 10 chr1:100007091:100024554          0         NA         NA         NA
## 18 chr1:100007157:100011364         12          3          5          9
## 23 chr1:100007601:100023781          1         NA         NA         NA
## 39 chr1:100011534:100015301         15         NA          8         12

Splicing event metadata

             The rows of this metadata represent the splicing events while the columns represent the splicing event information such as the transcript ID and the corresponding gene information. Compulsory columns are tran_id and gene_id.
             The splicing events here were detected using rMATS version 4.1.0 (Shen et al., 2014).Example code for running rMATS as follows.
             Note that any BAM files may be specified in --b1 and --b2. This is because rMATS requires these specification for statistical testing of splicing events between the two samples. But here, we will only be using the splicing events detected (fromGTF.SE.txt, fromGTF.MXE.txt, fromGTF.RI.txt, fromGTF.A5SS.txt, fromGTF.A3SS.txt), but not the statistical test results, from this step for our downstream analysis.
rmats \
    --b1 path_to_BAM_sample_1.txt \
    --b2 path_to_BAM_sample_2.txt \
    --gtf gencode.v31.annotation.gtf \
    --od rMATS/ \
    --tmp rMATS/ \
    -t paired \
    --readLength 125 \
    --variable-read-length \
    --nthread 8 \
    --statoff
             The exon and intron coordinates from the fromGTF.*.txt files would need to be converted to official splicing nomenclatures (tran_id) as per here: https://wenweixiong.github.io/Splicing_Nomenclature. Example data and codes to do this available here: https://drive.google.com/file/d/1ilhgUdRQYC2ee1fbA54ZdxbbtB8tlQ2m/view?usp=sharing.
             Once the individual splicing event files for SE, MXE, RI, A5S5, and A3SS have been generated, they may be read into R as follows:
# SE
path <- "Data/rMATS/SE/"
file <- "SE_featureData.txt"
df.feature.se <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")

head(df.feature.se)
##                                                                          tran_id
## 1 chrX:156022314:156022459:+@chrX:156022699:156022834:+@chrX:156023012:156023209
## 2 chrX:154227192:154227360:+@chrX:154228828:154228993:+@chrX:154230548:154230598
## 3 chrX:154190054:154190222:+@chrX:154191688:154191853:+@chrX:154193408:154193458
## 4 chrX:154152940:154153108:+@chrX:154154574:154154739:+@chrX:154156294:154156344
## 5 chrX:152918983:152919081:+@chrX:152920328:152920411:+@chrX:152920707:152920748
## 6 chrX:152922173:152922256:+@chrX:152922720:152922809:+@chrX:152928575:152928661
##              gene_id gene_short_name                          gene_type
## 1 ENSG00000182484.15          WASH6P transcribed_unprocessed_pseudogene
## 2  ENSG00000166160.9         OPN1MW2                     protein_coding
## 3  ENSG00000268221.5          OPN1MW                     protein_coding
## 4  ENSG00000102076.9          OPN1LW                     protein_coding
## 5 ENSG00000147394.18          ZNF185                     protein_coding
## 6 ENSG00000147394.18          ZNF185                     protein_coding
# MXE
path <- "Data/rMATS/MXE/"
file <- "MXE_featureData.txt"
df.feature.mxe <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")
    
# RI
path <- "Data/rMATS/RI/"
file <- "RI_featureData.txt"
df.feature.ri <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")
    
# A5SS
path <- "Data/rMATS/A5SS/"
file <- "A5SS_featureData.txt"
df.feature.a5ss <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")
        
# A3SS
path <- "Data/rMATS/A3SS/"
file <- "A3SS_featureData.txt"
df.feature.a3ss <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")
    
# Merge
df.feature.list <- list(df.feature.se, df.feature.mxe, df.feature.ri, df.feature.a5ss, df.feature.a3ss)
names(df.feature.list) <- c("SE", "MXE", "RI", "A5SS", "A3SS")

Intron count matrix

             The rows of this matrix represent intron coordinates, the columns represent the sample IDs, and the values represent the total reads mapping to the intron. These values will be used to compute the percent spliced-in (PSI) values of retained introns (RI) splicing events downstream.
             Here, intron coverage was computed using Bedtools version 2.27.1 (Quinlan et al., 2010). Example code for one sample (ERR1562083) below. This code computes the counts at each base of a given intron, the sum of which, will be the total counts for the given intron. It is this total counts that is represented in the matrix.
             Note for GRCh38.primary_assembly.genome_bedtools.txt, the first column consists of the chromosome name (chr1, chr2, chr3…) and the second column consists of the chromosome size or length. Additionally, the BED file RI_Coordinates.bed contains the intron coordinates from RI_featureData.txt generated from rMATS in the previous step.
             Example data and codes to prepare the BEDTools input (intron coordinates) and to tabulate the intron counts for each cell and then across all cells available here: https://drive.google.com/file/d/13YMkDg_oE3fD7jYyQna2QFlDltxxxRYW/view?usp=sharing.
bedtools coverage \
               -g GRCh38.primary_assembly.genome_bedtools.txt \
               -split \
               -sorted \
               -a RI_Coordinates.bed \
               -b ERR1562083.Aligned.sortedByCoord.out.bam > \
                  ERR1562083.txt \
               -d
             Once the individual splice junction count files have been generated, they should be collated and read into R as follows:
path <- "Data/MARVEL/PSI/RI/"
file <- "Counts_by_Region.txt"
df.intron.counts <- as.data.frame(fread(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA"))

df.intron.counts[1:5,1:5]
##         coord.intron ERR1562083 ERR1562084 ERR1562085 ERR1562086
## 1   chr1:13375:13452          0          0          0          0
## 2 chr1:146510:146641          0        150          0        129
## 3 chr1:498457:498683        678          0          0          0
## 4 chr1:764801:765143       5375       4544       6794       1591
## 5 chr1:804967:806385        794          0          0          0

Gene expression matrix

             The rows of this matrix represent the gene IDs, the columns represent the sample IDs, and the values represent the normalised gene expression counts (e.g., RPKM/FPKM/TPM), but not yet log2-transformed.
             Here, gene expression was quantified using RSEM version 1.2.31 (Li et al., 2011). Example code for one sample (ERR1562083) as follows. Here, the values returned are in transcript per million (TPM) unit.
rsem-calculate-expression --bam \
                          --paired-end \
                          -p 8 \
                          ERR1562083.Aligned.toTranscriptome.out.bam \
                          GRCh38_GENCODE_genome_RSEM_indexed/gencode.v31 \
                          ERR1562083
             Once the individual gene expression files have been generated, they should be collated and read into R as follows:
path <- "Data/RSEM/"
file <- "TPM.txt"
df.tpm <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

df.tpm[1:5,1:5]
##              gene_id ERR1562083 ERR1562084 ERR1562085 ERR1562086
## 1 ENSG00000000003.14     343.26     163.45     210.43     190.46
## 2  ENSG00000000005.6       5.69       0.00       0.00       0.00
## 3 ENSG00000000419.12     288.02     155.26      42.49     238.67
## 4 ENSG00000000457.14       1.58       8.71       0.00       1.06
## 5 ENSG00000000460.17      41.23      28.53     100.17      66.43

Gene metadata

             The rows of this metadata represent the gene IDs while the columns represent the gene information such as the abbreviated gene names and gene type. Compulsory columns are gene_id, gene_short_name, and gene_type. All other columns are optional.
             Here, the metadata information was parsed and retrieved from gencode.v31.annotation.gtf.
path <- "Data/RSEM/"
file <- "TPM_featureData.txt"
df.tpm.feature <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.tpm.feature)
##              gene_id gene_short_name      gene_type
## 1 ENSG00000000003.14          TSPAN6 protein_coding
## 2  ENSG00000000005.6            TNMD protein_coding
## 3 ENSG00000000419.12            DPM1 protein_coding
## 4 ENSG00000000457.14           SCYL3 protein_coding
## 5 ENSG00000000460.17        C1orf112 protein_coding
## 6 ENSG00000000938.13             FGR protein_coding

Gene transfer file (GTF)

             For consistency, this GTF should be the same file and version (here v31) previously used for indexing the genome (for STAR aligner here), computing gene expression (for RSEM here), and detecting splicing events (for rMATS here). This file was downloaded from GENCODE webpage (https://www.gencodegenes.org).
path <- "Data/GTF/"
file <- "gencode.v31.annotation.gtf"
gtf <- as.data.frame(fread(paste(path, file, sep=""), sep="\t", header=FALSE, stringsAsFactors=FALSE, na.strings="NA", quote="\""))

Create MARVEL object

marvel <- CreateMarvelObject(SpliceJunction=sj,
                             SplicePheno=df.pheno,
                             SpliceFeature=df.feature.list,
                             IntronCounts=df.intron.counts,
                             GeneFeature=df.tpm.feature,
                             Exp=df.tpm,
                             GTF=gtf
                             )

Detect additional events

             Majority of splicing detection tools, such as rMATS used here, detects SE, MXE, RI, A5SS, and A3SS splicing events. MARVEL detects additional two splicing events, namely alternative first and last exons (AFE, ALE).