Installation

             Instructions for installation available on MARVEL’s Github page: https://github.com/wenweixiong/MARVEL

Introduction

             This tutorial demonstrates the application of MARVEL for integrated gene and splicing analysis of RNA-sequencing data generated from droplet-based library preparation methods such as 10x Genomics. Only the most salient options of the functions will be elaborated here. For more details on the function’s options, please launch the help page with ?function_name.
             The dataset used to demonstrate the utility of MARVEL here includes induced pluripotent stem cells (iPSCs) and iPSC-induced cardiomyocytes at day-2, -4, and -10 (Ou et al., 2021). The processed input files required for this tutorial may be downloaded from here: http://datashare.molbiol.ox.ac.uk/public/project/meadlab/wwen/MARVEL/Tutorial/Droplet/Data.zip

Load package

# Load MARVEL package
library(MARVEL)

# Load adjunct packages for selected MARVEL features
  # General data processing, plotting
  library(ggnewscale)
  library(ggrepel)
  library(reshape2)
  library(plyr)
  library(stringr)
  library(textclean)

  # Gene ontology analysis
  library(AnnotationDbi)
  library(clusterProfiler)
  library(org.Hs.eg.db)
  library(org.Mm.eg.db)

  # ad hoc gene candidate gene analysis
  library(gtools)

  # Visualising splice junction location
  library(GenomicRanges)
  library(IRanges)
  library(S4Vectors)
  library(wiggleplotr)

# Load adjunct packages for this tutorial
library(Matrix)
library(data.table)
library(ggplot2)
library(gridExtra)

Input files

             In this section, we will read in the required input files for MARVEL. The formatting requirement for each input file will be explained.

Preprocessing

             Alignment of sequencing reads was performed using cellranger v2.1.1. Example code for one sample (SRR9008752) below:
cellranger count --id=SRR9008752 \
                 --fastqs=SRR9008752 \
                 --sample=SRR9008752 \
                 --transcriptome=refdata-cellranger-GRCh38-3.0.0 \
                 --expect-cells 10000
             The resulting BAM files for each sample were used as inputs for STARsolo 2.7.8a (Kaminow et al., 2021) to generate the gene and splice junction count matrices required for downstream processing. Example code for one sample (SRR9008752) below. Please note the cell barcode list specified in --soloCBwhitelist should match the 10x Genomics chemistry used to generated the scRNAseq dataset (https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist-).
STAR --runThreadN 16 \
     --genomeDir refdata-cellranger-GRCh38-3.0.0 \
     --soloType CB_UMI_Simple \
     --readFilesIn SRR9008752_possorted_genome_bam.bam \
     --readFilesCommand samtools view -F 0x100 \
     --readFilesType SAM SE \
     --soloInputSAMattrBarcodeSeq CR UR \
     --soloInputSAMattrBarcodeQual CY UY \
     --soloCBwhitelist 737K-august-2016.txt \
     --soloFeatures Gene SJ
             The raw gene count outputs (matrix.mtx, features.tsv, barcodes.tsv) may be found in the Solo.out/Gene/filtered/ folder whereas the raw splice junction counts may be found in theSolo.out/SJ/raw/ folder. The raw gene counts outputs will be used for the Normalised gene expression and Gene counts sections below whereas the raw splice junction counts will be used for the Splice junction counts section.

Normalised gene expression

             Gene expression normalisation and batch correction etc. should be performed prior to MARVEL.
             The outputs of STARsolo (matrix.mtx.gz, features.tsv.gz, barcodes.tsv.gz) were used as inputs for SingCellaR (Wang et al., 2022). For each sample, only cells passing QC were retained and gene expression normalisation was also performed. All 4 samples (iPSCs and iPSC-induced cardiomyocytes at day-2, -4, and -10) were subsequently integrated and the gene expression (sparse) matrix, cell metadata, and gene metadata were saved as separated files. These files were used as inputs for MARVEL in R.
             More details on data processing using SingCellaR on STAR Protocol (https://star-protocols.cell.com/protocols/1530).
             The matrix rows represent genes, the columns represent cells, and the values represent normalised gene expression values that have not yet been log2-transformed.
             The cell metadata rows represent the cell IDs while the columns represent the cell information. Compulsory column is cell.id, all other columns are optional. The cell IDs here should match the columns of the gene expression matrix.
             The gene metadata rows represent the abbreviated gene names. Compulsory column is gene_short_name, all other columns are optional. The gene names here should match the rows of the gene expression matrix.
# Matrix
path <- "Data/Gene_SingCellaR/"
file <- "matrix_normalised.mtx"
df.gene.norm <- readMM(paste(path, file, sep=""))

df.gene.norm[df.gene.norm[,1] !=0,][1:5,1:5]
## 5 x 5 sparse Matrix of class "dgTMatrix"
##                                                      
## [1,] 2.1168501 .        0.9731413 0.4139587 0.9874593
## [2,] 0.7056167 .        .         .         .        
## [3,] 2.1168501 3.453436 1.9462826 3.3116695 1.9749185
## [4,] 1.4112334 .        0.4865707 0.4139587 1.4811889
## [5,] 4.2337003 2.302291 4.3791359 3.3116695 1.9749185
# cell metadata
path <- "Data/Gene_SingCellaR/"
file <- "phenoData.txt"
df.gene.norm.pheno <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.gene.norm.pheno)
##                       cell.id   donor.id    cell.type
## 1 AAACCTGAGCCGGTAA_SRR9008752 SRR9008752 Cardio day 4
## 2 AAACCTGAGTACCGGA_SRR9008752 SRR9008752 Cardio day 4
## 3 AAACCTGAGTGCTGCC_SRR9008752 SRR9008752 Cardio day 4
## 4 AAACCTGAGTGTACGG_SRR9008752 SRR9008752 Cardio day 4
## 5 AAACCTGAGTGTCCAT_SRR9008752 SRR9008752 Cardio day 4
## 6 AAACCTGCAGCCTGTG_SRR9008752 SRR9008752 Cardio day 4
# gene metadata
path <- "Data/Gene_SingCellaR/"
file <- "featureData.txt"
df.gene.norm.feature <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.gene.norm.feature)
##   gene_short_name
## 1     MIR1302-2HG
## 2         FAM138A
## 3           OR4F5
## 4      AL627309.1
## 5      AL627309.3
## 6      AL627309.2

Gene counts

              Raw gene counts here will be used alongside raw splice junction counts (below) for splicing analysis.
              Each sample’s raw gene count files generated by STARsolo should be collated and read into R as follows.
              The matrix rows represent genes, the columns represent cells, and the values represent UMI-collapsed, non-normalised, non-log2-transformed gene counts.
             The cell metadata rows represent the cell IDs while the columns represent the cell information. Compulsory column is cell.id, all other columns are optional. The cell IDs here should match the columns of the gene expression matrix. The cell IDs should also match the cell metadata and matrix of the normalised gene expression data df.gene.norm.pheno above.
             The gene metadata rows represent the abbreviated gene names. Compulsory column is gene_short_name, all other columns are optional. The gene names here should match the rows of the gene expression matrix. The gene names should also match the gene metadata and matrix of the normalised gene expression data df.gene.norm.feature above.
# Matrix
path <- "Data/Gene_STARsolo/"
file <- "matrix_counts.mtx"
df.gene.count <- readMM(paste(path, file, sep=""))

df.gene.count[df.gene.count[,1] !=0,][1:5,1:5]
## 5 x 5 sparse Matrix of class "dgTMatrix"
##                  
## [1,] 1  2 1  3  2
## [2,] 1  2 .  .  .
## [3,] 1  1 .  .  .
## [4,] 5  6 8 16  8
## [5,] 3 12 9 10 13
# cell metadata
path <- "Data/Gene_STARsolo/"
file <- "phenoData.txt"
df.gene.count.pheno <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.gene.count.pheno)
##                       cell.id
## 1 AAACCTGAGAAGATTC_SRR9008754
## 2 AAACCTGAGAAGGGTA_SRR9008754
## 3 AAACCTGAGCACACAG_SRR9008754
## 4 AAACCTGAGGGCTTGA_SRR9008754
## 5 AAACCTGAGTTGTCGT_SRR9008754
## 6 AAACCTGCAAGTCTGT_SRR9008754
# feature metadata
path <- "Data/Gene_STARsolo/"
file <- "featureData.txt"
df.gene.count.feature <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.gene.count.feature)
##   gene_short_name
## 1     MIR1302-2HG
## 2         FAM138A
## 3           OR4F5
## 4      AL627309.1
## 5      AL627309.3
## 6      AL627309.2

Splice junction counts

              Raw splice junctions counts here will be used alongside raw gene counts (above) for splicing analysis.
              Each sample’s raw splice junction count files generated by STARsolo should be collated and read into R as follows.
              The matrix rows represent splice junction coordinates, the columns represent cells, and the values represent UMI-collapsed, non-normalised, non-log2-transformed splice junction counts.
             The cell metadata rows represent the cell IDs while the columns represent the cell information. Compulsory column is cell.id, all other columns are optional. The cell IDs here should match the columns of the splice junction count matrix. The cell IDs should also match the cell metadata and matrix of the normalised gene expression data df.gene.norm.pheno above.
             The gene metadata rows represent the splice junction coordinates. Compulsory column is coord.intron, all other columns are optional. The splice junction coordinates here should match the rows of the splice junction count matrix.
# Matrix
path <- "Data/SJ_STARsolo/"
file <- "matrix_counts.mtx"
df.sj.count <- readMM(paste(path, file, sep=""))

df.sj.count[df.sj.count[,1] !=0,][1:5,1:5]
## 5 x 5 sparse Matrix of class "dgTMatrix"
##                
## [1,] 5 4 7 14 6
## [2,] 1 1 2  1 1
## [3,] 1 . .  . .
## [4,] 1 1 .  . .
## [5,] 1 . .  . .
# cell metadata
path <- "Data/SJ_STARsolo/"
file <- "phenoData.txt"
df.sj.count.pheno <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.sj.count.pheno)
##                       cell.id
## 1 AAACCTGAGAAGATTC_SRR9008754
## 2 AAACCTGAGAAGGGTA_SRR9008754
## 3 AAACCTGAGCACACAG_SRR9008754
## 4 AAACCTGAGGGCTTGA_SRR9008754
## 5 AAACCTGAGTTGTCGT_SRR9008754
## 6 AAACCTGCAAGTCTGT_SRR9008754
# feature metadata
path <- "Data/SJ_STARsolo/"
file <- "featureData.txt"
df.sj.count.feature <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.sj.count.feature)
##        coord.intron
## 1  chr1:14830:14969
## 2 chr1:14830:185490
## 3 chr1:15039:186316
## 4  chr1:16766:16853
## 5  chr1:16766:16857
## 6  chr1:16766:16875

tSNE/UMAP coordinates

             Dimension reduction embeddings would have to be performed prior to MARVEL. These embeddings will be used to project gene expression and splice junction expression values by MARVEL. Compulsory columns are cell.id, x (x-axis coordinates) and y (y-axis coordinates). All other columns are optional.
             Here, the tSNE embeddings were generated by SingCellaR (Wang et al., 2022).
path <- "Data/Gene_SingCellaR/"
file <- "dim_red_coordinates_iPSC_CardioDay10.txt"
df.coord <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.coord)
##                       cell.id         x          y
## 1 AAACCTGAGAAGATTC_SRR9008754  15.06473 -12.685700
## 2 AAACCTGAGAAGGGTA_SRR9008754  23.00697  10.624021
## 3 AAACCTGAGCACACAG_SRR9008754  12.97456 -18.985449
## 4 AAACCTGAGCTTTGGT_SRR9008753 -20.56841  -1.531026
## 5 AAACCTGAGGGCTTGA_SRR9008754  36.47875  -6.242368
## 6 AAACCTGAGTCGTTTG_SRR9008753 -24.82610 -18.792021

Gene transfer file

             For consistency, this GTF should be the same file previously used for cellranger and STARsolo above.
path <- "Data/GTF/"
file <- "refdata-cellranger-GRCh38-3.0.0.gtf"
gtf <- as.data.frame(fread(paste(path, file, sep=""), sep="\t", header=FALSE, stringsAsFactors=FALSE))

Create MARVEL object

marvel <- CreateMarvelObject.10x(gene.norm.matrix=df.gene.norm,
                                 gene.norm.pheno=df.gene.norm.pheno,
                                 gene.norm.feature=df.gene.norm.feature,
                                 gene.count.matrix=df.gene.count,
                                 gene.count.pheno=df.gene.count.pheno,
                                 gene.count.feature=df.gene.count.feature,
                                 sj.count.matrix=df.sj.count,
                                 sj.count.pheno=df.sj.count.pheno,
                                 sj.count.feature=df.sj.count.feature,
                                 pca=df.coord,
                                 gtf=gtf
                                 )

Pre-process data

             This step annotates the gene and splice junction metadata, and subsequently enables us to retain only high-quality splice junctions and genes of particular biotype, e.g., protein-coding genes.

Annotate gene metadata

             This step annotates the genes with their corresponding gene biotype, e.g., protein-coding, lincRNA, pseudogene etc.
marvel <- AnnotateGenes.10x(MarvelObject=marvel)

head(marvel$gene.metadata)
##   gene_short_name      gene_type
## 1     MIR1302-2HG        lincRNA
## 2         FAM138A        lincRNA
## 3           OR4F5 protein_coding
## 4      AL627309.1        lincRNA
## 5      AL627309.3        lincRNA
## 6      AL627309.2      antisense
table(marvel$gene.metadata$gene_type)
## 
##       antisense       IG_C_gene IG_C_pseudogene       IG_D_gene       IG_J_gene 
##            5494              14               9              37              18 
## IG_J_pseudogene       IG_V_gene IG_V_pseudogene         lincRNA  protein_coding 
##               3             144             188            7482           19893 
##       TR_C_gene       TR_D_gene       TR_J_gene TR_J_pseudogene       TR_V_gene 
##               6               4              79               4             106 
## TR_V_pseudogene 
##              33
             This step creates and then annotates the splice junction metadata. For each splice junction, the start and end exons are annotated with their corresponding gene names.

Annotate junction metadata

marvel <- AnnotateSJ.10x(MarvelObject=marvel)

head(marvel$sj.metadata)
##        coord.intron gene_short_name.start gene_short_name.end
## 1  chr1:14830:14969                  <NA>                <NA>
## 2 chr1:14830:185490                  <NA>                <NA>
## 3 chr1:15039:186316                  <NA>                <NA>
## 4  chr1:16766:16853                  <NA>                <NA>
## 5  chr1:16766:16857                  <NA>                <NA>
## 6  chr1:16766:16875                  <NA>                <NA>
##                               sj.type
## 1 start_unknown.gene|end_unknown.gene
## 2 start_unknown.gene|end_unknown.gene
## 3 start_unknown.gene|end_unknown.gene
## 4 start_unknown.gene|end_unknown.gene
## 5 start_unknown.gene|end_unknown.gene
## 6 start_unknown.gene|end_unknown.gene

Validate junctions

             Only splice junctions whose start and end mapped to same gene are retained.
marvel <- ValidateSJ.10x(MarvelObject=marvel)

Subset CDS genes

             Moving forward, we will only retain genes with coding sequence (CDS), i.e., protein-coding genes.
marvel <- FilterGenes.10x(MarvelObject=marvel,
                          gene.type="protein_coding"
                          )

Pre-flight check

             This step ensures that our data is ready for further downstream analysis including differential expression analysis, functional annotation, and candidate gene analysis.
             To this end, we will have to make sure the columns of the matrices align with the cell IDs of the sample metadata and the rows of the matrices align with the feature metadata. Finally, the columns across all matrices should align with one another.
marvel <- CheckAlignment.10x(MarvelObject=marvel)
             Our data is ready for downstream analysis when only MATCHED flags are reported. If any NOT MATCHED flags are reported, please double-check the input file requirements.

Save R object

             Given the time taken for data pre-processing, it’ll be a good idea to save the MARVEL object at this step. The object may be loaded whenever the user begins any one of the downstream analysis.
# Save object
path <- "Data/R object/"
file <- "MARVEL.RData"
save(marvel, file=paste(path, file, sep=""))

# Load object
# path <- "Data/R Object/"
# file <- "MARVEL.RData"
# load(paste(path, file, sep=""))

Explore expression

             Due to the huge number of splice junctions detected, it will not be time-effective to include all splice junctions for differential splicing analysis. Therefore, we will only include splice junctions expressed in at least a user-defined proportion of cells for differential splicing analysis. To determine this threshold, we will explore the number of cells expressing a given gene and splice junction.
             We will explore the gene and splice junction expression rate in iPSCs and day-10 cardiomyocytes as we will be performing differential splicing analysis on these two cell populations.

Gene expression rate

# Define cell groups
    # Retrieve sample metadata
    sample.metadata <- marvel$sample.metadata
    
    # Group 1 (reference)
    index <- which(sample.metadata$cell.type=="iPSC")
    cell.ids.1 <- sample.metadata[index, "cell.id"]
    length(cell.ids.1)
## [1] 11244
    # Group 2
    index <- which(sample.metadata$cell.type=="Cardio day 10")
    cell.ids.2 <- sample.metadata[index, "cell.id"]
    length(cell.ids.2)
## [1] 5937
# Explore % of cells expressing genes
marvel <- PlotPctExprCells.Genes.10x(MarvelObject=marvel,
                                     cell.group.g1=cell.ids.1,
                                     cell.group.g2=cell.ids.2,
                                     min.pct.cells=5
                                     )

marvel$pct.cells.expr$Gene$Plot