Installation

Instructions for installation available on MARVEL’s Github page: https://github.com/wenweixiong/MARVEL

Introduction

This tutorial demonstrates the application of MARVEL for integrated gene and splicing analysis of RNA-sequencing data generated from droplet-based library preparation methods such as 10x Genomics. Only the most salient options of the functions will be elaborated here. For more details on the function’s options, please launch the help page with ?function_name.
The dataset used to demonstrate the utility of MARVEL here includes induced pluripotent stem cells (iPSCs) and iPSC-induced cardiomyocytes at day-2, -4, and -10 (Ou et al., 2021). The processed input files required for this tutorial may be downloaded from here: http://datashare.molbiol.ox.ac.uk/public/project/meadlab/wwen/MARVEL/Tutorial/Droplet/Data.zip

Load package

# Load MARVEL package
library(MARVEL)

# Load adjunct packages for selected MARVEL features
  # General data processing, plotting
  library(ggnewscale)
  library(ggrepel)
  library(reshape2)
  library(plyr)
  library(stringr)
  library(textclean)

  # Gene ontology analysis
  library(AnnotationDbi)
  library(clusterProfiler)
  library(org.Hs.eg.db)
  library(org.Mm.eg.db)

  # ad hoc gene candidate gene analysis
  library(gtools)

  # Visualising splice junction location
  library(GenomicRanges)
  library(IRanges)
  library(S4Vectors)
  library(wiggleplotr)

# Load adjunct packages for this tutorial
library(Matrix)
library(data.table)
library(ggplot2)
library(gridExtra)

Input files

In this section, we will read in the required input files for MARVEL. The formatting requirement for each input file will be explained.

Preprocessing

Alignment of sequencing reads was performed using cellranger v2.1.1. Example code for one sample (SRR9008752) below:

cellranger count --id=SRR9008752 \
                 --fastqs=SRR9008752 \
                 --sample=SRR9008752 \
                 --transcriptome=refdata-cellranger-GRCh38-3.0.0 \
                 --expect-cells 10000

The resulting BAM files for each sample were used as inputs for STARsolo 2.7.8a (Kaminow et al., 2021) to generate the gene and splice junction count matrices required for downstream processing. Example code for one sample (SRR9008752) below. Please note the cell barcode list specified in --soloCBwhitelist should match the 10x Genomics chemistry used to generated the scRNAseq dataset (https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist-).

STAR --runThreadN 16 \
     --genomeDir refdata-cellranger-GRCh38-3.0.0 \
     --soloType CB_UMI_Simple \
     --readFilesIn SRR9008752_possorted_genome_bam.bam \
     --readFilesCommand samtools view -F 0x100 \
     --readFilesType SAM SE \
     --soloInputSAMattrBarcodeSeq CR UR \
     --soloInputSAMattrBarcodeQual CY UY \
     --soloCBwhitelist 737K-august-2016.txt \
     --soloFeatures Gene SJ

The raw gene count outputs (matrix.mtx, features.tsv, barcodes.tsv) may be found in the Solo.out/Gene/filtered/ folder whereas the raw splice junction counts may be found in theSolo.out/SJ/raw/ folder. The raw gene counts outputs will be used for the Normalised gene expression and Gene counts sections below whereas the raw splice junction counts will be used for the Splice junction counts section.

Normalised gene expression

             Gene expression normalisation and batch correction etc. should be performed prior to MARVEL.
             The outputs of STARsolo (matrix.mtx.gz, features.tsv.gz, barcodes.tsv.gz) were used as inputs for SingCellaR (Wang et al., 2022). For each sample, only cells passing QC were retained and gene expression normalisation was also performed. All 4 samples (iPSCs and iPSC-induced cardiomyocytes at day-2, -4, and -10) were subsequently integrated and the gene expression (sparse) matrix, cell metadata, and gene metadata were saved as separated files. These files were used as inputs for MARVEL in R.
             More details on data processing using SingCellaR on STAR Protocol (https://star-protocols.cell.com/protocols/1530).
             The matrix rows represent genes, the columns represent cells, and the values represent normalised gene expression values that have not yet been log2-transformed.
             The cell metadata rows represent the cell IDs while the columns represent the cell information. Compulsory column is cell.id, all other columns are optional. The cell IDs here should match the columns of the gene expression matrix.
             The gene metadata rows represent the abbreviated gene names. Compulsory column is gene_short_name, all other columns are optional. The gene names here should match the rows of the gene expression matrix.

# Matrix
path <- "Data/Gene_SingCellaR/"
file <- "matrix_normalised.mtx"
df.gene.norm <- readMM(paste(path, file, sep=""))

df.gene.norm[df.gene.norm[,1] !=0,][1:5,1:5]

## 5 x 5 sparse Matrix of class "dgTMatrix"
##                                                      
## [1,] 2.1168501 .        0.9731413 0.4139587 0.9874593
## [2,] 0.7056167 .        .         .         .        
## [3,] 2.1168501 3.453436 1.9462826 3.3116695 1.9749185
## [4,] 1.4112334 .        0.4865707 0.4139587 1.4811889
## [5,] 4.2337003 2.302291 4.3791359 3.3116695 1.9749185

# cell metadata
path <- "Data/Gene_SingCellaR/"
file <- "phenoData.txt"
df.gene.norm.pheno <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.gene.norm.pheno)

##                       cell.id   donor.id    cell.type
## 1 AAACCTGAGCCGGTAA_SRR9008752 SRR9008752 Cardio day 4
## 2 AAACCTGAGTACCGGA_SRR9008752 SRR9008752 Cardio day 4
## 3 AAACCTGAGTGCTGCC_SRR9008752 SRR9008752 Cardio day 4
## 4 AAACCTGAGTGTACGG_SRR9008752 SRR9008752 Cardio day 4
## 5 AAACCTGAGTGTCCAT_SRR9008752 SRR9008752 Cardio day 4
## 6 AAACCTGCAGCCTGTG_SRR9008752 SRR9008752 Cardio day 4

# gene metadata
path <- "Data/Gene_SingCellaR/"
file <- "featureData.txt"
df.gene.norm.feature <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.gene.norm.feature)

##   gene_short_name
## 1     MIR1302-2HG
## 2         FAM138A
## 3           OR4F5
## 4      AL627309.1
## 5      AL627309.3
## 6      AL627309.2

Gene counts

              Raw gene counts here will be used alongside raw splice junction counts (below) for splicing analysis.
              Each sample’s raw gene count files generated by STARsolo should be collated and read into R as follows.
              The matrix rows represent genes, the columns represent cells, and the values represent UMI-collapsed, non-normalised, non-log2-transformed gene counts.
             The cell metadata rows represent the cell IDs while the columns represent the cell information. Compulsory column is cell.id, all other columns are optional. The cell IDs here should match the columns of the gene expression matrix. The cell IDs should also match the cell metadata and matrix of the normalised gene expression data df.gene.norm.pheno above.
             The gene metadata rows represent the abbreviated gene names. Compulsory column is gene_short_name, all other columns are optional. The gene names here should match the rows of the gene expression matrix. The gene names should also match the gene metadata and matrix of the normalised gene expression data df.gene.norm.feature above.

# Matrix
path <- "Data/Gene_STARsolo/"
file <- "matrix_counts.mtx"
df.gene.count <- readMM(paste(path, file, sep=""))

df.gene.count[df.gene.count[,1] !=0,][1:5,1:5]

## 5 x 5 sparse Matrix of class "dgTMatrix"
##                  
## [1,] 1  2 1  3  2
## [2,] 1  2 .  .  .
## [3,] 1  1 .  .  .
## [4,] 5  6 8 16  8
## [5,] 3 12 9 10 13

# cell metadata
path <- "Data/Gene_STARsolo/"
file <- "phenoData.txt"
df.gene.count.pheno <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.gene.count.pheno)

##                       cell.id
## 1 AAACCTGAGAAGATTC_SRR9008754
## 2 AAACCTGAGAAGGGTA_SRR9008754
## 3 AAACCTGAGCACACAG_SRR9008754
## 4 AAACCTGAGGGCTTGA_SRR9008754
## 5 AAACCTGAGTTGTCGT_SRR9008754
## 6 AAACCTGCAAGTCTGT_SRR9008754

# feature metadata
path <- "Data/Gene_STARsolo/"
file <- "featureData.txt"
df.gene.count.feature <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.gene.count.feature)

##   gene_short_name
## 1     MIR1302-2HG
## 2         FAM138A
## 3           OR4F5
## 4      AL627309.1
## 5      AL627309.3
## 6      AL627309.2

Splice junction counts

              Raw splice junctions counts here will be used alongside raw gene counts (above) for splicing analysis.
              Each sample’s raw splice junction count files generated by STARsolo should be collated and read into R as follows.
              The matrix rows represent splice junction coordinates, the columns represent cells, and the values represent UMI-collapsed, non-normalised, non-log2-transformed splice junction counts.
             The cell metadata rows represent the cell IDs while the columns represent the cell information. Compulsory column is cell.id, all other columns are optional. The cell IDs here should match the columns of the splice junction count matrix. The cell IDs should also match the cell metadata and matrix of the normalised gene expression data df.gene.norm.pheno above.
             The gene metadata rows represent the splice junction coordinates. Compulsory column is coord.intron, all other columns are optional. The splice junction coordinates here should match the rows of the splice junction count matrix.

# Matrix
path <- "Data/SJ_STARsolo/"
file <- "matrix_counts.mtx"
df.sj.count <- readMM(paste(path, file, sep=""))

df.sj.count[df.sj.count[,1] !=0,][1:5,1:5]

## 5 x 5 sparse Matrix of class "dgTMatrix"
##                
## [1,] 5 4 7 14 6
## [2,] 1 1 2  1 1
## [3,] 1 . .  . .
## [4,] 1 1 .  . .
## [5,] 1 . .  . .

# cell metadata
path <- "Data/SJ_STARsolo/"
file <- "phenoData.txt"
df.sj.count.pheno <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.sj.count.pheno)

##                       cell.id
## 1 AAACCTGAGAAGATTC_SRR9008754
## 2 AAACCTGAGAAGGGTA_SRR9008754
## 3 AAACCTGAGCACACAG_SRR9008754
## 4 AAACCTGAGGGCTTGA_SRR9008754
## 5 AAACCTGAGTTGTCGT_SRR9008754
## 6 AAACCTGCAAGTCTGT_SRR9008754

# feature metadata
path <- "Data/SJ_STARsolo/"
file <- "featureData.txt"
df.sj.count.feature <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.sj.count.feature)

##        coord.intron
## 1  chr1:14830:14969
## 2 chr1:14830:185490
## 3 chr1:15039:186316
## 4  chr1:16766:16853
## 5  chr1:16766:16857
## 6  chr1:16766:16875

tSNE/UMAP coordinates

Dimension reduction embeddings would have to be performed prior to MARVEL. These embeddings will be used to project gene expression and splice junction expression values by MARVEL. Compulsory columns are cell.id, x (x-axis coordinates) and y (y-axis coordinates). All other columns are optional.
Here, the tSNE embeddings were generated by SingCellaR (Wang et al., 2022).

path <- "Data/Gene_SingCellaR/"
file <- "dim_red_coordinates_iPSC_CardioDay10.txt"
df.coord <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.coord)

##                       cell.id         x          y
## 1 AAACCTGAGAAGATTC_SRR9008754  15.06473 -12.685700
## 2 AAACCTGAGAAGGGTA_SRR9008754  23.00697  10.624021
## 3 AAACCTGAGCACACAG_SRR9008754  12.97456 -18.985449
## 4 AAACCTGAGCTTTGGT_SRR9008753 -20.56841  -1.531026
## 5 AAACCTGAGGGCTTGA_SRR9008754  36.47875  -6.242368
## 6 AAACCTGAGTCGTTTG_SRR9008753 -24.82610 -18.792021

Gene transfer file

For consistency, this GTF should be the same file previously used for cellranger and STARsolo above.

path <- "Data/GTF/"
file <- "refdata-cellranger-GRCh38-3.0.0.gtf"
gtf <- as.data.frame(fread(paste(path, file, sep=""), sep="\t", header=FALSE, stringsAsFactors=FALSE))

Create MARVEL object

marvel <- CreateMarvelObject.10x(gene.norm.matrix=df.gene.norm,
                                 gene.norm.pheno=df.gene.norm.pheno,
                                 gene.norm.feature=df.gene.norm.feature,
                                 gene.count.matrix=df.gene.count,
                                 gene.count.pheno=df.gene.count.pheno,
                                 gene.count.feature=df.gene.count.feature,
                                 sj.count.matrix=df.sj.count,
                                 sj.count.pheno=df.sj.count.pheno,
                                 sj.count.feature=df.sj.count.feature,
                                 pca=df.coord,
                                 gtf=gtf
                                 )

Pre-process data

This step annotates the gene and splice junction metadata, and subsequently enables us to retain only high-quality splice junctions and genes of particular biotype, e.g., protein-coding genes.

Annotate gene metadata

This step annotates the genes with their corresponding gene biotype, e.g., protein-coding, lincRNA, pseudogene etc.

marvel <- AnnotateGenes.10x(MarvelObject=marvel)

head(marvel$gene.metadata)

##   gene_short_name      gene_type
## 1     MIR1302-2HG        lincRNA
## 2         FAM138A        lincRNA
## 3           OR4F5 protein_coding
## 4      AL627309.1        lincRNA
## 5      AL627309.3        lincRNA
## 6      AL627309.2      antisense

table(marvel$gene.metadata$gene_type)

## 
##       antisense       IG_C_gene IG_C_pseudogene       IG_D_gene       IG_J_gene 
##            5494              14               9              37              18 
## IG_J_pseudogene       IG_V_gene IG_V_pseudogene         lincRNA  protein_coding 
##               3             144             188            7482           19893 
##       TR_C_gene       TR_D_gene       TR_J_gene TR_J_pseudogene       TR_V_gene 
##               6               4              79               4             106 
## TR_V_pseudogene 
##              33

This step creates and then annotates the splice junction metadata. For each splice junction, the start and end exons are annotated with their corresponding gene names.

Annotate junction metadata

marvel <- AnnotateSJ.10x(MarvelObject=marvel)

head(marvel$sj.metadata)

##        coord.intron gene_short_name.start gene_short_name.end
## 1  chr1:14830:14969                  <NA>                <NA>
## 2 chr1:14830:185490                  <NA>                <NA>
## 3 chr1:15039:186316                  <NA>                <NA>
## 4  chr1:16766:16853                  <NA>                <NA>
## 5  chr1:16766:16857                  <NA>                <NA>
## 6  chr1:16766:16875                  <NA>                <NA>
##                               sj.type
## 1 start_unknown.gene|end_unknown.gene
## 2 start_unknown.gene|end_unknown.gene
## 3 start_unknown.gene|end_unknown.gene
## 4 start_unknown.gene|end_unknown.gene
## 5 start_unknown.gene|end_unknown.gene
## 6 start_unknown.gene|end_unknown.gene

Validate junctions

Only splice junctions whose start and end mapped to same gene are retained.

marvel <- ValidateSJ.10x(MarvelObject=marvel)

Subset CDS genes

Moving forward, we will only retain genes with coding sequence (CDS), i.e., protein-coding genes.

marvel <- FilterGenes.10x(MarvelObject=marvel,
                          gene.type="protein_coding"
                          )

Pre-flight check

This step ensures that our data is ready for further downstream analysis including differential expression analysis, functional annotation, and candidate gene analysis.
To this end, we will have to make sure the columns of the matrices align with the cell IDs of the sample metadata and the rows of the matrices align with the feature metadata. Finally, the columns across all matrices should align with one another.

marvel <- CheckAlignment.10x(MarvelObject=marvel)

Our data is ready for downstream analysis when only MATCHED flags are reported. If any NOT MATCHED flags are reported, please double-check the input file requirements.

Save R object

Given the time taken for data pre-processing, it’ll be a good idea to save the MARVEL object at this step. The object may be loaded whenever the user begins any one of the downstream analysis.

# Save object
path <- "Data/R object/"
file <- "MARVEL.RData"
save(marvel, file=paste(path, file, sep=""))

# Load object
# path <- "Data/R Object/"
# file <- "MARVEL.RData"
# load(paste(path, file, sep=""))

Explore expression

Due to the huge number of splice junctions detected, it will not be time-effective to include all splice junctions for differential splicing analysis. Therefore, we will only include splice junctions expressed in at least a user-defined proportion of cells for differential splicing analysis. To determine this threshold, we will explore the number of cells expressing a given gene and splice junction.
We will explore the gene and splice junction expression rate in iPSCs and day-10 cardiomyocytes as we will be performing differential splicing analysis on these two cell populations.

Gene expression rate

# Define cell groups
    # Retrieve sample metadata
    sample.metadata <- marvel$sample.metadata
    
    # Group 1 (reference)
    index <- which(sample.metadata$cell.type=="iPSC")
    cell.ids.1 <- sample.metadata[index, "cell.id"]
    length(cell.ids.1)

## [1] 11244

    # Group 2
    index <- which(sample.metadata$cell.type=="Cardio day 10")
    cell.ids.2 <- sample.metadata[index, "cell.id"]
    length(cell.ids.2)

## [1] 5937

# Explore % of cells expressing genes
marvel <- PlotPctExprCells.Genes.10x(MarvelObject=marvel,
                                     cell.group.g1=cell.ids.1,
                                     cell.group.g2=cell.ids.2,
                                     min.pct.cells=5
                                     )

marvel$pct.cells.expr$Gene$Plot

head(marvel$pct.cells.expr$Gene$Data)

##       cell.group gene_short_name n.cells.total n.cells.expr pct.cells.expr
## 2  cell.group.g1           NOC2L         11244         5590          49.72
## 5  cell.group.g1            HES4         11244          687           6.11
## 6  cell.group.g1           ISG15         11244         7172          63.79
## 7  cell.group.g1            AGRN         11244         2381          21.18
## 12 cell.group.g1            SDF4         11244         4440          39.49
## 13 cell.group.g1        C1QTNF12         11244          565           5.02

min.pct.cells option. The minimum percentage of cells expressing a given gene for that gene to be included for expression exploration analysis here.
Here, we observe majority of genes to be expressed in ~10% of cells.

SJ expression rate

marvel <- PlotPctExprCells.SJ.10x(MarvelObject=marvel,
                                  cell.group.g1=cell.ids.1,
                                  cell.group.g2=cell.ids.2,
                                  min.pct.cells.genes=5,
                                  min.pct.cells.sj=5,
                                  downsample=TRUE,
                                  downsample.pct.sj=10
                                  )

marvel$pct.cells.expr$SJ$Plot

head(marvel$pct.cells.expr$SJ$Data)

##       cell.group         coord.intron n.cells.total n.cells.expr pct.cells.expr
## 25 cell.group.g1 chr1:1629705:1630291         11244         1227          10.91
## 26 cell.group.g1 chr1:1634329:1634450         11244          832           7.40
## 32 cell.group.g1 chr1:1721712:1722707         11244          867           7.71
## 42 cell.group.g1 chr1:2400936:2402206         11244         1066           9.48
## 47 cell.group.g1 chr1:3775484:3775794         11244         1467          13.05
## 79 cell.group.g1 chr1:8868058:8870451         11244          674           5.99

min.pct.cells.genes option. The minimum percentage of cells expressing a given gene for that gene to be included for expression exploration analysis here. This threshold may be inferred from after running the PlotPctExprCells.Genes.10x function earlier.
min.pct.cells.sj option. The minimum percentage of cells expressing a given splice junction for that splice junction to be included for expression exploration analysis.
downsample option. Down-sample the number of splice junctions prior to expression exploration analysis.
downsample.pct.sj option. When downsample set to TRUE, the percentage of splice junctions to keep for expression exploration analysis.
Here, we observe majority of splice junctions to be expressed in ~10% of cells.

Differential analysis

Differential analysis is the cornerstone of RNA-sequencing analysis. This is the first step to identify candidate genes and isoforms for downstream experimental validation.
For differential splicing analysis, MARVEL utilises the permutation-based approach for comparing the PSI values between two pseudo-bulk samples (Efremova et al., 2020).

For differential gene expression analysis, MARVEL utilises the Wilcoxon rank-sum test to compare the normalised, log2-transformed gene expression values between the single-cells that constitute the two pseudo-bulk samples (Satija et al., 2015)

Differential splicing analysis

marvel <- CompareValues.SJ.10x(MarvelObject=marvel,
                               cell.group.g1=cell.ids.1,
                               cell.group.g2=cell.ids.2,
                               min.pct.cells.genes=10,
                               min.pct.cells.sj=10,
                               min.gene.norm=1.0,
                               seed=1,
                               n.iterations=100,
                               downsample=TRUE,
                               show.progress=FALSE
                               )

cell.group.g1 option. Cell IDs corresponding to the reference cell group.
cell.group.g2 option. Cell IDs corresponding to the non-reference cell group.
min.pct.cells.genes option. The minimum number of cells expressing a given gene for the gene to be included for DE analysis. This threshold may be determined from running the PlotPctExprCells.Genes.10x function earlier.
min.pct.cells.sj option. The minimum number of cells expressing a given splice junction for the splice junction to be included for DE analysis. This threshold may be determined from running the PlotPctExprCells.SJ.10x function earlier.
seed option. To ensure the null distributions and dowm-sampling process are always reproducible.
n.iterations option. The number of permutations to perform when building the null distribution.
downsample option. If set to TRUE, the number of cells of each cell group will be down-sampled. The number of cells to down-sample to will be based on the size of smaller cell group.
show.progress option. For brevity of this tutorial, we did not track the DE progress here. But users are advised to set this option to TRUE on their own devices.

Differential gene analysis

marvel <- CompareValues.Genes.10x(MarvelObject=marvel,
                                  show.progress=FALSE
                                  )

head(marvel$DE$SJ$Table)

##           coord.intron gene_short_name n.cells.total.g1 n.cells.expr.sj.g1
## 1 chr1:1373903:1373999        AURKAIP1             5937               5858
## 2 chr1:1402257:1405808          MRPL20             5937               2955
## 3 chr1:1405887:1406908          MRPL20             5937               1069
## 4 chr1:2189782:2193638          FAAP20             5937               4686
## 5 chr1:6197757:6199561           RPL22             5937               4839
## 6 chr1:6820251:6825091          CAMTA1             5937                479
##   pct.cells.expr.sj.g1 n.cells.expr.gene.g1 pct.cells.expr.gene.g1
## 1                98.67                 5897                  99.33
## 2                49.77                 5845                  98.45
## 3                18.01                 5845                  98.45
## 4                78.93                 5479                  92.29
## 5                81.51                 5937                 100.00
## 6                 8.07                 5753                  96.90
##   sj.count.total.g1 gene.count.total.g1 psi.g1 n.cells.total.g2
## 1             40782               49806  81.88             5937
## 2              4425               36002  12.29             5937
## 3              1199               36002   3.33             5937
## 4             10815               19491  55.49             5937
## 5             11697              193189   6.05             5937
## 6               498               27599   1.80             5937
##   n.cells.expr.sj.g2 pct.cells.expr.sj.g2 n.cells.expr.gene.g2
## 1               5798                97.66                 5858
## 2               3516                59.22                 5859
## 3               1249                21.04                 5859
## 4               5270                88.77                 5715
## 5               5499                92.62                 5937
## 6                675                11.37                 5717
##   pct.cells.expr.gene.g2 sj.count.total.g2 gene.count.total.g2 psi.g2
## 1                  98.67             32201               38876  82.83
## 2                  98.69              5754               35035  16.42
## 3                  98.69              1445               35035   4.12
## 4                  96.26             15551               25371  61.29
## 5                 100.00             19280              189020  10.20
## 6                  96.29               721               27367   2.63
##       log2fc delta pval mean.expr.gene.norm.g1.g2 n.cells.total.norm.g1
## 1 0.01644263  0.95    0                  2.138988                  5937
## 2 0.39040352  4.13    0                  1.901503                  5937
## 3 0.24177679  0.79    0                  1.901503                  5937
## 4 0.14100507  5.80    0                  1.410974                  5937
## 5 0.66780357  4.15    0                  4.074651                  5937
## 6 0.37454272  0.83    0                  1.631310                  5937
##   n.cells.expr.gene.norm.g1 pct.cells.expr.gene.norm.g1 mean.expr.gene.norm.g1
## 1                      5897                       99.33               2.501344
## 2                      5845                       98.45               2.108606
## 3                      5845                       98.45               2.108606
## 4                      5479                       92.29               1.436871
## 5                      5937                      100.00               4.350727
## 6                      5753                       96.90               1.821909
##   n.cells.total.norm.g2 n.cells.expr.gene.norm.g2 pct.cells.expr.gene.norm.g2
## 1                  5937                      5858                       98.67
## 2                  5937                      5859                       98.69
## 3                  5937                      5859                       98.69
## 4                  5937                      5715                       96.26
## 5                  5937                      5937                      100.00
## 6                  5937                      5717                       96.29
##   mean.expr.gene.norm.g2 log2fc.gene.norm pval.gene.norm pval.adj.gene.norm
## 1               1.776632      -0.72471240   0.000000e+00       0.000000e+00
## 2               1.694400      -0.41420606   0.000000e+00       0.000000e+00
## 3               1.694400      -0.41420606   0.000000e+00       0.000000e+00
## 4               1.385077      -0.05179485   8.407605e-13       9.278512e-13
## 5               3.798576      -0.55215116   0.000000e+00       0.000000e+00
## 6               1.440711      -0.38119821  8.205003e-297      1.623418e-296

show.progress option. For brevity of this tutorial, we did not track the DE progress here. But users are advised to set this option to TRUE on their own devices.
Please note that MARVEL will only perform differential gene expression analysis on genes included for differential splicing analysis. These genes constitute the splice junctions included for analysis using the CompareValues.SJ.10x function earlier.

Volcano plot: Splicing

marvel <- PlotDEValues.SJ.10x(MarvelObject=marvel,
                              pval=0.05,
                              delta=5,
                              min.gene.norm=1.0,
                              anno=FALSE
                              )

marvel$DE$SJ$VolcanoPlot$SJ$Plot

head(marvel$DE$SJ$VolcanoPlot$SJ$Data[,c("coord.intron", "gene_short_name", "sig")])

##           coord.intron gene_short_name  sig
## 1 chr1:1373903:1373999        AURKAIP1 n.s.
## 2 chr1:1402257:1405808          MRPL20 n.s.
## 3 chr1:1405887:1406908          MRPL20 n.s.
## 4 chr1:2189782:2193638          FAAP20   up
## 5 chr1:6197757:6199561           RPL22 n.s.
## 6 chr1:6820251:6825091          CAMTA1 n.s.

pval option. The p-value, below which, the splice junction is considered to be differentially spliced.
delta option. The absolute difference in mean PSI values between the two cell populations, above which, the splice junction is considered to be differentially spliced.
min.gene.norm option. The mean normalised gene expression value across all cells from both populations, above which, the corresponding splice junction is considered to be expressed.

Gene-splicing dynamics

MARVEL’s integrated differential gene and splicing analysis enables us to investigate how gene expression changes relative to splice junction changes when iPSCs differentiate into day-10 cardiomyocytes. The gene-splicing dynamics may be classified into four categories, namely coordinated, opposing, isoform-switching, and complex.

Coordinated gene-splicing relationship refers to the change in mean gene expression is in the same direction with the corresponding splicing junction(s).
Opposing gene-splicing relationship refers to the change in mean gene expression is in the opposite direction to the corresponding splicing junction(s).
Isoform-switching refers to genes that are differentially spliced without being differentially expressed.
Complex gene-splicing relationship refers to genes with both coordinated and opposing relationships with the corresponding splicing junctions.

Here, we will explore the gene-splicing dynamics of genes that are differentially spliced between iPSCs and day-10 cardiomyocytes. Representative examples of each dynamic will also be shown. This section also utilises the ad hoc plotting functions PlotValues.PCA.CellGroup.10x, PlotValues.PCA.Gene.10x, PlotValues.PCA.PSI.10x for plotting cell groups, PSI, and gene expression, respectively, on the reduced dimension space.
Please not users are required to run the CompareValues.SJ.10x and CompareValues.Genes.10x functions (refer to “Differential splicing analysis” section) before proceeding with this section.

Assign dynamics

marvel <- IsoSwitch.10x(MarvelObject=marvel,
                        pval.sj=0.05,
                        delta.sj=5,
                        min.gene.norm=1.0,
                        pval.adj.gene=0.05,
                        log2fc.gene=0.5
                        )

marvel$SJ.Gene.Cor$Proportion$Plot

marvel$SJ.Gene.Cor$Proportion$Table

##   sj.gene.cor freq       pct
## 2 Coordinated  101 18.738404
## 4    Opposing   83 15.398887
## 3  Iso-Switch  317 58.812616
## 1     Complex   38  7.050093

head(marvel$SJ.Gene.Cor$Data[,c("coord.intron", "gene_short_name", "cor.complete")])

##             coord.intron gene_short_name cor.complete
## 1   chr1:2189782:2193638          FAAP20   Iso-Switch
## 2 chr1:11054886:11054961             SRM     Opposing
## 3 chr1:11674817:11675081          MAD2L2     Opposing
## 4 chr1:20652529:20652620           DDOST   Iso-Switch
## 5 chr1:22086867:22091427           CDC42  Coordinated
## 6 chr1:23313703:23318482          HNRNPR  Coordinated

pval.sj option. The p-value, below which, the splice junction is considered to be differentially spliced.
delta.sj option. The absolute difference in mean PSI values between the two cell populations, above which, the splice junction is considered to be differentially spliced.
min.gene.norm option. The mean normalised gene expression value across all cells from both populations, above which, the corresponding splice junction is considered to be expressed.
pval.adj.gene option. The p-value, below which, the gene is considered to be differentially expressed.
log2fc.gene option. The absolute log2 fold change in gene expression, above which, the gene is considered to be differentially expressed.
Here, we observe majority of differentially spliced genes underwent isoform-switching followed by coordinated, opposing, and then complex gene expression changes relative to splicing changes.
Note that majority of differentially spliced genes may not be inferred directly from differential gene expression analysis alone. For example, only in coordinated gene-splicing relationship that the gene and splicing changes between iPSCs and day-10 cardiomyocytes are in the same direction. But this category only constitute around one-fifth of gene-splicing relationships.

Coordinated

# Define cell groups
    # Retrieve sample metadata
    sample.metadata <- marvel$sample.metadata
    
    # iPSC
    index <- which(sample.metadata$cell.type=="iPSC")
    cell.ids.1 <- sample.metadata[index, "cell.id"]
    length(cell.ids.1)

## [1] 11244

    # Cardio day 10
    index <- which(sample.metadata$cell.type=="Cardio day 10")
    cell.ids.2 <- sample.metadata[index, "cell.id"]
    length(cell.ids.2)

## [1] 5937

    # Save into list
    cell.group.list <- list("iPSC"=cell.ids.1,
                            "Cardio d10"=cell.ids.2
                            )
    
# Plot cell groups
marvel <- PlotValues.PCA.CellGroup.10x(MarvelObject=marvel,
                                       cell.group.list=cell.group.list,
                                       legendtitle="Cell group",
                                       type="tsne"
                                       )

plot_group <- marvel$adhocPlot$PCA$CellGroup

# Plot gene expression
marvel <- PlotValues.PCA.Gene.10x(MarvelObject=marvel,
                                  gene_short_name="VIM",
                                  color.gradient=c("grey","cyan","green","yellow","red"),
                                  type="tsne"
                                  )


plot_gene <- marvel$adhocPlot$PCA$Gene

# Plot PSI
marvel <- PlotValues.PCA.PSI.10x(MarvelObject=marvel,
                                 coord.intron="chr10:17235390:17235845",
                                 min.gene.count=3,
                                 log2.transform=FALSE,
                                 color.gradient=c("grey","cyan","green","yellow","red"),
                                 type="tsne"
                                 )
 
plot_sj <- marvel$adhocPlot$PCA$PSI

# Arrange and view plots
grid.arrange(plot_group, plot_gene, plot_sj, nrow=1)

For VIM, both gene expression (middle) and the corresponding splice junction (right) shown here are up-regulated in day-10 cardiomyocytes relative to iPSCs.
Note gene expression represented by log2(expression) scale whereas splicing rate represented by PSI scale.

Opposing

# Plot gene expression
marvel <- PlotValues.PCA.Gene.10x(MarvelObject=marvel,
                                  gene_short_name="UQCRH",
                                  color.gradient=c("grey","cyan","green","yellow","red"),
                                  type="tsne"
                                  )


plot_gene <- marvel$adhocPlot$PCA$Gene

# Plot PSI
marvel <- PlotValues.PCA.PSI.10x(MarvelObject=marvel,
                                 coord.intron="chr1:46310317:46316551",
                                 min.gene.count=3,
                                 log2.transform=FALSE,
                                 color.gradient=c("grey","cyan","green","yellow","red"),
                                 type="tsne"
                                 )
 
plot_sj <- marvel$adhocPlot$PCA$PSI

# Arrange and view plots
grid.arrange(plot_group, plot_gene, plot_sj, nrow=1)

For UQRCH, the gene expression (middle) is up-regulated in iPSCs relative to day-10 cardiomyocytes.
On the other hand, the corresponding splice junction (right) shown here is up-regulated in day-10 cardiomyocytes relative to iPSCs.
Note gene expression represented by log2(expression) scale whereas splicing rate represented by PSI scale.

Iso-switch

# Plot gene expression
marvel <- PlotValues.PCA.Gene.10x(MarvelObject=marvel,
                                  gene_short_name="RBM39",
                                  color.gradient=c("grey","cyan","green","yellow","red"),
                                  type="tsne"
                                  )


plot_gene <- marvel$adhocPlot$PCA$Gene

# Plot PSI: SJ-1
marvel <- PlotValues.PCA.PSI.10x(MarvelObject=marvel,
                                 coord.intron="chr20:35740888:35741940",
                                 min.gene.count=3,
                                 log2.transform=FALSE,
                                 color.gradient=c("grey","cyan","green","yellow","red"),
                                 type="tsne"
                                 )
 
plot_sj_1 <- marvel$adhocPlot$PCA$PSI

# Plot PSI: SJ-2
marvel <- PlotValues.PCA.PSI.10x(MarvelObject=marvel,
                                 coord.intron="chr20:35739018:35740823",
                                 min.gene.count=3,
                                 log2.transform=FALSE,
                                 color.gradient=c("grey","cyan","green","yellow","red"),
                                 type="tsne"
                                 )
 
plot_sj_2 <- marvel$adhocPlot$PCA$PSI

# Arrange and view plots
grid.arrange(plot_gene, plot_sj_1, plot_sj_2, nrow=1)

For RBM39, the gene (left) is not differentially expressed between iPSCs and day-10 cardiomyocytes.
On the other hand, both corresponding splice junctions (middle and right) are up-regulated in iPSCs relative to day-10 cardiomyocytes.
Note gene expression represented by log2(expression) scale whereas splicing rate represented by PSI scale.

Complex

# Gene 1
  # Plot gene expression
  marvel <- PlotValues.PCA.Gene.10x(MarvelObject=marvel,
                                    gene_short_name="TPM1",
                                    color.gradient=c("grey","cyan","green","yellow","red"),
                                    type="tsne"
                                    )

  plot.1_gene <- marvel$adhocPlot$PCA$Gene
  
  # Plot PSI: SJ-1
  marvel <- PlotValues.PCA.PSI.10x(MarvelObject=marvel,
                                   coord.intron="chr15:63064143:63065895",
                                   min.gene.count=3,
                                   log2.transform=FALSE,
                                   color.gradient=c("grey","cyan","green","yellow","red"),
                                   type="tsne"
                                   )
   
  plot.1_sj_1 <- marvel$adhocPlot$PCA$PSI
  
  # Plot PSI: SJ-2
  marvel <- PlotValues.PCA.PSI.10x(MarvelObject=marvel,
                                   coord.intron="chr15:63044153:63056984",
                                   min.gene.count=3,
                                   log2.transform=FALSE,
                                   color.gradient=c("grey","cyan","green","yellow","red"),
                                   type="tsne"
                                   )
   
  plot.1_sj_2 <- marvel$adhocPlot$PCA$PSI

# Gene 2
  # Plot gene expression
  marvel <- PlotValues.PCA.Gene.10x(MarvelObject=marvel,
                                    gene_short_name="TPM2",
                                    color.gradient=c("grey","cyan","green","yellow","red"),
                                    type="tsne"
                                    )
  
  
  plot.2_gene <- marvel$adhocPlot$PCA$Gene
  
  # Plot PSI: SJ-1
  marvel <- PlotValues.PCA.PSI.10x(MarvelObject=marvel,
                                   coord.intron="chr9:35684808:35685268",
                                   min.gene.count=3,
                                   log2.transform=FALSE,
                                   color.gradient=c("grey","cyan","green","yellow","red"),
                                   type="tsne"
                                   )
   
  plot.2_sj_1 <- marvel$adhocPlot$PCA$PSI
  
  # Plot PSI: SJ-2
  marvel <- PlotValues.PCA.PSI.10x(MarvelObject=marvel,
                                   coord.intron="chr9:35684551:35685063",
                                   min.gene.count=3,
                                   log2.transform=FALSE,
                                   color.gradient=c("grey","cyan","green","yellow","red"),
                                   type="tsne"
                                   )
   
  plot.2_sj_2 <- marvel$adhocPlot$PCA$PSI
  
# Arrange and view plots
grid.arrange(plot.1_gene, plot.1_sj_1, plot.1_sj_2, 
             plot.2_gene, plot.2_sj_1, plot.2_sj_2,
             nrow=2
             )

For both TPM1 and TPM2 genes, their gene expressions (left) are up-regulated in day-10 cardiomyocytes relative to iPSCs.
Similar to gene expression profile, one of the splice junctions (middle) for these genes are up-regulated in day-10 cardiomyocytes relative to iPSCs.
On the other hand, the other splice junction (right) for these genes are down-regulated in day-10 cardiomyocytes relative to iPSCs.
Note gene expression represented by log2(expression) scale whereas splicing rate represented by PSI scale.

Gene ontology analysis

Gene ontology analysis or pathway enrichment analysis categorises the differentially spliced genes between iPSCs and day-10 cardiomyocytes into biological pathways. This may identify sets of genes with similar function or belong to similar biological pathways that are concurrently spliced.

marvel <- BioPathways.10x(MarvelObject=marvel,
                          pval=0.05,
                          delta=5,
                          min.gene.norm=1.0,
                          method.adjust="fdr",
                          species="human"
                          )

head(marvel$DE$BioPathways$Table)

##                    ID
## GO:0006119 GO:0006119
## GO:0046034 GO:0046034
## GO:0000377 GO:0000377
## GO:0000398 GO:0000398
## GO:1903311 GO:1903311
## GO:0010257 GO:0010257
##                                                                                     Description
## GO:0006119                                                            oxidative phosphorylation
## GO:0046034                                                                ATP metabolic process
## GO:0000377 RNA splicing, via transesterification reactions with bulged adenosine as nucleophile
## GO:0000398                                                       mRNA splicing, via spliceosome
## GO:1903311                                                 regulation of mRNA metabolic process
## GO:0010257                                                  NADH dehydrogenase complex assembly
##            GeneRatio   BgRatio enrichment       pvalue     p.adjust
## GO:0006119    48/454 149/18866  13.386867 6.120887e-41 2.421423e-37
## GO:0046034    54/454 311/18866   7.215349 9.007387e-31 8.908306e-28
## GO:0000377    56/454 390/18866   5.966881 1.698617e-27 7.466364e-25
## GO:0000398    56/454 390/18866   5.966881 1.698617e-27 7.466364e-25
## GO:1903311    50/454 344/18866   6.039981 7.320281e-25 2.227618e-22
## GO:0010257    24/454  65/18866  15.343409 1.245443e-22 3.284649e-20
##                  qvalue
## GO:0006119 2.034712e-37
## GO:0046034 7.485613e-28
## GO:0000377 6.273955e-25
## GO:0000398 6.273955e-25
## GO:1903311 1.871858e-22
## GO:0010257 2.760077e-20
##                                                                                                                                                                                                                                                                                                                                                                                      geneID
## GO:0006119                               UQCRH/NDUFS2/ATP5F1C/NDUFS3/NDUFS8/ATP5MC2/ATP5F1B/NDUFA12/DNAJC15/NDUFB1/COX5A/NDUFB10/UQCRC2/NDUFAB1/COX4I1/ATP5PD/NDUFV2/ATP5F1A/ATP5F1D/NDUFS7/NDUFA3/ATP5MC3/NDUFB3/ATP5F1E/ATP5PF/NDUFA6/UQCRC1/NDUFB4/ATP5ME/NDUFC1/NDUFS6/NDUFS4/COX7C/UQCRQ/COX7A2/NDUFA4/CYCS/ATP5MF/NDUFA5/NDUFB2/COX6C/NDUFB9/CYC1/NDUFB6/STOML2/NDUFA8/NDUFB11/NDUFA1
## GO:0046034 UQCRH/NDUFS2/TPR/PARP1/GUK1/ATP5F1C/NDUFS3/NDUFS8/GAPDH/ATP5MC2/ATP5F1B/NDUFA12/DNAJC15/NDUFB1/COX5A/NDUFB10/UQCRC2/NDUFAB1/COX4I1/ATP5PD/NDUFV2/ATP5F1A/ATP5F1D/NDUFS7/NDUFA3/OLA1/ATP5MC3/NDUFB3/ATP5F1E/ATP5PF/APP/NDUFA6/UQCRC1/NDUFB4/ATP5ME/NDUFC1/NDUFS6/NDUFS4/COX7C/UQCRQ/COX7A2/NDUFA4/CYCS/ATP5MF/NDUFA5/NDUFB2/COX6C/NDUFB9/CYC1/NDUFB6/STOML2/NDUFA8/NDUFB11/NDUFA1
## GO:0000377                    HNRNPR/SRRM1/DNAJC8/SFPQ/THRAP3/YBX1/SRSF11/RBM8A/PRDX6/SNRPE/HNRNPU/RBM17/POLR2G/SF3B2/CWC15/ZCRB1/PCBP2/HNRNPA1/SNRPF/SAP18/HNRNPC/SRSF5/PAPOLA/SNRPA1/SNRNP25/SRRM2/FUS/DDX5/SNRPD1/CIRBP/HNRNPM/UBL5/SNRPD2/SF3B6/SRSF7/SNRPG/PRPF40A/SF3B1/SNRPB/SNRPB2/RALY/RBM39/SON/POLR2F/DDX17/LSM3/U2SURP/FXR1/TRA2B/LSM6/HNRNPH1/SNRPC/SYNCRIP/BUD31/POLR2J/PSIP1
## GO:0000398                    HNRNPR/SRRM1/DNAJC8/SFPQ/THRAP3/YBX1/SRSF11/RBM8A/PRDX6/SNRPE/HNRNPU/RBM17/POLR2G/SF3B2/CWC15/ZCRB1/PCBP2/HNRNPA1/SNRPF/SAP18/HNRNPC/SRSF5/PAPOLA/SNRPA1/SNRNP25/SRRM2/FUS/DDX5/SNRPD1/CIRBP/HNRNPM/UBL5/SNRPD2/SF3B6/SRSF7/SNRPG/PRPF40A/SF3B1/SNRPB/SNRPB2/RALY/RBM39/SON/POLR2F/DDX17/LSM3/U2SURP/FXR1/TRA2B/LSM6/HNRNPH1/SNRPC/SYNCRIP/BUD31/POLR2J/PSIP1
## GO:1903311                                                                       HNRNPR/YTHDF2/PSMB2/THRAP3/YBX1/SERBP1/PSMA5/RBM8A/PSMD4/PSMB4/PRDX6/HNRNPU/RBM17/VIM/PSMC3/POLR2G/YBX3/HNRNPA1/SAP18/HNRNPC/PSME2/PSMA3/SRSF5/PAPOLA/SLTM/PSMA4/FUS/PSMB6/PSMB3/PSMC5/DDX5/CIRBP/HNRNPM/UBA52/PSMD8/SRSF7/RBM39/YWHAB/SON/DDX17/DHX36/FXR1/TRA2B/NPM1/SYNCRIP/PSMB1/HSPB1/YWHAZ/PSMB7/SET
## GO:0010257                                                                                                                                                                                                       NDUFS2/NDUFS3/NDUFS8/NDUFA12/NDUFB1/NDUFB10/NDUFAB1/NDUFV2/NDUFS7/NDUFA3/NDUFB3/NDUFA6/NDUFB4/NDUFC1/NDUFS6/NDUFS4/NDUFA5/NDUFB2/NDUFB9/DMAC1/NDUFB6/NDUFA8/NDUFB11/NDUFA1
##            Count
## GO:0006119    48
## GO:0046034    54
## GO:0000377    56
## GO:0000398    56
## GO:1903311    50
## GO:0010257    24

pval option. The p-value, below which, the splice junction is considered to be differentially spliced.
delta option. The absolute difference in mean PSI values between the two cell populations, above which, the splice junction is considered to be differentially spliced.
min.gene.norm option. The mean normalised gene expression value across all cells from both populations, above which, the corresponding splice junction is considered to be expressed.
species option. MARVEL also supports GO analysis of "mouse".
Alternative to using the pval, delta, and min.gene.norm options for specifying differentially spliced genes for GO analysis, users may input their own custom genes for GO analysis using the custom.genes option.

# Plot pathways related to cardiomyocyte development
go.terms <- c("ATP metabolic process",
              "regulation of stem cell differentiation",
              "Wnt signaling pathway, planar cell polarity pathway",
              "non-canonical Wnt signaling pathway",
              "muscle filament sliding",
              "actin-myosin filament sliding",
              "regulation of ATPase activity",
              "regulation of fibroblast proliferation",
              "actomyosin structure organization",
              "myofibril assembly",
              "negative regulation of calcium-mediated signaling"
              )

marvel <- BioPathways.Plot.10x(MarvelObject=marvel,
                               go.terms,
                               y.label.size=8,
                               offset=0.5
                               )

marvel$DE$BioPathways$Plot

Candidate gene analysis

             Users may have specific candidate genes of interest to investigate in the single-cell RNA-sequencing dataset. These candidate genes may have been identified from initial differential gene expression analysis of the same dataset or from an earlier discovery dataset, e.g. bulk RNA-sequencing dataset, or that reported from the literature.
             Here, we will illustrate several related functions for investigating the gene-splicing relationship between a selected gene and its corresponding splice junctions. This may reveal candidate isoforms for downstream functional validation.
             We have chosen TPM2 for this illustration as we have seen earlier that this gene has a complex relationship with its splice junctions (see “Gene-splicing dynamics” section of this tutorial). Therefore, this gene will be a good example for demonstrating the intricate relationship betweeh splice junction usage and gene expression profile.

Define cell groups

Our earlier differential analysis included only iPSCs and day-10 cardiomyocytes. Here, we will include all cell types in this dataset, namely iPSCs, and day-2, -4, and -10 cardiomyocytes.

# Retrieve sample metadata
sample.metadata <- marvel$sample.metadata

# iPSC
index <- which(sample.metadata$cell.type=="iPSC")
cell.ids.1 <- sample.metadata[index, "cell.id"]
length(cell.ids.1)

## [1] 11244

# Cardio day 2
index <- which(sample.metadata$cell.type=="Cardio day 2")
cell.ids.2 <- sample.metadata[index, "cell.id"]
length(cell.ids.2)

## [1] 6240

# Cardio day 4
index <- which(sample.metadata$cell.type=="Cardio day 4")
cell.ids.3 <- sample.metadata[index, "cell.id"]
length(cell.ids.3)

## [1] 8635

# Cardio day 10
index <- which(sample.metadata$cell.type=="Cardio day 10")
cell.ids.4 <- sample.metadata[index, "cell.id"]
length(cell.ids.4)

## [1] 5937

# Save into list
cell.group.list <- list("iPSC"=cell.ids.1,
                        "day2"=cell.ids.2,
                        "day4"=cell.ids.3,
                        "day10"=cell.ids.4
                        )

Gene expression profiling

marvel <- adhocGene.TabulateExpression.Gene.10x(MarvelObject=marvel,
                                                cell.group.list=cell.group.list,
                                                gene_short_name="TPM2",
                                                min.pct.cells=10,
                                                downsample=TRUE
                                                )

marvel$adhocGene$Expression$Gene$Plot

marvel$adhocGene$Expression$Gene$Table

##   group mean.expr pct.cells.expr
## 1  iPSC  1.533693       93.39734
## 2  day2  2.064926       99.52838
## 3  day4  2.889787       99.89894
## 4 day10  2.919375       99.89894

min.pct.cells option. The minimum percentage of cells the gene is expressed in for the gene to be considered expressed and included for analysis here.
downsample option. If set to TRUE, the number of cells of each cell group will be down-sampled. The number of cells to down-sample to will be based on the size of smallest cell group.
We observe progressive increased in TPM2 gene expression from iPSCs to early cardiomyocytes and then late cardiomyocytes.

SJ usage profiling

marvel <- adhocGene.TabulateExpression.PSI.10x(MarvelObject=marvel,
                                               min.pct.cells=10
                                               )

marvel$adhocGene$Expression$PSI$Plot

head(marvel$adhocGene$Expression$PSI$Table)

##   group figure.column           coord.intron n.cells.total n.cells.expr.sj
## 1  iPSC          SJ-1 chr9:35682164:35684245          5937            5209
## 2  iPSC          SJ-2 chr9:35684316:35684487          5937            4860
## 3  iPSC          SJ-3 chr9:35684551:35684731          5937            2754
## 4  iPSC          SJ-4 chr9:35684808:35685268          5937            1013
## 5  iPSC          SJ-5 chr9:35685340:35685433          5937             919
## 6  iPSC          SJ-9 chr9:35684551:35685063          5937            1073
##   pct.cells.expr.sj sj.count.total gene.count.total   psi
## 1             87.74          15904            21778 73.03
## 2             81.86          12730            21778 58.45
## 3             46.39           4083            21778 18.75
## 4             17.06           1152            21778  5.29
## 5             15.48           1039            21778  4.77
## 6             18.07           1229            21778  5.64

min.pct.cells option. The minimum percentage of cells the splice junction is expressed in for the splice junction to be considered expressed and included for analysis here.
We observe 10 splice junctions to be expressed in at least one cell population.
SJ-1 is more highly expressed in iPSCs and early cardiomyocytes relative to late cardiomyocytes.
However, there is virtually no difference in SJ-2 expression across all cell populations.
On the other hand, SJ-3 is more highly expressed in cardiomyocytes relative to iPSCs.

Gene-splicing relationship

Let’s investigate how the top 3 most highly-expressed splice junctions’ usage (SJ-1, -2, and -3) changes relative to TPM2 gene expression changes for all possible pair-wise comparison between iPSCs, and day-2, -4, and -10 cardiomyocytes.

# Perform differential gene expression analysis for all pair-wise group comparison
marvel <- adhocGene.DE.Gene.10x(MarvelObject=marvel)

# Perform differential splicing expression analysis for all pair-wise group comparison
marvel <- adhocGene.DE.PSI.10x(MarvelObject=marvel)

# Retrieve SJ DE results table
results <- marvel$adhocGene$DE$PSI$Data

# SJ-1
  # Define SJ
  coord.intron <- results[which(results$figure.column=="SJ-1"), "coord.intron"]
  coord.intron <- unique(coord.intron)
  coord.intron

## [1] "chr9:35682164:35684245"

  # Plot
  marvel <- adhocGene.PlotDEValues.10x(MarvelObject=marvel,
                                       coord.intron=coord.intron,
                                       log2fc.gene=0.5,
                                       delta.sj=5,
                                       label.size=2,
                                       point.size=2,
                                       xmin=-2.0,
                                       xmax=2.0,
                                       ymin=-25,
                                       ymax=25
                                       )

  plot.1 <- marvel$adhocGene$DE$VolcanoPlot$Plot

# SJ-2
  # Define SJ
  coord.intron <- results[which(results$figure.column=="SJ-2"), "coord.intron"]
  coord.intron <- unique(coord.intron)
  coord.intron

## [1] "chr9:35684316:35684487"

  # Plot
  marvel <- adhocGene.PlotDEValues.10x(MarvelObject=marvel,
                                       coord.intron=coord.intron,
                                       log2fc.gene=0.5,
                                       delta.sj=5,
                                       label.size=2,
                                       point.size=2,
                                       xmin=-2.0,
                                       xmax=2.0,
                                       ymin=-25,
                                       ymax=25
                                       )
                                         
  plot.2 <- marvel$adhocGene$DE$VolcanoPlot$Plot    

# SJ-3
    # Define SJ
    coord.intron <- results[which(results$figure.column=="SJ-3"), "coord.intron"]
    coord.intron <- unique(coord.intron)
    coord.intron

## [1] "chr9:35684551:35684731"

    # Plot
    marvel <- adhocGene.PlotDEValues.10x(MarvelObject=marvel,
                                         coord.intron=coord.intron,
                                         log2fc.gene=0.5,
                                         delta.sj=5,
                                         label.size=2,
                                         point.size=2,
                                         xmin=-2.0,
                                         xmax=2.0,
                                         ymin=-25,
                                         ymax=25
                                         )
                                         
    plot.3 <- marvel$adhocGene$DE$VolcanoPlot$Plot    

# Arrange and view plots
grid.arrange(plot.1, plot.2, plot.3,
             nrow=3
             )

log2fc.gene option. The absolute log2 fold change, above which, the gene is considered differentially expressed.
delta.sj option. The absolute difference in mean PSI values between two cell populations, above which, the splice junction is considered differentially spliced.
SJ-1 is more highly expressed in iPSCs and early cardiomyocytes relative to late cardiomyocytes.
However, there is virtually no difference in SJ-2 expression across all cell populations.
On the other hand, SJ-3 is more highly expressed in cardiomyocytes relative to iPSCs.

Locate SJ position

Locating the position of the splice junctions on the isoforms corresponding to TPM2 will reveal the specific isoforms expressed in our dataset for this.

# SJ-1
    # Define SJs to plot
    coord.intron <- results[which(results$figure.column=="SJ-1"), "coord.intron"]
    coord.intron <- unique(coord.intron)
    coord.intron

## [1] "chr9:35682164:35684245"

    # Plot
    marvel <- adhocGene.PlotSJPosition.10x(MarvelObject=marvel,
                                          coord.intron=coord.intron,
                                          rescale_introns=FALSE,
                                          show.protein.coding.only=TRUE,
                                          anno.label.size=1.5
                                          )
    
    marvel$adhocGene$SJPosition$Plot

# SJ-2
    # Define SJs to plot
    coord.intron <- results[which(results$figure.column=="SJ-2"), "coord.intron"]
    coord.intron <- unique(coord.intron)
    coord.intron

## [1] "chr9:35684316:35684487"

    # Plot
    marvel <- adhocGene.PlotSJPosition.10x(MarvelObject=marvel,
                                          coord.intron=coord.intron,
                                          rescale_introns=FALSE,
                                          show.protein.coding.only=TRUE,
                                          anno.label.size=1.5
                                          )
    
    marvel$adhocGene$SJPosition$Plot

# SJ-3
    # Define SJs to plot
    coord.intron <- results[which(results$figure.column=="SJ-3"), "coord.intron"]
    coord.intron <- unique(coord.intron)
    coord.intron

## [1] "chr9:35684551:35684731"

    # Plot
    marvel <- adhocGene.PlotSJPosition.10x(MarvelObject=marvel,
                                          coord.intron=coord.intron,
                                          rescale_introns=FALSE,
                                          show.protein.coding.only=TRUE,
                                          anno.label.size=1.5
                                          )
    
    marvel$adhocGene$SJPosition$Plot

rescale_introns option. If set to TRUE, the introns will be minimise in order to emphasise the exons and splice junctions.
show.protein.coding.only option. If set to TRUE, only protein-coding isoforms will be displayed.
We may infer that ENST00000329305 is the dominant isoform expressed in this dataset because it contains the 3 top highly expressed splice junctions.
Due to differential splice junction usage across the different cell population, it is conceivable that isoforms other than ENST00000329305 are also expressed in this dataset, and are differentially expressed across the different cell population.

Aggregate figures

Finally, we may tabulate the earlier figures above into a publication ready format.

MARVEL

Sean Wen