Input files
             In this section, we will read in
the required input files for MARVEL. The formatting requirement for each
input file will be explained.
Preprocessing
             Alignment of sequencing reads was
performed using cellranger v2.1.1. Example code for one sample
(SRR9008752) below:
cellranger count --id=SRR9008752 \
--fastqs=SRR9008752 \
--sample=SRR9008752 \
--transcriptome=refdata-cellranger-GRCh38-3.0.0 \
--expect-cells 10000
             The resulting BAM files for each
sample were used as inputs for STARsolo 2.7.8a (Kaminow
et al.,
2021) to generate the gene and splice junction count matrices required
for downstream processing. Example code for one sample (SRR9008752)
below. Please note the cell barcode list specified in
--soloCBwhitelist
should match the 10x Genomics chemistry
used to generated the scRNAseq dataset (
https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist-).
STAR --runThreadN 16 \
--genomeDir refdata-cellranger-GRCh38-3.0.0 \
--soloType CB_UMI_Simple \
--readFilesIn SRR9008752_possorted_genome_bam.bam \
--readFilesCommand samtools view -F 0x100 \
--readFilesType SAM SE \
--soloInputSAMattrBarcodeSeq CR UR \
--soloInputSAMattrBarcodeQual CY UY \
--soloCBwhitelist 737K-august-2016.txt \
--soloFeatures Gene SJ
             The raw gene count outputs
(matrix.mtx, features.tsv, barcodes.tsv) may be found in the
Solo.out/Gene/filtered/ folder whereas the raw splice junction counts
may be found in theSolo.out/SJ/raw/ folder. The raw gene counts outputs
will be used for the Normalised gene expression
and
Gene counts
sections below whereas the raw splice junction
counts will be used for the Splice junction counts
section.
Normalised gene expression
             Gene expression normalisation and
batch correction etc. should be performed prior to MARVEL.
             The outputs of STARsolo (matrix.mtx.gz, features.tsv.gz,
barcodes.tsv.gz) were used as inputs for SingCellaR (Wang
et
al., 2022). For each sample, only cells passing QC were retained
and gene expression normalisation was also performed. All 4 samples
(iPSCs and iPSC-induced cardiomyocytes at day-2, -4, and -10) were
subsequently integrated and the gene expression (sparse) matrix, cell
metadata, and gene metadata were saved as separated files. These files
were used as inputs for MARVEL in R.
             More details on data processing using SingCellaR on STAR
Protocol (
https://star-protocols.cell.com/protocols/1530).
             The matrix rows represent genes, the columns represent
cells, and the values represent normalised gene expression values that
have not yet been log2-transformed.
             The cell metadata rows represent the cell IDs while the
columns represent the cell information. Compulsory column is
cell.id
, all other columns are optional. The cell IDs here
should match the columns of the gene expression matrix.
             The gene metadata rows represent the abbreviated gene
names. Compulsory column is
gene_short_name
, all other
columns are optional. The gene names here should match the rows of the
gene expression matrix.
# Matrix
path <- "Data/Gene_SingCellaR/"
file <- "matrix_normalised.mtx"
df.gene.norm <- readMM(paste(path, file, sep=""))
df.gene.norm[df.gene.norm[,1] !=0,][1:5,1:5]
## 5 x 5 sparse Matrix of class "dgTMatrix"
##
## [1,] 2.1168501 . 0.9731413 0.4139587 0.9874593
## [2,] 0.7056167 . . . .
## [3,] 2.1168501 3.453436 1.9462826 3.3116695 1.9749185
## [4,] 1.4112334 . 0.4865707 0.4139587 1.4811889
## [5,] 4.2337003 2.302291 4.3791359 3.3116695 1.9749185
# cell metadata
path <- "Data/Gene_SingCellaR/"
file <- "phenoData.txt"
df.gene.norm.pheno <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)
head(df.gene.norm.pheno)
## cell.id donor.id cell.type
## 1 AAACCTGAGCCGGTAA_SRR9008752 SRR9008752 Cardio day 4
## 2 AAACCTGAGTACCGGA_SRR9008752 SRR9008752 Cardio day 4
## 3 AAACCTGAGTGCTGCC_SRR9008752 SRR9008752 Cardio day 4
## 4 AAACCTGAGTGTACGG_SRR9008752 SRR9008752 Cardio day 4
## 5 AAACCTGAGTGTCCAT_SRR9008752 SRR9008752 Cardio day 4
## 6 AAACCTGCAGCCTGTG_SRR9008752 SRR9008752 Cardio day 4
# gene metadata
path <- "Data/Gene_SingCellaR/"
file <- "featureData.txt"
df.gene.norm.feature <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)
head(df.gene.norm.feature)
## gene_short_name
## 1 MIR1302-2HG
## 2 FAM138A
## 3 OR4F5
## 4 AL627309.1
## 5 AL627309.3
## 6 AL627309.2
Gene counts
              Raw gene counts here will be used
alongside raw splice junction counts (below) for splicing
analysis.
              Each sample’s raw gene count files generated by STARsolo
should be collated and read into R as follows.
              The matrix rows represent genes, the columns represent
cells, and the values represent UMI-collapsed, non-normalised,
non-log2-transformed gene counts.
             The cell metadata rows represent the cell IDs while the
columns represent the cell information. Compulsory column is
cell.id
, all other columns are optional. The cell IDs here
should match the columns of the gene expression matrix. The cell IDs
should also match the cell metadata and matrix of the normalised gene
expression data df.gene.norm.pheno
above.
             The gene metadata rows represent the abbreviated gene
names. Compulsory column is gene_short_name
, all other
columns are optional. The gene names here should match the rows of the
gene expression matrix. The gene names should also match the gene
metadata and matrix of the normalised gene expression data
df.gene.norm.feature
above.
# Matrix
path <- "Data/Gene_STARsolo/"
file <- "matrix_counts.mtx"
df.gene.count <- readMM(paste(path, file, sep=""))
df.gene.count[df.gene.count[,1] !=0,][1:5,1:5]
## 5 x 5 sparse Matrix of class "dgTMatrix"
##
## [1,] 1 2 1 3 2
## [2,] 1 2 . . .
## [3,] 1 1 . . .
## [4,] 5 6 8 16 8
## [5,] 3 12 9 10 13
# cell metadata
path <- "Data/Gene_STARsolo/"
file <- "phenoData.txt"
df.gene.count.pheno <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)
head(df.gene.count.pheno)
## cell.id
## 1 AAACCTGAGAAGATTC_SRR9008754
## 2 AAACCTGAGAAGGGTA_SRR9008754
## 3 AAACCTGAGCACACAG_SRR9008754
## 4 AAACCTGAGGGCTTGA_SRR9008754
## 5 AAACCTGAGTTGTCGT_SRR9008754
## 6 AAACCTGCAAGTCTGT_SRR9008754
# feature metadata
path <- "Data/Gene_STARsolo/"
file <- "featureData.txt"
df.gene.count.feature <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)
head(df.gene.count.feature)
## gene_short_name
## 1 MIR1302-2HG
## 2 FAM138A
## 3 OR4F5
## 4 AL627309.1
## 5 AL627309.3
## 6 AL627309.2
Splice junction counts
              Raw splice junctions counts here
will be used alongside raw gene counts (above) for splicing
analysis.
              Each sample’s raw splice junction count files generated by
STARsolo should be collated and read into R as follows.
              The matrix rows represent splice junction coordinates, the
columns represent cells, and the values represent UMI-collapsed,
non-normalised, non-log2-transformed splice junction counts.
             The cell metadata rows represent the cell IDs while the
columns represent the cell information. Compulsory column is
cell.id
, all other columns are optional. The cell IDs here
should match the columns of the splice junction count matrix. The cell
IDs should also match the cell metadata and matrix of the normalised
gene expression data df.gene.norm.pheno
above.
             The gene metadata rows represent the splice junction
coordinates. Compulsory column is coord.intron
, all other
columns are optional. The splice junction coordinates here should match
the rows of the splice junction count matrix.
# Matrix
path <- "Data/SJ_STARsolo/"
file <- "matrix_counts.mtx"
df.sj.count <- readMM(paste(path, file, sep=""))
df.sj.count[df.sj.count[,1] !=0,][1:5,1:5]
## 5 x 5 sparse Matrix of class "dgTMatrix"
##
## [1,] 5 4 7 14 6
## [2,] 1 1 2 1 1
## [3,] 1 . . . .
## [4,] 1 1 . . .
## [5,] 1 . . . .
# cell metadata
path <- "Data/SJ_STARsolo/"
file <- "phenoData.txt"
df.sj.count.pheno <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)
head(df.sj.count.pheno)
## cell.id
## 1 AAACCTGAGAAGATTC_SRR9008754
## 2 AAACCTGAGAAGGGTA_SRR9008754
## 3 AAACCTGAGCACACAG_SRR9008754
## 4 AAACCTGAGGGCTTGA_SRR9008754
## 5 AAACCTGAGTTGTCGT_SRR9008754
## 6 AAACCTGCAAGTCTGT_SRR9008754
# feature metadata
path <- "Data/SJ_STARsolo/"
file <- "featureData.txt"
df.sj.count.feature <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)
head(df.sj.count.feature)
## coord.intron
## 1 chr1:14830:14969
## 2 chr1:14830:185490
## 3 chr1:15039:186316
## 4 chr1:16766:16853
## 5 chr1:16766:16857
## 6 chr1:16766:16875
tSNE/UMAP coordinates
             Dimension reduction embeddings
would have to be performed prior to MARVEL. These embeddings will be
used to project gene expression and splice junction expression values by
MARVEL. Compulsory columns are cell.id
, x
(x-axis coordinates) and y
(y-axis coordinates). All other
columns are optional.
             Here, the tSNE embeddings were generated by SingCellaR
(Wang et al., 2022).
path <- "Data/Gene_SingCellaR/"
file <- "dim_red_coordinates_iPSC_CardioDay10.txt"
df.coord <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)
head(df.coord)
## cell.id x y
## 1 AAACCTGAGAAGATTC_SRR9008754 15.06473 -12.685700
## 2 AAACCTGAGAAGGGTA_SRR9008754 23.00697 10.624021
## 3 AAACCTGAGCACACAG_SRR9008754 12.97456 -18.985449
## 4 AAACCTGAGCTTTGGT_SRR9008753 -20.56841 -1.531026
## 5 AAACCTGAGGGCTTGA_SRR9008754 36.47875 -6.242368
## 6 AAACCTGAGTCGTTTG_SRR9008753 -24.82610 -18.792021
Gene transfer file
             For consistency, this GTF should be
the same file previously used for cellranger and STARsolo above.
path <- "Data/GTF/"
file <- "refdata-cellranger-GRCh38-3.0.0.gtf"
gtf <- as.data.frame(fread(paste(path, file, sep=""), sep="\t", header=FALSE, stringsAsFactors=FALSE))
Create MARVEL object
marvel <- CreateMarvelObject.10x(gene.norm.matrix=df.gene.norm,
gene.norm.pheno=df.gene.norm.pheno,
gene.norm.feature=df.gene.norm.feature,
gene.count.matrix=df.gene.count,
gene.count.pheno=df.gene.count.pheno,
gene.count.feature=df.gene.count.feature,
sj.count.matrix=df.sj.count,
sj.count.pheno=df.sj.count.pheno,
sj.count.feature=df.sj.count.feature,
pca=df.coord,
gtf=gtf
)
Pre-process data
             This step annotates the gene and
splice junction metadata, and subsequently enables us to retain only
high-quality splice junctions and genes of particular biotype, e.g.,
protein-coding genes.
Validate junctions
             Only splice junctions whose start
and end mapped to same gene are retained.
marvel <- ValidateSJ.10x(MarvelObject=marvel)
Subset CDS genes
             Moving forward, we will only retain
genes with coding sequence (CDS), i.e., protein-coding genes.
marvel <- FilterGenes.10x(MarvelObject=marvel,
gene.type="protein_coding"
)
Pre-flight check
             This step ensures that our data is
ready for further downstream analysis including differential expression
analysis, functional annotation, and candidate gene analysis.
             To this end, we will have to make sure the columns of the
matrices align with the cell IDs of the sample metadata and the rows of
the matrices align with the feature metadata. Finally, the columns
across all matrices should align with one another.
marvel <- CheckAlignment.10x(MarvelObject=marvel)
             Our data is ready for downstream
analysis when only MATCHED
flags are reported. If any
NOT MATCHED
flags are reported, please double-check the
input file requirements.
Save R object
             Given the time taken for data
pre-processing, it’ll be a good idea to save the MARVEL object at this
step. The object may be loaded whenever the user begins any one of the
downstream analysis.
# Save object
path <- "Data/R object/"
file <- "MARVEL.RData"
save(marvel, file=paste(path, file, sep=""))
# Load object
# path <- "Data/R Object/"
# file <- "MARVEL.RData"
# load(paste(path, file, sep=""))
Explore expression
             Due to the huge number of splice
junctions detected, it will not be time-effective to include all splice
junctions for differential splicing analysis. Therefore, we will only
include splice junctions expressed in at least a user-defined proportion
of cells for differential splicing analysis. To determine this
threshold, we will explore the number of cells expressing a given gene
and splice junction.
             We will explore the gene and splice junction expression
rate in iPSCs and day-10 cardiomyocytes as we will be performing
differential splicing analysis on these two cell populations.
Gene expression rate
# Define cell groups
# Retrieve sample metadata
sample.metadata <- marvel$sample.metadata
# Group 1 (reference)
index <- which(sample.metadata$cell.type=="iPSC")
cell.ids.1 <- sample.metadata[index, "cell.id"]
length(cell.ids.1)
## [1] 11244
# Group 2
index <- which(sample.metadata$cell.type=="Cardio day 10")
cell.ids.2 <- sample.metadata[index, "cell.id"]
length(cell.ids.2)
## [1] 5937
# Explore % of cells expressing genes
marvel <- PlotPctExprCells.Genes.10x(MarvelObject=marvel,
cell.group.g1=cell.ids.1,
cell.group.g2=cell.ids.2,
min.pct.cells=5
)
marvel$pct.cells.expr$Gene$Plot
