Integrative genomic data is not always available in a standard format and often requires some level of data cleaning and bioinformatic processing. The BCPLS data is no exception, which requires multiple pre-processing steps prior to learning the network and interpreting it. It is imperative to maintain the integrity of the Breast Cancer Proteogenomics Landscape Study (BCPLS) data through a transparent outline of every pre-processing step. Dimension reduction, missing value imputation, and reducing multi-colinearity, are specifically highlighted.
This document serves as an example on how one might organize annotations and high-dimensional omic data prior to leveraging the spacemap R package for network learning. We show how to construct Bioconductor’s BioBase::ExpressionSet
data structure for the various omic data profiles (e.g. CNA,RNA,protein). The ExpressionSet
data structure is part of the best practices of the Bioconductor community because it binds experimental metadata and biological annotations to the omic profiles. The annotations are formatted specifically for the spaceMap network analysis toolkit.
Note: the initial data files are not included in the neta-bcpls repository to keep it as light-weight as possible in terms of file sizes. If interested in reproducing this analysis with the initial data files, please contact the authors of spaceMap.
We process 105 breast tumour samples that are sourced from a collaboration between The Cancer Genome Atlas (TCGA) and the Clinical Proteomics Tumor Analysis Consortium (CPTAC). Assays on these 105 samples generated protein quantification, mRNA expression, and copy number alteration (CNA) forming a framework for proteogenomic analysis. Protein abundance profiles were downloaded from the supplementary table of Breast Cancer Proteogenomics Landscape Study, Nature 2016 (BCPLS). The protein identification and quantitation data had been previously processed according to the Common Data Analysis Pipeline associated with CPTAC. The RNA-seq gene expression data set was pre-processed by Michael Gatza, an author of BCPLS, where data has been normalized, log transformed, median centered, and filtered to contain only those genes present in >80% of samples. The CNA gains and losses from the aCGH data were previously called by the ‘Fixed-order clustering’ algorithm from Pei Wang’s group.
There are 108 samples of breast cancer tumours, 3 of which are not included because they were normal controls. The meta-data associated with the CPTAC experiment is listed in the following data.frame:
basedir <- "~/scratch-data//neta-bccptac/data/"
clin <- read.csv(file = file.path(basedir,"tumor-metadata.csv"), header = TRUE)
kable(head(clin))
Sample | iTRAQ.Experiment.number | PAM50 | ER.Status | PR.Status | HER2.Status |
---|---|---|---|---|---|
C8-A131 | 1 | Basal | Negative | Negative | Negative |
BH-A18Q | 2 | Basal | Negative | Negative | Negative |
BH-A0AV | 5 | Basal | Negative | Negative | Negative |
A2-A0D0 | 6 | Basal | Negative | Negative | Negative |
A2-A0CM | 7 | Basal | Negative | Negative | Negative |
AR-A0U4 | 9 | Basal | Negative | Negative | Negative |
Read in the RNA-seq expression data and the gene-level copy number variation (CNV) data. Although we will not use the gene-level CNV data, it is helpful for identifying sample names.
rna <- read.delim(file = file.path(basedir, "rna-seq-data-v1.txt"), header = TRUE, sep = "\t")
cnv <- read.delim(file = file.path(basedir, "CNA-data-v1.txt"), header = TRUE, sep = "\t")
There are only 77 high quality samples from the protein quantitation assays, which requires us to subset from the 105 RNA-seq assays. The high quality sample IDs are contained in Pei Wang’s clustering of the aCGH data. Load this analysis in preparation to extract out the high quality sample IDs.
pwf <- new.env()
load(file = file.path(basedir, "cnv.data.80.RData"), envir = pwf)
In order to reformat the RNA-seq raw data to contain only 77 samples of sufficiently high quality, we must find the high quality sample identifier, which is only a substring contained within of the column names of Pei’s aCGH analysis. Create an ID map to extract the high quality samples.
#drop the control, normal sample ids
idMap <- data.frame(sampleID = clin$Sample[-c(106,107, 108)])
#obtain the abbreviated sample ids
idMap$rnaOrder <- sub(pattern = ".", replacement = "-", fixed = TRUE,
x = substr(x = colnames(rna)[2:length(colnames(rna))],
start = 6, stop = 12))
#indicate which column indices
idMap$rnaIndex <- 2:length(colnames(rna))
idMap$cnvOrder <- sub(pattern = ".", replacement = "-", fixed = TRUE,
x = substr(x = colnames(cnv)[4:length(colnames(cnv))],
start = 6, stop = 12))
idMap$cnvIndex <- 4:length(colnames(cnv))
#assure sample columns for rna, cnv are the same order
stopifnot(identical(idMap$rnaOrder, idMap$cnvOrder))
#assure unique abreviated sample ids before using grep
stopifnot(!anyDuplicated(idMap$rnaOrder))
stopifnot(!anyDuplicated(pwf$cnv.part))
idMap$highQual <- FALSE
highQualIndex <- sapply(pwf$cnv.part, function(x) {
idx <- grep(x, idMap$rnaOrder)
if (length(idx))
return(idx)
else
return(NA)
})
# make sure no high quality samples are missing from search
stopifnot(!any(is.na(highQualIndex)))
idMap$highQual[highQualIndex] <- TRUE
After some processing, we have a data.frame indicating unique sample ID’s, the index in which they appear in the rna
and cnv
variangles, as well as a logical indicating high quality samples.
kable(head(idMap))
sampleID | rnaOrder | rnaIndex | cnvOrder | cnvIndex | highQual |
---|---|---|---|---|---|
C8-A131 | A2-A0CM | 2 | A2-A0CM | 4 | TRUE |
BH-A18Q | A2-A0D0 | 3 | A2-A0D0 | 5 | FALSE |
BH-A0AV | A2-A0D1 | 4 | A2-A0D1 | 6 | FALSE |
A2-A0D0 | A2-A0D2 | 5 | A2-A0D2 | 7 | TRUE |
A2-A0CM | A2-A0EQ | 6 | A2-A0EQ | 8 | TRUE |
AR-A0U4 | A2-A0EV | 7 | A2-A0EV | 9 | TRUE |
Record the high quality clinical information with samples sorted in the same order as the RNA samples.
clin.hq <- clin[match(idMap$rnaOrder[idMap$highQual], clin$Sample),]
Subset the high quality samples for further processing.
rna.hq <- as.matrix(rna[,idMap$rnaIndex[idMap$highQual]])
cnv.hq <- as.matrix(cnv[,idMap$cnvIndex[idMap$highQual]])
rownames(cnv.hq) <- cnv$Gene.Symbol
rownames(rna.hq) <- rna$CLID
The gene symbols are not entirely unique, 12 of them having duplicates.
symbol <- unlist(lapply(strsplit(x = as.character(rna$CLID), split = "|", fixed = TRUE), function(x) x[1]))
dupSymbolIndex <- duplicated(symbol)
sum(dupSymbolIndex)
## [1] 12
The entrez gene identifiers are unique. We will use entrez gene IDs as the accession number in downstream applications and the gene symbols will be useful to retain for network visualizations.
entrezgene <- unlist(lapply(strsplit(x = as.character(rna$CLID), split = "|", fixed = TRUE), function(x) x[2]))
dupEntrezGeneIndex <- duplicated(entrezgene)
sum(dupEntrezGeneIndex)
## [1] 0
Remove genes that exceed 30% of samples with missing values.
library(ggplot2)
gnumNA <- sapply(seq_len(nrow(rna.hq)), function(i) sum(is.na(rna.hq[i,])))
missingRateThresh <- 0.30
qplot(gnumNA / ncol(rna.hq)*100, geom = "bar") + xlab("% missing samples per gene")
No gene needs to be filtered out on this criteria. In fact, the degree of missingness is low in the CPTAC RNA-seq dataset. In future processing of the data, missing values will be imputed with mean zero after standardization.
stopifnot(!any(gnumNA > missingRateThresh*ncol(rna.hq)))
Limited sample size, high biological heterogeneity, amd computational limits of spaceMap
are compelling reasons to filter out genes that do not exhibit high dispersion across the samples. Restrict the analysis to mRNA that exceed the 90th percentile in standard deviation.
rna.sd <- apply(rna.hq, MARGIN = 1, FUN = sd, na.rm = TRUE)
keepRNAIndex <- rna.sd > quantile(x = rna.sd, probs = 0.90)
Include some cis genes that are positive controls of the 17q12 amplicon that may not have passed the high variance filter.
rna_entrez <- unlist(lapply(strsplit(x = as.character(rownames(rna.hq)), split = "|", fixed = TRUE), function(x) x[2]))
rna_symbol <- unlist(lapply(strsplit(x = as.character(rownames(rna.hq)), split = "|", fixed = TRUE), function(x) x[1]))
rna_symbol <- ifelse(rna_symbol == "?", rna_entrez, rna_symbol)
pc_index <- match(c("ERBB2", "ERBB4", "GRB7", "KRT17", "KRT19", "PNMT"), rna_symbol)
final_keep_rna_index <- sort(union(pc_index, which(keepRNAIndex)))
rnatofit <- rna.hq[final_keep_rna_index,]
The total number of mRNA expression considered for this analysis is 1657. Mean-center mRNA features and scale by the standard deviation. Follow this action by imputing mean-zero for missing values.
#transpose to be n x p dimension
rnatofit_t <- t(rnatofit)
#standardize features
rnatofit_t <- scale(rnatofit_t)
#zero-mean imputation for mRNA
rnatofit_t[is.na(rnatofit_t)] <- 0
rnatofit_std <- t(rnatofit_t)
BioBase::ExpressionSet
for mRNA data.Associate the correct sample meta data from TCGA with the mRNA expression data.
ltcga_data <- strsplit(x = colnames(rnatofit_std), split = ".", fixed = T)
tcga_vars <- c("Project", "Location", "Study_Participant", "Sample_Type", "Portion", "Plate_Order", "Center")
ltcga_data2 <- lapply(ltcga_data, function(l) {
names(l) <- tcga_vars
as.data.frame(t(data.frame(l)))
})
suppressPackageStartupMessages(library(data.table))
tcga_sample_info <- rbindlist(ltcga_data2)
kable(head(tcga_sample_info))
Project | Location | Study_Participant | Sample_Type | Portion | Plate_Order | Center |
---|---|---|---|---|---|---|
TCGA | A2 | A0CM | 01A | 31R | A034 | 07 |
TCGA | A2 | A0D2 | 01A | 21R | A034 | 07 |
TCGA | A2 | A0EQ | 01A | 11R | A034 | 07 |
TCGA | A2 | A0EV | 01A | 11R | A034 | 07 |
TCGA | A2 | A0EX | 01A | 21R | A034 | 07 |
TCGA | A2 | A0EY | 01A | 11R | A034 | 07 |
Combine this information with other clinical features previously imported.
pheno_data <- cbind(clin.hq, tcga_sample_info)
#simplify sample id now that we have extracted the information encoded therein (see previous block)
colnames(rnatofit_std) <- clin.hq$Sample
row.names(pheno_data) <- colnames(rnatofit_std)
Provide additional information on the sample covariates in the form of meta data.
suppressPackageStartupMessages(library(Biobase))
meta_data <- data.frame(labelDescription = c("Tissue Location - Study Participant",
"iTRAC experiment number",
"50 genes used to classify breast cancer subtypes",
"Estrogen Receptor Status",
"Progesterone Receptor Status",
"HER2 Subtype",
"Project",
"Location where the samples and the clincal metadata were collected",
"Study Participant",
"Sample and vial (01 = solid tumor, A = The first vial in a sequence of samples)",
"order of portion in a sequence of 100 - 120 mg samples",
"order of plate in a sequence of 96-well plates",
"Center that will receive the sample"),
row.names = colnames(pheno_data))
pheno_data <- new("AnnotatedDataFrame",
data=pheno_data, varMetadata=meta_data)
pheno_data
## An object of class 'AnnotatedDataFrame'
## rowNames: A2-A0CM A2-A0D2 ... E2-A15A (77 total)
## varLabels: Sample iTRAQ.Experiment.number ... Center (13 total)
## varMetadata: labelDescription
Identify the entrez gene and symbol features in the data. If a symbol is missing, replace it with its entrez gene id.
rna_entrez <- unlist(lapply(strsplit(x = as.character(rownames(rnatofit_std)), split = "|", fixed = TRUE), function(x) x[2]))
rna_symbol <- unlist(lapply(strsplit(x = as.character(rownames(rnatofit_std)), split = "|", fixed = TRUE), function(x) x[1]))
rna_symbol <- ifelse(rna_symbol == "?", rna_entrez, rna_symbol)
Import the Bioconductor package org.Hs.eg.db
regarding the homo sapien annotation. Prepare a function to extract mapped key values from entrez id features in the requested database.
suppressPackageStartupMessages(library(org.Hs.eg.db))
hs_feature_list <- function(db, eg) {
mk <- mappedkeys(db)
xx <- as.list(db[eg[eg %in% mk]])
}
Extract the chromosome, first start and last stop coordinates, as well as the strand of the mRNA features. Maintain these attributes in list format since some of the genes do not have complete information on the attributes (e.g. strand information missing).
start_chrloc <- hs_feature_list(org.Hs.egCHRLOC, rna_entrez)
rna_preattr <- list(chr = unlist(hs_feature_list(org.Hs.egCHR, rna_entrez)),
start = sapply( hs_feature_list(org.Hs.egCHRLOC, rna_entrez), function(x) min(abs(x))),
end = sapply(hs_feature_list(org.Hs.egCHRLOCEND, rna_entrez), function(x) max(abs(x))),
strand = ifelse(sapply(start_chrloc, function(x) sign(min(x))) == 1L, "+", "-"))
Reformat the list into a data.frame that can be stored in an Biobase::ExpressionSet
object.
suppressPackageStartupMessages(library(foreach))
lrna_node_atttributes <- foreach(id = rna_entrez, sym = rna_symbol) %do% {
id_attr <- sapply(rna_preattr, function(a) {
b <- a[names(a) %in% id]
ifelse(length(b), b, NA)
})
names(id_attr) <- names(rna_preattr)
id_attr_df <- as.data.frame(t(data.frame(id_attr)), stringsAsFactors = F, row.names = "")
id_attr_df$start <- as.integer(id_attr_df$start)
id_attr_df$end <- as.integer(id_attr_df$end)
id_attr_df$id <- id
id_attr_df$symbol <- sym
id_attr_df$cyto <- NA
id_attr_df$type <- "y"
id_attr_df$cytoscape <- sym
id_attr_df$chr <- paste0("chr", id_attr_df$chr)
id_attr_df[,c(5,1,2,3,6,7,4,8,9)]
}
rna_node_attributes <- as.data.frame(rbindlist(lrna_node_atttributes))
rownames(rna_node_attributes) <- rna_node_attributes$id
rna_node_attributes <- rna_node_attributes[,c(1,5,2,3,4,7)]
rna_node_attributes$start <- as.integer(rna_node_attributes$start)
rna_node_attributes$end <- as.integer(rna_node_attributes$end)
names(rna_node_attributes)[1] <- "id"
names(rna_node_attributes)[2] <- "alias"
kable(head(rna_node_attributes), row.names = F)
id | alias | chr | start | end | strand |
---|---|---|---|---|---|
143941 | TTC36 | chr11 | 118527432 | 118531197 | + |
54596 | L1TD1 | chr1 | 62194801 | 62212328 | + |
2151 | F2RL2 | chr5 | 76615481 | 76623434 | - |
23017 | FAIM2 | chr12 | 49866895 | 49903977 | - |
9856 | KIAA0319 | chr6 | 24544103 | 24646155 | - |
164284 | APCDD1L | chr20 | 58459100 | 58514938 | - |
Create the final mRNA ExpressionSet
object.
#subset high quality samples
idMapHq <- idMap[idMap$highQual,]
#assure sample order correct
stopifnot(identical(as.character(clin.hq$Sample), idMapHq$rnaOrder),
identical(as.character(clin.hq$Sample), idMapHq$cnvOrder))
suppressPackageStartupMessages(library(Biobase))
rownames(rnatofit_std) <- rownames(rna_node_attributes)
rnaset <- ExpressionSet(assayData = rnatofit_std,
phenoData = pheno_data,
featureData = AnnotatedDataFrame(data = rna_node_attributes))
rnaset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 1657 features, 77 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: A2-A0CM A2-A0D2 ... E2-A15A (77 total)
## varLabels: Sample iTRAQ.Experiment.number ... Center (13 total)
## varMetadata: labelDescription
## featureData
## featureNames: 143941 54596 ... 9760 (1657 total)
## fvarLabels: id alias ... strand (6 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
The protein quantitation data contains the 77 high quality samples plus a few normal tissue samples. The data has been normalized and missing data has been filtered away previously. Read in the data.
prot <- read.delim(file = file.path(basedir, "proteome-ratio-norm-noNA-unimodal-edit-header.gct"), header = TRUE, sep = "\t")
#exclude 3 normal control samples
prot <- prot[,-c((ncol(prot) - 2):ncol(prot))]
Reformat the sample-level ids from the protein file to match that found in the clinical meta data file.
pid <- gsub(pattern = ".", replacement = "-", fixed = TRUE,
x = substr(x = colnames(prot)[3:length(colnames(prot))],
start = 1, stop = 7))
head(pid)
## [1] "AO-A12D" "C8-A131" "AO-A12B" "BH-A18Q" "C8-A130" "C8-A138"
Duplicate sample IDs exist, which correspond to technical replicates. There are three sample ids with two technical replicates.
dupids <- pid[which(duplicated(pid))]
dupids
## [1] "AO-A12D" "C8-A131" "AO-A12B"
Identify the technical replicates.
#extract an extended id including the technical replicate id.
pid2 <- gsub(pattern = ".", replacement = "-", fixed = TRUE,
x = substr(x = colnames(prot)[3:length(colnames(prot))], start = 1, stop = 10))
#assure there is not duplicates among extended id
stopifnot(!anyDuplicated(pid2))
techreps <- sapply(dupids, function(di) pid2[grep(di, pid2)])
rownames(techreps) <- paste0("replicate-", 1:2)
kable(techreps)
AO-A12D | C8-A131 | AO-A12B | |
---|---|---|---|
replicate-1 | AO-A12D-01 | C8-A131-01 | AO-A12B-01 |
replicate-2 | AO-A12D-05 | C8-A131-32 | AO-A12B-34 |
Obtain the corresponding technical replicate indices in the extended protein sample ID variable pid2
.
trepids <- sapply(dupids, function(pat) grep(pattern = pat, x = pid2))
The correlation between technical replicates is sufficiently high. We replace each technical replicate measurements with the average.
tcors <- vector(mode = "numeric", length = 3)
avgtechreps <- matrix(data = NA, nrow = nrow(prot), ncol = 3)
colnames(avgtechreps) <- colnames(trepids)
rownames(avgtechreps) <- rownames(prot)
library(ggplot2)
for(i in 1:3) {
trep <- prot[,trepids[,i] + 2L]
avgtechreps[,i] <- rowMeans(trep)
prot[,trepids[,i] + 2L] <- avgtechreps[,i]
tcors[i] <- cor(trep[,1], trep[,2])
print(qplot(x = trep[,1], y = trep[,2]) + geom_point() +
xlab("replicate 1") + ylab("replicate 2") + theme_bw() +
annotate("text", x = -5, y = 5, label = paste0("Correlation: ", round(tcors[i], 4))))
}
The idMap object contains all 105 sample ids, excluding the 3 control samples, and indicates which is called a sufficiently high quality sample. Subset the 77 high quality samples. Assure the high quality samples agree between protein, CNA and RNA. The order of the samples for proteins does not natively match the RNA and CNA, which will be fixed here.
#remove duplicate protein sample ids
pidnodups <- pid[!duplicated(pid)]
#assure high quality samples agree between protein,rna,cna
stopifnot(!length(setdiff(pidnodups, idMapHq$cnvOrder)))
#change the order of the protein sample columns
idMapHq$protOrder <- pidnodups[match(idMapHq$cnvOrder, pidnodups)]
Subset the high quality samples from the original import of the protein quantitation data.
#pid
prot_hq <- prot[,c(1,2, match(idMapHq$protOrder, pid) + 2L)]
#assure ourselves that the sample ids match in the correct order of the protein expression matrix
tmp <- gsub(pattern = ".", replacement = "-", fixed = TRUE,
x = substr(x = colnames(prot_hq)[3:length(colnames(prot_hq))],
start = 1, stop = 7))
stopifnot(identical(tmp, idMapHq$cnvOrder))
stopifnot(identical(tmp, idMapHq$rnaOrder))
stopifnot(identical(tmp, idMapHq$protOrder))
#match the simplified sample names now that we order has been assured.
colnames(prot_hq)[3:length(colnames(prot_hq))] <- tmp
Filter out all proteins but those that have variance in the top 20%.
ed <- as.matrix(prot_hq[,3:ncol(prot_hq)])
#initial feature information
fd <- data.frame( RefSeqProteinID = prot_hq[,2], description = prot_hq[,1],
row.names = prot_hq[,2])
protallset <- ExpressionSet(assayData = ed,
phenoData = pheno_data,
featureData = AnnotatedDataFrame(data = fd))
pexp <- exprs(protallset)
#no missing proteins
stopifnot(!any(is.na(pexp)))
sd_pexp <- apply(X = pexp, MARGIN = 1, FUN = sd)
keep_prot_index <- sd_pexp > quantile(x = sd_pexp, probs = 0.80)
protsdf <- protallset[which(keep_prot_index),]
protsdf
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 1595 features, 77 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: A2-A0CM A2-A0D2 ... E2-A15A (77 total)
## varLabels: Sample iTRAQ.Experiment.number ... Center (13 total)
## varMetadata: labelDescription
## featureData
## featureNames: NP_112598 NP_001714 ... NP_997625 (1595 total)
## fvarLabels: RefSeqProteinID description
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
As seen in the ExpressionSet object
, we have filtered out samples and protein features, added the phenotypic data, but have yet to add gene coordinates of the feature level information.
Since the phenotypic data for the samples has been prepared earlier, arrange to annotate the protein features.
suppressPackageStartupMessages(library(Biobase))
suppressPackageStartupMessages(library(mygene))
## Warning: package 'GenomicRanges' was built under R version 3.4.1
p2r77map <- queryMany(qterms = featureNames(protsdf), scopes = "refseq",
fields = c("entrezgene", "symbol"))
## Querying chunk 1
## Querying chunk 2
## Finished
## Pass returnall=TRUE to return lists of duplicate or missing query terms.
p2r77map2 <- data.frame(refseq = p2r77map@listData$query, entrezgene = as.character(p2r77map@listData$entrezgene),
symbol = p2r77map@listData$symbol, stringsAsFactors = FALSE)
The RefSeq ID’s of the protein features is non-missing and unique.
stopifnot(!any(is.na(p2r77map2$refseq)))
stopifnot(!any(duplicated(p2r77map2$refseq)))
On the other hand, some of the entrez gene ID’s are missing.
missing_entrez_gene_ids <- p2r77map2$refseq[is.na(p2r77map2$entrezgene)]
length(missing_entrez_gene_ids)
## [1] 29
And duplication of entrez gene ID implies there are 194 isoforms.
sum(duplicated(p2r77map2$entrezgene))
## [1] 194
Identify the protein features that are not missing entrez gene identifiers so that we can find their gene coordinates. List the isoforms for each of the 1400 features that ought to have gene coordinates.
p2r77map3 <- p2r77map2[!is.na(p2r77map2$entrezgene),]
isoforms <- split(p2r77map3$refseq, f = as.factor(p2r77map3$entrezgene))
#unique entrez gene identifiers
egu <- names(isoforms)
Add the gene coordinate information as in the mRNA case where we have information available.
get_protein_gene_coords <- function(egu, isoforms) {
#find gene coordinate information for each unique entrez gene identifier
yyids <- egu
yy_chr <- hs_feature_list(org.Hs.egCHR, yyids)
#assign to X
yy_chr[[which(sapply(yy_chr, length) > 1)]] <- "X"
yy_chr <- unlist(yy_chr)
start_chrloc <- hs_feature_list(org.Hs.egCHRLOC, yyids)
#get minimum start site
min_start_chrloc <- sapply(start_chrloc, function(x) min(abs(x)))
#strand information
yy_strand <- ifelse(sapply(start_chrloc, function(x) sign(min(x))) == 1L, "+", "-")
#get maximum end site
end_chrloc <- hs_feature_list(org.Hs.egCHRLOCEND, yyids)
max_start_chrloc <- sapply(end_chrloc, function(x) max(abs(x)))
#get symbols
yy_symbols <- unlist(hs_feature_list(org.Hs.egSYMBOL, yyids))
#for each isoform provide gene coordinate information
library(foreach)
yy_node_attributes <- foreach(isos = isoforms, eg = names(isoforms), .combine= 'rbind') %do% {
#if symbol does not exist for isoform, provide refseq id
if (is.na(yy_symbols[eg])) {
cys <- isos
} else {
cys <- yy_symbols[eg]
}
data.frame(id = isos,
chr = paste0("chr", yy_chr[eg]),
start = min_start_chrloc[eg],
end = max_start_chrloc[eg],
symbol = yy_symbols[eg],
cyto = NA,
cytoscape = cys,
strand = yy_strand[eg],
type = "y",
stringsAsFactors = FALSE,
row.names = NULL)
}
stopifnot(!any(is.na(yy_node_attributes$id)))
#add missing feature values for those protein features that do not have gene coordinates
missing_feats <- data.frame(id = missing_entrez_gene_ids,
chr = NA,
start = NA,
end = NA,
symbol = NA,
cyto = NA,
cytoscape = missing_entrez_gene_ids,
strand = NA,
type = "y",
stringsAsFactors = FALSE,
row.names = NULL)
yy_node_attributes <- rbind(yy_node_attributes,missing_feats)
#sort to match the feature order of the expression data
yy_node_attributes[match(p2r77map2$refseq, yy_node_attributes$id),]
}
prot_gene_coordinates <- get_protein_gene_coords(egu, isoforms)
rownames(prot_gene_coordinates) <- prot_gene_coordinates$id
tmp <- pData(featureData(protsdf))
prot_node_attributes <- prot_gene_coordinates[match(tmp$RefSeqProteinID, prot_gene_coordinates$id),]
prot_node_attributes$description <- as.character(tmp$description)
prot_node_attributes <- prot_node_attributes[,c(1,7,2,3,4,8,10)]
prot_node_attributes$start <- as.integer(prot_node_attributes$start)
prot_node_attributes$end <- as.integer(prot_node_attributes$end)
names(prot_node_attributes)[1] <- "id"
names(prot_node_attributes)[2] <- "alias"
kable(head(prot_node_attributes), row.names = F)
id | alias | chr | start | end | strand | description |
---|---|---|---|---|---|---|
NP_112598 | EPPK1 | chr8 | 143857318 | 143878464 | - | epiplakin |
NP_001714 | DST | chr6 | 56457986 | 56954628 | - | bullous pemphigoid antigen 1 isoform 1e precursor |
NP_002465 | MYH11 | chr16 | 1361145 | 15857030 | - | myosin-11 isoform SM1A |
NP_001035202 | MYH11 | chr16 | 1361145 | 15857030 | - | myosin-11 isoform SM2B |
NP_001070654 | MYH14 | chr19 | 50203627 | 50310544 | + | myosin-14 isoform 1 |
NP_079005 | MYH14 | chr19 | 50203627 | 50310544 | + | myosin-14 isoform 2 |
We now form the | final ` | Expressi | onSet` for t | he protein q | uantitati | on data. Be sure to mean-center protein features and scale by the standard deviation. |
protset_no_std <- ExpressionSet(assayData = exprs(protsdf),
phenoData = pheno_data,
featureData = AnnotatedDataFrame(data =prot_node_attributes))
#n x p format
prot_expr_t <- t(exprs(protsdf))
#standardize and return to p x n format
prot_expr_std <- t(scale(prot_expr_t))
protset <- ExpressionSet(assayData = prot_expr_std,
phenoData = pheno_data,
featureData = AnnotatedDataFrame(data =prot_node_attributes))
In the copy number alteration data, we found that even after processing with ‘Fixed-order clustering’ (Pei Wang, 2010), there were still groupings of highly-correlated copy number alteration intervals (CNA). We seek to further cluster these CNA so that the extreme multicolinearity does not render the multivariate regression unstable.
We will use a density-based clustering algorithm because it has a number of virtues:
MinPnts
Parmater set to 2 to find even small blocks of multicolinearity.The data has been previously loaded in the pwf
environment. Create a temporary matrix X
to store the CNA data.
rownames(pwf$cnv.data) <- pwf$cnv.pos$Cyto
X <- t(pwf$cnv.data)
dim(X)
## [1] 77 1730
Match the samples order to that of the protein and mRNA data.
sidx <- match(idMap$cnvOrder[idMap$highQual], substring(colnames(pwf$cnv.data), 6, 12))
X <- X[sidx,]
Assure ourselves again that the sample names match up now that we are working with the CNA output from CLAC.
cna_sample_names <- substring(colnames(pwf$cnv.data), 6, 12)[sidx]
rna_sample_names <- sampleNames(rnaset)
stopifnot(identical(cna_sample_names, rna_sample_names))
Many variables have the same cytoband interval names.
sum(duplicated(colnames(X)))
## [1] 862
Load the fpc
package for using the dbscan
algorithm. Define the dissimilarity to be \(1 - abs(\rho(X))\), where \(\rho(X)\) is the spearman’s correlation of the CNA.
library(fpc)
dR <- 1 - abs(cor(X))
hist(dR[upper.tri(dR)], col = "#31a354", xlab = "Multicolinearity dissimilarity (|1 - cor(X,Y)|)", main = "")
We require a minimally small cluster of size 2 and a reacheability radius of 1e-3
.
clout <- dbscan(data = dR, MinPts = 2, eps = 1e-3, method = "dist")
The distribution on cluster size shows that most CNA are classified as sufficiently distinct from other CNA. This is desired behavior. Most extreme colinearity cases are limited to small clusters.
barplot(table(table(clout$cluster)[]), col = "#31a354", xlab = "Cluster Size", ylab = "Frequency", main = "")
To get a sense of how much spatial correlation affects the multicolinear clusters, we compute a distance on the CNA cytoband names. It is clear that most multicolinear clusters have the same cytoband label suggesting those intervals with similar distance will be clustered together.
suppressWarnings(library(stringdist))
library(foreach)
clusterDistr <- as.data.frame(table(clout$cluster), stringAsFactors = FALSE)
#Cluster `0` corresponds to sufficiently independent CNA's. Omit it from the multi-colinear analysis.
mcolinear <- clusterDistr$Var1[2:nrow(clusterDistr)]
nameDist <- foreach(cl = mcolinear, .combine = 'c', .packages = c("stringdist")) %do% {
cn <- colnames(X[,clout$cluster == cl])
cnIdx <- seq_along(cn)
nds <- length(cn) * (length(cn) - 1) / 2
sds <- numeric(nds)
cntr <- 1
for(i in cnIdx[-length(cnIdx)]) {
for(j in (i + 1):length(cnIdx)) {
#message(i,",", j)
sds[cntr] <- stringdist(cn[i], cn[j])
cntr <- cntr + 1
}
}
mean(sds)
}
hist(nameDist, col="#31a354", xlab = "restricted Damerau-Levenshtein distance of CNA names", main = "" )
We validate that the eps
parameter behaves as we anticipated by computing the mean correlation of each multicolinear cluster.
clCorMeans <- foreach(cl = mcolinear, .combine = 'c') %do% {
clcor <- cor(X[,clout$cluster == cl])
mean(1 - clcor[upper.tri(clcor)])
}
hist(clCorMeans, col="#31a354", xlab = "Dissimilarity (|1 - Cor(X,Y)|)", main = "" )
We will use the average of each mulitcolinear cluster to collapse (i.e. replace) the multi-colinear CNA into a single representative vector for future data analysis.
avgMultiCol <- foreach(cl = mcolinear, .combine = 'cbind') %do% {
rowMeans(X[,clout$cluster == cl])
}
Almost all of these multicolinear CNA are contiguous. We resolve the interval names of the multicolinear CNA by either (1) collapsing the identical interval names or (2) changing the interval names to reflect an expanded contiguous range of the multicolinear CNA.
namesMultiCol <- foreach(i = seq_along(mcolinear)) %do% {
colnames(X[,clout$cluster == mcolinear[i]])
}
Those multicolinear CNA with interval names that are not identical, but are contiguous are renamed to reflect the contiguity. Only one of the multicolinear entries is not contiguous (index 59); but they happen to be in the same chromosome and the same arm. We make note of this by separating the two cytoband regions with a semi-colon. Lastly we rename the averaged multicolinear CNA with the representative interval name that spans two or more multicolinear CNA variables.
toResolve <- namesMultiCol
namesMultiCol <- sapply(toResolve, function(x) paste0(unique(sort(unlist(strsplit(x=x, split = "-")))), collapse = "-"))
colnames(avgMultiCol) <- unlist(namesMultiCol)
Cluster 0
corresponds to sufficiently independent CNA’s; these are combined with the collapsed multicolinear CNA’s.
#assure that combining avg. multi-colinear cna intervals with the non-multi-colinear intervals are matched by sample
stopifnot(identical(rownames(t(pwf$cnv.data)[sidx,]), rownames(avgMultiCol)))
cna <- t(cbind(t(pwf$cnv.data)[sidx,clout$cluster == 0],avgMultiCol))
The dimension of CNA has now been reduced to 1662 from 1730 under this setting.
dim(cna)
## [1] 1662 77
Assure that the averaging of multicolinear features preserved the sample order.
cna_sample_names <- substring(colnames(cna), 6, 12)
stopifnot(identical(cna_sample_names, substring(rownames(avgMultiCol), 6, 12)))
rna_sample_names <- sampleNames(rnaset)
stopifnot(identical(cna_sample_names, rna_sample_names))
Create unique ID’s for the genomic coordinates by appending the position coordinates in a
cnapos <- pwf$cnv.pos
cnaids <- apply( cnapos[ , c(1,2,3) ] , 1 , paste , collapse = "_" )
#add multicolinear ids
cnaids <- paste(cnaids,clout$cluster, sep = "_")
stopifnot(!anyDuplicated(cnaids))
str(cnaids)
## chr [1:1730] "1_3218610_6972307_0" "1_6990271_14007849_0" ...
Confirm that multicolinear CNA have no direct overlaps in the genome.
suppressPackageStartupMessages(library(GenomicRanges))
gr <- GRanges(seqnames = Rle(paste0("chr", cnapos$ch.interval.iter2)),
ranges = IRanges(start = cnapos$begin.nuc.iter2, end = cnapos$end.nuc.iter2,
names = cnapos$Cyto))
no_overlap <- all(sapply(mcolinear, function(mc) {
grcl <- gr[clout$cluster == mc,]
grclu <- unique(grcl)
ov <- findOverlaps(grcl, grclu)
identical(ov@to, ov@from)
}))
stopifnot(no_overlap)
Improve the cytoband information to be as non-redundant as possible.
nr_cytobands <- function(cb) {
split_cb <- unlist(strsplit(x = cb, split = "-"))
if(length(split_cb) != 2) {
return(cb)
}
split_cb2 <- unlist(strsplit(x = split_cb, split = "p|q"))
pq1 <- regmatches(split_cb[1], m= regexpr(pattern = "[pq]+", text = split_cb[1]))
pq2 <- regmatches(split_cb[2], m= regexpr(pattern = "[pq]+", text = split_cb[2]))
same_arm <- pq1 == pq2
chr1 <- split_cb2[1]
chr2 <- split_cb2[3]
same_chr <- chr1 == chr2
loc1 <- as.numeric(split_cb2[2])
loc2 <- as.numeric(split_cb2[4])
same_loc <- loc1 == loc2
if( same_chr & same_arm & same_loc) {
paste0(chr1, pq1, loc1)
} else if (same_chr & same_arm & !same_loc) {
paste0(chr1, pq1, min(loc1,loc2), "-", max(loc1,loc2))
} else if (same_chr & !same_arm & same_loc) {
paste0(chr1, pq1, loc1, "-", pq2, loc2)
} else if (same_chr & !same_arm & !same_loc) {
paste0(chr1, pq1, loc1, "-", chr2, pq2, loc2)
} else if (!same_chr) {
paste0(chr1, pq1, loc1, "-", chr2, pq2, loc2)
}
}
tmp <- data.frame(original = cnapos$Cyto, nr = sapply(cnapos$Cyto, nr_cytobands))
nr_cytoband_ids <- as.character(tmp$nr)
Format the CNA feature’s genomic coordinates in the same fashion as the protein and mRNA features.
cna_node_attributes <- data.frame(id = cnaids,
alias = nr_cytoband_ids,
chr = paste0("chr", cnapos$ch.interval.iter2),
start = as.integer(cnapos$begin.nuc.iter2),
end = as.integer(cnapos$end.nuc.iter2),
stringsAsFactors = FALSE)
#combine genomic coordinates in the same manner as the CNA data (inclusive of multicolinear features )
cna_node_attributes2 <- cna_node_attributes[c(which(clout$cluster == 0), match(mcolinear, clout$cluster)),]
rownames(cna_node_attributes2) <- cna_node_attributes2$id
kable(head(cna_node_attributes2), row.names = F)
id | alias | chr | start | end |
---|---|---|---|---|
1_3218610_6972307_0 | 1p36.31-36.32 | chr1 | 3218610 | 6972307 |
1_6990271_14007849_0 | 1p36.21-36.31 | chr1 | 6990271 | 14007849 |
1_14028633_28692959_0 | 1p35.3-36.21 | chr1 | 14028633 | 28692959 |
1_28695597_30387702_0 | 1p35.2-35.3 | chr1 | 28695597 | 30387702 |
1_30392887_31908168_0 | 1p35.2 | chr1 | 30392887 | 31908168 |
1_31909072_32155716_0 | 1p35.2 | chr1 | 31909072 | 32155716 |
Format the CNA data into an ExpressionSet
object. Be sure to mean-center CNA features and scale by the standard deviation.
colnames(cna) <- idMapHq$cnvOrder
rownames(cna) <- cna_node_attributes2$id
cnaset <- ExpressionSet(assayData = t(scale(t(cna))),
phenoData = pheno_data,
featureData = AnnotatedDataFrame(data = cna_node_attributes2))
cnaset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 1662 features, 77 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: A2-A0CM A2-A0D2 ... E2-A15A (77 total)
## varLabels: Sample iTRAQ.Experiment.number ... Center (13 total)
## varMetadata: labelDescription
## featureData
## featureNames: 1_3218610_6972307_0 1_6990271_14007849_0 ...
## 18_46953239_47159916_49 (1662 total)
## fvarLabels: id alias ... end (5 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
Save the ExpressionSet
objects, which contain the data to train the spaceMap model and perform downstream network analysis illustrated in other sections of this website.
basedir2 <- "data/"
#rna
saveRDS(rnaset, file = file.path(basedir2, "rna-expression-set.rds"))
#protein
saveRDS(protset, file = file.path(basedir2, "prot-expression-set.rds"))
#cna
saveRDS(cnaset, file = file.path(basedir2, "cna-expression-set.rds"))
#all
save(rnaset, protset, cnaset, file = file.path(basedir, "bccptac-expression-sets.RData"))
save.image(file = file.path(basedir, "data-cleaning-environment.RData"))
Please see the Model Fitting article for the next step in the analysis.
sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] stringdist_0.9.4.6 fpc_2.1-10 mygene_1.12.0
## [4] GenomicFeatures_1.28.4 GenomicRanges_1.28.5 GenomeInfoDb_1.12.2
## [7] data.table_1.10.4 ggplot2_2.2.1 topGO_2.28.0
## [10] SparseM_1.77 GO.db_3.4.1 graph_1.54.0
## [13] foreach_1.4.3 org.Hs.eg.db_3.4.1 AnnotationDbi_1.38.2
## [16] IRanges_2.10.3 S4Vectors_0.14.4 Biobase_2.36.2
## [19] BiocGenerics_0.22.0 knitr_1.17
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 matrixStats_0.52.2
## [3] bit64_0.9-7 RColorBrewer_1.1-2
## [5] httr_1.3.1 rprojroot_1.2
## [7] prabclus_2.2-6 tools_3.4.0
## [9] backports_1.1.0 R6_2.2.2
## [11] rpart_4.1-11 Hmisc_4.0-3
## [13] DBI_0.7 lazyeval_0.2.0
## [15] colorspace_1.3-2 trimcluster_0.1-2
## [17] nnet_7.3-12 gridExtra_2.3
## [19] bit_1.1-12 curl_2.8.1
## [21] compiler_3.4.0 chron_2.3-50
## [23] htmlTable_1.9 DelayedArray_0.2.7
## [25] rtracklayer_1.36.4 diptest_0.75-7
## [27] scales_0.5.0 checkmate_1.8.3
## [29] DEoptimR_1.0-8 robustbase_0.92-7
## [31] mvtnorm_1.0-6 stringr_1.2.0
## [33] digest_0.6.12 Rsamtools_1.28.0
## [35] foreign_0.8-69 rmarkdown_1.6
## [37] XVector_0.16.0 base64enc_0.1-3
## [39] pkgconfig_2.0.1 htmltools_0.3.6
## [41] highr_0.6 htmlwidgets_0.9
## [43] rlang_0.1.2 RSQLite_2.0
## [45] jsonlite_1.5 mclust_5.3
## [47] BiocParallel_1.10.1 acepack_1.4.1
## [49] RCurl_1.95-4.8 magrittr_1.5
## [51] modeltools_0.2-21 GenomeInfoDbData_0.99.0
## [53] Formula_1.2-2 Matrix_1.2-11
## [55] Rcpp_0.12.12 munsell_0.4.3
## [57] proto_1.0.0 sqldf_0.4-11
## [59] stringi_1.1.5 yaml_2.1.14
## [61] MASS_7.3-47 SummarizedExperiment_1.6.3
## [63] zlibbioc_1.22.0 flexmix_2.3-14
## [65] plyr_1.8.4 grid_3.4.0
## [67] blob_1.1.0 lattice_0.20-35
## [69] Biostrings_2.44.2 splines_3.4.0
## [71] codetools_0.2-15 biomaRt_2.32.1
## [73] XML_3.98-1.9 evaluate_0.10.1
## [75] latticeExtra_0.6-28 gtable_0.2.0
## [77] kernlab_0.9-25 gsubfn_0.6-6
## [79] class_7.3-14 survival_2.41-3
## [81] tibble_1.3.4 iterators_1.0.8
## [83] GenomicAlignments_1.12.2 memoise_1.1.0
## [85] cluster_2.0.6
Copyright © 2017 Regents of the University of California. All rights reserved.