Network analysis toolkit

The goal of this vignette is to illustrate the network analysis toolkit in the spaceMap analysis pipeline. We use the fitted spaceMap networks to the bcpls data as an example.

Input

Before conducting the network analysis, you should have completed the model fitting step such that you have a network ensemble or a final network to analyze (e.g. output from bootVote). To fully realize the capability of the toolkit, you should also have annotations for the nodes.

Load the bcpls list.

library(spacemap)
data("bcpls")
#Attach "bcpls" data objects to the R Global Environment
attach(bcpls)

Next we list the contents of bcpls.

names(bcpls)

## [1] "net"   "yinfo" "xinfo" "bdeg"  "go2eg"

The following bullets go into detail of how each item of bcpls is used by the toolkit.

net: a list of two adjacency matrices that is output from bootVote. The protein–protein edges are encoded with 1’s in net$yy. Likewise, the CNA–protein edges are encoded with 1’s in net$xy.
yinfo: a data.frame of annotations, whose rows should correspond to the rows of and net$yy. We organize the columns by level of necessity for the toolkit. Columns with missing values are denoted with ‘NA’.
- required variables
  - id: a character vector for uniquely identifying each node (e.g. an Entrez gene accession).
- required variables for cis/trans identification
  - chr: a character vector for the chromosome numbe (e.g. “chr17”).
  - start: an integer for the starting positiion of the genomic feature (e.g. transcription start site).
  - end: an integer for the terminal positiion of the genomic feature (e.g. stop codon location).
- optional variables
  - alias: a character vector of human-readable labels for id column. For example, a gene symbol. Two protein isoforms must have different id but may have the same gene symbol.
  - strand: a vector of ’+/-/*‘indicating the 5’ to 3’ DNA strand direction.
  - description: additional functional description of gene product
  - : other variables of interest will be exported to visulization software such Cytoscape.

See the contents of yinfo for the BCPLS data set below.

head(yinfo,5)

id	alias	chr	start	end	strand	description
NP_112598	EPPK1	chr8	144939491	144952632	-	epiplakin
NP_001714	DST	chr6	56322784	56819426	-	bullous pemphigoid antigen 1 isoform 1e precursor
NP_002465	MYH11	chr16	15796991	15950887	-	myosin-11 isoform SM1A
NP_001035202	MYH11	chr16	15796991	15950887	-	myosin-11 isoform SM2B
NP_001070654	MYH14	chr19	50706884	50813801	+	myosin-14 isoform 1

xinfo: Same required variables as yinfo but the optional variables differ. See the contents of xinfo for the BCPLS data set below.

head(xinfo,5)

id	alias	chr	start	end
1_3218610_6972307_0	1p36.31-36.32	chr1	3218610	6972307
1_6990271_14007849_0	1p36.21-36.31	chr1	6990271	14007849
1_14028633_28692959_0	1p35.3-36.21	chr1	14028633	28692959
1_28695597_30387702_0	1p35.2-35.3	chr1	28695597	30387702
1_30392887_31908168_0	1p35.2	chr1	30392887	31908168

go2eg: A list of functional mappings where each key is associated with a GO ID and the value is a character vector specifying all the proteins that map to this GO ID. Proteins can map to multiple GO terms. The mapping go2eg will be used for conducting functional enrichment analysis, but is otherwise optional.

head(go2eg,2)

## $`GO:0000122`
##  [1] "NP_009330"    "NP_644671"    "NP_001120"    "NP_056082"   
##  [5] "NP_001041666" "NP_005321"    "NP_055398"    "NP_036233"   
##  [9] "NP_001441"    "NP_001257972" "NP_001177666" "NP_002492"   
## [13] "NP_005587"    "NP_061119"    "NP_002022"    "NP_060733"   
## [17] "NP_005310"    "NP_005312"    "NP_005311"    "NP_005313"   
## [21] "NP_006051"    "NP_001002295" "NP_116027"    "NP_005987"   
## [25] "NP_005985"    "NP_066285"    "NP_001138627" "NP_073576"   
## [29] "NP_001001548" "NP_004022"    "NP_000336"    "NP_001963"   
## [33] "NP_005323"    "NP_004771"    "NP_001744"    "NP_006262"   
## [37] "NP_620590"   
## 
## $`GO:0000165`
##  [1] "NP_001020029" "NP_000338"    "NP_003117"    "NP_006149"   
##  [5] "NP_004439"    "NP_005219"    "NP_005226"    "NP_003740"   
##  [9] "NP_001356"    "NP_005535"    "NP_057726"    "NP_001553"   
## [13] "NP_000213"    "NP_001744"    "NP_000791"    "NP_620590"

We can also create a biologically informative alias for each GO ID, which will be useful for reporting enriched hubs/modules later.

library(AnnotationDbi)
library(GO.db)
process_alias <- AnnotationDbi::Term(names(go2eg))
head(process_alias,3)

##                                                             GO:0000122 
## "negative regulation of transcription from RNA polymerase II promoter" 
##                                                             GO:0000165 
##                                                         "MAPK cascade" 
##                                                             GO:0000278 
##                                                   "mitotic cell cycle"

bdeg: A list from bootVote. It is optional for network analysis, but is useful for prioritizing both CNA- and protein- hubs. If bdeg is provided in the function rankHub, then hubs will be prioritized according to their average degree rank, so that highly ranked hubs would consistently have a large degree across the network ensemble. The contents of bdeg are two matrices yy and xy where the bth row is:
- yy[b,] an integer vector representing the degree distribution of the proteins in the network fitted on the $b$th bootstrap replicate.
- xy[b,] an integer vector representing the degree distribution of the CNAs.

Map Annotations

Convert the final network into an igraph object and map the annotations onto the network.

library(igraph)
ig <- spacemap::adj2igraph(yy = net$yy, xy = net$xy, yinfo = yinfo, xinfo = xinfo)

If we query the attribute names of the nodes in the graph, we notice that the columns of xinfo and yinfo have been applied.

vertex_attr_names(graph = ig)

## [1] "name"        "alias"       "chr"         "start"       "end"        
## [6] "strand"      "description" "levels"

The igraph package has a number of ways to access the annotation information. For example, if we wish to confirm the chromosome location of ERBB2, we can easily query:

vertex_attr(graph = ig, name = "chr", index = V(ig)[alias %in% "ERBB2"])

## [1] "chr17"

Hub Analysis

We first prioritize the CNA- and protein- hubs. If the bdeg argument is specified, then we rank the hubs according to the average degree rank. Accordingly, the most highly ranked hubs will have the most consistently high degree across network ensemble.

To rank the protein nodes, use the rankHub command and simply specify the level = "y" argument.

ig <- rankHub(ig = ig, bdeg = bdeg$yy, level = "y")

To rank the CNA nodes, specify the level = "x" argument.

ig <- rankHub(ig = ig, bdeg = bdeg$xy, level = "x")

Alternatively, if the bdeg argument is not available, the ranking is according to degree in the final network (coded by “ig”).

tmp <- rankHub(ig = ig,  level = "x")

Identify cis and trans

Next label $x-y$ edges as being regulated in cis or in trans. The GenomicRanges package and the genomic coordinates chr,start,end columns of xinfo and yinfo are required for this step. If you don’t have the package GenomicRanges installed yet, try:

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
library(GenomicRanges)

Now we can label the $x-y$ edges with either “cis” or “trans” in the cis_trans edge attribute of ig.

ig <- cisTrans(ig = ig, level = "x-y")

Report

We then report a well-organized table, as seen in Table 3 of the spaceMap publication. The top argument allows us to control how many hubs are displayed.

xhubs <- reportHubs(ig, top = 50, level = "x")

hub	# cis/ # trans	Potential # cis	cis genes
17q21.32 (46-46 Mb)	1 / 98	14	LRRC59
5q34 (160-170 Mb)	0 / 129	0	–
17q12 (38-38 Mb)	5 / 47	34	ERBB2, GRB7, KAT2A, PNMT, MIEN1
10p15.1-15.3 (0.42-4 Mb)	0 / 86	10	–
15q13.1-15.1 (29-43 Mb)	2 / 34	20	IVD, JMJD7
16q22.1-22.2 (68-71 Mb)	2 / 70	13	CDH1, CYB5B
11q13.5-14.1 (76-78 Mb)	4 / 7	7	RSF1, PGM2L1, GAB2, AAMDC
17q23.1-23.2 (58-59 Mb)	2 / 19	8	HEATR6, PTRH2
16p12.1-12.3 (19-27 Mb)	2 / 19	17	KNOP1, DCTPP1
5p12-5q11.1 (45-50 Mb)	1 / 24	4	NNT
16q22.1 (68-68 Mb)	0 / 16	8	–
17p11.2 (20-21 Mb)	0 / 5	11	–
17p11.2 (20-20 Mb)	1 / 6	11	SHMT1
12q21.1 (74-74 Mb)	0 / 4	2	–
8q21.2-22.1 (87-96 Mb)	1 / 0	10	RIDA

Similarly, we can report the top 10 protein hubs, their degrees in the final network, and a description of each hub, if the description column was specified in yinfo.

yhubs <- reportHubs(ig, top = 10, level = "y")

hub	degree	description
ARGLU1	5	arginine and glutamate-rich protein 1
SERPINA1	8	alpha-1-antitrypsin precursor
KLKB1	5	plasma kallikrein preproprotein
MMP2	5	72 kDa type IV collagenase isoform a preproprotein
C6	4	complement component C6 precursor
APOA1	6	apolipoprotein A-I preproprotein
HPX	4	hemopexin precursor
ALB	5	serum albumin preproprotein
C5	3	complement C5 preproprotein
CFI	3	complement factor I preproprotein

GO-neighbor percentage

A CNA neighborhood comprises all protein nodes that are directly connected to a CNA hub by an edge. CNA neighborhoods represent direct perturbations to the proteome by amplifications or deletions in the DNA. To quantify their functional relevance, we compute a score called the GO-neighbor percentage. Two protein nodes are called GO-neighbors if they share a common GO term in the same CNA neighborhood. We postulate that a high percentage of GO-neighbors within a CNA neighborhood associates the CNA hub with more functional meaning. These scores, as presented in Figure 5 of the publication, can be generated with a GO mapping to the proteins as follows.

hgp <- xHubEnrich(ig = ig, go2eg = go2eg)

hub	degree	neighbor_percentage
10p15.1-15.3	86	77.90698
11q13.5-14.1	11	45.45455
12q21.1	4	50.00000
15q13.1-15.1	36	63.88889
16p12.1-12.3	21	71.42857
16q22.1	16	43.75000
16q22.1-22.2	72	79.16667
17p11.2	7	71.42857
17p11.2	5	60.00000
17q12	52	78.84615
17q21.32	99	72.72727
17q23.1-23.2	21	33.33333
5q34	129	77.51938
5p12-5q11.1	25	60.00000

Module Analysis

There are many criteria to define modules of a network. This toolkit allows users to import the module membership information by themselves (see mods argument in modEnrich).

In the spaceMap publication, we use the edge-betweenness algorithm in igraph.

library(igraph)
mods <- cluster_edge_betweenness(ig)

The main goal of module analysis is identifying modules that are functionally enriched. The modEnrich function will test for significantly over-represented GO-terms (or any other valid functional grouping) within a module using hyper-geometric tests.

In the current example, only the protein nodes have functional mapping and we specify this through the levels = "y" argument. If both predictors and responses have functional mapping in the go2eg argument, then we can specify levels = c("y","x"). Other arguments are available to control the enrichment testing; see modEnrich for more details.

outmod <- modEnrich(ig = ig, mods = mods, levels = "y", 
                    go2eg = go2eg, 
                    prefix = "P", process_alias = process_alias)

The output of the module analysis is a list of 3 objects.

names(outmod)

## [1] "ig"   "etab" "eraw"

ig is the input igraph network object updated with a “process_id” attribute for nodes affiliated with a significant GO-term. The “process_id” and “module” attributes together are useful for visualizing which nodes are enriched for a specific biological function.
etab is the polished module enrichment table to report significant GO terms, the representation of the GO term in the module relative to the size of the GO term, and which CNA hubs belong to the module. The top ten hits as in Table S.5 of the spaceMap publication’s supplementary materials are as follows:

outmod$etab[1:10,]

Module (size)	GO Category	GO Obs./ Total	X-hubs (hits)
P1 (244)	small molecule metabolic process	61/280	5q34 (111); 10p15.1-15.3 (79)
	oxidation-reduction process	28/106
	transcription from RNA polymerase II promoter	9/20
	protein homotetramerization	9/20
P10 (28)	transcription, DNA-templated	11/77
	regulation of transcription, DNA-templated	7/41
P2 (25)	negative regulation of cell growth	6/22
	muscle organ development	5/17
P3 (57)	cell proliferation	6/37	17q12 (50)
P4 (97)	ion transmembrane transport	10/21	17q21.32 (79)

eraw contains details for each (module, GO-term) pair that was subjected to the hyper-geometric test. This output gives the user more control by reporting all tests, the relative over-representation of a GO-term in that module, the raw P-value, and the adjusted P-value.

outmod$eraw[1:5,]

Module	process_id	process_alias	GO Obs./Total	P_value	Adjusted_P_value
P1	GO:0044281	small molecule metabolic process	61/280	0.0002922	0.0139429
P1	GO:0055114	oxidation-reduction process	28/106	0.0008225	0.0327023
P1	GO:0006366	transcription from RNA polymerase II promoter	9/20	0.0010725	0.0398029
P1	GO:0051289	protein homotetramerization	9/20	0.0010725	0.0398029
P1	GO:0034641	cellular nitrogen compound metabolic process	17/61	0.0050060	0.1248049

Export for Visualization

There are many tools for network visualization. In the publication of spaceMap, we exported the annotated igraph network object ig to the “graphml” format, which maintained all the attributes associated with nodes when read into Cytoscape for visualization.

filename <- "~/repos/neta-bcpls/neta/spacemap-prot-boot-vote.graphml"
#only keep vertices that have edges for visualization
vis <- delete_vertices(graph = outmod$ig, v = V(outmod$ig)[igraph::degree(outmod$ig) == 0])
igraph::write_graph(graph = vis, file = filename, format = "graphml")

Here we list all the attributes associated with the nodes that can be used in tandem with Cytoscape’s filtering functions to identify specific nodes of interest.

vertex_attr_names(outmod$ig)

##  [1] "name"             "alias"            "chr"             
##  [4] "start"            "end"              "strand"          
##  [7] "description"      "levels"           "rank_hub"        
## [10] "mean_rank_hub"    "sd_rank_hub"      "ncis"            
## [13] "ntrans"           "regulates_in_cis" "potential_cis"   
## [16] "module"           "process_id"

We describe some of the most useful attributes for visualization:

‘name’: the unique node ID
‘alias’: the node alias (e.g. gene symbol ERBB2)
‘chr,start,end,strand’: the gene coordinates of nodes
‘description’: any note (e.g. breast cancer oncogene)
‘levels’: indicates whether the node belongs to predictors “x” or responses “y”
‘rank_hub’: the rank of the hub within its level (e.g. a value of “1” for a node of level “x” means that it is the most consistently high degree $x$ node in the network. )
‘regulates_in_cis’: list of genes regulated in cis
‘module’: the module ID that the node belongs to.
‘process_id’: the significant GO-term(s) associated with the node.

Also the edge attributes are exported to ‘graphml’ format.

edge_attr_names(outmod$ig)

## [1] "levels"    "cis_trans"

‘levels’ indicates whether an edge is $x-y$ or $y-y$.
‘cis_trans’ indicates whether an edge is regulated in cis or in trans.

Summary

This toolkit is useful for analyzing and summarizing the output of the spaceMap model fit in the context of integrative genomics. The biological context mapped onto the network object can easily be exported to existing network visualization tools like Cytoscape.

Christopher Conley, Pei Wang, Jie Peng

2017-05-19