Background

Predicting the interaction of two factors using binding and expression data

  • The binding of a transcription factor to a regulatory region (e.g. gene promoter) induces or represses its expression Rougemont and Naef (2012).

  • Transcription factors share their binding sites with other factor, co-factors and/or DNA-binding proteins forming complexes.

  • The integration of the overlapping binding sites and the effect on gene expression of perturbing the factors can be used to infer their combined function; cooperative or competitive.

Modeling the binding sites as the discounted distances of the ChIP peaks

Peak Score (\(S_p\)): is the distance (\(\Delta\)) from transcription start site (TSS) relative to a 100 kb Wang et al. (2013).

\[ S_p = e^{-(0.5+4\Delta)} \]

The shape of the function approximate empirical observations Tang et al. (2011)

# define distances
d <- 1:100

# define a decay function
decay_func <- function(x) exp(-(0.5 + (4 * x/100)))

# plot decay over distances
plot(d, decay_func(d),
     pch = 19, 
     xlab = 'd = 1:100', ylab = 'exp(-(0.5 + 4 * d/100))')

Modeling the regulatory potential as the sum of the weighted peaks

Gene Score (\(S_g\)): is the sum of the scores (\(S_p\)) of the \(k\) peaks nearby the TSS.

\[S_g = \sum_{i=1}^k S_{pi}\]

Regulatory potential increases with the number of binding sites Tang et al. (2011).

# define distances
d <- 1:100

# sample distances to assign genes
set.seed(123)
g <- sample(1:10, 100, replace = TRUE)

# count the number of instances
n <- as.numeric(table(g))

# aggregate the distances for each instance
s <- aggregate(d, list(g), sum)$x

# plot aggregate distances over the number of instances
plot(n, s,
     pch = 19,
     xlab = 'table(g)', ylab = 'aggregate(d, g, sum)')

Integrating the factor binding and the expression information

Rank Product (\(RP_g\)): the gene score (\(R_{gb}\)) rank is multiplied by the gene statistics rank (\(R_{ge}\)) from differential expression Breitling et al. (2004).

\[ RP_g = \frac{R_{gb}\times R_{ge}}{n^2} \]

Integrate the binding events and the functional effect Tang et al. (2011).

# define distances
d <- 1:100

# make a random stat variable
set.seed(12345)
t <- rnorm(200, 0, 5)

# calculate rank product
rp <- rank(abs(c(-d, d))) * rank(abs(t))

# get min and max ranks
pmax <- which(rp == max(rp))
pmin <- which(rp == min(rp))

# plot random stats over distances
plot(c(-d, d), t,
     pch = 19, col = 'gray',
     xlab = 'c(-d, d)', ylab = 'rnorm(200, 0, 5)')

# highlight the max rank
points(c(-d, d)[pmax], t[pmax],
       col = 'blue', pch = 19)
text(c(-d, d)[pmax]+10, t[pmax] + 1,
       col = 'blue', labels = 'Max')

# highlight the min rank
points(c(-d, d)[pmin], t[pmin],
       col = 'green', pch = 19)
text(c(-d, d)[pmin] + 10, t[pmin] + 1,
       col = 'green', label = 'Min')

# make quadrant lines
abline(h = 0, col = 'red')
abline(v = 0, col = 'red')

Modeling the interaction of two factors using independent perturbations

Regulatory Interaction (\(RI\)): is the product of the gene statistics from differential expression of the perturbation of the two factors (\(x\) and \(y\)) separately.

\[ RI_{g} = x_{ge}\times y_{ge} \] and,

\[ RP_g = \frac{R_{gb}\times RI_{ge}}{n^2} \]

# define two random stat variables
set.seed(12345); t <- rnorm(200, 0, 5)
set.seed(321); t2 <- rnorm(200, 0, 5)

# plot the two random stat variables
plot(t, t2, ylim = c(-10, 10),
     xlim = c(-10, 10), pch = 19, col = 'gray',
     xlab = 'rnorm(200, 0, 5)', ylab = 'rnorm(200, 0, 5)')

# highlight four quadrant by sign
text(5,5, '(+, +)', col = 'blue')
text(-5,-5, '(-, -)', col = 'blue')
text(-5,5, '(-, +)', col = 'green')
text(5,-5, '(+, -)', col = 'green')

# make quadrant lines
abline(h = 0, col = 'red')
abline(v = 0, col = 'red')

Aggregating the effect of the binding events

Empirical Cumulative Distribution Function (ECDF): the proportion of genes in a category (up- or down-regulated genes) that are ranked at or better than the x-axis (regulatory potential value) value Tang et al. (2011).

Aggregate the effect of the factor perturbation in relation to its regulatory potential.

# make a random stat variable
set.seed(12345)
t <- rnorm(200, 0, 5)

# calculate the cumulative distribution function
ecdf_fun <- ecdf(t)

# plot ecdf
plot(ecdf_fun,
     main = '', xlab = 'rnorm(200, 0, 5)', ylab = 'ECDF')

Workflow for integrating binding and expression data

Workflow

Functions in the target R package

Comparison with existing R packages

  • rTRM identifies the transcriptional regulatory modules (TRMs) which are complexes of transcription factors and co-factors by integrating ChIP, gene expression and protein-protein interactions Diez, Hutchins, and Miranda-Saavedra (2013).
  • TFEA.ChIP curates large quantities of data from different sources and uses this data to build a database where queries of transcription factor targets can be constructed Puente-Santamaria, Wasserman, and Peso (2019).
  • transcriptR uses the ChIP data to denovo identify transcripts which are then used to quantify the expression from the RNA-Seq data Karapetyan AR (2019).

Limitations of target

  • Comparable sets of data for the two factors are required; binding data using ChIP and gene expression data under factor perturbation (overexpression or knockdown).
  • Assume that the interaction between two DNA-binding proteins is linear which may not be the case always.
  • Cannot detect assisted binding.

Availability

  • target is available as an open source R/Bioconductor package, here
  • An interactive shiny application can be invoked locally through R or accessed directly on the web, here
  • The source code for the package and the interactive application is available here

Code Walkthrough (–> Next)

References

Breitling, Rainer, Patrick Armengaud, Anna Amtmann, and Pawel Herzyk. 2004. Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments.” FEBS Letters 573 (1-3): 83–92. https://doi.org/10.1016/j.febslet.2004.07.055.
Diez, Diego, Andrew Paul Hutchins, and Diego Miranda-Saavedra. 2013. Systematic identification of transcriptional regulatory modules from protein-protein interaction networks.” Nucleic Acids Research. https://doi.org/10.1093/nar/gkt913.
Karapetyan AR. 2019. An Integrative Tool for ChIP- And RNA-Seq Based Primary Transcripts Detection and Quantification.” R package. https://doi.org/10.18129/B9.bioc.transcriptR.
Puente-Santamaria, Laura, Wyeth W Wasserman, and Luis del Peso. 2019. TFEA.ChIP: a tool kit for transcription factor binding site enrichment analysis capitalizing on ChIP-seq datasets.” Bioinformatics. https://doi.org/10.1093/bioinformatics/btz573.
Rougemont, Jacques, and Felix Naef. 2012. Computational Analysis of Protein–DNA Interactions from ChIP-seq Data.” In Gene Regulatory Networks, 263–73. Springer. https://doi.org/10.1007/978-1-61779-292-2{\_}16.
Tang, Qianzi, Yiwen Chen, Clifford Meyer, Tim Geistlinger, Mathieu Lupien, Qian Wang, Tao Liu, Yong Zhang, Myles Brown, and Xiaole Shirley Liu. 2011. A comprehensive view of nuclear receptor cancer cistromes. Cancer Research 71 (22): 6940–47. https://doi.org/10.1158/0008-5472.CAN-11-2091.
Wang, Su, Hanfei Sun, Jian Ma, Chongzhi Zang, Chenfei Wang, Juan Wang, Qianzi Tang, Clifford A. Meyer, Yong Zhang, and X. Shirley Liu. 2013. Target analysis by integration of transcriptome and ChIP-seq data with BETA.” Nature Protocols. https://doi.org/10.1038/nprot.2013.150.