vignettes/workshop_code.Rmd
workshop_code.Rmd
# load required libraries
library(dplyr)
library(tidyr)
library(readr)
library(purrr)
library(GenomicRanges)
library(rtracklayer)
library(AnnotationDbi)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
library(target)
YY1 and YY2 belongs to the same family of transcription factors.
Ying Yang 1 (YY1)
Ying Yang 2 (YY2)
Using the target analysis, we will attempt to answer the following questions:
To answer these questions, we use publicly available datasets.
Table 1. Expression and binding data of YY1 and YY2 in HeLa cells.
GEO ID | Data Type | Design | Ref. |
---|---|---|---|
GSE14964 | Microarrays | YY#-knockdown | Chen et al. (2010) |
GSE31417 | ChIP-Seq | YY1 vs input | Michaud et al. (2013) |
GSE96878 | ChIP-Seq | YY2 vs input | Wu et al. (2017) |
Data pre-processing:
bed
files from ChIP-Atlas.Data analysis:
The ChIP peaks were downloaded in the form of separate bed files for each factor. We first locate the files in the data/
directory and load the files using import.bed
. Then the data is transformed into a suitable format, GRanges
. The resulting object, peaks
, is a list
of two GRanges
items, one for each factor.
# locate the peaks bed files
peak_files <- c(YY1 = 'data/Oth.Utr.05.YY1.AllCell.bed',
YY2 = 'data/Oth.Utr.05.YY2.AllCell.bed')
# load the peaks bed files as GRanges
peaks <- map(peak_files, ~GRanges(import.bed(.x)))
# show the numbers of peaks of the two factors
lengths(peaks)
#> YY1 YY2
#> 31490 3368
# show the first few entries of the GRanges object
show(peaks$YY1)
#> GRanges object with 31490 ranges and 4 metadata columns:
#> seqnames ranges strand | name score
#> <Rle> <IRanges> <Rle> | <character> <numeric>
#> [1] chr1 9980-10149 * | ID=SRX092576;Name=YY.. 186
#> [2] chr1 10001-10265 * | ID=SRX190209;Name=YY.. 512
#> [3] chr1 10368-10452 * | ID=SRX190209;Name=YY.. 128
#> [4] chr1 11219-11322 * | ID=SRX190209;Name=YY.. 100
#> [5] chr1 29140-29432 * | ID=SRX190209;Name=YY.. 418
#> ... ... ... ... . ... ...
#> [31486] chrY 58996509-58996690 * | ID=SRX092576;Name=YY.. 135
#> [31487] chrY 58996927-58997108 * | ID=SRX092576;Name=YY.. 183
#> [31488] chrY 58996934-58997092 * | ID=SRX190209;Name=YY.. 111
#> [31489] chrY 59363025-59363426 * | ID=SRX190209;Name=YY.. 894
#> [31490] chrY 59363027-59363397 * | ID=SRX092576;Name=YY.. 404
#> itemRgb thick
#> <character> <IRanges>
#> [1] #00BDFF 9980-10149
#> [2] #0CFF00 10001-10265
#> [3] #0082FF 10368-10452
#> [4] #0066FF 11219-11322
#> [5] #00FF53 29140-29432
#> ... ... ...
#> [31486] #0089FF 58996509-58996690
#> [31487] #00BAFF 58996927-58997108
#> [31488] #0071FF 58996934-58997092
#> [31489] #FF6C00 59363025-59363426
#> [31490] #00FF61 59363027-59363397
#> -------
#> seqinfo: 56 sequences from an unspecified genome; no seqlengths
# show number of peaks in each chromosome
table(seqnames(peaks$YY1))
#>
#> chr1 chr10 chr11
#> 3126 1632 1590
#> chr12 chr13 chr14
#> 1817 605 971
#> chr15 chr16 chr17
#> 964 1246 1613
#> chr17_gl000204_random chr17_gl000205_random chr18
#> 1 12 548
#> chr19 chr19_gl000208_random chr1_gl000191_random
#> 1626 18 4
#> chr1_gl000192_random chr2 chr20
#> 4 2304 775
#> chr21 chr22 chr3
#> 332 728 1935
#> chr4 chr4_gl000193_random chr4_gl000194_random
#> 1369 1 1
#> chr5 chr6 chr7
#> 1479 1688 1538
#> chr7_gl000195_random chr8 chr9
#> 11 1134 1330
#> chr9_gl000198_random chr9_gl000199_random chrM
#> 1 37 10
#> chrUn_gl000211 chrUn_gl000212 chrUn_gl000214
#> 3 4 9
#> chrUn_gl000216 chrUn_gl000218 chrUn_gl000219
#> 44 1 5
#> chrUn_gl000220 chrUn_gl000221 chrUn_gl000222
#> 37 1 5
#> chrUn_gl000223 chrUn_gl000224 chrUn_gl000225
#> 4 8 51
#> chrUn_gl000226 chrUn_gl000227 chrUn_gl000228
#> 18 1 12
#> chrUn_gl000235 chrUn_gl000237 chrUn_gl000240
#> 3 1 1
#> chrUn_gl000242 chrUn_gl000243 chrUn_gl000245
#> 1 1 1
#> chrX chrY
#> 630 199
# show the the width of the peaks
summary(width(peaks$YY1))
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 85.0 111.0 151.0 192.9 230.0 1772.0
# show the strands of the peaks
unique(strand(peaks$YY1))
#> [1] *
#> Levels: + - *
The differential expression data were downloaded in tabular format.
data/
read_tsv
The resulting object, express
, is a list
of two tibble
items.
# locate the expression text files
expression_files <- c(YY1 = 'data/DataSet_01_18.tsv',
YY2 = 'data/DataSet_01_19.tsv')
# load the expression text files
express <- map(expression_files,
~read_tsv(.x, col_names = FALSE) %>%
dplyr::select(2, 3, 7, 9) %>% #9
setNames(c('tf', 'gene', 'fc', 'pvalue')) %>%
filter(tf %in% c('YY1', 'YY2')) %>%
na.omit())
# show the number of genes with recorded fold-change and p-value
map(express, nrow)
#> $YY1
#> [1] 16595
#>
#> $YY2
#> [1] 16595
# show same genes were recorded for both factors
all(express$YY1$gene %in% express$YY2$gene)
#> [1] TRUE
# show direction and significance of the regulation
table(express$YY1$fc > 0, express$YY1$pvalue < .01)
#>
#> FALSE TRUE
#> FALSE 6785 160
#> TRUE 9280 370
table(express$YY2$fc > 0, express$YY2$pvalue < .01)
#>
#> FALSE TRUE
#> FALSE 7435 132
#> TRUE 8804 224
The knockdown of either factor in HeLa cells seem to change the expression of many genes in either directions (Figure 1A&B). Moreover, the changes resulting from the knockdown of the factors individually are correlated (Figure 1C). This suggest that, many of the regulated genes are shared targets of the two factors or they respond similarly to the perturbation of either factor.
# Figure 1
par(mfrow = c(1, 3))
# volcano plot of YY1 knockdown
plot(express$YY1$fc,
-log10(express$YY1$pvalue),
xlab = 'Fold-change (log_2)',
ylab = 'P-value (-log_10)',
xlim = c(-4, 4), ylim = c(0, 6))
title('(A)')
# volcano plot of YY2 knockdown
plot(express$YY2$fc,
-log10(express$YY2$pvalue),
xlab = 'Fold-change (log_2)',
ylab = 'P-value (-log_10)',
xlim = c(-4, 4), ylim = c(0, 6))
title('(B)')
# plot fold-change of YY1 and YY2
plot(express$YY1$fc[order(express$YY1$gene)],
express$YY2$fc[order(express$YY2$gene)],
xlab = 'YY1-knockdown (log_2)',
ylab = 'YY2-knockdown (log_2)',
xlim = c(-4, 4), ylim = c(-4, 4))
title('(C)')
The gene information in express
are recorded using the gene SYMBOLS.
org.Hs.eg.db
TxDb.Hsapiens.UCSC.hg19.knownGene
# map symbols to entrez ids
symbol_entrez <- AnnotationDbi::select(org.Hs.eg.db,
keys = express$YY1$gene,
columns = 'ENTREZID',
keytype = 'SYMBOL')
# remove unmapped genes
symbol_entrez <- na.omit(symbol_entrez)
# resize regions
genome <- promoters(TxDb.Hsapiens.UCSC.hg19.knownGene,
upstream = 100000,
filter = list(gene_id = symbol_entrez$ENTREZID),
columns = c('tx_id', 'tx_name', 'gene_id'))
# match and add symbols
ind <- match(genome$gene_id@unlistData, symbol_entrez$ENTREZID)
genome$gene <- symbol_entrez$SYMBOL[ind]
# show the first few entries of the mapping data.frame
head(symbol_entrez)
#> SYMBOL ENTREZID
#> 1 IFI44L 10964
#> 2 C7orf57 136288
#> 3 KLHL32 114792
#> 4 TREM2 54209
#> 5 EPHX2 2053
#> 6 LRRC25 126364
# show txdb object
TxDb.Hsapiens.UCSC.hg19.knownGene
#> TxDb object:
#> # Db type: TxDb
#> # Supporting package: GenomicFeatures
#> # Data source: UCSC
#> # Genome: hg19
#> # Organism: Homo sapiens
#> # Taxonomy ID: 9606
#> # UCSC Table: knownGene
#> # Resource URL: http://genome.ucsc.edu/
#> # Type of Gene ID: Entrez Gene ID
#> # Full dataset: yes
#> # miRBase build ID: GRCh37
#> # transcript_nrow: 82960
#> # exon_nrow: 289969
#> # cds_nrow: 237533
#> # Db created by: GenomicFeatures package from Bioconductor
#> # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
#> # GenomicFeatures version at creation time: 1.21.30
#> # RSQLite version at creation time: 1.0.0
#> # DBSCHEMAVERSION: 1.1
# show the
show(genome)
#> GRanges object with 56521 ranges and 4 metadata columns:
#> seqnames ranges strand | tx_id tx_name
#> <Rle> <IRanges> <Rle> | <integer> <character>
#> uc001abv.1 chr1 760530-860729 + | 21 uc001abv.1
#> uc001abw.1 chr1 761121-861320 + | 22 uc001abw.1
#> uc031pjl.1 chr1 761302-861501 + | 23 uc031pjl.1
#> uc031pjm.1 chr1 761302-861501 + | 24 uc031pjm.1
#> uc031pjn.1 chr1 761302-861501 + | 25 uc031pjn.1
#> ... ... ... ... . ... ...
#> uc011mgb.1 chrUn_gl000223 68588-168787 - | 82934 uc011mgb.1
#> uc011mgc.1 chrUn_gl000223 68588-168787 - | 82935 uc011mgc.1
#> uc011mgd.2 chrUn_gl000223 119531-219730 - | 82936 uc011mgd.2
#> uc031tgg.1 chrUn_gl000223 119531-219730 - | 82937 uc031tgg.1
#> uc031tgh.1 chrUn_gl000223 119531-219730 - | 82938 uc031tgh.1
#> gene_id gene
#> <CharacterList> <character>
#> uc001abv.1 148398 SAMD11
#> uc001abw.1 148398 SAMD11
#> uc031pjl.1 148398 SAMD11
#> uc031pjm.1 148398 SAMD11
#> uc031pjn.1 148398 SAMD11
#> ... ... ...
#> uc011mgb.1 7637 ZNF84
#> uc011mgc.1 7637 ZNF84
#> uc011mgd.2 7574 ZNF26
#> uc031tgg.1 7574 ZNF26
#> uc031tgh.1 7574 ZNF26
#> -------
#> seqinfo: 93 sequences (1 circular) from hg19 genome
Now the two objects can be merged. The merged object, regions
, is similarly a GRanges
that contains genome and expression information of all common genes.
# make regions by merging the genome and express data
regions <- map(express,
~{
# make a copy of genome
gr <- genome
# match gene names
ind <- match(gr$gene, .x$gene)
# add the expression info to the metadata
mcols(gr) <- cbind(mcols(gr), .x[ind,])
gr
})
# show the first few entries in the GRanges object
regions$YY1
#> GRanges object with 56521 ranges and 8 metadata columns:
#> seqnames ranges strand | tx_id tx_name
#> <Rle> <IRanges> <Rle> | <integer> <character>
#> uc001abv.1 chr1 760530-860729 + | 21 uc001abv.1
#> uc001abw.1 chr1 761121-861320 + | 22 uc001abw.1
#> uc031pjl.1 chr1 761302-861501 + | 23 uc031pjl.1
#> uc031pjm.1 chr1 761302-861501 + | 24 uc031pjm.1
#> uc031pjn.1 chr1 761302-861501 + | 25 uc031pjn.1
#> ... ... ... ... . ... ...
#> uc011mgb.1 chrUn_gl000223 68588-168787 - | 82934 uc011mgb.1
#> uc011mgc.1 chrUn_gl000223 68588-168787 - | 82935 uc011mgc.1
#> uc011mgd.2 chrUn_gl000223 119531-219730 - | 82936 uc011mgd.2
#> uc031tgg.1 chrUn_gl000223 119531-219730 - | 82937 uc031tgg.1
#> uc031tgh.1 chrUn_gl000223 119531-219730 - | 82938 uc031tgh.1
#> gene_id gene tf gene fc
#> <CharacterList> <character> <character> <character> <numeric>
#> uc001abv.1 148398 SAMD11 YY1 SAMD11 -0.54365
#> uc001abw.1 148398 SAMD11 YY1 SAMD11 -0.54365
#> uc031pjl.1 148398 SAMD11 YY1 SAMD11 -0.54365
#> uc031pjm.1 148398 SAMD11 YY1 SAMD11 -0.54365
#> uc031pjn.1 148398 SAMD11 YY1 SAMD11 -0.54365
#> ... ... ... ... ... ...
#> uc011mgb.1 7637 ZNF84 YY1 ZNF84 0.08999
#> uc011mgc.1 7637 ZNF84 YY1 ZNF84 0.08999
#> uc011mgd.2 7574 ZNF26 YY1 ZNF26 -0.47400
#> uc031tgg.1 7574 ZNF26 YY1 ZNF26 -0.47400
#> uc031tgh.1 7574 ZNF26 YY1 ZNF26 -0.47400
#> pvalue
#> <numeric>
#> uc001abv.1 0.05581
#> uc001abw.1 0.05581
#> uc031pjl.1 0.05581
#> uc031pjm.1 0.05581
#> uc031pjn.1 0.05581
#> ... ...
#> uc011mgb.1 0.68511
#> uc011mgc.1 0.68511
#> uc011mgd.2 0.09301
#> uc031tgg.1 0.09301
#> uc031tgh.1 0.09301
#> -------
#> seqinfo: 93 sequences (1 circular) from hg19 genome
# show the names of the columns in the object metadata
names(mcols(regions$YY1))
#> [1] "tx_id" "tx_name" "gene_id" "gene" "tf" "gene" "fc"
#> [8] "pvalue"
# show the width of the regions
unique(width(regions$YY1))
#> [1] 100200
The standard target analysis includes the identification of associated peaks using associated_peaks
and direct targets using direct_targets
.
The inputs for these functions are
peaks
regions
regions_col
, the column names for regionsstats_col
, the statistics column which is the fold-change in this case.The resulting objects are GRanges
for the identified peaks assigned to the regions or the ranked targets. Several columns is added to the metadata objects of the GRanges
to save the calculations.
# get associated peaks
ap <- map2(peaks, regions,
~associated_peaks(peaks=.x,
regions = .y,
regions_col = 'tx_id'))
# show associated_peak return
ap
#> $YY1
#> GRanges object with 160722 ranges and 7 metadata columns:
#> seqnames ranges strand | name
#> <Rle> <IRanges> <Rle> | <character>
#> [1] chr1 762751-762895 * | ID=SRX190209;Name=YY..
#> [2] chr1 762751-762895 * | ID=SRX190209;Name=YY..
#> [3] chr1 762751-762895 * | ID=SRX190209;Name=YY..
#> [4] chr1 762751-762895 * | ID=SRX190209;Name=YY..
#> [5] chr1 762751-762895 * | ID=SRX190209;Name=YY..
#> ... ... ... ... . ...
#> [160718] chrX 154611695-154611857 * | ID=SRX190209;Name=YY..
#> [160719] chrX 154611695-154611857 * | ID=SRX190209;Name=YY..
#> [160720] chrX 154611909-154612034 * | ID=SRX190209;Name=YY..
#> [160721] chrX 154611909-154612034 * | ID=SRX190209;Name=YY..
#> [160722] chrY 21154746-21154873 * | ID=SRX190209;Name=YY..
#> score itemRgb thick assigned_region distance
#> <numeric> <character> <IRanges> <integer> <numeric>
#> [1] 109 #006FFF 762751-762895 21 -47807
#> [2] 109 #006FFF 762751-762895 22 -48397
#> [3] 109 #006FFF 762751-762895 23 -48579
#> [4] 109 #006FFF 762751-762895 24 -48579
#> [5] 109 #006FFF 762751-762895 25 -48579
#> ... ... ... ... ... ...
#> [160718] 388 #00FF72 154611695-154611857 78347 -2110.0
#> [160719] 388 #00FF72 154611695-154611857 78348 -2110.0
#> [160720] 88 #0059FF 154611909-154612034 78347 -1914.5
#> [160721] 88 #0059FF 154611909-154612034 78348 -1914.5
#> [160722] 163 #00A6FF 21154746-21154873 78747 -49796.5
#> peak_score
#> <numeric>
#> [1] 0.0896107
#> [2] 0.0875207
#> [3] 0.0868859
#> [4] 0.0868859
#> [5] 0.0868859
#> ... ...
#> [160718] 0.5574402
#> [160719] 0.5574402
#> [160720] 0.5618165
#> [160721] 0.5618165
#> [160722] 0.0827559
#> -------
#> seqinfo: 56 sequences from an unspecified genome; no seqlengths
#>
#> $YY2
#> GRanges object with 20393 ranges and 7 metadata columns:
#> seqnames ranges strand | name
#> <Rle> <IRanges> <Rle> | <character>
#> [1] chr1 894578-894665 * | ID=SRX1530266;Name=Y..
#> [2] chr1 894578-894665 * | ID=SRX1530266;Name=Y..
#> [3] chr1 894578-894665 * | ID=SRX1530266;Name=Y..
#> [4] chr1 894578-894665 * | ID=SRX1530266;Name=Y..
#> [5] chr1 894578-894665 * | ID=SRX1530266;Name=Y..
#> ... ... ... ... . ...
#> [20389] chrX 154299744-154299864 * | ID=SRX1530266;Name=Y..
#> [20390] chrX 154444690-154444806 * | ID=SRX1530266;Name=Y..
#> [20391] chrX 154444690-154444806 * | ID=SRX1530266;Name=Y..
#> [20392] chrX 154611688-154611827 * | ID=SRX1530266;Name=Y..
#> [20393] chrX 154611688-154611827 * | ID=SRX1530266;Name=Y..
#> score itemRgb thick assigned_region distance
#> <numeric> <character> <IRanges> <integer> <numeric>
#> [1] 150 #0099FF 894578-894665 55 48555.5
#> [2] 150 #0099FF 894578-894665 56 47693.5
#> [3] 150 #0099FF 894578-894665 57 47513.5
#> [4] 150 #0099FF 894578-894665 58 47061.5
#> [5] 150 #0099FF 894578-894665 59 46787.5
#> ... ... ... ... ... ...
#> [20389] 141 #008FFF 154299744-154299864 78345 -49644.0
#> [20390] 125 #007FFF 154444690-154444806 76878 49948.0
#> [20391] 125 #007FFF 154444690-154444806 76879 49948.0
#> [20392] 192 #00C3FF 154611688-154611827 78347 -2128.5
#> [20393] 192 #00C3FF 154611688-154611827 78348 -2128.5
#> peak_score
#> <numeric>
#> [1] 0.0869676
#> [2] 0.0900185
#> [3] 0.0906690
#> [4] 0.0923232
#> [5] 0.0933406
#> ... ...
#> [20389] 0.0832623
#> [20390] 0.0822559
#> [20391] 0.0822559
#> [20392] 0.5570279
#> [20393] 0.5570279
#> -------
#> seqinfo: 41 sequences from an unspecified genome; no seqlengths
# show added columns in the output
names(mcols(peaks$YY1))
#> [1] "name" "score" "itemRgb" "thick"
names(mcols(ap$YY1))
#> [1] "name" "score" "itemRgb" "thick"
#> [5] "assigned_region" "distance" "peak_score"
# show information in the added columns
head(ap$YY1$assigned_region)
#> [1] 21 22 23 24 25 26
head(ap$YY1$distance)
#> [1] -47807 -48397 -48579 -48579 -48579 -48579
head(ap$YY1$peak_score)
#> [1] 0.08961075 0.08752069 0.08688586 0.08688586 0.08688586 0.08688586
# get direct targets
dt <- map2(peaks, regions,
~direct_targets(peaks=.x,
regions = .y,
regions_col = 'tx_id',
stats_col = 'fc'))
# show direct_targets return
dt
#> $YY1
#> GRanges object with 44403 ranges and 13 metadata columns:
#> seqnames ranges strand | tx_id tx_name gene_id
#> <Rle> <IRanges> <Rle> | <integer> <character> <list>
#> 1 chr1 760530-860729 + | 21 uc001abv.1 148398
#> 6 chr1 761121-861320 + | 22 uc001abw.1 148398
#> 11 chr1 761302-861501 + | 23 uc031pjl.1 148398
#> 16 chr1 761302-861501 + | 24 uc031pjm.1 148398
#> 21 chr1 761302-861501 + | 25 uc031pjn.1 148398
#> ... ... ... ... . ... ... ...
#> 160718 chrUn_gl000223 68588-168787 - | 82934 uc011mgb.1 7637
#> 160719 chrUn_gl000223 68588-168787 - | 82935 uc011mgc.1 7637
#> 160720 chrUn_gl000223 119531-219730 - | 82936 uc011mgd.2 7574
#> 160721 chrUn_gl000223 119531-219730 - | 82937 uc031tgg.1 7574
#> 160722 chrUn_gl000223 119531-219730 - | 82938 uc031tgh.1 7574
#> gene tf gene.1 fc pvalue score
#> <character> <character> <character> <numeric> <numeric> <numeric>
#> 1 SAMD11 YY1 SAMD11 -0.54365 0.05581 0.953844
#> 6 SAMD11 YY1 SAMD11 -0.54365 0.05581 0.944972
#> 11 SAMD11 YY1 SAMD11 -0.54365 0.05581 0.942342
#> 16 SAMD11 YY1 SAMD11 -0.54365 0.05581 0.942342
#> 21 SAMD11 YY1 SAMD11 -0.54365 0.05581 0.942342
#> ... ... ... ... ... ... ...
#> 160718 ZNF84 YY1 ZNF84 0.08999 0.68511 0.175405
#> 160719 ZNF84 YY1 ZNF84 0.08999 0.68511 0.175405
#> 160720 ZNF26 YY1 ZNF26 -0.47400 0.09301 0.273345
#> 160721 ZNF26 YY1 ZNF26 -0.47400 0.09301 0.273345
#> 160722 ZNF26 YY1 ZNF26 -0.47400 0.09301 0.273345
#> score_rank stat stat_rank rank
#> <integer> <numeric> <integer> <numeric>
#> 1 14440 -0.54365 7623 0.0558301
#> 6 14621 -0.54365 7623 0.0565299
#> 11 14671 -0.54365 7623 0.0567232
#> 16 14671 -0.54365 7623 0.0567232
#> 21 14671 -0.54365 7623 0.0567232
#> ... ... ... ... ...
#> 160718 37285 0.08999 35329 0.668099
#> 160719 37285 0.08999 35329 0.668099
#> 160720 33771 -0.47400 9775 0.167431
#> 160721 33771 -0.47400 9775 0.167431
#> 160722 33771 -0.47400 9775 0.167431
#> -------
#> seqinfo: 93 sequences from an unspecified genome; no seqlengths
#>
#> $YY2
#> GRanges object with 15296 ranges and 13 metadata columns:
#> seqnames ranges strand | tx_id tx_name gene_id
#> <Rle> <IRanges> <Rle> | <integer> <character> <list>
#> 1 chr1 795967-896166 + | 55 uc001aca.2 339451
#> 2 chr1 796829-897028 + | 56 uc001acb.1 339451
#> 3 chr1 797009-897208 + | 57 uc010nya.1 339451
#> 4 chr1 797461-897660 + | 58 uc001acc.2 339451
#> 5 chr1 797735-897934 + | 59 uc010nyb.1 339451
#> ... ... ... ... . ... ... ...
#> 20388 chrX 154134649-154234848 - | 78342 uc010nvi.1 2157
#> 20389 chrX 154255016-154355215 - | 78343 uc011mzx.1 2157
#> 20391 chrX 154299348-154399547 - | 78345 uc004fmz.2 4515
#> 20392 chrX 154563787-154663986 - | 78347 uc004fnf.3 1193
#> 20393 chrX 154563787-154663986 - | 78348 uc010nvj.1 1193
#> gene tf gene.1 fc pvalue score
#> <character> <character> <character> <numeric> <numeric> <numeric>
#> 1 KLHL17 YY2 KLHL17 -0.39191 0.30656 0.0869676
#> 2 KLHL17 YY2 KLHL17 -0.39191 0.30656 0.0900185
#> 3 KLHL17 YY2 KLHL17 -0.39191 0.30656 0.0906690
#> 4 KLHL17 YY2 KLHL17 -0.39191 0.30656 0.0923232
#> 5 KLHL17 YY2 KLHL17 -0.39191 0.30656 0.0933406
#> ... ... ... ... ... ... ...
#> 20388 F8 YY2 F8 0.32501 0.12452 0.3079124
#> 20389 F8 YY2 F8 0.32501 0.12452 0.5732370
#> 20391 MTCP1 YY2 MTCP1 -0.06521 0.65344 0.0832623
#> 20392 CLIC2 YY2 CLIC2 0.21019 0.35482 0.5570279
#> 20393 CLIC2 YY2 CLIC2 0.21019 0.35482 0.5570279
#> score_rank stat stat_rank rank
#> <integer> <numeric> <integer> <numeric>
#> 1 11139 -0.39191 2527 0.120308
#> 2 11002 -0.39191 2527 0.118829
#> 3 10973 -0.39191 2527 0.118515
#> 4 10902 -0.39191 2527 0.117749
#> 5 10865 -0.39191 2527 0.117349
#> ... ... ... ... ...
#> 20388 5339 0.32501 3699 0.0844090
#> 20389 1707 0.32501 3699 0.0269875
#> 20391 11837 -0.06521 12032 0.6087286
#> 20392 1908 0.21019 6592 0.0537576
#> 20393 1908 0.21019 6592 0.0537576
#> -------
#> seqinfo: 93 sequences from an unspecified genome; no seqlengths
# show added columns in the output
names(mcols(regions$YY1))
#> [1] "tx_id" "tx_name" "gene_id" "gene" "tf" "gene" "fc"
#> [8] "pvalue"
names(mcols(dt$YY1))
#> [1] "tx_id" "tx_name" "gene_id" "gene" "tf"
#> [6] "gene.1" "fc" "pvalue" "score" "score_rank"
#> [11] "stat" "stat_rank" "rank"
# show information in the added columns
head(dt$YY1$score)
#> [1] 0.9538438 0.9449725 0.9423423 0.9423423 0.9423423 0.9423423
head(dt$YY1$score_rank)
#> [1] 14440 14621 14671 14671 14671 14671
head(dt$YY1$rank)
#> [1] 0.05583011 0.05652992 0.05672324 0.05672324 0.05672324 0.05672324
To determine the dominant function of a factor,
Because the ranks rather than the absolute value of the regulatory potential is used, the lower the value the higher the potential. Then the groups of targets can be compared to each other or to a theoretical distribution.
# Figure 2
par(mfrow = c(1, 3))
# plot distance by score of associate peaks
plot(ap$YY1$distance, ap$YY1$peak_score,
xlab = 'Distance', ylab = 'Peak Score',
main = '(A)')
points(ap$YY2$distance, ap$YY2$peak_score)
# make labels, colors and groups
labs <- c('Down', 'None', 'Up')
cols <- c('green', 'gray', 'red')
# make three groups by quantiles
groups <- map(dt,~{
cut(.x$stat, breaks = 3, labels = labs)
})
# plot the group functions
pmap(list(dt, groups, c('(B)', '(C)')), function(x, y, z) {
plot_predictions(x$score_rank,
group = y, colors = cols, labels = labs,
xlab = 'Regulatory Potential', ylab = 'ECDF')
title(z)
})
The scores of the individual peaks are a decreasing function of the distance from the transcription start sites.
The closer the factor binding site from the start site the lower the score. The distribution of these scores is very similar for both factors (Figure 2A). The ECDF of the down-regulated targets of YY1 is higher than that of up- and none-regulated targets (Figure 2B). Therefore, the absence of YY1 on its targets result in aggregate in their down regulation.
The opposite is true for YY2 where more high ranking targets are up-regulated by the factor knockdown (Figure 2C).
# Table 2
# test individual factor functions
map2(dt, groups,
~test_predictions(.x$rank,
group = .y,
compare = c('Down', 'Up')))
#> $YY1
#>
#> Two-sample Kolmogorov-Smirnov test
#>
#> data: x and y
#> D = 0.78597, p-value < 2.2e-16
#> alternative hypothesis: two-sided
#>
#>
#> $YY2
#>
#> Two-sample Kolmogorov-Smirnov test
#>
#> data: x and y
#> D = 0.40354, p-value = 1.075e-12
#> alternative hypothesis: two-sided
To formally test these observations, we use the Kolmogorov-Smirnov (KS) test. The distribution of the two groups are compared for equality. If one lies one either side of the other then they must be drawn from different distributions.
Here, we compared the up- and down-regulated functions for both factors (Table 2).
In both cases, the distribution of the two groups were significantly different from one another.
target
package implements the BETA algorithm to rank targets based on the distance of the ChIP peaks of the transcription factor in the genes and the differential expression from the factor perturbation.Chen, Li, Toshi Shioda, Kathryn R. Coser, Mary C. Lynch, Chuanwei Yang, and Emmett V. Schmidt. 2010. “Genome-wide analysis of YY2 versus YY1 target genes.” Nucleic Acids Research 38 (12): 4011–26. https://doi.org/10.1093/nar/gkq112.
Michaud, Joëlle, Viviane Praz, Nicole James Faresse, Courtney K Jnbaptiste, Shweta Tyagi, Frédéric Schütz, and Winship Herr. 2013. “HCFC1 is a common component of active human CpG-island promoters and coincides with ZNF143, THAP11, YY1, and GABP transcription factor occupancy.” Genome Research 23 (6): 907–16. https://doi.org/10.1101/gr.150078.112.
Wu, Xiao-Nan, Tao-Tao Shi, Yao-Hui He, Fei-Fei Wang, Rui Sang, Jian-Cheng Ding, Wen-Juan Zhang, et al. 2017. “Methylation of transcription factor YY2 regulates its transcriptional activity and cell proliferation.” Cell Discovery 3: 17035. https://doi.org/10.1038/celldisc.2017.35.