Báo cáo y học: "A statistical framework for modeling gene expression using chromatin features and application to modENCODE datasets" - Pdf 21

MET H O D Open Access
A statistical framework for modeling gene
expression using chromatin features and
application to modENCODE datasets
Chao Cheng
1
, Koon-Kiu Yan
1
, Kevin Y Yip
1,2
, Joel Rozowsky
1
, Roger Alexander
1
, Chong Shou
1
and Mark Gerstein
1,3,4*
Abstract
We develop a statistical framework to study the relationship between chromatin features and gene expression. This
can be used to predict gene expression of protein coding genes, as well as microRNAs. We demo nstrate the
prediction in a variety of contexts, focusing particularly on the modENCODE worm datasets. Moreover, ou r
framework reveals the positional contribution around genes (upstream or downstream) of distinct chromatin
features to the overall prediction of expression levels.
Background
In eukaryotes, nuclear chromosomes are organized into
chains of nucleosomes, which are in turn composed of
octamers of four types of histones wrapped around
147 bp of DNA. Modifications of these core histones are
central to many biologica l proc esses, i ncluding tra n-
scriptional regulation [1], replication [2], alternative spli-

ever, a study in yeast revealed only simple and cumula-
tive functional consequences for combinations of
histone H4 acetylation rathe r than a complicated syner-
gistic histone code [28]. Two other studies, one in yeas t
and the other in D rosophila, also demo nstrated that his-
tone modificat ions are hig hly correlated with each other
and are partially redundan t in function [13,17], presum-
ably conferring robustness in relation to epigenetic regu-
lation [29]. Alternatively, the high correlation between
histone modifications may have been overestimated as a
result of differe nces in nucleosome d ensity or other
unkn own biases [29]. So f ar, knowledge about the effect
of histone modifications on transcriptional regulation is
still limited, and the degree of complexity of the histon e
code is far from clear. To further understand the rela-
tionship between histone modifications and gene expres-
sion, we require a systematic analysis that integrates
histone modification maps with other genome-wide
datasets.
* Correspondence: [email protected]
1
Department of Molecular Biophysics and Biochemistry, Yale University, 260
Whitney Avenue, New Haven, CT 06520, USA
Full list of author information is available at the end of the article
Cheng et al . Genome Biology 2011, 12:R15
http://genomebiology.com/2011/12/2/R15
© 2011 Cheng et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
The model organism encyclopedia of DNA elements

applying the chromatin-based model to predict the
expression of coding genes and microRNAs at different
developmental st ages, we furthe r address the develop-
mental stage specificity of chromatin modifications and
suggest that chromatin features regulate transcription of
coding genes and microRNAs in a similar fashion.
As more and more ge nome-wide ChIP-Seq and RNA-
Seq data are going to be generated via the modEN-
CODE project and the ENCODE project [2] in the near
future, the met hods of data integration proposed in this
work have various potential applications.
Results
Chromatin features show distinct signal patterns around
genic regions
To systematically study the genome-wide properties of
various chromatin f eatures, we collected more than 50
ChIP-chip and ChIP-seq profiles of histone modifica-
tions and DNA binding factors in C. elegans from the
modENCODE project (see Mat erials and methods) . We
divided the DNA regions around (± 4 kb) the transcrip-
tion start site ( TSS) and transcript ion t ermination site
(TTS) of each transcript into small 100-bp bins and
calculated the average signal of the chromatin fea tures
in each bin. As a result, each bin was assigned a matrix
whose elements are the average signals of different fea-
tures in different tr anscri pts (Figure 1). Fi gure 2a shows
the rich spatial pattern of 16 features in the early
embryonic (EEMB) stage, where the signals are averaged
over all transcripts. We first observed that the upstream
and downstream regions of TSSs and TTSs are clearly

stages of C. elegans, we quantified the expres sion level of
each gene. For each bin, we then calculated the correlation
between the gene expression levels and the average signals
of each chromatin feature of the bin. Figure 2b shows the
spatial variation of these correlation coefficients around
TSSs and TTSs. According to the correlation patterns,
there are two main types of chromatin features: ones that
are positively correlated with gene expression (such as
H3K79me1, H3K79me2 and H3K79me3); and ones that
are negatively correlated with gene expression (such as
H3K9me2 and H3K9me3). While some features show lar-
gely uniform correlations across the 16-kb regions, some
others are more variable across the regions. For example,
H3K79me2 has a high correlation coefficient (0.65) near
the TSS, but rather a low correlation (0.10) downstream of
Cheng et al . Genome Biology 2011, 12:R15
http://genomebiology.com/2011/12/2/R15
Page 2 of 18
the TTS. It is interesting to observe that the negative fea-
tures tend to have more uniform spatial patterns while the
positive featur es tend to show greater variation. In addi-
tion, for chromatin features such as H3K79me2, although
the average signal intensity decreases with distance
downstream from the TSS, the correlation between the
feature signal and the expression level remains high. This
pattern suggests that, while some chromatin features have
the strongest average signals only at some highly specific
regions, the differences of their signals between genes with
Figure 1 Schematic diagram of our data binning and supervised analysis. (a) DNA regions around the transcription sta rt site ( TSS) and
transcription terminal site (TTS) of each transcript were separated into 160 bins of 100 bp in size. Average signal of each chromatin feature was

of gene expression with feature signal at distant locations
does reflect the long-range effects of their regulation,
instead of an artifact caused by chromatin structure of the
nearby genes.
Furthermore, to assess whether the trends we
observed are universal to all developmental stages rather
than specific to the EEMB sta ge, we repeated the analy-
sis in other stages, including late embryo, larval stages
and young adult. Although the exact values of correla-
tion coefficients vary across stages, the spatial patterns
are consistent in all stages (Figure S4 in Addition al fi le
4). In addition, a large number of genes are associated
with multiple transcripts corresponding to different
alternative splicing i soforms. In many cases, the overlap
between these t ranscripts is substantial, which might
affect the correlation patterns between chromatin fea-
tures and expression. We thus repeated the correlation
analysis using only genes with a single transcript,
and obtained the same qualitative results (Figure S5 in
Additional file 5).
Among the chromatin features shown in Figure 2,
MES-4 and MRG-1 are factors associated with
X-chromosome inactivation [37,38]. These f actors are
supposed to ha ve different binding patterns in the X
chromosome than in autosomes. We therefore analyzed
their correlation patterns in X genes and autosomal
genes separately. As expected, we found that MES-4 and
MRG-4 associate predominantly with autosomal DNAs,
while the dosage compensation complex (DCC) subunits
bind specifically with X-chromosomal DNAs (data not

ters roughly correspond to genes with high expression
levels (H) and genes with low expression levels (L),
respectively (Figure 3b). These two clusters are charac-
terized by complementary patterns of chromatin fea-
tures. Cluster H is characterized by high signals of 11
features (the right component of the upper dendro-
gram), and low signals for the other 5 features. We note
in particular that highly expressed genes tend to have a
strong H3K36me3 signal, which is consistent with the
role of H3K36me3 as a chromatin mark that activates
transcription o f associated genes. Similarly, the well-
known repressive mark H3K9me3 shows a low signal.
Compared to cluster H, genes in cluster L show the
opposite pattern of chromatin signals.
To explore which regions around the TSS and TTS
provide the greatest power in determining gene expres-
sion levels, we repeated the two-way clustering proce-
dureforeachofthe160binsaroundTSSsandTTSs.
Figure 3c shows the resulting t-statistics. We observe
that the signals slightly downstream of TSSs are t he
most informative. In general, the t-statistics decrease as
the distance from the TSS or TTS increases. The decay
is steeper at the region downstream of TTSs.
Theaboveintegrativeanalysisinvolvesallchromatin
features. To examine how each feature individually
affects gene expression, for each feature we performed
hierarchical clustering of the genes based on the collec-
tive signals of the feature at all 160 bins. An example is
shown in Figure 3d, in which signals of the single fea-
ture H3K79me2 at the different bins were used to clus-

the training transcripts at a certain bin (Figure 1). T he
model was then used to predict to which class each
transcript in the testing set belongs. We repeated the
procedure for all 160 bins, and 100 different random
splitting of the transcripts into training and testing sets
for each bin (see Materials and methods). We repre-
sented the overall performance of the model using the
receiver operating characteristic (ROC) curve and
further quantified the accuracy usin g the area under the
curve(AUC).Figure4ashowstheROCscorresponding
to the prediction performance of five different bins.
Compared to random ordering, which would give a
diagonal ROC curve on average with an expected AUC
of 0.5, we observed that all five curves are much better
than random but with diverse performance, which indi-
cates that all the bin s are useful to cl assify gene expres-
sion but they are not e qually informative. This result is
consistent with what we have observed using the unsu-
pervised method described above (Figure 3f). Instead of
using SVM, we also learned support vector regression
(SVR) models using similar procedures (see Materials
and methods) to predict expression values directly.
Figure 4b s hows that there is a high positive correlation
(0.75) between the predicted levels from an SVR model
and the actual expression levels measured by RNA-seq.
Cheng et al . Genome Biology 2011, 12:R15
http://genomebiology.com/2011/12/2/R15
Page 5 of 18
Figure 3 Hierarchical clustering using either chromatin feature profiles (a-c) or bin pr ofiles (d-f) discriminates highly and lowly
expressed genes. (a) Hierarchical clustering of 16 chromatin features in bin 1 (0 to 100 nucleotides upstream of a TSS). The resulting tree is

TSS (-2 kb to 2 kb). These comprehensive models achieve
slightly higher prediction accuracy than those based on
single bins, yet the enhancement is not dramatic, with an
average AUC of 0.94 for the cla ssification model (SVM)
and an average correlation coefficient of 0.75 for the
regression model (SVR) (Figure 6 in Additional file 6).
We then learned SVM models using only features of
individual types. As shown in Figure 5a, the AUC
obtained by using all features (black) is comparable to
the AUCs obtained from models using only particular
subsets of fe atures. Strikingly, the model involving only
the 9 hist one modification features is almos t as accurate
as the model involving all 16 features. We further
divided t he histone modification features into four sub-
sets: modific ations on K4, K9, K 36 a nd K79, resp ec-
tively. While the integrated model with all histone
modifications achieves an AUC value of 0.9, using just
one o f the subsets can yield an AUC higher than 0.8
(Figure 5b). In particular, the set H3K79 is found to be
most predictive, which again confirms our previous find-
ing of the importance of these histone modifications in
regulating gene expression (Figure 3f).
The results of the supervised a nalysis suggest that
chromatin features are not only correlated with expres-
sion but are also predictive of the expression levels of
individual genes with good accuracy and could explain a
large portion of the expression differences between dif-
ferent genes. We note that histone modificat ions ma y
have other regions of enrichment that are informative
about gene expression: fo r instance, the percentage of

explore to what extent the chromatin features are
redundant, and to what extent they are interac ting in a
combinatorial fashion. Specifically, for each bin, we
modeled the expression level y as a linear combination
of the effects of individual histone modification features
x
i
and their products x
i
x
j
:
yx xx
iij
ij
~ +
<
∑∑
We found that among the 66 (12 × 11/2) possible
interactions between the 12 distinct histone modification
features, many interactions are statistically significant.
For example, for bin 1, we detected 12 significant inter-
actions (P < 0.001, linear regression) betwee n the his-
tone modifications (Table S7 in Additional file 7).
To quantify the importance of these interactions in
determining gene expression levels, we compared the
above regression model with a singleton model that
does not contain the interaction terms:
yx
i

feature for RNA polymerase II. (b) HIS, the 11 chromatin modification features; H3K79ME, H3K79me1, H3K79me2 and H3K79me3; H3K9ME,
H3K9me2, H3K9me3(Ab1) and H3K9me3(Ab2); H3K36ME, H3K36me2(Ab1), H3K36me2(Ab2) and H3K36me3; H3K4ME, H3K4me3 and H3K4me3.
Cheng et al . Genome Biology 2011, 12:R15
http://genomebiology.com/2011/12/2/R15
Page 8 of 18
gene expression. As expected, a combination of high
H3K9me3 signal and low H3K79me3 signal results in a
lower expression level than when both signals are low.
When the signals of both features are high, we observe a
significant difference in gene expression compared to the
other three cases, indicating that the features contribute to
gene expression regulation in a collective manner.
Our analyses of the interactions between the above
chromatin features only considered binary interactions
between two features. For higher-order relationships invol-
ving more features, it is infeasible to perform the same
type of analyses, as the number of feature combinations
would become intractable. Also, the above analyses only
suggest which features interact wi th each other, but do
not explain how the features interact. In particular, the
complex correlations between features and gene expres-
sion make it difficult to extract directional relationships
between them (Figure S10 in Additional file 10). We there-
fore used Bayesian networks to study the higher order
relationships between the chromatin features and gene
expression (see Additional file 11 for details).
The chromatin model is developmental stage-specific
We have previously construc ted an integrative model
using chromatin features at the EEMB stage of C. elegans
development and used it to predict gene expression levels

between EEMB and L3 stages. Using the EEMB stage
chromatin model to pr edict the expr ession level of these
genes, the prediction accuracy further decreases (AUC =
0.70).
Chromatin features show different correlation patterns
with different genes in an operon
In C. elegans some neighboring genes are organized into
operons. The genes in an operon are co-transcribed as a
polycistronic pre-messenger RNA and processed into
monocistronic mRNAs [39, 40]. Here we investigate the
differential signals of chromatin features among genes
in operons and how this organization affects their
expression levels. We collected the first , second and last
genes in 881 C. elegans operons and calculated the sig-
nals of chromatin features in each of the 160 bins
around their annotated TSS and TTS. We observed
strong correlations between exp ression lev els and chro-
matin feature signals for the first genes (Figure 8). In
compa rison, the correlation patterns fo r th e second and
lastgenesoftheoperonsarenotasapparent(Figure
S12 in Additional file 12). The weaker correlations
could be caused by the lack of signals for some histone
modificat ion types. As we observed, the mark for acti ve
promoters, H3K4me3, demonstrates strong signals
around the TSS of the first genes, which is the shared
promoter o f genes in the sa me operon. In the upstream
region of the internal genes, the H3K4me3 signal is
often relatively weak. Alternatively, the wea k correlation
for internal genes may also be explained by the inten-
sive post-transcriptional regulation of these genes,

input features for our chromatin model.
We predicted the expression levels of 162 worm
micro RNAs with genomic locations ob tained from miR-
BASE [42]. We then compared our predictions with the
experimental measu rements performed by Kato et al.
[43]. As shown in Figure 9, our predictions are in good
agreement with the experimen tal re sults in the EEMB
stage (see also the prediction results for the L3 stage in
Figure S13 in Additional file 13). Some microRNAs
locate within or near gene loci, which may confound the
prediction of microRNA expression. To address this
issue, we also che cked t he pre diction ac curacy using
only microRNAs that are away from any known gene,
and obtained similar prediction accuracy (PCC = 0.62).
Figure 8 Correlation patterns of H3K4me3 and H3K79me3 in the 16 0 bins around the TSS and TTS (from 4 kb upstream to 4 kb
downstream) with the expression levels of the first, second and last genes of 881 C. elegans operons.
Figure 9 Prediction of expression levels of microRNAs at the EEMB stage. (a) Predicted expression levels of the experimentally measured
highly and lowly expressed microRNAs based on small RNA-seq results. Expression levels of microRNAs at the EEMB stage were predicted using
an SVR regression model trained on data for protein-coding genes at the same stage. (b) Predicted versus experimentally measured expression
levels of microRNAs at the EEMB stage. R is the Pearson correlation coefficient.
Cheng et al . Genome Biology 2011, 12:R15
http://genomebiology.com/2011/12/2/R15
Page 11 of 18
It is interesting t o see that the expression of micro-
RNAs can be accurately predicted using a chromatin
model trained by data for protein-coding genes. Consis-
tent with previous reports on microRNA transcriptional
regulation [44,45], this result suggests that microRNAs
and protein-coding genes share a similar mechanism of
transcriptional regulation by chromatin modifications.

expression is similar in tested organisms (and also in
different t issues, cell-lines, and developmental stages).
For example, H3K4me3 signals around the TSS of genes
show high predictive capability in all the analyses we
have performed. We also found that the models based
on expression levels measured by RNA-seq achieved
higher prediction accuracy than those by microarrays,
consistent with the higher measurement accuracy of
RNA-seq compared to microarrays. Our method can, of
course, be applied to multiple data sets in each species
Figure 10 Prediction accuracy of the chromatin model in four other species. (a-d) Expression levels of genes are predicte d using the SVR
method. In yeast, average signals of chromatin features from the TSS to 500 bp upstream were used as predictors (a); in the other species,
signals of chromatin features within the bin at the TSS (bin 1) were used as predictors (b-d). E4-8 h: embryonic stage at 4 to 8 h; ESC, embryonic
stem cell.
Cheng et al . Genome Biology 2011, 12:R15
http://genomebiology.com/2011/12/2/R15
Page 12 of 18
(for example, different developmental stages in fruit fly).
Figure10showsonlyasingleillustrativeexamplefor
each species. We only show initial statistical analysis
here, further biological inte rpretation would, of course,
be the subject of future studies.
Discussion
In this study, we present a systematic analysis of the
genome-wide relationship between chromatin features
and gene expression. We have shown that, in terms of
gene expression prediction, information from different
histone modifi catio n features is considerably redundant.
Here in this paper, we use the modENCODE worm data
to exemplify our analysis. In fact, we have applied o ur

We have shown that chromatin features are strongly
correlated with gene expression. Nevertheless, it should
be noted that our models could not reveal if histone
modifications are the ‘ cause’ or ‘ consequence’ of tran-
scription. In fact, both directions of causality have been
previously reported. Some studies have proposed that
some histone modifications are the memory of past
transcriptional events resulting from previous active
transcription [52-54]. For instance, it has been shown
that phosphorylation in the tail of Pol II is required for
H3K4me3, re vealing that it is a direct consequence of
Pol II passing through the TSS [55]. Other studies, how-
ever, have s hown that chromatin modification change s
precede changes in gene expression [56]. A rece nt study
in human T cells suggested that, for both protein-coding
and miRNA genes, activating histone marks were
already in place before induc tion of expression, and
these marks were maintained even after the genes were
silenced [45]. This finding shows that histone modifica-
tion can be b oth ca use and consequence of gene tran -
scription, and t hat a full explanation will require
incorporation of additional data. Generalizing our model
to follow a time course of changing histone modifica-
tions might be helpful for understanding this issue.
The supervised chromatin model trained from ex pres-
sion data for protein-coding genes can accurately predict
the abundance of both protein-coding genes and micro-
RNAs, which suggests that microRNAs and protein-cod-
ing genes share similar mechanisms of transcriptional
regulation by chromatin modifications [44,45]. To pre-

H3K79me1, H3K79me2 and H 3K79me3), binding of
dosage compensation complex (DCC) proteins (SDC2,
SDC3, DPY27, DPY28 and MIX1) and other X-
Cheng et al . Genome Biology 2011, 12:R15
http://genomebiology.com/2011/12/2/R15
Page 13 of 18
chromosome inactivation factors (MES4 and MRG1). For
some chromatin features such as H3K9me3, biological
replicates using different antibodies were available. Profiles
of these chromatin features were measured for different
developmental stages, in particular at EEMB and L3 stages.
A list of the data, with their Gene Expression Omnibus IDs
can be found in Additional file 15. All these data are avail-
able from the modENCODE website at [57]. Operon infor-
mation for C. elegans was obtained from a previous study
by Blumenthal et al. [39]. The dataset contains a total of
881 operons w ith 2.6 genes in each of the m on a verage.
MicroRNA expression levels at different devel opmen-
tal stages of C. elegans were obtained from small R NA-
seq measurements per formed by Kato et al. [43]. Anno-
tation of worm transcripts was downloaded from
WormBase at [58,59]. Annotation of nematode micro-
RNAs was downloaded from the microRNA database
miRBASE at [42,60]. Assembly version WS180 of C. ele-
gans was used f or gene and microRNA an nota tions and
data processing of all the chromatin features.
Binning DNA regions
We obta ined the ge nomic locations a nd struc tures of
27,310 protein-coding transcripts of C. elegans from
WormBase. The contribution of ea ch chromatin f eature

of trans cripts and m is the number of chromatin fea-
tures. To make the signals for different chromatin fea-
tures comparable, we normalized the columns of A by
subtracting the median and then divided by the standard
deviation of each column across all transcripts. We per-
formed hierarchical clustering analysis using the normal-
ized matrix for a given bi n. To evaluate the capability of
a bin to discriminate between genes with high and low
expression levels, w e divided the transcripts into two
clusters by splitting the resu lting hierarch ical tree at the
top level. The expression levels of transcripts in the two
clusters measured by RNA-seq experiments were com-
pared using t-test. We repeated this procedure for all
160 bins, which resulted in a t-score for each bin. Those
t-scores reflect the capability of chromatin features in
these bins to separate genes with low and high expres-
sion levels.
Similarly, given a specific feature, we performed hier-
archical clustering using its signals across all 160 bins.
The clustering analysis wa s conducted for all chromatin
features, and the capability of each feature to predict
gene expression was evaluated and compa red by their t-
scores calculated as described above.
Supervised models for gene expression prediction
We constructed supervised learning models to integrat e
the chromatin features for gene expression prediction.
In principle, the chromatin features of e ach of the 160
bins could contribute to regulation of gene expression.
We therefore constructed the model in a bin-specific
manner to investigate the relative importance of each

suring the prediction power of classification models.
In the regression model, we directly predicted the expres-
sion levels of transcripts rather than classifying them into
two broad expression categories. The prediction power of
the regression model was also checked using cross-valida-
tion. The SVR model was trained on the training data and
applied to the testing data. Then the predicted expression
levels for transcripts in the testing data were compared
with their actual levels measured by RNA-seq experiment.
The correlation between predi cted and actual expression
level indicates the prediction power of the model.
In a linear regression model, the square of the correla-
tion (R
2
) between the predicted values and the actual
values is equal to the fraction o f total variance in the
observed data explained by the predictions. We used
this quantity to estimate how much variation of gene
expression can be explained by the chromatin features.
To estimate the predictive po wer of classification and
regression models for each of the 160 bins, we repeated
the cross-validation procedure 100 times. The mean and
standard deviation of the resulting 100 AUC scores
were calculated for each bin as a measuremen t o f the
predictiv e power of the SVM classification model. Simi-
larly, the accuracy of the SVR model for a bin was
reflected by th e mean and standar d deviation of the 100
correlation coefficients.
Detecting combinatorial effects of chromatin features
using linear models

involving only two features:
yx x xx
ijij
~ ++
A significant interaction term would indicate that the
interaction between the two features has a significant
effect on gene expression.
Predicting expression levels of microRNAs
We downloaded the annotation of 162 C. elegans micro-
RNAs from the miRBASE database [42]. For most micro-
RNAs, the annotation provides no information about the
TSSs. Instead, only the start and end positions of the cor-
responding pre-microRNAs (about 100 nucleotides in
length) are available. To predict the expression levels of
microRNAs, we calculated the signals of all chromatin fea-
tures within the associated pre-microRNAsandapplied
our model trained on chromatin features associated with
protein-coding genes. We applied both the SVM classifica-
tion and the SVR regression models to predict microRNA
expression. The resulting predictions were validated using
measured microRNA expression levels from small RNA
sequencing performed by Kato et al. [43].
Data sets for other organisms
In yeast, the expression levels of genes were measured
by microarrays and available from Wang et al. [62]; the
histone modification data are performed by P okholok et
al. [63].Infruitfly,thegeneexpressionandchromatin
data at 12 different developmental stages were obtained
by usi ng RNA-seq and ChIP-seq experiments, respec-
tively, which are available from t he modENCOD E web-

transcripts.
Additional file 4: Correlation patterns of chromatin features with
gene expression at the L3 stage. Correlation was calculated based on
long transcripts (>8 kb).
Additional file 5: Correlation patterns of chromatin features with
gene expression at the EEMB stage based on single-transcript
genes only.
Additional file 6: Prediction of gene expression using chromatin
features in all the 40 bins around the TSS (from -2 kb to 2 kb). (a)
ROC curve of the SVM classification model. (b) Predicted expression
levels versus actual expression levels measured by RNA-seq experiment.
PCC, Pearson correlation coefficient.
Additional file 7: Interaction between all possible pairs of histone
modifications. Interaction between all possible pairs of histone
modification as indicated by linear model in bin 1. For each pair, both
the results of linear models with the interaction terms (Interaction
models) and without the interaction terms (Singleton models) are
shown.
Additional file 8: The significant interactions between chromatin
features based on a linear model. The significant interactions between
chromatin features based on a linear model with 12 different chromatin
features and their pairwise interaction terms.
Additional file 9: Mutual information between expression and
pairwise histone modification signals. For each pair of histone
modifications (denoted as H1, H2), the heat map shows the normalized
mutual information I(E, H1 AND H2)/max(I(E,H1),I(E,H2)). For pairs such as
H3K4me2 and K4K36me3, the combination of two features gives a
higher predictive power than the two individual features.
Additional file 10: Interactions among chromatin features and
expression. (a) Node colors indicate the correlation of the

RNA-seq: RNA-sequencing; ROC: receiver operating characteristic; SVM:
support vector machine; SVR: support vector regression; TSS: transcription
start site; TTS: transcription termination site.
Acknowledgements
This work was supported by the NHGRI modENCODE project and the AL
Williams Professorship funds. We thank Jason Lieb, Robert Waterston and
Frank Slack for their comments and suggestions.
Author details
1
Department of Molecular Biophysics and Biochemistry, Yale University, 260
Whitney Avenue, New Haven, CT 06520, USA.
2
Department of Computer
Science and Engineering, The Chinese University of Hong Kong, Rm 1006,
Ho Sin-Hang Engineering Bldg, Shatin, New Territories, Hong Kong.
3
Program in Computational Biology and Bioinformatics, Yale University, 260
Whitney Avenue, New Haven, CT 06520, USA.
4
Department of Computer
Science, Yale University, PO Box 208285, New Haven, CT 06520, USA.
Authors’ contributions
CC and MG conceived and designed the study. CC and KKY performed the
full analysis. CC, KKY, KYY, RA, JR, CS and MG wrote the manuscript.
Received: 21 December 2010 Revised: 26 January 2011
Accepted: 16 February 2011 Published: 16 February 2011
References
1. Li B, Carey M, Workman JL: The role of chromatin during transcription.
Cell 2007, 128:707-719.
2. Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR,

modification maps. Nat Rev Genet 2007, 8:286-298.
11. Berger SL: The complex language of chromatin regulation during
transcription. Nature 2007, 447:407-412.
12. Khan AU, Krishnamurthy S: Histone modifications as key regulators of
transcription. Front Biosci 2005, 10:866-872.
Cheng et al . Genome Biology 2011, 12:R15
http://genomebiology.com/2011/12/2/R15
Page 16 of 18
13. Schubeler D, MacAlpine DM, Scalzo D, Wirbelauer C, Kooperberg C, van
Leeuwen F, Gottschling DE, O’Neill LP, Turner BM, Delrow J, Bell SP,
Groudine M: The histone modification pattern of active genes revealed
through genome-wide chromatin analysis of a higher eukaryote. Genes
Dev 2004, 18:1263-1271.
14. Bernstein BE, Kamal M, Lindblad-Toh K, Bekiranov S, Bailey DK, Huebert DJ,
McMahon S, Karlsson EK, Kulbokas EJ, Gingeras TR, Schreiber SL, Lander ES:
Genomic maps and comparative analysis of histone modifications in
human and mouse. Cell 2005, 120:169-181.
15. Liu CL, Kaplan T, Kim M, Buratowski S, Schreiber SL, Friedman N, Rando OJ:
Single-nucleosome mapping of histone modifications in S. cerevisiae.
PLoS Biol 2005, 3:e328.
16. Millar CB, Grunstein M: Genome-wide patterns of histone modifications in
yeast. Nat Rev Mol Cell Biol 2006, 7:657-666.
17. Kurdistani SK, Tavazoie S, Grunstein M: Mapping global histone acetylation
patterns to gene expression. Cell 2004, 117:721-733.
18. Wang Z, Zang C, Rosenfeld JA, Schones DE, Barski A, Cuddapah S, Cui K,
Roh TY, Peng W, Zhang MQ, Zhao K: Combinatorial patterns of histone
acetylations and methylations in the human genome. Nat Genet 2008,
40:897-903.
19. Ercan S, Giresi PG, Whittle CM, Zhang X, Green RD, Lieb JD: X chromosome
repression by localization of the C. elegans dosage compensation

2009, 523:341-366.
32. Schones DE, Zhao K: Genome-wide approaches to studying chromatin
modifications. Nat Rev Genet 2008, 9:179-191.
33. Katayama S, Tomaru Y, Kasukawa T, Waki K, Nakanishi M, Nakamura M,
Nishida H, Yap CC, Suzuki M, Kawai J, Suzuki H, Carninci P, Hayashizaki Y,
Well s C, Frith M, Ravasi T, Pang KC, Hallinan J, Mattick J, Hume DA,
Lipovich L, Batalov S, Engstrom PG, M izuno Y, Faghihi MA, Sandelin A,
Chalk AM, Mottagui-Tabar S, Liang Z, Lenhard B, et al: Antisense
transcription in th e mammalian transcriptome. Science 2005,
309:1564-1566.
34. Baugh LR, Demodena J, Sternberg PW: RNA Pol II accumulates at
promoters of growth genes during developmental arrest. Science 2009,
324:92-94.
35. Core LJ, Waterfall JJ, Lis JT: Nascent RNA sequencing reveals widespread
pausing and divergent initiation at human promoters. Science 2008,
322:1845-1848.
36. Seila AC, Calabrese JM, Levine SS, Yeo GW, Rahl PB, Flynn RA, Young RA,
Sharp PA: Divergent transcription from active promoters. Science 2008,
322:1849-1851.
37. Bender LB, Suh J, Carroll CR, Fong Y, Fingerman IM, Briggs SD, Cao R,
Zhang Y, Reinke V, Strome S: MES-4: an autosome-associated histone
methyltransferase that participates in silencing the X chromosomes in
the C. elegans germ line. Development 2006, 133:3907-3917.
38. Takasaki T, Liu Z, Habara Y, Nishiwaki K, Nakayama J, Inoue K, Sakamoto H,
Strome S: MRG-1, an autosome-associated protein, silences X-linked
genes and protects germline immortality in Caenorhabditis elegans.
Development 2007, 134:757-767.
39. Blumenthal T, Evans D, Link CD, Guffanti A, Lawson D, Thierry-Mieg J,
Thierry-Mieg D, Chiu WL, Duke K, Kiraly M, Kim SK: A
global analysis of

50. Sims RJ, Reinberg D: Is there a code embedded in proteins that is
based on post-translational modifications? Nat Rev Mol Cell Biol 2008,
9:815-820.
51. Schreiber SL, Bernstein BE: Signaling network model of chromatin.
Cell
2002, 111:771-778.
52.
Ng HH, Robert F, Young RA, Struhl K: Targeted recruitment of Set1
histone methylase by elongating Pol II provides a localized mark and
memory of recent transcriptional activity. Mol Cell 2003, 11:709-719.
53. Li J, Moazed D, Gygi SP: Association of the histone methyltransferase
Set2 with RNA polymerase II plays a role in transcription elongation. J
Biol Chem 2002, 277:49383-49388.
54. Fischer JJ, Toedling J, Krueger T, Schueler M, Huber W, Sperling S:
Combinatorial effects of four histone modifications in transcription and
differentiation. Genomics 2008, 91:41-51.
55. Fuchs SM, Laribee RN, Strahl BD: Protein modifications in transcription
elongation. Biochim Biophys Acta 2009, 1789:26-36.
56. Chambeyron S, Bickmore WA: Chromatin decondensation and nuclear
reorganization of the HoxB locus upon induction of transcription. Genes
Dev 2004, 18:1119-1130.
57. modENCODE. [http://www.modencode.org].
58. WormBase. [http://www.wormbase.org].
59. Harris TW, Antoshechkin I, Bieri T, Blasiar D, Chan J, Chen WJ, De La Cruz N,
Davis P, Duesbury M, Fang R, Fernandes J, Han M, Kishore R, Lee R,
Muller HM, Nakamura C, Ozersky P, Petcherski A, Rangarajan A, Rogers A,
Schindelman G, Schwarz EM, Tuli MA, Van Auken K, Wang D, Wang X,
Williams G, Yook K, Durbin R, Stein LD, et al: WormBase: a comprehensive
resource for nematode research. Nucleic Acids Res 2010, 38:D463-467.
60. miRBASE. [http://www.mirbase.org].

Submit your next manuscript to BioMed Central
and take full advantage of:
• Convenient online submission
• Thorough peer review
• No space constraints or color ﬁgure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution
Submit your manuscript at
www.biomedcentral.com/submit
Cheng et al . Genome Biology 2011, 12:R15
http://genomebiology.com/2011/12/2/R15
Page 18 of 18

Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Báo cáo y học: "A statistical framework for modeling gene expression using chromatin features and application to modENCODE datasets" - Pdf 21

Tài liệu, ebook tham khảo khác

Học thêm