Báo cáo khoa học: Prediction of coenzyme speciﬁcity in dehydrogenases ⁄ reductases A hidden Markov model-based method and its application on complete genomes doc - Pdf 12

Prediction of coenzyme speciﬁcity in dehydrogenases⁄
reductases
A hidden Markov model-based method and its application
on complete genomes
Yvonne Kallberg
1,2
and Bengt Persson
1,2
1 IFM Bioinformatics, Linko
¨
ping University, Sweden
2 Centre for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden
Dehydrogenases and reductases are enzymes of funda-
mental metabolic importance that utilize coenzymes
for electron transport (NAD(H), NADP(H) or
FAD(H
2
), herein denoted NAD, NADP and FAD).
The enzymes bind the coenzyme through a double
babab fold, resulting in a six-stranded b-sheet surroun-
ded by a-helices, known as the Rossmann fold [1].
This domain is often found in combination with other
domains of different folding types either on the N-ter-
minal side, C-terminal side, or interrupting the Ross-
mann fold [2]. For example, glutathione reductases
have two domains of the Rossmann-fold type, one
FAD-binding domain that is interrupted by an
NAD(P)-binding domain (PDB code 3grs [3]). 6-Phos-
phogluconate dehydrogenases have an NADP-binding
domain of the Rossmann-fold type followed by a
Keywords

kingdoms, only 3–8% of the Rossmann proteins are predicted to have
more than one membrane-spanning segment, which is much lower than the
frequency of membrane proteins in general. Analysis of the major protein
types in eukaryotes reveals that the most common type (26%) of the Ross-
mann proteins are short-chain dehydrogenases ⁄ reductases. In addition, the
identiﬁed Rossmann proteins were analyzed with respect to further protein
types, enzyme classes and redundancy. The described method is available
at where the preferred coenzyme and its
binding region are predicted given an amino acid sequence as input.
Abbreviations
ORF, open reading frame; SDR, short-chain dehydrogenase ⁄ reductase; TM, transmembrane.
FEBS Journal 273 (2006) 1177–1184 ª 2006 The Authors Journal compilation ª 2006 FEBS 1177
C-terminal catalytic domain consisting of a-helices
only (PDB code 2pgd [4]).
In the ﬁrst part of the Rossmann fold (b
1
a
1
b
2
), there
are three glycine residues surrounded by hydrophobic
residues, with the ﬁrst glycine at the end of the b
1
strand and the other two at the beginning of the a
1
helix (Fig. 3, top right, Experimental procedures). The
ﬁrst two glycine residues are involved in dinucleotide
binding, while the third is involved in the close packing
of the b-strands and the a-helix [5]. Most of the early

-strand, while FAD-preferring enzymes
instead have a glutamic acid residue at this position.
However, there are exceptions in both cases that pre-
vent this feature to be used to differentiate between
the two types.
We have now developed a method that from the
amino acid sequence alone identiﬁes a protein with
coenzyme binding of the Rossmann type, and predicts
the coenzyme speciﬁcity. The method is applied to all
eukaryotic and archaeal genomes and a representative
set of bacterial genomes.
Results and discussion
We have developed a method for prediction of coen-
zyme speciﬁcity, based upon hidden Markov models
(HMMs) and sequence motifs (see Experimental proce-
dures). To the best of our knowledge there is no pre-
diction method available with the same applicability as
the one presented here. A search in InterPro [9] using
key words such as ‘Rossmann’, ‘NAD’, ‘NADP’ and
‘FAD’ reveals many entries but there is no single entry
which can be used to identify the motifs of interest.
While most entries are on protein family level, there
are some on domain level as well, e.g. ‘NAD_BS’
(identiﬁer IPR000205) which identiﬁes NAD binding
sites. However, this motif only identiﬁes 29 gene prod-
ucts in the human Ensembl [10] database, a number
far below what could be expected.
Rossmann fold in completed genomes
The new method was applied to a selection of 68 com-
pleted genomes, representing archaea, bacteria and

Open Reading Frames (ORFs)
Rossmann Folds
Archaea
Bacteria
Eukaryota
Fig. 1. Number of coenzyme binding proteins in each genome plot-
ted versus number of open reading frames. The number of Ross-
mann-folds increase steeply for genomes with up to 10 000 ORFs,
while it levels out for larger genomes.
Prediction of coenzyme speciﬁcity Y. Kallberg and B. Persson
1178 FEBS Journal 273 (2006) 1177–1184 ª 2006 The Authors Journal compilation ª 2006 FEBS
coenzyme binding proteins than the others (655 and
646, respectively), but given the size of their genomes
($61 000 and $53 000) the proportions are still within
the same range as for other eukaryotes. There are four
eukaryotic parasites (Plasmodium falciparum, Plasmo-
dium yoelii, Leishmania major and Entamoeba histolyti-
ca) for which the ratio of coenzyme binding proteins is
much lower than expected, possibly due to their ability
to rely on the dehydrogenase ⁄ reductase systems of the
host organism.
Redundancy
Prokaryotic species, with a typical maximum genome
size of 5000 ORFs, have a moderate sequence redund-
ancy among their coenzyme binding proteins. Using a
threshold of maximum 60% pair-wise sequence iden-
tity, 0–10% of the sequences are redundant. Most of
the small eukaryotic genomes have a comparable level
of redundancy. In general, the redundancy of Ross-
mann proteins is similar to that of other proteins in

insect, there is a majority of NADP-preferring enzymes
while mammals and chicken have a majority of NAD-
preferring enzymes. In a previous study of short chain
dehydrogenases ⁄ reductases (SDRs) it was found that
NADP is more frequent than NAD in human, mouse,
fruit ﬂy, worm, plant and yeast [8]. As mentioned
above, this is still valid when including all Rossmann-
fold proteins for the lower organisms, but in human
and mouse the balance is shifted and NAD is the most
frequent coenzyme.
Dual coenzyme sites
Some proteins have two Rossmann binding sites; for
example, the ﬂavin monooxygenases with both an
FAD and an NAD binding site. Out of the $9200 pro-
teins predicted to have a Rossmann fold, almost 700
have more than one such fold. For all kingdoms, the
fraction of Rossmann proteins with dual sites amount
to 0–10%, with some exceptions. Among the eukaryo-
tes Entamoeba histolytica, Plasmodium falciparum, and
Plasmodium yoelii the proportion is 15, 18 and 15%,
respectively. The bacterial genome of Chlamydophila
caviae also show a dual sites proportion of 15%, while
the archeal genomes of Thermococcus kodakaraensis
and Nanoarchaeum equitans show 17 and 20%, respect-
ively. These high ratios are partly caused by the low
number of Rossmann-fold proteins.
Protein families
Among the annotated human Rossmann proteins,
most proteins have EC numbers within main group 1
(oxidoreductases). However, there are several SDRs

Total proportion
(%) Human Chimp Mouse Rat Fish Fly Worm Yeast Sum
Short-chain dehydrogenases ⁄ reductases 26 71 62 68 67 79 57 75 13 492
FAD-dependent pyridine nucleotide-
disulphide oxidoreductases
717131623111186105
Flavin-containing amine oxidases 5 18 17 12 17 5 8 5 0 82
FAD-dependent oxidoreductases 5 15 14 12 10 5 9 8 1 74
Zinc-containing alcohol dehydrogenases 4 12 12 8 11 7 5 6 11 72
Lactate ⁄ malate dehydrogenases 3 7 7 8 13 8 3 1 2 49
UBA ⁄ THIF-type NAD ⁄ FAD binding fold 3 10 8 8 8 1 4 3 3 45
Flavin-containing monooxygenases 3 7 7 10 10 3 2 5 1 45
D-isomer speciﬁc 2-hydroxyacid dehydrogenases 2 6 5 4 5 7 6 1 6 40
Aldehyde dehydrogenases 2 2 2 6 11 1 2 3 3 30
Prediction of coenzyme speciﬁcity Y. Kallberg and B. Persson
1180 FEBS Journal 273 (2006) 1177–1184 ª 2006 The Authors Journal compilation ª 2006 FEBS
each family, but there are a few notable exceptions.
Rat aldehyde dehydrogenases, for instance, are almost
twice as frequent as mouse aldehyde dehydrogenases,
and FAD-dependent pyridine nucleotide-disulphide
oxidoreductases are also more numerous in rat com-
pared to mouse. Another species which deviates from
the general pattern is yeast. In this species, the ﬁfth
major group, zinc-containing alcohol dehydrogenases,
has almost as many members as the SDRs (Table 2).
Transmembrane regions
A number of dehydrogenases and reductases are mem-
brane-attached. The transmembrane (TM) helix can be
found in either the N-terminal part of the protein, as in
11-beta hydroxysteroid dehydrogenase type 1 [11], or

modium yoelii with one-third each, and Encephalito-
zoon cuniculi with as many as ﬁve of its six predicted
Rossmann proteins also being predicted as membrane
proteins.
The majority of proteins was found to harbor one
or two TM segments ($800 proteins vs. $350 proteins
with more than two TM helices), with one TM most
usual ($600 proteins). A positioning of the TM seg-
ments C-terminally of the coenzyme binding site was
twice as common as an N-terminally positioning.
Looking at differences in TM attachment between the
various coenzyme speciﬁcities it was found that
NADP-preferring enzymes are the most common type
to be membrane bound. Around 44% ($500 proteins)
of the Rossmann membrane proteins are NADP-pre-
ferring, which is a larger proportion than Rossmann
NADP-preferring proteins in general ( $36%, Table 4).
Inversely, NAD-preferring membrane proteins amount
to 33% ($400 proteins) which is lower than the fre-
quency in general ($43%, Table 4). Finally, FAD-
preference is 15% (close to 200 proteins), also below
the general occurrence ($21%). Thus, NADP prefer-
ence is overrepresented, while NAD and FAD pre-
ferences are underrepresented. Protein sequences
predicted to have two or more coenzyme binding sites
were the least common to be membrane bound, with
only $100 sequences out of $670 predicted to have
TM helices.
In the human genome, there are 45 Rossmann
proteins with predicted TM regions. The three main

dary structure elements. Arrows indicate positions of critical importance for coenzyme speciﬁcity prediction. In the ﬂow chart, the boxes
describe the different steps of the method.
Prediction of coenzyme speciﬁcity Y. Kallberg and B. Persson
1182 FEBS Journal 273 (2006) 1177–1184 ª 2006 The Authors Journal compilation ª 2006 FEBS
our study demonstrates the power of sequence-based
predictions. It is our hope and belief that the presented
prediction tool will be a welcome addition to the
arsenal of analysis methods available for large scale
protein function exploration. The prediction tool is
available via where a
web form allows the user to enter one or several amino
acid sequence(s) and in return get the Rossmann-fold
prediction with estimated coenzyme preference and
position.
Experimental procedures
We have developed a method which identiﬁes coenzyme
binding regions in proteins, and also predicts if the speciﬁc-
ity is FAD, NAD or NADP. The method is based upon a
combination of HMMs and sequence motif matching as
outlined in Fig. 3. The HMMs are used to extract a num-
ber of potential hits which subsequently are exposed to a
ﬁltering process followed by prediction of coenzyme specif-
icity. During the development phase, different combinations
of HMMs were tried: one for each type of speciﬁcity, one
for all, and one for FAD-binding combined with one for
NAD(P)-binding proteins. The latter was found to be the
best solution in terms of speciﬁcity and selectivity. All
HMMs were developed using the hmmbuild command in
HMMer [17], with the parameters –F and –fast, followed
by the hmmcalibrate command.

binding proteins can be lost either during the database
search or during the classiﬁcation. Only two FAD-binding
proteins are lost (false negatives): one is classiﬁed as
NADP-binding and the other is classiﬁed as false, i.e. non-
Rossmann fold. Among the NAD-binding proteins a total
of 10 are false negatives: four are lost during the database
search, ﬁve are classiﬁed as false, and one is classiﬁed as
NADP-binding. The group with most failures is NADP-
binding proteins, with a total of 13 false negatives: eight
are lost during database search, three are classiﬁed as false,
and two are falsely predicted to be NAD-binding.
False positives, i.e. protein sequences falsely predicted to
have certain coenzyme speciﬁcities, can be of two types:
either they do not bind the coenzymes of interest or they
do but the coenzyme preference is not correctly predicted.
Initially, during the database search, 62 proteins were
picked up which do not bind any of the coenzymes of inter-
est. However, only three of them remain as false positives
after the classiﬁcation step: molybdenum cofactor biosyn-
thesis protein (1jw9, MoeB), glycinamide ribonucleotide
transformylase (1kjq, PurT), and a cell division protein
(1ofu, FtsZ). In common for all three is a Rossmann-fold-
like structure at the predicted coenzyme binding site. MoeB
and PurT are ATP-binding proteins, but while the predicted
coenzyme binding region in MoeB is in contact with ATP,
in PurT it is the substrate (glycinamide ribonucleotide)
which is in contact with the corresponding region. FtsZ is a
GTPase and its coenzyme is in contact with the region fal-
sely predicted to be NADP-bound. In addition to these
three there are four Rossmann-fold proteins where the

were false positives, yielding an overall prediction sensitivity
of 79.2%, a speciﬁcity of 99.9% and a Matthews correla-
tion coefﬁcient of 0.86 (Table 5).
The method, using HMMs trained on all six groups, was
applied on 68 genomes: all available among eukaryotes (30)
and archaea (18), and a representative selection of 20 bac-
terial genomes. Genome sequences were downloaded from
ENSEMBL ( NCBI
( and TIGR (ftp://ftp.
tigr.org/pub/data/).
TM regions were predicted using phobius [19], a tool
based on HMMs, with ability to differentiate between sig-
nal sequences and true transmembrane sequences. The TM
regions were subsequently scrutinized, and in those cases
they overlap with a predicted Rossmann-fold region (coen-
zyme binding site plus 65 residues), the transmembrane pre-
diction was ignored.
References
1 Rossmann MG, Liljas A, Bra
¨
nde
´
n C-I & Banaszak LJ
(1975) In (Boyer, P D, eds), The Enzymes, Vol. 11, 3rd
edn. pp. 61–102. Academic Press, New York.
2 Brenner SE, Chothia C, Hubbard TJP & Murzin AG
(1996) Understanding protein structure: using scop for
fold interpretation. Methods Enzymol 266, 635–643.
3 Schulz GE, Schirmer RH, Sachsenheimer W & Pai EF
(1978) The structure of the ﬂavoenzyme glutathione

tion in the endoplasmic reticulum membrane. J Biol
Chem 274, 28762–28770.
12 Binda C, Hubalek F, Li M, Edmondson DE & Mattevi
A (2004) Crystal structure of human monoamine oxi-
dase B, a drug target enzyme monotopically inserted
into the mitochondrial outer membrane. FEBS Lett 564,
225–228.
13 Jackson JB, Peake SJ & White SA (1999) Structure and
mechanism of proton-translocating transhydrogenase.
FEBS Lett 464, 1–8.
14 Liu J & Rost B (2001) Comparing function and struc-
ture between entire proteomes. Protein Sci 10, 1970–
1979.
15 Krogh A, Larsson B, von Heijne G & Sonnhammer EL
(2001) Predicting transmembrane protein topology with
a hidden Markov model: application to complete gen-
omes. J Mol Biol 305, 567–580.
16 Nilsson J, Persson B & von Heijne G (2005) Compara-
tive analysis of amino acid distributions in integral
membrane proteins from 107 genomes. Proteins 60,
606–616.
17 Eddy SR (1998) Proﬁle hidden Markov models.
Bioinformatics 14, 755–763 ( ).
18 Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl
P, Levitt M & Brenner SE (2004) The ASTRAL Com-
pendium in 2004. Nucleic Acids Res 32, 189–192.
19 Ka
¨
ll L, Krogh A & Sonnhammer EL (2004) A com-
bined transmembrane topology and signal peptide pre-

Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Báo cáo khoa học: Prediction of coenzyme speciﬁcity in dehydrogenases ⁄ reductases A hidden Markov model-based method and its application on complete genomes doc - Pdf 12

Tài liệu, ebook tham khảo khác

Học thêm