Domain deletions and substitutions in the modular protein
evolution
January Weiner 3rd, Francois Beaussart and Erich Bornberg-Bauer
Division of Bioinformatics, School of Biological Sciences, The Westfalian Wilhelms University of Mu
¨
nster, Germany
Proteins are well known to evolve not only by point
mutations, but also by modular rearrangements [1–
3]. By and large, these rearrangements occur at the
level of domains, which are independent folding units
and have been proposed to represent the unit of
modular evolution [3,4]. Most domains always form
the same combinations; that is, they are always
found next to the same neighbours. For example,
domains found in ribosomal proteins are not found
elsewhere and are present always in the same con-
text. Also, it has been reported that many domains
appear in a very much conserved order (suprado-
mains) [5], and that the frequent occurrence of cer-
tain modular arrangements (arrangements of modules
along a sequence) across phyla is the result of con-
servation [6].
While few domains co-occur with many others at
least once in the same protein, most domains have few
partner domains, or are even always singletons [3,7–9].
Well-known examples of highly linked domains occur-
ring in many different combinations are the P-loop
nucleotide triphosphate hydrolase domain, the epider-
mal growth factor (EGF) domain, the SH3 domain,
the P-kinase domain and the domains involved in the
blood clotting cascade [1,10].
more frequent at the ends of proteins. We showed that losses can be
explained by the introduction of start and stop codons which render the
terminal domains nonfunctional, such that further shortening, until the
whole domain is lost, is not evolutionarily selected against. We demon-
strated that domains which also occur as single-domain proteins are less
likely to be lost at the N terminus and in the middle, than at the C ter-
minus. We conclude that fission ⁄ fusion events with single-domain
proteins occur mostly at the C terminus. We found that domain substi-
tutions are rare, in particular in the middle of proteins.We also showed
that many cases of substitutions or losses result from erroneous annota-
tions, but we were also able to find courses of evolutionary events where
domains vanish over time. This is explained by a case study on the bac-
terial formate dehydrogenases.
Abbreviations
Domain ID, domain identification number; EGF, epidermal growth factor; FDHF, formate dehydrogenase H.
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS 2037
modules or small arrangements are being transferred
from one protein to another. Considering that often
two modules or larger arrangements as such are
fused into one protein, it becomes difficult to defne
which of the modules is ‘mobile’ and which is ‘sta-
tic’. Therefore, it has been suggested that the term
versatility ahould be used instead of domain mobility
[3,12]. Independently of the perspective taken, the
underlying mechanisms of modular rearrangements
are mostly gene fusion and domain loss and, prob-
ably to a lesser extent, domain shuffling of exons
and recombination [13–17].
While the emergence of domain combinations is well
documented [4,6,7,18–21], relatively little is known
frequent. For that purpose, we categorized and des-
cribed misannotations of domains to discern them
from real substitutions or deletions of domains. Next,
we studied whether some domains are more often lost
and whether frequencies of domain deletions depend
on domain versatility. Finally, we discussed the impli-
cations of our results for a wider understanding of
modular protein evolution and the possibilities for gen-
erating a model in which modular protein evolution is
formally described in terms of module edit operations
and cost functions.
Results and Discussion
Single domain deletions
The first question we asked was whether the probabil-
ity of a domain deletion is evenly distributed through-
out a protein. The null hypothesis was that genetic
mechanisms which lead to domain deletions (for exam-
ple, deletions and insertions of sequence fragments,
intron recombinations, etc.) do not depend on the
position within the sequence. However, two factors
could cause a bias. First, any point mutation that cre-
ates a premature stop codon will cause a C-terminal
deletion of a protein. Likewise, a mutation leading to
the emergence of an alternative transcription or trans-
lation start will cause an N-terminal deletion. Second,
a fission producing two genes from one will result in
the deletion of a terminal fragment from a protein or,
vice versa, a fusion of two smaller proteins into one
will result in the observed pattern.
We first grouped proteins by the number of domains
half of the domains of the full length arrangement was
preserved, to ensure that homologous arrangements
were being compared. The results were similar to those
of single domain deletions, in that the terminal dele-
tions were prevalent (see the Supplementary Material).
In many cases, a deleted domain is a part of a lar-
ger, deleted fragment. We have found that fragments
deleted at either termini are, in general, much longer
than fragments deleted within a protein sequence. The
deletions within the protein are much more often single
domain deletions (Fig. 2). The total number of dele-
tions that concern only one, single domain, is higher
for the positions between the termini. However, the
number of major deletions (deletions that span more
than one domain) is higher at terminal positions. This
supports the view that the deletions generally involve
the protein termini.
In-detail analysis of the deletion events
During our analyses, we noted that some of the appar-
ent domain deletions are actually just misannotations.
A lack of a domain identifier at a given position in a
protein annotation does not necessarily mean that the
corresponding domain is physically deleted. Likewise,
a different identifier does not necessarily signify a
physical substitution. To address this problem, we con-
structed clusters of similar proteins that contained at
Position
Proportion of domains deleted
0.9
0.8
0.6
0.5
0.4
0.3
0.2
0.1
0
123456789101112345678910
Fig. 1. Statistics of single domain deletions in the whole SwissProt ⁄ TrEMBL set of proteins. The figure shows the relative proportion of
domain deletions at different positions within the proteins of length 4, 6, 10 and 11 domains. Dark grey, Pfam; Light grey, ProDom.
Length of the deleted fragment (in domains)
Number of occurencies
Fig. 2. Number of occurrences of domain deletions as a function of
the length (in domains) of the deleted fragment. Diamonds, N-term-
inal deletions; squares, deletions within the protein; circles, C-term-
inal deletions. Single domain losses occur preferentially on one of
the middle positions, whereas longer fragments tend to be deleted
at the termini.
J. Weiner 3rd et al. Mechanisms shaping modular protein evolution
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS 2039
least six ProDom domains. We aligned the domain
arrangements within a cluster using a simple progres-
sive multiple alignment algorithm [24], based on
pairwise alignments generated using the Needleman-
Wunsch algorithm [25] (Supplementary material).
We were able to distinguish five types of phenom-
ena that resulted in an apparent deletion from the
domain arrangement (Table 1, Fig. 3). The first two
were real substitutions and physical deletions of
B
C
C
A
A
B
C
C
Shadow domain
seq
B Shadow domain
Deletion
A
A
B
C
C Physical deletion
C
D Camouflage
A
D
B
C
C
A
Camouflage
E Erosion
A
A
B
physical deletions, shadow domains and erosions, the
numbers of these events were simply counted. How-
ever, in the case of substitutions and camouflage, it
is not reasonable to count the number of occur-
rences of such an event without inferring a direction
of the substitution. For example, if at a certain posi-
tion in a cluster, domain A occurs in two sequences,
and each of the domains B and C occurs five times,
then what frequency of the substitutions should be
assumed here? We have used the following routine:
all possible pairwise combinations of domains from
different proteins occurring at the same domain posi-
tion in a cluster were analysed. If the two domains
in a pair were different, then an event (substitution
or camouflage) was recorded. Therefore, the calcula-
ted numbers of substitution and camouflage events
cannot be used to infer any conclusions on the act-
ual substitution rate of domains; however, because at
all domain positions the number of camouflage and
substitution events have been calculated in the same
way, relative frequencies of the camouflage and sub-
stitution events at different positions can be inferred.
The relative frequencies of physical domain dele-
tions, substitutions and shadow domains are all
higher at the termini. The average domain deletion
frequency is 9%, 7% at the nonterminal position
and 20% at the termini (Table 3). This trend cannot
be seen in the case of annotation artefacts (Fig. 4,
Table 3). Furthermore, annotation artefacts are 10
times rarer than real, physical events (Table 3).
We tackled this problem as follows. We have con-
structed clusters of proteins. Each cluster contained
proteins with the same domain arrangement, or with
an arrangement shortened by a terminal domain dele-
tion, either N terminal or C terminal. We recorded
the length of the N- or C-terminal amino acid
Evolutionary events Annotation artefacts
Fig. 4. Results of the protein clusters analysis: relative percentages
of different evolutionary events and annotation artefacts at different
domain positions within the analysed sequences. Error bars indi-
cate the standard error of the calculated proportion. The values for
the ‘Middle position’ were averaged from the values for all non-
terminal positions.
J. Weiner 3rd et al. Mechanisms shaping modular protein evolution
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS 2041
sequence and plotted the distribution of its length
(see the Materials and methods for details). The
lengths were normalized for every protein cluster and
then averaged for evaluation. A length of 0 corres-
ponds to the case when the terminal domain is com-
pletely deleted, and 100 to the average length of the
terminal domain in the whole cluster. Furthermore,
we refined these results by counting only the protein
sequence fragments that are similar, at the amino acid
sequence level, to the remaining sequence of the dele-
ted domain, given one of two E-value thresholds.
These E-values between those fragments and the
intact domain were recorded and put in three bins,
each for a different range of E-values (any E-value,
0 £ E £ 0.01; 0 £ E £ 1 · 10
Pfam, N−terminus Pfam, C −terminus
Number of occurences
Number of occurences
25000
35000
55000
0 100 200 300
05000
0100
200
300
0
5000 15000
0 100 200 300
0500015000
0 100 200 300
0500
0
15000
Fig. 5. Length distributions of the remaining
fragment from a terminal domain. Distribut-
ion of the length of the terminal sequences
is based on comparison of domain arrange-
ments alignments. Left, distribution on
the N-termini; right; distribution on the
C-termini.The lengths are relative to the size
of the deleted domain (¼ 100%). White bars;
all terminal fragments; light grey, terminal
fragments similar to the deleted domain
(E < 0.01); dark grey, terminal fragments
Mechanisms shaping modular protein evolution J. Weiner 3rd et al.
2042 FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS
recorded how often domains that are lost form single-
domain proteins.
First, we calculated the fraction of domains that also
occur as single-domain genes in the sets of domains
that are deleted at an N-terminal, C-terminal or cen-
tral position.We found that the domains which also
occur as single-domain proteins are found two- to four
times more frequently at the termini, and twice as fre-
quently at the C terminus than at the N terminus
(Table 2). Surprisingly, the average fraction of
domains that also occur as single-domain genes is
lower for the domains that partake in deletion events
than the average for all domains.
The ability of a domain to form autonomous, sin-
gle-domain proteins may be related to its versatility.
We have therefore calculated the domain connectivity
and found that it is highest for the nonterminal
domains. However, as the domains at a nonterminal
position have, on average, two neighbours, whereas
the terminal domains have only one, the averages for
this type of domains must be halved. In that case,
the percentages of domains that form autonomous,
single-domain proteins are higher for domains that
undergo deletions at the termini, and lower for
domains that undergo deletions at a nonterminal
position (Table 2). Again, the numbers of domains
that form autonomous, single-domain proteins are
highest for the domains that are deleted at the
accumulation of mutations in a domain that it is no
longer similar to the original sequence.
There are three variable regions in the domain
arrangement of the protein cluster. First, at position 6
in the arrangement, in some proteins there are similar
sequences that were not annotated in ProDom (‘ero-
sion’) or domains which were annotated differently
because of high sequence divergence (‘camouflage’).
Next, at position 8, there is a substitution in two of
the sequences. Finally, the C-terminal part is missing,
truncated or eroded in many sequences, for example in
the illustrated structure (Fig. 6A,B).
Conclusions
Our main conclusions are as follows (a) domain dele-
tion events occur frequently at either of the termini,
(b) the deletions occur domain-wise; that is, in most of
the cases the whole domain is lost, (c) domain losses
correlate with domain versatility (i.e. the number of
different combinations in which a domain occurs), (d)
versatile domains are more frequently found at the
C terminus and (e) clear definitions can be given to
distinguish misannotations from physical deletions.
Eventually the question ‘What is the probability of a
domain deletion?’ can only be answered using domain
phylogenies. However, our study shows that the dele-
tion events are quite frequent; in the collected protein
clusters, the frequencies of proteins in a cluster with a
domain deleted at either of the termini were % 9%
Table 3. Results of the analysis of protein clusters for the ProDom
database. Numbers in the table correspond to the absolute num-
domains and their ability to form single-domain pro-
teins, we have found that, while gene fusion and fission
indeed play a significant role in the deletion events at
the termini, the introduction of new start and stop co-
dons also play a major role. The fraction of the dele-
ted domains that can be found as single-domain
proteins was twice as high at the C terminus (Table 2),
as was the connectivity of the C-terminally deleted
domains. This suggests that in a gene fusion or fission
event, the versatile, single-domain protein is more
likely to be found at the C terminus. This may be
explained by the fact that in a gene fusion ⁄ fission
event, or in the case of introduction of new start and
stop codons, the N-terminal part of the coding
sequence remains connected to its promoter region and
regulatory sites. Thus, a versatile domain that is fused
with the C terminus of a much larger protein will not
have an effect on the regulation of the whole protein,
because it will not modify the promoter region and
regulatory sites. Our results suggest such a selective
disequilibrium: the function (and regulation) of the
protein is connected to its N-terminal part, and there-
fore the fusion ⁄ fission events involving smaller, versa-
tile domains will occur more frequently at the
C terminus.
Moreover, we have found that the event of domain
deletion occurs mostly in a modular manner. This can
have two explanations. First, the apparent domain
deletion can be caused by gene fusion or fission. Sec-
ond, a domain fragment truncated (e.g. by a nonsense
ent ProDom domains and are the same as
on (A). The black thin boxes on position 6
correspond to ‘shadow domains’.
Mechanisms shaping modular protein evolution J. Weiner 3rd et al.
2044 FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS
Materials and methods
For the analyses, ProDom [22] version 2004.1 was used. The
main results were confirmed using the Pfam, release17 [29].
Each database contains a number of domain arrangements,
that is, proteins annotated in terms of domains. All supple-
mentary materials can be found on our web page (http://
www.uni-muenster.de/Bioinformatics/services/domdel/).
Overall single deletion statistics
Proteins from the ProDom database and, separately, from
the Pfam database, were divided into sets according to the
number of domains. Each set contained all proteins with a
fixed number of domains, for example ‘set6’ contained pro-
teins with six domains.
Each protein from a given set containing proteins of
length N domains was compared with each protein from
the set containing proteins of length N)1 domains. For
example, a protein with six domains was compared with all
proteins that have five domains. If the shorter arrangement
was identical to the longer one, with the exception of a sin-
gle, missing domain, a deletion was registered. The position
of the deletion within the domain arrangement was recor-
ded. For example, given the five-domain arrangement
ABDEF (where A to E are domains), it is identical to the
six-domain arrangement, ABCDEF, with the exception of
the deleted domain C.
one domain less than the given protein. If a given protein
matched the examined arrangement by all but one domain,
a deletion event was recorded. Starting with a single protein,
a number of hits was recorded and added to the cluster; fur-
thermore, these proteins were used to obtain the next set of
hits (i.e. proteins that have one domain less than the protein
that was used in the search). The procedure stopped for a
given cluster when no further similar domain arrangements
were found. Only clusters containing at least 10 proteins
and 10 ProDom domains were used for further analysis.
Additionally, the amino acid sequences of all the sequences
in the cluster were collected. The resulting clusters were sub-
sequently aligned with a simple multiple-domain arrange-
ment alignment algorithm (progressive alignment). The
length (in terms of domains) of a cluster was defined as the
length of the multiple-domain arrangement alignment.
Calculation of the relative event frequency at
different domain positions in protein clusters
For each of the events, e, and for each of the sets of clus-
ters of a given length, l, the frequency of the event at a
position, k, was defined as:
f
e;k
¼ n
e;k
=
X
l
i¼1
n
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS 2045
database. Only alignments which contained at least one
complete sequence and one sequence with a missing domain
(depending on the set, either N- or C terminal) were consid-
ered.
For each alignment in each set, the average size of the
deleted domain was calculated for the proteins with the
complete arrangement. To take into account the variability
of the length of the complete domain, the length of the
N-terminal fragment was definned as the length of the
amino acid sequence preceding the next domain in
the arrangements, expressed as the percentage of the calcu-
lated average length of the deleted domain in this align-
ment. Finally, the distribution of these values throughout
all of the analysed alignments was calculated.
References
1 Patthy L (1999) Protein Evolution. Blackwell Science,
Oxford.
2 Liu J & Rost B (2004) CHOP: parsing proteins into
structural domains. Nucleic Acids Res 32, W569–W571.
3 Bornberg-Bauer E, Beaussart F, Kummerfeld S, Teich-
mann S & Weiner J 3rd (2005) The evolution of domain
arrangements in proteins and interaction networks. Cell
Mol Life Sci 62, 435–445.
4 Voge IC, Teichmann S & Pereira-Lea IJ (2005) The
relationship between domain duplication and recombi-
nation. J Mol Biol 346, 355–365.
5 Voge IC, Berzuini C, Bashton M, Gough J & Teich-
mann S (2004) Supra-domains: evolutionary units larger
than single protein domains. J Mol Biol 336, 809–823.
lar evolution of DNA methyltransferases. BMC Evol
Biol 2,3.
17 Weiner J 3rd, Thomas G & Bornberg-Bauer E (2005)
Rapid motif-based prediction of circular permutations
in multi-domain proteins. Bioinformatics 21, 932–937.
18 Apic G, Gough J & Teichmann S (2001) Domain com-
binations in archaeal, eubacterial and eukaryotic pro-
teomes. J Mol Biol 310, 311–325.
19 Bashton M & Chothia C (2002) The geometry of
domain combination in proteins. J Mol Biol 315, 927–
939.
20 Vogel C, Bashton M, Kerrison N, Chothia C & Teich-
mann S (2004) Structure, function and evolution of
multidomain proteins. Curr Opin Struct Biol 14, 208–
216.
21 Kummerfeld S & Teichmann S (2005) Relative rates of
gene fusion and fission in multi-domain proteins. Trends
Genet 21, 25–30.
22 Corpet F, Servant F, Gouzy J & Kahn D (2000) Pro-
Dom and ProDom-CG: tools for protein domain analy-
sis and whole genome comparisons. Nucleic Acids Res
28, 267–269.
23 Zhang Y, Chandonia J, Ding C & Holbrook S (2005)
Comparative mapping of sequence-based and structure-
based protein domains. BMC Bioinformatics 6, 77.
24 Feng D & Doolittle R (1987) Progressive sequence
alignment as a prerequisite to correct phylogenetic trees.
J Mol Evol 25, 351–360.
25 Needleman S & Wunsch C (1970) A general method
applicable to the search for similarities in the amino
domain arrangements.
This material is available as part of the online article
from
J. Weiner 3rd et al. Mechanisms shaping modular protein evolution
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS 2047