Human-blind probes and primers for dengue virus
identification
Exhaustive analysis of subsequences present in the human and
83 dengue genome sequences
Catherine Putonti
1
, Sergei Chumakov
2
, Rahul Mitra
3
, George E. Fox
4
, Richard C. Willson
4,5
and Yuriy Fofanov
1,4
1 Department of Computer Science, University of Houston, Houston, TX, USA
2 Department of Physics, University of Guadalajara, Guadalajara, Jalisco, Mexico
3 Genomics USA, Houston, TX, USA
4 Department of Biology and Biochemistry, University of Houston, Houston, TX, USA
5 Department of Chemical Engineering, University of Houston, Houston, TX, USA
Members of the Flavivirus genus are responsible for a
number of diseases, including yellow fever, West Nile,
St Louis encephalitis, and dengue fever. One or
more of the four serotypes of the dengue virus are
endemic in many parts of the world, including all of
south-east Asia, parts of Africa, and Southern and
Central America. The Aedes aegypti mosquito, which
prefers to feed on humans, is a carrier of the dengue
virus and is commonly found on the US Gulf Coast
according to the CDC (Centers for Disease Control
species of closely related pathogens and absent in the genomes of the host
or the organisms that contribute to the sample background. Here we des-
cribe ‘host-blind probe design’ – a novel strategy of designing probes based
on highly frequent genomic signatures found in the pathogen genomes of
interest but absent from the host genome. Upon hybridization, an array of
such informative probes will produce a unique pattern that is a genetic fin-
gerprint for each pathogen strain. This multiprobe approach was applied
to 83 dengue virus genome sequences, available in public databases, to
design and perform in silico microarray experiments. The resulting patterns
allow one to unequivocally distinguish the four major serotypes, and within
each serotype to identify the most similar strain among those that have
been completely sequenced. In an environment where dengue is indigenous,
this would allow investigators to determine if a particular isolate belongs
to an ongoing outbreak or is a previously circulating version. Using our
probe set, the probability that misdiagnosis at the serotype level would
occur is % 1:10
150
.
398 FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS
the field. Thus, serology has emerged as the primary
method for dengue diagnosis. Serological tests are easy
to use and able to accommodate a great number of
samples, both necessities when confronting an epi-
demic. These benefits, however, come at a cost; tests
such as hemagglutination inhibition, IgG-ELISA and
MAC-ELISA cannot easily distinguish dengue at the
serotype level and are likely to misidentify other flavi-
viruses as dengue [1,2]. Recently, specific tests have
been developed for dengue identification using nucleic
acid-based technologies [3] such as the PCR [4–14] and
particular organism or grouping. In either approach,
analysis is further complicated because viruses are obli-
gate intracellular parasites; they are found in conjunc-
tion with host cells whose DNA might contain
sequences that would interfere with the test. As separ-
ation of viral from host nucleic acids is quite difficult,
it is important that the sequences used for virus detec-
tion are absent from any potentially contaminating
DNA.
We have recently developed a set of novel algo-
rithms that make it possible to efficiently calculate
the frequency of all subsequences (n-mers) of length
5–25+ nucleotides in any sequenced genome within no
more than a few hours, depending on the genome size.
This allows exclusion of all subsequences that are
present in a selected host ⁄ background genome (e.g.
human) in the PCR primer ⁄ microarray probe design
step, which has greatly increased speed, predictability
and effectiveness compared with current design meth-
ods. The microarray format is particularly attractive as
it permits testing for multiple pathogens simulta-
neously (e.g. the set of viral pathogens causing similar
symptoms in hosts or those rampant in the same
regions in which the infection has occurred). We refer
to the sequences that are present in the genome of
interest and absent from the host genome as being
‘host-blind’ (human-blind, mosquito-blind, mouse-
blind, rat-blind, etc.) sequences. The greater the num-
ber of changes necessary to ‘convert’ such a host-blind
sequence to a sequence found in the host genome, the
present in dengue and human-blind for all possible
changes of one, two, three or four nucleotides. Several
hundred human-blind sequences were identified, inclu-
ding those that were (a) present in each individual viral
strain’s genome, (b) present in all 83 dengue strains
regardless of their serotype, (c) unique to each serotype
of the virus (present in all strains of the serotype), and
(d) unique to each individual viral strain’s genome
(present in the strain and absent from all other
strains).
The results demonstrate that any method of identifi-
cation based solely on hybridization with a particular
unique sequence or a small set (typically less than six)
of sequences, as used in the existing tests of dengue
diagnosis, would not be able to reliably accommodate
potential mispriming. To minimize the probability of
misdiagnosis, sequences that require three, four or
more bases to be altered for a mispriming to occur are
considered ideal for identification purposes. A multiple
probe approach was taken in which detection and
identification of any dengue virus strain in the presence
of human DNA was developed using characteristic
sequences. A sample probe set that could be used in a
microarray format was developed and tested by
in silico hybridization. This probe set was designed to
contain the minimal number of probes necessary to
detect and identify dengue at the strain level and the
ability to unequivocally distinguish between the four
major serotypes.
Results
one, two, three, or four changes away from the
nearest human sequence (Fig. 2). The results of which
are provided in the Supplementary data. The presence
or absence of each n-mer was calculated, rather than
the frequency of occurrence. There were no 16-mers
three changes away from the nearest human sequence
in any of the viral genomes and only a single 17-mer
(and its complementary 17-mer), which was found in
Fig. 2. Number of unique human-blind sequences found in each of
the 83 complete dengue virus strain genomes considering different
sizes of n and number of changes away from the nearest human
sequence. Also listed is the average number of 22-mers present in
an individual genome and absent from the human sequence given
any one, two, three or four changes. The ideal set of probes would
be 22-mers that are four changes away; 16-, 17- and 18-mers with
one change away will lead to false-positive results because the
mismatches could be tolerated in the hybridization between the
host target and dengue probes.
Human-blind sequences for dengue identification C. Putonti et al.
400 FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS
just two of the dengue strains. It is not until 19-mers
were considered that all of the dengue genomes were
found to have some human-blind sequences at least
three changes away from the nearest human sequence.
The sequences four changes away are ideal candidates
for use in recognizing dengue because it is unlikely
that mispriming will occur and a false positive will
be reported. Each of the 83 strains had sequences
at least four changes away when considering n-mers
for n ‡ 21.
even likely, that this sequence could mispair to the
host sequence or related flavivirus genomes. Our
results lead us to the conclusion that there are no
human-blind sequences common to all 83 dengue
strains that are at least three changes away from the
nearest human sequence.
Sequences unique for serotype 1 and 2
We also calculated the number of unique sequences for
each dengue type 1 (DENV-1) or DENV-2 serotype,
as these types comprise the great majority of the 83
genomes considered (Table 2). It is likely that when a
more extensive sample of DENV-3 and DENV-4
genomes become available that the results will be sim-
ilar. It is observed that while there are far more
human-blind n-mers shared within each group, as the
sequence length and stringency increase the number of
common n-mers decreases. In the case of DENV-2,
there are no n-mers four changes away from the near-
est human sequence shared amongst all 46 virus
genomes. Further analysis of all serotype-specific
sequences is required to verify that they are unique
with respect to other flavivirus genomes as well. Select-
ing host-blind primers⁄ probes that are unique to the
serotype and host-blind with the most changes possible
Table 1. The number of n-mers present simultaneously in all 83
dengue genomes. The first row does not consider if the sequences
are absent from the human genome, just that they are present in
all of the dengue genomes.
n
16 17 18 19 20 21 22
Human-blind three changes away 0 0 0 0 0 0 4
Human-blind four changes away 0 0 0 0 0 0 0
C. Putonti et al. Human-blind sequences for dengue identification
FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS 401
from the nearest human sequence would ensure a more
reliable method of detection with a lower false-positive
rate than the currently available techniques.
Sequences unique for each individual viral
genome
For each n-mer present in a dengue genome, the num-
ber of other dengue genomes that also contain this
particular n-mer was calculated. On average, 4.4%
(16-mers) to 8.3% (22-mers) of the viral genome is
comprised of n-mers that are not present in any of the
other dengue genomes. For example, in the genome of
the DENV-4 China Guangzhou B5 strain (AF289029),
75.4% of the 22-mers are unique to this genome. Three
genomes (one DENV-1, two DENV-2, and one
DENV-4) do not have any 16- to 22-mers that do not
occur in any other dengue strain’s genomic sequence.
Thus, no single sequence could be used as a pri-
mer ⁄ probe to identify one of these strains. Figure 3
shows the distribution of the percentage of unique
n-mers per genome for 16- to 22-mers. This analysis
was next extended to those n-mers that are human-
blind. The average number of host-blind n-mers that
are unique to a particular genome is less than 8%, and
many genomes have no human-blind n-mers at least
two changes away from the nearest human sequence.
Despite this low average, there are several genomes
cials to rapidly determine if an isolate causing hem-
orrhagic fever represents a new outbreak or belongs to
known circulating versions of the virus. The ability to
quickly, inexpensively, and reliably diagnose dengue at
the strain level in such a manner is not possible with
existing techniques.
Based upon the results, presented above, for human-
blind n-mers in the 83 dengue genomes, a set of 216
probes (22-mers, at least three changes away from the
nearest human sequence) was designed for in silico
Fig. 3. Distribution of the percentage of n-mers per genome that
are unique (i.e. not contained in any of the other dengue genomes
considered).
Fig. 4. Distribution of the percentage of human-blind 22-mers per
genome that are unique (i.e. not contained in any of the other den-
gue genomes considered).
Human-blind sequences for dengue identification C. Putonti et al.
402 FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS
experiments. The 216-probe set was computed as the
minimum number of probes possible to uniquely iden-
tify each of the 83 genomes such that each genome
was required to contain a subset of at least 28% (in
this case 61) of the 216 22-mers or probe sequences.
Furthermore, for any two strain’s genomes, the subsets
contained in each must differ by at least two
sequences. If two strains differ only by two sequences
and mutations occur in these two sequences, the
strains will be indistinguishable. The likelihood of such
an occurrence can be reduced by demanding more
sequences for distinguishing between any two individ-
ility that such an event would occur is % 1:10
150
.
The microarray of 216 probes represents what many
researchers can produce in-house at low cost. We
determined the pattern that would appear on the
microarray given a particular genome’s ability to
hybridize with the probe sequences. Figure 5 shows the
overlapping expression patterns for two pairs of
genomes for the set of 216 probes. The distribution of
the number of probes present on the 216-probe set
microarray for each of 83 genomes ranges from 61 to
95. Because dengue infections occur in regions in
which other flaviviruses are also prevalent, it is impera-
tive that a diagnostic tool is able to discriminate
between the different viruses [2]. Considering the close
relative of dengue virus, West Nile virus, we computed
the number of probes expected to be present in 26
publicly available strains. Of the 216 dengue probes, at
most only three would hybridize with a West Nile
strain. In fact, 24 of the West Nile virus strains share
these same three 22-mers with the dengue virus strains.
In the event that the clinical sample contained West
Nile virus and not dengue virus, the expression pattern
is expected to show only 1% of the probes hybridized,
far less than the 28% required during the set design.
Thus, it is highly unlikely that a misidentification of
the presence of dengue will be made using the 216-
probe set, even in the presence of another flavivirus.
To estimate the ability of such arrays to distinguish
C. Putonti et al. Human-blind sequences for dengue identification
FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS 403
its simplicity; the distance is 0 if both genomes produce
the same pattern and 1 if they do not share any of the
same probes.
By computing the distances between each pair of 83
patterns (the distance matrix), we were able to group
virus isolates using phylip’s kitsch (University of
Washington, Seattle, WA, USA) [22] and visualize these
groups using publicly available software packages
[23,24] based on the distances between the patterns
observed on the microarray (Fig. 6). The trees generated
clearly separate DENV-1 strains from the remainder of
the serotypes. DENV-3 and DENV-4 are most closely
clustered within their own respective serotypes but are
nested within the DENV-2 branch. While this may be
attributed to the fact that there are far fewer DENV-3
and DENV-4 available to be included in this analysis, it
is much more probable that it is a result of the design
process itself. Because each strain must contain a
percentage of the overall probe set, sequences that are
unique to a strain are, in essence, selected against.
A second probe set was designed containing a ran-
dom sampling of 4000 sequences (18-mers, two away
from the nearest human sequence). Because members
of this set were chosen at random, many more
sequences unique to a single strain or to just a few
strains are included. A tree displaying similarity
between isolates was also created using this set
(Fig. 7). This allows the dengue stains to be grouped
both diagnosis and phylogenetic tree construction. This
assay will be able, without necessitating viral isolation,
to quickly detect a new pattern signifying a new strain
of dengue almost akin to sequencing the genome.
The ability to identify strains very similar to an
unknown isolate in the data set of sequenced dengue
genomes may be especially valuable in epidemiological
studies where one would like to rapidly understand the
origins of an outbreak of hemorrhagic fever. For
example, if such an outbreak were to occur in a loca-
tion where dengue fever is indigenous, it may be the
result of a new variant of the virus which is common
in that region, a re-emergence of an earlier version, a
continuation of an outbreak from the previous season
or the introduction of a new strain as the result of
Fig. 6. Dengue groupings based on the similarity of the observed
hybridization patterns for the 216-probe hypothetical microarray of
22-mers at least three changes away from the nearest human
sequence.
Human-blind sequences for dengue identification C. Putonti et al.
404 FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS
travel. If the needed complete sequences are obtained
initially, hybridization arrays will allow these alternat-
ive explanations to be monitored on an ongoing basis.
Such monitoring might be conducted routinely to
detect changes in the local virus population before
cases of hemorrhagic fever occur.
If reduction of cost and size of this test are critical,
the ability to identify dengue at the strain level can be
sacrificed such that specificity is available only at the
bly does not allow the assembly of each chromosome with-
out gaps, all n-mers having a subsequence belonging to one
file and the remaining sequence in another file were not
included in our calculations. All calculations on the human
genome utilized both the original and complementary
strand sequences.
Eighty-three complete sequences of the dengue virus (28
DENV-1, 46 DENV-2, two DENV-3, and seven DENV-4)
were considered. This set of sequences, including their
accession numbers, is provided in the Supplementary data.
The dengue genome is % 10 kb with minor variations in
length. Although dengue is a single-stranded RNA positive-
Fig. 7. Dengue groupings obtained from the
similarity of the observed hybridization
patterns for the hypothetical 4000-probe
microarray of randomly sampled 18-mers at
least two changes away from the nearest
human sequence.
C. Putonti et al. Human-blind sequences for dengue identification
FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS 405
strand virus with no DNA stage, both the original and
complementary strand sequences were used in our calcula-
tions as a precautionary measure.
Calculations
We have recently developed a set of novel algorithms that
make it possible to analyze the occurrence frequency of all
short subsequences (n-mers) of length 5–25+ nucleotides
in any sequenced genome within a reasonable time (hours)
[28–30]. The unique properties of this new approach are:
l
extension of the algorithms to a PCR primer set design is
now in progress.
Probe selection
It is our intent to define the minimum optimal set of subse-
quences, s
min
, that can both identify the presence of a par-
ticular pathogen and distinguish between different strains
of the pathogen. To ensure the sensitivity needed to prop-
erly identify a genomic sequence, each genome under con-
sideration must contain at least a subsequences from s
min
.
If applicable, each subclass or type must be distinguishable
from any other subclass or type by at least b subsequences.
The set of subsequences present in each genome must differ
by at least c subsequences from the set present in every
other genome. Furthermore, for each element k in s
min
, its
complement k¢ must not be a member of s
min
.
In designing this optimal set, an evolutionary program-
ming approach was taken. While many sets, s, may meet
the criteria above, a fitness function is needed to measure
how ‘good’ a particular set s is in order to determine whe-
ther it is, in fact, the optimal solution. For instance, a par-
ticular set may exceed the minimum values required of a, b
and c and, in fact, have values A, B and G, where A ‡ a,
public health. J Microbiol Immunol Infect 38, 5–16.
3 Relman DA (1998) Detection and identification of pre-
viously unrecognized microbial pathogens. Emerg Infect
Dis 4, 382–389.
4 Lanciotti RS, Calisher CH, Gubler DJ, Chang GJ &
Vorndam AV (1992) Rapid detection and typing of den-
gue viruses from clinical samples by using reverse tran-
scriptase-polymerase chain reaction. J Clin Microbiol
30, 545–551.
5 Harris E, Roberts TG, Smith L, Selle J, Krammer LD,
Valle S, Sandoval E & Balmaseda A (1998) Typing of
dengue viruses in clinical specimens and mosquitoes by
single-tube multiplex reverse trascriptase PCR. J Clin
Microbiol 36, 2634–2639.
6 De Paula SOD, Lima CDM, Torres MP, Pereira MR &
da Fonseca BAL (2004) One-step RT-PCR protocols
Human-blind sequences for dengue identification C. Putonti et al.
406 FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS
improve the rate of dengue diagnosis compared to two-
step RT-PCR approaches. J Clin Virol 30, 297–301.
7 Wang WK, Sung TL, Tsai YC, Kao CL, Chang SM &
King CC (2002) Detection of dengue virus replication in
perifperal blood mononuclear cells from dengue virus
type 2-infected patients by a reverse transcription-real-
time PCR assay. J Clin Microbiol 40, 4472–4478.
8 Sudiro TM, Zivny J, Ishiko H, Green S, Vaughn DW,
Kalayanorooj S, Nisalak A, Norman JE, Ennis FA &
Rothman AL (2001) Analysis of plasma viral RNA
levels during acute dengue virus infection using quanti-
tative competitor reverse transcription-polymerase chain
15 Wu SJL, Lee EM, Pubatana R, Shurtliff RN, Porter
KR, Suharyono W, Watts DM, King CC, Murphey
GS, Hayes CG et al. (2001) Detection of dengue viral
RNA using a nucleic acid sequence-based amplification
assay. J Clin Microbiol 39, 2794–2798.
16 Baeumner AJ, Schlesinger NA, Slutzki NS, Romano J,
Lee EM & Montagna RA (2002) Biosensor for dengue
virus detection: sensitive, rapid, and serotype specific.
Anal Chem 74, 1442–1448.
17 Schena M, Shalon D, Davis RW & Brown PO (1995)
Quantitative monitoring of gene expression patterns
with a complementary DNA microarray. Science 270,
467–470.
18 Lipshutz RJ, Fodor SP, Gingeras TR & Lockhart DJ
(1999) High density synthetic oligonucleotide arrays.
Nat Genet 21, 20–24.
19 Woese CR, Maniloff J & Zablen LB (1980) Phyloge-
netic analysis of the mycoplasmas. Proc Natl Acad Sci
USA 77, 494–498.
20 McGill TR, Jurka J, Sobieski JM, Pickett MH, Woese
CR & Fox GE (1986) Characteristic Archaebacterial
16S rRNA Oligonucleotides. Syst Appl Microbiol 7,
194–197.
21 Zhang Z, Willson RC & Fox GE (2002) Identifica-
tion of characteristic oligonucleotides in the 16S
ribosomal RNA sequence dataset. Bioinformatics 18,
244–250.
22 Felsenstein J (2005)
PHYLIP (Phylogeny Inference Pack-
age), Version 3.6. Distributed by the Author. Depart-
on Mathematics and Engineering Techniques in
Medicine and Biological Sciences (Valafar F &
Valafar H, eds), pp. 363–367. CSREA Press, Las
Vegas, NV.
30 Fofanov V, Putonti C, Chumakov S, Pettitt BM &
Fofanov Y (2005) Fast Algorithm for the Analysis of the
Presence of Short Oligonucleotide Sequences in Genomic
Sequences. UH Technical Report #UH-CS-05–11, Uni-
versity of Houston, Houston, Texas. [Online http://
www.cs.uh.edu/Preprints/preprint/uh-cs-05-11.pdf]
C. Putonti et al. Human-blind sequences for dengue identification
FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS 407
Supplementary material
The following material is available online:
Table S1. Data set of publicly available dengue strains
considered.
Table S2. Estimated number of sequences absent in a
genome of size 2 Gb (the approximate size of the
human genome excluding highly repeated elements).
Table S3. Number of n-mers absent from the human
genome.
Table S4. Number of human-blind n-mers, one, two,
three or four changes away, present in each dengue
genome.
Table S5. Number of unique human-blind n-mers, one,
two, three or four changes away.
This material is available as part of the online article
at http://www.blackwell-synergy.com
Human-blind sequences for dengue identification C. Putonti et al.
408 FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS