Tài liệu Báo cáo khoa học: Isochore structures in the chicken genome - Pdf 10

Isochore structures in the chicken genome
Feng Gao and Chun-Ting Zhang
Department of Physics, Tianjin University, China
The first draft genome sequence of the red jungle
fowl, Gallus gallus, was published in December 2004.
The chicken (G. gallus) is an important model organ-
ism that bridges the evolutionary gap between mam-
mals and other vertebrates and serves as a main
laboratory model for the $ 9600 extant avian species.
The chicken also represents the first agricultural ani-
mal to have its genome sequenced. Like most bird
species, the chicken has a relatively small genome of
$ 1200 million base pairs, or $ 39% of the size of
the human genome [1].
The nuclear genomes of vertebrates are mosaics of
isochores, very long stretches [> 300 kilobases (kb)] of
DNA that are fairly homogeneous in base composi-
tion. Isochores can be partitioned into a small number
of families that cover a range of GC levels, which is
narrow in cold-blooded vertebrates, but broad in
warm-blooded vertebrates [2,3]. The large-scale vari-
ation in base composition correlates both coding and
noncoding sequences and seems to reflect a fundamen-
tal level of genome organization [4]. This isochore
organization shows marked variation in a number of
important genomic features, including gene density [5],
chromosome bands [6,7], patterns of codon usage [8],
gene length [9], replication timing [10], recombination
rate [11,12], and the distribution of transposable ele-
ments [13]. By in situ hybridization of fractionated
DNA on mitotic and meiotic chromosomes, a com-

chicken genome. These isochores have a fairly homogeneous G + C con-
tent and often correspond to meaningful biological units. With the aid of
the technique of cumulative GC profile, we proposed an intuitive picture
to display the distribution of segmentation points. The relationships
between G + C content and the distributions of genes (CpG islands, and
other genomic elements) were analyzed in a perceivable manner. The cumu-
lative GC profile, equipped with the new segmentation algorithm, would be
an appropriate starting point for analyzing the isochore structures of
higher eukaryotic genomes.
Abbreviations
SNP, single nucleotide polymorphism.
FEBS Journal 273 (2006) 1637–1648 ª 2006 The Authors Journal compilation ª 2006 FEBS 1637
opportunity to study the global genome organization
at the sequence level.
In this article, we analyzed the isochore structures of
the chicken genome using a new segmentation algo-
rithm [15]. By applying the segmentation algorithm to
24 chicken chromosome sequences, the boundaries of
isochores for each chromosome were obtained, respect-
ively. It was found that the chicken genome is
organized into a mosaic structure of isochores. Conse-
quently, 25 isochores longer than 2 Mb have been
identified, i.e. eight GC-rich isochores and 17 GC-poor
isochores.
Results and discussion
The isochores in the chicken genome
It should be noted that the chicken genome sequence
still contains a large number of gaps (Table 1). In the
case of GGA1, there are 9847 gaps remaining. There-
fore, applying the segmentation algorithm to each frag-

Number
of gaps
Percent of
gaps in the
chromosome (%)
G+C
content
(%)
Number of
isochores
1 188 239 860 9847 2.45 39.78 186
2 147 590 765 7333 2.64 39.61 151
3 108 638 738 4411 2.59 39.82 110
4 90 634 903 4122 3.04 39.91 89
5 56 310 377 2599 4.20 40.91 50
6 33 893 787 1531 1.48 41.54 36
7 37 338 262 1505 5.46 41.24 37
8 30 024 636 1252 6.55 41.79 24
9 23 409 228 1145 1.54 42.73 23
10 20 909 726 1233 10.32 42.96 16
11 19 020 054 1395 5.67 41.40 17
12 19 821 895 880 4.10 43.13 17
13 17 279 963 1132 2.87 44.25 12
14 20 603 938 1423 2.21 44.17 20
15 12 438 626 722 1.78 45.10 14
16 239 457 37 25.86 52.55 –
17 10 632 206 832 7.47 47.42 6
18 8919 268 473 1.38 45.67 12
19 9463 882 563 1.57 46.52 5
20 13 506 680 767 1.59 45.60 9

leads to
more segmentation points and shorter segmented sub-
sequences. Similar procedures were carried out for
macrochromosomes, intermediate chromosomes and
Fig. 1. The negative cumulative GC profiles for the chicken genome. The gaps in the chicken chromosome sequences are left empty in the
curves. Note that sharp peaks correspond to the sites where G + C content undergoes abrupt changes, from GC-rich regions to GC-poor
regions, and vice versa, indicating a mosaic structure of the chromosomes. A jump in the Àz
0
n
curve indicates an increase of the G + C con-
tent; whereas a drop down in the Àz
0
n
curve indicates a decrease of the G + C content. An approximate straight region in the Àz
0
n
curve
implies that the G + C content in this region is roughly constant.
F. Gao and C T. Zhang Isochores in the chicken genome
FEBS Journal 273 (2006) 1637–1648 ª 2006 The Authors Journal compilation ª 2006 FEBS 1639
sex chromosome Z, respectively. Consequently, for
macrochromosomes, intermediate chromosomes and
sex chromosome Z, the threshold t
0
is set to 1000 to
partition these chromosomes into compositionally dis-
tinct domains. For microchromosomes, which are
much smaller and contain higher density of CpG
islands and genes, t
0

relative magnitude of the G + C content of isochores
with respect to the genomic G + C content. Accord-
ing to this classification, the G + C content of GC-
rich isochores (GC-poor isochores) is higher (lower)
than the genomic G + C content.
Biological implications of isochores
With the aid of the technique of cumulative GC pro-
file, we proposed an intuitive picture to display the dis-
tribution of segmentation points. The relationships
between G + C content and the distributions of genes
(CpG islands, and other genomic elements) can be an-
alyzed in a perceivable manner. The cumulative GC
profile is also called the z
0
n
curve, which is a discrete
function of the nucleotide position n in a genome or
Fig. 2. The negative cumulative GC profile for GGA14 marked with
the segmentation points obtained. The bottom four plots show the
distributions of the G + C content and CpG islands along chicken
chromosome 14, respectively. The G + C contents are calculated
for the domains segmented at t
0
¼ 1000, 500, and 100, respect-
ively. Note that the distribution of CpG islands is closely correlated
with the segmented regions with distinct G + C content. The nota-
tion used here is described as follows. Besides the position coordi-
nates, the order of occurrence for each point in the segmentation
process is also labeled in the figure. We used ‘f’, ‘l’, ‘r’, and an inte-
ger to label the order of occurrence, where f denotes the first point

0
+ D is represented by a blank interval in this plot.
Here, n
0
and n are the relative coordinates with respect to the con-
tig without gaps. Other gaps are dealt with using similar procedure.
Isochores in the chicken genome F. Gao and C T. Zhang
1640 FEBS Journal 273 (2006) 1637–1648 ª 2006 The Authors Journal compilation ª 2006 FEBS
chromosome. Before studying the features of the
cumulative GC profiles of the chicken genome, some
basic characteristics of the cumulative GC profile need
to be addressed. It was shown that the average G + C
content of a genome or chromosome at position
n fi n + Dn is calculated by
G þ C / DðÀz
0
n
Þ=Dn [16].
Therefore, a jump in the Àz
0
n
curve indicates an
increase of the G + C content; whereas a drop down
in the Àz
0
n
curve indicates a decrease of the G + C
content. An approximate straight region in the Àz
0
n

tion points have clear biological implications. Note
that the distribution of CpG islands is closely correla-
ted with the segmented regions with distinct G + C
content. We therefore investigated the correlation
between the G + C content of isochores and the dis-
tribution of CpG islands throughout the chicken gen-
ome (Fig. 6). With t
0
¼ 100, only a total of 811
segments longer than 300 kb were considered as iso-
chores, according to our definition of an isochore
(Table 1). It was shown that there are positive and
highly significant correlations between the G + C con-
tent of these isochores and the corresponding density
distribution of CpG islands (R ¼ 0.82, P < 0.001).
The positive correlation between the G + C content
and the density distribution of CpG islands is a well-
known fact. It is therefore worth pointing out that the
segmentation points obtained here are exactly the
boundaries of the related regions. For example, there
is an abrupt increase (decrease) of the density of CpG
islands at the first (second) boundary of the short
GC-rich region between 15 908 133 and 16 385 348
nucleotide on GGA12 (Fig. 4). Similar phenomena are
observed in other G + C distinct regions.
The precise boundary coordinates obtained by the
segmentation algorithm and the associated cumulative
Fig. 3. Histogram of length and G + C content based on all the seg-
ments obtained at t
0

with G + C content and CpG island density, it seems
that the gene density predicted by SGP-2 is more rea-
sonable than that predicted by Ensembl and twinscan
at the region between 15 908 133 and 16 385 348
nucleotide on GGA12, based on Fig. 4.
The obtained isochore map can also be displayed in
the UCSC Genome Browser as a custom track, together
with a series of tracks aligned with the genomic sequence
[21]. As an example, the top track in Fig. 5 shows the
isochore structure of chicken chromosome 28, integra-
ted with comprehensive genome information, such as
the G + C content, isochores from Pennsylvania State
University (PSU) [22], gene density predicted by
Ensembl, CpG islands, best alignments with the human
genome, single nucleotide polymorphisms (SNPs) and
repeat densities. This graphical interface allows rapid
visual inspection of the correlation of different types of
information [21]. Note that the density distributions of
CpG islands and genes are correlated with the segmen-
ted regions with distinct G + C content. Here, the
region from 2 021 043 to 2 644 230 nucleotide was
deemed as an isochore (with length ¼ 623 kb), which is
the longest region among the obtained segments on
GGA28. The G + C content of this isochore is 37.08%,
the lowest G + C content among the identified iso-
chores. It is clearly shown that this isochore corresponds
to a desert region of genes ⁄ CpG islands ⁄ SNPs and con-
tains high-density simple tandem repeats. It can also be
seen from Fig. 5 that our result is more reasonable than
that obtained from PSU. The isochore data from PSU

1642 FEBS Journal 273 (2006) 1637–1648 ª 2006 The Authors Journal compilation ª 2006 FEBS
were generated based on the methods described in
[22], in which a measure, compositional heterogeneity
(or variability) index, was proposed to compare the dif-
ferences in compositional heterogeneity between long
genomic sequences. It seems that there is something
wrong with the boundary coordinates of the isochores
identified from PSU. For example, the region from
1 935 001 to 2 075 000 nucleotide was deemed as an
isochore in the result from PSU, while both the cumula-
tive GC profile for GGA28 (Fig. 1) and G + C content
in five-base windows clearly showed an abrupt change
in the G + C content within this region.
Based on the present method, other chicken chromo-
somes were also analyzed, the detailed analysis for
which is accessible at />The program of the new segmentation algorithm is
also available on request.
Comparison with the other segmentation
algorithms
Traditionally, the G + C content distribution of a
genome is usually assessed by computing the G + C
content in sliding windows moving along the genome.
Fig. 4. The negative cumulative GC profile
for GGA12 marked with the segmentation
points obtained. The bottom five plots show
the distributions of G + C content, genes
and CpG islands along chicken chromosome
12, respectively. Here, the distribution of
gene density is plotted based on the predic-
ted results by SGP-2, Ensembl and

These segments are represented by rectangular blocks, and the corresponding G + C contents are labeled on the left of the segments. Seg-
ments with higher G + C content are more darkly shaded. The precise boundary coordinates can be found at />The region from 2021 043 to 2644 230 nucleotide was identified as an isochore, with the lowest G + C content (37.08%) among the
obtained segments on GGA28. It is clearly shown that this isochore corresponds to a desert region of genes ⁄ CpG islands ⁄ SNPs and
contains high-density simple tandem repeats. Note that there are abrupt changes in the density distributions of CpG islands, genes and other
elements at the boundaries of this isochore identified by the present algorithm.
Isochores in the chicken genome F. Gao and C T. Zhang
1644 FEBS Journal 273 (2006) 1637–1648 ª 2006 The Authors Journal compilation ª 2006 FEBS
The disadvantage of this routinely used window-based
method is that the resolution is low, e.g. the method is
not sensitive in detecting the small changes in the
G + C content. In addition, the distribution pattern
of G + C content obtained is largely dependent on
the window size.
Historically, other windowless methods have been
developed to calculate the G + C content, which are
usually given the name of ‘segmentation of DNA
sequences’. Among them, the methods of entropic seg-
mentation [23,24], hidden Markov model [25,26] and
wavelet shrinkage technique [27] should be mentioned.
The advantages and disadvantages of the latter two
methods were discussed in [28]. As the entropic seg-
mentation algorithm is widely used to find segmenta-
tion points for various genomes, one may wonder if
the two algorithms (the entropic and our algorithm)
result in the same or different results. Therefore, it is
interesting to compare the two segmentation algo-
rithms. Here, we focus the comparison only with the
entropic segmentation algorithm. Both segmentation
algorithms possess the highest resolution (single nuc-
leotide accuracy). By applying the new algorithm to

shown to correlate to a number of important genomic
features. Furthermore, quantitative analysis of compo-
sitional heterogeneity reveals the statistical properties
of DNA sequences, which is useful to locate the origin
and terminus of replication in bacterial [32] and archa-
eal [33] genomes, and detect horizontally transferred
genes and genomic islands [28].
In this paper, it has been shown that the chicken
genome is organized into a mosaic structure of iso-
chores. A new algorithm has been applied to segment
24 chicken chromosome sequences, and the boundaries
of isochores obtained for each chromosome have been
determined precisely.
In summary, the cumulative GC profile marked with
the coordinates of resulting segmentation points is a
useful tool for genome analysis. This leads to a neat
graphical representation of G + C content variations
along a genome or chromosome, and a clear-cut defini-
tion of isochores. This technique allowed us to
show ⁄ confirm that GC-rich isochores in a chicken
chromosome have higher gene and CpG-islands densi-
ties than AT-rich isochores. Although these are well-
known characteristics of isochores of the vertebrate
organisms, the advantage of the technique is that an
investigator is able to study all of these in a perceiv-
able and precise manner. We believe that a plot similar
to Fig. 4 could become a common tool for analyzing
Fig. 6. Correlation between the G + C content of isochore and the
density distribution of CpG islands. With t
0

of CpG islands and genes were calculated in 100 kb long,
nonoverlapping windows.
A new segmentation algorithm of DNA
sequences
The genome order index S is defined by
S ¼ SðPÞ¼a
2
þ c
2
þ g
2
þ t
2
ð1Þ
where a, c, g and t denote the occurrence frequencies of
A, C, G and T, respectively, in a genome or a DNA
sequence. The genome order index S defined in Eqn 1 is
a useful statistical quantity to reflect the compositional
characteristics of a genome [29], which can serve as an
appropriate divergence measure to quantify the composi-
tional difference between two DNA sequences [15]. The
new segmentation algorithm proposed here is based on
the quadratic divergence (see Eqn 2). Consider a genome
with N bases. Let n be an integer, 2 £ n £ N – 1. For a
given n, the genome sequence is partitioned into two sub-
sequences, one left and the other right. Let w
1
¼ n ⁄ N
and w
2

r
,c
r
,g
r
,t
r
are the occur-
rence frequencies of bases A, C, G and T in the left and
right subsequences, respectively. Thus,
DSðP
l
; P
r
Þ¼ðn=NÞSðP
l
Þþ½ðN À nÞ=NSðP
r
Þ
À Sfðn= NÞP
l
þ½ðN À nÞ=NP
r
g; ð2Þ
where S(P) is defined by Eqn 1. If we suppose that n*isa
position, at which DS(P
l
,P
r
) reaches maximum, then n*is

sequences, whereas a smaller threshold t
0
leads to more seg-
mentation points and shorter segmented subsequences. For
an obtained segmentation point, it is important to know
whether the halting parameter value is significantly different
from that of a random sequence. In order to halt the seg-
mentation at different significance levels, we estimated the
distribution of the halting parameter based on 100 000 ran-
dom sequences with length of 1 Mb. For each of these
sequences, we calculated a halting parameter for the first
point occurring during the course of segmentation and
obtained thus 100 000 numbers. Consequently, cumulative
frequency and counts were plotted against the halting
parameter, respectively (Fig. 7). For example, if the signifi-
cance level is 5% then t
0
corresponds to 6.194. However, a
much more stringent stopping criterion is actually required
in most cases. It should be noted that in some cases the
segmentation procedure also halts when the resulting subse-
quence is shorter than a given minimum length. Here, we
choose 3000 nucleotide as the minimum length according to
a requirement imposed by the experimental characterization
of isochores through DNA centrifugation [3]. In general,
the choice of t
0
and the minimum length is heuristic and
must be determined on a case by case basis [15].
Cumulative GC profile

nents of the Z-curve, which is a three dimensional curve
that uniquely represents a DNA sequence [34,35]. Usu-
ally, for an AT-rich (GC-rich) genome, z
n
is approxi-
mately a monotonously increasing (decreasing) linear
function of n. To amplify the deviations of z
n
, the curve
of z
n
$ n is fitted by a straight line using the least
squares technique,
Isochores in the chicken genome F. Gao and C T. Zhang
1646 FEBS Journal 273 (2006) 1637–1648 ª 2006 The Authors Journal compilation ª 2006 FEBS
z ¼ kn ð5Þ
where (z, n) is the coordinate of a point on the straight line
fitted and k is its slope. Instead of using the curve of
z
n
$ n, we will use the z’ curve, or cumulative GC profile,
hereafter, where
z
0
n
¼ z
n
À kn: ð6Þ
If we let
G þ C denote the average G + C content within a

24 chromosomes of the chicken genome, respectively
(Fig. 1). Note that the cumulative GC profile is not the
G + C content itself, rather, the derivative of the cumula-
tive GC profile with respect to the base position n is
negatively proportional to the G + C content at the
given position, i.e. G + C µ ) dz¢⁄dn. Therefore, the
average slope of the cumulative GC profile within a region
reflects the average G + C content of the sequence within
this region.
We should point out that the method of cumulative GC
profile (z¢-curve method) shares some basic features with
the cusum method, that is, both are based on the cumula-
tive calculation. However, there are still some differences
between the two methods, as reflected by the fact that the
former is only designed for genome analysis, whereas the
latter is a general one, suitable for econometrics and time-
series analysis, etc.
Acknowledgements
We are grateful to the referees for their constructive
comments, which were very important in strengthening
the presentation of the paper. We would like also to
thank Drs. R. Zhang and L L. Chen for invaluable
assistance. Suggestions for writing the manuscript from
Feng-Biao Guo and Wen-Xin Zheng are gratefully
acknowledged. The present work was supported in
part by National Natural Science Foundation of China
Grant no. 90408028.
References
1 Hillier LW, Miller W, Birney E, Warren W, Hardison
RC, Ponting CP, Bork P, Burt DW, Groenen MA,

6 Saccone S, De Sario A, Della Valle G & Bernardi G
(1992) The highest gene concentrations in the human
genome are in telomeric bands of metaphase chromo-
somes. Proc Natl Acad Sci USA 89, 4913–4917.
7 Saccone S, De Sario A, Wiegant J, Raap AK, Della
Valle G & Bernardi G (1993) Correlations between iso-
chores and chromosomal bands in the human genome.
Proc Natl Acad Sci USA 90, 11929–11933.
8 Sharp PM, Averof M, Lloyd AT, Matassi G & Peden JF
(1995) DNA sequence evolution: the sounds of silence.
Philos Trans R Soc Lond B Biol Sci 349, 241–247.
9 Duret L, Mouchiroud D & Gautier C (1995) Statistical
analysis of vertebrate sequences reveals that long genes
are scarce in GC-rich isochores. J Mol Evol 40, 308–317.
10 Tenzen T, Yamagata T, Fukagawa T, Sugaya K, Ando
A, Inoko H, Gojobori T, Fujiyama A, Okumura K &
Ikemura T (1997) Precise switching of DNA replication
timing in the GC content transition area in the human
major histocompatibility complex. Mol Cell Biol 17,
4043–4050.
11 Eisenbarth I, Vogel G, Krone W, Vogel W & Assum G
(2000) An isochore transition in the NF1 gene region
coincides with a switch in the extent of linkage disequili-
brium. Am J Hum Genet 67, 873–880.
12 Fullerton SM, Bernardo Carvalho A & Clark AG
(2001) Local rates of recombination are positively corre-
lated with GC content in the human genome. Mol Biol
Evol 18, 1139–1142.
13 Smit AF (1999) Interspersed repeats and other memen-
tos of transposable elements in mammalian genomes.

tional heterogeneity within and between eukaryotic gen-
omes. Genome Res 10, 1986–1995.
23 Oliver JL, Bernaola-Galvan P, Carpena P & Roman-
Roldan R (2001) Isochore chromosome maps of eukar-
yotic genomes. Gene 276, 47–56.
24 Li W, Bernaola-Galvan P, Haghighi F & Grosse I
(2002) Applications of recursive segmentation to the
analysis of DNA sequences. Comput Chem 26, 491–
510.
25 Churchill GA (1992) Hidden Markov chains and the
analysis of genome structure. Comput Chem 16, 107–
115.
26 Peshkin L & Gelfand MS (1999) Segmentation of yeast
DNA using hidden Markov models. Bioinformatics 15,
980–986.
27 Lio P & Vannucci M (2000) Finding pathogenicity
islands and gene transfer events in genome data. Bio-
informatics 16, 932–940.
28 Zhang R & Zhang CT (2004) A systematic method to
identify genomic islands and its applications in analyz-
ing the genomes of Corynebacterium glutamicum and
Vibrio vulnificus CMCP6 chromosome I. Bioinformatics
20, 612–622.
29 Zhang CT & Zhang R (2004) A nucleotide composition
constraint of genome sequences. Comput Biol Chem 28,
149–153.
30 Zhang CT & Wang J (2000) Recognition of protein
coding genes in the yeast genome at better than 95%
accuracy based on the Z curve. Nucleic Acids Res 28,
2804–2814.


Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Music ♫

Copyright: Tài liệu đại học © DMCA.com Protection Status