| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Janos Demeter,3
Suzanne E. Stroup,4 and
Upinder Singh1,2*
Division of Infectious Diseases, Department of Internal Medicine,1 Department of Microbiology and Immunology, Stanford University School of Medicine,2 Department of Genetics, Center for Clinical Sciences Research, Stanford University, Stanford, California,3 Department of Internal Medicine, University of Virginia, Charlottesville, Virginia4
Received 26 September 2004/ Accepted 21 December 2004
| ABSTRACT |
|---|
|
|
|---|
10% of
infected individuals; 90% remain asymptomatically colonized.
Entamoeba dispar, a closely related species, is avirulent. To
determine the extent of genetic diversity among Entamoeba
isolates and potential genotype-phenotype correlations, we have
developed an E. histolytica genomic DNA microarray
and used it to genotype strains of E. histolytica and
E. dispar. On the basis of the identification of
divergent genetic loci, all strains had unique genetic fingerprints.
Comparison of divergent genetic regions allowed us to distinguish
between E. histolytica and E.
dispar, identify novel genetic regions usable for strain and
species typing, and identify a number of genes restricted to virulent
strains. Among the four E. histolytica strains, a
strain with attenuated virulence was the most divergent and
phylogenetically distinct strain, raising the intriguing possibility
that genetic subtypes of E. histolytica may be
partially responsible for the observed variability in clinical
outcomes. This microarray-based genotyping assay can readily be applied
to the study of E. histolytica clinical isolates to
determine genetic diversity and potential genotypic-phenotypic
associations. | INTRODUCTION |
|---|
|
|
|---|
50% of
children have serological evidence of exposure to E.
histolytica by age 5
(34,
35). Infected children
suffer from significant morbidity with malnourishment and growth
delays. A number of interesting epidemiologic trends have been
identified for amebic disease. The data, although limited, indicate
that a minority (
10%) of individuals who become
infected with E. histolytica progress to clinically
overt disease; others remain asymptomatically colonized
(29). In individuals with
invasive disease, occurrence of amebic liver abscess (ALA) is 5 to 50
times less common than diarrhea
(65). Geographically
variable disease predilections have been observed, and invasive disease
predominantly affects men
(1,
2,
11,
61). The extent of genetic diversity among E. histolytica clinical isolates is unclear. Studies have analyzed a small number of highly repetitive and polymorphic genetic loci by techniques, such as randomly amplified polymorphic DNA (RAPD), RNA arbitrarily primed PCR, and restriction fragment length polymorphism (RFLP), to conclude that there is significant genetic diversity (7, 18, 23, 31, 33, 64, 72, 73). Contradictory data were obtained by two studies; in one study, the lectin gene was sequenced, and in the other study, the intergenic region between the superoxide dismutase gene and the actin 3 gene was analyzed (9, 30). These studies found minimal genetic diversity among clinical isolates. Although limited, these data indicate that the extensive genetic diversity identified by analysis of repetitive regions may not be indicative of a genome-wide phenomenon. A number of studies have attempted to identify an association between parasite genotype and the clinical phenotype and although most have failed, two studies (one using RAPD analysis of amebic isolates and one using PCR and RFLP analyses of the serine-rich E. histolytica protein [SREHP]) did give some preliminary indication that it may be feasible to find a genotypic pattern predictive of a phenotypic outcome (7, 64). Host factors that contribute to variable disease manifestations have also been studied. In a study of Mexican mestizo adults and children, an increased frequency of HLA-DR3 was found to be associated with ALA (5, 6, 63). A recent study identified a potential protective association of the HLA class II allele DQB1*0601 and the heterozygous haplotype DQB1*0601/DRB1*1501 for intestinal amebiasis (26).
E.
histolytica was recently reclassified into E.
histolytica Schaudinn (EH) (previously described as pathogenic
E. histolytica) and Entamoeba dispar Brumpt
(ED) (previously described as nonpathogenic E.
histolytica)
(25). The two species are
morphologically similar, exhibit
98% identity at the
rRNA level, and have similar host range and cell biology. However,
their pathogenicity in vivo is vastly different. E.
dispar colonizes humans, but only E.
histolytica is able to cause invasive disease. To date, only
four genes have been reported to be present in E.
histolytica but absent (EhCP1, Ehapt2, and
Ariel1) or significantly degenerate (EhCP5) in
E. dispar
(12,
67-69).
However, the molecular basis of E. histolytica
virulence and E. dispar nonvirulence is not yet
known.
The parasite and host variables that contribute to the epidemiologic trends described above are not clear. Undoubtedly, there is a complex interplay between host genetics, immunity, enteric flora, nutrition, and parasite genetics that occurs and contributes to disease. Whether there are subtypes of EH that have higher or lower virulence potential or predilection for infection of certain organ systems is not known. The World Health Organization has prioritized efforts to determine whether functional subgroups of EH exist (71). Using an E. histolytica 11,328-clone genomic DNA microarray, we have genotyped four E. histolytica laboratory strains and two E. dispar laboratory strains. This first genome-wide analysis of Entamoeba strains reveals that genotypic fingerprints can be used to distinguish E. histolytica from E. dispar, identify genes restricted to virulent strains, and find potential genotypic-phenotypic associations.
| MATERIALS AND METHODS |
|---|
|
|
|---|
|
2X genome
coverage of ED. Microarray construction and quality control. We utilized 11,328 sequenced clones from the genomic library of EH (HM-1:IMSS) prepared by TIGR to generate our DNA microarrays. Approximately 62% of these clones were used in the final sequence assembly. Microarray construction was similar to previously described methods (19). Briefly, we PCR amplified the genomic insert directly from 5 µl of bacterial cultures using M13F and M13R primers that are part of the multiple cloning sites of the cloning vector. A total of 35 cycles of PCR was performed, with 1 cycle consisting of 1 min at 94°C, 1 min at 50°C, and 2 min at 72°C. PCR was performed with 2 mM MgCl2 in a Tetrad Thermo cycler (MJ Research, Waltham, Mass.). The products were ethanol precipitated, resuspended in water, dried, resuspended in 3x SSC buffer (1x SSC is 0.15 M NaCl plus 0.015 M sodium citrate), and printed on polylysine-coated glass slides (19). To check the accuracy of printing, quality controls were performed. Probes from the rRNA episome, EhActin, EhPFK, and EhATPase were used to verify clone orientation and contamination issues. One hundred nanograms of each amplified PCR product was labeled with Cy5-dUTP and hybridized on the arrays (see "Microarray hybridization" for details). Primer sequences are available for review (see Table 1 in the supplemental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml).
Clone annotation. The assembly, locus information, and final TIGR/Sanger Center gene annotation were kindly provided by Brendan Loftus (annotated on 12 February 2004). Clones from potentially contaminated plates and clones from the rRNA episome and repeat elements (Ehapt2 and EhRLE4) were identified. The final information was compiled and stored in the Stanford Microarray Database (SMD) (http://smd.stanford.edu/) for data retrieval and analysis. We also identified a subset of clones which were fully sequenced, used in the final assembly, contained one open reading frame (ORF) according to the TIGR annotation (>200-bp overlap at >98% identity), and were not from contaminated plates. These clones (2,802 good clones which contain 2,112 unique genes) were used for analyses in which we considered data from clones with a single ORF.
Microarray hybridization. Genomic DNA (4 to 8 µg) from the EH and ED strains was resuspended in 30.5 µl of Tris-EDTA (pH 7.6), 3 µl of random nanomers (4 µg/µl) was added to the solution, and the solution was boiled for 1 min and cooled for 5 min. The DNA was mixed with 10x deoxynucleoside triphosphates (0.25 mM concentrations of dATP, dCTP, and dGTP and 0.09 mM concentration of dTTP), 1.5 µl of Cy5-dUTP, and 2 µl of Klenow fragment and incubated at 37°C for 4 h. A reference sample (350 ng of a 140-bp PCR product which labeled all spots on the array) was labeled similarly with Cy3-dUTP. The labeled products were mixed, purified with a Microcon 30 filter (Amicon/Millipore, Billerica, Mass.), and hybridized on arrays for 16 to 18 h at 65°C. The arrays were washed at room temperature sequentially for 2 min each with three solutions: 2x SSC and 0.1% sodium dodecyl sulfate, 1x SSC, and 0.2x SSC (19). The arrays were scanned using GenePix 4000B microarray scanner (Axon Instruments, Union City, Calif.). We performed a minimum of two replicate experiments per strain, except for EH (HM-1:IMSS) and ED (SAW760) for which five replicate arrays were generated. Clones with significant cross-hybridization with bacterial genomic DNA were identified by labeling and hybridizing bacterial genomic DNA on the arrays as described above.
Microarray data analysis. Data were analyzed using ScanAlyze software (http://rana.lbl.gov/EisenSoftware.htm) to determine the fluorescence signal intensity of each dye at each spot. The intensities of channel 2 (Cy5-labeled Entamoeba genomic DNA) and channel 1 (Cy3-labeled reference DNA) measured the Cy5 and Cy3 signals, respectively.
(i) Data normalization. For all arrays hybridized with Entamoeba genomic DNA, the channel 2 signal intensity was adjusted such that the mean channel 2/channel 1 intensity, or R/G ratio, was 1.0 for the entire array (19). For arrays hybridized with control DNA, nonnormalized R/G signal was used; the cutoff for positive hybridization was the mean R/G value plus 3 standard deviations.
(ii) Filters used. Spots were filtered out on the basis of two criteria: poor-quality spots were discarded if they were flagged or had a channel 1 net mean intensity of <250, and spots with high noise/signal ratios were discarded if they had extremely different values in replicate arrays (average deviation/average of >0.4). Additionally, clones from contaminated plates and clones that cross-hybridized with bacterial genomic DNA were not analyzed.
Comparative genomic hybridizations (CGH). For each EH and ED strain, the hybridization signal to the clones on the array was compared to the hybridization signal for the reference strain, EH (HM-1:IMSS), and the relative hybridization was analyzed by genomotyping analysis by the method of Charlie Kim et al. (GACK) (40). This algorithm generates a graded output and assigns a range of values from 0.50 to +0.50 in 0.05 increments. For a GACK value of +0.50, there is 100% likelihood that a gene is present; for a GACK value of 0.50, there is 100% likelihood that a gene is absent, divergent, or present at reduced genomic abundance. The GACK output was put into four categories (on the basis of Southern blot hybridizations and sequence analysis): A (absent or highly divergent; GACK values of 0.50 and 0.45), B (significantly divergent; GACK values of 0.45 to 0.15), C (moderately divergent; GACK values of 0.10 to +0.40), and D (highly conserved; GACK values of +0.45 and +0.50).
Estimation of copy number. The average signal intensity for low-copy-number genes (1 or 2 loci), multicopy genes (6 to 10 loci), or high-copy-number repeat elements was calculated by analyzing a minimum of 10 clones in each category. Genes were selected on the basis of previously published literature and genome annotation. All clones analyzed contained a single ORF. To estimate the signal for specific regions of the rRNA episome, we chose clones that represented (i) a portion of the upper intergenic spacer (22,000 to 24,000 bp), (ii) a part of the transcribed unit (15,000 to 19,000 bp), and (iii) a part of the lower intergenic spacer (12,000 to 14,000 bp) (58). For regions with >2-kb sequence, clones with 100% match were selected. For regions with <2-kb sequence, clones with the highest E value were chosen, although these clones would always contain genomic sequences other than the sequence of interest. For each of the categories, the R/G signal for each EH and ED strain was retrieved, spot quality filters were applied, and the mean signal intensity ± standard error was calculated. Lists of the clones used in these analyses are available for review (see Table 4 in the supplemental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml).
Estimation of Cy5-labeled probe size. To determine the average probe length of the Cy5-labeled genomic DNA, labeled genomic DNA was electrophoresed in an agarose gel. Cy5 and ethidium bromide were visualized with the Molecular Dynamics Typhoon 8600 multimode imaging system (excitation wavelengths of 633 and 493 nm, respectively, and emission filters of 670 and 620 nm, respectively) (Amersham Biosciences, Piscataway, N.J.).
Estimation of sequence identity for clones in divergence categories. To determine the reliability of the divergence categories, all clones from categories A, B, and C and 200 random clones from category D of the ED (SAW760) analysis were compared to those in the ED (SAW760) sequence database. BLASTN analysis was performed, and the match length and percent identity of the first hit was retrieved.
Statistical analysis. A Student's t test (two-tailed distribution and two-sample equal variance) was utilized. The values in comparisons of sample pairs were considered significantly different if the P value was <0.05.
Southern blotting and PCR and sequence analysis.
Southern blotting was
performed by the standard protocol
(58). Genomic DNA (10 to
15 µg) was digested with EcoRI or HindIII, electrophoretically
separated on 1% agarose gels, blotted onto Hybond
N+ nylon membranes, UV cross-linked, and hybridized
with radiolabeled probes. The probes were PCR amplified from the
appropriate clone, and the probe sequence was verified, labeled with
[
-32P]dATP with the Random Primed DNA
labeling kit (Roche, Mannheim, Germany), and hybridized
with Express-Hyb (Clontech, Palo Alto, Calif.). Blots were subjected to
autoradiography, scanned, and prepared for publication using Adobe
Photoshop (version 6.01; Adobe Systems, San Jose, Calif.). Blots were
stripped using H2O-0.5% sodium dodecyl
sulfate and reused for subsequent hybridizations. The probes used were:
category A (ENTI070, ENTIL41, and ENTOU30), category B (ENTPF64,
ENTON45, ENTNW18, and ENTNQ01), and category D (ENTOI63, ENTOE77,
ENTOJ81, and ENTPJ36). Four loci (423.m00019, 271.m00049, 353.m00047,
and 8.m00351) were PCR amplified when possible from all EH strains and
ED (SAW760). PCR products were directly sequenced and analyzed using
CLUSTALW. Primer sequences are available for review (see Table
1 in the
supplemental material at
http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml).
Cluster and dendrogram analysis.
The GACK divergence output was used
for clustering. Clones identified as being highly conserved in all
strains were removed from the analysis to facilitate use of the
clustering and visualization programs. The remaining clones
(
2,389) were clustered using the program XCLUSTER
(http://smd.stanford.edu/download/),
using a Pearson correlation, uncentered metric algorithm, and
hierarchical clustering. The clustering output was viewed using Java
Treeview version 1.0.3. (Alok Saldahna,
http://sourceforge.net/projects/jtreeview).
To generate dendrograms based on genotypic differences, the clones were
analyzed using the Mega 2.2 program (dMEGA2: Molecular Evolutionary
Genetics Analysis Software, Arizona State University)
(http://www.megasoftware.net/)
using an unweighted pair group method with arithmetic mean (UPGMA)
distance matrix algorithm and bootstrap values to estimate confidence
intervals.
Nucleotide sequence accession numbers. Nucleotide sequence data for the 423.m00019, 271.m00049, and 353.m00047 loci have been deposited in GenBank. The locus names and accession numbers in GenBank follow: Rahman_423.m00019, AY857547; NIH_353.m00047, AY857548; Rahman_353.m00047, AY857549; HK9_353.m00047, AY857550; and SAW_271.m00049, AY857551.
| RESULTS |
|---|
|
|
|---|
Microarray signal intensity estimates genomic abundance. In order to determine whether the hybridization signal corresponded to genomic abundance, we analyzed the signal from clones with no genomic insert, low-copy-number genetic loci (1 to 2 loci), high-copy-number genetic loci (6 to 10 loci), and highly repetitive genomic elements (see Fig. 1 in the supple-mental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml).The average probe length of Cy5-labeled genomic DNA was 500 to 1,500 bp, and all clones had comparable AT content allowing this comparison (data not shown). For clones with no genomic insert, the hybridization signal intensity was extremely low (R/G, 0.009 ± 0.002), indicating that Cy5-labeled E. histolytica genomic DNA did not cross-hybridize to the reference DNA. Signal intensities for 1-copy (R/G, 0.19 ± 0.01), 2-copy (R/G, 0.31 ± 0.05), and 6- to 10-copy-number genes (R/G, 1.65 ± 0.37) were proportional to the copy number. Clones with repetitive loci (Ehapt2 and EhRLE4) had high signal intensities (R/G, 1.03 ± 0.18 and 1.75 ± 0.16, respectively), although not proportional to the copy number, likely due to signal dampening from the adjacent nonrepetitive genomic regions. Clones from nontranscribed regions of the rRNA episome had half (R/G, 1.54 ± 0.17) the signal from transcribed regions (R/G, 3.95 ± 0.36) (each rRNA episome contains two identical transcribed repeats) (58). We determined the mean signal intensity for all clones on the array that reliably contained one ORF. Of the genes in EH (HM-1:IMSS), 13.8% had a mean signal intensity R/G of >1.0, indicating that these genetic loci were in high genomic abundance or were members of highly similar gene families.
|
Analysis of the rRNA episome revealed significant differences between EH and ED (see Fig. 2 in the supplemental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml).In EH (HM-1:IMSS), this 24.5-kb extrachromosomal DNA is present in 200 copies per genome and consists of two 5.9-kb identical inverted repeat regions that are transcribed and two nontranscribed regions of 9.2 kb (upper intergenic sequence) and 3.5 kb (lower intergenic sequence), respectively (58). Our analysis revealed that the upper and lower nontranscribed intergenic sequences had significantly higher signal in EH than ED. We performed sequence comparison of these regions with the ED (SAW760) genome sequence. The nontranscribed intergenic sequences of EH had little homology to comparable regions in ED (at most 91% identity over 64 to 141 bp). Thus, the low signal intensity of the rRNA nontranscribed region in ED accurately reflected the significant sequence dissimilarity of these two regions. Signal from the transcribed region of the rRNA episome was significantly lower in EH than in both ED strains (sequences of EH and ED exhibit 98.6% identity), and Southern blot analysis confirmed that the copy number of the rRNA episome was lower in EH than ED (see Fig. 2B in the supplemental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml).
|
Southern blot
analyses were performed with EH (HM-1:IMSS) and ED (SAW760) genomic DNA
and yielded the expected results (Fig.
2 [some data not
shown]). For clones in the divergence categories A and B, the
probes did not hybridize to genomic DNA from ED (SAW760). We used
genome sequence data from ED (SAW760) to further analyze the validity
of the divergence algorithm and performed BLASTN analysis on all clones
in categories A, B, and C and 200 random clones in category D. For
categories A, B, C, and D, the numbers of sequences that had no
significant BLASTN match (E value of >1E4) were 24.6,
13.3, 9.8, and 7.5%, respectively (Fig.
3A). The 7.5% of sequences in category D without a BLAST match most
likely represent incomplete genome sequence coverage of ED (
2X
at present). The higher proportion of such sequences in the highly and
moderately divergent categories indicated that those genomic loci had
no matches in the ED database on the basis of divergence (rather than
incomplete coverage alone). For those sequences with significant
matches, the BLASTN results were analyzed (Fig.
3B and C). The average
identity to the ED sequence database of clones from categories A, B, C,
and D was 89% ± 0.4%, 90% ±
0.3%, 92% ± 0.2%, and 93% ±
0.3%, respectively. The mean match length for clones in
categories A, B, C, and D was 191 ± 11, 242 ± 11, 273
± 7, and 368 ± 16 bp, respectively. Therefore, in the
highest divergence category (category A), although the query EH
sequences were
800 bp, the average match in the ED database
was 191 bp with 89% identity. Thus, both the Southern blot
hybridizations and large-scale sequence analyses indicated that the
divergence categories were quantitatively valid. Although these
analyses were performed for ED (SAW760), the validity of the categories
should be applicable to other Entamoeba
strains.
|
In order to obtain an overview of the comparative hybridization data, the divergence data from all usable clones were clustered with the XCLUSTER program using a Pearson correlation, uncentered metric algorithm, and hierarchical clustering (Fig. 4). Clones that were highly conserved in all strains were removed to facilitate the clustering program. Overall, 2,372 clones (27% of clones analyzed) were divergent in at least one strain of Entamoeba. Each Entamoeba strain had a unique genetic fingerprint compared to the EH (HM-1:IMSS) reference strain. Previous studies of EH clinical isolates relying almost exclusively on PCR-based analyses of a few highly polymorphic loci have shown that significant genetic diversity exists among EH strains (32, 33, 72, 73). The microarray data were generated from the analysis of >7,600 clones and suggest that genetic diversity exists on a genome-wide scale, including divergence in nonrepetitive genomic regions.
|
|
|
The genotype of EH (Rahman), an attenuated-virulence strain, is distinct from the genotypes of the virulent EH strains. To generate genotypic dendrograms, the divergence categories were analyzed using the UPGMA algorithm (Fig. 5). The analysis revealed that we could successfully differentiate the genotypes of E. histolytica and E. dispar. Additionally, strain EH (Rahman), a less virulent strain as assessed by decreased monolayer destruction, disease in animal models, cytotoxicity, and the only EH strain isolated from an asymptomatic individual was the most divergent of the EH strains and clustered separately from the virulent EH strains with significant bootstrap values (13, 27, 48-51). The separation of the EH (Rahman) strain from the other EH strains could also reflect the geographic origins of the strains and be unrelated to parasite phenotypes.
|
30 years ago; both
ED strains are xenic; and all four EH strains are axenic); thus, the
overall selection pressures within the ED and EH strains should be
comparable. Identification of E. histolytica genomic assemblies with unusually high divergence rates. In order to assess whether any regions of the EH genome were more prone to be divergent, we analyzed the extent of divergence in assemblies. The 12X genome sequence coverage of EH (HM-1:IMSS) at this time has been organized into 888 assemblies (assembly size ranges from 297 to 2 kb; assemblies contain 1 to 151 ORFs); of these assemblies, 554 are on the array (Neil Hall and Brendan Loftus [TIGR], personal communication). The size of the assembly correlated with the number of predicted ORFs (R = 0.9766), and the number of predicted ORFs correlated with the number of clones on the microarray (R = 0.9063) (data not shown).
We analyzed all the clones on the microarray for which we had a definitive assembly and ORF assignment, a total of 4,555 clones. Assemblies were categorized by the number of Entamoeba strains that were divergent in that assembly. The majority of assemblies (39%) had no divergence in any Entamoeba strain, 16% had divergence in one strain, 19% had divergence in two strains, 12% had divergence in three strains, 9% had divergence in four strains, and 5.4% had divergence in all five Entamoeba strains analyzed (see Fig. 3 in the supplemental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml). The assemblies in all five categories were characterized and were highly similar for the following: (i) mean ORF length (387 to 403 bp); (ii) percentage of ORFs that are hypothetical proteins (55 to 62%); (iii) average genomic abundance as assessed by signal intensity (R/G, 0.50 to 0.57); and (iv) the percentage of the assembly that is composed of simple (2.1 to 3.0%), low-complexity (9.7 to 12.6%), SINE (short interspersed nuclear element) (1 to 2.2%), or LINE (long interspersed nuclear element) (8.7 to 10.9%) repeats.
Therefore, the identification of 30 assemblies in which all five amebic strains were divergent but whose structural features and composition were apparently not different from the rest of the genome may indicate that these genomic regions are under different and higher divergence pressures than the remainder of the genome. Although we were able to identify a number of assemblies that apparently have high rates of divergence, we did not identify large portions of individual assemblies that were divergent in a given Entamoeba strain (data not shown). Thus, unlike bacterial genomes, large islands of genomic DNA do not appear to have been affected by the pressures that cause genetic divergence (39, 57). Since we do not have full genome coverage on the microarray, it is possible that there are larger regions of divergence that we did not identify due to lack of clones in a given genomic region. Additionally, single-nucleotide polymorphisms cannot be detected by our approach.
Identification of genetic loci associated with virulence. In order to identify genes associated with virulence, we analyzed the divergence data for genes divergent in ED compared to EH (Table 1). Overall, 1,640 to 1,817 loci were analyzed; the majority were highly conserved (94 to 98% in EH and 88 to 91% in ED), with 1,534 loci that were not divergent in any strain. The two ED strains (SAW1734 and SAW760) had the greatest number of divergent loci: 147 and 196, respectively. The EH strains had lower divergence: 111, 81, and 45 genes in 200:NIH, HK-9, and Rahman, respectively.
As previously mentioned, to date, only four genes have been identified as being absent (CP1, Ariel1, and Ehapt2) or highly divergent (CP5) in ED compared to EH (12, 67-69). We have now identified 80 novel genes that were divergent in both ED strains but conserved in all EH strains. The majority of these genes are hypothetical proteins; however, we identified an Arabidopsis thaliana avrRpt2-induced gene (AIG1) (565.m00023) (a plant gene involved in resistance to bacteria), a putative protein kinase (95.m00150), a peroxiredoxin 1-related gene (105.m00128), and a Ras family gene (395.m00027). We also identified nine genes that were divergent in the two ED strains and in the attenuated-virulent strain EH (Rahman) but conserved in virulent EH strains. Seven of the genes encoded hypothetical proteins; one gene coded for Rab geranylgeranyltransferase beta subunit (505.m00018), and one gene had a dihydrouridine synthase domain (34.m00240) (Table 2).
In order to determine whether despite their sequence dissimilarity, these genes encode functionally similar proteins in ED, we performed TBLASTX against the ED (SAW760) genome database on the 67 loci from the highly and significantly divergent categories and 30 random loci from the highly conserved category. The overall homology for the predicted proteins was much lower in the two divergent categories than in the highly conserved set (see Table 8 in the supplemental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml). Some genes from the highly or significantly divergent categories did have highly similar orthologues; these were likely identified as divergent on the basis of decreased copy number.
The overall paucity of genes restricted to virulent Entamoeba strains indicates that the presence or absence of a given gene (or sets of genes) may not fully explain the vastly different phenotypes of E. histolytica and E. dispar (Table 1). Expression differences in highly conserved genes, as known for the amebapore genes, undoubtedly play large roles in amebic pathogenic potential (46, 52). Since we did not have full genome coverage on our microarray, there are likely other genes restricted to EH that we did not identify. However, our array was composed of a random subset of genes, and a whole-genome analysis will likely reveal similar levels of divergence in ED. To validate that the genes we identified as divergent were functionally different, we checked the expression profiles of these genes. Genes divergent in ED (SAW760) had significantly lower expression levels than those in EH (HM-1:IMSS) (Ryan MacFarlane, personal communication).
In order to determine whether genes in category A were highly divergent or absent, we analyzed the 22 genes that were in this category for ED (SAW760). For each gene, we looked for homologues in the ED (SAW760) genome sequence database. Using these data, we identified 45% of genetic loci that appear to have diverged as a consequence of genetic drift, 23% that were divergent on the basis of copy number differences, and 32% that appear to be absent in the ED genome (although the limited 2X genome coverage makes it difficult to definitively state that a given gene is missing). On the basis of the CGH data and the limited sequence analysis, it appears that the majority of differences between ED (SAW760) and EH (HM-1:IMSS) resulted from genetic drift, rather than gene loss; however, a comprehensive genome analysis of ED will be necessary to make definitive conclusions.
We identified a number of loci that were divergent between EH strains including two hypothetical proteins (423.m00019 and 353.m00047) and a Ras guanine nucleotide releasing factor (271.m00049) (see Tables 7 and 9 in the supplemental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml). In order to confirm the CGH data, we PCR amplified and sequenced these loci (Table 3). For all loci categorized as highly divergent, PCR gave negative results (40). The average sequence identity for loci in categories B, C, and D were 78, 84, and 99%, respectively. Some variability was seen in the sequence identities for categories B and C. Since the array data were based on clones that contained ORFs and adjacent genomic sequences, the divergence we measured by CGH for 423.m00019 may have been in noncoding regions. Intergenic differences can, however, reflect functional differences (i.e., divergent promoters may affect expression levels). Thus, as a screening tool to identify divergent genetic loci, array-based CGH were valid for EH-EH comparisons.
| DISCUSSION |
|---|
|
|
|---|
In addition to giving a genome-wide overview of the extent of genetic diversity, CGH allowed identification of genetic regions that were highly prone to divergence. We identified assemblies in which every EH and ED strain studied had some divergence, indicating that these genomic regions may be hot spots for genetic variability. No obvious structural component of these regions that would predispose them to diverge could be identified, although their organization in scaffolds is not known. Genomic architecture can predispose to divergence, as spontaneous subtelomeric deletions of Plasmodium chromosomes have been observed and genomic regions rich in tRNA genes or with transposon recognition sequences are more prone to integration (3, 21, 41, 44, 53). Irrespective of structural issues, genomic regions can be under different evolutionary pressures (introns and pseudogenes are rapidly evolving) or functional pressures (highly antigenic genes may be under immunogenic pressures), and these pressures may affect the overall divergence in a given genomic context (45, 47).
A number of factors likely contribute to this observed genome plasticity in Entamoeba. Retrotransposons, which are abundant in EH, play a significant role in genome evolution by their ability to move to new chromosomal locations (10, 59). Significant numbers of E. histolytica genes are in multiple copies or are members of highly homologous gene family members. Duplicate genes have higher rates of intron gain or loss in comparison to orthologues and are under accelerated evolutionary pressures either through decreased functional constraints or via positive selection (8, 16, 28, 38, 74). E. histolytica contains a number of episomally maintained DNA circles, which in other systems contribute to genome plasticity due to intramolecular homologous recombination between direct tandem repeats (20, 24). Whether genetic plasticity is enhanced in clinical scenarios to allow more-virulent strains to survive and disseminate or has resulted in a large subset of avirulent EH strains (which may account for the overall low rates of invasive E. histolytica infection) is not clear.
The genomic differences between E. histolytica and E. dispar were the most extensive. We identified a number of genes conserved in all EH strains and highly divergent in both ED strains, including a putative protein kinase, a peroxiredoxin 1-related gene, and a Ras family member. Peroxiredoxins are antioxidant enzymes that protect against reactive oxygen species and reactive nitrogen species (14, 37). Peroxiredoxin-null Saccharomyces cerevisiae cells are hypersensitive to oxidative stress and genomically unstable (70). Small GTP-binding proteins regulate many diverse processes in eukaryotic cells, including signal transduction, cell proliferation, cytoskeletal organization, and intracellular membrane trafficking. In Entamoeba, small GTP-binding proteins play roles in polarity, motility, phagocytosis, and perhaps encystation; ED may be affected in a number of cellular functions if functionally divergent in some of these genes (43, 54, 66). Nine genes were identified that were divergent in both ED and EH (Rahman) but conserved in all virulent EH strains. Most of these were hypothetical genes, but two had homologues in other systems: a Rab geranylgeranyltransferase and a gene with a dihydrouridine synthase domain. Functional studies will be necessary to elucidate the roles of these genes in amebic pathogenesis. The relatively low number of divergent loci in avirulent versus virulent Entamoeba is surprising, considering the vast biological differences between the two species. This implies that gain or loss of a gene may not be sufficient to confer pathogenicity. Thus, comparative genomics alone cannot unravel the complexities of virulence in these parasites. Expression levels, coordination, and timing, along with the genomic context, may all contribute to virulence and need further investigation.
Using CGH, we identified that the genomic fingerprint of an EH strain with attenuated virulence was phylogenetically distinct from virulent EH strains; it is an intriguing observation, because it implies that genotypes may be associated with phenotypes. If a similar association were identified in clinical isolates, it would have enormous clinical implications, since only those individuals with invasive subtypes of EH would need treatment. At this time, studies to determine the extent of genetic diversity among clinical isolates and potential genotype-phenotype correlations are under way. Our work has laid the foundation for pursuing genome-wide population studies of E. histolytica and understanding the impact of parasite genetics on human disease.
| ACKNOWLEDGMENTS |
|---|
We gratefully acknowledge the help of Brendan Loftus, Neil Hall, and Iain Anderson (TIGR and Sanger Center) for access to EH clones and sequence and genome data. We also thank David Mirelman and Dan Eichinger for parasite strains, Sandeep Jaggi for help with data analysis, Charlie Kim for input on the GACK algorithm, Kevin Visconti for microarray printing, and all members of the lab for helpful suggestions and discussions.
| FOOTNOTES |
|---|
Present address: Lewis-Sigler Institute for Integrative Genomics,Princeton, NJ 08544. ![]()
| REFERENCES |
|---|
|
|
|---|