Simple, experimentally tractable systems such Saccharomyces cerevisiae, Chlamydomonas reinhardtii, and Arabidopsis thaliana are powerful models for dissecting basic biological processes. The unicellular green alga C. reinhardtii is amenable to a diversity of genetic and molecular manipulations. This haploid organism grows rapidly in axenic cultures, on both solid and liquid medium, with a sexual cycle that can be precisely controlled. Vegetative diploids are readily selected through the use of complementing auxotrophic markers and are useful for analyses of deleterious recessive alleles. These genetic features have permitted the generation and characterization of a wealth of mutants with lesions in structural, metabolic and regulatory genes. Another important feature of C. reinhardtii is that it has the capacity to grow with light as a sole energy source (photoautotrophic growth) or on acetate in the dark (heterotrophically), facilitating detailed examination of genes and proteins critical for photosynthetic or respiratory function. Other important topics being studied using C. reinhardtii, many of which have direct application to elucidation of protein function in animal cells (26), include flagellum structure and assembly, cell wall biogenesis, gametogenesis, mating, phototaxis, and adaptive responses to light and nutrient environments (32, 44). Some of these studies are directly relevant to applied problems in biology, including the production of clean, solar-generated energy in the form of H2, and bioremediation of heavy metal wastes.
Recent years have seen the development of a molecular toolkit for C. reinhardtii (42, 44, 66, 98, 99). Selectable markers are available for nuclear and chloroplast transformation (4, 5, 12, 13, 30, 44, 56, 82). The Arg7 (22) and Nit1 (30) genes are routinely used to rescue recessive mutant phenotypes. The bacterial ble gene (which codes for zeocin resistance [70, 112]) is an easily scored marker for nuclear transformation, and the bacterial aadA gene (which codes for spectinomycin and streptomycin resistance) is a reliable marker for chloroplast transformation (39). Nuclear transformation can be achieved by particle bombardment (22, 23, 57, 73), agitation with glass beads (56, 81), or electroporation (105, 121). Generation of tagged insertional mutations by nuclear transformation has led to the rapid identification of mutant alleles (3, 17, 20, 21, 60, 108, 109, 120, 132, 138). Plasmid, cosmid (92, 139), and bacterial artificial chromosome (BAC) (66) libraries are used to rescue nuclear mutations. Expression of specific genes can be repressed using both antisense (65, 103) and RNA interference technologies (50, 58, 107; N. F. Wilson and P. A. Lefebvre, abstract presented at the 10th International Chlamydomonas Conference, 2002). In addition, endogenous transposable elements (31, 102, 127), marker rescue of Escherichia coli mutants (89, 136), direct rescue of C. reinhardtii mutants (38, 94, 132), and map-based techniques are being used to clone specific genes. Chloroplast transformation (12, 83) has permitted disruption (118) and site-specific mutagenesis of genes on the chloroplast genome (33, 34, 35, 43, 45, 46, 63, 64, 76, 129, 131, 134, 140). Reporter genes such as green fluorescent protein (36, 37), Ars (arylsulfatase) (19), and Luc (luciferase) (77; M. Fuhrmann L. Ferbitz, A. Eichler-Stahlberg, A. Hausherr, and P. Hegemann, abstract presented at the 10th International Chlamydomonas Conference, 2002) are helping to elucidate processes such as transcriptional regulation (16, 49, 87, 93, 125) and polyadenylation-mediated chloroplast RNA decay (59).
Ongoing genome projects offer the scientific community a wealth of information concerning the sequence and organization of the C. reinhardtii genome. Combined with the molecular toolkit, these data expand our ability to analyze gene function, organization, and evolution and to examine how environmental parameters and specific mutations alter global gene expression.
Generation of C. reinhardtii expressed sequence tag (EST) information was initiated in Japan (www.kazusa.or.jp/en/plant/chlamy/EST ), and augmented by a National Science Foundation supported project (www.biology.duke.edu/chlamy_genome/ ) that has generated over 200,000 additional sequences assembled into over 10,000 “unique” cDNAs (106; unpublished data). Microarrays with representation for all of the plastid genes and approximately 3,000 nuclear genes (48, 68) have been used to probe global gene expression in wild-type (48, 68) and mutant strains (Z. Zhang and A. R. Grossman, unpublished results). Furthermore, the genomic information has aided in the generation of tools for map-based cloning, based on linkage of genetic and physical markers (55, 126).
The accumulation of cDNA sequence information and development of robust molecular markers has stimulated the interests of the Joint Genome Institute (JGI), Department of Energy, and under the leadership of one of us (D. Rohksar), a rough draft of the near-complete genome sequence was made publicly accessible in the early part of 2003. This sequence has been partially annotated and both cDNA information and molecular markers have been anchored to the sequence. These advances have dramatically enhanced the utility of C. reinhardtii as a model system.
NUCLEAR GENOME SEQUENCE
Assembly and annotation of the genome.The nuclear genome of C. reinhardtii is 100 to 110 million bp, comprising 17 genetic linkage groups (55), with a very high GC content (nearly 65%) that results in cloning difficulties and limits the length of reads from shotgun sequencing reactions. Generating a high-quality genome sequence has therefore presented unusual challenges. Sequencing strategies being used involve production of random genomic fragments of ∼3 and ∼6 kbp, cloning of the fragments into plasmids, and obtaining paired end sequences of the insert DNA. Paired end sequences from 35 to 40 kbp fragments in fosmid vectors are also being generated. This information is being integrated with end sequence data from 15,000 BAC clones (see “Alignment of Genetic and Physical Maps”).
With a sequence redundancy of nearly 10-fold, the randomly sequenced fragments generated by the strategies described above can be assembled into “contigs” (contiguous stretches of reconstructed sequence obtained from overlapping end sequences) that are further linked together into “scaffolds” (longer stretches of reconstructed sequence interrupted by “gaps” whose size is roughly known based on spanning clones). A preliminary rough draft of the C. reinhardtii genome is already available at the JGI Chlamydomonas Web site (see below). A high-quality draft genome assembly is anticipated by the fall of 2004.
We plan to generate a complete sequence reconstruction of C. reinhardtii chromosomes by linking together sequence scaffolds using genetic and clone-based physical maps (see “Alignment of Genetic and Physical Maps”). Sequencing of selected regions of the genome is likely to be finished by further targeted efforts to close gaps in scaffolds and by resequencing low-quality regions to achieve a uniform error rate of less than one error per 10,000 bases. The ultimate goal is a high-quality reference genome sequence.
Annotation of the gene content of C. reinhardtii is being facilitated by copious EST information produced by several projects (see below) and availability of modern gene-finding methods that exploit expressed sequence evidence, statistical signatures of coding regions, and conservation of deduced polypeptide sequences with known proteins from other organisms. One intriguing possibility for further analysis of the C. reinhardtii genomic sequence is to compare it with sequence information from the colonial alga Volvox carteri, with the goal of highlighting coding regions that may be unique to the chlorophyte algae, and possibly to identify putative conserved regulatory regions. While computational methods can only reliably predict coding regions, the large scale EST collections will enable many 5′ and 3′ untranslated regions (UTRs) to be directly determined. Furthermore, probes synthesized based on ab initio gene predictions can be used to identify and clone rare transcripts. Integration of complementary community informatics resources centered on the genome will provide a comprehensive view of the C. reinhardtii genome that is readily accessed by many different network locations (see “Toward an Integrated Database”).
Chlamydomonas Genome Portal.Genomic information generated at JGI can be accessed through the JGI Chlamydomonas Genome Portal (www.jgi.doe.gov/chlamy ), which is intended as an archival, Web-based source for C. reinhardtii genomic sequence information and associated annotations (Fig. 1A). Prior to initial publication of the genome sequence and its annotation and analysis, items presented on the JGI site should be considered to be preliminary results and a community resource.
(A) Schematic of JGI genome portal. The diagram shows the internal connections of the JGI Genome Portal. Information on BLAST results, EST alignments, and gene models can be accessed through the Search page. From the gene model information page, or protein page, InterPro domains and Smith-Waterman alignments to protein databases are displayed with a graphical interface. With the version 2.0 release GO and KEGG will be available, as well as the ability to annotate gene models. The Chlamydomonas Genome Portal is accessible at www.jgi.doe.gov/chlamy . (B) Browse view. Screen shot of the browse view for several gene models displayed on the genome. Displayed simultaneously are overlapping EST alignments and Blastx results. (C) Protein page. The protein page displays information about a gene model. InterPro results, Smith-Waterman alignments, and the protein and transcript sequence for this model can be retrieved from this page.
Various precalculated features identified on the genome (exons; genes; mRNA, EST, or unigene alignments; markers for mapping; protein BLAST hits; etc.) are organized in “tracks” using a graphical interface similar to that developed at Santa Cruz for the human genome (54) (Fig. 1B). Clicking on (selecting) a predicted gene will display a page (Fig. 1C) showing protein and transcript sequences, precalculated BLAST results (1), and InterPro (79) determinations of protein domains. Clicking on an EST, unigene, or mRNA alignment displays a graphical view of the alignment as well as information at the sequence level and BLAST results relative to known proteins from various organisms.
Users can reach a genomic region of interest in a variety of ways. One can perform BLAST analysis against the genome and view resulting alignments in the context of all the other database features. For example, comparing an Arabidopsis protein to the C. reinhardtii genome with BLAST would access the region of the C. reinhardtii genome with a similar sequence, immediately recovering the gene at that location. There are tracks for predicted gene structures based on the GeneWise (9) and GreenGenie (Susan Dutcher, personal communication) algorithms, as well as for alignments of publicly available ESTs (106), molecular markers (55), array elements, and known protein sequences from specific organisms.
Since a BLAST analysis of the genome against all proteins in GenBank has already been performed and will be periodically updated, one can text search through the names of precomputed alignments. Other access points include the GO (gene ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) links that organize genes into functional groupings.
The JGI Chlamydomonas Genome Portal is in a dynamic state of development. Assignment of gene functions is a feature of any genome project that is continually being informed by sequence similarities, experimental evidence, phylogenetic data, and expression profiles. To capture the richest annotation of the C. reinhardtii genome, the JGI portal includes interfaces for community annotation, allowing experts around the world to add their input, and incorporates links to publications, experiments, and descriptive text. New features being integrated into the JGI Portal include tracks showing spanning BACs and fosmids. Improved gene models will be merged ab initio with EST/mRNA evidence, increasing the number of complete gene predictions (including UTRs) and revealing alternatively spliced transcripts. Sequence signals for transmembrane spanning regions, signal peptides, and targeting sequences will also be computed and added to the site. Linkages to and from JGI pages to other community resources, notably ChlamyDB, are being developed, as described below under “Toward an Integrated Database.”
THE TRANSCRIPTOME
Efforts are currently under way to identify transcribed regions of the genome and to analyze their expression patterns.
cDNA information.After a pilot experiment by S. Purton, a collection of 37,940 5′-end ESTs was generated for C. reinhardtii by the Kazusa DNA Research Institute in Japan (2). Normalized, size-selected libraries were generated from cells grown under low- or high-CO2 conditions. A National Science Foundation-supported cDNA project performed at the Carnegie Institution of Washington and the Genome Technology Center at Stanford has led to the generation of cDNA libraries constructed from RNA isolated from cells exposed to a variety of different conditions (Table 1); these libraries were normalized prior to sequencing individual clones. One library is from the field isolate S1D2 (41), which has numerous sequence polymorphisms but is interfertile with the laboratory strain 21gr, and is used for map-based cloning of mutant alleles (55). Nearly 200,000 clones have been sequenced from their 3′ and 5′ ends (106), and full-length sequences are being generated. Our assembly protocol is based on the commonly used Phrap program, which takes into account sequence quality. The assembly generates assemblies of contiguous ESTs (ACEs), which theoretically represent unique genes (106; J. Shrager, C.-W. Chang, J. Davies, E. H. Harris, C. Hauser, R. Tamse, R. Surzycki, M. Gurjal, Z. Zhang, and A. R. Grossman, presented at the proceedings of the 12th International Congress on Photosynthesis, 2001) (www.biology.duke.edu/chlamy/PDF/Shrager2003.pdf ). Sequences from the ∼10,000 ACEs in the assembly designated 20021010 (dated 10 October 2002) have been annotated on the basis of BlastX homology to potential homologs in other organisms. We are currently preparing a final assembly of all of the EST data, which will include those from S1D2 as well as from the Purton and Kazusa projects. Knowing the distribution of ESTs among the cDNA libraries and the conditions used for library generation, we can infer a qualitative image of the expression pattern of specific genes. Accordingly, we have identified several genes represented by multiple cDNAs in the stress libraries (including arylsulfatase, phosphatases, and regulatory proteins) that are not represented in the core library.
cDNA libraries
Microarray construction and application.The DNA microarray is currently the most commonly used and widely applicable technique for the global analysis of gene expression. We have completed and are distributing a first generation cDNA array. A region of each cDNA 3′ end was amplified using a universal primer in the vector and a specific primer ∼400 bp upstream of the 3′ end. PCR products were purified and printed onto GAPS II amino silane-coated slides (Corning), with each slide carrying four replicate spots of each cDNA fragment. For version 1.0, we chose clones with high-quality sequence information from 2,761 distinct ACEs. As of January 2003, a slightly different version is being distributed (version 1.1), with ∼300 additional genes amplified either from our EST libraries, or from other sources; many were kindly provided by other laboratories. Within 2 years we plan to generate an array representing the entire C. reinhardtii genome.
We and others have already performed experiments with these arrays. Recently, we have identified genes activated by high-intensity light under low-CO2 conditions (48); these genes encode photorespiratory proteins, proteins that combat the accumulation of toxic oxygen radicals, polypeptides that function in concentrating inorganic carbon and several proteins of unknown function. Expression studies have also been performed with wild-type and mutant cells transferred from nutrient-replete to sulfur-deficient medium. For example, the SacI gene controls the acclimation of cells to sulfur deprivation conditions and encodes a regulatory protein (17, 18) that has some similarity to transporters with 12 membrane-spanning helices. Figure 2 shows a set of microarrays generated for CC-425 and the sac1 mutant following imposition of sulfur deprivation. A number of transcripts were found to increase dramatically during starvation. Some encode proteins involved in sulfur metabolism (e.g., the Ars gene [which encodes arylsulfatase] and the Ats1 gene [which encodes ATP sulfurylase]) or other cellular processes (e.g., Ecp76, which encodes a cell wall polypeptide specific to sulfur stress cells [116]), while the functions of several others remain unknown (Fig. 2).
Chamydomonas microarrays. Shown are microarray images generated after 24 h of sulfur starvation of the parental strain (CC-425, left panel) and the sac1 mutant in the same genetic background (CC-3794, right panel). Red fluorescence indicates an increase in the level of the transcript during sulfur deprivation, while green fluorescence indicates a decrease in transcription. Spot 1 represents the Ecp76 gene (963017H04), and spot 6 represents an LI818 gene (894097E05), which encodes a polypeptide that is part of the light-harvesting protein family. The functions of the genes represented by spots 2 to 5 and 7 are not known (all of these spots are circled). The orange arrow marks a gene [encoding a putative poly(A) binding protein; 894006E07] for which the transcript increases in both CC-425 and the sac1 mutant, while the white arrow marks a gene whose transcript increases in sac1 but not CC-425 cells.
Similar studies are being conducted (in the Grossman laboratory), on phosphorus and nitrogen limitation, as well as on the physiological effects of different light qualities. Other microarray studies have been initiated with Krishna Niyogi (high-light-activated genes), Donald Weeks (CO2-activated genes), and Jean-David Rochaix (mutants in photosynthetic function). We have also distributed several hundred arrays to researchers working on C. reinhardtii, and it is expected that a large corpus of data will be generated in the coming months that should begin to reveal global and interacting regulatory features of the genome. A specific microarray section is being introduced into the Chlamydomonas Genome Project Database in which all relevant information regarding array elements (sequence, position on the array, ACE and gene models and their annotation etc.) will be made available.
ALIGNMENT OF GENETIC AND PHYSICAL MAPS
An important component of the genome project has been the placement of molecular markers onto the C. reinhardtii genetic map, with the aim of facilitating map-based cloning of genes identified by mutations. Over the last 50 years, more than 200 phenotypic markers (mostly mutations) have been mapped onto the 17 C. reinhardtii linkage groups, and recently, more than 270 molecular markers have been placed on the linkage map. Some of these have been correlated with mutant data, allowing for the alignment of the physical and genetic maps. The defined physical markers are either restriction fragment length polymorphism- or PCR-based markers. The positioning of these markers onto linkage groups provides, on average, a map in which any given point on the C. reinhardtii genome is within 2 cM of a mapped molecular marker (55, 126), corresponding to 150 to 200 kbp of genomic sequence.
To facilitate the use of the molecular map for map-based cloning, a BAC library of more than 15,000 clones has been generated and arrayed, providing an eightfold coverage of the nuclear genome. (Individual BAC clones or the entire library can be obtained from the Clemson University Genomics Institute: www.genome.clemson.edu ). JGI has sequenced both ends of all clones in this library, and this information is available and can be searched using BLAST on the JGI Web site (bahama.jgi-psf.org/prod/bin/chlamy/home.chlamy.cgi). More than 2,500 of these clones, focusing on those containing mapped molecular markers, have been fingerprinted and placed into overlapping BAC contigs. The BAC contigs now cover more than 25% of the genome. As the assembly of the nuclear genome proceeds, by linking together sequence scaffolds, it will be increasingly useful to compare BAC end sequences with the genomic sequence to place additional BACs onto the physical/genetic map. Ultimately, a tiling path of BAC clones corresponding to the complete C. reinhardtii genetic and physical maps will be generated.
The information already available has made it possible to apply map-based cloning strategies to the identification of mutant alleles in C. reinhardtii, e.g., lf1 (R. Nguyen and P. Lefebvre, presented at the he 10th International Conference on the Cell and Molecular Biology of Chlamydomonas, 2002) and bld2 (27). The Bld2 gene was cloned by identifying overlapping BAC clones covering 720 kbp of genomic sequence corresponding to 4.5 cM on linkage group III. The BAC clone containing the wild-type Bld2 gene was identified by transforming individual BAC clones into bld2 mutant cells to rescue the mutant phenotype.
Map-based cloning will be greatly accelerated by a high density of genetically mapped polymorphisms between the laboratory strain 21gr and field isolate S1C5, which is very similar to S1D2. Sequence information already available suggests that the frequency of polymorphisms between the laboratory and wild-isolate strains is surprisingly high. In a survey of more than 29,000 nucleotides from the 3′ UTR of 62 transcripts, there were 2.7 base substitutions and 0.54 insertions or deletions per 100 bases. This level of sequence polymorphism will allow any new mutation in a laboratory strain to be mapped both genetically and physically. A protocol for mapping any new mutation by crosses to S1C5 followed by PCR-based detection of a set of molecular markers was recently described (55). Once a mutation has been mapped to a genetic interval, more detailed fine-structure mapping may require that additional molecular markers in the interval of interest be identified. Such markers can be easily obtained from DNA sequence in regions of interest by searching for microsatellite sequences [usually (GT)n repeats]. Thousands of microsatellites, dispersed throughout the genome, can be converted into PCR-based molecular markers by designing specific oligonucleotide primers for PCR amplification of the microsatellite-containing sequence, followed by identification of the different alleles by sizing products on gels (the different alleles will have different numbers of GT repeats). Kang and Fawley (52) have used this procedure to map microsatellite sequences in C. reinhardtii.
ORGANELLE GENOMES
A complete C. reinhardtii mitochondrial genome sequence is available (GenBank accession U03843). This 15.7-kb genome encodes the cytochrome b and cytochrome oxidase apoproteins, six NAD dehydrogenase subunits, a protein resembling reverse transcriptase, large and small mitochondrial rRNAs (fragmented), and three tRNAs (GenBank accession U03843). All other mitochondrial components are presumably encoded in the nuclear genome.
Completion of the entire sequence of the chloroplast genome of C. reinhardtii has permitted the generation of mutations in all of the genes on that genome (except where the lesions are lethal) and an analysis of transcripts that emanate from different genomic regions. The complete sequence has also enabled the production of a chloroplast genome microarray that can be used for analyzing the global accumulation of chloroplast transcripts under different environmental conditions.
Chloroplast genes and their expression.The C. reinhardtii chloroplast genome is 203.8 kbp (GenBank accession number BK000554 ) and contains 99 genes, including 5 rRNA genes, 17 ribosomal protein genes, 30 tRNAs specifying all of the amino acids, and 5 genes encoding the catalytic core of a eubacterial-type RNA polymerase (72). Figure 3 depicts the circular genome, its known genes, and the positions of those that have been disrupted. The genome contains a staggering number of small dispersed repeats (SDRs) that mostly populate intergenic regions.
Chloroplast genome. The C. reinhardtii chloroplast genome and its genes are shown. Those that have been disrupted are highlighted.
The structure and gene content of the C. reinhardtii chloroplast chromosome are conventional, with a ribosomal DNA-containing inverted repeat separating two single copy regions. When compared to the chloroplast DNA (cpDNA) of land plants, the C. reinhardtii genome has a few noteworthy features: (i) an unusual gene, tscA, that encodes an RNA that is involved in trans-splicing of psaA transcriptional segments; (ii) a split rpoC1 gene; (iii) the presence of tufA, which encodes elongation factor Ef-Tu; (iv) two large open reading frames (ORFs) (1,995 and 2,971) of unknown but essential function; and (v) an absence of ndh genes, which encode polypeptides critical for chlororespiration, a process first reported in C. reinhardtii (6). The ndh genes are ubiquitous on land plant cpDNA.
Gene disruption is routine for C. reinhardtii chloroplast genes, and even the so-called essential genes can be functionally analyzed by weakening their translation initiation codons (71). The completion of the genome sequence does not offer many new gene candidates for functional analyses but does provide landmarks necessary for gene manipulation and the analysis of global plastid gene expression. Table 2 lists genes marked in Fig. 3 as having been disrupted; the total is an impressive 35 genes in which only 6 could not be brought to homoplasmicity.
Genes disrupted on the chloroplast genome of C. reinhardtii
The analysis of the chloroplast genome enables researchers to define previously undiscovered genes and to measure expression of known genes. Sequence alone does not necessarily presage identification of a full genomic complement, and some genes (like tscA) may not encode proteins. To complicate matters, three of the four major photosynthetic complexes (photosystem I, photosystem II, and the cytochrome b6f complex) contain small chloroplast-encoded polypeptides with ORF sizes that would frequently arise by chance in the genome. For this reason, annotation of the ORFs was limited to those at least 100 residues long. Since small genes or non-protein-encoding genes should nonetheless be represented in the transcript pool, a comprehensive RNA filter blot analysis was undertaken, using RNA isolated from cells grown under a range of environmental conditions. As reported by Lilly et al. (68) the accumulation of chloroplast transcripts is strongly affected by culture conditions. Under conditions in which most investigators grow their cells—in rich medium and under continuous light—chloroplast transcript accumulation is relatively high. This is consistent with the observations that substantial decreases in the cpRNA content do not, in the short term, visibly affect the synthesis of most chloroplast polypeptides (28). Under conditions of abiotic stress, changes in transcript accumulation range from subtle to as much as eightfold. Increases in the levels of some transcripts in response to phosphate deprivation appear to be mediated, at least in part, by polynucleotide phosphorylase (Y. Komine and D. Stern, unpublished results), a nuclear-encoded, chloroplast RNase whose activity is modulated by physiologically relevant phosphate concentrations (135).
SDRs.The SDRs that have colonized intergenic regions of the cpDNA (Fig. 4) present a fascinating evolutionary puzzle. Of sequenced cpDNAs within the chlorophytes, which include land plants as well as green algae, only Chlorella sp. appears to have numerous SDRs (72). Surprisingly, there is almost no sequence similarity between the SDRs of Chlorella and C. reinhardtii, suggesting that SDR amplification might share a common mechanism but be sequence independent. The relatively balanced distribution of SDRs in the C. reinhardtii chloroplast genome raises questions concerning both their origin and function. Did an ancient invasion of a transposable element subsequently lead to the dispersal of smaller fragments, or did a nuclear mutation somehow permit or foment accumulation of SDRs? It has been suggested (15) that short repeats may be associated with rearrangement of chloroplast genes or that they might function as binding sites for proteins that participate in gene expression. Interestingly, SDR-rich sequences upstream of petA exhibit a conformational (torsional) response to light, which is correlated with increased transcriptional activity (122).
SDR sequences on the chloroplast chromosome. The first 100 kb of the chloroplast chromosome were analyzed for SDRs using a genome self-comparison with the program Pipmaker (bio.cse.psu.edu/cgi-bin/pipmaker?basic). The approximate locations of genes are shown on the top row only; the “Wendy” transposon and its disabled duplicate copy are shown on the top row at around position 75,000. The thin gray line represents one copy of the large inverted repeat. Each dot represents a repeat of the sequence along the top line; e.g., Wendy is duplicated, so a second line appears underneath it. The SDRs are represented by the large numbers of dots, whose sequence identity to the particular place on the genome ranges from 50 to 100% as shown in the scale on the right.
In summary, chloroplast genomics in C. reinhardtii has provided sophisticated tools for analyzing and manipulating cpDNA and has raised fascinating evolutionary questions. Recent years have seen accelerated cloning and analysis of nuclear genes encoding chloroplast regulatory factors (97, 99), which will stimulate studies on their interactions with chloroplast mRNAs and with one another (24, 137). With the sequencing of the C. reinhardtii nuclear genome, whole new families of putative regulators of chloroplast gene expression will emerge, presenting an opportunity to build an integrated image of genetic interactions between the nuclear and chloroplast genomes and how they are fine-tuned by critical features of the environment.
TOWARD AN INTEGRATED DATABASE
Use of available databases.One strength of C. reinhardtii as a model system lies in the extent to which it has been used for genetic and physiological characterization of biological processes. With the advent of C. reinhardtii genomics, we are poised to link phenotypes, alleles, and expression and sequence features into an integrated database.
The major goals of database construction are to (i) provide user-friendly points of access for the sequence data, (ii) connect genomic features to the classical biology of the organism, (iii) provide tools for viewing and querying genomic and gene expression data, and (iv) generate resources and tools for cross-species comparisons as data from related algal species become available.
Currently the genomic and organismal data are dispersed among three databases: (i) ChlamyDB, which contains information on genetic loci, mutant alleles, and sequenced genes, descriptions of strains, bibliographical citations, and community member information; (ii) ChlamyEST, which contains sequence data (EST, contigs, unigene, chloroplast, mitochondria) and gene annotations; and (iii) the JGI Chlamydomonas Genome Portal (see “Chlamydomonas Genome Portal” above), which contains the nuclear genome sequence, gene model predictions, and preliminary annotation data. All three databases are accessible through search engines, and both the Chlamydomonas Genome Project and the JGI Web sites include on-line Blast utilities, with additional specialized datasets available at ChlamyEST containing sequences from the Volvocales (including Chlamydomonas, Volvox, Eudorina, Pandorina, Dunaliella, and Haematococcus, among others) and BAC end sequences.
Integration of the databases. (i) Unification of ChlamyDB and ChlamyEST.The near-term challenge is to link all C. reinhardtii-related data sets in a seamless manner. To this end we will unify data maintained in ChlamyDB and ChlamyEST and establish links between this unified database and the JGI Chlamydomonas Genome Portal. The Chlamydomonas Genome Project is implementing a version of the Generic Model Organism Database (111) with the aim of integrating genetic, sequence, and bibliographic information. Figure 5 presents a schematic of the proposed unifications. At the core of this project is the underlying “chado” database schema, designed to integrate the Drosophila melanogaster data in FlyBase into distinct modular components with tightly defined dependencies (“Sequence,” which contains biological sequences and annotation; “Genetics,” which houses alleles and relationships between alleles and phenotypes; “Map,” which contains any type of localization excluding sequence localizations; “Expression,” which depicts transcriptional events and protein expression; “Companalysis,” an adjunct to the sequence module for in-silico comparisons; “CV,” which applies the controlled vocabularies and ontologies; “Organism,” which handles species and taxonomy data; “Pub,” which contains bibliographic, publications, and reference data). As depicted in Fig. 5, data currently in ChlamyDB (loci, alleles, strains, phenotypes, species, bibliographic data, genetic data, and physical maps) will be incorporated into the genetics, organism, publication, and map modules. The sequence module will be populated by nuclear, chloroplast, and mitochondrial genomic sequences, EST sequences and their assembled contigs, complete cDNA sequences obtained from our expression libraries or from information in the literature, and DNA sequences that have been used to build microarrays. In addition, the sequence module maintains relationships that link sequence records to annotation data derived from automated resources (GenBank, SwissProt, InterPro, GO, and SO, etc.) and more accurate manually curated annotation. In the future, the expression module will accommodate global gene expression data derived from the analysis of microarrays. Researchers requesting microarrays from our facility will be asked to deposit a summary of their results in this module, in addition to making their data sets publicly accessible.
Integration of databases. Data for an integrated C. reinhardtii database are gathered from ChlamyDB, ChlamyEST, the Chlamy database at JGI, and a variety of outside sources before being integrated in the relational database chado and served to users on the Internet. Links connecting ChlamyDB and JGI will be established to provide robust data retrieval.
(ii) Interconnecting ChlamyDB and the JGI databases.To provide a genome that has robust annotation and to avoid unnecessary duplications, ChlamyDB and the JGI will establish interdatabase links, enabling users who enter one database to retrieve data maintained by the other (Fig. 5). For example, a query of the new ChlamyDB for a particular gene or gene product will return as complete a response to the query as available and information from the JGI data set.
DOWN THE ROAD
Several important trends are emerging in C. reinhardtii research. Analysis of mutant phenotypes (forward genetics) will undoubtedly remain a central route for defining gene function. The availability of genomic sequence information will spur the development of insertional mutagenesis, and sequences of DNA flanking insertion sites will immediately identify putative genes responsible for specific phenotypes. Defined BAC clones will be used for rescuing mutant phenotypes, which will help establish gene function. In addition, researchers will begin to use genetic mapping of mutations on the nuclear genome to routinely clone genes; one primary goal of the C. reinhardtii genome initiative is to provide sets of mapping primers in a 96-well format to stimulate the use of this approach. However, as genome sequence and annotation become more precise, we expect that reverse genetics will emerge as the centerpiece of functional genomics in C. reinhardtii, as it is now for Arabidopsis. This approach will exploit RNA interference and antisense RNA technologies to suppress gene expression and use tilling (74, 75, 123) to identify allelic series for specific genes; the phenotypes associated with the different alleles will help elucidate the relationship between gene structure and function.
In the very near future, global expression analyses are likely to take a central position in C. reinhardtii genomics. As our knowledge of transcribed regions in the genome becomes secure, construction of a full-genome microarray will be possible, enabling the synthesis of a more complete picture of the control of gene expression. Integration of the expression data will generate a catalog that describes the activity of each gene and facilitates construction of “coregulation graphs,” providing clues to the physiological role of many genes of unknown function. Finally, microarray analyses applied to strains mutated for putative regulators will identify suites of genes subject to common control mechanisms.
While analysis of transcript behavior in dynamic environments will be one of the most rapid outcomes of whole genome information, many key cellular processes must be studied at the level of protein abundance and activity. The European Community is committed to building a program around C. reinhardtii proteomics. Initially, the focus will be to identify components localized to specific subcellular compartments, and in particular those that traffic to the chloroplast and mitochondrion. While no program currently available can accurately predict organellar targeting for C. reinhardtii, the results obtained by proteomic analyses should generate training sets that stimulate the development of robust predictor algorithms. Quantitative proteomics will also shape our understanding of environmental pressures that modulate levels and activities of specific proteins. Global analyses at both the protein and transcript levels, combined with computational and informatic approaches, will help predict functions of specific gene products in both metabolic and regulatory pathways and identify promoter sequences important for controlling suites of genes. Sequence information concerning promoter structure and function can be coupled with biochemical data (84, 90, 128) to determine, in a direct way, cis-acting sequences that modulate promoter activity. Antibodies to specific regulatory proteins identified in mutant screens can be used for chromatin immunoprecipitation (80, 88, 130), which would help establish specific protein-DNA interactions. Furthermore, two-hybrid (51, 53, 67, 124) and tandem-affinity purification (91) methodologies can be used to explore functional protein-protein interactions.
As with any organism, a strictly statistical analysis of genome sequence properties can be used to identify general and local properties of the genome such as isochores, large and small duplications, consensus sequences for splice junctions, and codon bias and its relationship to the level of expression of a gene or its evolutionary history, etc. However, because of the large underlying body of genetic, gene expression, and biochemical data, we can also predict breakthroughs in our ability to describe metabolic and regulatory pathways, and identify novel pathways as well as those that are absent or modified in specific organisms.
How C. reinhardtii genomics is going to evolve in the next few years is a question for the whole community. Already, the developments described here have attracted new investigators to the organism and invigorated established investigators, offering them a new pallet of tools that will undoubtedly create new landscapes in biological knowledge.
ACKNOWLEDGMENTS
We acknowledge both the National Science Foundation (for grants MCB-9975765, MCB-8819133, and DBI-9970022) and the Department of Energy for their forward-thinking support of this work.
We also thank the personnel at the Stanford Genome Center, especially Raquel Tamse, for their work in sequence the cDNAs.
A.G.W. was the Principal Investigator. All other authors in the byline are listed alphabetically.
- Copyright © 2003 American Society for Microbiology
REFERENCES
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.
- 8.
- 9.↵
- 10.
- 11.
- 12.↵
- 13.↵
- 14.
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.
- 26.↵
- 27.↵
- 28.↵
- 29.
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.
- 62.
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.
- 86.
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.
- 96.
- 97.↵
- 98.↵
- 99.↵
- 100.
- 101.
- 102.↵
- 103.↵
- 104.
- 105.↵
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.
- 111.↵
- 112.↵
- 113.
- 114.
- 115.
- 116.↵
- 117.
- 118.↵
- 119.
- 120.↵
- 121.↵
- 122.↵
- 123.↵
- 124.↵
- 125.↵
- 126.↵
- 127.↵
- 128.↵
- 129.↵
- 130.↵
- 131.↵
- 132.↵
- 133.
- 134.↵
- 135.↵
- 136.↵
- 137.↵
- 138.↵
- 139.↵
- 140.↵