| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Previous Article | Next Article ![]()
Eukaryotic Cell, October 2004, p. 1088-1100, Vol. 3, No. 5
1535-9778/04/$08.00+0 DOI: 10.1128/EC.3.5.1088-1100.2004
Copyright © 2004, American Society for Microbiology. All Rights Reserved.
Scott D. Drabenstot,2 Kent L. Buchanan,3 Hongshing Lai,1 Hua Zhu,1 David W. Dyer,2 Bruce A. Roe,1 and Juneann W. Murphy2*
Department of Microbiology and Immunology, University of Oklahoma Health Sciences Center, Oklahoma City,2 Department of Chemistry and Biochemistry, University of Oklahoma, Norman, Oklahoma,1 Department of Microbiology and Immunology, Tulane University Health Sciences Center, New Orleans, Louisiana3
Received 30 October 2003/ Accepted 12 July 2004
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
Fungal gene expression, like that in higher eukaryotes, depends on accurate splicing via the coordinated efforts of a spliceosome. In metazoan systems, spliceosomes are composed of five small nuclear RNAs and over 60 proteins that function as splicing factors (40). The spliceosome coordinates with conserved cis elements in the intron to identify correctly the 5' and 3' splice sites (ss) (40). These cis elements consist of the 5' and 3' ss, found universally to be predominantly GU and AG, respectively (10, 44); the branch point A and surrounding motif (9, 48, 65); and polypyrimidine tracts (13). In the vertebrate model, early in spliceosome assembly, U1 snRNP binds to the 5' ss, and a heterodimeric protein in the U2 complex, U2AF, binds to the polypyrimidine tract near the 3' ss via one subunit, U2AF65, and to the 3' ss via the other subunit, U2AF35 (40). These splicing complexes facilitate correct excision of the intron sequences and joining of the exon sequences, events which are necessary to obtain an mRNA that can be translated into a functional protein (40). Intron and exon features differ between groups of eukaryotes, and these differences influence the mechanism of recognition of ss, excision of introns, and joining of adjacent exons (14, 39). Information content in the introns at the 5' and 3' ss and the branch site, information content in the exon regions adjacent to the introns, and the locations of the polypyrimidine tracts are parameters that vary between groups of eukaryotic organisms and have the potential to influence splicing mechanisms.
In this study, we analyzed intron size distributions, distances from the branch point to the 3' or 5' ss, the information content of the ss and the branch site, the information content of the exon regions adjacent to the introns, the distribution of polypyrimidine tracts within the introns, and homologs of selected proteins that have been associated with spliceosomes. We confirmed that fungal introns are typically short and exons are long relative to their mammalian counterparts. The information content needed for splicing is found in fungal introns. Yeast introns have a broader length distribution and a higher information content than introns of the two filamentous fungi and the C. neoformans strain that we analyzed. Since fungal introns are short and have polypyrimidine tracts primarily in the region between the 5' ss and the branch point, we suspect that the splicing mechanisms of fungi differ from the generally accepted splicing mechanisms described for metazoans. Homologs of U2AF proteins that have been associated with spliceosomes were found in all of the studied fungi except for S. cerevisiae. In addition, homologs of spliceosomal proteins such as Nam8p, a yeast U1 snRNP, and TIA-1, a splicing regulator in metazoans that is associated with splicing of introns with polypyrimidine tracts upstream of the branch site, were found in the fungi. Together, our findings suggest that further studies of fungal splicing mechanisms focusing on novel or nonclassical mechanisms are needed, since the available evidence indicates significant differences between fungi and metazoans.
| MATERIALS AND METHODS |
|---|
|
|
|---|
From the S. cerevisiae complete genomic sequence (20), an annotated intron sequence database was constructed at the Ares Laboratory (23); we downloaded the database from http://www.cse.ucsc.edu/research/compbio/yeast_introns.html. For the calculation of information in the intron and exon regions, the complete genomic sequence was downloaded from GenBank (http://www.ncbi.nlm.nih.gov/), and the predicted coding sequences were downloaded from the Saccharomyces genome database (http://www.yeastgenome.org/). Introns and exons for Histoplasma capsulatum and Coccidioides immitis were obtained by downloading the preformatted GenBank 132 exon-intron database (http://mcb.harvard.edu/gilbert/eid/) (55); purging was done with exon-intron database filter_exp_keywl.p1 and filter_exp_keyw2.pl to remove genes that were identified only in silico. The intron and exon selection and validation methods were described previously (15).
Definitions applied. The International Union of Pure and Applied Chemistry standard abbreviations for nucleotides were used throughout this work (28). Briefly, they are as follows: A, adenine; C, cytosine; G, guanine; T, thymine; U, uridine; Y, thymine, uridine, or cytosine; R, adenine or guanine; W, adenine, thymine, or uridine; M, adenine or cytosine; S, guanine or cytosine; K, guanine, thymine, or uridine; and N, adenine, guanine, cytosine, thymine, or uridine. To be called a consensus nucleotide, the nucleotide frequency must exceed 40% at the given position. For a degenerate call, the second nucleotide must occur more than 30% of the time, and the frequencey must be equal to or greater than two times the frequency of the third most frequently found nucleotide at that position (57).
Preparation of the cDNA library. C. neoformans strain B3501 was kindly provided by J. Kwon-Chung (National Institutes of Health, Bethesda, Md.). C. neoformans yeast cells were cultured in yeast extract-peptone-dextrose broth at 30°C with shaking for 16 h. Following incubation and washing of the yeast cells, RNA was isolated by using a Mini-BeadBeater 8 apparatus (Biospec Products, Bartlesville, Okla.) in combination with 0.5-mm Zr/Si beads and an RNeasy kit (Qiagen, Santa Clarita, Calif.) according to the manufacturer's directions. The integrity of the total RNA was confirmed by formaldehyde-agarose gel electrophoresis. Poly(A)+ RNA was purified from total RNA by using a PolyATtract kit (Promega, Madison, Wis.). The cDNA library was constructed by synthesizing cDNA from poly(A)+ RNA by using a cDNA synthesis kit (Stratagene, La Jolla, Calif.) according to the manufacturer's instructions, except that priming for first-strand synthesis was done with a cocktail of three 1-base-anchored poly(dT)-containing primers. The three primers consisted of a protected XhoI restriction site followed by a 5-nucleotide (nt) tag sequence (GACAC), an 18-base poly(dT) sequence, and an A, a C, or a G. Thus, the three primers differed only in the final base (A, C, or G). Following second-strand synthesis, the cDNA ends were made blunt, and EcoRI linkers were ligated. The cDNAs were digested with XhoI, size selected (400 bp and greater), directionally ligated into predigested Uni-ZAP XR (Stratagene), and packaged with Gigapack III Gold packaging extracts. The titer of the resulting cDNA primary library was determined. Samples of the primary library were subjected to mass excision by using ExAssist helper phage (Stratagene), and individual clones were picked from the resulting primary library for sequencing.
Procedures for sequencing of the cryptococcal cDNA library. A C. neoformans double-stranded DNA template was isolated from the selected clones in a 96-well sample format by using a cleared-lysis method (72). For sequencing, approximately 0.2 µg of DNA was used with 20 pM universal M13 forward primer (5'-TGTAAAACGACGGCCAGT-3') or T3 primer (5'-CGAAATTAACCCTCACTAAAG-3') and 2 µl of ABI BigDye terminator mixture (PE-ABI 4303150) diluted 1:3 with 5x TM buffer (400 mM Tris-HCl [pH 9.0], 10 mM MgCl2). Thermocycling was done for 1 cycle of 95°C for 30 s, 60 cycles of 95°C for 10 s, 50°C for 5 s, and 60°C for 4 min, and holding at 4°C. DNA in the reactions was ethanol precipitated. Electrophoresis was performed by using ABI 3700 sequencers with a POP5 polymer for 2 h 50 min at an EP voltage of 6.5 kV and an EP current of 550 mA. Electrophoresis sequencing data were transferred to networked Sun workstations.
C. neoformans EST database preparation and assembly. The EST database was constructed as follows. The sequences obtained with the universal forward primer were designated 3' ESTs and given a .f1 suffix, while the sequences generated with the T3 primer were designated 5' ESTs and given a .r1 suffix. A piped set of scripts was used in a semiautomatic process to screen each sequence for overall base quality by using Phred (P. Green, http://www.phrap.org/) and to remove vector, mitochondrial, ribosomal, and Escherichia coli contaminating sequences. The sequences passing the screen were termed high-quality ESTs and were subjected to a BLASTX search of the nonredundant protein (nr) database in GenBank. A FASTA sequence file of the ESTs and corresponding cloning and sequence data has been placed at http://www.genome.ou.edu/.
The 3' ESTs were assembled separately by using Phrap (P. Green, http://www.phrap.org/) with a minmatch of 14 and a minscore of 80 in a cumulative fashion in order to monitor the level of clone redundancy. Both the 3' and the 5' ESTs were assembled by using Phrap as described above. The EST contigs were examined for chimeric sequences in a cursory fashion by examining any ESTs that did not align in the expected pattern of 5' EST and reverse complement 3' EST. Misaligned sequences were removed, and the entire database was reassembled. All members of the assembled EST database were examined for homology to the GenBank nr database by batch analysis. The assembled C. neoformans EST database and BLAST results were placed at http://www.genome.ou.edu/ in separate directories. The C. neoformans B3501 EST database contained 1,965 contigs and 1,168 singlets.
Intron and exon database construction. The intron and exon databases for all organisms except S. cerevisiae were created by using FELINES (15). Briefly, the genomic sequence databases were formatted by using formatdb (2). Each FASTA-formatted genomic sequence and each FASTA-formatted EST were placed into separate files by using all2many.pl (J. D. White and B. A. Roe, unpublished data; http://www.genome.ou.edu/informatics.html). Next, two list files were created for each organism, one containing the names of all of the EST files created above and one containing all of the genomic sequence files created above. A FELINES option file was customized for each organism and contained the following parameters, which were constant between organisms: BLAST e-value, 0.001; minimum HSP value, 50; splice site scoring matrix, v; minimum intronless exon length, 300; intron length range, 20 to 2,000; acceptable splicing classes, GUAG, GCAG, AUAC, AUAG, and AUAA; minimum exon number, 1; mRNA and genomic identity minimum percentage, 90; minimum mRNA coverage percentage, 80; maximum number of mRNA gaps, 10; and maximum number of mRNA mismatches, 200. The intron and exon sequence databases then were constructed by running wiscrs.pl to create the Spidey alignment files and gumbie.pl in the default-filtered mode to extract the intron and exon sequences into their respective databases (15).
Branch sites. Branch sites were identified by using the icat.pl program (15). The sequence CURAY was chosen as the primary motif, UURAY was chosen as the secondary motif, and a modified YURAY motif in which either the first, third, or fifth position was allowed to be any nucleotide was chosen as the alternate motif. The icat.pl program searched for each regular expression of the motifs and then chose the 3'-most instance of the motif in each intron sequence based on branch site motifs previously described for metazoans and S. cerevisiae (6, 9, 23).
Polypyrimidine tracts. Polypyrimidine tracts were defined as at least six consecutive nonadenine nucleotides containing no fewer than three uridines (13, 54, 59, 60) and were identified by using FELINES perl scripts, cattracts.pl, and icat.pl (15).
Information content determinations. The information content of the intron and exon regions was determined by using the CONSENSUS utility (27). Briefly, FASTA sequences containing 20 nt of the 5' end of each intron, 13 nt flanking the branch point A (9 nt to the 5' side of the branch point A and 3 nt to the 3' side of the branch point A) of each intron, 20 nt of the 3' end of each intron, and 5 nt in the exon region adjacent to each intron were created. These sequences were formatted for use in the CONSENSUS utility by using FASTA consensus version 2c. A frequency matrix for the sequences was created by using make-matrix version 2.4. Information content then was calculated by using gmat-inf-gc version 2c. Finally, the P values for comparisons of individual alignment regions with random sequences were calculated by using P value version 3a. The background nucleotide distribution was assumed to be completely random for all of the organisms studied.
Homologs of spliceosomal proteins. To identify fungal homologs of the human branch point binding protein (BBP), S. pombe U2AF65 and U2AF35 subunits, S. cerevisiae Nam8p, and Homo sapiens TIA-1 and TIAR, the individual protein homolog sequences were compared to the genomic sequence and EST databases of S. pombe, A. nidulans, N. crassa, and C. neoformans B3501 by using TBLASTN. All of the sequences used were from GenBank, and the accession numbers were as follows: U2AF65 and U2AF35, homolog accession numbers CAB46760 and Q09176, respectively; BBP, accession number AF073779 1; Nam8p, accession number NP_0011954; S. pombe CSX1, accession number NP_594243; H. sapiens TIA-1, accession number NP_071505; and H. sapiens TIAR, accession number NP_003243. Translation of the DNA sequences was done by using FGENSH at the Softberry website (http://www.softberry.com/) with either the S. pombe or the N. crassa matrix provided or by using the TBLASTN output when the FGENSH matrix was inadequate. Domain searches were performed by using the National Center for Biotechnology Information con-served domain database (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml) (41) and the protein family website (http://www.sanger.ac.uk/Software/Pfam/index.shtml) (3), which uses a hidden Markov model-based similarity search.
Multiple alignments of spliceosomal proteins.
Multiple alignments were created by using xced (version 3.93), an X-windows-based multiple-alignment program (http://www.biophys.kyoto-u.ac.jp/
katoh/programs/align/xced/), to align the sequences and ClustalX (ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/, version 1.83), a local stand-alone windowed version of ClustalW, to produce the multiple alignments (29). The scoring matrix used was JTT200PAM, with an AO of +0.06, an OGC of 2.4, and an EGC of 0.0. Percent identity was calculated by using MacVector with ClustalW pairwise alignment in the slow alignment mode with the default parameters.
Phylograms. Phylograms were prepared by using a website-based implementation of ClustalW (http://clustalw.genome.jp/). The pairwise parameters used were as follows: fast/approximate, K-tuple size, 1; window size, 5; gap penalty, 3; number of top diagonals, 5; and scoring method, percentage. The multiple-alignment parameters used were as follows: gap open penalty, +10.0; gap extension penalty, 0.05; weight transition, no; hydrophobic residues for proteins, GPSNDQERK; hydrophilic gaps, yes; and weight matrix, BLOSUM.
Statistics. Intron lengths were compared by using the Kruskal-Wallis test with Dunn's multiple-comparison posttest. Correlation coefficients were calculated by using the Correl function in the OpenOffice.org spreadsheet (http://www.openoffice.org/).
Nucleotide sequence accession numbers. The ESTs determined here were submitted to dbEST under accession numbers CF182795 through CF194965.
| RESULTS |
|---|
|
|
|---|
Characteristics of the genomes for the five organisms and the derived intron and exon data sets used in this study are shown in Table 1. The genome size for this group of fungi ranged from 12 to 43 Mb (Table 1). The estimated mean number of introns per gene varied considerably among the five fungithe lowest being found in S. cerevisiae (0.04) and the highest being found in C. neoformans (2.42) (Table 1). The total number of introns in the data set for each organism ranged from 253 (approximately 100% of the S. cerevisiae introns) to 5,725 (approximately 36.5% of the C. neoformans introns). The intron data sets for S. pombe (1,280 introns), A. nidulans (2,115 introns), and N. crassa (1,897 introns) were sizable but represented only 27.1, 17.3, and 10.8%, respectively, of the estimated total introns (Table 1). With the exception of the S. cerevisiae data set, intron and exon data sets were based on alignments of ESTs with genomic sequences for each respective organism, allowing accurate intron-exon boundary predictions.
|
|
The utility that we used for generating our intron data sets, FELINES (15), filtered out all identified sequences that did not conform to the 5'GU...AG3', 5'GC...AG3', or 5'AU...AC3' dinucleotide ss pairs. The percentages of sequences not meeting these criteria were 17% for S. pombe, 11% for A. nidulans, 19% for N. crassa, and 7% for C. neoformans. The majority (98 to 99.9%) of the remaining introns from all five fungi in this study had the canonical 5'GU...AG3' donor-acceptor ss pairs. The percentages of the 5'GC...AG3' class for S. cerevisiae, A. nidulans, N. crassa, and C. neoformans were 1.19, 1.15, 0.86, and 1.98%, respectively. These percentages are higher than that found for the mammalian 5'GC...AG3' class (0.56%) based on a GenBank representation (11). S. pombe 5'GC...AG3' introns, at 0.08%, were the exception. In the human genome, 5'AU...AC3' introns represent 0.04% of the total (11). We found that 0.09% or 2 of the A. nidulans introns were in the 5'AU...AC3' ss class. In both cases, two or more ESTs showing excellent alignment with the genomic sequence defined the A. nidulans 5'AU...AC3' introns. The remaining intron data sets had no 5'AU...AC3' introns.
5' ss (donor sequence). The consensus sequence for the region of the 5' ss for each of the fungal organisms in this study is shown as a component of the structure logos in Fig. 2. The fungal consensus sequence derived from the five organisms, based on 11,083 introns and including the 5'GU...AG3', 5'GC...AG3', and 5'AU...AC3' classes, was found to be NG|GURWGU (where the vertical bar indicates the splice junction and underlining indicates the first two bases of the intron), which was more degenerate than the metazoan 5' ss consensus sequence, CAG|GUAAGU (9, 35). The fungal +4 base was an A or a U, a variation from the metazoan consensus sequence. When the two yeasts were excluded, the two filamentous members of the Ascomycota and the member of the Basidiomycota had a 5' ss consensus sequence (NG|GURAGU) that was closer to the metazoan consensus sequence. For the yeast 5' ss, a separate and longer consensus sequence (NG|GUAWGUW) could be constructed. U12-type introns have a 5' ss motif (RUAUCCUUU) that differs from that of U2-type introns (9, 58), but this motif was not present in our data sets.
|
Branch site consensus sequence. The branch site is a key intron consensus sequence element required for lariat formation during the splicing process (50). The metazoan and S. cerevisiae branch site consensus sequences have been determined to be YNCURAY and UACUAAC, respectively, where the underlined A is the branch point (6). The reported branch site consensus sequence for S. pombe, CURAY, is more degenerate than that for S. cerevisiae (71).
Greater than 98% of the introns in our data sets could be shown to have potential branch sites by using the FELINES icat.pl script (15). As a training and validation step, the icat.pl script identified branch sites in 100% of the 253 S. cerevisiae introns (15) from the Ares Laboratory yeast intron database (YIDB) (23). The consensus sequences for the putative branch sites for each organism are shown in Fig. 2. From these data, we could derive a general fungal branch site consensus sequence of RCURAY, where the underlined A is the branch point. An earlier survey of a small number of annotated introns of H. capsulatum and C. immitis, two dimorphic pathogenic fungi, also showed that their branch site consensus sequences were in agreement with the general fungal branch site sequence defined above (data not shown).
U12-type introns have a unique branch site consensus sequence, UCCUURAC (9, 67), which was not found in the yeast data sets but which was present in 0.08% of A. nidulans introns, 0.22% of N. crassa introns, and 0.11% of C. neoformans introns.
The branch point A could be localized to a position between 13 and 36 nt from the 3' end of the intron and between 52 and 220 nt from the 5' ss, depending on the organism (Fig. 2). There was a very high correlation between the total number of nucleotides in an intron and the number of nucleotides in the region from the 5' ss to the branch point A. Thus, variability in intron length seems to be a function of the distance between the 5' ss and the branch point (Table 2).
|
Having found that the introns of S. cerevisiae have a broader size distribution than those of the other fungi (Fig. 1), we also wondered whether the information content in the fungal introns and exons changes as the length of the introns increases. To address this question, we analyzed the sequence conservation (information content in bits) in the introns in 20 nt at the 5' or 3' ss and in 13 nt at the branch site and in the exons in 5 nt adjacent to the 5' and 3' ss of the five fungi in this study. The introns were divided into six bins with increasing length ranges, and P values were calculated as described by Hertz and Stormo (27). From comparisons of bits of information in a random sequence to bits of information in conserved regions, we found that the information content for each intron length range for each organism was significantly higher than that in a random sequence for the number of nucleotides represented (P < 0.001) (Fig. 3). S. cerevisiae and S. pombe had the highest information content in each conserved region of the introns, irrespective of the intron length range, relative to the other three fungi in this study (Fig. 3). The information content in the exon regions was higher than that in a random sequence for exons adjacent to introns that were less than 240 nt long for each organism (P < 0.01). The numbers and percentages of introns in each length range are shown in Table 3.
|
|
The information content in the exon regions adjacent to each intron was determined for each bin of introns for each organism and was found to be minimal or absent in some cases (Fig. 2; see also Fig. S1 in the supplemental material). The 5 nt in the exon bordering the 5' end of the intron had a significant amount of information compared to that in a random sequence (P < 0.01), except for exons adjacent to S. cerevisiae 400- to 2,000-nt introns and exons adjacent to A. nidulans and C. neoformans introns longer than 240 nt (Fig. 2; see also Fig. S1A in the supplemental material). In the 5 nt in the exon adjacent to the 3' ss, the highest information content was observed in exons adjacent to S. cerevisiae introns shorter than 320 nt; however, for introns 320 nt long or longer, the highest information content was observed in exons adjacent to S. pombe introns (see Fig. S1B in the supplemental material). The information content for A. nidulans, N. crassa, and C. neoformans exons adjacent to the 3' ss was lower than that for S. cerevisiae and S. pombi exons but above that in a random sequence for exons adjacent to introns shorter than 240 nt. For this latter group of fungi, only N. crassa exons adjacent to the 3' ss of introns longer than 240 nt had information content significantly higher than that in a random sequence (P < 0.02) (see Fig. S1B in the supplemental material).
Polypyrimidine tracts. Polypyrimidine tracts in mammalian introns are conserved elements typically found near the 3' ss, and they function as a binding site for spliceosomal protein U2AF65 (49). We screened for polypyrimidine tracts in the introns in our data sets by using a minimal definition of six consecutive nucleotides with at least 3 U's and no A's (13, 15, 59). We defined two intron regions, the region from the 5' ss to the branch point and the region from the branch point to the 3' ss, screening both for the presence of polypyrimidine tracts. From our results, introns could be placed into four classes based on the locations of the polypyrimidine tracts with reference to the branch point. Table 4 shows the percentages of polypyrimidine tracts in each of the four classes. Figure 4 shows the distribution of the distances (in nucleotides) from the 5' ss or the 3' ss to the branch point for polypyrimidine tracts for each organism. The majority of introns in all five organisms (83.2 to 93.7%) had polypyrimidine tracts in the region from the 5' ss to the branch point (Table 4). Surprisingly, 48 to 62% of the introns in S. pombe, A. nidulans, N. crassa, and C. neoformans had polypyrimidine tracts only in the region from the 5' ss to the branch point. Additionally, most of the polypyrimidine tracts in this region were located close to the 5' ss (Fig. 4, left panels). In the region from the 3' ss to the branch point, where one might expect to find polypyrimidine tracts, based on metazoan introns, we found polypyrimidine tracts in only 27.6 to 68.8% of the introns (Table 4 and Fig. 4, right panels).
|
|
(i) BBP. BBP would be expected to be found in all organisms that form spliceosomes, and candidate BBP homologs have been identified for S. pombe and N. crassa as well as for humans and S. cerevisiae. Analysis of the multiple-sequence alignment of BBPs allowed identification of the expected homologs of BBP in A. nidulans and C. neoformans (see Fig. S2A in the supplemental material). All of the BBP homologs had the MSL5 domain characteristic of BBPs (63) and an adjacent AIR domain, a ring Zn finger motif associated with posttranslational modification (http://www.ncbi.nlm.nih.gov/COG/). A phylogram of the fungal BBP homologs (see Fig. S2B in the supplemental material) shows clustering of S. pombe, A. nidulans, and N. crassa BBPs, which have between 23 and 33% identity with C. neoformans BBP, between 15 and 30% identity with S. cerevisiae BBP, and between 20 and 31% identity with human BBP, whereas C. neoformans BBP has 25% identity with human BBP and 27% identity with S. cerevisiae BBP.
(ii) U2AF. U2AF is a heterodimeric protein consisting of a U2AF65 subunit and a U2AF35 subunit. It binds to the polypyrimidine tract at the 3' end of the intron, associates with the 3' acceptor site early in the establishment of the metazoan spliceosome, and is essential for correct splicing (69). Homologs for human U2AF65 and U2AF35 subunits have been identified in S. pombe (66). S. cerevisiae has Mud2p, a functional equivalent of U2AF65, but no U2AF35 homolog (1).
To search for fungal U2AF homologs, sequences of both the human and the S. pombe U2AF large and small subunits were used to screen A. nidulans, N. crassa, and C. neoformans genomes and their EST databases. Homologs were found in all of the fungal data sets examined.
The human U2AF65 subunit contains three RNA recognition motifs (RRM) and a serine- and arginine-rich region (43). The C. neoformans, N. crassa, and A. nidulans U2AF65 subunit homologs also contain three RRM; however, the S. pombe U2AF65 subunit homolog has only two RRM, and Mud2p, the U2AF65 functional homolog in S. cerevisiae, has only one RRM (56) (see Fig. S3 in the supplemental material).
U2AF35 subunit homologs also were observed in all fungi except for S. cerevisiae. The U2AF35 subunit homologs in all of the fungi that had them were similar; all of them contained one RRM, a KOG2202 domain, and two Zn finger domains (see Fig. S4 in the supplemental material).
A phylogram for the U2AF65 splicing factor homologs and Mud2p of S. cerevisiae shows, as would be expected, that Mud2p falls outside the cluster containing the other fungal proteins, having only 14 to 17% identity with the other fungal U2AF65 homologs (Fig. 5A). The filamentous Ascomycota (A. nidulans and N. crassa) U2AF65 homologs have between 27 and 28% identity with the C. neoformans U2AF65 homolog and between 29 and 30% identity with the S. pombe U2AF65 homolog (Fig. 5A). The S. pombe, A. nidulans, and N. crassa U2AF65 homologs as well as the C. neoformans U2AF65 homolog have between 25 and 29% identity with the human U2AF65 protein, indicating the U2AF65 homologs of these fungi are more similar to the human U2AF65 protein than to Mud2p, the functional equivalent in S. cerevisiae.
|
(iii) Nam8p, TIA-1, and TIAR homologs. Having found high percentages of fungal introns with polypyrimidine tracts only in the region from the 5' ss to the branch point, we searched for homologs of another family of RNA binding proteins that have been implicated in the splicing of introns that have U-rich regions upstream of the branch point. Nam8p, TIA-1, and TIAR are proteins that share homologies in RRM (22, 25, 31, 47, 64), are involved in stabilizing the functional association of U1 snRNP with the 5' ss, and have activities dependent on polypyrimidine regions downstream of the 5' ss (17, 22, 25, 47). By examining the fungal genomes for homologs to the three proteins, we found Nam8p homologs in the S. pombe, A. nidulans, and N. crassa genome data sets and TIA-1 homologs in the A. nidulans, N. crassa, and C. neoformans genome data sets (see Fig. S5 in the supplemental material). All of the homologs have the characteristic KOGO148 domain found in TIA-1 and TIAR, which includes three RRM (see Fig. S5 in the supplemental material). Consequently, these homologs are candidates for factors that could bind to 5' polypyrimidine tracts or that could associate with other proteins in the commitment complex. The phylogram in Fig. 6 shows that the filamentous fungi have both a Nam8p homolog with 25 to 27% identity and a TIA-1 homolog with 14 to 25% identity, whereas S. pombe has only a Nam8p homolog (19% identity) and C. neoformans has only a TIA-1 homolog (29% identity). Except for the S. pombe CSX1 protein (51), there are limited biological data on the functions of this group of fungal RNA binding protein homologs.
|
| DISCUSSION |
|---|
|
|
|---|
On the other hand, we detected features of fungal introns that set them apart from metazoan introns. Our large data set confirmed what had been surmised from the limited available data that fungal introns are characteristically short, with mean intron lengths ranging from 69 nt for C. neoformans to 256 nt for the model yeast S. cerevisiae (7). This length difference is accounted for by the distance from the 5' ss to the branch point. We also found that the information content in fungal introns at the 5' and 3' ss and the branch site is substantial compared to the information content of the exon regions adjacent to the introns. These findings, in conjunction with the fact that fungal introns are short, suggest that splicing in fungi fits the intron definition model (39).
Of the five fungi surveyed, S. cerevisiae introns have the broadest length distribution pattern, followed by S. pombe introns. Furthermore, introns of these two yeasts have the highest intron information content of the five fungi studied. N. crassa, A. nidulans, and C. neoformans introns fall into a narrow length range, with peak numbers of introns within the size range of 50 to 70 nt. The latter three fungi have less information in their introns than is found in the introns of the two yeasts. Lim and Burge reported that for the short introns of S. cerevisiae, the nematode C. elegans, the fruit fly D. melanogaster, the dicot plant Arabidopsis thaliana, and the primate H. sapiens, the intron length distribution peaks occurring at higher numbers of nucleotides indicate that there must be increasing bits of information in the introns for accurate identification of the introns (39). These observations suggested to us that as the mean lengths of the short introns surveyed by Lim and Burge (39) increased, so did the information content in the introns. In keeping with this idea, Fields has demonstrated that the short C. elegans introns (<75 nt) have less information content at the 5' ss than the somewhat rare long C. elegans introns (>75 nt) (16).
We thought it would be of interest to determine whether the intron length and the intron information content showed a similar relationship in the introns of the fungi. We found that with only a single intron size range exception, as the S. cerevisiae and S. pombe intron lengths increase, so does the information content. This pattern of an increase in information content with an increase in the size of the introns is more pronounced at the 5' ss and 3' ss of S. pombe introns than S. cerevisiae introns. However, the information content at the branch sites across most intron size ranges was higher for S. cerevisiae than for S. pombe. With N. crassa, A. nidulans, and C. neoformans introns, the direct correlation between information content and intron length was more subtle than that observed for the introns of the two yeasts. As mentioned earlier, the majority of the introns of N. crassa, A. nidulans, and C. neoformans fall into a more restricted length range than do the introns of S. cerevisiae and S. pombe; therefore, the information content necessary for effective splicing over this narrow range of intron lengths may not require a large increase in information content with an intron length increase. We did note a slight increase in information content for intron lengths between 240 and 2,000 nt for N. crassa, A. nidulans, and C. neoformans introns relative to what we saw at the lower nucleotide ranges. It is not completely clear why introns of the two yeasts have higher information content than do introns of N. crassa, A. nidulans, and C. neoformans. These differences may suggest subtle differences in splicing mechanisms between the yeasts and the three other fungi.
The most striking difference that we observed between the fungal introns and mammalian introns was the absence of polypyrimidine tracts between the 3' ss and the branch point in a sizable population (31.3 to 72.5%) of introns from each organism in the study. Considering that we used a minimal definition for identifying polypyrimidine tracts, it is clear that polypyrimidine tracts are absent in many fungal introns downstream of the branch point. The polypyrimidine tracts that we did observe between the 3' ss and the branch point are relatively weak and may function in a manner different from that of the classical polypyrimidine tracts defined for metazoans. These observations are consistent with the absence of polypyrimidine tracts between the branch point and the 3' ss that also has been observed for certain Drosophila introns (46). Splicing mechanisms in small introns of Drosophila that lack a 3'-end polypyrimidine tract but instead have a pair of polypyrimidine tracts in the region from the 5' ss to the branch point are different from those found in mammalians (17, 32, 33). Bon et al. also noted that poly(T) tracts were found in S. cerevisiae introns upstream of the branch site (7). Our observations in conjunction with those of Bon et al. (7) suggest that the classical splicing signals defined for metazoans may differ from those in fungi.
The classical splicing of metazoan pre-mRNA involves the recruitment of U1 snRNP to the 5' ss and U2 snRNP with its two associated U2AF subunits. One of the two U2AF subunits, U2AF65, binds to the polypyrimidine tract in the 3' ss region, and the other subunit, U2AF35, associates with the 3' ss acceptor site to facilitate correct splicing in metazoan introns (8, 24, 42, 45, 69). In a small Drosophila intron that lacks the polypyrimidine tract in the 3' end of the intron, the pair of polypyrimidine tracts in the region between the 5' ss and the branch point are required for U2AF binding and efficient splicing (32, 33). Forch et al. (17) also have reported that U2AF65 promotes the recruitment of U1 snRNP to weak 5' ss that have downstream U-rich sequences, and these authors have suggested that U2AF65 plays this role in splicing by binding to polypyrimidine tracts in the region from the 5' ss to the branch site. We identified homologs of both U2AF subunits in A. nidulans, N. crassa, and C. neoformans, and U2AF65 and U2AF35 subunits have been found in S. pombe (66). S. cerevisiae has a functional equivalent of U2AF65 (Mud2p) but no U2AF35 homolog (1). These findings are consistent with the observations described above and suggest that the splicing mechanism in S. pombe, the filamentous ascomycetes, and C. neoformans may differ from that in metazoans but may be similar to that described for a small intron of Drosophila that lacks a 3'-end polypyrimidine tract (32, 33). Based on the observations that S. pombe and S. cerevisiae have higher information content in their introns than the other three fungi and that S. cerevisiae does not have a U2AF35 homolog, one may speculate that pre-mRNA splicing may be different in the yeasts and the other fungi. Our observations indicate a need for defining a new model fungal organism other than S. cerevisiae and possibly S. pombe that could be exploited for establishing how the U2AF subunit homologs function in protein-protein and protein-RNA interactions during pre-mRNA splicing.
Considering that splicing of the small introns of Drosophila have many characteristics that match those of fungal introns, the results of splicing studies with Drosophila could serve as an excellent guide for future studies of model fungal organisms. Another protein in Drosophila that is associated with splicing in pre-mRNA and that contains a 5' ss with downstream polypyrimidine tracts is TIA-1 (18). A protein similar to TIA-1 is Nam8p of S. cerevisiae (47). Nam8p is associated with yeast U1 snRNP, and the activity of Nam8p is optimal when there are pyrimidine-rich sequences downstream of the 5' ss (47, 70). Because TIA-1 and Nam8p function in splicing in introns that are similar to many of the introns characterized in our study, we screened for homologs of these proteins in genomic and EST data sets of the fungi used in this study. The finding of homologs of both TIA-1 and Nam8p in the A. nidulans and N. crassa data sets, a Nam8p homolog in S. pombe, and a TIA-1 homolog in C. neoformans suggests that splicing in these organisms is dependent on mechanisms similar to those described for S. cerevisiae or Drosophila, in which Nam8p or TIA-1 is involved. The mechanisms described for the splicing of small introns with polypyrimidine tracts upstream of the branch point may be unique to eukaryotic organisms other than vertebrates because Nam8p has no counterpart in mammalian U1 snRNP (22).
Taken together, our findings show that while there are significant similarities between introns of vertebrates and fungi, there are also some important differences that will have an impact on the mechanisms used for excising the introns. Fungi may be excellent model organisms for studying the splicing machinery needed for efficient splicing in groups of organisms that have short introns with polypyrimidine tracts only in the region downstream of the 5' ss but upstream of the branch site. Within the fungal organisms studied here, the introns of S. cerevisiae and S. pombe, the two yeasts, were found to differ in many ways from the introns of the two filamentous ascomycetes and the one basidiomycete. Based on these differences among groups of fungi, it seems necessary to select for splicing studies a new model organism that more accurately reflects the characteristics of the filamentous ascomycetes.
| ACKNOWLEDGMENTS |
|---|
This work was supported by grant AI-47079 from the National Institutes of Health and EPSCOR grant EPS9550478 from the National Science Foundation.
| FOOTNOTES |
|---|
Supplemental material for this article may be found at http://ec.asm.org/. ![]()
Present address: Mutational Profiling, Genome Sequencing Center, Washington University School of Medicine, St. Louis, MO 63108. ![]()
| REFERENCES |
|---|
|
|
|---|