Transcription factor binding sites were identified by consulting multiple online prediction tools which quickly found over two hundred predicted cis-motifs, many of which had low probability scores. The odds of identifying functional cis-motifs were increased in a few select cases by adding 5bp sequences on either side of the core motif, based on previously identified target sites for WUS , ARF1, and ARR1. The enlarged biding sites were then mapped to the CLV3 genomic sequence, tolerating up to 2 mis-matches in the flanking regions. In order to account for the presence of transcription factors whose cis-motifs are not currently known, MEME analysis were employed to identify motifs shared between genes that are co-expressed with CLV3. Overall, 231 potential cis-motifs and transcription factor binding sites were identified. Most were randomly distributed over the entire CLV3 genomic sequence, but irregular clusters could be recognized near the coding region. The largest cluster occurred in the upstream 500bp of the 5’ promoter, while up to three smaller clusters occurred in the 3’ enhancer region . The list of potential factors was then filtered to include those found inside the previously identified CLV3 regulatory regions, which left just 157 predictions . Many of the remaining predictions were found to have overlapping sequences, though it is unclear how well this might predict their actual function in-vivo. One notable example of this phenomenon is a predicted MYB-like binding site located at -155bp, which was predicted by four different databases. In other cases, two structurally different transcription factors were predicted to have overlapping cis-motifs, such as the bZIP/homeodomain pair Opaque-2/ALFIN-1 in the 3’ enhancer region. Interestingly,large plastic pots for plants the data also revealed four partial miR414 targets, three of which overlapped with the DNA/Mariner family transposable element At2gTE50670 in the 3’ enhancer , and the fourth occurred in the 3rd exon.
In an alternative approach to identify unknown cis-motifs, phylogenetic footprinting was used to compare CLV3 orthologous sequences from different species. In this method, functional regulatory structures can be identified by their conservation over evolutionary time, which often requires little more than performing a sequence alignment. The method is also quite robust, as previous studies found that the identified footprints matched 80 and 85% of known transcription factor binding sites. To begin this analysis, three CLV3 orthologs were identified by their syntenic relationships within the Brassicaceae using the tools in the Brassica Genome.org database. Their cDNA sequences were aligned with 27 CLE family paralogs identified in A. thaliana in order to identify features that were unique to CLV3 orthologs, before expanding the search to additional species. This analysis revealed three potentially unique traits that might be used to distinguish orthologs from the multitudes of closely related CLE genes. These included three consecutive histidines at the C terminal end of the CLE motif, a C-terminal oligo extension, and a 3-exon gene structure, all of which had been previously identified in the CLV3 sub-group. Additional orthologs were then identified using tBlastn searches against the AtCLV3 protein, for which nine species which met the criteria described above: Brachypodium distachyon, Oryza sativa, Ricinus communis, Glycine max, A. thaliana, Arabidopsis lyrata and Brassica rapa, Capsella grandiflora, and Camelina sativa. No AtCLV3 orthologs were identified in the gymnosperms, basal angiosperms, or the Asteriids using these search parameters. The Euphorbiaceae and Fabaceae each contributed one species in the closely related Eurosiids I, while the monocots are represented by two species in the Poaceae. As a result, this sampling is heavily biased towards the Brassicaceae family , which provide more than half of the total number of species. In order to footprint the promoter regions, initial sequence alignments were performed using 8kb genomic fragments, containing up to 5kb of upstream and downstream sequences on either side of the coding region. However, little or no homology was found when all nine orthologs were aligned simultaneously. This was not improved by removing monocot clade, as the two grass orthologs failed to align with each other.
Repeating this pattern, both R. communis and G. max also failed to alignment with each other, or with any of the remaining orthologs. In contrast, conserved regions became clearly visible when the five Brassicaceae species were aligned separately . This result appears to reflect the optimum degree of sequence divergence for this gene, as previous studies have found that orthologs outside of the Brassicaceae were less informative due excessive divergence, whereas sequences obtained entirely within the Brassicaceae have been found to have too little divergence . Three of the remaining species had complete genomic sequences, while the other two consisted of two contigs separated by a gap of unknown size. In the B. rapa ortholog, the gap was located in the 3’ region, and was flanked by 256 and 452 base pair sequences that did not align with any of the other Brasssicaceae orthologs, despite strong sequence conservation in the surrounding regions. This indicates the recent insertion of a large DNA fragment, potentially >700bp in size. Attempts to locate the source of the two end-fragment sequences in the B. rapa genome with BLAST searches, unexpectedly found that each was present in multiple copies, and were distributed across several different chromosomes. No evidence of transposable element sequences were found, so the flanking regions were here interpreted to be contaminating scaffold sequences from the original genome assembly. A similar gap of unknown size occurred in C. grandiflora, where one contig aligned with the CDS and 3’ UTR, while the entire 5’ upstream contig failed to align with any other ortholog. In both cases, the non-aligning sequences were removed from the analysis, providing a final alignment consisting of four orthologs in the 5’ promoter region, and five orthologs spanning the CDS and 3’ UTR. Overall, the five orthologs shared between 27% and 65% sequence similarity, and grouped into two closely related pairs. One pair contained C. grandiflora and C. sativa, and the other contained A. thaliana and A. lyrata. In contrast, B. rapa was found to be distinct from all other Brassicaceae orthologs, which accurately recapitulates its predicted evolutionary relationship with the rest of the family. Upon closer inspection, the coding regions were found to be 79-93% similar, which dropped to just 14-34% in regions with no significant alignments. The initial alignment was considerably fragmented, with many insertions, deletions, and isolated nucleotides. In many cases, the position of these features varied with the settings in the alignment software, and were here interpreted to be artifacts of the alignment procedure.
To correct such artifacts, isolated nucleotides were manually adjusted left or right to maximize local sequence alignments within ±5bp. Where variation in the length of tandem repeats was apparent, gaps were introduced into one or more ortholog sequences to accommodate the largest number of repeats present. Conserved regions were then identified by using a 5bp sliding window to identify regions with more than 60% identity. This window is unusually small compared to previous studies that have used 15-50bp sliding windows, but was chosen here to more accurately reflect the minimum size of known transcription factor binding sites. Where large contiguous conserved regions were found, the presence of small 1-3bp indels within their sequences were used to break them into smaller fragments, as disruption of these sites indicates that they do not contain functional cis-motifs. scattered in the 3’ UTR. Several predicted transcription factor sites were found within the coding regions, but these were interpreted to be non-functional, as previous GUS-reporter systems did not reveal any significant regulatory elements within this region. Among other notable features was a predicted signal peptide in the first exon, identified with signal P 4.0,blueberry pot which was almost entirely conserved and is consistent with the secretion of the mature CLV3 oligopeptide. In addition, the second exon was found to be completely conserved with no In all, 42 conserved regions were identified, ranging in size from 5 to over 111bp long. Fourteen footprints were found in the coding sequence, of which nine of were clustered around the three exons. Only one footprint was found entirely within in the 5’ UTR, and the remaining four were intervening gaps. The second exon also completely overlapped with several predicted transcription factors, including HOX2a, aswell as cytokinin and gibberellic acid responsive motifs. This suggests as-yet unrecognized functional role for the second exon, which might explain why it has been retained in a family that consists largely of single exon genes. The 3rd exon was also highly conserved, although curiously the most conserved region only partially overlapped with the CLE motif and instead included part of the C-terminal extension. In the 3’ UTR, the footprints were found to overlap with potential zinc-finger and MYB binding sites, as well as a cytokinin responsive ARR10 site. In the upstream regulatory region, the 5’ promoter contained ten conserved footprints, eight of which formed a large and nearly contiguous block near the TSS. The two isolated footprints were located at -204bp and -167bp upstream, corresponding to the palindromic Motif#2 and the redundantly predicted MYB binding site, respectively. In the remaining footprints, additional predictions were found for an overlapping AGL15/CBF site, an auxin response element, overlapping GT1 and AGAMOUS sites, and one prediction for a TATA-less promoter. The latter may be related to the position of the only recognizable TATA box-like sequence, which at – 68bp upstream, which is more than double the usual 25-35bp described for other TATA-based promoters. In contrast, the 3’ enhancer region contained seventeen footprints arranged in roughly three clusters, spanning a region nearly 600 bp long. Two of these clusters closely corresponded with the previously noted clusters of predicted transcription factor sites, while the third was distinctly isolated and had no predicted transcription factors. Together, the footprints contained one of the three known WUS binding sites , two predicted AtHB1 binding sites, a cytokinin responsive element , several bZIP motifs, a KNOX-like site, and a predicted cis-motif forNPR1. Strikingly, the majority of the footprints also overlapped with a DNA transposable element in A. thaliana, At2TE50665 . It has previously been implied that WUS controls CLV3 expression in a concentration dependent manner, which is consistent with the close proximity of two demonstrated WUS binding sites .
The region around these two sites also contains several other TAAT cores within a single stretch about 100bp long, much of which is represented by four conserved footprints, which together might form a WUS binding site cluster. However, only the +970 WUS binding site was found to be perfectly conserved, while the other TAAT cores displayed mutations or were interrupted by indel sequences in one or more orthologs. Instead, when the region around the known WUS binding sites was examined in more detail with a 5bp sliding window, a strikingly periodic pattern was observed, where four different conserved motifs were found to be regularly spaced about 15 bp apart. In order from 5’-3’, these motifs were identified as CCGTTGGG, AGTAC, TTGTCAA, and TAATTAATGG , the latter two of which correspond to a predicted W-box motif, and the +970 WUS binding site. In addition, a perfectly conserved sequence was found just 25-36 bp downstream in all orthologs, which consisted almost entirely of tandem repeats containing ATG. The ATG repeats also overlapped with a predicted ALFIN-1 homeodomain/Opaque-2 binding site, suggesting that this sequence may actually represent a modified bZIP motif, or perhaps an atypical homeodomain binding site containing a TGAT core motif. It is not clear how many potential binding sites are present in these ATG repeats, but in consideration of the size of the conserved region, it seems likely that they could accommodate up to three transcription factor proteins simultaneously.The potential functional role of the TGAT motifs is further supported by the observation that they are 4x over-represented in the surrounding 124 bp conserved region, while the TAAT cores actually are 5x under-represented. In addition, pair-wise distance measurements between the two cores revealed a skewed distribution, where few sites were found closer together than the median value of 5bp. When several median-length pairs were aligned, this corresponded to the 13bp motif TAATnnWnnTGAT. When this motif was subjected to Patmatch searches of the A. thaliana genome, it was found to be 26x over represented among the genes directly targeted by WUS. Multiple copies of the 13bp motif were also found in several target genes, including two in the 3’ enhancer of AtCLV3.