DYNAMIC QUERYING FOR PATTERN IDENTIFICATION in Microarray and Genomic Data

 

Harry Hochheiser1, Eric H. Baehrecke2 Stephen M. Mount3, Ben Shneiderman1

                       

1Unversity of Maryland, Department of Computer Science; 2University of Maryland Biotechnology Research, Center for Biosystems Research; 3University of Maryland, Department of Cell Biology and Molecular Genetics

 

 


Abstract

 

Data sets involving linear ordered sequences are a recurring theme in bioinformatics. Dynamic query tools that support exploration of these data sets can be useful for identifying patterns of interest. This paper describes the use of one such tool – TimeSearcher - to interactively explore linear sequence data sets taken from two bioinformatics problems.  Microarray time course data sets involve expression levels for large numbers of genes over multiple time points. TimeSearcher can be used to interactively search these data sets for genes with expression profiles of interest. The occurrence frequencies of short sequences of DNA in aligned exons can be used to identify sequences that play a role in the pre-mRNA splicing. TimeSearcher can be used to search these data sets for candidate splicing signals.

1. Introduction

 

Data sets involving linear ordered sequences of measurements are a recurring theme in bioinformatics work. Microarray time courses include gene expression levels for thousands of genes over multiple time points, providing biologists with a history of relative changes under different experimental conditions. Similarly, frequency counts for oligonucleotides in aligned sequences can help to identify signals for transcription and RNA processing.

 

Although these data sets involve fundamentally different questions and methods, they are both amenable to analysis via interactive querying for patterns involving differences in measured levels. For microarray results, this might involve finding genes that have low expression levels at one time point and higher levels at the next. For sequence data, the analogous query might involve identifying short sequences 5 nucleotides long (pentamers) that occur more frequently than expected in particular regions.

 

TimeSearcher (http://www.cs.umd.edu/hcil/timesearcher) [8] is a dynamic query tool originally designed for identification of time series data.  This paper describes the use of TimeSearcher for identification of splicing signals in aligned sequence data, and patterns in microarray time course data.

 

2. TimeBOXES AND TIMESEARCHER

 

Timeboxes are rectangular query regions drawn directly on a two-dimensional graph.  The extent of the Timebox on the time (x) axis specifies the time period of interest, while the extent on the value (y) axis specifies a constraint on the range of values of interest in the given time period.  More specifically, if an item in a data set is to satisfy a timebox that goes between (x1,y1 ) and (x2,y2), for every point in the time range x1£x£x2, the value of that item must be in the range y1£y£y2 (assuming y2³y1 and x2³x1). Multiple timeboxes can be drawn to specify conjunctive queries.  Items in a data set must match all of the constraints implied by the active timeboxes in order to be included in the result set.

 

To create a timebox, the user clicks on the desired starting point of the timebox and drags the pointer to the desired location of the opposite corner.  As this is identical to the mechanism used for creating rectangles in widely used drawing programs, this operation should be familiar to most users.  Once the timebox is created, it may be dragged to a new location or resized via appropriate resize handles on the corners, using similarly familiar interactions.  In all cases, the query is reprocessed with each mouse event.  As the mouse is moved, the current position of the timebox is stored, and the result display is updated.

 

Construction of timeboxes is aided by the display of the graphs for all of the items in the data set directly on the query area.  This “graph overview” display provides additional insight into the density, distributions, and patterns of change found among items in the data set (Figure 1).

 


3. MICROARRAY DATA

 


Numerous published reports of microarray data have used the examination of changes in gene expression levels over time to examine the effects of various stimuli on genetic expression. As these data sets typically involve measurements for thousands of genes taken over numerous (5-30) time points, interpretation is often a challenge. Currently, analyses of these data sets are generally conducted via some sort of mathematical grouping of genes with similar expression profiles. Clustering techniques that have been used include hierarchical clustering [5,14], self-organizing maps [17,19], and singular value decomposition [9]. The use of TimeSearcher to interactively query and explore these data sets may be a useful complement to such techniques.

 

 

In order to study the shift between anaerobic to aerobic metabolism, DeRisi, Iyer, and Brown examined gene expression changes in yeast (Saccharomyces cerevisiae) cells at several points in time after their placement in fresh medium  [4]. Microarray measurements were made every 2 hours between 9 and 21 hours after initial placement, for a total of 7 time points.  Figure 1 shows a “graph overview” of a subset of the data set, containing 1051 profiles overlaid on a single pair of axes. Starting from this overview, the user might draw a series of timeboxes to identify some genes with expression levels that had local peaks at 19 hours  (Figure 2). This provides a subset of 57 genes that appear to have strongly similar expression profiles.

 

 

 


 


To identify potential regulatory genes, this query can be shifted one time period earlier, thus finding genes that had similar peaks at 17 hours. To do this, the user lasso-selects the three query boxes and drags them to the left. The resulting query identifies a single profile with a similar local peak at 17 hours (Figure 3).  As this single gene precedes the expression of the genes identified in Figure 2, it might be considered for examination as a potential downstream regulator of those genes with local maxima at 19 hours.

 

 

 


 

 

 


4. NUCLEOTIDE SEQUENCE DATA

 

Molecular biologists interested in understanding the process of converting genes into proteins must examine and understand the structure of nucleotide sequences. Often, this work involves analyzing the frequencies of subsequence/oligonucleotide/words at different positions within aligned DNA sequences.  A variety of computational and statistical tools have been proposed to help with the challenge of analyzing the large volume of sequence information that is available [6,7,12]. As with the microarray data sets, interactive exploration complements these tools.

 

Specifically, we have been using TimeSearcher to identify consensus branch site splicing signals in the plant Arabidopsis thaliana. These are secondary signals in the RNA transcripts of genes that help to determine which sequences (introns) are removed from RNA.  The segments that remain are known as exons.

 

The data set being used for this purpose was generated from the genomic sequences surrounding 8550 internal exons that were internally truncated and aligned with respect to their boundaries [13]. It consists of the normalized frequencies of each of the 1024 possible pentamers at each of 192 possible positions.

 

Figure 4 shows a “data envelope” overview of the whole data set.  Formed by plotting a contour defined by the minimum and maximum values of any item in the data set at a given time point, this display can provide useful feedback when the graph envelope (Figure 1) becomes too cluttered. Two peaks, indicating the boundaries between the exon in the middle and the introns on the ends, are immediately apparent. These peaks represent well-known conservation of sequences at splice sites.

 

 

 

To identify candidate splicing signals, we create a query using two timeboxes. One component of the query will identify those pentamers that are frequently found 23-27 nucleotides upstream of intron-exon boundaries. The second identifies pentamers that are infrequently found elsewhere within introns (Figure 5). Taken together, these criteria identify candidate branch site consensus sequences (e.g. CTAAT, CTGAT) that correspond closely to known examples [15]. In addition to identifying known consensus motifs, TimeSearcher was useful for identifying their location.

 

 

 

 

 


 

 

 

 

 


 

 

 

 

 


5. RELATED WORK

 

A variety of methods have been proposed for analyzing microarray data sets. Mathematical clustering of gene expression profiles, together with clustering mosaic plots, has been used to examine time series and other microarray results [4,5,9,11,17,19]. As these results generally provide output in static forms, interactive exploration is generally not supported. Similarly, much of the work to date in analysis of oligomer count data has focused heavily on computational and statistical approaches [6,7,12]. These computational approaches can be very helpful for understanding these data sets, but they suffer from problems such as lack of interactivity and possible sensitivity to parameters. TimeSearcher and other tools that let users work more directly from the data may help users “see” their data better. Ideally, interactive approaches such as TimeSearcher would be integrated with these computational approaches.

 

Recently, tools that apply principles of information visualization to microarray data have been developed. The hierarchical clustering explorer (HCE) provides support for dynamic querying of dendrograms that result from hierarchical clustering algorithms [14]. VxInsight has been used to visualize clustered gene expression profiles in a 3d-projected “mountain”  [11].

 

Interactive tools for querying time series data might be used to find patterns in microarray and sequence data. QuerySketch is an innovative query-by-example tool that uses an easily drawn sketch of a time-series profile to retrieve similar profiles, with similarity defined by Euclidean distance  [18]. This approach is simple and intuitive, but accurate sketches may be difficult to draw, and query constraints, including the similarity threshold, are not adjustable. Spotfire’s Array Explorer 3  [16] supports graphically editable queries of temporal patterns, but the result set is generated by complex metrics in a multidimensional space.  This potent approach produces useful results, but users may wish to constrain result sets more precisely.  Spotfire also includes integrated support for a variety of clustering algorithms.

 

Algorithmic tools for working with microarray time series data sets have addressed difficulties caused by missing data and differences in experimental time scales that might require time warping [1,4]. The combination of these tools with TimeSearcher’s interactive visualization might be an interesting area for future work.

 

 

6. DISCUSSION AND CONCLUSIONS

 

The identification of interesting transitions is a key element of analysis of the microarray and genomic data sets described above. For example, identification of genes with characteristic increases or decreases in expression levels is necessary for finding genes that respond to certain stimuli. Similarly, the candidate regulatory splice sequences were those that had high frequencies at one position and low frequencies at another, requiring a conjunctive query.

 

Timeboxes and TimeSearcher are particularly well suited to support these queries. As timeboxes are drawn directly on a graph space that is also used for plotting data, the queries are easily interpreted at a glance.  Complex queries containing multiple timeboxes provide visual feedback that illustrates the pattern defined by the query. The graph envelopes drawn directly on the two-dimensional query space provide additional feedback that can aid the process of creating queries and interpreting result sets.

 

The power of the timebox model lies in its flexibility. For queries involving identical constraints over w adjacent attributes, a single timebox of width w can be used to express all of the constraints.  This represents a substantial improvement over single-attribute query widgets, which would have required manipulation of w individual widgets to specify the same number of parameters.  When desired query constraints vary from one attribute to the next, multiple boxes can be combined to specify a complex, conjunctive query (Figures 2, 3, and 5).

 

As an interactive, dynamic query tool, timeboxes can assist analyses of microarray and oligomer count data sets by providing rapid feedback that links the results of queries to the query criteria. Together with TimeSearcher’s graph envelope and data envelope overviews that provide high-level summaries, these queries can help biologists understand data sets. Timebox queries that describe patterns of interests may be interesting results in themselves, possibly providing parameters that might be used with algorithmic approaches.

 

The design and functionality of TimeSearcher has been influenced by our needs in examining data sets similar to those discussed above. In addition to providing preliminary validation of the timebox model, this work has led to numerous suggestions for extensions to the query model.

 

For example, Variable Time Timeboxes (VTTs) can be used to specify queries requiring that values remain in a given range for a given amount of time, occurring within some larger range. These queries might be used to find genes that have peak expression levels for three consecutive measurements anywhere in a window containing 5 time points. VTTs have proven useful in the construction of queries that separate two classes of profiles in a larger data set [10].  Future work will involve incorporation of VTTs and other extensions into TimeSearcher.

 

ACKNOWLEDEGMENTS

 

The first author is supported by a fellowship from America Online.  Thanks to Steven Salzberg from The Institute for Genomics Research for providing the nucleotide sequence data.

 

References

 

[1] J. Aach and G.M. Church Aligning Gene Expression Time Series with Time Warping Algorithms. Bioinformatics 17(6): 495-508.

[2] Z. Bar-Joseph, G. Gerber, D.K. Gifford, and T.S. Jaakkola. A new approach to analyzing gene expression time series data. In The Sizth Annual International Conference on Research in Computational Molecular Biology, 2002.

[3] M. de Berg, M. van Kreveld, M. Overmars,  and O. Schwarzkopf Computational Geometry: Algorithms and Applications, Springer-Verlag: Berlin, 2000.

[4]   J. DeRisi, V. Iyer, and P. Brown. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278: 680–686, 24 October 1997.

[5] M.B .Eisen, P.T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proc. Nat. Acad. Sci USA 9514863-14868.December, 1998.

[6]  W.G. Fairbrother, R.F. Yeh, P.A. Sharp, and C..B. Burge Predictive identification of exonic splicing enhancers in human genes. Science 297, 1007-1013, 9 August 2002.

[7] J. van Helden, J., B. Andre, B. and J. Collado-Vides Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. Journal of Molecular Biology 281(5), 827-842, September 4, 1998.

[8]. H.S. Hochheiser and B. Shneiderman Visual specification of  queries for finding patterns in time series data. In K.P. Jante and A. Shinohara, editors, Proceedings of Discovery Science 2001, Lecture Notes in Artificial Intelligence 2226, 441-446. Berlin, 2001. Springer

[9] N.S. Holter, M. Mitra, A. Maritan, M. Cieplak, J. Banavar, and N. Federoff Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proc. Nat. Acad. Sci USA 97(15):8409-8414. 18 July 2000.

[10] E. Keogh, H. Hochheiser, and B. Shneiderman. An Augmented Visual Query Mechanism for Finding Patterns in Time Series Data. Proceedings of Flexible Query Answering Systems 2002, Lecture Notes in Artificial Intelligence 2522, 240-250. Berlin, 2002. Springer.

[11] S. K. Kim, J. Lund, M. Kiraly, K. Duke, M. Jiang, J.M. Stuart, A. Eizinger, B.N. Wylie, and S.G. Davidson. A gene expression map for Caenorhabditis elegans Science 293:2087-2092.

[12] U. Ohler, U., and H. Niemann, Identification and analysis of eukaryotic promoters: recent computational approaches. Trends in Genetics 17(2), 56-60, Feb 2001.

[13] S. Salzberg. Personal Communication, 2002.

[14] J. Seo and B. Shneiderman Understanding hierarchical clustering results by interactive exploration of dendrograms: A case study with genetic microarray data IEEE Computer, 35(7), 80-86, July 2002.

[15] C. G. Simpson, G. Thow, G. P. Clark, S. N. Jennings, J. A. Watters and J. W. S. Brown. Mutational analysis of a plant branchpoint and polypyrimidine tract required forc onstitutive splicing of a mini-exon. RNA 8:47-56. January 2002

[16] Spotfire. http: //www.spotfire.com.

[17] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan,  E. Dmitrovsky, E. Lander, and T. Golub Interpreting patterns of gene expression with self-organizing maps: methods and applications to hematopoietic differentiation. Proc. Nat. Acad. Sci USA 96:2907-2912, March 1999,

[18] M. Wattenberg. Sketching a graph to query a time series database. In Proceedings of the 2001 Conference Human Factors in Computing Systems, Extended Abstracts, pages 381–382, Seattle WA, March 31- April 5 2001. ACM Press.

[19] K. P. White, S.A. Rifkin, P. Hurban, and David Hogness. Microarray Analysis of Drosophila Development During Metamorphosis. Science 286:2179-2184. 10 December 1999.