Jinwook
Seo and Ben Shneiderman*
Department
of Computer Science
and
Human-Computer Interaction Laboratory
Institute
for Advanced Computer Studies
University
of Maryland
College
Park, MD 20742
* Correspondence:
Ben Shneiderman
Department of
Computer Science
University of Maryland
College Park, MD 20742
Phone: (301)
405-2680
Fax: (301) 405-6707
Email: ben@cs.umd.edu
Key
Words: Cluster
Analysis, Interactive Design, Information Visualization, Coordination, Domain
Knowledge, Graphical User Interfaces, Dynamic Queries
Cluster analysis of multidimensional data is widely used in many research areas including
financial, economical, sociological, and biological analyses. Finding natural subclasses
in a data set not only reveals interesting patterns but
also serves as a basis for further analyses.
One of
the troubles with cluster analysis is that evaluating how interesting a
clustering result is to researchers is subjective, application-dependent, and
even difficult to measure. This problem
generally gets worse as dimensionality and the number of items grows. The
remedy is to enable researchers to apply domain knowledge to facilitate insight
about the significance of the clustering result. This article presents a way to better understand a clustering
result by combining insights from two interactively coordinated visual displays
of domain knowledge. The first is a parallel
coordinates view powered by a direct-manipulation search. The second is a domain knowledge view
containing a well-understood and meaningful tabular or hierarchical information
for the same data set. Our examples depend on hierarchical clustering of gene
expression data, coordinated with a parallel coordinates view and with the gene
annotation and gene ontology.
Cluster analysis is
used in numerous research domains, including business, economical, sociological, and biological analyses. Data sets in these domains are
usually large (tens of thousands of items) and have more than 3
attributes/variables, making them multidimensional or multivariate [1]. A cluster is a group of data items that are similar to others within the same group
and are different from items in other groups.
Clustering enables researchers to see overall distribution patterns, and identify interesting
unusual patterns, and spot potential outliers. Moreover, clusters can serve as effective inputs to other analysis method
such as classification.
Researchers in various areas are still
developing their own clustering algorithms even though there
already
exist a large number of general-purpose clustering algorithms. One reason is that it is difficult to
understand a clustering algorithm well enough to apply it to a new data
set. A more important reason is that it
is difficult for researchers to validate or understand the clustering results
in their own way or in terms of their knowledge of the data set. Even the same clustering algorithm might
generate a completely different clustering result when the distance/similarity
measure changes. A clustering result
could make sense to some researchers, but not to others because validity of a
clustering result heavily depends on users’ interest and is
application-dependent. Therefore,
researchers’ domain knowledge plays the key role in understanding/evaluating
the clustering result.
A large number of clustering algorithms have been
developed, but only a small number of cluster visualization tools are available
to facilitate researchers’ understanding of the clustering results. Current visual cluster analysis tools can be improved by allowing
researchers to incorporate their domain knowledge into visual displays that are
well coordinated with the clustering result view. This paper describes additions to our interactive visual cluster
analysis tool, the Hierarchical Clustering Explorer [3]. These two additions are coordinated views
for the researchers’ domain knowledge:
-
a parallel
coordinates view enables researchers to search for profiles similar to a
candidate pattern, which is specified by direct-manipulation.
-
a domain knowledge view allows users to
compare their clustering results with well-understood and meaningful tabular or
hierarchical information of the same data set
Visual analysis by techniques such as dynamic queries
has been successfully used in supporting researchers who are interested in
analyses of multidimensional data [2][7].
Well-designed visual coordination with researchers’ domain knowledge
facilitates users’ understanding of the analysis result.
We first briefly explain the interactive exploration
of clustering results using our current version, HCE 3.0. In section 3, the design considerations for
the direct-manipulation search tool and the dynamic queries are explained in detail.
Section 4 presents a tabular view showing gene annotation and the gene ontology
browser and section 5 covers some implementation issues.
Some clustering algorithms, such as k-means, require
users to specify the number of clusters as an input, but it is hard to know the
right number of natural clusters beforehand.
Other clustering algorithms automatically determine the number of
clusters, but users may not be convinced of the result since they had little or
no control over the clustering process.
To avoid this dilemma, researchers prefer the hierarchical clustering
algorithm since it doesn’t require users to enter a predetermined number of
clusters
and it allows users to control the desired resolution of a clustering result. HCE 3.0 is an interactive knowledge
visualization tool for hierarchical clustering results with a rich set of user controls
(dendrograms, color mosaic displays and etc.) (Figure 1). A hierarchical clustering result is
generally represented as a binary tree called dendrogram whose subtrees are
clusters. HCE 3.0 users can see the
overall clustering result in a single screen, and zoom in to see more detail.
Considering that the lower a subtree is, the tighter the cluster is, we
implemented two dynamic controls, minimum similarity bar and detail cutoff bar,
which are shown over the dendrogram display.
Users can control the number of clusters by using the minimum similarity
bar whose y-coordinate determines the minimum similarity threshold. As users pull down the minimum similarity
bar, they get tighter clusters (lower subtrees) that satisfy the current
minimum similarity threshold. Users can
control the level of detail by using the detail cutoff bar. All the subtrees below
the detail cutoff bar are rendered using the average intensity of items in the
subtree so that we can see the overall patterns of clusters without distraction
by too much detail.

Figure 1. Overall layout of HCE 3.0. Minimum similarity bar was pulled down to
get 55 clusters in the Dendrogram View.
A cluster of 113 genes is selected in the dendrogram view and they are
highlighted in scatterplots, detail view, and parallel coordinates view tab
window (see section 3). Users can select
a tab among the seven tab windows at the bottom pane to investigate the data
set coordinating with different views.
Users can see the names of the selected genes and the actual expression
values in the detail views.
Since we get a different clustering result as a different linkage method or similarity measure is used in hierarchical clustering, we need some mechanisms to evaluate clustering results. HCE implements 3 different evaluation mechanisms. Firstly, HCE 3.0 users can compare two dendrograms (or hierarchical clustering results) in a single dendrogram view to visually compare the effects of different clustering parameters. Two dendrograms are shown face to face, and when users double-click on a cluster of a dendrogram, they can see the lines connecting items in the cluster and the same items in the other dendrogram. Secondly, HCE 3.0 users can compare a hierarchical clustering result and a k-means clustering result. When users click on a cluster in the dendrogram view, the items in the cluster are also highlighted in the k-means clustering result view (the last tab in Figure 1) so that users can see if the two clustering results are consistent. Thirdly, HCE 3.0 enables users to evaluate a clustering result using an external evaluation measure (F-measure) when they know the correct clustering result in advance. Through these three mechanisms, HCE 3.0 helps users to determine the most appropriate clustering parameters for their data set.
HCE 3.0 was successfully used in two case
studies with gene expression data. We
proposed a general method of using HCE 3.0 to identify the optimal signal/noise
balance in Affymetrix gene chip data analyses.
HCE
3.0's interactive features help researchers to find the optimal combination of three
variables (probe set signal algorithms, noise filtering methods, and clustering linkage
methods) to maximize the effect of the desired biological variable
on data interpretation [8]. HCE 3.0 was also used to analyze in
vivo murine muscle regeneration expression profiling data using Affymetrix
U74Av2 (12,488 probe sets) chips measured in 27 time points. HCE 3.0 's visual analysis techniques and dynamic
query controls played an important role in finding 12 novel downstream targets
that are biologically relevant during myoblast differentiation [9]. In section 3 and 4, we will use this data
set to demonstrate how HCE 3.0 combines users’ domain knowledge with other
views to facilitate insight about the clustering result and the data set.
Many microarray experiments measure gene
expression over time [5][9]. Researchers
would like to group genes with similar expression profiles or find interesting
time-varying patterns in the data set by performing cluster analysis. Another way to identify
genes with profiles similar to known genes is to directly search for the genes
by specifying the expected pattern of a known gene. When researchers have some
domain knowledge such as the expected pattern of a previously characterized
gene, researchers can try to find genes similar to the expected pattern. Since
it is not easy to specify the expected pattern at a single try, they have to
conduct a series of searches for the expression profiles similar to the
expected pattern. Therefore, they need an interactive visual analysis tool that
allows easy modification of the expected pattern and rapid update of the search
result.
Clustering and direct profile search can complement
each other. Since there is no perfect clustering algorithm right for all data
sets and applications, direct profile search could be used to validate the
clustering result by projecting the search result onto the clustering result
view. Conversely, a clustering result
could be used to validate the profile search by projecting the cluster result on
the profile view. Therefore,
coordination between a clustering result and a direct search result make the
identification process more valid and effective.
‘Profile
Search’ in the Spotfire DecisionSite (www.spotfire.com) calculates the
similarity to a search pattern (so called 'master profile') for all genes in
the data set and adds the result as a new column to the data set. The built-in profile editor makes it possible to
edit the search pattern, but the editor view is separate from the profile chart view where all
matching profiles are shown, so users need to switch between two views to try a
series of queries. The modification of
master profile in the profile editor view is interactive, but search results
are not updated dynamically as the master profile changes.
TimeSearcher [7] supports interactive querying and exploration of time-series data. Users can specify interactive timeboxes over the time-varying patterns, and get back the profiles that pass though all the timeboxes. Users can drag and drop an item from the data set into the query window to create a query with a separate timebox for each time point over the item in the data set. Each timebox at each time point can be modified to change the query.
HCE 3.0 reproduces Spotfire’s and TimeSearcher’s basic
functions with a novel interface, the parallel coordinates view powered by a direct-manipulation search,
that
allows for rapid creation and modification of desired profiles using novel
visual metaphors. Key design
concepts are:
- interactive specification of a search pattern on the information space : Users can submit their queries simply by mouse drags over the search space rather than using a separate query specification window.
- dynamic query control : Users get the query results instantaneously as they change the search pattern, similarity function, or similarity threshold.
- sequential query refinement : Users can keep the current query results as a new narrowed search space for subsequent queries. This enables users to refine their query results, which follows the process of general problem solving.
The parallel coordinates view consists of three
parts (Figure 2): the information space where input profiles are drawn and
queries are specified, the range slider to specify similarity thresholds, and a
set of controls to specify query parameters. Users specify a search pattern by simple mouse drags. As they drag the mouse over the information
space, the intersection points of mouse cursor and vertical time lines define
control points. A search pattern is a
set of line segments connecting the contiguous control points specified. Users choose a search method and a
similarity measure on the control panel. They can change the current search pattern by moving a
control point (a rectangular point on the search pattern), by moving a line segment vertically or
horizontally, or by adding or removing control points. All of these modifications are done by mouse clicks or drags, and the
results are updated instantaneously. This
integration of the space where the data is shown and the space where the search
pattern is composed reduces users' cognitive load by removing the overhead of
context switching between two different spaces.

Figure 2. Parallel coordinates view: Layout of the
parallel coordinates view and an example of model-based query on the mouse
muscle regeneration data. The data
silhouette (the gray shadow) represents the coverage of all expression
profiles. The red bold line is a
search pattern specified by users’ mouse drags. Thin regular solid lines are the result of the current query that
satisfies the given similarity threshold (more than 96.3% similar to the search
pattern). The data set shown is a temporal gene
expression profile on the mouse muscle regeneration [9].
Incremental query processing enables rapid updates (within
100 ms) so that dynamic query control is possible for most microarray data
sets. The easy and fast search for
interesting patterns enables researchers to attempt multiple queries in a short
period of time to get important insights into the underlying data set.
In the parallel coordinates view, users can submit a new query over the current query result. If users click “Pin This Result” button after submitting a query, the query result becomes a new narrowed search space (Figure 2). We call this “pinning.” Pinning enables sequential query refinement, which makes it easy to find target patterns without losing the focus of the current analysis process. If users click on a cluster in the dendrogram view, all items in the cluster are shown in the parallel coordinates view. By pinning this result, users can limit the search to the cluster to isolate more specific patterns in the cluster.
Genes included in the search result are highlighted
in the dendrogram
view. Conversely, if users click
on a cluster in the dendrogram view, profiles of the genes in the cluster are shown in the parallel
coordinates view so that users can see the patterns of genes in a
different view other than color mosaic.
Through the coordination between the parallel coordinates view and the dendrogram
view,
users can
easily see the representative patterns of clusters and compare patterns
between clusters. Since queries done in
the parallel coordinates view identify genes with a similar
profile, the search results should be consistent with clustering results, if the same
similarity function is used. In
this regard, the parallel coordinates view helps researchers to validate the clustering
results by
applying their domain knowledge through direct-manipulation searches.
In the parallel coordinates view, users can run a text
search (called search-by-name query) by typing in a text string to find items whose name or description
contains the string. Moreover, two
different types of direct-manipulation queries are possible in the parallel
coordinates view: model-based queries and ceiling-and-floor queries.
Model-based queries: Users can specify a model pattern (or a search pattern) simply by mouse drags as shown in Figure 2, and select a distance/similarity measure among 3 different ones and assign the similarity/distance threshold values. All profiles satisfying the similarity/distance threshold range will be rapidly shown in the information space. The three different measures are ‘Pearson correlation coefficient’, ‘Euclidean distance’, and ‘absolute distance from each control point’. The first measure is useful when the up-down trends of profiles are more important than the magnitudes, while the second and the third measures are useful when the actual magnitudes are more important. When users know the name of a biologically relevant gene, they can perform a text-based search first by entering a name or a description of the gene (Figure 4). Then they can choose one of the matching genes and make them a model pattern by right-clicking on the pattern and selecting “Make it a model pattern.” They can adjust or delete some control points depending on their domain knowledge. Finally, they adjust the similarity thresholds to get the satisfying results and project them onto other views including the dendrogram view.
Ceiling-and-Floor queries: Ceilings and
floors are novel visual metaphors to specify satisfactory value ranges using
direct manipulation. A ceiling imposes
upper bounds and a floor imposes lower bounds on the corresponding time
points. Users can define ceilings and
floors on the information space so that only the profiles between ceilings and
floors are shown as a result (Figure 3). Users can specify a ceiling by dragging with the left mouse button
depressed, and a floor by dragging with the right mouse button depressed. They can change ceilings and floors with
mouse actions in the same way as they did for changing search patterns in model-based
queries. This type of query is useful
when users know the up-down patterns and the appropriate value ranges at the
corresponding time points of the target profiles. Compared to model-based queries, ceiling-and-floor queries allow
users to specify separate bounds for each control point.

Figure 3. An example of the Ceiling-and-Floor
query. Bold line segments above the
profiles define ceilings, and bold line segments below profiles define
floors. Profiles below ceilings and
above floors at the time points where ceilings or floors are defined are shown
as a result. Users can move a line
segment or a control point of ceilings or floors to modify current query. The highlighted region gives users informative visual feedbacks of
the current query. The data set shown
is a temporal gene expression profile on the mouse muscle regeneration [9].
Coordination example: Researchers generated in vivo
murine muscle regeneration expression profiling data using Affymetrix U74Av2
(12,488 probe sets) chips. They
measured expression levels at 27 time points to find genes that are biologically relevant
to the muscle regeneration process. They already have domain knowledge that MyoD
is one of genes that are the most relevant to muscle regeneration. They run the hierarchical clustering with
the data set, and identify a relevant cluster that has a peak on 3 day (Figure
4). In the parallel coordinates view, they search MyoD using
search-by-name query, then make it a model pattern to perform a model-based
query. They adjust the similarity
thresholds to get the search result that mostly overlaps with the relevant 3
day cluster (Figure 4). Finally, they
confirm through other biological experiments that 2 genes (Cdh15 and Stam)
in the overlapped result set are novel downstream targets of MyoD.

Figure 4(a). Run a search-by-name query with ‘MyoD’
to find 5 genes whose name contains MyoD, and the 5 genes are projected
onto the current clustering result visualization shown by triangles under the
color mosaic. Select a gene (myogenic
differentiation 1) and make it a model pattern for next query.

Figure 4(b). Modify the model pattern to emphasize 3
day peak (notice the bold red line), and run a model base query to find a small
set of candidate genes. The updated
search result will be highlighted in the dendrogram view and the gene ontology
browser (see section 4).
Figure 4. An example of coordination with the parallel
coordinates view
Interactive visualization techniques combined with
cluster analysis help researchers discover meaningful groups in the data
set. A direct-manipulation search
coordinated with clustering result visualization facilitates insight about
clustering result and the data set.
Further improvement is possible if there is another well-understood and meaningful knowledge structure for the same
data set. For example, when
marketers perform a cluster analysis on the customer transaction data, they
discover customer groups based on purchasing patterns. If they have another knowledge structure on
the data such as the customer preferences or demographic information, they can
acquire more insight into the clustering results by projecting the additional
information onto the clustering result.
In this market analysis example, if a geographic hierarchy of states,
counties, and cities were available, it might be possible to discover that
purchasers of expensive toys reside in large southern cities. They are likely
to be older grandparents in retirement communities.
Coordination between clustering results and external
domain knowledge, such as the Gene Ontology, is also being added to commercial software tools,
such as Spotfire DecisionSite and CoMotion(www.mayaviz.com). We expand on this
important idea by allowing rapid multiple selection in secondary databases
through tabular and hierarchical views. The paper continues with the genomic
data case study.
Tabular View
In recent decades, biological knowledge has been
accumulated in public genomic databases (GenBank, LocusLink, FlyBase, MGI, and
so on) and it will increase rapidly in the future [4]. These databases are useful sources of
external domain knowledge with which biologists gain insights into their data
sets and clustering results. Biologists frequently utilize those databases to
obtain information about genomic instances that they are interested in.
However, those databases are so diverse that researchers have difficulties in
identifying relevant information from the databases and combining them.
HCE 3.0 implements a tabular view (Figure 5) as a hub
of database annotations where users can see annotations extracted from those databases
for items in the data set. Each row represents an item and each column represents
an annotation from an external knowledge source. The tabular view is interactively
coordinated with other views in HCE 3.0 as shown in Figure 8. If users select a group of items in other
views, rows of the selected items are highlighted in the tabular view. By carefully looking at the annotations for
the selected item in the table view and looking them up in the corresponding
databases, users can gain more insight into the items by utilizing the domain
knowledge from the databases.
Conversely, if users select a bunch of rows in the tabular view, the
selected items are also highlighted in other views. Researchers can do
annotation either manually or by using annotation files provided by gene chip
makers. For example, Affymetrix
provides annotation files for all their GeneChips, and users can easily import
the annotation file and combine it with the data set.

Figure 5. Tabular
view: Each row has annotations for a gene.
Each column represents an annotation from an external database. All of
12422 genes are in the tabular view, and there are 28 annotation columns. When
users select a cluster of 113 genes in the dendrogram view, the annotation
information for those genes is highlighted in the tabular view. The Affymetrix U74Av2 chip annotation
file downloaded from www.affymetrix.com was imported and combined with the data
set. The data set shown is a temporal gene expression profile on the mouse
muscle regeneration [9].
Hierarchy View: Gene Ontology Browser
One of the major reasons that biologists cannot
efficiently utilize the abundant knowledge in public genomic databases is the lack
of a shared controlled vocabulary. The Gene Ontology (GO) project [6] is a
collaborative effort of biologists to build consistent descriptions of
gene products in different databases. The GO collaborators have been developing three ontologies - structured,
controlled vocabularies with which gene products are described in terms of their associated biological
processes, molecular functions, and cellular components in a
species-independent manner.
The good news is that Gene Ontology (GO) annotation is
a widely accepted, well-understood and meaningful knowledge structure for gene expression data. GO annotations of genes in a cluster or a
direct manipulation search result might reveal a clue about why the genes are
grouped together. With the GO
annotation, researchers can easily recognize the biological process, molecular
function, and cellular component that genes in a cluster are associated
with. Furthermore, it is possible to test
a hypothesis that an unknown gene might have the same or similar biological
role with the known genes in the same cluster.
Interactive coordination with the GO annotation enables researchers to
upgrade their insights by combining generally accepted knowledge from other
researchers.
HCE 3.0 integrates the three ontologies – molecular
function, biological process, and cellular component into the process of
understanding clusters and patterns in gene express profile data. The ontologies are shown in a hierarchical
structure as in Figure 6. The gene ontology hierarchy is a directed acyclic
graph (DAG), but we use a tree structure to show the hierarchy since the tree
structure is easier for users to understand and easier for developers to
implement than a DAG. Thus, a gene
ontology term may appear several times in different branches, but the path from
the root to a node is unique.
Users can download the latest gene ontologies from the Gene Ontology Consortium’s ftp server (‘Get
Latest Ontology’ button), and browse the ontology hierarchy on its own (‘Load
Ontology’ button). Coordination between the gene ontology browser and other
views in HCE 3.0 is bi-directional.
`
Figure 6. HCE 3.0 with gene ontology browser on. Users can select a cluster in the dendrogram
view (at the top left corner), which is highlighted with a rectangle. 113 genes in the selected cluster are shown
in the gene list control at the bottom right corner. All paths to the selected GO terms (associated with myogenin)
are shown with a flag-shape icon in the ontology tree control at the bottom
left corner. ‘I’ represents ‘IS-A’
relationship and ‘P’ represents ‘PART-OF’ relationship. The data set shown is in vivo
murine muscle regeneration expression profiling data using Affymetrix U74Av2
(12,488 probe sets) chips measured in 27 time points.
Coordination from other views to the Gene Ontology
Browser: Selection of genes in other views such as a click on a cluster in the
dendrogram view, a direct-manipulation search in the parallel coordinates view,
and a rubber-band selection in a scatterplot populate the gene list control
with the selected genes and their GO identifiers as shown in Figure 6 (bottom
right corner). Gene names are
preceded by the ‘G’-shape icon and GO identifiers are preceded by a flag-shape
icon. GO identifiers are listed below
the gene name with an indentation. If users select a GO identifier in the gene list
control, all possible paths from the root to the selected GO identifier in the
entire GO hierarchy are shown at the
ontology tree control (in the bottom left corner of Figure 6). To reduce clutter, irrelevant paths are hidden. If users select a gene in the gene list control
as in Figure 6, paths for all GO identifiers of the gene are shown in the
ontology tree control. By taking a look
at GO term names shown in the ontology tree control, users can see the detail of
biological functions related to the gene described using a shared
controlled vocabulary. Clicking on the ‘¬ All’ button or ‘¬ Selected’ shows all paths from the root to GO
identifiers of all or selected genes in the gene list control. By carefully investigating the shared paths
in the ontology tree control, users can learn which molecular function,
biological process, or cellular component is related to the genes in the
cluster. For example, if all genes in a
cluster are mapped to GO nodes below physiological process in the
biological process ontology, genes in the cluster are likely to be involved in a
physiological process.
Coordination
from the Gene Ontology Browser to other views: When a gene expression profile data is loaded into HCE 3.0, each gene is
mapped to its associated gene ontology identifiers. Each item in the gene ontology tree control shows the number of
genes mapped to the item or its descendants within parentheses following the
gene ontology identifier. Scrutinizing the numbers next to GO identifiers,
researchers can have some idea about which known gene ontology terms better
describe the gene expression profile data. If
users right-click on an item (or, a GO term), all genes mapped to the
item or its descendants are highlighted in all other views including the
dendrogram view and they are listed in the gene list control (Figure 7). If users want more information about a GO
identifier, they can double-click on it and HCE 3.0 will launch a web browser and open up a web
page for the identifier at godatabase.org where users can also find all
associated genes across available public data sources (FlyBase, MGI, SRS,
etc.).
Coordination
example: Researchers annotate
their 27 time point murine muscle data set with GO identifiers using the
annotation file downloaded from Affymetrix website (www.affymetrix.com). They
click on the 3 day cluster in the dendrogram view, or perform a model-based
query in the parallel coordinates view and check the GO annotations of the genes
in the result to see if there are any shared ontology terms. Conversely, they can browse the ontology tree
control and perform a text-based search for GO:0007519 (myogenesis) that
is one of the most biologically relevant to their experiment. By right-clicking on the GO term, they see
all genes that are mapped to myogenesis and its descendants are
highlighted in the dendrogram view, and then they realize that many of the
genes are in the 3 day cluster. All
genes in the cluster actually become candidate genes of novel downstream
targets of MyoD, and deserve further biological experiments. The
coordination with GO would produce more meaningful insights as GO becomes more comprehensive
and as more genes are annotated with GO terms.

Figure 7(a). Users right-click on a GO identifier at the ontology tree control
to highlight all genes mapped to the identifier or its descendants in the dendrogram
view. The selected genes are also
listed in the gene list control.

Figure 7(b). The
selected genes are also shown in the parallel coordinates view to enable users
to check the result in a different view.
Figure 7. An example of coordination with the Gene
Ontology Browser
HCE 3.0 was implemented as a stand-alone
application using Microsoft Visual C++ 6.0.
The Microsoft Foundation Class (MFC) library was statically linked. HCE 3.0 runs on personal computers running
Windows (at least Window 95) without special hardware or external library
support. HCE
3.0 is freely available at http://www.cs.umd.edu/hcil/hce/ for
academic or research purposes.
Figure 8 shows four tightly coupled components of HCE and linkages
between
them. Updates by each linkage in
Figure 8
are instantaneous
(or, it takes less than 100ms) for most microarray data sets.

Figure 8. Diagram of interactions between components of
HCE 3.0. All interactions are
bi-directional. This paper describes
coordination between the dendrogram view, parallel coordinates view, and
knowledge tables/hierarchies view.
Knowledge tables/hierarchies incorporate external domain knowledge while
others show the internal data using different visual representations.
To achieve rapid responses to users’ actions, hash and map data structures were used because they enable constant time lookup of items, with only a modest storage overhead. Incremental data structures were used to support rapid query update in the parallel coordinate view by maintaining active index sets for intermediate query results.
Microarray experiment data set can be imported to HCE 3.0
from a tab-delimited
text or an
Excel spreadsheet. The latest
gene ontology annotation data is automatically downloaded from the Gene
Ontology Consortium’s ftp server. The current annotation file with GO annotations for
most Affymetrix chips is downloadable from www.affymetirx.com and it can be
automatically attached to the input data.
Cluster analysis has been the focus of numerous
research projects conducted in various fields.
It reveals the underlying structure of an input data set, interesting unusual patterns, and potential outliers. Understanding the clustering result has been a
tedious process of checking items one by one.
With HCE 3.0, we believe users can quickly apply their own or external
domain knowledge to interpret a cluster by visual display in coordinated views.
This paper presented two coordinated views to
incorporate users’ domain knowledge with visual analysis of the data set and
clustering results. First, when users
know an approximate pattern of a candidate group of interest, they can use the
parallel coordinates view to quickly compose the search pattern according to
their domain knowledge and run a direct manipulation search. Second, when there is a well-understood and meaningful tabular or
hierarchical information for their data set, they can utilize other
researchers’ knowledge to make interpretations based on the clustering
result. Well-designed interactive
coordination among visual displays helps users to evaluate and understand the
clustering results as well as the data set by visually facilitating human
intuition.
This work is a part of our continuing effort to give users more controls over data analysis processes and to enable more interactions with analysis results through interactive visual techniques. These efforts are designed to help users perform exploratory data analysis, establish meaningful hypotheses, and verify results. In this paper, we show how those visualization methods can help molecular biologists analyze and understand multidimensional gene expression profile data. Empirical validation on standard tasks, more case studies with biological researchers, and feedback from users will help refine this and similar software tools.
1. A. Inselberg and T. Avidan, “Classification and visualization for
high-dimensional data,” Proc. 6th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2000, pp. 370-374.
2. E. Kandogan, “Visualizing multi-dimensional clusters,
trends, and outliers using star coordinates,” Proc. 7th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, 2001, pp. 107-116.
3. J. Seo and B. Shneiderman, “Interactively exploring hierarchical
clustering results,” IEEE Computer, Vol. 35, No. 7, 2002, pp. 80-86.
4. A.D.
Baxevanis, “The Molecular Biology Database Collection: 2003 update,” Nucleic
Acids Research, 31, 2003, pp. 1-12.
5. A. Butte, “The use and analysis of microarray data,” Nature Reviews Drug
Discovery, Vol. 1 No. 12, 2002, pp. 951-960.
6. Gene Ontology Consortium, “Gene Ontology: tool for the unification of biology”, Nature Genet, 25, 2000, pp. 25-29.
7. H. Hochheiser and B. Shneiderman, “Visual specification of queries for
finding patterns in time-series data,” Proceedings of Discovery Science,
Springer, Berlin, 2001, pp. 441-446.
8. J. Seo, M. Bakay, P. Zhao, Y. Chen, P. Clarkson, B. Shneiderman, and E.P. Hoffman, “Interactive Color Mosaic and Dendrogram Displays for Signal/Noise Optimization in Microarray Data Analysis,” Proc. IEEE International Conference on Multimedia and Expo, 2003, pp. III-461~III-464.
9. P. Zhao, J. Seo, Z. Wang, Y. Wang, B. Shneiderman, and E.P. Hoffman, "In vivo filtering of in vitro
MyoD target data: An approach for identification of biologically relevant novel
downstream targets of transcription factors," Comptes Rendus Biologies,
Vol. 326, Issues 10-11, October-November 2003, pp 1049-1065.