Interface and Data Architecture for
Query Preview in
Networked Information Systems

Khoa Doan , Catherine Plaisant, Ben Shneiderman#, and Tom Bruns

Human-Computer Interaction Laboratory
University of Maryland Institute for Advanced Studies
# Department of Computer Science and Institute for Systems Research

HCIL/UMIACS A.V. Williams Building
University of Maryland, College Park, MD 20742

e-mail:plaisant @cs.umd.edu
http://www.cs.umd.edu/projects/hcil




ABSTRACT

There are numerous problems associated with formulating queries on networked information systems. These include data diversity, data complexity, network growth, varied user base, and slow network access. This paper proposes a new approach to a network query user interface which consists of two phases: query preview and query Rrefinement. This new approach is based on the concepts of dynamic queries and query previews, which guides users in rapidly and dynamically eliminating undesired datasets, reducing the data volume to manageable size, and refining queries locally before submission over a network. Examples of 2 applications are given: a Restaurant Finder and prototype with NASA's Earth Observing Systems--Data Information Systems (EOSDIS). Data architecture is discussed and user's feedback is presented. Dynamic queries and query previews provide solutions to many existing problems in querying networked information systems.

Keywords: user interface, direct manipulation, dynamic query, information system, metadata, query preview, query refinement, science data, NASA EOSDIS.

INTRODUCTION

The exploration of networked information resources becomes increasingly difficult as the volume of data grows. We identified at least the following problems of information retrieval in networked environments:

· Data Volume: The amount of data available is rapidly increasing. For example, some sensor data in NASA's Earth Observing Systems is growing at the rate of gigabytes per day. Organizing and indexing the volume of new datasets is difficult. Many users in search of specific information have little interest in most of the available data. A rapid way to focus on information of interest is needed.

· Data Diversity: Datasets come in a variety of forms, such as text, image, audio, movies, or combinations of these. Some formats are application specific, making it difficult for search and retrieval tools to identify and categorize them.

· Data Complexity: Some datasets have dynamic, non-traditional structures. The attributes and content are also dynamic and complex. But in many cases simplifying the complexity and making datasets accessible via simple search is desired.

· Slow Network Access: Slow network access is a well-known problem of information retrieval in networked environments. When network traffic is high, data transmission suffers considerable delay. Therefore, it is desirable to minimize the network activities.

· Network Growth: Networks like the Internet, are large and rapidly growing. Changing technologies make it difficult to standardize information organization and discovery. New network technology brings fundamental changes to information repositories (e.g. the World-Wide Web). Therefore, it is important to design a schema or standard that can be extended to capture the structure and semantics of new information and can be adapted to work with new network technologies.

· Varied User Base: Along with network growth, the user population is also rapidly increasing in size and diversity. It is challenging to provide an interface to serve users with diverse backgrounds and needs.

In this paper, we present a dynamic query user interface to support efficient query formulation for networked information systems using dynamic queries and query previews.

Dynamic queries built on the work of graphical query interfaces based on aggregation/generalization hierarchies [WS93, Shn94]. Dynamic query user interfaces apply the principles of direct manipulation to query formulation and imply:

· visual representation of the query

· visual representation of the results

· rapid, incremental, and reversible control of the query

· selection by pointing, not typing

· immediate and continuous feedback

Dynamic queries involve the interactive control by a user of visual query parameters that generate a rapid, animated, and visual displays of database search results. As users adjust sliders or buttons, results are updated (within 100 msec.) on the display.

The enthusiasm users have for query previews emanates from the sense of control they gain over the query. Empirical results have shown that dynamic queries are effective for novice and expert users to find trends and spot exceptions [Will93].

Early implementations of dynamic queries used relatively small datasets of a few thousand datapoints as they required the data to be stored in memory to guarantee rapid update of the display. We worked on algorithms and data structures that allow larger datasets to be handled (up to 100,000 datapoints) [Tan96] but slow network performance and limited local memory are obstacles when trying to use dynamic queries for large distributed datasets.

Query previews offer a solution to this problem. A simple example of query previews, the Restaurant Finder is described first to illustrate the basic principles. Then the two-phase query formulation process and a system architecture are presented. A dynamic query user interface prototype for NASA's EOSDIS (Earth Observing Systems - Data Information Systems) is used to show how this approach has been applied. Evaluation and users' feedback are reported. Finally, related work, conclusions, and future work are presented.

QUERY PREVIEWS

Traditionally, there are two strategies for information seekers to quickly and efficiently obtain data from large information systems [Mar95]. Analytical strategies depend on careful planning, the recall of query terms, relevant iterative query formulation and examination of results. Browsing strategies are heuristic and opportunistic, and depend on recognizing relevant information. Analytical strategies require users to have a good knowledge of the application domain, and be skillful in reasoning. Browsing can be difficult when the volume of data to be browsed is large.

Keyword-oriented or form-based interfaces are widely used today for formulating queries on networked information systems. They often generate zero-hit queries, or query results that contain large number of datasets through which users still have to browse. Users can limit how much data a query should return (e.g. 20 "hits") to limit the duration of the search but it is then impossible to estimate how much data was not returned, and how representative of the entire search space the returned data was. Users also often fail to find data if appropriate keywords cannot be guessed.

Query previews combine browsing and querying. Summary data about the database (such as the number of datasets in pre-defined categories) are used to guide users to reduce the scope of their queries and to focus only on the datasets of interest. The summary data, which will vary with the database and application, provides an overview of the database from several perspectives. It is generally orders of magnitude smaller than the database itself, and can be downloaded quickly to drive a dynamic query interface locally on the user's machine. Therefore, query previews support a dynamic query user interface where the visual display of the summary is updated in real time in response to users' selections. Users are able to formulate queries dynamically, and to rapidly reduce the number of datasets to a manageable size.

Query previews empower users to perform more complex searches by using visual strategies and has many advantages:

· eliminates zero-hit queries

· reduces network activity and browsing effort by preventing the retrieval of undesired datasets

· represents statistical information of the database visually to aid comprehension and exploration

· supports dynamic queries, which aids users to discover dataset patterns and exceptions

· suitable to novice, intermittent, or expert users

A SIMPLE EXAMPLE OF QUERY PREVIEW: THE RESTAURANT FINDER

The Restaurant Finder (Figure 1a and 1b) illustrates the concept of visual interaction with database statistics, the essence of dynamic query previews. The Restaurant Finder is designed to help users identify restaurants that match certain criteria. Users first specify criteria of the restaurants they want, such as type of food or price range. This reduces the number of selected restaurants to a more manageable size (Figure 1b).

Figure 1a: Restaurant Finder. Users can choose an area on the map and make choices with buttons and sliders.

(large version)

Figure 1b: Restaurant Finder. The user has now selected 2 cuisine types, a price range, and a geographical area, reducing the number of restaurant to review to 95 as shown on the result bar which is updated continuously as users adjust their queries.

The request is then submitted to the network, which retrieves more data on the selected restaurants. Users can then continue to refine their queries with additional, more specific, criteria. Consider a database of 50,000 restaurants in the mid-Atlantic region. The Restaurant Finder's user interface provides sliders and buttons for selecting desired cuisine, range of cost, range of hours, geographic regions, rating, and accepted charge cards. As selections are made, the result bar shown at the bottom of the screen changes length proportionally to the number of selected restaurants that satisfy the users' selection (possibly thousands of restaurants). Zero-hit queries are eliminated: users can quickly see if there are any Chinese restaurants open after midnight and they will rapidly realize that there are no cheap French restaurants in the DC area. Database trends are visible: users may discover that there are more Chinese restaurants than Italian restaurants, but more Italian restaurants are open after midnight. In the query preview only statistical information is downloaded from the network, allowing real time interaction and reducing network access until a useful subset of the data has been identified. Then more details will be downloaded from the network about this subset (e.g. geographical location indicated on a zoom-able local map, data for parking availability, number of seats, or handicap access) to allow users to refine their query. Finally users can click on individual restaurants and review their menus and directions to make the final selection.

MAIN EXAMPLE AND PROTOTYPE: THE CASE OF EOSDIS

We used the example of NASA's Earth Observing System Data Information System (EOSDIS) to illustrate our approach. A prototype two-phase dynamic query preview interface for NASA's EOSDIS project is discussed and users' feedback is discussed.

EOSDIS SCIENCE DATA

Soon users (scientists, teachers, students etc.) will be able to retrieve earth science data from hundreds of thousands of datasets containing pictures, measurements, or processed data, from centers around the country. Data about the datasets (called metadata) is available and is used to search for useful datasets. Standard EOSDIS metadata includes spatial coverage, time coverage, type of data, sensor type, campaign name, level of processing etc. Classic form fill-in interfaces for EOSDIS (Figure 2) permit searches of the already large holdings but zero-hit queries are a problem and it is difficult to estimate how much data is available on a given topic and what to do to increase or reduce the result set.

PROTOTYPES

An early draft of our proposed two step approach was implemented in Visual Basic and described in [DPS96]. Later a more complete prototype was implemented in Tcl/Tk (available in video [PBD96]) and more recently a Java implementation was prepared to demonstrate the feasibility on the World-Wide Web (WWW). The prototype consists of two dynamic interfaces: query preview and query refinement. The data shown in the current prototype is fictitious and we are now working with NASA to include current summary data in the query preview implementation at two data centers.

Figure 2: Classic form fill-in interfaces for EOSDIS (Figure 2) permit searches of the already large holdings but zero-hit queries are a problem and it is difficult to estimate how much data is available on a given topic.

EOSDIS QUERY PREVIEW

In the Query Previewer (Figure 3) users select rough ranges for a three attributes: geographical location (a world map with 12 regions is shown at the top of the screen), parameters (a menu list of parameters such as vegetation, land classification or precipitation), and temporal coverage (in the lower third of the screen). The spatial coverage of datasets is generalized into continents and oceans. The temporal coverage is defined by discrete years.

The number of datasets for each parameter, region, and year is shown on preview bars. The length of the preview bars is proportional to the number of the datasets containing data corresponding to the attribute value. At a glance we can see that the datasets seem to cover all areas of the globe but there is more data on North America than South America, and that parameters and years are covered relatively uniformly in this hypothetical EOSDIS dataset collection. The result preview bar, at the bottom of the interface, displays the total number of datasets.

Note that only rough queries are possible since the spatial coverage of datasets are generalized into continents and oceans and the temporal coverage is defined by discrete years.

A query is formulated by selecting attribute values. As each value is selected, the preview bars in the other attribute groups adjust to reflect the number of datasets available.

For example, a user might be interested only in datasets that contain data for North America, which are selected by clicking on the North America checkbox (left of the map) or by clicking on the image of North America on the map. The interface changes immediately (in few milliseconds) in response to this selection (see Figure 3b.) First, the preview bars for the other parameter groups (dataset attributes and years) change to reflect the distribution of datasets for North America only. The query preview bar at the bottom of the interface changes size to illustrate the number of datasets selected by picking North America (660 in this example).

The user continues to define a preview query by selecting from other parameter groups. In this example, the user picks the two largest attributes illustrated for North America, "Vegetation" and "Land Classification" (see Figure 3b and c). The preview bars in the spatial and year parameter groups adjust to reflect the new query, showing the number of datasets having vegetation or land classification data in North America.

The OR operation is used within attribute, the AND operation between attributes [WS93]. Those AND/OR operations are made visible by the behavior of the bars which become smaller when an attribute is specified for the first time (e.g. picking the first year) while becoming longer when additional values are added for a given attribute (e.g. when more years are added).

Continuing, the user further reduces the number of selected datasets by choosing specific years, in the example 1986, 1987, and 1988, three years which have data as shown on the preview bar (Figure 3d). These selections change the number of datasets in the other parameter groups, and the preview bars are updated.

When the "Submit" button is pressed the query previewer submits the specified rough query to the EOSDIS search engine and the metadata of the datasets that satisfy the query is downloaded for the query refinement phase. In the example the query previewer had narrowed the search to 66 datasets.


Figure 3a: The query preview screen displays summary data on preview bars. Users learn about the holdings of the collection and can make selections over a few parameters (here geographic, environmental parameter and year).




Figure 3b: The query preview screen displays summary data on preview bars. Users learn about the holdings of the collection and can make selections over a few parameters (geographic, environmental parameter and year). Here the user has selected North America and all preview bars are updated.


Figure 3c: Vegetation and Land Classification are now selected. The preview bars shows which years have data.




Figure 3d: Three years (1986, 1987, and 1988) have been selected. The result bar shows that an estimated 66 datasets will satisfy this query. The query can now be submitted.

.

EOSDIS QUERY REFINEMENT

The query refinement interface supports dynamic queries over the metadata, i.e. over all the attributes of the datasets including: the detailed spatial extent and temporal interval, parameters measured in the dataset, the sensor used to generate the dataset, the platform on which the sensor resides, the project with which the platform is associated, the data archive center where the data is stored, and data processing level which indicates raw sensor data (level 0) to highly processed data (level 4).

A temporal overview of the datasets is given in the top left of the screen (Figure 4a). Each dataset is now individually represented by a selectable line. Controls are provided to select values for the common attributes: the data

archive center, project, platform, sensor, and data processing level. Beside those common attributes additional attributes can be included in the metadata but since the number of attributes may be large, menu access needs to be provided for those less common attributes. At the bottom of the screen a table lists all the datasets and give exact values for the attributes. This table is most likely very wide to show all attributes.

In the refinement phase of the query, users can select precise values for the attributes. The map, already zoomed to the area selected in the query preview, should be zoom-able to allow precise selection. The time line of the overview, already narrowed to the years selected in the query previewer, can be re-scaled to specify narrower periods of interest.

Figure 4a: In the query refinement users can browse all the information about individual datasets. The result set can be narrowed again by making more precise selections on more attributes.In this second dynamic query interface the result of the query is immediately visualized on the overview. As attribute values are selected the number of lines on the overview change to reflect the query in a few milliseconds since there is no access to the network.

All controls are tightly coupled to:

Figure 4b: Partial screen showing highlighted parameter values corresponding to a dataset selected on the timeline overview.

In Figure 4c the number of datasets was reduced by selecting the processing levels 2 and 3, two archive centers, and three projects. More details about a dataset such as descriptive information and sample data can be retrieved on demand from the network before the decision to download a full dataset is made.The Java implementation also illustrates the benefit of the World-Wide Web by allowing interface objects to act as links to relevant WWW information sources. For example, each platform name is linked to a NASA page containing information about that platform.

Figure 4c: Here the query has been refined by selecting 2 archive centers, 3 projects and 2 processing levels. More filtering could be done by zooming on the timeline or on the map. The timeline overview and the dataset table reflect the remaining datasets. Details and samples images can be downloaded from the network (window on the right) before the long process of ordering the large datasets.

SYSTEM ARCHITECTURE

The architecture supporting the two-phase query formulation consists of three layers: interface, local storage, and network (Figure 5).

At the interface layer, users formulate and refine queries as described above. The query preview and query refinement interfaces provide a visual representation of the preview statistics, selected datasets, and query parameters.

The local storage layer maintains the data used to drive the dynamic query interfaces of the interface layer. This data consists of volume preview statistics for the query preview, and dataset metadata for the refinement. When a user initiates a query preview session, preview statistics are downloaded from the networked databases and a volume preview table is constructed.

The network layer is where the network activities take place. These network activities include updating the volume preview tables, providing the metadata for datasets selected from a query preview, retrieving the details of a dataset selected in the query refinement.

Figure 5: Architecture of two-phase dynamic query approach for networked information systems.

Volume Preview table

The size and dimensionality of the volume preview table is a function of the number of preview attributes and the number of discrete preview values for each attribute. Consider a Restaurant Finder with three preview attributes: cuisine type, rating, and accepted credit cards. Imagine five types of cuisine, four ratings, and two acceptable credit cards. In the simplifying case where each restaurant's attribute can only take a single value the volume preview table would be a five-by-four-by-two table, with a total of 40 combinations. But in our example of the Restaurant Finder, allowable credit cards may be grouped. The cells of the volume preview table must be independent so there must be cells for each possible combination of credit cards. Two credit cards create four possible combinations (including neither being acceptable), so the Volume Preview table has five-by-four-by-four or 80 combinations. Each cell in the table (i.e. each attribute value combination) holds an integer representing the number of restaurants in the database for that particular combination. In Figure 6 corresponding to the "three-star rated" restaurants, the cell for 3-star Indian restaurant that accept Visa and MasterCard hold the value 98.

Such tables are used to update preview widgets in the query preview phase interface.

N preview attributes, yield an N-dimensional Volume Preview table. The total size of the table is many orders of magnitude smaller than the size of the database, or the size of the datasets' metadata. Furthermore, the volume preview table does not change size as the database grows. The size of the volume preview table allows it to be loaded into local high-speed storage to support dynamic queries in the query preview phase.

Controlling the size of the table

Nevertheless, the number of attributes and the number of the possible values needs to be carefully chosen if the objects being searched (e.g. restaurants or datasets) can take any combinations of values for their attributes. In the simple case of the Restaurant Finder, each restaurant could have a combination of credit cards. The interface widget only had 2 buttons for credit cards but the volume preview table needed 4 rows to represent the combinations. In the case of EOSDIS a given dataset can contain measurements of several parameters, covering several areas over several years. In the worst case (i.e. if all combinations are possible) the size of the preview table could become 212x212x210 (for 12 areas, 12 parameters and 10 time periods) which would lead to megabytes of data, much too large to load over the network and use in the previewer.


French Mexican American Indian Italian
None 18 9 5 6 8
Visa 45 22 40 34 23
MC 12 56 40 23 12
Visa & MC 80 90 120 98 160

Figure 6: A slice of the volume preview table for an example Restaurant Finder. This 2D table results from specifying one of three preview attributes. In this case, the third attribute, rating, has been specified. This table is used to update preview statistic widgets in the query preview phase interface.

A first solution is to ignore in some way the possible combinations and count twice the datasets that have 2 parameters, once in each cell for each parameter it contains. This will result in correct individual preview bars (e.g. the preview bar for 1990 really gives the total number of datasets that have any data for that year) but inflate total result preview bar since some datasets are counted multiple times. This might be acceptable if combinations are a small proportion of the data, which is likely to be common because of the high granularity of the selections in the query previewer.

Another more accurate solution to the problem is to analyze the number of combinations, either by looking at the type of attribute (e.g. year combinations are typically year ranges, reducing the number of combinations to 55 instead of 1024 for 10 values), or because of the distribution of the data itself (e.g. EOSDIS parameters are grouped into only a limited number of compatible combinations).

The first solution has the advantage of keeping the size of the volume preview very small (e.g. 12x12x10 integers for our EOSDIS prototype, i.e. much smaller than the world map graphic), the second gives a more accurate preview but requires more time and space.

In our current prototype we chose to simply duplicate datasets because we did not have access to large amounts of real EOSDIS metadata. The attributes were arbitrarily selected. However, it is not difficult to replace the set of attributes used in the prototype. We are now working with the operation data center to select attributes and values ranges that will lead to reasonably sized preview tables.

To summarize, volume preview tables can become rather large if combinations are to be previewed accurately or if large numbers of previewing attributes or attribute values are chosen. But the benefits of the query preview technique is that it remains always possible to reduce the number of attributes or the granularity of the selections so that query preview is possible, allowing users to reduce the scope of the query in an informed and rapid way. The size of the preview table can also be adapted to users' work environment (network speed, workstation type) or preferences.

Updating the Volume Preview table

Since the data of the networked information system changes regularly Volume Preview tables have to be updated. Our approach depends on the data providers being willing and able to produce and publish Volume Preview tables on a regular basis (weekly, daily or hourly depending on the application), or on third party businesses running series of queries to build the tables. Since the previewer is only meant to enter rough queries it is acceptable to use slightly out of date volume tables. The query previewer interface needs to make clear that the volume preview is an approximation on the real volume and give the "age" of the statistical information used. When the rough query is submitted, the (up-to-date) databases are queried and will return up-to-date data for the query refinement. At this point the number of datasets returned might be slightly different than predicted by the query preview. This might be a problem when the query preview predicts zero hits while a new dataset that would answer the query has just been added to EOSDIS. This risk has to be evaluated and adequate scheduling of the updates enforced. The Cubetree implementation of datacubes [RKR97] seems a promising data structure as it has efficient query update.

Limiting the download of metadata

Unless users have large amounts of memory available and very rapid network connection it is preferable not to submit the previewer's query unless the result set has been reduced significantly. The submit button can be disabled when the number of datasets is above a recommended level (75 in our current prototype) which restricts the downloading of large amounts of metadata over the network. The actual number would be tuned to network and user platform characteristics.

Merging volume preview tables

In the case of distributed information systems like EOSDIS, the volume preview table may have to be built by merging a series of volume preview tables prepared by individual data centers. An algorithm is proposed to accomplish this task:


LIMITATIONS OF THE CURRENT EOSDIS PROTOTYPE

The present implementation of the query refinement interface has several limitations. The implementation of the query refinement overview will not scale up very well when more than 100 datasets are returned from the query preview. The timeline of intervals will get too tall and occupy too much screen space if intervals are not allowed to overlap. Better methods of handling large numbers of intervals is needed. Possible directions include: zooming, optimizing the line packing to make use of screen space, or using line thickness to indicate overlaps. The quantitative and qualitative overview of the large number of datasets is needed to monitor their filtering but the ability to select individual lines is important when numbers have decreased enough to require browsing of individual datasets one by one.

In our EOSDIS prototype the zooming and panning of the overview has no filtering effect but we have implemented other examples which demonstrate the benefit of the technique (e.g. for the Library of Congress historical special collections browsing [PMB96]). Similarly the filtering by geographical location has not been developed yet in the query refinement. Zooming and selecting rectangular areas is easy but more sophisticated selection mechanisms issued from the geographical information systems are probably necessary.

The query preview allows users to specify the most common Boolean queries (OR within attributes and AND between attributes). There is no flexibility to specify different operations. This is appropriate since the query preview is only meant to be a rough query, but more precise control over the Boolean combinations need to be provided in the query refinement. Our current prototype does not offer such capability yet. Menu options can be provided to change the "behavior" of widgets, or graphical tools can be provided to allow Boolean combination of the widgets [Young93].

EVALUATION AND USER FEEDBACK

The prototype dynamic query preview interface was presented to subjects as part of a Prototyping Workshop organized by Hughes Applied Information Systems (HAIS) in Landover, MD [Pos96]. A dozen of NASA Earth Scientists who use EOSDIS to extract data for their research participated in the evaluation and reviewed several querying interfaces during the day.

The hands-on review of our prototype lasted about a hour and a half. Groups were formed with two or three evaluators and an observer / note-taker in each group. They received no training but were given 5 directions or starting points to explore the prototype. For example, one direction was to "Examine the relationship between the map at the top and the data shown on the bottom half of the window. Try selecting a geographic region and various attributes. How are the data displayed. Evaluators were encouraged to "think aloud" during the session and their comments and suggestions were recorded. The evaluation took place in a single room and HCIL members were able to go from group to group answering questions and collecting comments.

The 12 professionals reacted positively to the new concepts in the query preview and query refinement interfaces. They agreed that the visual feedback provided in the query preview interface allow users to understand the distribution of datasets. A group of evaluators recommended that it would be a effective tool for subjects who did not know what data exists and what does not. Others remarked that some users would not even need to go to the refinement phase as they would realize immediately that no data was available for them. The query preview interface was said to "allow to select data, see relationships among data, and explore available resources".

Subjects were also favorably inclined to the interval overview concept -- the ability to display data display the results in a two dimensional space. Subjects liked to be able to select or deselect processing levels, and see the changes in the overview. The tight-coupling among different tables supports the discovery of relationships among attributes of datasets. Subjects felt that the prototype "led the user", and was "an intuitive way to search data.

Some subjects suggested that the the map regions and selectable attributes be customizable so users could interact with information in which they are interested (different specialties may require different query preview attributes). There were questions about the scalability of the prototype and the computing power required by client software to implement the designs on a large scale.

At the time of the test the prototype was set to perform an AND operation within an attribute. This meant that clicking on 1991 just after a click on 1990 would result in all the bars being shorter (since it had restricted to the datasets which had data about 1990 AND 1991). After some confusion, all groups of evaluators were able to figure out that an AND was being performed by seeing the bars grow or shrink. But it was clear that they had expected the interface to perform an OR within attribute (i.e. retrieving all datasets having data from 1990 or 1991). This was an important change made to the prototype following the evaluation. This anecdote confirms that the visual feedback helped users understand the operations performed by the sytem.

After the evaluation, subjects were given a questionnaire with 30 positive statements to be rated from 1 to 5 (with 1= strongly disagree, 3 average and 5 strongly agree). The average for all thirty positive statements was 3.6. For example, the statement "Being able to preview the anticipated query results prior to submitting my query in an important capability" had a an average rating of 4.4. The statement "The interval overview is a good tool for displaying information about databases" received an average rating of 3.6.

For a complete list of subject comments and questionnaire results, see [Pos96].

SUMMARY AND DISCUSSION

The two examples we described illustrate a query formulation process for networked information system consisting of two phases: query preview and query refinement.

Query Preview

In the query preview phase, users form a rough query by selecting rough values over a small number of attributes. The scope of the query is large, but the resolution is limited (see Figure 7). Statistical information is maintained for each of the query preview attributes, and cross-referenced to all other preview attributes.

The total number of items selected by the user's query is visualized on a "result preview bar" (at the bottom of the screen for both the EOSDIS and restaurant finder examples). Preview statistics can also be rendered on maps or charts, etc. as illustrated in the EOSDIS prototype. These rendering must be dynamic, since the statistics must change within milliseconds in response to user input.

Query previews requires attribute values to be aggregated. For example the spatial attribute of the EOSDIS datasets is determined by a pair of latitudes and longitudes that define its coverage area. There are millions of possible combinations of coordinates. But these datasets could be categorized into grids or highly generalized geo-political or geo-physical objects, such as continents or oceans. Then, millions of datasets could be grouped into a few categories.

Selecting appropriate attribute values or categories rapidly reduces the data volume to a manageable size. Zero-hit queries are eliminated since invalid combinations are easily reversed. Once users are satisfied with the formulated query, it is submitted over the network to the database. More details about individual items are then retrieved to refine the query.

Query preview Query refinement
Number of datasets Very large Manageable (each one is selectable for details-on-demand)
Number of attributes for selection Few More or all of the attributes
Selection of attribute values Rough ranges or metavalues More precise or exact values

Figure 7: A comparison table of the two phases of the query formulation process.

Query refinement

In the query refinement phase, users construct detailed queries over all database attributes, which are applied only to those items selected in the query preview phase. The scope of the query is smaller, but the resolution is finer. The interface provides access to all database attributes and their full range of values.

A characteristic of the refinement phase is the rendering of each item in a graphical overview. The overview is closely related to the widgets used to refine a query, and reflects the query. By selecting appropriate values of relevant attributes, users continue to reduce the data volume and explore the correlation among the attributes of the items through the visual feedback. Complete details can then be obtained at any time by accessing the network for individual items.
RELATED WORK

An early proposal for volume previews in a database search is described in [HES85]. The "Dining out in Carlton" example was provided to illustrate a search technique (for a specific restaurant) based on the volume preview of the number of the available restaurants. However, query previews were not exploited to support dynamic queries and querying in networked information systems.

Retrieval by reformulation is a method that supports incremental query formation by building on query results [Wil84]. Each time a user specifies a query, the system responds with query reformulation cues that give users an indication of how the repository is structured and what terms are used to index objects. Users can then incrementally improve a query by critiquing the results of previous queries. Rabbit [Wil84] and Helgon [FNL89] are examples of retrieval systems based on the retrieval by reformulation paradigm, which is also the basis of the two-phase query formulation approach. In the following paragraphs, different querying techniques in networked information systems are described and compared with the two-phase approach.

Harvest [BDH94] is a system that was designed and implemented to solve problems common to Internet users. Harvest provides an integrated set of customizable tools for gathering information from diverse repositories, building topic-specific indexes, and searching. Harvest could be used to maintain and update the metadata servers where users can extract information and store it locally in order support dynamic queries in both the query preview and query refinement phases.

However, Harvest, just like other WWW browsers, still applies the traditional querying technique based on keywords. In order to express a complex query, a more visual query interface may be effective. In [CMP95], Marmotta, a visual tool devised as a form used within the WWW-clients to query networked databases, is presented. The ease of use of form-based interfaces is preserved (users need not know the structure of the database). Within Marmotta, icons are used to present the domain of interest and the retrieval requests in a structured form-based interface. Icons are then used in Marmotta to formulate a query. The system then translates the query into a format which can be handled by an HTTP. Therefore, the Marmotta system can assure the syntactic correctness of the query formulation.

In order to cope with the increasing data volume, in the case of the libraries containing millions of documents, it is common to formulate queries on a library catalog. In [VN95], a prototype interface using a ranked output information retrieval system, (called INQUERY) for a library catalog, (called Compendex, containing about 300,000 documents) has been implemented. The interface supports a visualization scheme which illustrates how the query results are related to the query words. Visualizing the results of the query keeps the user more informed on how the system computed the ranking of documents. Another technique, Titlebars [Hea95], visualizes term distribution information to supplement result lists in full text retrieval systems.

Butterfly [MRC95] was developed for simultaneously exploring multiple DIALOG bibliographic databases across the Internet using 3D interactive animation techniques [MRC95]. The key technique used by Butterfly is to create a virtual environment that grows under user control as asynchronous query processes link bibliographic records to form citation graphs. Asynchronous query processes reduce the overhead associated with accessing networked databases, and automatically formulated link-generating queries reduce the number of queries that must be formulated by the user. The Butterfly system provides a visually appealing display. However, it was not designed to support the formulation of complex queries.

CONCLUSIONS

In this paper, the concepts of query previews and dynamic queries are presented, and two prototypes are described. The evaluation results from a NASA Prototyping Workshop supports the relevance of the query preview approach to querying networked information systems. Suggestions are given to control the size of the volume preview table. But the benefits of the query preview technique is that it remains always possible to reduce the number of attributes or the granularity of the selections so that query preview is possible, allowing users to reduce the scope of the query in an informed and rapid way.

ACKNOWLEDGMENTS

This work is supported in part by NASA (NAG 52895 and NAGW 2777) and by the NSF grants NSF EEC 94-02384 and NSF IRI 96-15534. We thank Teresa Cronnell for her graphic design of the Restaurant Finder prototype.

REFERENCES

AS94 C. Ahlberg and B. Shneiderman. Visual information seeking: Tight coupling of dynamic query filters with starfield displays. In Proc. of the ACM CHI94 Conf., 1994, pages 313-319.

BDH94 C. M. Bowman, P. B. Danzig, D. R. Hardy, U. Manber, and M. F. Schwartz. The Harvest information discovery and access system. In Proc. of the Second International Conf. on the World Wide Web, 1994, pages 763-771.

CMP95 F. Capobianco, M. Mosconi, and L. Pagnin. Progressive HTTP-based querying of remote databases within the Marmotta iconic VQS. In Proc. of the IEEE Workshop on Visualization, 199, pages122-125.

DPS96 K. Doan, C. Plaisant, and B. Shneiderman. Query previews in networked information systems. In Proc. of the Forum on Advances in Digital Libraries. IEEE Computer Society Press, 1996, pages 120-129.

DPS97 K. Doan, C. Plaisant, B. Shneiderman, and T. Bruns. . Query previews in networked information systems: a case study with NASA environment data. in SIGMOD Records, Vol. 26, No 1, March 1997, pages 75-81.

FNL89 G. Fischer and H. Nieper-Lemke. HELGON: Extending the retrieval by reformulation paradigm. In Proc. of ACM CHI'89 Conf. , 1989, pages 333-352.

Hea95 M. Hearst. Tilebars: Visualization of term distribution information in full text information access. In Proc. of ACM CHI 95 Conf. , Denver CO, 1995, pages 59-66.

HES85 D. L. Heppe, W. H. Edmondson, and R. Spence. Helping both the novice and advanced user in menu-driven information retrieval systems. In Proc. of British HCI85 Conf., 1985., pages 92-101.

Mar95 G. Marchionini. Information Seeking in Electronic Environments. Cambridge University Press, 1995.

MRC95 J. D. Mackinlay, R. Rao, and S. K. Card. An organic user interface for searching citation Links. In Proc. of the ACM CHI95 Conf., 1995, pages 67-75.

PBD96 Plaisant, C., T. Bruns, K. Doan, and B. Shneiderman. Query Previews in Networked Information Systems: the case of EOSDIS. In CHI 97 Technical Video Program. Atlanta, GA, 1997, ACM New York. (also in HCIL 1996 video reports, HCIL/UMIACS, University of Maryland).

PMB96 Plaisant, C, Marchionini, G., Komlodi, A., Bruns, T, Campbell, L., Bringing treasures to the surface: the case of the Library of Congress Digital Library Program, Proc. of CHI 97, ACM New-York, March 1997. pages 518,525

Pos96 J. Poston. Prototype Workshop 2 (PW2) Results Report. Technical Report 167-TP-001-001, ECS Development Team, Hughes Applied Information Systems, Landover MD, 1996

Shn94 B. Shneiderman. Dynamic queries for visual information seeking. IEEE Software 11, 6, 1994, pages 70-77.

RKR97 Roussopolos, N., Kotidi, Y., Roussopolos, M., Cubetree: organization of and bulk incremental updates on data cube. To appear in Proc. SIGMOD 97

Tan96 E. Tanin, R. Beigel, and B. Shneiderman, Incremental Data structures and algorithms for dynamic query interfaces. ACM SIGMOD Record 25, 4, Dec. 96, pages 21-24.

VN95 A. Veerasamy and S. Navathe. Querying, navigating and visualizing a digital library catalog. In Proc. of the Second International Conf. on the Theory and Practice of Digital Libraries, 1995. (URL: http://www.csdl.tamu.edu/DL95/)

WS93 W. Weiland and B. Shneiderman, A graphical query interface based on aggregation/generalization hierarchies, Information Systems, vol. 18, #4, 1993, pages 215-232.

Wil84 M. D. Williams. What makes RABBIT run? In International Journal of Man-Machine Studies 21, 1984, pages 333-335.

Will93 C. Williamson and B. Shneiderman. The dynamic HomeFinder: Evaluating dynamic queries in a real-estate information exploration system, Proc. ACM SIGIR `92 Conference, ACM, New York, NY, 1992, pages 338-346.

YOU93 D. Young, and B. Shneiderman, A graphical filter/flow representation of boolean queries: a prototype implementation and evaluation, Journal of American Society for Information Science, vol. 44, #6, July 1993, pages 327-339.