FACILITATING DATA EXPLORATION:
DYNAMIC QUERIES ON A HEALTH STATISTICS MAP

Catherine Plaisant, Human-Computer Interaction Laboratory,
Center for Automation Research
A.V. Williams Bldg., University of Maryland, College Park MD 20742

KEY WORDS:

User interface, high interaction graphics, cartographic animation, time trends.

ABSTRACT: Users with no specialized computer training are often discouraged by the complex syntax of query languages and the output of long tables of alphanumerical values. The Human-Computer Interaction Laboratory has recently developed the concept of dynamic queries which allows user control of animated visual displays of information. Experiments with our first applications have shown that dynamic queries can help reveal trends or global properties as well as assist users in answering specific questions. We present a new application developed with the National Center for Health Statistics and running on a simple PC. A thematic map of the United States is animated by adjusting sliders displayed on the side of the map. A time slider illustrates time trends. The other sliders control the filtering out of areas of the map according to parameters such as demographics. Detailed data about a particular area is obtained by clicking directly on its location on the map. We have received encouraging feedback from users. We also hypothesize that this new tool will facilitate the finding of confounders.

Background

Maps and epidemiology

Epidemiology deals with the incidence, distribution and spread of diseases in conjunction with causation. An important part of the explored materials consist of statistical data such as the data collected by the National Center for Health Statistics (NCHS). Traditionally epidemiologists have to sort through huge tables of averages, numerous graphs and refer to books of detail data to start formulating hypothesis (e.g. proposing that a demographic factor might be linked to a cause of death.)

More recently thematic maps helped epidemiologists to identify "hot spots" and trends in the U.S. by providing a visualization of the geographic patterns of mortality not apparent from tabular statistics. Printed atlases such as the Atlas of U.S. Cancer Mortality [1] have triggered many important studies. As a result of the rapid advancement of computer technology, maps are playing an increasing role in the presentation of statistical information as they can be produced in a small fraction of the time it took before

One of the drawbacks of a map printed on paper or in an atlas is that the access to detail and exact values is lost. Since all values are split in only a few categories (i.e. 5 colors for a grouping in quintiles) users still have to refer to tables in order to access the exact rate in an area. Research on map graphic design (e.g.. [2]) shows that careful graphic design allow the tuning of the map to favor cluster identification or point value estimates, but printed maps have to select a single middleground design.

Graphical Information Systems (GIS) are increasingly being used. They are powerful tools for preparing and printing maps but remain complex to use (i.e. difficult to master for people with no computer training). Producing a map requires typing or selecting many options and parameters then waiting for the map to appear. GIS are aimed at the production of the "best possible map" to show the result of a study or to illustrate a fact probably discovered another way. So far they are not appropriate for the rapid browsing between maps which is necessary for the exploration and hypothesis formulation tasks. In addition, GIS do not include query mechanisms.

Showing trends

Epidemiology often involves the study of time trends (to look at diseases spread, effect of treatment etc.) Beside the usual tabular displays this can be done by showing series of maps. For example Carat and Valleron report a study of the spread of an influenza-like illness in France by a series of six maps showing the extent of the sickness at six time points [3]. Specialized software (such as Macromind Director) allow canned animations to be prepared in advance. For maps tools have been developed to prepare canned map animations (e.g. [4]). But there are reports of users' frustration with canned animation as they wish to have direct control of the animation. The "real" interactive animation seems still only a wish,or reserved to high-end equipment [5]. Unfortunately the high-end equipment of information visualization research centers are typically used for extremelly large datasets and therefore remain slow at producing maps. They tend to rely on video to simulate the animation of the changes over times. Those animations are powerful for showing time trends but once again they are used primarily to illustrate findings and not for rapid exploration of largely unknown materials.

There is a need to facilitate the immediate use and exploration of newly released datasets. An example of such a tool was developed by Dunn to examine the nature of the linkage between the statistical and spatial distribution of bivariate data [6]. Egbert and Slocum also proposed a complex tool to explore maps but animation was not implemented [7]. Our approach is to give researchers tools which can be mastered without special computer training and permit rapid browsing and querying of the data.

Dynamic queries:

The Human-Computer Interaction Laboratory has developed the concept of dynamic queries which allows user control of animated displays of information [8, 9]. Dynamic queries apply the concept of Direct Manipulation [10] to query formulation and output. The central ideas of Direct Manipulation are the visibility of the objects and actions of interest, immediate and continuous feedback and rapid, incremental and reversible actions. The Apple Macintosh interface is considered to have made those principles better known. In the case of dynamic queries it means toggles to make choices, sliders to set ranges, and results presented in real time on a meaningful two dimensional space. One of our first applications of dynamic queries was the DC Home-Finder that allowed users to locate houses for sale according to price, number of bedrooms, commuting distance, etc. Sliders were used to set acceptable ranges (e.g. a price range) and dots representing houses would appear or disappear from a map of DC in real time according to the position of the sliders. An experiment was conducted comparing this interface with conventional printed listings and a natural language interface [11]. This experiment demonstrated the strengths of dynamic queries for complex queries, trend analysis (e.g. finding the expensive areas) or exceptions (e.g. a bargain house with many bedrooms).

Dynamic queries were immediately identified by NCHS as a promising tool for the exploration of health statistics. We started the development of a series of prototypes, one of which is being considered for accompanying the release of selected NCHS datasets.

General description of the interfaces

First, users select the cause of death they want to study, a population group (e.g. male/female), then three filters they want to explore in conjunction with the selected cause of death.

A map of the USA is then displayed with the mortality rate color coded for each state (e.g. the Cervix Uteri cancer rates [figure 1]). The states where the cancer rate is high are shaded red, the states where the rates are low are shaded blue, the others are gray (Note that on the figures the colors were changed to be readable in black and white, and high rates are simply darker.) On the right side of the screen four sliders give control on the map animation. A pointing device such as a mouse is used to "grab" a slider handle which can then be moved laterally.

Time trends:

When the year slider is moved from 1950 to 1970 the color of each state changes in real time, reflecting each state's rate change. In the case of cervical most of the states change from red shades (high rates) to blue shades (low rates) showing the overall decline in cervical cancer rates over the years. This improvement is probably due to the increasing use of pap smears for early detection. The map shows a cluster of high rates in the east. The cluster remains present along the whole range of the time slider as states in that area appear to stay "redish" or gray longer than other areas as time advances. In the case of lung cancer the reverse effect is observed and the map turns from blues to reds. Opposed to canned animations, users have direct control over the speed and direction of the animation, and can replay at will. All changes are done in real time. Because the control of the animation is done via a natural movement of the hand (forward to the right, backward to the left) users can concentrate on their task (observing the map changes) and not on the system's control.

Querying or filtering

The three other sliders allow queries to be made. For example imagine that the three chosen parameters are the average number of school years, per-capita income, and the percentage of population who are currently smokers. When users adjust the position of a slider's handle the states with values falling outside the boundaries of the sliders are set to the background color (black) and seem to disappear, while states with values within all the boundaries are left in the color determined by their corresponding cervical cancer rate.

The three filter sliders are range sliders (double sided sliders) making it is possible to look only within particular ranges of values. For example users can look at only the states having a per-capita income between $9000 and $11000.

The sliders' filtering effect can be combined. For example figure 2 shows a map with only the states having more than 15% of the population below poverty AND an average of less than 12 years of school. The speed and ease of formulation changes the need for some of the queries. For example ORs can be achieved by looking alternatively at two sets (e.g. what are all the states with high income OR high number of smokers can be achieved by asking the 2 queries consecutively.) A special "number of deaths" filter is also available to remove from the map the areas with a number of deaths judged too small to be relevant.

Of course only a limited set of boolean queries can be formulated but this technique is immediately mastered by novice users. Opposed to traditional boolean query formulation (e.g. SQL) which are well known to be error prone and difficult to learn, specifying queries with the sliders is easy to learn and there need be no syntax or "out of range" errors.

Seeing patterns and identifing possible confounders

Because the results output is performed in real time it is possible to ask series of neighboring queries and look at the trends in the result. For example when the slider marked "college %" (for percentage of population with college education) is slid up from 10% (the minimum) to 28% (the maximum), the states will progressively all disappear from the map but the states with low percentage of college education will disappear first while the states with higher college education percentage will remain colored longer and the state with the highest percentage of college education will be the last. In the case of cervical cancer it can be seen that the states disappearing first tend to be blue and the ones remaining at the end are red, suggesting a possible correlation. In this case additional studies have shown that college educated women are more likely to be followed by a doctor and to seek diagnosis and treatment early. Statistical analysis and additional field studies are still needed to verify any hypothesis but it is hoped that such a tool could help the process of hypothesis formulation when new datasets are released.

Access to details:

In addition users can obtain the exact value of rates and demographics for each area by clicking on the area with the mouse. The information appear in a box where precise readings can be made along with any other usual information (e.g. confidence intervals).

A Help button explains how to use the interface and gives information about how the data was obtained. The Reset button resets the sliders to their initial states, the Quit button is used to leave the program. Two other buttons allow users to change the cause of death or to use other filters.

Development environment:

The software is written in Borland C since none of the user interface development toolkits available at that time provided double sided sliders which we therefore had to build ourselves.

In the "state" version described above the repainting of the map is instantaneous when run on our 386-25 PC (i.e. all states appear to change at the same instant). On slower PCs it can be seen that the states are repainted in a certain order but the whole map update remains around half a second, enough to keep the feeling of immediate and continuous feedback. The traditional technique of flood-filling regions is too slow and we selected to use the simple but powerful palette switching technique. Each state is assigned a different logical color and the color change on the map is done by a palette color re-assignment and not a real bitmap repaint. We run in a SuperVGA 640x400x256 graphic mode which gives us enough colors while remaining compatible with the equipment commonly available at NCHS.

Smaller areas:

A more recent prototype uses 800 smaller areas. This makes visible clusters which can be hidden in the state map. We had to abandon the palette technique because the number of area was larger than the number of colors in the graphic mode available on the target equipment (i.e. the ordinary NCHS staff equipment). The 800 areas version uses a mix of palette switch (for the large areas) and bitmap mask paint for the smaller areas. Unfortunately it brings to its limit the graphic capabilities of an ordinary PC (and ordinary PC programmer). It takes more than a second to refresh the map and some of the dynamic of the interface is lost. Despite those difficulties we feel confident that there are better algorithms to improve the speed of update. The quality of certain animations found in commercial video games demonstrate that expert graphic programmers can do magic with ordinary PC's.

Zooming:

Of course zooming is very useful. In the 800 areas version we chose to permit zooming on nine predefined regions. The zoomed region is enlarged to fit the screen. To indicate context a small view of the US shows the location of the zoomed area. Once zoomed, the number of areas is small enough again to allow instantenous update of the maps and adequate interactivity.

Users' feedback

Even though no formal testing of the system has been conducted yet, we have received enthusiastic feedback from the user community. It seems to be a compelling tool for novices and experts, as users are immediately drawn into the exploration of the sample data . Unless people are not familiar with pointing devices such as the mouse it only takes a minute to get new users started.

Possible future work

For a better product:

In order to develop a better "product", speed of refresh is a priority for the small area version . Other features like print, save, bookmarks etc. should be added. Easy customization is also a requirement as we keep receiving requests for variants to the prototype: custom colors, custom legends, handling of special datasets, etc. A direct connection to statistical packages might also be helpful.

On the research side:

Some formal evaluation need to be conducted to test if this tool assists users in identifying possible correlations or not. When the datasets grow automatic "grand tours" could be added for reviewing samples of the whole data. Sound may also be a possible direction to increase the chances of finding dissappearance patterns as visual patterns could be reinforced by audio patterns.

Conclusions

Dynamic queries on health statistics maps allow rapid exploration of thematic maps. Time trends are easily shown. Users retain access to detail data and some simple querying is possible.

We also hypothesize that this new tool will facilitate the finding of confounders. These methods hold much promise for improving access to very large data sets because they utilize the remarkable human capabilities with visual pattern recognition. We have also demonstrated that useful animations can be implemented in todays ordinary equipment. We have received encouraging feedback from users and we hope that this tool will accompany the release of future NCHS datasets. In some cases better tools exists to illustrate findings but dynamic queries are a novel and promising approach for the initial exploration of new data.. The ease of use and the active engagement in the search process might also benefit the teaching of health related issues.

Acknowledgments

Partial support for this project was provided by

the National Center for Health Statistics. We want to thank Vinit Jain and Boon-Teck Kuah, former computer science graduate students who worked on the implementation, Linda Pickle from NCHS for her collaboration and Ben Shneiderman for leading the Dynamic Queries research.

References

1 - Pickle, L., Mason, T., Howard, N., Hoover, R., Fraumeni, J. , Atlas of U.S. cancer mortality among whites: 1950-1980, DHHS Publication No. (NIH) 87-2900, 1987.

2 - Pickle, L., Kerwin, J., Croner, C., Herrman, D., White, A., The impact of statistical graphic design on interpretation of disease rate maps, Proc. of the A.S.A. 153rd annual meeting (San Francisco, August 8-12, 1993). Government stat. section, ASA, Alexandria, VA, 1993.

3 - Carat, F and Valleron, A-J: Epidemiologic mapping using the kriging method: application to an influenza-like illness epidemic in France. American Journal of Epidemiology, Vol 135, 11, 1293-1298 , 1992.

4 - Peterson, M., Interactive cartographic animation, Cartography and Geographic Information Systems, 20 (1), 1993, pp. 40-44

5 - Daniel Dorling, Streching space and splicing time: from cartographic animation to interactive visualization. Cartography and Geographic Information Systems, 19 (4), 1992, pp. 215-227, 267-270.

6 - Dunn, R., A dynamic approach to two variable color mapping, The American Statistician , vol 43, 4, 1989

7 - Egbert, S. and Slocum, T., Exploremap: an exploration system for cloropleth maps. Annals of the Association of Amrican Cartographers, 82(2), 1992, pp. 275-288

8 - Ahlberg, C., Williamson, C., Shneiderman, B., Dynamic queries for information exploration: an implementation and evaluation. ACM CHI ë92 Conference Proc. (Monterey, CA, May 3-7, 1992) 619-626. ACM, New York.

9 - Shneiderman, B. , Dynamic Queries: a step behond database languages, Technical report CAR-TR-3022, University of Maryland, College Park, MD 20782 (Jan. 1993).

10 - Shneiderman, B., Designing the User interface, Addison-Wesley Pub., Reading, MA, 1992

11 - Williamson, C., Shneiderman, B., The DC Home-Finder: evaluating dynamic queries in a real-estate information exploration system. Proc ACM SIGIR ë92 (Copenhagen June 21-24, 1992) 338-346, ACM, New York.