HCIL-TR-99-04 February, 1999

LAP-TR-1999-01

VisAGE Usability Study

Elizabeth D. Murphy, Kent L. Norman, and

Daniel Y. Moshinsky

Laboratory for Automation Psychology,

Department of Psychology, and the

Human-Computer Interaction Lab, Institute for Advanced Computer Studies University of Maryland

College Park, MD 20742
 
 
 
 
 

VisAGE Usability Study


1.0 Introduction

In the context of human-computer interaction, the term "usability" generally refers to the ease of use and operational suitability of the interactive displays and controls that serve as the user interface to a computing system. Studies of user-interface usability employ various methods to assess the effectiveness of user interfaces in supporting human performance in operational settings (e.g., Nielsen, 1997; Uehling, 1994). Methods of usability testing are flexible and adaptable to the purposes of a particular study.

A study was undertaken to evaluate the usability of a visualization tool developed by Century Computing, Inc. for NASA/Goddard (then Code 522, now Code 588). At the time of the usability study, the Visual Analysis Graphical Environment (VisAGE) tool was a working prototype. VisAGE was under development as an aid to visualizing spacecraft health and safety data as well as telemetry. The usability study was conducted under a grant from the NASA/Goddard Space Flight Center to the University of Maryland (NAG-5-3425).

1.1 Purpose

The usability study was conducted to collect human performance data and verbal feedback on VisAGE features. VisAGE developers were interested in getting user response to the tool as a basis for continued design and development.

1.2 Scope

The usability evaluation focused on two-dimensional (2-D) and three-dimensional (3-D) displays generated by test data representing simulated heaters on-board a hypothetical, unmanned spacecraft. Two data feeds were created. One data feed represented the performance (temperature) of two heaters. The second data feed represented the performance of five heaters. To clarify, the data were simulated; they were not from telemetry archives or "live" spacecraft. The following display types were generated using the two data feeds:

Several software controls associated with the 3-D displays were also evaluated. These controls included the zooming, panning, and rotating capabilities.

2.0 Method

Although the original intention was to test VisAGE only with operational personnel at NASA/Goddard, the availability of the testing version on the World Wide Web made it possible to set up a between-groups study, with undergraduate students at the University of Maryland making up a second group. Independent variables in the study were type of display and status of the participant (student or professional). Performance measures included time to complete the questions and accuracy of answers in comparison to the actual values that had been displayed. Accuracy measures included accuracy of detecting current trends in the data and accuracy of making predictions based on the displayed data. Participants provided written comments and suggestions for improving aspects of the VisAGE displays (e.g., labeling of axes, presentation of scales, use of color) to make them conform more closely to user expectations.

2.1 Participants

Several undergraduate students served as pilot subjects during the training of research assistants in running VisAGE and administering the questionnaire. Seventeen undergraduate students at the University of Maryland participated as test subjects in the VisAGE usability testing. This group included eight men and nine women, with an average age of 19.4 years (range = 18 - 23). Student performance was tested in the Laboratory for Automation Psychology, located in the Department of Psychology (Room 3118) at the University of Mayland, College Park. Students received credit in their Psychology 100 class for their participation.

Seven professional developers and operational personnel who work at NASA/Goddard were tested in the Usability Laboratory in Building 23 on-site at the Goddard facility in Greenbelt, Maryland. This group comprised four men and three women, with an average age of 36.7 years (range = 27 - 53). All participants read and signed an informed consent agreement before beginning the testing.

2.2 Testing Environment and Instrumentation

In both testing venues, VisAGE was accessed from a personal computer (PC) communicating with a Goddard server across the World Wide Web (http://nipper.gsfc.nasa.gov/visage/index.html). Century Computing developed a timer function to support the testing. When activated by a press of the <Start Test> button, the timer froze the display and recorded the interval in seconds until the <Stop Test> button was pressed. The timer function also recorded the data values that had been displayed while a display was frozen, providing scorers with a means of checking participantsí accuracy.

2.3 Materials

Before the start of a session, participants provided background information about themselves and their computer experience on the first page of a paper-based questionnaire (Appendix A). The remaining pages of the questionnaire were used to record participantsí answers to questions about the various displays and controls. The order of presenting the displays was randomized across participants, following the randomized order of pages in the questionnaire.

Questions were designed to evaluate the support provided by the various types of displays for user extraction of data values, detection of current trends, and prediction of future values. Rating scales were used to gather participant responses to the software controls for zooming, panning, and rotating the 3-D displays. Goddard participants also rated the usefulness of the various display types for their operational environments.

2.4 Procedures

The experimenter provided a brief introduction to VisAGE and began a testing session with the first display type indicated by the participantís questionnaire. For example, the first display for a particular participant may have been the 2-D strip chart with five heater sources. The experimenter set up the display using the VisAGE user interface and then asked the participant to observe the changing values on the display for up to one minute. The purpose of this brief observational period was to allow the participant to become familiar with the display and the depicted situation. In this case, there would have been five lines in different colors depicting the temperatures of five heaters over time.

The participant was then asked to press the <Start test> button in the timer box. This action froze the display. The participant then answered the questions about that display on the corresponding page in the questionnaire. Upon completing the questions, the participant pressed the <Stop test> button in the timer box. The participant then provided any further comments about the display, including suggestions for improving it.

Upon encountering the first 3-D display in a participantís questionnaire, the experimenter demonstrated the zoom, pan, rotate, and reorient controls. After participants practiced for a few minutes with the controls, the experimenter re-oriented the display to its original configuration, and participants were given a short time to observe the behavior of the displayed values and graphics. They were then prompted to press the <Start test> button and to answer the questions for that display. With the ensuing 3-D displays, participants were asked not to use the controls to reorient a display until after pressing the <Start test> button. The purpose of this instruction was to capture any time used for reorienting a display as part of the participantís total recorded task-performance time.

This procedure continued until the participant had worked through the entire questionnaire, including the pages which elicited ratings of the 3-D controls. A full testing session took between 45 minutes and one hour, or more, depending on the quantity of comments offered by a specific participant.

3.0 Results

The comments of operational participants contained many insights and useful suggestions for developers. Three-D display controls were generally rated easy or very easy to use. Statistical findings are presented in support of various inferences that can be drawn about the performance of the two groups with this version of the VisAGE user-interface prototype. These findings demonstrate that collecting human performance data during usability testing can provide valuable information that cannot be obtained simply from participantsí comments.
 
 

3.1 Comments from Participants

All participants were given the opportunity to provide written comments on the various displays. Typically, students provided very few comments, reflecting their lack of experience in an operational setting where such displays might be used. The professional participants, however, tended to provide detailed comments on the operational suitability of the various displays. They often suggested uses that they could envision for types of displays that are not currently used by their missions. Many of the professional participants commented aloud while they were viewing the displays; in those cases, the researcher made notes on their comments and prompted them to summarize their comments in the spaces provided on the questionnaire.

Comments from the professional participants were transcribed from the questionnaires and, as applicable, cross-checked against the researcherís notes. Comments were edited for readability, with brackets ([]) indicating any editorial insertions. The set of comments provided in Appendix B was forwarded to the development team soon after comments were collected and transcribed.

3.2 Statistical Findings

Various statistical techniques, primarily correlational analysis and testing of the differences between means, were applied to participantsí responses and rating data, as appropriate. Results of these analyses demonstrate that valuable information can be obtained if human performance data are collected during usability testing. In this study, human performance is defined in terms of accuracy and time to complete tasks: accuracy of detecting trends, accuracy of making predictions, and time spent responding to comparable sets of questions for each display type. A somewhat unexpected finding was that, for most performance measures, the two groups ó students and professionals ó did not differ significantly.

3.2.1 Correlations

A correlational analysis yielded several significant findings of general interest:

For the student group (N=17), there was a significant correlation between gender and the total number of times a participant answered "canít tell" or "unable to predict" (Spearmanís rho = .542, p < .05). Male students were more likely than female students were to report an inability to detect a trend or to make a prediction. This finding did not occur for the professional group.

When the two groups were combined for purposes of analysis (N = 24), there was a significant correlation of age with number of times unable to tell or predict (Spearmanís rho = .426, p < .05). As compared to younger participants, older participants more often answered "canít tell" and "canít predict." This correlation with age is largely due to the difference in mean age between the two groups (students and professionals). This finding did not occur within the groups because of the restricted range of within-group ages.

As shown in Table 1, other significant positive correlations occurred for age and general computer experience, for both groups; spreadsheet experience and general computer experience for the students; and graphics experience and programming experience for the students. Experience with 3-D games and other 3-D applications was significantly correlated with several other background variables. A significant negative correlation was found between word processing experience and accuracy for students responding to questions on the 2-D pie chart.

Table 1. Sample Correlational Statistics

Variables                                           Spearmanís rho                                       p

          experience, across groups           their general computer experience           their interactive 3-D experience           their programming experience           their interactive 3-D experience           with their interactive 3-D experience           across groups (males had greater

          experience)

           spreadsheet experience

           (more experience for males)

          experience: for students                                 .561                                                         .05

          for professionals                                            .801                                                        .05

          (see explanation in text)

          interactive 3-D experience: for students         .659                                                         .01

          for professionals                                            .991                                                        .01

          with their accuracy using the 2-D pie

           chart (5 heaters)
 

The correlations with gender occurred because male students had more word-processing experience than did female students; but female professionals had more word-processing experience as compared to male professionals. Other, similar comparisons between demographic variables and performance outcomes were non-significant for both groups.

3.2.2 Analysis of Differences Between Means

A central issue addressed by this study was that of the relative usefulness of 2-D displays versus 3-D displays. The 3-D displays available in VisAGE actually display two-dimensional data with added visual perspective. It cannot be assumed, however, that such "3-D" displays of two-dimensional data will improve performance times or accuracy. In a set of comparisons of mean performance times using the comparable 2-D and 3-D displays (on a comparable series of questions), the 2-D bar chart supported faster performance than did the 3-D bar chart, for the student group only ( t = 2.69, df = 13, p < .05). Studentsí mean performance time with the 2-D bar chart, for two heaters, was 47.93 seconds (SD = 17.86), while their mean performance time with the 3-D bar chart, for two heaters, was 92.50 seconds (SD = 65.75). This finding held for the 2-D versus 3-D bar charts with five heaters for both groups (for Group 1, t = 3.14, df = 15, p < .01; for Group 2, t =3.10, df = 4, p< .05). That is, performance using the 2-D bar chart was faster than performance using the 3-D bar chart. The respective means and standard deviations are given in Table 2. These differences may be attributable to time taken for zooming, panning, and rotating the 3-D bar charts. There were no other significant differences in performance times in the 2-D versus 3-D comparisons. The 3-D displays never yielded faster performance as compared to performance times for the 2-D displays.

_____________________________________________________________________

Table 2. Means and Standard Deviations for Performance Times with the 2-D and 3-D Bar Charts, with 5 Heaters, for Group 1 ( N=16) and Group 2 ( N=5)

Group                     Mean Time/S. D. (2-D bar chart)          Mean Time/S. D. (3-D bar chart)

1 ó Students           53.81 sec./15.17 sec.                             88.19 sec./49.62 sec.

2 -- Professionals     60.60 sec./24.98 sec.                             102.80 sec./34.08 sec

_____________________________________________________________________

When performance times were compared between groups, the only significant difference between the students and the professionals was in their performance times when using the 2-D pie chart: For the 2-D pie chart with two heaters, the professionalsí performance time was significantly longer when compared to the studentsí time (t = 2.37, df = 21, p < .05); for the 2-D pie chart with five heaters, performance time for professionals was again significantly longer than it was for the students (t = 2.15, df = 19, p < .05). Means and standard deviations are shown in Table 3. All other comparisons of performance times between groups showed non-significant differences.

____________________________________________________________________

Table 3. Means and Standard Deviations for Performance Times with the 2-D Pie Chart, with Two and Five Heaters, for Group 1 and Group 2

Group                     Mean Time/S. D. (2 heaters)                         Mean Time S. D. (5 heaters)

1 -- Students            21.31 sec./14.75 (N = 16)                          25.60 sec./13.18 (N = 15)

2 ó Professionals     40.00 sec./22.70 (N = 7)                           52.83 sec./46.22 (N = 6)

_____________________________________________________________________

In comparisons of accuracy between the several kinds of 2-D displays, the 2-D strip chart supported more accurate detection of current trends by students than did the 2-D text display (t = 2.95, df = 16, p < .01); the 2-D bar chart (t = 3.35, df = 16, p < .01); or the 2-D pie chart (t = 2.95, df = 16, p < .01). In comparisons of accuracy between the comparable 2-D and 3-D displays, there was no difference in detection of current trends or in prediction of future trends between the 2-D and 3-D strip chart with two heaters or with five heaters. Across the groups (N = 24), there was a correlation of .552 (p < .01) between accuracy of prediction with the 2-D strip chart (two heaters) and accuracy of prediction with the 3-D strip chart (two heaters); there no difference between the means for accuracy of prediction with these two displays when the groups were combined for data analysis. There was no difference between the groups on accuracy of detecting current trends with the 3-D strip chart (two heaters).

When the 2-D and 3-D bar charts were compared, there were no differences in detection of current trends between the groups; likewise, no differences between trend-detection performance emerged for the overall 2-D versus 3-D bar chart comparisons. Detection of temperatures showed no differences between the groups whether they used a 2-D or a 3-D bar chart. There were significant correlations between detection of trends using the 2-D and 3-D bar charts: For the comparable question on Heater 1, the correlation was .512 (N = 19, p < .05), and for the comparable question on Heater 2, the correlation was .721 (N = 18, p < .01).

When different types of displays were compared, the 2-D strip chart with two heaters supported better prediction accuracy than did the 3-D text display with two heaters (t = 2.48, df = 18, p < .05). This difference did not quite emerge within the student group, but it did occur within the professional group (t = 6.00, df = 6, p < .01).

A comparison of errors in using the 2-D text display, with five heaters, found no significant difference between the means for the student and professional groups (t = .78, df = 21, p > .05). The respective means were .82 (SD = .95) for the student group and .50 (SD = .55) for the professional group. Likewise, a comparison of errors using the 3-D text display found no significant differences between the means for the student and professional groups (t = 1.65, df = 21, p > .05). The respective means were 1.12 (SD = 1.05) for the student group and .33 (SD = .82) for the professional group.

Analysis of responses to the direct questions about the usefulness of the third dimension in the 3-D displays found that the combined groups rated the third dimension in the 3-D strip chart more useful than they rated the third dimension in the 3-D text display (t = 2.96, df = 20, p < .01). With the highest possible rating being one (1.00), the respective means were .52 (SD = .51) for the third dimension in the 3-D strip chart and .14 (SD = .36) for the third dimension in the 3-D text. When this question was analyzed by groups, the student groupís rating of the usefulness of the third dimension in the 3-D strip chart differed from their rating of its usefulness in the 3-D text (t = 2.45, df = 14, p < .05), but there was no difference between the respective ratings made by the professional group (t = 1.58, df = 5, p > .05).

3.2.3 Summary of Differences Between Groups

When differences between means were tested, the two groups differed on some of the experience variables, on only one time variable, and on only one error outcome, as shown in Table 4. The experience variables were rated along a scale from zero to 100 (Appendix A).

Table 4. Mean Differences Between Students (Group 1) & Professionals

(Group 2)

Variable                             Mean/S.D.                         Mean/S. D.                t-value         df             p <

                                          (Group 1)                          (Group 2)

General                             52.5/15.8                            81.4/16.6                  4.03             22             .01

computer                           (N=17)                               (N=7)

experience

Word-                             58.9/17.3                             74.9/15.5                  2.11             22             .05

processing                       (N=17)                                 (N=7)

experience

Spread-sheet                 34.0/23.8                               64.1/18.6                    2.98           22             .01

experience                     (N=17)                                 (N=7)

Number of                     7.1/4.9                                 13.3/8.0                         2.29          21             .05

times unable                  (N = 17)                               (N=6)

to tell or predict

Performance                 21.3/14.8                             40.0/22.7                         2.37          21             .05

time using                     (N=16)                                 (N=7)

the 2-D pie

chart (2 heaters)

(in seconds)

Error in using             .941/.827                             .167/.408                             2.18          21               .05

the 2-D strip             (N=17)                                 (N=6)

chart (5 heaters)

In some cases, each participant did not answer all of the questions, thus accounting for the occasional differences in number of participants. All of the reported results are considered significant at the .05 level. Applying a Bonferroni correction and accepting only results at the stricter .01 level, to adjust for the number of t-tests, the two groups differ only on two of the experience variables: general computer experience and spreadsheet experience.

3.2.3 Ratings of Operational Usefulness

The professional participants rated the types of displays for operational usefulness on a scale from zero to 100. They were permitted to mark anywhere along the scale. The following ranking of the displays is based on their mean rankings ( N = 7):

Tests of the differences between these mean ratings found the 2-D text display and the 2-D strip chart to be significantly different from each other and from all of the other display types. It is apparent that this group of professionals considered the 3-D text display and the 3-D bar chart to be about the same in operational usefulness.

4. 0 Discussion

Comments from the professional group provided the basis for extracting key insights about the VisAGE prototype displays, as discussed below. The statistical findings support several observations about the VisAGE displays and the use of college students as participants in a study of this kind.

4.1 Insights from Comments

Several themes emerged from the feedback provided by the professional group:

Each of these points is discussed in turn:

Violation of Expectations. User expectations develop from prior experience with graphic designs that follow standard practice. In this study, user expectations were violated by higher-to-lower numbering of items along the x-axis: HTR5, HTR4, HTR3, HTR2, HTR1. Standard practice is to number left to right, from one to n. A similar violation of standard practice was the top-down numbering of heaters in legends, from HTR5 at the top of a list to HTR1 at the bottom. Standard practice is to start with one at the top of the list and to end with the largest number at the bottom of the list.

Although some participants reported that they were not bothered by these non-standard orderings, many errors or near errors could be attributed to the non-standard ordering of elements. In many cases, participants changed their answers when they realized they were assuming that Heater 1 was in the left-most position. Time spent in error detection and correction contributed to longer performance times than could be expected with standard, low-to-high ordering. Some participants never realized that they had used values for Heater 5 in answer to questions about Heater 1. Such errors are induced by a design that violates standard practice.

User expectations were also violated by the dynamically changing scales used in the bar- chart and strip-chart displays. Participants reported having trouble developing a sense of trends because of the constant changes in the scales. When both values and scales are changing, it requires highly focused concentration ó near the limits of human capabilities ó to extract trends.

Need for Prior History. In the operational environment, analysts are interested in the prior performance of the various measured parameters or "mnemonics". A display, such as the ASCII text or pie chart, which offers no insight into past performance is of little use for purposes of trend detection. Expecting the analyst to extract trends from such a display places unreasonable demands on short-term or working memory, with its limited storage capacity. The 2-D strip chart, which does provide information on prior history, supported better performance by the participants than did any of the other 2-D displays. It was also rated the most useful by professional participants, who use 2-D strip charts regularly in support of mission operations.

Three-D Display of 2-D Data. One of the most highly respected authorities on graphic design, Tufte (1983), contends that it is inappropriate and misleading to display 2-D data in three dimensions (p. 118):

The addition of a fake perspective to the data structure clutters many graphics [bold added]. This variety of chartjunk, now at high fashion in the world of Boutique Data Graphics, abounds in corporate annual reports, the phoney statistical studies presented in advertisements, the mass media, and the more muddled sorts of social science research. In his 1990 book, Tufte describes as "content-emptyÖcosmetic decoration" a third dimension added to the display of 2-D data (p. 34). A key point is that such gratuitous, non-functional "decorationÖfrequently distorts the data," thus misleading the viewer (Tufte, 1990, p. 34).

Relying in part on Tufteís guidance, students in Shneidermanís fall 1997 graduate class on information visualization developed a web site, where they provide research-based recommendations on the uses of various kinds of graphics (http://otal.umd.edu/Olive/). Their recommendation on the 3-D display of 2-D data reads as follows:

We considered objects such as 3D bar and/or pie charts, where data may be effectively visualized in 2 dimensions, to be forced 3D objects[bold added]. If you have a bar chart that is rendered in 3D instead of 2D, has the data changed? The answer is an emphatic, "No!." The data is still 2D and thus should not be included when speaking about 3D visualization. Although 3D renderings of bar charts or pie charts may be appealing to some, empirical evidence suggests it makes the data more difficult to comprehend. In other words, what the presenter has created is "chart junk" as coined by Edward Tufte. According to Tufte, one dimension is being wasted when this "overcoding" is applied. Although some students in the VisAGE usability study recognized that the third dimension did not contribute to their understanding of the data, some students appeared to have been misled into thinking that it did. They seemed to think that they were getting more than they actually were because, in fact, nothing but extraneous chartjunk is added by displaying 2-D data in three dimensions. Most of the professional participants indicated that the third dimension did not add to their understanding of the data.

Those of the students and professionals who said that the 3-D display did aid their understanding may have been thinking of the ability to manipulate the 3-D displays, instead of the non-functional third dimension provided by a perspective rendering of 2-D data. Participants generally seemed to enjoy using the zoom and rotate controls to change the orientation of the 3-D displays. It was often actually necessary to move labels and change the orientation of a 3-D display in order to see the relationships between the data feeds.

When they were necessary, these manipulations did aid participantsí understanding of the data because, without them, the data would have been incomprehensible. Labels were sometimes not visible, and the data feeds often overlapped. To a large extent, the controls distinguished the 3-D displays from the 2-D displays, more so than did the dimensional aspects of the display sets. It may be the case that the ability to manipulate the 3-D displays came to the forefront when participants were asked whether the third dimension added to their understanding. In other words, "3-D" may have meant "moveable" to some participants. Thus, some participants may have had a different interpretation than intended of the questions on 3-D as an aid to understanding. They may have thought that these questions were asking them whether the controls aided their understanding.

Some of the professional participants were intrigued by the possibilities of 3-D displays and suggested ways of using the third dimension for truly functional purposes, as discussed next.

Possible Applications of Non-Traditional Display Types. One professional participant suggested that the 2-D bar chart might be useful for representing various aspects of fuel usage, such as keeping track of thruster pulses. Using a 2-D bar chart, analysts could monitor values, such as number of thruster pulses, that must be reached before other spacecraft functions can be executed. The same participant suggested that 2-D pie charts could be used to display changes in solar panel output, using one pie chart per solar panel, for a rotating spacecraft.

A professional participant suggested that the 3-D bar chart might be useful in studying anomalies that can lead to secondary anomalies. His idea was that the delay between the onset of the first and second anomalies might be informative about the cause of the original anomaly. The delay could be displayed using the third dimension of the 3-D bar chart. Another professional participant suggested that the 3-D displays might be useful, but only if the third dimension were used for a functional purpose.

4.2 Statistical Findings

Some of the correlational results can be understood in a cultural context. That is, the male students were, perhaps, more willing than were the female students to answer "canít tell" or "canít predict," possibly reflecting a male attitude of "telling it like it is" and/or a female attitude of wanting to please the experimenter by giving an answer. This male-female difference was not found in the professional group, possibly indicating the beneficial effect of experience on such attitudes.

This interpretation is supported by the finding that older participants were more willing than were the younger participants to answer "canít tell" or "canít predict." The older, more experienced participants may have been more attuned to the validity or lack of validity in giving answers based on the displayed values; they may have been less inhibited by a concern for the experimenterís opinion if they answered "canít tell" or "canít predict." In many cases, "canít tell" or "canít predict" was the best answer because it was simply not possible to give a specific value on the basis of the display.

The performance differences between the two groups when using the pie chart can be understood with reference to their relative experience in the operational setting. The students accepted the pie chart as a viable method of displaying the data. They took it on face value and did not agonize over whether it was an appropriate means of displaying mission data. As can be seen from their comments about the pie chart, the professional subjects, however, questioned the validity of using a pie chart to display operational data. Many of them spent time in trying to figure out how to interpret the pie-chart data.

These findings underscore the importance of including a diversity of ages and an equal number of male and female participants in a study of this nature. Without such a balance, statistical outcomes may be biased by the age or gender of participants, and erroneous conclusions may be drawn.

Perhaps the most interesting result from an experimental perspective was the finding of close similarity between the two groups of participants: undergraduate students and professional personnel. On the basis of the findings, it is apparent that students can be considered representative of professionals for purposes of gauging accuracy and performance times in a usability-testing context. In such a context, performance demands are placed largely on participantsí general capabilities in visual cognition and basic reasoning. If the testing does not require specialized knowledge, which most of the VisAGE testing did not, it makes no difference whether the participants are professional employees or college students. Professionals should be consulted, however, when issues of operational suitability are raised.

In cases where speed of performance is critical, student times may be deceptively fast from an operational standpoint. Although performance times did not differ significantly, except for the pie charts, there was often a large practical difference between student times and professional times. In absolute terms, professional times were longer. Thus, some caution in accepting student times is merited. Because professionals have more knowledge and experience to bring to an unexpected, anomalous situation and, consequently, a longer thought process to execute, it is likely that they will take longer than students will to come to a conclusion or decision. In a highly familiar and well-understood situation, however, the professionals are likely to respond more quickly (Klein, 1993). The inequality in the size of the groups may also have influenced the results. With an equal number of professionals, performance times may have differed significantly between the groups.

This study should not be considered as a full evaluation of 2-D versus 3-D displays for at least two reasons: 1) the comparisons could not be strictly controlled; and 2) the "3-D" displays available for comparison were actually displays of 2-D data with added visual perspective and controls (e.g., rotate, zoom, pan). Because the displays had already been constructed before the study began, it was not possible to control for variables such as the placement of legends or color schemes, which varied widely within and across each group of displays. The 3-D displays differed from the 2-D displays not only in the addition of perspective, but also in their associated controls for zooming, rotating, panning, and so forth. To some extent, therefore, the study is a comparison of 2-D displays without controls and 2-D, perspective displays with soft controls.

5.0 Recommendations

The study shows that collection and analysis of human performance data can produce instructive results that complement comments received from participants. Although the comments were considered the most valuable feedback to the developers, the performance data reveal that adding a third dimension to two-dimensional data does little to benefit accuracy or task time. It must be emphasized, however, that the displays studied were initial prototypes. A strong recommendation emerging from this study is to use the third dimension for a functional purpose, not just to add visual perspective to 2-D data. Since operational teams deal with truly 3-D data, it might be to their benefit to be able to inspect this data in a truly 3-D representation. Such 3-D representations should always be fully viewable upon first presentation, that is, the operator or analyst should not have to reposition a display to read labels or to see the full graphic. If labels are not easily viewable, an analyst may make erroneous assumptions about relationships within a display that represents multiple data streams.

Future prototypes and future releases of VisAGE should follow sound design principles and human factors guidelines, for example those found in Code 588ís original and revised user-interface guidelines (Carlow, 1992; CTA, 1995). It is incumbent upon the designer to search out such guidelines and to follow standard graphical design practice, which takes user expectations into account. The "language" of visual design (e.g., Dondis, 1973; Tufte, 1983, 1990) codifies user expectations. Imagine the chaos that would result if highway signage violated driversí understandings of common symbols and standard practice -- if, for example, the green light was sometimes in the middle or the bottom of traffic lights; or the octagonal shape, with red background, reserved for stop signs, was sometimes used in place of the triangular yield shape, with yellow background. Think of what does happen on the road when signage is missing or inconsistent with the actual situation.

Although the conventional meanings of visual-design elements may not be as well known in the user-interface design community as are the conventional meanings of traffic signs and symbols, ignoring or misusing visual conventions can cause unnecessary confusion for the user. If users are expected to be computer literate, user-interface designers should be literate in the meanings and implications of the elements of visual symbology and their combinations. Knowledge of visual conventions should be part of the graphic designerís tool kit.

A final recommendation is for continued, iterative usability testing of enhanced releases of VisAGE. The more the final product can be refined based on input from test participants, the more useful it is likely to be in the long run. Collection of human performance data to supplement comments is strongly recommended.
 
 

Acknowledgements This study was prepared for NASA/Goddard Space Flight Center (Code 588) under Grant No. NAG5-3425. The authors are grateful to those who participated in this study and to those who supported their participation. In particular, we thank the group of professionals who participated at the NASA/Goddard Space Flight Center. We appreciate their patience with the logistics of the study and their generosity in offering useful suggestions to be passed on to the developers. We express thanks to Matthew Brandt and his development team, especially David Fout, Vincent Pell, and Melissa Hess of Century Computing, Inc., who built customized timing and data-tracking software for use in this study. The study could not have been conducted without their support. Thanks to Tom Miller and Kirk Norman, technical assistants in the Laboratory for Automation Psychology. Our thanks go, as well, to Walt Truszkowski (Code 588) for his encouragement and support through the grant to the University of Maryland.
 


Reference

Carlow International Incorporated. (1992). Human-computer interface guidelines (DSTL-92-007). Greenbelt, MD: NASA/Goddard Space Flight Center.

CTA Incorporated. (1995). User-interface guidelines (DSTL-95-033). Greenbelt, MD: NASA/Goddard Space Flight Center (available on the Internet at http://groucho.gsfc.nasa.gov:8080/Code_520/Code_522/Documents/UG_96/UserGuide1.html).

Dondis, D. A. (1973). A primer of visual literacy. Cambridge, MA: MIT Press

Klein, G. A. (1993). A recognition-primed decision (RPD) model of rapid decision making. In G. A. Klein, J. Orasanu, R. Calderwood, & C. E. Zsambok (Eds.), Decision making in action: Models and methods (pp. 138-147). Norwood, NJ: Ablex.

Nielsen, J. (1997). Usability testing. In G. Salvendy (Ed.), Handbook of human factors and ergonomics (2nd ed., pp. 1543-1568). New York: John Wiley.

Tufte, E. (1990). Envisioning information. Cheshire, CT: Graphics Press.

Tufte, E. (1983). The visual display of quantitative information. Cheshire, CT: Graphics Press.

Uehling, D. L. (1994). Usability testing handbook (DSTL-94-002). Greenbelt, MD: NASA/Goddard Space Flight Center.