CAR-TR-763 May 1995

CS-TR-3451

ISR-TR-95-51

Evaluating spatial and textual style of displays

Ben Shneiderman*, Richard Chimera and Ninad Jog

Ren Stimart†, David White†

Human-Computer Interaction Laboratory
Department of Computer Science
Institute for Systems Research*
University of Maryland, College Park, MD 20742-3255 USA
ben@cs.umd.edu

General Electric Information Service†
401 N. Washington Street, Rockville, MD 20850 USA






Abstract

The next generation of Graphic User Interfaces (GUIs) will offer rapid access to perceptually-rich, information abundant, and cognitively consistent interfaces. These new GUIs will be subjected to usability tests and expert reviews, plus new analysis methods and novel metrics to help guide designers. We have developed and tested first generation concordance tools to help developers to review terminology, capitalization, and abbreviation. We have also developed a dialog box summary table to help developers spot patterns and identify possible inconsistencies in layout, color, fonts, font size, font style, and ordering of widgets. In this study we also explored the use of metrics such as widget counts, balance, alignment, density, and aspect ratios to provide further clues about where redesigns might be appropriate. Preliminary experience with several commercial projects is encouraging.










1. Introduction and Literature Review

Designing a user interface is a complex process (Hix & Hartson, 1993; Shneiderman, 1992). It begins with analysis of the users and their tasks, goes through creative stages in which key screens are designed and reviewed, and proceeds with detailed design of hundreds or thousands of screens, dialog boxes, form fill-in layouts, output formats, visual information presentation, help screens, tutorials, etc. Usability testing can begin early and be repeated with larger groups of users as the design becomes more stabilized and complete. The development process has been sped up in remarkable ways by the presence of user interface management systems and powerful development tools. It is possible to build running systems with an elaborate design in weeks and make refinements in hours.

While powerful tools enable designers to create excellent systems rapidly, designers can still produce poor designs. Commercial pressures are forcing many novice designers to turn out larger, more numerous systems at a more rapid pace, so concerns about quality are greater than ever. Furthermore, when several designers contribute to a large project coordination is needed to prevent unnecessary diversity. Quality control and acceptance testing procedures are being introduced in many organizations but system auditors are often at a loss for evaluation methods, criteria, and norms.

The most popular and effective methods appear to be usability testing and expert reviews. The dramatic expansion of usability testing has helped to improve designs, because designers are forced to work to a clear schedule and feedback from structured testing has proven to be powerful in revealing flaws early. Unfortunately usability testing cannot reveal what performance will be after months of usage and it is usually not possible for usability lab test participants to experience every dialog box. This ìcoverageî problem, a term borrowed from software testing, is increasingly important as systems grow in complexity and size. By contrast, expert reviews can be effective in coping with the coverage problem by diligent examination of each dialog box, but reviewers may differ in their opinions and can hardly be expected to notice all differences, omissions, or flaws when there are hundreds or thousands of dialog boxes. A further problem with usability testing and expert reviews is that they are relatively costly and time consuming compared with automated evaluations and interface metrics.

The criteria for excellent interface design are still emerging from creative graphics designers and from the results of empirical studies. Guidelines documents from Apple (1992), IBM (1991), Microsoft (1993), and others are a first step, but many design issues are not addressed by these already voluminous books (Brown, 1988; Galitz, 1989; Marcus, 1992). While there will always be room for innovative designs, there is a growing need for methods that enable ordinary designers to create effective systems reliably, on-time, and on-budget.

Interface development is greatly facilitated with widely used tools such as Visual Basic (Microsoft Corp.), Reality, or PowerBuilder, and more complex cross-platform systems such as Galaxy (Visix Corp.), XVT (XVT Corp.), or Open Interface (Neuron Data). These tools can facilitate standardization across platforms and provide the software infrastructure for new evaluation tools. Software tools to assist designers are being implemented by practitioners for their products while researchers have begun to develop some exploratory systems that help automate the design process (Kim & Foley, 1993; Sears, 1994).

While automated layout holds promise for standardized situations, in more complex situations simple automated evaluations can provide feedback to designers, even at early stages of development. Many of the guidelines documents include recommendations about appropriate numbers of menu items, colors, widgets, etc. Sometimes there is experimental support for these recommendations but often they are based on only subjective judgments and thoughtful analyses. Streveler and Wasserman (1987) proposed novel visual metrics such as symmetry, balance, percentage of screen used, and average distance between groups of items, but they did not apply or test these notions.

Tullis (1988a, 1988b) carried these ideas further and implemented a system for evaluating the visual displays from character based interfaces only. He implemented metrics for overall and local density (based on number of characters filled), grouping (number of groups and their sizes), and layout complexity (vertical and horizontal alignment). His metrics were partially validated in a useful series of studies and his tool was distributed. The shift to graphic user interfaces with multiple font sizes, three dimensional widgets, etc. means that new analyses and metrics are necessary.

The availability of more graphic design features has raised interest in spatial properties such as balance, symmetry, regularity, alignment, proportion, horizontality, simplicity, economy, neutrality, unity, grouping, predictability, sequentiality, etc. These and a dozen more were identified and discussed in the context of traditional and multimedia layouts (Vanderdonckt & Gillo, 1994). These properties were intended to serve as a basis for an automatic placement tool (Bodart et al., 1994), however, specific metrics and acceptable ranges were not tested. Other efforts at automatic layout may lead to useful tools for some situations (Feiner, 1988; Kim & Foley, 1993; Byrne et al., 1994), but there is very little experience with substantial commercial applications.

Esthetically pleasing layouts are important, but the layouts should also match the sequence and frequency of the usersís tasks. The term ìlayout appropriatenessî (Sears, 1993, 1994) was chosen to convey the correspondence between layout and task. Layout appropriateness requires more inputs concerning usage patterns, but it is far more powerful in providing reliable evaluations and can even be used to generate layouts that would be optimal with respect to the metric of distance traversed or other metrics. Early testing has demonstrated its effectiveness in analyzing simple dialog boxes and complex control panels from NASA applications.

These preliminary efforts have all been helpful in identifying potential metrics and evaluation tools that could be used in the context of modern user interface building tools that generated graphic user interfaces. We sought to take these ideas from the laboratory into field testing and develop tools for professional developers working for General Electric Information Systems.

2. Our methods

Our research expanded from single screen analyses, towards evaluations across the dozens or hundreds of dialog boxes found in many user interfaces. We focused on consistency across screens and on feedback to designers to guide them to issues that might require further analysis. In earlier work (Chimera & Shneiderman, 1993) we demonstrated that consistency in color, terminology, layout, instructions, etc. does make a difference in usersís perceptions, performance, and subjective satisfaction. While consistency is a complex concept and sometimes violations are appropriate, some aspects of consistency checking are candidates for automation. It would seem appropriate for designers to preserve spatial properties such as position of similar items, size or aspect ratios of related dialog boxes, minimal wasted space, consistent margins, and aligned, balanced layouts. Similarly visual properties of text such as color, fonts, font sizes, font styles, and justification of labels would be more acceptable if they were used consistently. Finally terminological consistency, and standard spelling, abbreviations, and capitalization seems important in simplifying an interface for novice and expert users who are first time, intermittent or frequent users.

Our goal was to give designers rapid feedback as they developed their designs and steer them to examine certain screens in detail to see if there was a need for improvements. We wanted to provide a kind of medical lab report (like a blood test) for a set of dialog boxes that revealed potential anomalies but did not prescribe cures. To do this we used the descriptions of dialog boxes generated by tools such as Visual Basic. We developed a canonical format for dialog box descriptions that becomes the input to our evaluation programs. Other development tools produced different descriptive outputs, but we assumed that a knowledgeable developer could write a conversion program to put the information in the canonical format.

We developed two reports: a dialog box summary table that gave a compact overview of spatial and visual properties. Each row described a dialog box and each column a metric. The second report, a concordance, was built by extracting all the words that appeared in every dialog box and sorting them in one file with references to where they came from.

2.1 Dialog box summary table

The dialog box summary table was intended to provide designers with a compact overview of the dozens or hundreds of dialog boxes that had been designed by the single or multiple designers in a project. Each row represents a single dialog box and each column represents a single metric. Typical use would be to scan down the column looking for extreme values, spotting inconsistencies, and understanding patterns within the design.

The order of the rows was initially alphabetical and the dialog box summary table was printed on paper, but other orders based on groupings of dialog boxes functionally (so that all the dialog boxes related to installation or printing might be seen as a group) might be useful. Viewing the dialog box summary table within an electronic spreadsheet would be logical. The order of the columns was less clear to us and we simply appended new columns as the software was created. A compact presentation which squeezed as many columns as possible across a wide printout was seen as advantageous.

The choice of the metrics was our most critical issue. The University of Maryland and the General Electric groups brainstormed independently for a week, consulting with colleagues and generating two lists with approximately 40 proposed metrics each. The specific items were grouped into categories such as consistency, spatial layout, alignment, clustering, cluttering, color usage, fonts, attention getting, etc. The two lists had many similar items and categories so we were encouraged. A second independent brainstorming session was used to choose an ordered list of metrics for implementation. Highly ranked items were to be ones that we expected to have high payoff and be easy to implement.

The implementation, written in C++, revealed problems in obtaining the required values in a complete and consistent manner. Definitions of the metrics were revised, special conditions were handled, and bugs were resolved one by one as the columns emerged. The current columns are explained and a portion of the dialog box summary table is below:

Dialog Name: Name of the file in which dialog is contained.

Aspect Ratio: The ratio of the height of a dialog to its width. Numbers in the range 0.3 thru 1.7 are desirable.

Widget Totals: Counts of all the widgets and the top level widgets. Increasing difference between all and top level counts indicates greater nesting of widgets, such as buttons inside containers.

Non-Widget Area: The ratio of the non-widget area to the total area of the dialog, expressed as a percentage. Numbers closer to 100 indicate high utilization, and low numbers (< 30) indicate possibilities of redesign.

Widget Density: The number of top-level widgets divided by the total area of the dialog (multiplied by 100,000 to normalize it). High numbers indicate that a comparatively large number of widgets are present in a small area. This number is a measure of the 'crowding' of widgets in the dialog.

Margins: The number of pixels between the dialog box border and the closest widget. The left, right, top and bottom margins should all be approximately equal to each other in a dialog, and should also be the same across different dialogs. Dialogs that contain widgets which extend beyond the dialog's bounds (e.g., lists) give rise to negative figures for bottom margins.

Gridedness: The ratio of the total number of widgets in a dialog to the number of distinct x or y positions that the widgets have. This gives rise to distinct x-axis and y-axis measures for gridedness. If all the widgets in a dialog have distinct values for the x-coordinate of their positions, the x-gridedness will be 1. A number greater than 1 is evidence of grouping. If the x-gridedness is greater than the y-gridedness in a single dialog, indicates that widgets are stacked into columns rather than rows.

Area Balances: A measure of how evenly widgets are spread out over the dialog box. There are two measures: a horizontal balance, which is the ratio of the total widget area in the left half of the dialog to the total widget area in the right half of the dialog; and the vertical balance, which uses top area divided by bottom area. Dialogs in which all widgets are vertically centered have a horizontal balance of 1 (Left Area = Right Area). In general we expect the horizontal balance to be greater than 1 because many dialogs typically consist of large-size widgets in the left and top halves, and small widgets (such as buttons) at the right and bottom.

Distinct Typefaces: Typeface consists of a font, font-size, bold and italics information. Each distinct typeface in all the dialog boxes is randomly assigned to an integer and is described in detail at the end of the table. For each dialog box all the integers representing the distinct typefaces are listed so that the typeface inconsistencies can be easily spotted locally within each dialog box and globally among all the dialog boxes. The idea is that a distinct typeface should be used for all the dialog boxes. Occurrence of too many typefaces within a dialog box may not be desired.

Distinct Colors: (This column is not shown below because of lack of space) All the distinct background colors in a dialog box are displayed. Each distinct color in all the dialog boxes has been randomly assigned to an integer for display and comparison convenience and is described in detail at the end of the table. The purpose of this metrics is to check if all the dialog boxes have the same background colors. Multiple background colors in a dialog box may indicate inconsistency.

This table reveals some interesting anomalies that led to reconsiderations of designs. The test user interface had about 140 dialog boxes and was a well-reviewed and polished design. Very few obvious bugs appeared but many interesting questions were raised as we reviewed the detailed analysis. For example the varying aspect ratios were a surprise and irregular margins were a sign of lack of coordination. The gridedness values did lead to some review of layouts, but we are not yet sure about how to refine this measure. The balance ratios were effective in finding unusual layouts which will be reconsidered. The unusual variety in typefaces in contacts.cft were a surprise and it turned out to be the work of a specific designer who had created other dialog boxes of the application with his distinctive style. Similar surprises occurred in the distinct typefaces and colors columns.

No. Dialog Aspect -WIDGET-- Non- Widget --- M A R G I N S --- Gridedness -Balances- Distinct

Name Ratio TOTALS Widget Density Left Right Top Bottom X Y Area Ratios Typefaces

(H/W) All Top- Area widget/ (pixels) Horiz Vert

Level (%) area (L/R) (T/B)

1 aboutedi.cft 0.49 6 5 74.4 76 64 30 8 6 1.0 1.0 1.0 0.7 1

2 actlog.cft 0.67 16 14 -0.0 60 0 6 0 -241 2.0 1.4 1.1 0.4 1

3 addexp.cft 0.43 3 2 46.0 38 8 33 8 17 1.0 1.0 1.1 3.2 1

4 addfamdf.cft 0.77 25 13 28.3 74 8 26 8 4 1.3 2.6 1.0 0.7 1

5 addr.cft 0.73 47 8 23.9 36 8 26 8 4 1.1 2.7 1.1 0.9 1

6 addrbk.cft 0.84 45 29 15.5 177 0 13 0 6 1.7 2.2 1.1 0.8 1

7 addsec.cft 0.50 7 6 32.9 84 8 23 8 9 1.2 2.0 1.5 0.8 1

8 addseg.cft 0.63 7 6 42.4 103 16 23 8 12 1.0 2.0 1.4 0.6 1

9 addstand.cft 0.41 3 2 60.9 44 24 38 24 12 1.0 1.0 1.0 2.1 1

10 admpwd.cft 0.70 14 6 31.9 63 16 21 8 5 1.0 2.0 1.0 0.7 1

11 adrmsg.cft 0.74 28 14 23.2 61 8 21 8 7 1.4 2.3 1.0 0.5 1

12 adrmsg2.cft 0.74 28 14 23.0 61 8 21 8 6 1.4 2.3 1.0 0.5 1

13 adrmsg3.cft 0.76 28 14 24.7 60 8 13 8 7 1.4 2.3 1.0 0.5 1

14 advsched.cft 0.82 4 3 35.6 42 16 23 16 13 1.0 1.5 1.0 1.3 1

15 afile2.cft 0.66 4 3 47.7 57 16 33 8 13 1.0 1.5 1.0 1.6 1

16 alert1.cft 0.47 5 4 42.4 129 8 18 8 4 1.0 1.3 1.0 1.5 1

17 archive.cft 0.57 23 14 44.6 86 8 33 8 26 1.6 1.8 0.9 0.6 1

18 archok.cft 0.60 13 12 48.8 130 8 11 8 6 1.5 3.0 1.1 1.1 1

19 asgnfam.cft 0.49 12 11 50.8 82 16 7 8 3 1.2 2.2 1.0 0.4 1

20 autoff.cft 0.42 10 9 38.7 87 16 23 8 4 1.1 3.0 1.2 0.8 1

21 autofile.cft 0.39 10 5 34.1 74 16 31 16 4 1.2 1.7 1.3 0.7 1

22 autoupd.cft 0.37 8 5 52.6 89 16 26 8 4 1.0 1.7 1.1 1.1 1

23 backnow.cft 0.49 12 11 56.4 102 24 24 8 14 1.4 2.8 0.8 1.0 1

24 btmail.cft 0.48 3 2 50.3 109 8 20 8 10 1.0 1.0 1.0 3.1 1

25 buildcl.cft 0.58 4 3 38.8 56 8 18 8 11 1.0 1.5 1.0 1.7 1

26 cc.cft 0.44 3 2 76.5 76 32 49 16 15 1.0 1.0 1.1 1.4 1

27 chgstat.cft 0.43 3 2 47.6 87 8 20 8 7 1.0 1.0 0.9 2.9 1

28 ckdoc.cft 0.55 3 2 47.9 47 8 31 8 12 1.0 1.0 1.0 3.5 1

29 conhost.cft 0.47 3 2 46.7 55 8 13 8 14 1.0 1.0 0.9 3.8 1

30 connect.cft 0.61 17 16 49.7 142 16 31 16 2 1.8 1.8 0.6 1.9 1

31 contacts.cft 0.73 105 13 -358.8 47 8 12 8 -1403 1.6 1.9 1.3 0.3 1234

32 create.cft 0.71 92 12 0.2 44 0 14 0 9 4.0 1.3 1.0 0.9 1

33 dbback.cft 0.37 14 5 45.2 65 16 29 16 11 1.0 1.7 1.1 0.8 1

34 dearch.cft 0.53 15 14 48.6 119 8 11 8 5 1.8 3.5 1.3 0.8 1

35 dearch2.cft 0.59 10 9 41.1 68 24 26 0 16 1.1 4.5 1.5 1.2 1

36 dearchok.cft 0.63 10 9 35.2 63 16 26 8 12 1.0 4.5 1.6 1.0 1

37 delconf.cft 0.41 6 5 63.2 118 16 14 16 7 1.7 1.0 0.9 2.7 1

38 dociduti.cft 0.69 29 3 23.8 14 16 28 16 16 1.5 1.5 1.3 1.0 1

...

Maximum 1.00 170 31 97.5 271 80 56 24 27 4.4 4.5 6.2 8.6

Minimum 0.32 3 2 0.0 14 0 0 0 0 1.0 1.0 0.3 0.0

Average 0.60 17 8 35.3 86 11 19 7 7 1.6 1.7 1.1 1.4

1 = MS Sans Serif 8.25 Bold 2 = MS Sans Serif 8.25 3 = MS Sans Serif 9.75 Bold Italic

4 = MS Sans Serif 8.25 Bold Italic 5 = Arial 8.25 Bold 6 = MS Sans Serif 18 Bold

7 = MS Sans Serif 9.75 Bold

Minimum, maximum, and average values were computere for the metrics. Dialog boxes with extreme values should be examined as candidates for redesign.

A second part of the dialog box summary table (shown below) displays information on frequently used buttons: OK, Cancel, Help and Close. The columns enabled us to spot the highly inconsistent sizes and relative placements of these buttons in this application.

Presence of OK and Cancel Buttons: If a dialog has OK or Cancel buttons, their height and width in pixels are printed. The idea is that they should have the same sizes, and designers can verify the presence of these fundamental controls.

OK and Cancel Button Relative Positions: For dialogs that have both OK and Cancel buttons, this metric indicates their relative position. If the Cancel button is to the right of the OK button, the offset in pixels is printed as x + offset, else if it is below the OK button, it is printed as y + offset.

Help and Close Button Sizes: If a dialog has Help or Close buttons, their height and width in pixels are printed. The size of the buttons should be consistent.

No. Dialog Box OK Cancel Relative Help Close

Name Button Button Position Button Button

(height,width) (height, width)

1 aboutedi.cft 25,89

2 actlog.cft 25,123

3 addexp.cft 25,97

4 addfamdf.cft 25,73 25,73 y + 7 25,73

5 addr.cft 25,81 25,81 y + 7 25,81

6 addrbk.cft 25,89

7 addsec.cft 25,73 25,73 y + 7 25,73

8 addseg.cft 25,73 25,73 y + 7 25,73

9 addstand.cft 25,89

10 admpwd.cft 25,65 25,65 y + 7 25,65

11 adrmsg.cft 25,57 25,57 y + 7 25,57

12 adrmsg2.cft 25,57 25,57 y + 7 25,57

13 adrmsg3.cft 25,57 25,57 y + 7 25,57

14 advsched.cft 25,97

15 afile2.cft 25,65

16 alert1.cft 25,97

17 archive.cft 25,73 25,73 x + 23 25,73

2.2 Concordance

The idea of the string concordance output is to list all occurrences of words that appear in labels, buttons, menus, user messages, etc. throughout the user interface canonical format file. Designers can use the concordance to identify many aspects of appropriate word use such as spelling, case consistency, passive/active voice, noun/verb choice, etc.

There is a short format and a long format of the string concordance. Both formats create a file that is an ascii table with multiple columns. The first column of both formats lists individual words one per line sorted in alphabetical order. Occurrences in different case are preserved as unique occurrences of words, and are listed in the sorted list after one another so that uses of different case is clearly pointed out. The normal sort order is a..zA..Z, but this would separate occurrences of ìfindî from ìFindî or ìFINDî and so our program sorted words by aAbB...zZ.

The short format lists the word and number of times the word appears. The long format identifies the files in which the word appears (see below). The word ìMessageî appears 18 times in the files whose names follow it, ìMessage:î appears 2 times, and further down the list the term ìmessagesî (uncapitalized) appears once and ìmsgsî appears once. These variant forms are spelling errors and may be acceptable, but they may be something that should be reconsidered.

Message 18

addr.cft addr.cft dociduti.cft

docsearc.cft docsearc.cft docsearc.cft

docsort.cft docsort.cft famdef.cft

ffadd.cft in.cft moreinfo.cft

moreinfo.cft moreinfo.cft moreinfo.cft

profile.cft profile.cft profile.cft

Message: 2

profile.cft remfam.cft

MessageIDs 1

dociduti.cft

Messages 4

archive.cft autofile.cft autofile.cft

profile.cft

messages 1

dbback.cft

msgs 1

dbback.cft

3. Testing our methods

Our testing has included applying the metrics to a prototype application, reviewing the results for concept validity, and gathering reactions from developers. The prototype with 140 highly varied dialog boxes was a GE Information Servicesí Electronic Data Interchange application. The user interface was written in Microsoft Visual Basic, independently of our efforts to create metrics to evaluate the spatial, and textual aspects of displays. The prototype simulated typical actions to show what the user would see. It served as the portion of the functional specification to which the final product was designed. It also was used in an early usability test to confirm the design concepts.

A translator was written to convert Visual Basic .FRM files into the canonical format that could then be inputted to the evaluation program. Screen shots were also taken of all the dialog boxes in the prototype and printed out. Output from the metric evaluation program was then scanned for patterns and anomalies that were compared to the screen printouts. Several iterations of generating output, comparing to the screen printouts and reworking the program took place until a stable and accurate set of metrics were produced. These metrics were then shown to developers and quality assurance people at GE Information Services for preliminary feedback.

We are in the process of repeating the testing with a GE Information Servicesí commercial product. This application (also written in Visual Basic) contains a larger and more complex set of dialog boxes. The output of the metric evaluation program will then be given to a number of developers and quality assurance specialists for feedback.

4. Conclusions and future work

As the complexity of GUIís increase, developers are finding they need more help in the analysis and testing of their designs. Quality assurance is also finding it increasingly difficult to adequately test all aspects of current user interfaces. Initial feedback on the metric evaluation program from these groups at GE Information Services indicate a definite perceived value in such tools. And while the feedback was positive on the concept and initial output, several issues and suggestions were raised. In its current format, the output needs to be manually scanned for anomalies and patterns. There is a desire to have these highlighted by the tool.

For developers:

- prescriptive directions on how to correct the interface problems. While in some cases, it is obvious what to do such as when there are multiple capitalization strategies for the ìCancelî button, or different fonts and button sizes are being used. Easier still would be a message telling them to use x font with a particular capitalization as it is encountered. The more difficult cases like widget density or gridedness is more of a mystery.

- a tool which is interactive that displays the problems as they are encountered. This may be beyond current processing speeds of most PCís, but intuitively it seems right to correct the problem as they go rather than discovering it in a printout and then finding their way back to the location in the application.

- a tool that is usable in all of their development environments, i.e. if they are using C++ or some of the new cross platform development tools, they want the same capability as we showed them with Microsoft Visual Basic. This supports the canonical format approach and implies a need to have a translator from whichever environment they are using.

For Quality Assurance people:

- a summative tool that checks across the entire application and reports back those areas that have problems.

- an indicator of the severity of the problem. While any item uncovered as an issue is probably worth trying to resolve, they are worried more about those items that have a large user impact than those that do not.

- a check for consistency of wording and layout. To do it manually is becoming very difficult. There are so many different things to look at that it is often a challenge just to make sure they have seen every dialog box. They would also like the evaluation tool to validate against the design specification.

There are many measures that were not attempted in this first effort. We needed to start someplace and the metrics we chose represented a variety of things to look at so we could test the concept. We know from this initial exploration that we are touching only the tip of the iceberg which are the aspects of displays that could be evaluated automatically. Where it is practically possible, assessment against industry standards should be undertaken. We need to expand the number of metrics to get better measures on usage consistency and if possible get at conformity to an organizationís ìlook and feelî for their products.

Perhaps the hardest part for future refinement of the evaluation tools is to provide the ìgoodnessî measures for the metric values. Its clearly what is needed for those not schooled in human factors. In many cases it is acceptable if there is not a clear, research supported recommendation. Educated judgments will suffice to provide the rules for development and to gain the consistency across applications.

The next steps are underway to develop a tool that provides some of how the developers would like to use the metrics thus far established. Below is a sample layout.

The tool will allow developers and quality assurance specialists to view on-line summative metrics for multiple dialog boxes and metrics for individual dialog boxes all with anomaly highlighting. They would receive feedback from whatever dialog box has focus by pressing the ìAnalyzeî button. Once the dialog box has been analyzed the reviewers could walk through each set of discrepancies or review the scores for that dialog box.

Our first attempts led to lengthy outputs of uncertain merit, but as we refined our choices of metrics the outputs became more provocative and productive. New ideas flowed more easily and new metrics, output formats, and theories of automated evaluation emerged. We are still at the beginning phases, but see that there is potential for these evaluation tools since they are quick and simple to apply, and they reveal interesting properties of complex designs.


ACKNOWLEDGEMENTS

We appreciate the support for this project from GE Information Services and the Maryland Industrial Partnerships program. We are grateful for draft comments from Vic Basili, Catherine Plaisant,and Anne Rose, and for programming assistance from Rohit Mahajan.

REFERENCES

Apple Computer, Inc. (1992), Macintosh Human Interface Guidelines, Addison-Wesley Publishing Co., Reading, MA.

Bodart, F., Hennebert, A.-M., Leheureux, J.-M., and Vanderdonckt, J. (1994), ìTowards a dynamic strategy for computer-aided visual placementî, In Catarci, T., Costabile, M.,

Levialdi, S., and Santucci, G. (Editors), Proc. Advanced Visual Interfaces Conference ë94, ACM Press, New York, 78-87.

Brown, C. M. (1988), Human-Computer Interface Design Guidelines, Ablex Publishing Co., Norwood, NJ.

Byrne, M., Wood, S., Sukaviriya, P., Foley, J., and Kieras, D. (1994), ìAutomating Interface Evaluationî, Proc. of CHIí94, ACM, New York, 232-237.

Chimera, R. and Shneiderman, B. (1993), ìUser interface consistency: An evaluation of original and revised interfaces for a videodisk libraryî, In Sparks of Innovation in Human-Computer Interaction (B. Shneiderman, editor), Ablex Publishers, Norwood, NJ, 259-271.

Feiner, S. (1988), ìA grid-based approahc to automating display layoutî, Proc. of Graphics Interface ë88, 192-197.

Galitz, W. O. (1989), Handbook of Screen Format Design: Third Edition, Q. E. D. Information Sciences, Inc., P. O. Box 181, Wellesley, MA 02181.

Hix, D. and Hartson, H. R. (1993), Developing User Interfaces: Ensuring Usability Through Product & Process, John Wiley & Sons, New York, NY.

IBM (1991), Systems Application Architecture: Common User Access, Advanced Interface Design Reference, IBM Document SC34-4290-00, Cary, NC.

Kim, W. and Foley, J. (1993), ìProviding high-level control and expert assistance in the user interface presentation designî, Proc. of CHIí93, ACM, New York, 430-437.

Marcus, A. (1992), Graphic Design for Electronic Documents and User Interfaces, ACM Press, New York, NY.

Sears, A. (1993), ìLayout Appropriateness: A metric for evaluating user interface widget layoutsî, IEEE Transactions on Software Engineering 19, 7, 707-719.

Sears, A. (1994), ìUsing automated metrics to design and evaluate user interfacesî, DePaul University Dept of Computer Science Technical Report #94-002, Chicago, IL.

Shneiderman, B. (1992), Designing the User Interface: Strategies for Effective Human-Computer Interaction: Second Edition, Addison-Wesley Publ. Co., Reading, MA.

Streveler, D. and Wasserman, A. (1987), ìQuantitative measures of the spatial properties of screen designsî, Proc. of INTERACT ë87, Elsevier Science, Amsterdam, 125-133.

Tullis, T. S. (1988a), ìScreen designî, In Helander, M. (Editor), Handbook of Human-Computer Interaction, Elsevier Science, Amsterdam, The Netherlands, 377-411.

Tullis, T. S. (1988b), ìA system for evaluating screen formats: Research and applicationî, In Hartson, H. Rex, and Hix, Hartson, Advances in Human-Computer Interaction: Volume 2, Ablex Publishing Corp., Norwood, NJ, 214-286.

Vanderdonckt, J. and Gillo, X. (1994), ìVisual techniques for traditional and multimedia layoutsî, In Catarci, T., Costabile, M., Levialdi, S., and Santucci, G. (Editors), Proc. Advanced Visual Interfaces Conference ë94, ACM Press, New York, 95-104.