Selecting appropriate methods and planning an evaluation are not trivial. Many factors need to be taken into account. Some are concerned with the stage of the system development at which feedback is required, the purpose of the evaluation and the kind of information that is needed; others are concerned with the practicalities of doing the actual evaluation such as time, the availability and involvement of users, specialist equipment, the expertise of the evaluators and so on.
In this article we compare the methods discussed in the previous articles in terms of the level of interface development required to use them, the involvement of users, the type of data collected and practical issues that influence evaluation practice.
Differences between methods
A number of factors need to be taken into account when you are deciding which kind of evaluation methods to use, including the purpose of the evaluation, the stage of system development, the involvement of users, the kind of data collected and how it is analyzed, and the practical constraints associated with actually doing the evaluation.
The purpose of the evaluation
The purpose of the evaluation is a key factor, which I mentioned several times already. Below just a recap:
Engineering towards a target: is it good enough?
Comparing alternative designs: which is the best?
Understanding the real world: how well does it work in the real world?
Checking conformance to a standard: does this product conform to the standard?
The way that the data is collected and analyzed must be suitable for the practical and philosophical nature of the evaluation. For example, laboratory experiments are not suitable for understanding how users work in the real world. Similarly, ethnographic methods are not appropriate for testing conformance to a standard.
Stage of system development
Different methods are appropriate for different stages of system development. Some methods like walkthroughs or keystroke-level analysis, can be done on a formal or semi-formal specification very early in design. Other kinds of evaluation, such as benchmarking, observation and monitoring and interpretative studies are usually done later on a prototype or working system. In the case of benchmarking, the evaluation will take place in a laboratory setting, observation may be in a laboratory or field location and interpretative studies will always be natural field settings. Often quick informal evaluations will be done in which ideas are tested out with users and where the setting is not important. Such studies take place early in design or at the time when decisions about screen design are being made. The important thing is that the ideas are at an early formative stage and the purpose of the evaluation is to get rapid feedback to improve them.
Involvement of users in the evaluation process
You can consider user involvement in three ways:
The participation of typical end users
The control that users have over their own tasks during the evaluation
The control that users have in running the evaluation
The more formal and scientific the evaluation the less control users tend to have over both their own tasks and over the evaluation procedure. Maximum user control occurs in interpretative studies where users often work in their normal environments. Predictive evaluations make a number of assumptions about the cognitive operations of users but there is no direct involvement with users except in the discount method, and even here involvement is low.
Type of data
The major distinction here is between qualitative and quantitative data. Quantitative data deal with either user performance or attitudes that can be recorded in a numerical form. Qualitative data focus on reports and opinions. Some data is inherently either quantitative or qualitative. For example, ethnographic data is qualitative whereas task completion time is quantitative. Other data, like a stream of video, can be either depending on the purpose of the study and the way that it is analyzed. Questionnaire data is typically dealt with quantitatively in order to produce statements of the type: "x% of the users could not guess the meaning of the first icon". At the other extreme qualitative data may be treated more holistically as in the case of interpretative studies. Quantitative data have the advantage of allowing consistent, detailed analysis across the users tested, which can be validated statistically, but does not contain qualitative "richness" as there is no account of individuals' responses, opinions and feelings. Qualitative data are limited to some form of descriptive account information gathered. However, in the case of interpretative studies the data may be very rich and may provide new insights into the way the technology is used which will enable designers to produce more appropriate products.
Some important constraints that need to be taken into account are:
Absence of specialist equipment for the evaluation, such as a video recorder or means of logging user interpretations,
Lack of specialist skills, for example, for designing experiments, undertaking complex statistical analyses, or doing ethnographic work,
Time constraints concerned with conducting either the evaluation or the data analysis,
Access to users (if required) and the system, for example, restricted access to both the interface software (which means that evaluations involving changes to an interface are ruled out) and users required for the evaluation (who may be completely unavailable or accessible only for a very short period of time).
One way or another, many of these constraints are related to cost. If the evaluation is too expensive, for whatever reasons, it is unlikely to happen, especially in companies that still regard evaluation as an additional luxury to be included if time and cost permit. In order to estimate the cost of an evaluation it will be important to consider both data collection and data analysis.
Two issues affect the ease of data gathering. Firstly, the number of users needed is an important factor for empirical evaluations. These evaluations normally observe or test users individually, and user testing can take from several days up to a few months. The number of users required depends on the technique and the statistical analyses undertaken. Secondly, the size of the task set used, that is, the number of tasks and their complexity, must be considered. The size of the task set directly affects the time taken by each user and therefore the time spent on the whole evaluation. Experimental evaluations and usability benchmarking typically have an organized approach to the control of task structure, usually with a set number of tasks of equal complexity.
Before starting the evaluation consideration needs to be given to the overheads associated with data analysis can become a very time-consuming process. Large surveys can also generate a huge volume of data but there are now many software packages available to support analysis, statistical testing and report generation.
Each category of techniques also has its own advantages and disadvantages. The choice of evaluation technique depends, in part, on making the most of the potential benefits and being aware of, or possibly reducing, the disadvantages associated with it. Two other criteria that need to be considered are:
Technical criteria, which deal with the details of using the evaluation method,
scope of the information needed, which is concerned with the relevance of the data collected.
Technical criteria are concerned with the type of information produced and issues relating to how that information is collected. Three technical criteria may be identified: validity, reliability and biases in data collection.
In the present context validity refers to whether an evaluation method is measuring what is required given the specified purpose of the evaluation, and it operates at two levels. Firstly, it is necessary to determine whether the method is valid for a particular purpose. Secondly, the validity of the measurements made must be considered.
The reliability or consistency of the measurement process is important. A reliable evaluation method is one that produces the same results on separate occasions under the same circumstances. Different evaluation methods vary in their reliability. Well-designed experiments in which careful control is made of the tasks that the subjects perform and the selection of subjects. Observational evaluations tend to have much lower reliability.
Using an evaluation technique is not a neutral activity; there may be biases in data collection and analysis and in the deduction of information from those analyses. Basically, there are two main sources of bias:
Selective data gathering,
Manipulation of the evaluation situation
In selective data gathering attention is focused only on particular aspects of the information available and this can distort the whole evaluation process. For example the opinions of the experts used in an expert review may be heavily biased because of previous experience and knowledge. The evaluation can also be manipulated by both users and evaluators. For instance, in any situation where a user is prompted for a response, such as in verbal protocols and interviews, subtle influences can guide the respondent to give particular responses. Users can also respond in a calculated way and give a false impression, so that, for example, responses on questionnaires may not reflect true opinions. Other criteria that need to be considered are the scope of information needed and the ecological validity of an evaluation.
Scope of information needed
This refers to the completeness of the data collected in relation to eventual use of the system. There are two issues to consider in this respect: the limitations of the information that result from the evaluation technique, and the extent to which findings can be generalized to other situations. Evaluation methods vary considerably in their scope, depending on the amount of contact with users and the structure of the tasks that users perform. To increase their scope some evaluations use more than one evaluation method.
Ecological validity refers to the environment in which the evaluation takes place and the degree to which the evaluation may affect the results it obtains. Almost all empirical evaluations affect the situation and working practices they are trying to study. The degree to which this occurs depends on the level of intrusion into users' work and the control exercised over the users' tasks by the evaluators. Obtrusive techniques are those where users are constantly aware that their behavior is being monitored or where they have to interrupt their work to provide some information. When selecting an evaluation method it is important to consider the artificiality created by the method and its possible effects on the evaluation itself. It is also important to take account of how artificial the test environment is when interpreting the findings of an evaluation. It is because many laboratory tests are so unrealistic that many evaluators and design teams are starting to move away from traditional laboratory testing to unobtrusive forms of observation. They are also trying to develop working practices in which designers and users work together alongside each other so that users' opinions and needs can be taken into account throughout system development.