Correspondence: Anthony R. Artino, Jr., PhD, Associate Professor of Preventive Medicine and Biometrics, Uniformed Services University of the Health Sciences, 4301 Jones Bridge Road, Bethesda, MD 20814-4712, USA. Tel: +1-301.295.3693; E-mail: ude.shusu@onitra.ynohtna
Copyright © 2014 Informa UK Ltd. All rights reserved: reproduction in whole or part not permittedThis is an open-access article distributed under the terms of the CC-BY-NC-ND 3.0 License which permits users to download and share the article for non-commercial purposes, so long as the article is reproduced in the whole without changes, and provided the original source is credited.
In this AMEE Guide, we consider the design and development of self-administered surveys, commonly called questionnaires. Questionnaires are widely employed in medical education research. Unfortunately, the processes used to develop such questionnaires vary in quality and lack consistent, rigorous standards. Consequently, the quality of the questionnaires used in medical education research is highly variable. To address this problem, this AMEE Guide presents a systematic, seven-step process for designing high-quality questionnaires, with particular emphasis on developing survey scales. These seven steps do not address all aspects of survey design, nor do they represent the only way to develop a high-quality questionnaire. Instead, these steps synthesize multiple survey design techniques and organize them into a cohesive process for questionnaire developers of all levels. Addressing each of these steps systematically will improve the probabilities that survey designers will accurately measure what they intend to measure.
Surveys are used throughout medical education. Examples include the ubiquitous student evaluation of medical school courses and clerkships, as well as patient satisfaction and student self-assessment surveys. In addition, survey instruments are widely employed in medical education research. In our recent review of original research articles published in Medical Teacher in 2011 and 2012, we found that 37 articles (24%) included surveys as part of the study design. Similarly, surveys are commonly used in graduate medical education research. Across the same two-year period (2011–2012), 75% of the research articles published in the Journal of Graduate Medical Education used surveys.
Questionnaires are widely used in medical education research, yet the processes employed to develop questionnaires vary in quality and lack consistent, rigorous standards.
This AMEE Guide introduces a systematic, seven-step design process for creating high-quality survey scales fit for program evaluation and research purposes.
The seven-step design process synthesizes multiple techniques survey designers employ into a cohesive process.
The survey design process described in this Guide includes the following seven steps: (1) conduct a literature review, (2) carry out interviews and/or focus groups, (3) synthesize the literature review and interviews/focus groups, (4) develop items, (5) collect feedback on the items through an expert validation, (6) employ cognitive interviews to ensure that respondents understand the items as intended and (7) conduct pilot testing.
This seven-step design process differs from previously described processes in that it blends input from other experts in the field as well as potential participants. In addition, this process front loads the task of establishing validity by focusing heavily on careful item development.
Despite the widespread use of surveys in medical education, the medical education literature provides limited guidance on the best way to design a survey (Gehlbach et al. 2010). Consequently, many surveys fail to use rigorous methodologies or “best practices” in survey design. As a result, the reliability of the scores that emerge from surveys is often inadequate, as is the validity of the scores’ intended interpretation and use. Stated another way, when surveys are poorly designed, they may fail to capture the essence of what the survey developer is attempting to measure due to different types of measurement error. For example, poor question wording, confusing question layout and inadequate response options can all affect the reliability and validity of the data from surveys, making it extremely difficult to draw useful conclusions (Sullivan 2011). With these problems as a backdrop, our purpose in this AMEE Guide is to describe a systematic process for developing and collecting reliability and validity evidence for survey instruments used in medical education and medical education research. In doing so, we hope to provide medical educators with a practical guide for improving the quality of the surveys they design for evaluation and research purposes.
The term “survey” is quite broad and could include the questions used in a phone interview, the set of items employed in a focus group and the questions on a self-administered patient survey (Dillman et al. 2009). Although the processes described in this AMEE Guide can be used to improve all of the above, we focus primarily on self-administered surveys, which are often referred to as questionnaires. For most questionnaires, the overarching goals are to develop a set of items that every respondent will interpret the same way, respond to accurately and be willing and motivated to answer. The seven steps depicted in Table 1 , and described below, do not address all aspects of survey design nor do they represent the only way to develop a high-quality questionnaire. Rather, these steps consolidate and organize the plethora of survey design techniques that exist in the social sciences and guide questionnaire developers through a cohesive process. Addressing each step systematically will optimize the quality of medical education questionnaires and improve the chances of collecting high-quality survey data.
A seven-step, survey scale design process for medical education researchers.
Step | Purpose |
---|---|
1. Conduct a literature review | To ensure that the construct definition aligns with relevant prior research and theory and to identify existing survey scales or items that might be used or adapted |
2. Conduct interviews and/or focus groups | To learn how the population of interest conceptualizes and describes the construct of interest |
3. Synthesize the literature review and interviews/focus groups | To ensure that the conceptualization of the construct makes theoretical sense to scholars in the field and uses language that the population of interest understands |
4. Develop items | To ensure items are clear, understandable and written in accordance with current best practices in survey design |
5. Conduct expert validation | To assess how clear and relevant the items are with respect to the construct of interest |
6. Conduct cognitive interviews | To ensure that respondents interpret items in the manner that survey designer intends |
7. Conduct pilot testing | To check for adequate item variance, reliability and convergent/discriminant validity with respect to other measures |
Adapted with permission from Lippincott Williams and Wilkins/Wolters Kluwer Health: Gehlbach et al. (2010). AM last page: Survey development guidance for medical education researchers. Acad Med 85:925.
Questionnaires are good for gathering data about abstract ideas or concepts that are otherwise difficult to quantify, such as opinions, attitudes and beliefs. In addition, questionnaires can be useful for collecting information about behaviors that are not directly observable (e.g. studying at home), assuming respondents are willing and able to report on those behaviors. Before creating a questionnaire, however, it is imperative to first decide if a survey is the best method to address the research question or construct of interest. A construct is the model, idea or theory that the researcher is attempting to assess. In medical education, many constructs of interest are not directly observable – student satisfaction with a new curriculum, patients’ ratings of their physical discomfort, etc. Because documenting these phenomena requires measuring people’s perceptions, questionnaires are often the most pragmatic approach to assessing these constructs.
In medical education, many constructs are well suited for assessment using questionnaires. However, because psychological, non-observable constructs such as teacher motivation, physician confidence and student satisfaction do not have a commonly agreed upon metric, they are difficult to measure with a single item on a questionnaire. In other words, for some constructs such as weight or distance, most everyone agrees upon the units and the approach to measurement, and so a single measurement may be adequate. However, for non-observable, psychological constructs, a survey scale is often required for more accurate measurement. Survey scales are groups of similar items on a questionnaire designed to assess the same underlying construct (DeVellis 2003). Although scales are more difficult to develop and take longer to complete, they offer researchers many advantages. In particular, scales more completely, precisely and consistently assess the underlying construct (McIver & Carmines 1981). Thus, scales are commonly used in many fields, including medical education, psychology and political science. As an example, consider a medical education researcher interested in assessing medical student satisfaction. One approach would be to simply ask one question about satisfaction (e.g. How satisfied were you with medical school?). A better approach, however, would be to ask a series of questions designed to capture the different facets of this satisfaction construct (e.g. How satisfied were you with the teaching facilities? How effective were your instructors? and How easy was the scheduling process?). Using this approach, a mean score of all the items within a particular scale can be calculated and used in the research study.
Because of the benefits of assessing these types of psychological constructs through scales, the survey design process that we now turn to will focus particularly on the development of scales.
The first step to developing a questionnaire is to perform a literature review. There are two primary purposes for the literature review: (1) to clearly define the construct and (2) to determine if measures of the construct (or related constructs) already exist. A review of the literature helps to ensure the construct definition aligns with related theory and research in the field, while at the same time helping the researcher identify survey scales or items that could be used or adapted for the current purpose (Gehlbach et al. 2010).
Formulating a clear definition of the construct is an indispensable first step in any validity study (Cook & Beckman 2006). A good definition will clarify how the construct is positioned within the existing literature, how it relates to other constructs and how it is different from related constructs (Gehlbach & Brinkworth 2011). A well-formulated definition also helps to determine the level of abstraction at which to measure a given construct (the so-called “grain size”, as defined by Gehlbach & Brinkworth 2011). For example, to examine medical trainees’ confidence to perform essential clinical skills, one could develop scales to assess their confidence to auscultate the heart (at the small-grain end of the spectrum), to conduct a physical exam (at the medium-grain end of the spectrum) or to perform the clinical skills essential to a given medical specialty (at the large-grain end of the spectrum).
Although many medical education researchers prefer to develop their own surveys independently, it may be more efficient to adapt an existing questionnaire – particularly if the authors of the existing questionnaire have collected validity evidence in previous work – than it is to start from scratch. When this is the case, a request to the authors to adapt their questionnaire will usually suffice. It is important to note, however, that the term “previously validated survey” is a misnomer. The validity of the scores that emerge from a given questionnaire or survey scale is sensitive to the survey’s target population, the local context and the intended use of the scale scores, among other factors. Thus, survey developers collect reliability and validity evidence for their survey scales in a specified context, with a particular sample, and for a particular purpose.
As described in the Standards for Educational and Psychological Testing, validity refers to the degree to which evidence and theory support a measure’s intended use (AERA, APA, & NCME 1999). The process of validation is the most fundamental consideration in developing and evaluating a measurement tool, and the process involves the accumulation of evidence across time, settings and samples to build a scientifically sound validity argument. Thus, establishing validity is an ongoing process of gathering evidence (Kane 2006). Furthermore, it is important to acknowledge that reliability and validity are not properties of the survey instrument, per se, but of the survey’s scores and their interpretations (AERA, APA, & NCME 1999). For example, a survey of trainee satisfaction might be appropriate for assessing aspects of student well-being, but such a survey would be inappropriate for selecting the most knowledgeable medical students. In this example, the survey did not change, only the score interpretation changed (Cook & Beckman 2006).
Many good reasons exist to use, or slightly adapt, an existing questionnaire. By way of analogy, we can compare this practice to a physician who needs to decide on the best medical treatment. The vast majority of clinicians do not perform their own comparative research trials to determine the best treatments to use for their patients. Rather, they rely on the published research, as it would obviously be impractical for clinicians to perform such studies to address every disease process. Similarly, medical educators cannot develop their own questionnaires for every research question or educational intervention. Just like clinical trials, questionnaire development requires time, knowledge, skill and a fair amount of resources to accomplish correctly. Thus, an existing, well-designed questionnaire can often permit medical educators to put their limited resources elsewhere.
Continuing with the clinical research analogy, when clinicians identify a research report that is relevant to their clinical question, they must decide if it applies to their patient. Typically, this includes determining if the relationships identified in the study are causal (internal validity) and if the results apply to the clinician’s patient population (external validity). In a similar way, questionnaires identified in a literature search must be reviewed critically for validity evidence and then analyzed to determine if the questionnaire could be applied to the educator’s target audience. If survey designers find scales that closely match their construct, context and proposed use, such scales might be useable with only minor modification. In some cases, the items themselves might not be well written, but the content of the items might be helpful in writing new items (Gehlbach & Brinkworth 2011). Making such determinations will be easier the more the survey designer knows about the construct (through the literature review) and the best practices in item writing (as described in Step 4).
Once the literature review has shown that it is necessary to develop a new questionnaire, and helped to define the construct, the next step is to ascertain whether the conceptualization of the construct matches how prospective respondents think about it (Gehlbach & Brinkworth 2011). In other words, do respondents include and exclude the same features of the construct as those described in the literature? What language do respondents use when describing the construct? To answer these questions and ensure the construct is defined from multiple perspectives, researchers will usually want to collect data directly from individuals who closely resemble their population of interest.
To illustrate this step, another clinical analogy might be helpful. Many clinicians have had the experience of spending considerable time developing a medically appropriate treatment regimen but have poor patient compliance with that treatment (e.g. too expensive). The clinician and patient then must develop a new plan that is acceptable to both. Had the patient’s perspective been considered earlier, the original plan would likely have been more effective. Many clinicians have also experienced difficulty treating a patient, only to have a peer reframe the problem, which subsequently results in a better approach to treatment. A construct is no different. To this point, the researcher developing the questionnaire, like the clinician treating the patient, has given a great deal of thought to defining the construct. However, the researcher unavoidably brings his/her perspectives and biases to this definition, and the language used in the literature may be technical and difficult to understand. Thus, other perspectives are needed. Most importantly, how does the target population (the patient from the previous example) conceptualize and understand the construct? Just like the patient example, these perspectives are sometimes critical to the success of the project. For example, in reviewing the literature on student satisfaction with medical school instruction, a researcher may find no mention of the instructional practice of providing students with video or audio recordings of lectures (as these practices are fairly new). However, in talking with students, the researcher may find that today’s students are accustomed to such practices and consider them when forming their opinions about medical school instruction.
In order to accomplish Step 2 of the design process, the survey designer will need input from prospective respondents. Interviews and/or focus groups provide a sensible way to get this input. Irrespective of the approach taken, this step should be guided by two main objectives. First, researchers need to hear how participants talk about the construct in their own words, with little to no prompting from the researcher. Following the collection of unprompted information from participants, the survey designers can then ask more focused questions to evaluate if respondents agree with the way the construct has been characterized in the literature. This procedure should be repeated until saturation is reached; this occurs when the researcher is no longer hearing new information about how potential respondents conceptualize the construct (Gehlbach & Brinkworth 2011). The end result of these interviews and/or focus groups should be a detailed description of how potential respondents conceptualize and understand the construct. These data will then be used in Steps 3 and 4.
At this point, the definition of the construct has been shaped by the medical educator developing the questionnaire, the literature and the target audience. Step 3 seeks to reconcile these definitions. Because the construct definition directs all subsequent steps (e.g. development of items), the survey designer must take care to perform this step properly.
One suitable way to conduct Step 3 is to develop a comprehensive list of indicators for the construct by merging the results of the literature review and interviews/focus groups (Gehlbach & Brinkworth 2011). When these data sources produce similar lists, the process is uncomplicated. When these data are similar conceptually, but the literature and potential respondents describe the construct using different terminology, it makes sense to use the vocabulary of the potential respondents. For example, when assessing teacher confidence (sometimes referred to as teacher self-efficacy), it is probably more appropriate to ask teachers about their “confidence in trying out new teaching techniques” than to ask them about their “efficaciousness in experimenting with novel pedagogies” (Gehlbach et al. 2010). Finally, if an indicator is included from one source but not the other, most questionnaire designers will want to keep the item, at least initially. In later steps, designers will have opportunities to determine, through expert reviews (Step 5) and cognitive interviews (Step 6), if these items are still appropriate to the construct. Whatever the technique used to consolidate the data from Steps 1 and 2, the final definition and list of indicators should be comprehensive, reflecting both the literature and the opinions of the target audience.
It is worth noting that scholars may have good reasons to settle on a final construct definition that differs from what is found in the literature. However, when this occurs, it should be clear exactly how and why the construct definition is different. For example, is the target audiences’ perception different from previous work? Does a new educational theory apply? Whatever the reason, this justification will be needed for publication of the questionnaire. Having an explicit definition of the construct, with an explanation of how it is different from other versions of the construct, will help peers and researchers alike decide how to best use the questionnaire both in comparison with previous studies and with the development of new areas of research.
The goal of this step is to write survey items that adequately represent the construct of interest in a language that respondents can easily understand. One important design consideration is the number of items needed to adequately assess the construct. There is no easy answer to this question. The ideal number of items depends on several factors, including the complexity of the construct and the level at which one intends to assess it (i.e. the grain size). In general, it is good practice to develop more items than will ultimately be needed in the final scale (e.g. developing 15 potential items in the hopes of ultimately creating an eight-item scale), because some items will likely be deleted or revised later in the design process (Gehlbach & Brinkworth 2011). Ultimately, deciding on the number of items is a matter of professional judgment, but for most narrowly defined constructs, scales containing from 6 to 10 items will usually suffice in reliably capturing the essence of the phenomenon in question.
The next challenge is to write a set of clear, unambiguous items using the vocabulary of the target population. Although some aspects of item-writing remain an art form, an increasingly robust science and an accumulation of best practices should guide this process. For example, writing questions rather than statements, avoiding negatively worded items and biased language, matching the item stem to the response anchors and using response anchors that emphasize the construct being measured rather than employing general agreement response anchors (Artino et al. 2011) are all well-documented best practices. Although some medical education researchers may see these principles as “common sense”, experience tells us that these best practices are often violated.
Reviewing all the guidelines for how best to write items, construct response anchors and visually design individual survey items and entire questionnaires is beyond the scope of this AMEE Guide. As noted above, however, there are many excellent resources on the topic (e.g. DeVillis 2003; Dillman et al. 2009; Fowler 2009). To assist readers in grasping some of the more important and frequently ignored best practices, Table 2 presents several item-writing pitfalls and offers solutions.
Item-writing “best practices” based on scientific evidence from questionnaire design research.
Pitfall | Survey example(s) | Why it’s a problem | Solution(s) | Survey example(s) | References |
---|---|---|---|---|---|
Creating a double- barreled item | – How often do you talk to your nurses and administrative staff when you have a problem? | Respondents have trouble answering survey items that contain more than one question (and thus could have more than one answer). In this example, the respondent may talk to his nurses often but talk to administrative staff much less frequently. If this were the case, the respondent would have a difficult time answering the question. Survey items should address one idea at a time. | When you have multiple questions/premises within a given item, either (1) create multiple items for each question that is important or (2) include only the more important question. Be especially wary of conjunctions in your items. | – How often do you talk to your nurses when you have a problem? – How often do you talk to your administrative staff when you have a problem? | Tourangeau et al. 2000; Dillman et al. 2009 |
Creating a negatively worded item | – In an average week, how many times are you unable to start class on time? – The chief resident should not be responsible for denying admission to patients. | Negatively worded survey items are challenging for respondents to comprehend and answer accurately. Double negatives are particularly problematic and increase measurement error. If a respondent has to say “yes” in order to mean “no” (or “agree” in order to “disagree”), the item is flawed. | Make sure “yes” means yes and “no” means no. This generally means wording items positively. | – In an average week, how many times do you start class on time? – Should the chief resident be responsible for admitting patients? | Dillman et al. 2009 |
Using statements instead of questions | I am confident I can do well in this course. • Not at all true • A little bit true • Somewhat true • Mostly true • Completely true | A survey represents a conversation between the surveyor and the respondents. To make sense of survey items, respondents rely on “the tacit assumptions that govern the conduct of conversation in everyday life” (Schwarz 1999). Only rarely do people engage in rating statements in their everyday conversations. | Formulate survey items as questions. Questions are more conversational, more straightforward and easier to process mentally. People are more practiced at responding to them. | How confident are you that you can do well in this course? • Not at all confident • Slightly confident • Moderately confident • Quite confident • Extremely confident | Krosnick 1999; Schwarz 1999; Tourangeau et al. 2000; Dillman et al. 2009 |
Using agreement response anchors | The high cost of health care is the most important issue in America today. • Strongly disagree • Disagree • Neutral • Agree • Strongly agree | Agreement response anchors do not emphasize the construct being measured and are prone to acquiescence (i.e. the tendency to endorse any assertion made in an item, regardless of its content). In addition, agreement response options may encourage respondents to think through their responses less thoroughly while completing the survey. | Use construct-specific response anchors that emphasize the construct of interest. Doing so reduces acquiescence and keeps respondents focused on the construct in question; this results in less measurement error. | How important is the issue of high healthcare costs in America today? • Not at all important • Slightly important • Moderately important • Quite important • Extremely important | Krosnick 1999; Tourangeau et al. 2000; Dillman et al. 2009 |
Using too few or too many response anchors | How useful was your medical school training in clinical decision making? • Not at all useful • Somewhat useful • Very useful | The number of response anchors influences the reliability of a set of survey items. Using too few response anchors generally reduces reliability. There is, however, a point of diminishing returns beyond which more response anchors do not enhance reliability. | Use five or more response anchors to achieve stable participant responses. In most cases, using more than seven to nine anchors is unlikely to be meaningful to most respondents and will not improve reliability. | How useful was your medical school training in clinical decision making? • Not at all useful • Slightly useful • Moderately useful • Quite useful • Extremely useful | Weng 2004 |
Adapted with permission from Lippincott Williams and Wilkins/Wolters Kluwer Health: Artino et al. 2011. AM last page: Avoiding five common pitfalls in survey design. Acad Med 86:1327.
Another important part of the questionnaire design process is selecting the response options that will be used for each item. Closed-ended survey items can have unordered (nominal) response options that have no natural order or ordered (ordinal) response options. Moreover, survey items can ask respondents to complete a ranking task (e.g. “rank the following items, where 1 = best and 6 = worst”) or a rating task that asks them to select an answer on a Likert-type response scale. Although it is outside the scope of this AMEE Guide to review all of the response options available, questionnaire designers are encouraged to tailor these options to the construct(s) they are attempting to assess (and to consult one of the many outstanding resources on the topic; e.g. Dillman et al. 2009; McCoach et al. 2013). To help readers understand some frequently ignored best practices Table 2 and Figure 1 present several common mistakes designers commit when writing and formatting their response options. In addition, because Likert-type response scales are by far the most popular way of collecting survey responses – due, in large part, to their ease of use and adaptability for measuring many different constructs (McCoach et al. 2013) – Table 3 provides several examples of five- and seven-point response scales that can be used when developing Likert-scaled survey instruments.