This book is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License which permits noncommercial use and distribution provided the original author(s) and source are credited. (See https://creativecommons.org/licenses/by-nc-nd/4.0/
NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
Structured Abstract
Background:
Discrete choice experiments (DCEs) measure preferences by presenting choice tasks in which respondents choose a preferred alternative. We identified 4 knowledge gaps related to DCE design and analysis: (1) Current trends (since 2012) in design, analysis, and reporting are unknown. (2) DCE design decisions can affect findings by influencing respondent behavior, and task complexity and respondent fatigue may play important roles. These and other problems, such as sampling error, selection bias, and unmeasured interactions, can reduce the usefulness of DCE findings, but their combined effects are unknown. (3) Random parameter logit (RPL) models, commonly used in DCEs, are sensitive to the number of Halton draws (numeric sequences) used for simulation to estimate parameters, but little detail is available about the effects of number of draws on results with different numbers of random parameters. (4) The multinomial logit model, also commonly used in DCEs, does not account for repeated observations from the same individual and therefore underestimates standard errors. It is not known whether bootstrapping at the level of the individual can be used to generate correct confidence intervals for multinomial logit models in the presence of preference heterogeneity.
Objectives:
This report describes a project to explore issues with DCE design, analysis, and reporting. (1) We collaborated on a systematic review to capture the state of the science of health-related DCEs. We then conducted computer simulations to (2) improve understanding of the effects of selected DCE design features and statistical model assumptions on DCE results, (3) demonstrate problems in RPL estimation with numerous random parameters and inadequate numbers of Halton draws, and (4) explore the use of bootstrap methods to improve variance estimation when using fixed-effect multinomial logit models in lieu of RPL models.
Methods:
We systematically reviewed health-related DCEs published in 2013 to 2017. We implemented 864 simulation scenarios based on data from 2 actual DCEs to examine the effects of DCE design features and model assumptions on bias, variance, and overall error. We analyzed real and simulated data to demonstrate problems with RPL estimation using numerous random parameters and different numbers of Halton draws. Finally, using simulated data that reflect preference heterogeneity, we ran multinomial logit models with both bootstrapped and conventional variance estimation.
Results:
(1) Our review found growing use of health-related DCEs (an average of 60 per year from 20132017) with increasingly sophisticated methods (design, software, and econometric models), as well as inadequate reporting of methodologic details (eg, incorporation of interactions into the study design, use of blocking, method used to create choice sets, distributional assumptions, number of draws, use of internal validity testing). (2) In our simulations involving DCE design features and model assumptions, problems in the pilot phase tended to have little effect on the main DCE results; however, in 1 of the 2 study settings, problems in the pilot and main DCEs had widespread effects on bias and variance estimation, possibly related to correlations among attributes. (3) In analyses of real DCE data, the stability of RPL results depended on the number of Halton draws used for a given number of random parameters; some parameters, especially deviation parameters, failed to stabilize even with 20 000 draws. (4) Compared with traditional variance estimation, bootstrapping in simple DCE data sets with 3 random parameters did not yield confidence intervals with actual coverage closer to the nominal (95%) coverage.
Conclusions:
Reporting DCE design and analysis methods in greater detail would strengthen health-related DCEs. Reporting guidelines may be a means to that end. Small problems in a pilot study may not have drastic effects on the main DCE results, but certain DCE designs (eg, those with correlated attributes) may require special care, such as added sensitivity analyses. In settings with numerous random parameters, RPL models should use more Halton draws or perhaps a different type of draws to produce valid findings that can support good health and health policy decisions. Sensitivity analyses can increase confidence in estimating a given number of random parameters with a particular number of draws. Although bootstrapping is known to correct variance estimates in settings with correlated errors, bootstrapping should not be used as an alternative to traditional variance estimation for multinomial logit models in the presence of preference heterogeneity.
Limitations:
Our conclusions largely depend on our specific source data sets and simulation parameters and should therefore be replicated under a wide range of conditions. For example, our simulations involving DCE design features and model assumptions were complex and included 1 study setting (study 1) that was not typical of recent health-related DCEs. Also, our simulated random parameters followed the normal distribution, which does not apply to all DCE parameters. Finally, our findings on Halton draws in RPL estimation enables us to make only general suggestions, not to specify precise numeric thresholds related to the number of random parameters to estimate or the number of draws to use.
Background
Patient centeredness requires capturing patient preferences, and a well-known method of doing so is the discrete choice experiment (DCE).1 Based on economic theory, DCEs are common in the marketing field and have become increasingly popular in health economics.2 DCEs typically present multiple scenarios, or choice tasks; in each task, respondents must indicate which alternatives they prefer. By capturing stakeholders' stated choices in these repeated tasks, DCEs enable researchers to measure quantitatively the relative value (“utility”) that stakeholders place on each alternative (eg, health care intervention or potential research study) and attribute (eg, intervention or study characteristic).
As models of individual decision-making, DCEs rely on random utility maximization (RUM) theory. Following work by Thurstone and Hull,3,4 Marschak5 developed RUM, and McFadden6 expressed the theory in the form of discrete choice models.7 The theory and models are based on the idea that people choose among alternatives in a way that maximizes the utility, or value, that can be derived.
DCEs rest on the assumption that actual behavior will be consistent with stated preferences. To the extent that this assumption holds true, DCEs can help health services researchers engage with stakeholders in formulating research questions, defining intervention and comparator conditions, and selecting study outcomes. Therefore, DCEs can be an important tool for the design of health care interventions. In addition, DCE findings may have a growing impact on health policy because regulatory agencies in Europe and the United States are beginning to encourage or require the incorporation of patient preferences into risk-benefit assessments.8-10 The expanding role of DCEs in health care decision-making means that improvements in DCE study design and reporting can strengthen the incorporation of stakeholder preferences into health care interventions and policies.
This report describes a project to explore issues with DCE design, analysis, and reporting. The intended audience includes DCE researchers and consumers of DCE research who have advanced technical knowledge. In completing this project, we worked with 3 stakeholders who are internationally known experts in health-related DCEs. (Details about stakeholder engagement appear in the following section.) First, (1) we collaborated with a stakeholder and 2 additional researchers on a systematic review to capture the state of the science of health-related DCEs. We describe the systematic review in a later section of this report by incorporating the resulting published peer-reviewed journal article. After completing the review, we conducted computer simulations to (2) improve understanding of the effects of DCE design features and statistical model assumptions on DCE results; (3) demonstrate problems in random parameter logit (RPL) estimation with numerous random parameters and inadequate numbers of Halton draws; and (4) explore the use of bootstrap methods to improve variance estimation when using fixed-effect multinomial logit models in lieu of RPL models. The remainder of this section provides background information related to the last 3 study components (those involving simulations) and describes the organization of the report.
Effects of DCE Design Features and Statistical Model Assumptions on DCE Results
DCE designs possess several features, including the number of choice tasks, the number of alternatives (eg, health care interventions or diagnostic procedures) compared in each task, the number of attributes (eg, out-of-pocket cost, wait time) per alternative, and the range of attribute levels (eg, $0 to $100 for out-of-pocket cost). Formulas and specialized software help the analyst compare different DCE designs with regard to statistical efficiency and directly calculate (given a particular DCE design and analysis model, the expected parameter values, and the desired power and significance levels) the minimum sample size required to estimate parameters.7,11 However, despite the apparent ease of designing a DCE with adequate sample size for parameter estimation or with high overall statistical efficiency, DCE design generally does not account for “response efficiency”: the amount of error associated with inconsistent respondent behavior. Specifically, DCE analysis models usually assume that respondents will use all available information and resolve choice tasks in a consistent way, regardless of the DCE design. However, because DCE design features affect task complexity and the level of cognitive effort required, some designs can lead respondents to become fatigued, miss relevant information, and/or use heuristics that ignore some attributes (ie, lexicographic behavior). These problems can result in biased parameter estimates and standard errors, thus rendering current study design and sample size determination methods less useful.12 For example:
- Increasing the number of attributes per alternative increases error variance,13-15 and the number of attributes may strongly influence variance relative to other design dimensions.13,15 Possible explanations are that the increased task complexity associated with larger numbers of attributes may introduce random error because of fatigue or excessive cognitive effort or may lead respondents to use heuristics.16 Taking into account the effect that the number of attributes has on variance, changing the number of attributes may also affect parameter estimates14 or willingness-to-pay estimates.17
- As the number of choice tasks increases, error variance may initially decrease because of learning, and then increase because of fatigue.13,18 However, 1 study found a positive association,19 and 1 found no association.14 Two studies found effects on willingness to pay,17,19 and 2 found no effect on willingness to pay13 or (taking into account the effect of the number of tasks on variance) on parameter estimates.14
Clearly, DCE design decisions can affect study findings by influencing respondent behavior, and task complexity and respondent fatigue may play important roles. DCE designs can account for common information-processing strategies. For example, Rose and Bliemer20 demonstrated DCE design generation with priors associated with the probability of ignoring attributes. However, this method has not been used in practice.
Other key issues in DCE design include sample size determination, the potential for selection bias, and the potential for unmeasured interactions among attributes. For example, using a specific DCE design and with point estimates from an actual DCE as the expected parameter values, de Bekker-Grob and colleagues11 estimated that the sample size required to estimate several parameters in a simplistic multinomial logistic (MNL) model with 80% power and a significance level of α = .05 ranged from a high of 190 to low values between 3 and 7. (For a given DCE design, analysis model, significance level, and level of statistical power, the sample size required for an estimate statistically distinguishable from 0 depends only on the analyst's belief about the actual parameter value11; therefore, a wide range of sample size estimates across different parameters is not unusual.) Such low sample size estimates lack face validity. In addition, when the analysis model includes indicator variables, incorrect sample size determination can lead to problems with model identification (eg, insufficient sample size to estimate interaction effects).21 Together, these problems can make it difficult to design a DCE to yield parameter estimates that meaningfully describe the preferences of a policy-relevant target population. In this sense, the parameter estimates, standard errors, confidence intervals, and statistical significance levels from DCE models may sometimes be misleading. We conducted a series of computer simulations according to the standard 2-stage process for DCEs (a pilot study to obtain preliminary parameter estimates and refine the study design, followed by a main DCE study). By simulating this process under varying conditions, we sought to describe the compound effects of DCE design features and assumption violations on DCE results and to identify potential improvements in the DCE design process.
RPL Estimation With DCE Data
Increasingly, researchers analyze health-related DCE data by using the RPL model, which appeared in 39% (118/301) of recent publications.22 Compared with the traditional MNL model, the RPL model adds an important degree of modeling flexibility by accounting for the panel nature of DCE data (ie, repeated measurements from the same individuals), accounting for unobserved preference heterogeneity between individuals and allowing flexible specification of parameter distributions. This modeling flexibility may yield better translation of population preferences into health care and health policy decisions by facilitating more accurate portrayal of population preferences and more accurate characterization of the level of confidence in estimating those preferences.
In practice, implementations of the RPL model often assume independent (uncorrelated) random parameters with specific distributions.7 To estimate these parameters, RPL analysis usually follows the simulated maximum likelihood approach, which uses sequences of numbers to simulate multiple potential distributions (“draws”) for each random parameter. (For example, if parameter b1 represents the utility of travel time for an appointment, and the model assumes that b1 varies randomly across individuals according to the normal distribution, then an estimation process with 100 draws would involve simulating 100 possible distributions of b1 for the study sample, where a “distribution” includes a value for each individual in the sample.) For this simulation, health-related DCEs almost universally use Halton sequences, which are systematic sequences of numbers between 0 and 1, generated using prime numbers. Mapping Halton sequences onto the assumed (eg, normal) distribution of a parameter yields well-spaced (ie, nearly uniformly spaced) draws from that distribution, which speeds estimation.7,23 (Intuitively, finding a dropped key in a field with no prior information about the key's location would happen more quickly with a uniform grid search, on average, than with a search that focused more on certain parts of the field. Similar logic applies to finding parameter estimates with the RPL model.)
Although Halton draws can make estimation proceed more quickly than it would with pseudo-random draws, RPL results are sensitive to the number of draws used,24 and theory suggests that the number of draws should increase with the number of random parameters.7 Because Halton sequences are systematic and the initial values in each Halton sequence are relatively far apart, using too few draws could violate distributional assumptions. Using too few draws could also violate the independence assumption because sequences based on primes of similar magnitude start out similarly. Despite these potential problems, available reviews and best-practice documents offer little guidance about the number of draws needed to estimate a given number of random parameters in an RPL model.2,22,24-29 (One exception is a recent simulation paper by Czajkowski and Budziński,30 which compared 4 types of draws and recommended minimum numbers of Sobol draws.) In addition, data from our systematic review22 showed no relation between the number of random parameters estimated and the number of draws used. Of the 118 recent health-related DCEs with RPL analyses, 40 (33.9%) reported the number of Halton draws used. Among those 40 studies, the correlation between the number of draws and the number of random parameters was −0.05 (Figure 1). These facts indicate a need for information about the effects of estimating RPL models with different numbers of Halton draws. In this study, we addressed this information gap by combining systematic-review data, Halton sequences, simulations, and real data to show how Halton sequences operate, demonstrate the effects of correlation among random parameters, demonstrate how the number of Halton draws affects random parameter estimation, and compare our simulation parameters with current practice.
In this study component, some analyses, rather than providing new information, describe issues with correlation among Halton draws and coverage of the multivariate normal distribution in ways that may be helpful to health economists. In particular, they describe the number of draws needed to prevent violation of the independence and normality assumptions (with different numbers of random parameters) and compare those numbers with current practice. In the parts examining (1) the relation between correlated random parameters and findings and (2) the stability of parameter estimation with different numbers of random parameters and different numbers of draws, we do provide some new information. Although we do not propose specific thresholds for the number of random parameters or the number of Halton draws, we demonstrate a sensitivity analysis technique for determining whether enough draws have been used to provide stable results. Also, although we do not compare multiple types of draws, our analysis differs from much of the previous work in that we include a real-data example with a large sample (N = 2051), many attributes (up to 15 random and 25 total), and some continuous parameters.30 Taken together, the findings from this component of our study may help analysts consider whether to use Halton draws, decide how many to use, and conduct sensitivity analyses that may indicate whether they have used too few. The issues examined in this portion of our study are relevant for accurate analysis, interpretation, and reporting of DCE results and therefore for the accurate translation of population preferences into health care and health policy decisions.
Variance Estimation in MNL Models
Unlike the RPL model, the MNL model assumes fixed rather than random parameters. Also, the MNL model does not account for the use of repeated observations from the same individual, which are more strongly correlated (ie, less variable) than observations from different individuals. DCEs often rely on repeated observations from the same individual but fail to account for this clustering, which artificially reduces standard errors.1,31-33 The downward bias in SEs results in inflated statistical significance and artificially narrow confidence intervals. Rose and Bliemer may have been referring to this issue when they stated:
one might require a relatively small sample size to generate statistically significant parameter estimates for a given study; however, a much larger sample size might be required in order to be able to infer something about the preferences of the wider population [emphasis added] (p 1039).34
The bottom line is that because of the way in which most DCEs are currently implemented, analyzing a large amount of preference information from a small sample of individuals may mislead researchers and their audiences into concluding that the results reliably describe the preferences of a larger population when in fact they do not.
This variance estimation problem may affect sample size determination. For example, most pilot studies are too small to support an RPL model and therefore do not control for clustering. In this case, the SE estimates used to determine sample size for a future study would be artificially small, as would the resulting sample size estimates. Some DCE pilot studies have generated remarkably small estimates of required sample size. For example, based on prior information from a pilot study (n = 120),35 de Bekker-Grob et al11 estimated that with a sample of 7 people, with 80% power and a significance level of α = .05, it would be possible to estimate the utility (or relative weight) of a €1-increase in treatment cost—or, with a sample of 3 people, to do the same with the utility of nausea adverse effects. These estimates assumed a specific DCE design and a simplistic MNL model, and other parameters required larger estimated sample sizes—as high as 190. Nonetheless, it seems doubtful that a sample of 3 or 7 individuals would yield meaningful information about treatment preferences in the larger population of interest. Failure to control for clustering may have been a factor in generating these improbably small estimates of required sample size. Without additional information, however, it is impossible to know whether the small sample size estimates were primarily the result of uncontrolled clustering, truly strong patient preferences, sampling error, or some other reason.
Fortunately, researchers have access to relatively straightforward methods to control for clustering. Hierarchical models (also known as mixed models or random-effects models) account for clustering and describe the amount of variation (ie, variance components) at different levels of analysis (eg, the extent to which preferences vary between individuals vs within individuals). The RPL model is 1 kind of hierarchical model. Other solutions to the clustering problem include robust variance estimation (a simple adjustment to the variance estimates, if available in the analysis software) and, potentially, bootstrapping (which involves drawing repeated samples of equal size from the original analysis sample, with replacement). Hensher et al7 described both methods, but their bootstrapping method apparently samples single observations. In the DCE setting, appropriate bootstrapping would sample respondents (ie, clusters of observations) rather than single observations.36 Like hierarchical models, both robust variance estimation and bootstrapping estimate SEs correctly; however, unlike hierarchical models, they treat clustering as a nuisance in that they do not explore the clustering by estimating variance components. Bootstrapping has the advantage that it is independent of the statistical model being used for analysis. Demonstrating the effective use of bootstrapping to correct DCE variance estimation could lead to improvements in the reporting of DCE results and, by making pilot study results more accurate, could also lead to improved estimation of sample size requirements. With this goal in mind, we conducted a series of simulations to compare variance estimates from bootstrapping with those from traditional MNL estimation.
Organization of This Report
The following section describes our stakeholder engagement process and its positive impact on our project. The section following that documents changes to the research protocol. The subsequent 4 sections describe, in turn, the 4 major components of the project: (1) the systematic review, (2) the simulations of the effects of DCE design features and statistical model assumptions on DCE results, (3) the simulations involving RPL estimation with DCE data, and (4) the simulations comparing variance estimation methods in MNL models. We used these 4 project components to describe the state of the science in health-related DCEs and to identify potential improvements in DCE design, analysis, and reporting. The final section discusses the results and their implications.
Patient and Stakeholder Engagement
Many PCORI projects include patients as stakeholders. For technical projects, PCORI encourages researchers to think broadly about which stakeholder group or groups to involve in the research process. Because this project focused on technical aspects of research methods, we recruited expert researchers as stakeholders. One of the co-investigators, Kirsten Howard, MPH, PhD, recruited the stakeholders from her international network of research collaborators. Our stakeholders (all economists) are world-renowned experts on DCEs, mostly in the area of health. We kept stakeholders informed about plans, progress, and findings and solicited their feedback and advice. From Australia, the Netherlands, and the United Kingdom, our stakeholders provided feedback mostly through conference calls, email conversations, and manuscript revisions. We increasingly used email, 1-on-1 meetings, and separate conferences by continent (Europe, Australia) to make the meeting schedule convenient for stakeholders and to obtain more thorough feedback. We found Skype useful because it enabled free communication and screen sharing. We were also able to hold an in-person meeting, in conjunction with a pair of health economics conferences, to discuss findings and dissemination plans.
Stakeholder feedback directly influenced the research protocol, the interpretation of results, and the research manuscripts. Specifically, stakeholder feedback resulted in (1) a more thorough grounding of the project in literature on DCEs and health-related DCEs, (2) a focus on key DCE design and analysis assumptions related to choice behavior, (3) an additional investigation of RPL analysis methods that not only informed our simulation design but also was well received at international health economics conferences and is expected to lead to an influential publication, and (4) decreased focus on sample-size determination and variance estimation. For that reason, 2 of the 4 study components (the systematic review and the investigation involving RPL analysis) reflect expansions in study scope. In turn, the other 2 components—the simulations involving (1) the effects of DCE design features and statistical model assumptions on DCE results and (2) bootstrapping vs traditional MNL estimation—reflect simplification of the initial study protocol to allow for these expansions in scope. Overall, our incorporation of strong stakeholder feedback made the study more relevant, more robust, and likely more impactful. Stakeholder engagement also increased the amount of learning that could take place for investigators during this project and provided a foundation for future collaboration.
Changes to the Research Protocol
In consultation with stakeholders and for reasons described in the preceding section, we made the following changes to the original research protocol:
- Added collaboration on the systematic review
- Added the analysis of problems related to the number of Halton draws used in RPL estimation
- Separated the bootstrapping simulations from the simulations examining DCE design features and model assumptions
- Reworded the aims to provide more specific information
- In the simulations examining DCE design features and model assumptions, we did the following:
- –
Added simulation parameters involving unmeasured interactions
- –
Added simulation parameters that involved ignoring attributes in complex tasks
- –
Reduced the number of real DCEs (on which simulations were based) from 3 to 2
- –
Reduced the number of pilot study sample sizes from 7 to 4
- –
Used the original study design in the pilot phase and a D-efficient design in the main DCE phase rather than implementing orthogonal, D-efficient, and S-efficient designs
- –
Used only model-estimated standard errors, not bootstrapped standard errors
- –
Eliminated a planned comparison of sample size estimation methods
- –
Reduced the number of main DCE sample sizes from 10 to 3
- –
Reduced the number of blocking strategies from 2 to 1 for studies 1 and 2
- –
Reduced the number of potential interaction terms from 2-4 to 1 for studies 1 and 2
- –
Dropped estimation of random parameter, scaled, and generalized logit models from the main DCE phase
- –
Made minor changes in selection of the terms to be omitted, ignored, or included in interactions
- –
Used a slightly smaller set of outcome measures to compare simulation scenarios
- –
Substituted relative standard error for 95% CI coverage as an outcome measure
Systematic Review of DCEs in Health Economics
Because our review was published under a Creative Commons License, we have adapted some of the text for this section and for the Discussion section, and we are reproducing the article in its entirety (please see Soekhai et al22). The published paper follows a traditional structure: Introduction, Methods, Results, Discussion, and Conclusions.
Early in the project period, our expert stakeholders advised us to conduct an up-to-date review of the literature on the design of DCEs. In addition to expanding and updating the informal review that had led to our original research proposal, we collaborated with a stakeholder and 2 of her colleagues on a systematic review that covered the past 5 years (2013–2017):
Soekhai V, de Bekker-Grob EW, Ellis AR, Vass C. Discrete choice experiments in health economics: past, present and future. Pharmacoeconomics. 2018;37(2):201-226. doi:10.1007/s40273-018-0734-2 [PMC free article: PMC6386055] [PubMed: 30392040] [CrossRef]
In this review, we aimed to describe the general state of the science of health-related DCEs. We updated prior reviews (1990-2012); identified all health-related DCEs; and described trends, current practice, and future challenges. We observed increasing use of DCEs (an average of 60 per year from 2013-2017), with broadening areas of application; increased geographic scope; and increased sophistication in design, software, and econometric models.
Most recent health-related DCEs have focused on consumer experiences (35%) or trade-offs between consumer experiences and health outcomes (16%). The vast majority (89%) have used fractional designs, usually (82%) with 2 alternatives per task and 4 to 9 attributes per alternative (39% with 4-5 attributes, 22% with 6 attributes, 21% with 7-9 attributes). The design plan has most commonly allowed only for main effects (29%), has sometimes allowed for 2-way interactions (17%), and often has not been clearly reported (49%). We did not examine the proportion of studies that used pilot data; however, the percentage of studies using qualitative methods for pretesting a questionnaire was 38%, which might be considered a lower bound. We did not collect information about the method of sample size determination, but we did collect sample size, which ranged from 35 to 30 600 individuals, with a mean of 728 and a median of 401. On average, studies with blocking (to limit the number of choice tasks per respondent) had 709 participants, each of whom completed 11 choice tasks, whereas studies with unblocked designs had 439 participants, each of whom completed 13 choice tasks.
With regard to statistical analysis, mixed logit (usually RPL rather than error components models) and MNL models were the most common, each representing 39% of published DCEs. Of the studies using mixed logit models, 22% reported details such as distributional assumptions and the number of draws used. We also observed an increase in the use of more complex models, such as latent class, generalized multinomial logit, or heteroskedastic MNL models. In terms of outcome measures, most studies (56%) reported coefficients, and some reported other measures, such as willingness to pay, utility scores, probability scores, or odds ratios. One of our conclusions was that inadequate reporting of methodologic details (eg, incorporation of interactions into the study design, use of blocking, method used to create choice sets, distributional assumptions, number of draws, use of internal validity testing) makes it difficult to assess study quality, which may reduce decision-makers' confidence in and ability to act on the findings. We did not examine the factors influencing the level of detail in reporting study methods, which may include author, reviewer, and editor behavior; journal restrictions, such as word limits; and even cultural norms.
Simulations of the Effects of Design Features and Model Assumptions on DCE Results
Methods
To assess the effects of DCE design features and model assumptions on DCE results, we completed a series of computer simulations based on 2 actual DCEs. One of the co-investigators provided access to the data from the 2 DCEs, which we selected based on data availability and applicability to US health care. Study 1 examined preferences for organ allocation among adults (N = 2051) in the Australian general public.37 That study used an efficient design, with respondents randomly assigned to 5 blocks of 30 choice tasks. Each task included 2 alternatives; the alternatives had 15 attributes, with 2 to 6 levels per attribute. (Appendix A: Table A1 lists the study 1 attributes and variable names.) Study 2 examined preferences for labor induction among women (N = 362) who were participating in a randomized trial of labor induction alternatives (n = 260) or who were pregnant and volunteered only for the DCE (n = 102).38 That study used an efficient fractional factorial design, with participants randomly assigned to 2 blocks of 25 choice tasks. Each task included 3 alternatives (outpatient, basic inpatient, and enhanced inpatient care for labor induction). The alternatives, based on the actual treatment arms of the randomized trial, had 6 attributes, with 1 to 4 levels per attribute. In study 2, some attributes were alternative specific (eg, relevant only for a particular type of care), and there was a fixed status quo option (basic inpatient care). (In a DCE, a status quo option, if present, is a fixed set of attributes that respondents compare with the other alternatives presented. The status quo option often represents usual care or the current state of affairs. Appendix A: Table A2 lists the study 2 attributes and variable names.)
The purpose of conducting simulations based on these 2 studies was to use realistic data to explore the effects on parameter estimates and their standard errors under violations of certain assumptions or combinations thereof: an unbiased sample, independence of utility from the set of attributes presented (eg, omitting an attribute does not affect the utility of other attributes), no ignoring of attributes, and no unmeasured interactions (which could be tested by including an interaction in the data-generating process but not in the analysis model). We explored these effects with samples of various sizes. We had expectations for the effects of individual simulation parameters. For example, selection bias in which sample selection is positively associated with a given parameter should bias the estimate of that parameter upward; omission of attributes that cause variance inflation or deflation because of respondent behavior should lead to inflated or deflated SEs, respectively; ignoring attributes should decrease their apparent importance and increase the apparent importance of other attributes; and an unmeasured interaction should result in biased main-effect estimates for the variables involved in the interaction. However, because of the complexity of the simulation setup, we did not specify hypotheses for the combined effects of simulation parameters. Rather, we aimed to describe those effects.
We originally considered generating our simulated populations using 4 actual DCEs, which we selected based on data availability. The final research protocol provided for 3 actual DCEs and, as described in the section on stakeholder engagement, we reduced the number of studies to 2 to expand other aspects of the project. Compared with the studies examined in our systematic review, study 1 was atypical in that it had a fairly large sample, focused on developing a priority-setting framework (as did 9% of the studies reviewed), and had relatively large numbers of choice tasks and attributes. Study 2 had a more typical sample size, consumer experience focus, and number of attributes but still had a higher-than-average number of choice tasks. Of the studies that we dropped, both had small sample sizes (130 and 219). One had a common number of attributes (4) but many choice tasks (32); the other had many attributes (9) and choice tasks (32). The implications of our choice of studies are discussed in the Limitations section toward the end of the report.
Beginning with the actual data from each study, we verified that we could replicate the original results. We then simulated a population of 10 000 individuals based on the estimates from the original study. For each fixed parameter, the entire sample received the estimated mean parameter value. For each random parameter, we used the estimated mean and SD to simulate a normally distributed parameter across the population. The simulated population included an additional dichotomous covariate, x, which had 50% prevalence and was correlated with selected random parameters at 0, 0.20, or 0.50. We used the BinNor package in R to simultaneously induce correlations between x and those parameters.
In theory, high task complexity may cause respondents to ignore less important attributes, which would violate the assumptions of the usual DCE analysis models. To flag complex tasks, we used the parameter estimates from the original study to measure the entropy (complexity) of each task. Entropy measures the amount of information contained in a task. For a given choice task, entropy is calculated as , where pk is the probability of choosing alternative k.39 Tasks with different alternatives have smaller entropy values, whereas tasks with similar alternatives have larger entropy values. For example, in a 2-alternative task, entropy is maximized (at 0.69) when each alternative has probability 0.5 of being selected; if the probabilities are close to 0 and 1, as in a task with a dominant alternative, the entropy is near 0. Tasks with higher entropy (ie, tasks without a clearly dominant alternative) require more thinking. Therefore, we flagged tasks with entropy above the 75th percentile as complex. In our simulated populations, described further in the next paragraph, some people responded to these complex tasks by ignoring certain attributes.
For each study, we used North Carolina State University's high-performance computing cluster to simulate the usual DCE study process: a pilot study, followed by a redesign based on pilot parameter estimates, followed by a larger main DCE. We used the set of simulation parameters shown in Table 1. Starting with either study 1 or study 2 as a data source, we selected pilot samples of 4 sizes from the population. Pilot sample selection was either completely random or minimally or moderately biased (ie, correlated with the unmeasured variable x at r = 0.2 or r = 0.5, where x was correlated with selected preference parameters). Then, using the design of the real DCE, we simulated respondents' choices. We either (1) omitted selected attributes from the data-generating process and analysis model, in which case variance was either increased or decreased by 10% (because simpler tasks could either improve measurement precision by decreasing fatigue or worsen precision by withholding information that participants need to make choices); or (2) included those attributes in the data-generating process and analysis model, with or without an unmeasured interaction (because failing to account for an interaction could bias DCE results). When including the selected attributes in the data-generating process (condition 2 in the previous sentence), we set their parameter values such that the utility ranges associated with those attributes would be 20% of the maximum utility range of any attribute in the original study (ie, in a sense, the selected attributes would be 20% as important as the most important attribute). We randomly selected 50% of the study population to ignore the least important attributes (as measured using the parameter estimates from the original study) in complex tasks (those with entropy above the 75th percentile).
In the simulated population of 100 000 people for study 1, a total of 49 984 people (50%) were randomly assigned to ignore selected attributes in high-entropy tasks, and 50 166 (50%) were randomly assigned a value of 1 for unmeasured variable x, which had a correlation of 0.35 with qol_post and a correlation of −0.35 with qol_pre. The population was generated using parameters with the following means (SDs for random parameters only): asc1 0.097, age5 0.673, age15 0.573, age25 0.388, age55 −0.283, age70 −1.179, prev −0.149, adher −0.063, sex 0.08, donor 0.199, depend 0.356, donor*depend 0.365, wait 0.043 (0.052), le_pre −0.089 (0.076), le_post 0.06 (0.05), qol_pre −0.059 (0.059), qol_post 0.114 (0.108), comb1 0.002, comb2 −0.068, comb3 −0.096, comb4 −0.23, smok1 −0.263, smok2 −0.787, drink1 −0.095, drink2 −0.359, and obes −0.272.
For study 2, in the simulated population of 100 000 people, 49 917 people (50%) were randomly assigned to ignore selected attributes in high-entropy tasks, and 50 297 (50%) were randomly assigned a value of 1 for unmeasured variable x, which had a correlation of 0.35 with midw. The population was generated using normally distributed parameters with the following means (SDs for random parameters only): asc1 1.635, env_home 0.324, env_pvt_shrd 0.886, env_pvt_pvt 1.8578, pain_1dose 0.178, pain_mild 0.606, pain_mild_nd 0.572, chk_ph_avl −0.791, chk_mw_avl 1.361, midw 0.173 (0.057), trip −0.265 (0.088), tt −0.014 (0.005), and trip*tt −0.005.
To increase the likelihood of stable estimation with 1000 Halton draws, we generated only 5 random parameters for each study. Because parameters had to vary to be correlated with unmeasured variable x (and therefore with sample selection in certain scenarios), we included as random parameters the variables that were correlated with x. After generating the simulated population, for each simulation scenario we estimated an MNL model (Appendix B: Table B1) with utility as a function of the attributes of alternatives. We conducted the pilot simulations using R version 3.5.1, with the gmnl package for MNL estimation. We used MNL estimation for the pilot simulations because the pilot sample sizes were insufficient to support estimation of RPL models. As an example, the pilot analysis model for study 1 in scenarios with no ignored attributes or omitted terms was:
where Unk = utility for person n and alternative k; asc1 is an alternative-specific constant for alternative 1; n = 1, 2, …, N; and εnk is the independently and identically distributed random error, following the Gumbel distribution.
For the redesign step, we used SAS version 9.4 with a set of SAS macros to generate a new, efficient design for each simulation scenario based on the parameter estimates from each pilot study.40 The redesign software was programmed to generate a D-efficient design for MNL estimation, assuming that the parameter estimates from the pilot study were correct and keeping constant the numbers of tasks, blocks, and alternatives.
We based the main DCE simulation on the new DCE design. In the main DCE simulation, we varied 2 parameters: sample size and analysis model. We randomly selected a sample of each size from the simulated population. The analysis models included a full model (main effects plus interaction), a model omitting the interaction, and a model omitting both the interaction and 1 of the main effects. For the main DCE, as for the pilot, we used the gmnl package in R version 3.5.1. Although we had originally planned to use both MNL and RPL models in the main DCE phase for comparison, we implemented only MNL models, which by definition were misspecified because they ignored preference heterogeneity. We made this simplification based on consultation with stakeholders, as described previously. As in the pilot phase, the MNL models expressed utility as a function of attributes of alternatives.
As an example, the full model for study 1 (main effects plus interaction) is as follows:
where Unk = utility for person n and alternative k; asc1 is an alternative-specific constant for alternative 1; n = 1, 2, …, N; and εnk is the independently and identically distributed random error, following the Gumbel distribution.
Using study 1 as an example, Figure 2 shows the flow of the simulations, from simulation of the study population to pilot design to pilot simulation to main DCE simulation. In the pilot design phase, the parameters involving omission of selected attributes (with variance inflation or deflation), inclusion of those attributes (with or without an interaction in the data-generating process), and ignoring selected attributes in complex tasks account for 2 × 2 × 2 = 8 sets of simulation scenarios. In the pilot simulation phase, the sample size and sample selection simulation parameters account for 4 × 3 = 12 sets. Finally, in the main DCE phase, the sample size and analysis model simulation parameters account for 3 × 3 = 9 sets of simulation parameters. Therefore, the total number of simulation scenarios for each original study was 8 × 12 × 9 = 864. For each simulation scenario, we used 5000 iterations in the pilot simulation phase. Based on the results of each pilot study, we obtained a design for the main DCE phase in which we used 1000 iterations for study 1 and 5000 iterations for study 2. We used convergence plots to select the number of iterations, enabling us to conserve computing resources while ensuring that we had used enough iterations for stable estimation.
In the convergence plots, we plotted parameter estimates, ratios of parameter estimates, and standard errors by number of iterations, from 100 up to the maximum number of iterations (1000 or 5000) in increments of 100. We produced convergence plots for “best-case” and “worst-case” simulation scenarios. The best case included pilot N = 240, all attributes included in the pilot data-generating process and model, no unmeasured interaction in the pilot DCE, no selection bias, no attributes ignored, and the correct main DCE analysis model. The worst case included pilot N = 30, selected attributes omitted from the pilot data-generating process and model, variance inflated by 10%, selection bias (r = 0.5), selected attributes ignored in complex tasks, and an unmeasured interaction in the main DCE analysis model.
To assess the combined effects of the pilot and main DCE simulation parameters, we estimated bias, relative standard error (RSE, or the standard error divided by the parameter estimate), and D-error (the determinant of the asymptotic covariance matrix, a measure of overall error in parameter estimation). Because DCE parameter estimates are relative rather than absolute, we estimated bias in ratios of parameter estimates (relative to qol_post in study 1 and midw in study 2) rather than estimating bias directly. For pairs of attributes that appear only in linear terms in the analysis model, this ratio is known as a marginal rate of substitution (MRS).7 We expressed each ratio as a percentage of the denominator (percent bias). (For qol_post and midw, respectively, we used qol_pre and env_pvt_pvt as the denominator when estimating bias.) High RSE values indicate imprecise estimates.
We then plotted bias, RSE, and D-error for different combinations of pilot and main DCE simulation parameters, holding other simulation parameters constant at their default values. We assigned the default values with the goal of making the comparisons as informative as possible. The default pilot sample size was 240. With all attributes included in the pilot data-generating process, the default was to include no unmeasured interaction; with selected attributes excluded, the default variance adjustment was to decrease variance by 10%. By default, selection bias was absent, no one ignored attributes in complex tasks, the main DCE sample size was 1000, and the main DCE included the full-analysis model. When comparing across different main DCE analysis models, we included an unmeasured interaction in the data-generating process (and not in the pilot analysis model).
Results
We completed the simulations without errors, except that for study 2, a coding error caused some lines of output to be garbled at random because multiple processes were writing to the same output file. This error affected <1% of output (18 815 of 4 296 865 lines). Because of computing resource limitations, we were unable to complete as many iterations for study 1 as for study 2. However, the convergence plots (Appendix A: Figures A1-A4) showed little fluctuation in estimates with 1000 to 5000 iterations.
The following subsections report the findings as percent bias, RSE, and D-error, using selected figures as examples. Full sets of figures appear in Appendix A. Each figure in this section represents a combined effect of pilot DCE simulation parameters and main DCE simulation parameters. For example, Figure 3 shows percent bias in the estimate of the parameter age70 (relative to the qol_post parameter estimate) by pilot sample size and main DCE sample size.
Percent Bias and Main DCE Sample Size
Little bias emerged in the default scenario. (In study 1, an exception was that estimates for 1 of the comorbidity variables, comb1, were strongly biased. However, the true population mean of that parameter was 0.002—near 0—which may explain the large differences in the estimated vs actual parameter ratios.) Larger main DCE sample sizes generally resulted in percent bias closer to 0. We observed no consistent combined effect of pilot sample size and pilot selection bias on the one hand and main DCE sample size on the other hand (eg, Figure 3). (For example, the leftmost column of Appendix A: Figure A5 shows generally low levels of bias that appear to fluctuate randomly across pilot study sample sizes; for a main DCE sample size of 400, the percentage of plots in which the absolute value of bias increased from 1 pilot study sample size to the next—30 to 100, 100 to 170, or 170 to 240—was 52%, 56%, and 40%, respectively.) In study 1, we saw no consistent combined effect of the other 3 pilot simulation parameters (ignoring attributes, unmeasured interaction, and omission of attributes plus variance inflation) and the main DCE sample size (Appendix A: Figure A5). However, in study 2, we observed that when 50% of simulated respondents ignored the professional availability attributes (chk_ph_avl and chk_mw_avl) in complex tasks, many estimates were slightly biased, usually upward (eg, Figure 4; see also Appendix A: Figure A6). Given that bias was measured in reference to a ratio of parameter estimates, the bias and its direction were consistent, with an apparent decrease in the magnitude of the ignored parameters' utility and proportional increases in the magnitude of other parameters' utility. A similar pattern emerged in study 2 when the pilot DCE involved an unmeasured interaction (which is known to cause bias in the coefficients of variables correlated with the interaction term) and also when selected attributes (trip, tt) were omitted from the pilot study and the variance was inflated by 10% (Appendix A: Figure A6). When the trip and tt attributes were omitted in the pilot study, positive bias appeared in the ratios of most parameters to midw, with the possible exception of trip and tt themselves, and negative bias appeared in the ratio of midw to env_pvt_pvt. Omission of the trip and tt attributes would have led to 0 (ie, uninformative) priors for those 2 parameters in the redesign stage. This should not have biased the priors for other parameters, but greater measurement error for trip and tt in the main DCE phase may have inflated the apparent importance of other parameters in the main DCE analysis model.
Percent Bias and Main DCE Analysis Model
In study 1, misspecified models resulted in bias in certain parameter estimates. Model 1, which omitted an interaction, yielded biased estimates for the 2 terms (donor and depend) that were involved in the interaction (Appendix A: Figure A7). Model 0, which omitted the interaction and also the main effect of depend, yielded even greater bias in the estimate for donor (Figure 5) as well as bias in the alternative-specific constant and some of the comorbidity variables (Appendix A: Figure A7). We observed no clear combined effect of pilot and main DCE simulation parameters (Appendix A: Figure A7).
In study 2, we observed similar effects of main DCE model on the parameter estimates for the terms involved in the interaction (Appendix A: Figure A8). We also observed that when the interaction term was omitted (model 1), other parameter estimates reflected some bias (Appendix A: Figure A8). When a main effect was also omitted (model 0), many parameter estimates reflected large bias. In 2 instances, pilot simulation parameters appeared to have a combined effect with main DCE simulation parameters. First, when the main DCE model omitted both the interaction term (trip*tt) and a main effect (tt) (model 0), the amount of selection bias in the pilot study appeared to affect the amount of bias, increasing the positive bias in tt but often reducing the amount of negative bias in other parameter estimates (relative to midw) (eg, Figure 6). Second, when the pilot data-generating process and model omitted certain attributes and the pilot variance was inflated by 10%, the main DCE parameter estimates tended to be more positively biased (relative to midw) (Appendix A: Figure A8).
RSE and Main DCE Sample Size
In both study 1 and study 2, main DCE sample size was negatively associated with RSE, and all plots showed nearly parallel lines, indicating that pilot and main DCE simulation parameters had no combined effect on RSE (Appendix A: Figures A9 and A10).
RSE and Main DCE Analysis Model
In study 1, the main DCE analysis model appeared not to have a consistent effect on RSE (Appendix A: Figure A11). The full model (model 2) resulted in a higher RSE for both terms involved in the interaction (depend and donor) (eg, Figure 7). We observed no consistent combined effect of pilot and main DCE simulation parameters. In study 2, when the analysis model omitted the interaction (model 1), we observed a greater RSE for trip (but not tt) (Figure 8) and no effect on most other parameters (Appendix A: Figure A12). With both the interaction and the main effect of trip omitted (model 0), we observed a greater RSE for some parameters but a lower RSE for others (eg, trip, pain_mild_nd, midw). We saw no apparent combined effect of pilot and main DCE simulation parameters.
D-Error
In both study 1 (Appendix A: Figures A13 and A15) and study 2 (Appendix A: Figures A14 and A16), when the main DCE had a small sample size or the main DCE model omitted the interaction, we observed greater overall error in parameter estimation, especially when the pilot sample size was small, the interaction term was omitted from the pilot data-generating process and analysis model, or selected attributes were omitted from the pilot data-generating process and model and variance were reduced (vs increased) by 10% (eg, Figure 9). Especially in study 2, omitting both the interaction and a main effect from the main DCE analysis model resulted in a greater increase in overall error when the pilot sample size was small or the pilot study had strong selection bias (Figure 10).
Examination of RPL Estimation With DCE Data
The goal of our RPL investigation was to assess the number of Halton draws required for stable estimation with different numbers of random parameters. This analysis included 5 parts. We demonstrated (1) how Halton sequences correlate with each other; (2) how well Halton draws simulate normal distributions depending on the number of sequences and the number of draws from each sequence; and (3) how increasing the number of draws increases coverage (ie, results in better distribution of values across the 0-1 interval and therefore more thorough mapping to the normal distribution) and decreases correlation among random parameters. In RPL models with simulated DCE data, we examined (4) how correlated random parameters affect bias and variance estimation. Finally, using real data, we examined (5) RPL models of real data to assess the combined effects of the number of random parameters and number of draws on parameter estimates and standard errors. We have organized this section according to those 5 components, reporting both methods and results within each subsection.
Correlations Among Halton Sequences
Method
The purpose of this analysis was to demonstrate how Halton sequences correlate with each other because (1) correlations among Halton sequences could lead to correlations among random parameters, which would violate the assumptions of the RPL model; and (2) correlations among Halton sequences could cause the sequences to do a poor job of covering the 0 to 1 interval, which in turn would limit the simulated values of random parameters during the estimation process, possibly causing errors in parameter estimation. With R version 3.5.0,41 we used the randtoolbox package to generate 10 000 draws from each of 50 Halton sequences. Because we used 50 sequences and the 50th prime is 229, we skipped the first 229 elements of each sequence to avoid excessive correlations between sequences.23 To show correlations between Halton sequences and bivariate coverage of the 0 to 1 interval, we created miniature scatter plots between pairs of adjacent Halton sequences with 500, 1000, 5000, and 10 000 draws. We used these plots to demonstrate how correlation and coverage varied with the number of Halton sequences and the number of draws from each sequence. For selected scatter plots that showed evidence of correlation and incomplete coverage, we also generated surface plots to demonstrate departures from bivariate normality.
Results
Figure 11 shows scatter plots of adjacent Halton sequences with different numbers of draws. In each panel, plot 1 represents Halton sequences 1 and 2, plot 2 represents sequences 2 and 3, and so forth. With 500 draws, increasing correlation and decreasing coverage of the 0 to 1 interval became apparent with 12 to 14 sequences (panel a, plots 11 and 13). With 1000 draws, correlation and coverage issues began with 18 sequences (panel c, plot 17). With 5000 and 10 000 draws, issues appeared once the number of sequences reached 34 (panel e, plot 33) and 42 (panel f, plot 41) respectively. We observed that the 4 plots referenced here are also associated with departures from bivariate normality (not shown). To summarize, Halton sequences based on larger primes showed greater correlation and more coverage issues; using more draws mitigated these issues.
Univariate and Multivariate Normality
Method
Because the purpose of Halton sequences in RPL estimation is to simulate random parameters with specific distributions, the next step after generating the sequences is to convert the Halton draws to Halton-normal draws (evenly spaced draws from the normal distribution, generated using Halton sequences). We effected this conversion by treating the Halton sequences described in the previous subsection as lists of normal quantiles. We examined univariate and multivariate normality because the RPL model assumes specific distributions for random parameters, and violations of this assumption for normally distributed random parameters could lead to error in parameter estimation.
To examine univariate normality, we conducted Shapiro-Wilk tests for each of the 50 Halton sequences, using 50 to 5000 draws in increments of 50. We created heat maps showing, for each combination of number of draws and number of random parameters (Halton sequences), the percentage of tests in which the null hypothesis was retained, the minimum P value, and the median P value.
To examine multivariate normality, we conducted Henze-Zirkler tests for each of the 50 Halton sequences, using between 50 and 10 000 draws in increments of 50. We created a heat map showing, for each combination of number of draws and number of random parameters, the percentage of tests in which the null hypothesis was retained. We also created a line plot showing the minimum number of draws required for the Henze-Zirkler P value to remain above thresholds of P < .25 and P < .05.
Results: Univariate Normality
The heat map in Figure 12, panel (a) shows the percentage of Shapiro-Wilk tests for which the null hypothesis of univariate normality was retained, by number of draws and number of random parameters. Panels (b) and (c) show the minimum and median P values, respectively. With 5 random parameters, the P values for some Shapiro-Wilk tests began to decrease (panel b). With 500 draws and 10 random parameters, or 1000 and 12 respectively, at least 1 random parameter departed from normality (panels a and b). With >10 random parameters, the number of draws required to maintain univariate normality increased sharply (panels a and b); with 10 to 15 random parameters, using 1000 draws did not prevent a substantial proportion of random parameters from being non-normally distributed (panel a). With 500 draws and 17 random parameters, or 1000 and 22 respectively, half of the random parameters departed from normality (panel c).
Results: Multivariate Normality
Figure 13 shows that as the number of random parameters increased to 7 to 12 and beyond, the P value for the Henze-Zirkler test of multivariate normality decreased, meaning that increasing the number of random parameters increased the likelihood of departures from multivariate normality. The number of draws required to keep the P value above .25 (Figure 14, dashed line) was 500 with 10 random parameters and increased sharply as additional random parameters were introduced. Similarly, the number of draws required to keep the P value above .05 (Figure 14, solid line) was 4000 with 11 random parameters and increased sharply with the number of random parameters. By overlaying Figure 1 onto Figure 14, we observed that, among the 40 recently published health-related DCEs that described RPL results and reported number of draws, the 40% (16/40) with ≥10 random parameters used too few draws to prevent a noticeable departure from the multivariate normal distribution. (We did not assess the distributional assumptions in those studies, however.)
Correlations Among Halton-Normal Draws
Method
The purpose of this analysis was to assess, for different numbers of Halton sequences (random parameters to be estimated), how many Halton draws were required to keep the correlations among draws below a given level. As mentioned previously, excessive correlations among random parameters would violate the assumptions of the RPL model and could cause errors in parameter estimation. Using the same data as for the multivariate normality tests (in the previous subsection), we created a heat map showing, for each combination of number of draws and number of random parameters, the maximum Spearman correlation between any pair of Halton-normal draws. We superimposed a line plot showing the minimum number of draws required for the maximum correlation to remain below thresholds of r < 0.2 and r < 0.1.
Results
Figure 15 shows that as the number of draws decreased or the number of random parameters increased, the maximum Spearman correlation between pairs of Halton-normal variables increased, meaning that using too few draws (or increasing the number of random parameters without increasing the number of draws) led to violations of the independence assumption. Keeping the maximum correlation below 0.2 required about 250 draws with 10 to 15 random parameters, 500 draws with 17 random parameters, and ≥1000 draws with ≥22 random parameters. Obeying the more conservative threshold of 0.1 required 500 draws with 13 random parameters and 1000 draws with 17 random parameters. With either threshold, the required number of draws increased sharply as the number of random parameters increased beyond 15 or 20.
Overlaying Figure 1 onto Figure 15 showed that, among the 75% (30/40) of studies with ≤12 random parameters, most used enough draws to keep correlations among normally distributed random parameters below 0.1. Among the 10 studies with ≥13 random parameters, 7 could be expected to have correlations >0.2 if they assumed normally distributed random parameters.
Effect of Correlated Random Parameters on Bias and Variance Estimation
Method
To examine the effect of correlated random parameters on bias and variance estimation, we used the R version 3.5.0 software41 to simulate a simple DCE with 3 alternatives, three 3-level attributes, a fractional-factorial orthogonal design with 6 choice tasks, and 500 participants. The DCE design did not include an opt-out alternative. We generated the data with no alternative-specific constant, no effect of the ordering of the alternatives on preferences, and a random parameter for each attribute (b1, b2, and b3). We set the true means at 1/3, 2/3, and 1.0 and the SDs at 0.5.
We created 10 simulation scenarios, with all pairs of random parameters (b1 and b2, b2 and b3, and b1 and b3) correlated at r = 0.0 to 0.9 by steps of 0.1. For each scenario, we used the R version 3.5.1 software41 to simulate 1000 RPL models, each using 1000 pseudorandom draws with the specified pairwise correlations. We used the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm in the R maxLik package42 for estimation and a different random number seed for each of the 1000 iterations. We estimated the 3 random parameters as linear effects, along with their SDs and the intercept. Therefore, the model (Appendix B: Table B1) estimated 7 parameters with (3 − 1) × 6 = 12 df. Across all iterations within each scenario, we measured for each parameter the mean bias (estimate minus true effect), mean squared error (MSE) (mean of squared bias), SE (SD of estimates), and 95% CI coverage (percentage of confidence intervals that contained the true value). We compared these statistics across correlation levels.
Results
For each level of simulated correlation among pairs of pseudorandom sequences, Figure 16 shows the mean bias, mean absolute bias, MSE, and standard error of each parameter estimate across simulation runs. The top and bottom rows of Figure 16 correspond to the mean parameters (b1, b2, b3) and deviation parameters (sd1, sd2, sd3), respectively. When the pseudorandom sequences were uncorrelated, mean bias was 0. With correlations of 0.1, 0.2, and 0.3 among pseudorandom sequences, bias in b1 reached 8%, 16%, and 24%, respectively. Larger correlations sometimes resulted in much greater bias. Naturally, MSE followed a similar pattern; 95% CI coverage started at 94% to 96% and decreased drastically with increasing correlation among pseudorandom sequences. Coverage for b1 was 77%, with a correlation of only 0.2. Once the correlation among pseudorandom sequences reached 0.7, bias, MSE, and coverage improved somewhat for mean estimates but continued to worsen for deviation estimates. With the correlation set at 0.9, 1 model out of 1000 failed to produce a standard error for sd1, and another failed to produce standard errors for b1 through b3 and sd1 through sd2.
Real-Data Example
Method
To examine the joint effect of the number of random parameters and the number of Halton draws on parameter estimates and standard errors, we ran a series of RPL models with the actual (not simulated) data from a DCE on preferences for allocation of organs for transplant.37 In the original study, each of 2051 respondents completed 30 choice tasks with 2 alternatives and 15 attributes. The analysis variables included 24 attribute-related parameters and 1 alternative-specific constant. Based on published health-related DCEs (Figure 1), we selected 3 levels for number of random parameters (5, 10, and 15) and 8 levels for number of draws (250, 500, 750, 1000, 2500, 5000, 7500, and 10 000). Because some results did not stabilize with 10 000 draws, we implemented 4 additional levels for number of draws (12 500, 15 000, 17 500, and 20 000). Using the R gmnl package, we ran RPL models (not generalized MNL, as the reader may infer from the package name) with all permutations of number of random parameters and number of draws, assuming normally distributed random parameters (Appendix B: Table B1). We arbitrarily selected as random effects the first 5, 10, or 15 attribute-related parameters. For each combination of number of random parameters and number of draws, we ran a single model.
As an example, the model below specified the first 5 attribute-related parameters as random:
where Unk = utility for person n and alternative k; asc1 is an alternative-specific constant for alternative 1; n = 1, 2, …, N; βpn is a normally distributed random parameter associated with attribute p; βq is a population-level parameter associated with attribute q; and εnk is the independently and identically distributed random error, following the Gumbel distribution.
We compared bias and the precision of results across models. For each number of random parameters, we treated the model with the maximum number of draws (20 000) as the gold standard. To assess bias in estimated means, we calculated the MRS by dividing each parameter estimate by the parameter estimate for posttransplant quality of life (QOL). (For posttransplant QOL, we divided by the parameter estimate for pretransplant QOL.) We compared each MRS to the corresponding MRS from the maximum-draws model. To assess bias in estimated deviations, we calculated the coefficient of variation (COV)—the SD estimate divided by the corresponding mean estimate—and compared each COV to the corresponding COV from the maximum-draws model. To assess precision, we calculated relative efficiency by dividing the RSE (SE divided by parameter estimate) from the maximum-draws model by the corresponding RSEs from other models. A percentage <100% would indicate that the current estimate is less efficient than the gold-standard estimate because the current estimate has a higher RSE. We expressed bias and relative efficiency as percentages relative to the estimates from the maximum-draws model.
We also recorded the central processing unit (CPU) time, memory use, and log-likelihood (LL) for each model and plotted these values by number of random parameters and number of draws.
Results
For the real-data models, Table 2 summarizes the degree of bias (top half of table) and the relative efficiency (bottom half of table) compared with the estimates from the maximum-draws model by number of random parameters and parameter type across models with different numbers of draws. Models with 5 random parameters rarely showed substantial bias. Models with 10 random parameters sometimes exhibited bias, especially in estimated deviations. For models with 15 random parameters, bias often exceeded 50%, even for estimated means with no corresponding random effect, and the most biased estimate differed from the maximum-draws estimate by a factor of 362 (maximum, 36 238.7). Nineteen models, all with 15 random parameters, had at least 1 estimate with bias >1000% in absolute value (not shown). Ten of those 19 models used >1000 draws, and 2 used 17 500 draws, indicating that parameter estimates in approximately 15 random-parameter models had not yet stabilized by the time the number of draws reached 20 000.
With regard to relative efficiency (ratio of standard error to estimate in the maximum-draws model divided by the same ratio from the current model), in models with 5 random parameters, the RSEs for estimated means differed little from the RSEs in the maximum-draws model. The relative efficiency of estimated deviations exhibited some variation. Models with 10 random parameters produced a similar pattern of results but much greater variation in the relative efficiency of estimated deviations. With 15 random parameters, relative efficiency fluctuated greatly for all types of parameter estimates. For the models with 15 random parameters, 16% (54/330) of the relative efficiencies for random parameters and their deviations could not be estimated because the corresponding model did not produce a standard error estimate. The number of draws in these problematic models ranged from 250 to 20 000. The large relative efficiencies with smaller numbers of draws, mostly for deviation parameters, suggest that the stability of deviation parameters was especially sensitive to the number of draws used.
Figure 17 shows how the bias (panels a1, b1, and c1) and relative efficiency (panels a2, b2, and c2) of parameter estimates varied by number of draws (x-axis) across models with 5, 10, or 15 random parameters (panels a1-a2, b1-b2, and c1-c2, respectively; for visibility, the y-axis scale varies across plots.) Each point in the figure represents a single parameter estimate from a single model. Estimates became increasingly stable as the number of draws increased. In models with 5 random parameters, bias and the relative efficiency of estimated means (panels a1 and a2, plots 1-5) varied little but were more stable with ≥1000 draws. For certain estimated deviations, greater bias and efficiency differences emerged (panels a1 and a2, plot 6). Bias and the relative efficiency of estimated deviations stabilized after 1000 to 2500 draws.
In models with 10 random parameters, the bias and relative efficiency of most estimated means (Figure 17, panels b1 and b2, plots 1-5) varied little but appeared slightly more stable after about 1000 to 2500 draws. In 1 exceptional case (panels b1 and b2, plot 4), bias and relative efficiency did not stabilize before the number of draws reached 17 500 (bias near 5% and relative efficiency near 105%) and may not have stabilized at all. For some estimated deviations (panels b1 and b2, plots 6 and 7), estimates clearly did not stabilize even as the number of draws approached 20 000.
Models with 15 random parameters (Figure 17, panels c1 and c2) showed substantial bias, relative efficiency differences, and fluctuation in bias and relative efficiency. These results emerged for all types of parameters, including estimated means with no corresponding random effect. Results stabilized somewhat after 1000 draws but did not completely stabilize even with 20 000 draws.
The required amount of computer memory was a linear function of the number of Halton draws, with 12 to 13 GB required for 1000 draws, 27 to 29 GB required for 2500 draws, and 221 to 242 GB for 20 000 draws (not shown). CPU time was a quadratic function of the number of draws, with 6 to 7 hours required for 1000 draws, 9 to 11 hours for 2500 draws, and 92 to 122 hours (4-5 days) for 20 000 draws. LL appeared to stabilize after about 2500 draws with 5 random parameters and after 15 000 draws with 10 random parameters but never clearly stabilized with 15 random parameters (not shown).
Simulations Assessing Bootstrapped MNL Estimation of Parameter Means in the Presence of Preference Heterogeneity
Methods
Using R version 3.5.1, we simulated 500 study samples with n = 500 individuals per sample. The sample data included 3 normally distributed random parameters (b1, b2, b3) with means 1/3, 2/3, and 1.0 and SDs 1/9, 2/9, and 1/3 respectively. The data-generation model was , where Unk = utility for person n and alternative k; n = 1, 2, …, N; and εnk is the random error, following the Gumbel distribution.
We designed a simple DCE with 6 choice tasks, 3 alternatives per task, and 3 attributes (corresponding to b1, b2, and b3) per alternative. Then, for each of the 500 samples, we generated 2 sets of bootstrapped samples with replacement at 100% of the original sample size. One set of bootstrapped samples used 1000 replications, and the other used 2500 replications. We ran MNL models (Appendix B: Table B1) using the original sample and each set of bootstrapped samples as well as an RPL model using the original sample. For simplicity and ease of comparison, we generated the data and estimated the models without intercepts. To obtain bootstrapped estimates for each parameter, we calculated the mean of the parameter estimate across each set of bootstrapped samples. To estimate bootstrapped confidence intervals, we used 3 methods: (1) the percentile method (2.5th and 97.5th percentiles of the parameter estimate across bootstrapped samples); (2) the standard deviation method (SD of the parameter estimate across bootstrapped samples); and (3) the bias-corrected and accelerated method, which is similar to the percentile method but adjusts the percentiles to account for bias (the percentage of bootstrapped estimates that are less than the conventional estimate) and acceleration (sensitivity of the bootstrapped standard error to the true value of the parameter being estimated).43 We then compared the estimates to the true parameter values with respect to mean percent bias (the average difference between the estimate and the true parameter value, expressed as a percentage of the true parameter value), mean absolute percent bias (the average absolute difference between the estimate and the true parameter value, expressed as a percentage of the true parameter value), MSE (the mean of the square of bias), and 95% CI coverage (the percentage of samples in which the 95% CI included the true parameter value). We expected that, compared with the conventional MNL model, bootstrapping would yield confidence interval coverage closer to the nominal (95%) coverage.
As described previously in the sections on stakeholder engagement and changes to the research protocol, these methods reflect simplification of the initial study protocol to allow completion of additional study components. Specifically, we separated the bootstrapping analysis from the simulations exploring the effects of DCE design features and model assumptions on DCE results. In addition to reducing the computing resources required for the study, this change enabled us to conduct a thorough comparison of bootstrapped vs traditional methods without confounding the effects of the variance estimation method with those of DCE design features or model assumptions.
Results
Table 3 shows the results of the bootstrapping simulations. The bootstrapped parameter estimates had little bias (mean percent bias, −4.5% to 0.6%). The MSE for the bootstrapped estimates of b1 and b2 was similar to the MSE for the RPL estimates, but the MSE for the bootstrapped estimate of b3 was higher than the MSE for the RPL estimate. Confidence interval coverage for b1 was close to the nominal value of 95% regardless of estimation method, but only the RPL model provided adequate coverage for all 3 parameters, and coverage for b3 was <75% for the conventional and bootstrapped MNL models. In fact, as the true parameter means and SDs increased from b1 to b2 to b3, bias in the MNL estimates increased, and the confidence interval coverage of those estimates (regardless of bootstrapping method) decreased.
Discussion
We conducted a study with 4 diverse components: (1) a systematic review, (2) an examination of the effects of DCE design features and statistical model assumptions on DCE results, (3) an examination of the limits of RPL estimation with DCE data, and (4) an exploration of bootstrapping for variance estimation in MNL models. The systematic review provided a useful description of recent trends in health-related DCEs and identified important gaps in reporting from such studies. The simulations involving design features and model assumptions yielded no broad truths but did result in a caveat for researchers using certain types of DCE designs. The simulations of RPL estimation demonstrated problems associated with the use of Halton draws in health-related DCEs. Finally, the bootstrapping analyses suggested that bootstrapping at the level of the individual fails to produce correct confidence intervals for MNL models in the presence of preference heterogeneity.
Systematic Review of DCEs in Health Economics
Our systematic review, produced in collaboration with 1 of our stakeholders and 2 colleagues outside our team, provided an overview of the applications and methods used by DCEs in health. Because we replicated the methods of previous systematic reviews, our review was both limited by our decision not to expand search terms and enhanced by our ability to observe trends over a long period. We noted growing use of empirical DCEs in health economics, with more sophisticated designs and analysis methods and enhanced use of qualitative methods and validity checks. However, we also identified many studies that did not report methodological details (eg, incorporation of interactions into the study design, use of blocking, method used to create choice sets, distributional assumptions, number of draws, use of internal validity testing). DCEs should include more detailed information about design and analysis decisions. Development of DCE reporting guidelines could strengthen DCEs and improve the ability of decision-makers to act on DCE results. How and when to integrate health-related DCE findings into person-, program-, and policy-level decision-making remains an important area for future research.
Simulations of the Effects of Design Features and Model Assumptions on DCE Results
We simulated the effects of pilot and main DCE design features and model assumptions on DCE results (bias in parameter ratios, RSE, and D-error) in study 1 (the study of organ allocation preferences, with 2 alternatives and 15 attributes) and study 2 (the study of labor induction preferences, with 3 alternatives and 6 attributes). Based on past research,13-19,44,45 the main effects of simulation parameters were as expected. For example, random error decreases with increasing sample size, and omitting an interaction term from a model yields bias in the main effect estimates for the terms involved in the interaction. However, we also observed in study 2 that model misspecification biased estimation of parameters not directly involved in the omitted interaction. In addition, when we examined the combined effects of pilot and main DCE simulation parameters—the main purpose for conducting these simulations—we observed in study 2 that problems in the pilot study (ignoring of attributes in complex tasks, unmeasured interaction, omission of attributes, small sample size, strong selection bias) can exacerbate bias or random error created by problems in the main DCE (because of small sample size or model misspecification).
One key difference between study 1 (organ allocation) and study 2 (labor induction) was that some of the attributes in study 2 were highly correlated with each other. Although this is usually a situation to be avoided in DCEs, that particular study was designed to compare specific labor interventions with each other, and to some extent those interventions included mutually exclusive treatment modalities (eg, midwife contact in person vs by phone, pain management methods available in the hospital that are not available at home). The strong correlations among attributes may have caused small problems (eg, with sampling or model misspecification) to affect many parameter estimates at once. Considered together with the finding described in the previous paragraph, this finding implies that (1) because results from study 1 and study 2 were not consistent with each other, we have no reason to believe that pilot study problems generally have large effects on estimates from the main DCE; (2) researchers should avoid correlated attributes where possible; (3) if correlated attributes are unavoidable, the pilot study should be designed with great care; and (4) in the presence of correlated attributes, researchers should interpret pilot and main DCE results with heightened caution and keep the pilot study's limitations in mind when interpreting main DCE results. One potential approach would be to conduct more sophisticated sensitivity analyses than usual when correlated attributes are present. For example, the analyst could simulate the effects of bias in 1 parameter estimate on the sign or significance of other, correlated parameter estimates.
Substudy Limitations
One limitation of this component of the study is its complexity. We were able to conduct simulations based on only 2 actual DCEs, and we reduced the number of simulation parameters as well as the number of levels for certain parameters in the parent DCEs. Nonetheless, we obtained a massive quantity of simulated data. We compared selected simulation scenarios with a limited number of default scenarios (pilot sample size of 240 participants, no unmeasured interaction [if all attributes were included in the data-generating process], variance decreased by 10% [with selected attributes excluded], no selection bias, no ignoring of attributes, main DCE sample size of 1000 participants, full analysis model or main DCE, unmeasured interaction in the data-generating process when comparing across different main DCE analysis models). Although we believe that the comparisons and default scenarios were well chosen, we may have merely scratched the surface of the available information.
Related to the complexity of this study component, our aim was descriptive rather than explanatory, which limits the degree to which we can describe the mechanisms underlying the observed patterns and discuss the practical implications. Another limitation is that our findings and interpretations may apply only to the specific simulation scenarios we examined. For example, the data-generating mechanism included only normally distributed random parameters, and some of our simulation parameters—especially those not as strongly supported by research evidence (eg, the amount of selection bias, 50% of the population ignoring certain attributes)—may have been unrealistic. A related limitation involves the selection of real DCEs on which to base the simulations. Using data from studies typical of recent health-related DCEs would be ideal. However, we believed that replication of the original results was an important step in re-creating the original analysis models, and we were able to access and transmit data from only a few studies. Although study 2 was typical except for the large number of choice tasks, study 1 was atypical (ie, it had a large sample, unusual study focus, and many choice tasks and attributes). To the extent that these studies fail to represent recent health-related DCEs, our conclusions may have limited applicability. Also, any flaws in the original studies may have affected the sets of parameter estimates on which we based our simulations. On the other hand, our simulations were much more realistic than they would have been if we had simply invented population parameters.
Summary
Researchers may take comfort from the fact that in our simulation settings, pilot study problems had no general effect on main DCE results. However, our findings also suggest that unusual DCE designs (eg, those with highly correlated attributes) may require an abundance of caution (eg, thorough sensitivity analyses). Research should continue to explore the compound effects of pilot and main DCE design features and assumption violations on findings by searching both more deeply (eg, those with more analyses) and more broadly (eg, those with different sets of simulation parameters).
Examination of RPL Estimation With DCE Data
In the past 10 years, the RPL model with Halton draws has become a popular way to analyze data from health-related DCEs.22 In this project, we have used density and scatter plots to demonstrate that the required number of Halton draws increases with the number of random parameters to be estimated; using too few draws leads to assumption violations (ie, correlated random parameters and insufficient coverage of the parameter space to be searched). For example, with 500 draws and 12 Halton sequences (or 1000 draws and 18 Halton sequences), we observed correlated sequences, decreased coverage of the 0 to 1 interval (and therefore the parameter space to be searched), and visible departures from bivariate normality.
When we used significance testing to detect departures from normality, the univariate tests indicated that using <500 draws with 10 random parameters, or <1000 draws with 12 random parameters, leads to non-normally distributed random parameters. Multivariate tests indicated that departures from multivariate normality may not happen with 5 random parameters, but that with 11 random parameters, as many as 4000 draws may be required to meet the multivariate normality assumption. In addition, with ≥10 random parameters, the number of draws required to avoid noticeable departures from multivariate normality rapidly increased to 10 000.
In terms of correlations among random parameters, our findings suggest that with ≤17 random parameters, 1000 draws would be sufficient to keep between-parameter correlations below 0.1. Our simulations of correlated pseudorandom sequences indicated that at least in this setting, with 3 normally distributed random parameters, even correlations as low as 0.1 can result in noticeable bias and severely degraded confidence interval coverage.
Our real-data example confirmed that the number of draws required increases with the number of random parameters being estimated. With 5 random parameters, we encountered few problems in estimating means and deviations, and estimates stabilized when the number of Halton draws reached 1000 to 2500. In contrast, models with 15 random parameters yielded unstable estimates and standard errors even with as many as 20 000 draws. We saw fewer problems with 10 random parameters than with 15, but with 10 random parameters, we were still unable to rely on the estimated SDs, and the estimated means did not always stabilize by the time the number of draws reached 20 000. In addition, stabilizing the LL required at least 2500, 15 000, and 20 000 draws with 5, 10, and 15 random parameters, respectively. Therefore, between-model comparisons with fewer draws would likely be invalid. We also observed that some models with 10 and several models with 15 random parameters failed to produce standard errors for ≥1 parameters. This phenomenon may indicate an empirical identification problem—namely, insufficient data to support model estimation. Using too few draws can mask a lack of identification by producing estimates that appear to be valid but are not.46
Taken together, these results suggest that RPL analyses with numerous random parameters can require thousands of draws for stable estimation. (Indeed, with 15 random parameters, we were unable to produce stable results. Because simulation error decreases as the number of draws increases,47 the required number of draws for that simulation scenario likely exceeded 20 000.) Our findings further suggest that depending on the study setting and the number of Halton draws used, health-related DCEs with RPL analyses may be subject to violations of the independence and normality assumptions and to unstable results. We observed that recent published reports of RPL analyses included the number of Halton draws about a third of the time. When this information was included, we detected no relation between the number of random parameters and the number of draws used. In addition, among the studies reporting the number of Halton draws used, 75% (30/40) included ≥6 random parameters, 90% (36/40) used <2500 Halton draws, and 97.5% (39/40) did one or the other (Figure 1). Based on our analyses, it is possible that many of these studies required a larger number of Halton draws (or estimation of fewer random parameters) for stable results. Given that using thousands of draws requires large amounts of computer memory and processing time (up to 242 GB and 4-5 days for 20 000 draws in our setting and with our computing resources), this finding implies a tradeoff between modeling more random parameters for theoretical or empirical reasons and modeling fewer random parameters to conserve resources and support stable estimation. This issue may prove to be an important consideration both in planning study resources and in interpreting DCE results and assessing their applicability for decision-making.
Limitations
This component of our study has at least 2 important limitations. First, coverage of the 0 to 1 interval, correlations among random parameters, and significance of normality tests do not translate directly to the validity and reliability of findings. However, we used simulation to demonstrate the effects of correlated random parameters on bias and confidence interval coverage, and we used real data to examine how the number of random parameters and number of draws affect the validity and reliability of estimates. Second, our real-data example came from a single study, and our simulations focused on specific parameter combinations; our findings and interpretations are limited to the specific data set and simulation scenarios we examined. Replication with different data sources and simulation parameters would provide more context for our findings, interpretations, and conclusions. Specifically, although we suspect that health-related DCEs should use far more Halton draws than they have in the past and we know that adding random parameters requires increasing the number of draws, our findings do not provide precise thresholds by which to judge DCEs.
Summary
Despite these limitations, given our findings and the current state of health-related DCEs, we make the following recommendations:
- The number of random parameters being estimated should inform the number of Halton draws to use.
- RPL analysis should include a sensitivity analysis to assess the stability of estimates with larger numbers of draws, especially when estimating many random parameters.
- Reports of RPL analyses should include the number of random parameters estimated, the number of draws used, and the sensitivity analysis results.
- Researchers should develop more detailed guidelines for RPL analyses.
- Researchers should continue to explore alternatives to Halton draws for simulated maximum likelihood estimation.
With regard to the last recommendation, randomized and shuffled Halton sequences do not solve the problems we have observed, but Sobol sequences may improve on the performance of Halton sequences, and modified Latin hypercube sampling—the current standard in the transport field—also deserves more attention.7 Pseudorandom sampling might prevent the problems we observed but would greatly increase the amount of computing time required.23
For the analyst who chooses to use Halton draws, because our simulation results are specific to our simulation settings and parameters, we have recommended no specific numeric thresholds for number of random parameters or number of draws. These thresholds should be decided by the analyst. However, our analyses describing violations of the independence and normality assumptions, which were not simulations but rather examinations of Halton sequences, clearly indicated that estimating more than 5 to 7 normally distributed random parameters without using thousands of draws would lead to such violations. In our simulations examining bias and confidence interval coverage with 3 correlated, normally distributed random parameters, we found that in our setting, minor assumption violations in the form of correlations of 0.1 or 0.2 led to substantial estimation problems. One analytic approach would be to decide on acceptable thresholds by combining these findings, run initial analyses using those thresholds, and follow up with a sensitivity analysis using the methods from our real-data example (essentially convergence plots of estimates and SEs for different numbers of draws). The set of acceptable thresholds should incorporate the knowledge that the number of draws should increase more quickly than the square root of the number of observations (individuals × choice tasks).7,23 This strategy may provide a realistic way to manage the observed resource constraints without sacrificing study quality. The strategy could be enhanced by using software that provides warnings regarding the number of draws used48 or allows comparison of multiple methods.49
Simulations Assessing Bootstrapped MNL Estimation of Random Parameter Means in the Presence of Preference Heterogeneity
Using MNL models with simulated DCE data, we estimated bootstrapped confidence intervals for parameter means in the presence of preference heterogeneity. In our simulation setting, which had a small number of normally distributed random parameters and no intercept term in the data-generation or analysis models, bootstrapped estimates did not provide better confidence interval coverage than that of the conventional MNL estimates. In fact, the bootstrapped confidence intervals provided inconsistent coverage, and their coverage was <75% for 1 of the simulated parameters. Our finding that bias in estimating random parameters increased with the variability in those parameters was consistent with work by Daly and Hess,50 indicating that bootstrapping and other variance correction techniques work well to address correlated errors but only in the absence of preference heterogeneity.
Conclusions
Based on our systematic review, reporting DCE design and analysis methods in greater detail would strengthen health-related DCEs, and reporting guidelines may be a means to that end. In our examination of DCE design features and model assumptions, we concluded that small problems in a pilot study do not necessarily have drastic effects on main DCE results but that certain DCE designs (eg, those with correlated attributes) may require greater care than others (eg, more thorough sensitivity analyses). Our simulations of RPL models led us to the conclusion that such models should use greater numbers of Halton draws or perhaps a different type of draws altogether to produce valid findings that can support good health and health policy decisions. In this context, sensitivity analyses can increase confidence in estimates of a given number of random parameters with a particular number of draws. Finally, in our bootstrapping analyses, we concluded that bootstrapping did not improve variance estimation in the MNL model in the presence of preference heterogeneity.
Our findings have important implications for analysts, other DCE researchers, and consumers of health-related DCE research. Improving DCE analysis methods (eg, paying closer attention to the number of random parameters being estimated and the number of Halton draws used, seeking alternatives to Halton draws, conducting careful sensitivity analyses in DCEs with constrained designs) would benefit analysts and other DCE researchers by making DCE-based estimates more robust. Improving reporting of DCE methods would help people assess the strength of evidence from health-related DCEs. Together, these actions could improve health preference research and increase the potential for findings to be used, and used well, in guiding policy and health care decisions.
References
- 1.
- Louviere JJ, Woodworth G. Design and analysis of simulated consumer choice or allocation experiments: an approach based on aggregate data. J Mark Res. 1983;20(4):350-367.
- 2.
- de Bekker-Grob EW, Ryan M, Gerard K. Discrete choice experiments in health economics: a review of the literature. Health Econ. 2012;21(2):145-172. [PubMed: 22223558]
- 3.
- Thurstone LL. A law of comparative judgment. Psychol Rev. 1927;34(4):273.
- 4.
- Hull C. Principles of Behaviour. Prentice Hall; 1943.
- 5.
- Marschak J. Binary choice constraints and random utility indicators. In: Karlin S, Suppes P, Arrow KJ, eds. Mathematical Methods in the Social Sciences. Stanford University Press; 1959:312-329.
- 6.
- McFadden D. Conditional logit analysis of qualitative choice behavior. In: Zarembka P, ed. Frontiers in Econometrics. Academic Press; 1974:105-142.
- 7.
- Hensher DA, Rose JM, Greene WH. Applied Choice Analysis. 2nd ed. Cambridge University Press; 2015.
- 8.
- Mühlbacher AC, Juhnke C, Beyer AR, Garner S. Patient-focused benefit-risk analysis to inform regulatory decisions: the European Union perspective. Value Health. 2016;19(6):734-740. [PubMed: 27712699]
- 9.
- US Food and Drug Administration. Patient Preference Information – Voluntary Submission, Review in Premarket Approval Applications, Humanitarian Device Exemption Applications, and De Novo Requests, and Inclusion in Decision Summaries and Device Labeling: Guidance for Industry, Food and Drug Administration Staff, and Other Stakeholders. Published August 24, 2016. Accessed October 30, 2019. https://www
.fda.gov/media/92593/download - 10.
- US Food and Drug Administration. Factors to Consider When Making Benefit-Risk Determinations for Medical Device Investigational Device Exemptions: Guidance for Investigational Device Exemption Sponsors, Sponsor-Investigators and Food and Drug Administration Staff. Published January 13, 2017. Accessed October 30, 2019. https://www
.fda.gov/media/92427/download - 11.
- de Bekker-Grob EW, Donkers B, Jonker MF, Stolk EA. Sample size requirements for discrete-choice experiments in healthcare: a practical guide. Patient. 2015;8(5):373-384. [PMC free article: PMC4575371] [PubMed: 25726010]
- 12.
- Rose JM, Hess S, Collins AT. What if my model assumptions are wrong? the impact of non-standard behaviour on choice model estimation. J Transp Econ Policy. 2013;47(2):245-263.
- 13.
- Caussade S, Ortúzar J de D, Rizzi LI, Hensher D. Assessing the influence of design dimensions on stated choice experiment estimates. Transp Res B Methodol. 2005;39(7):621-640.
- 14.
- Arentze T, Borgers A, Timmermans H, DelMistro R. Transport stated choice responses: effects of task complexity, presentation format and literacy. Transp Res E Logist Transp Rev. 2003;39(3). Accessed October 27, 2019. https://trid
.trb.org/view/645969 - 15.
- DeShazo JR, Fermo G. Designing choice sets for stated preference methods: the effects of complexity on choice consistency. J Environ Econ Manage. 2002;44(1):123-143.
- 16.
- Green PE, Srinivasan V. Conjoint analysis in marketing: new developments with implications for research and practice. J Mark. 1990;54(4):3-19.
- 17.
- Hensher DA. Revealing differences in willingness to pay due to the dimensionality of stated choice designs: an initial assessment. Environ Resour Econ. 2006;34(1):7-44.
- 18.
- Brazell JD, Louviere JJ. Length Effects in Conjoint Choice Experiments and Surveys: an Explanation Based on Cumulative Cognitive Burden. Department of Marketing, The University of Sydney; 1998.
- 19.
- Bech M, Kjaer T, Lauridsen J. Does the number of choice sets matter? Results from a web survey applying a discrete choice experiment. Health Econ. 2011;20(3):273-286. [PubMed: 20143304]
- 20.
- Rose JM, Bliemer MC. Incorporating analyst uncertainty in model specification of respondent processing strategies into efficient designs for logit models. In: ISI World Statistics Congress, Hong Kong; 2013:25-30.
- 21.
- Reed Johnson F, Lancsar E, Marshall D, et al. Constructing experimental designs for discrete-choice experiments: report of the ISPOR Conjoint Analysis Experimental Design Good Research Practices Task Force. Value Health. 2013;16(1):3-13. [PubMed: 23337210]
- 22.
- Soekhai V, de Bekker-Grob EW, Ellis AR, Vass C. Discrete choice experiments in health economics: past, present and future. Pharmacoeconomics. 2018;37(2):201-226. doi:10.1007/s40273-018-0734-2 [PMC free article: PMC6386055] [PubMed: 30392040] [CrossRef]
- 23.
- Train KE. Discrete Choice Methods With Simulation. Cambridge University Press; 2009.
- 24.
- Hauber AB, González JM, Groothuis-Oudshoorn CG, et al. Statistical methods for the analysis of discrete choice experiments: a report of the ISPOR Conjoint Analysis Good Research Practices Task Force. Value Health. 2016;19(4):300-315. [PubMed: 27325321]
- 25.
- Ryan M, Gerard K. Using discrete choice experiments to value health care programmes: current practice and future research reflections. Appl Health Econ Health Policy. 2003;2(1):55-64. [PubMed: 14619274]
- 26.
- Clark MD, Determann D, Petrou S, Moro D, de Bekker-Grob EW. Discrete choice experiments in health economics: a review of the literature. Pharmacoeconomics. 2014;32(9):883-902. [PubMed: 25005924]
- 27.
- Louviere JJ, Lancsar E. Choice experiments in health: the good, the bad, the ugly and toward a brighter future. Health Econ Policy Law. 2009;4(4):527-546. [PubMed: 19715635]
- 28.
- Bridges JF, Hauber AB, Marshall D, et al. Conjoint analysis applications in health—a checklist: a report of the ISPOR Good Research Practices for Conjoint Analysis Task Force. Value Health. 2011;14(4):403-413. [PubMed: 21669364]
- 29.
- Lancsar E, Louviere J. Conducting discrete choice experiments to inform healthcare decision making. Pharmacoeconomics. 2008;26(8):661-677. [PubMed: 18620460]
- 30.
- Czajkowski M, Budziński W. Simulation error in maximum likelihood estimation of discrete choice models. J Choice Model. 2019;31:73-85.
- 31.
- Kreft IGG, Leeuw J de. Introducing Multilevel Modeling. Sage Publications; 1998.
- 32.
- Snijders TAB, Bosker R. Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling. Sage Publications; 1999.
- 33.
- Raudenbush SW, Bryk AS. Hierarchical Linear Models: Applications and Data Analysis Methods. Sage Publications; 2002.
- 34.
- Rose J, Bliemer M. Sample size requirements for stated choice experiments. Transportation. 2013;40(5):1021-1041.
- 35.
- de Bekker-Grob EW, Essink-Bot ML, Meerding WJ, Pols HAP, Koes BW, Steyerberg EW. Patients' preferences for osteoporosis drug treatment: a discrete choice experiment. Osteoporos Int. 2008;19(7):1029-1037. [PMC free article: PMC2440927] [PubMed: 18193329]
- 36.
- Flynn TN, Peters TJ. Use of the bootstrap in analysing cost data from cluster randomised trials: some simulation results. BMC Health Serv Res. 2004;4(1):33. doi:10.1186/1472-6963-4-33 [PMC free article: PMC535558] [PubMed: 15550169] [CrossRef]
- 37.
- Howard K, Jan S, Rose JM, et al. Community preferences for the allocation of donor organs for transplantation: a discrete choice study. Transplantation. 2015;99(3):560-567. [PubMed: 25700169]
- 38.
- Howard K, Gerard K, Adelson P, Bryce R, Wilkinson C, Turnbull D. Women's preferences for inpatient and outpatient priming for labour induction: a discrete choice experiment. BMC Health Serv Res. 2014;14(1):330. doi:10.1186/1472-6963-14-330 [PMC free article: PMC4128401] [PubMed: 25073486] [CrossRef]
- 39.
- Swait J, Adamowicz W. Choice environment, market complexity, and consumer behavior: a theoretical and empirical approach for incorporating decision complexity into models of consumer choice. Organ Behav Hum Decis Process. 2001;86(2):141-167.
- 40.
- SAS Institute. SAS macros for experimental design and choice modeling. 2019. Accessed October 31, 2019. https://support
.sas.com/rnd/app/macros/ - 41.
- R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; 2018. Accessed September 22, 2020. https://www
.R-project.org/ - 42.
- Henningsen A, Toomet O. maxLik: a package for maximum likelihood estimation in R. Comput Stat. 2011;26(3):443-458.
- 43.
- Tibshirani RJ, Efron B. Better bootstrap confidence intervals. In: Monographs on Statistics and Applied Probability: An Introduction to the Bootstrap. Vol 57. Chapman and Hall; 1993:178-199.
- 44.
- Box GE. Use and abuse of regression. Technometrics. 1966;8(4):625-629.
- 45.
- Slutsky DJ. Statistical errors in clinical studies. J Wrist Surg. 2013;2(4):285-287. [PMC free article: PMC3826246] [PubMed: 24436830]
- 46.
- Chiou L, Walker JL. Masking identification of discrete choice models under simulation methods. J Econom. 2007;141(2):683-703.
- 47.
- Hess S. Advanced discrete choice models with applications to transport demand. PhD thesis. Imperial College London; 2005. Accessed March 5, 2021. https://www
.researchgate .net/publication /230663966_Advanced_discrete _choice_models _with_applications_to_transport_demand - 48.
- StataCorp LLC. Panel-data mixed logit. Accessed May 22, 2020. https://www.stata.com/new-in-stata/panel-data-mixed-logit/
- 49.
- Czajkowski M. Models for discrete choice experiments. Accessed May 22, 2020. http://czaj
.org/research /estimation-packages/dce - 50.
- Daly A, Hess S. Simple approaches for random utility modelling with panel data. Presented at: European Transport Conference 2010; October 11-13, 2010; Glasgow, Scotland, UK. Accessed March 5, 2021. https://trid
.trb.org/view/1118335
Related Publications
- Soekhai V, de Bekker-Grob EW, Ellis AR, Vass C. Discrete choice experiments in health economics: past, present and future. Pharmacoeconomics. 2018;37(2):201-226. doi:10.1007/s40273-018-0734-2 [PMC free article: PMC6386055] [PubMed: 30392040] [CrossRef]
To date, this project has produced the following publication:
Development and submission of 1 or more additional manuscripts is underway.
Acknowledgments
I gratefully acknowledge the contributions of my co-investigators, Kirsten Howard (University of Sydney) and Kathleen C. Thomas (University of North Carolina at Chapel Hill); our stakeholders, Esther W. de Bekker-Grob (Erasmus University Rotterdam), Mandy Ryan (University of Aberdeen), and Emily Lancsar (Australian National University); and John M. Rose (University of Technology Sydney). They have not read this report and bear no responsibility for its content, but they deserve recognition for providing valuable information, guidance, and feedback throughout this project. Also, I gratefully acknowledge the work of Vikas Soekhai and Caroline M. Vass, who collaborated with me on the systematic review and who extracted some of the data used in the study component addressing RPL estimation.
Research reported in this report was funded through a Patient-Centered Outcomes Research Institute® (PCORI®) Award (ME-1602-34572). Further information available at: https://www.pcori.org/research-results/2016/improving-methods-discrete-choice-experiments-measure-patient-preferences
Appendices
Appendix A.
Table A1. Study 1 attributes and variables (PDF, 36K)
Table A2. Study 2 attributes and variables (PDF, 34K)
Figure A1. Study 1 convergence plots, “best case” scenario (PDF, 714K)
Figure A2. Study 2 convergence plots, “best case” scenario (PDF, 498K)
Figure A3. Study 1 convergence plots, “worst case” scenario (PDF, 647K)
Figure A4. Study 2 convergence plots, “worst case” scenario (PDF, 452K)
Figure A9. Relative standard error by pilot simulation parameters and main DCE sample size, Study 1 (PDF, 928K)
Figure A10. Relative standard error by pilot simulation parameters and main DCE sample size, Study 2 (PDF, 525K)
Figure A11. Relative standard error by pilot simulation parameters and main DCE analysis model, Study 1 (PDF, 716K)
Figure A12. Relative standard error by pilot simulation parameters and main DCE analysis model, Study 2 (PDF, 496K)
Figure A13. D-error by pilot simulation parameters and main DCE sample size, Study 1 (PDF, 123K)
Figure A14. D-error by pilot simulation parameters and main DCE sample size, Study 2 (PDF, 139K)
Figure A15. D-error by pilot simulation parameters and main DCE analysis model, Study 1 (PDF, 114K)
Figure A16. D-error by pilot simulation parameters and main DCE analysis model, Study 2 (PDF, 121K)
Footnotes
Note: This figure shows how the bias (panels a1, b1, and c1) and relative efficiency (panels a2, b2, and c2) of parameter estimates varied by the number of draws (x-axis) across models with 5, 10, or 15 random parameters (panels a1-a2, b1-b2, and c1-c2, respectively). For visibility, the y-axis scale varies across plots. Each point in the figure represents a single parameter estimate from a single model.
Some models failed to produce standard errors for certain parameters. When this occurred, relative efficiency could not be calculated.
aBias in the ratio of the parameter estimate to qol_pre estimate or ratio of the qol_pre estimate to the qol_post estimate.
Suggested citation:
Ellis AR. (2021). Improving Methods for Discrete Choice Experiments to Measure Patient Preferences. Patient-Centered Outcomes Research Institute (PCORI). https://doi.org/10.25302/03.2021.ME.160234572
Disclaimer
The [views, statements, opinions] presented in this report are solely the responsibility of the author(s) and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute® (PCORI®), its Board of Governors or Methodology Committee.
- NLM CatalogRelated NLM Catalog Entries
- PMCPubMed Central citations
- PubMedLinks to PubMed
- Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.[Cochrane Database Syst Rev. 2022]Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.Crider K, Williams J, Qi YP, Gutman J, Yeung L, Mai C, Finkelstain J, Mehta S, Pons-Duran C, Menéndez C, et al. Cochrane Database Syst Rev. 2022 Feb 1; 2(2022). Epub 2022 Feb 1.
- A closer look at decision and analyst error by including nonlinearities in discrete choice models: implications on willingness-to-pay estimates derived from discrete choice data in healthcare.[Pharmacoeconomics. 2013]A closer look at decision and analyst error by including nonlinearities in discrete choice models: implications on willingness-to-pay estimates derived from discrete choice data in healthcare.de Bekker-Grob EW, Rose JM, Bliemer MC. Pharmacoeconomics. 2013 Dec; 31(12):1169-83.
- Review Developing Statistical Methods to Improve Stepped-Wedge Cluster Randomized Trials[ 2021]Review Developing Statistical Methods to Improve Stepped-Wedge Cluster Randomized TrialsHeagerty PJ. 2021 Aug
- Review Discrete choice experiments in health economics: a review of the literature.[Pharmacoeconomics. 2014]Review Discrete choice experiments in health economics: a review of the literature.Clark MD, Determann D, Petrou S, Moro D, de Bekker-Grob EW. Pharmacoeconomics. 2014 Sep; 32(9):883-902.
- Efficient Designs for Valuation Studies That Use Time Tradeoff (TTO) Tasks to Map Latent Utilities from Discrete Choice Experiments to the Interval Scale: Selection of Health States for TTO Tasks.[Med Decis Making. 2023]Efficient Designs for Valuation Studies That Use Time Tradeoff (TTO) Tasks to Map Latent Utilities from Discrete Choice Experiments to the Interval Scale: Selection of Health States for TTO Tasks.Che M, Pullenayegum E. Med Decis Making. 2023 Apr; 43(3):387-396. Epub 2023 Mar 3.
- Improving Methods for Discrete Choice Experiments to Measure Patient PreferencesImproving Methods for Discrete Choice Experiments to Measure Patient Preferences
Your browsing activity is empty.
Activity recording is turned off.
See more...