Comparing Public Opinion Data Across Countries: Challenges & Solutions

The challenge with comparing public opinion data across countries

Morning Consult conducts daily brand tracking surveys in more than forty countries in addition to running hundreds of custom surveys in many more countries. We of course, then, want to be able to compare public opinion data across countries. Given the questions may be the same in some tracking surveys, one might think this is a straightforward task: compare the means or distributions of opinions on a question of people living in one country to those living in another. One would be wrong. Differences in culture, norms, and institutional contexts can lead people to understand and respond to identical questions in different ways (Brady 1985). 

For reasons unrelated to the survey or question at hand, survey respondents in some countries will strongly tend toward agreement in Likert scale questions, for example, while others toward disagreement. Similarly, intangible concepts like ‘democracy’ and ‘political efficacy’ might elicit different reactions in different countries given heterogeneous understandings of what the concepts mean (Sen 2002, King et al. 2004). The most widely used term to describe this interpersonal incomparability is differential item functioning (DIF). The result of DIF, of course, is measurement error that is correlated with fielding country and biased comparative estimates.

How we have tried to mitigate differential item functioning

Half a century of scientific work has not been able to do away with DIF. More recently,  researchers have tried to use anchoring vignettes to measure differences in the standards respondents use when asked to evaluate themselves in survey questions. In this framework, survey respondents read several short vignettes, each of which describes a situation in which a greater or lesser degree of a concept is present. They then order the vignettes according to how much or little of the concept they believe is present in each and subsequently answer a question in which they place themselves on the same scale. Using vignette ratings, King et al. (2004), for example, uncovered Mexicans have much higher standards for what constitutes political efficacy than do the Chinese, resolving the puzzle of why Chinese respondents, who live in an autocracy, report higher levels of efficacy than Mexicans, who live in a democracy.

Leveraging anchoring vignettes

Morning Consult’s Research Science department set out to replicate King et al.’s study across three countries using our own surveys to explore whether it was possible to leverage the rescaling factor anchoring vignettes offer to reweight the daily tracking survey we field in 40+ countries. Specifically, we wanted to know whether we could use the vignette approach to not only render responses to a single question comparable across countries, but also all questions in our daily tracking surveys in one shot. 

We fielded a survey replicating King et al.’s design in the US, India and Germany and estimated a between-country difference factor using a Compound Hierarchical Probit (CHOPIT) model, where self-placement questions were regressed on vignette evaluations, certain individual-level sociodemographic characteristics and country indicators. Based on this model, we predicted individual respondents’ probabilities of being in each of the five self-placement categories for each question and recoded respondents as belonging to the category to which they had the highest probability of belonging. 

We wanted to experiment using the proportions on this recategorized variable as possible targets for the original, non-vignette-adjusted proportions in constructing new raking weights. While our existing weighting approach rakes to demographic targets usually provided by government Census or statistical offices, we hoped raking opinion to CHOPIT-rescaled versions of those questions would help us eliminate DIF from our surveys.

Outcomes and limitations

Unfortunately, this approach failed for a variety of reasons. To begin with, it was not immediately clear or necessarily testable how many anchoring vignettes were necessary to correct an entire survey, and we quickly observed violations of assumptions about how people perceived the vignettes implemented as being ordered. 

We then also found that the CHOPIT model tended to ‘overcorrect,’ such that some categories were not the most probable for any respondent and thus remained empty. The reweighted data were also sensitive in general to categorization decisions when people had close to equal probabilities of being in two categories.

 Finally, we noticed that even once we resolved these issues to the extent possible, raking to the CHOPIT-drived targets changed the proportions of other variables in unexpected ways. Ultimately, the assumptions required to use anchoring vignettes in this way did not appear to be more defensible than the assumptions underpinning the comparison of basic unadjusted responses.

ncG1vNJzZmiln6e7qrrGnKanq6WhwW%2BvzqZma2hiaHxxgY5pamiopZe5qq%2BMqKeippmku26wwK2YZpufor2ivsispqdlkZi%2FsL%2FSZpqorZ6pv6qx0mg%3D