Tamanna Hossain-Kay

Paper Summary: Whose Opinions Do Language Models Reflect?

Tamanna Hossain-Kay / 2023-08-05

Paper Link: https://arxiv.org/pdf/2303.17548.pdf
Authors: Shibani Santurkar \(^1\), Esin Durmus\(^1\), Faisal Ladhak\(^2\), Cinoo Lee\(^1\), Percy Liang\(^1\), Tatsunori Hashimoto\(^1\) (\(^1\) Stanford, \(^2\) Columbia University)

Language models, or LMs, can give “opinionated” answers to open-ended, subjective questions. But whose opinions are these? This is important to understand as LMs become more integrated into open-ended applications.

Recent studies have shown that LMs can exhibit specific political stances and even mirror beliefs of certain demographics. To investigate this further, the authors used a framework built on public opinion surveys. This framework, utilizing the OpinionQA dataset formed from Pew Research’s American Trends Panels (ATP), offers insights via expertly curated topics, clear wording, and standardized multiple-choice responses.

From the paper’s analysis of nine LMs from OpenAI and AI21 Labs:

OpinionsQA Dataset
Measuring Human-LM Alignment
Results: Whose views do current LMs express?

OpinionsQA Dataset

When trying to curate a dataset to discern the viewpoints of Language Models (LMs), researchers face several challenges. These include selecting relevant topics, creating effective questions to extract the LM’s views, and establishing a benchmark of human opinions for comparison. A promising solution is to utilize public opinion surveys, a proven tool for capturing human sentiments.

OpinionQA is based on the Pew American Trends Panel (ATP) survey:

OpinionsQA uses 15 such ATP surveys on diverse topics, from politics to health, and gathers responses from thousands in the US. The acquired data, including individual answers, demographics, and participant weights, assists in sketching the human opinion landscape. These questions are then organized into broad and detailed topic classes. It’s crucial to note that the OpinionQA dataset primarily focuses on English and the US demographic.

Measuring Human-LM Alignment

To facilitate the comparison between humans and LMs, LMs are queried using conventional question answering (QA) techniques, transforming each question into a specific format as demonstrated in Figure 1. The sequence in which options are presented follows the original design from the surveys, acknowledging the ordinal nature of the options.

Figure 1: Evaluating the opinions reflected by language models using the OpinionQA dataset. The pipeline is as follows: an LM (here, text-davinci-003) is prompted with a multiple-choice survey question from our dataset, preceded by an optional context (QA/BIO/PORTRAY) to steer it towards a persona (here, Democrats). Th next-token log probabilities from the LM are then obtained for each of the answer choices (excluding refusal) and normalized to obtain the model’s opinion distribution. Finally, this quantity is compared to reference human opinion distributions— obtained by aggregating human responses to the same survey question at a population level and by demographic. Model and human refusal rates are compared separately.

The evaluation of the LMs is bifurcated into representativeness (where no context is given) and steerability (where contextual cues guide the LM to mimic a certain demographic). LMs are “steered” towards mimicing a specific demographic using 3 methods,


The study evaluated models from both OpenAI and AI21 labs (see Table 5 for complete list). When models are posed with a question, they evaluate the likelihood of each potential answer, which is then transformed to discern the model’s opinion distribution. Due to API restrictions, OpenAI returns a maximum of 100 log probabilities, while AI21 provides up to 10. If an answer isn’t included within these returned probabilities, its likelihood is capped by using the minimum of the remaining probability mass, or the smallest returned token probability.

Comparing human and LM opinion distributions

A metric that can account for the ordinal nature of the survey answers needs to be used, so KL-divergence isn’t used. 1-Wasserstein distance (WD) is used instead, which is defined as the minimum cost for transforming distribution \(D_1\) to distribution \(D2\). To convert ordinal answers to a appropriate space for WD, they are mapped to corresponding positive integers.

Results: Whose views do current LMs express?


Overall Representativeness: most models have comparable opinion alignment to the alignment between agnostic and orthodox people on abortion or Democrats and Republicans on climate change (Figure 2).

Figure 2: Overall representativenessROm of LMs: A higher score (lighter) indicates that, on average across the dataset, the LM’s opinion distribution is more similar to that of the total population of survey respondents (Section 4.1). For context, we show the representativeness measures for: (i) demographic groups that are randomly chosen (‘avg’) and least representative of the overall US population (‘worst’), and (ii) pairs of demographic groups on topics of interest.

Group representativeness: (some in Figure 3):

Figure 3: Group representativeness scoresRGm of LMs as a function of political ideology and income: A higher score (lighter) indicates that, on average across dataset questions, the LMs opinion distribution is more similar to that of survey respondents from the specified group (i.e.,RGm(Q) is larger). The coloring is normalized by column to highlight the groups a given model (column) is most/least aligned to. We find that the demographic groups with the highest representativeness shift from base LM (moderate to conservative with low income) to the RLHF trained ones (liberal and high income). Other demographic categories are shown in Appendix 8.

Modal representativeness: text-davinci-003 has a sharp and low entropy opinion distribution, converging to the modal views of liberals and moderates.

Figure 9: A comparison of the entropy of LM response distributions: text-davinci-003 tends to assign most of it’s probability mass to a single option. This is in contrast to human opinions which tend to have a fair amount of variability.


Figure 4: (a) The alignment of LM opinions with the actual and modal views of different ideological groups on contentious topics. (b) Steerability of LMs towards specific demographic groups: we compare the group representativeness of models by default (x-axis, RGm) and with steering SGm (y-axis). Each point represents a choice of model m and target group G, and points above the x = y line indicate pairs where the model’s opinion alignment improves under steering. Shaded lines indicate linear trends for each model m, and we generally observe that models improve from steering (above x = y) but the amount of improvement is limited.
Figure 11: A break down of the post-steering representativeness scores of different LMs by the subgroup they are steered to.


A Consistency score (Cm) is defined, which is the fraction of topics where an LM’s most aligned group matches its most aligned group on the given topic. The scores ranges from 0 to 1, higher score means model agrees with same subgroups across all topics.

Figure 6: Consistency of LM opinions Cm, where a higher score (lighter) indicates that an LM aligns with the same set of groups across topics.
Figure 5: Consistency of different LMs (columns) across topics (rows) on different demographic attributes (panels). Each dot indicates an LM-topic pair, with the color indicating the group to which the model is best aligned, and the size of the dot indicates the strength of this alignment (computed as the ratio of the best and worst subgroup representativeness for that topic, see Appendix B.3 for details). We find significant topic-level inconsistencies, especially for base LMs, and strong educational attainment consistency for RLHF trained LMs.