Enrich Network logo

Menu

Sign in
Blog

User context we feed to LLMs also feeds their biases. How do we guard against this?

Brent
Brent
Cover Image for User context we feed to LLMs also feeds their biases. How do we guard against this?

When building a product on LLMs, the way we personalize context can systematically degrade the experience for some users, even when we think we're treating everyone the same.

A recent study from MIT found evidence of this in action. Researchers prepended short user bios to multiple-choice questions and tested three models: GPT-4, Claude 3 Opus, and Llama 3. When the bio suggested lower English proficiency, less formal education, or non-US origins, all three models gave less accurate and less truthful answers. And the effects compounded: non-native English speakers with less formal education got the worst results across the board.

The numbers are concerning. Claude 3 Opus refused to answer nearly 11% of questions for that group, versus 3.6% for the control. It responded with condescending or mocking language 43.7% of the time for less-educated users, compared to under 1% for highly educated ones.

The paper also includes examples of Claude responding to a non-U.S. user with less formal education by performing a caricature rather than answering clearly. It also found the model selectively refused to answer questions about nuclear power, anatomy, and historical events for less formally educated users from certain countries, while answering the same questions correctly for everyone else.

Why context is the thing to watch

These biases were triggered by changes in context.

Most of an LLM's behavior is set during training. When developers build a product using an existing LLM model, they can’t change its fundamental behavior. And unfortunately, researchers have known for awhile now that AI models amplify bias present in training data. It’s a deeply difficult problem to solve, and even plagues plagues modern models explicitly designed to be unbiased.

Developers building an app aren’t going to be able to fix the above. But what developers can and do explicitly control is the context: everything sent to the model alongside the end-user's actual question, such as date, user info, special instructions, background data. The work of curating this input is sometimes called context engineering. The right context makes a generic model feel specific and useful for the problem at hand.

Here's the kind of context you might imagine sending for a personal finance product:

System time: 2026-03-01T14:23:00Z
User: Maria
Timezone: America/Mexico_City
Connected accounts: 1

This all seems reasonable and helpful. The timestamp matters for transaction-related questions. The name adds warmth and personalization. The account count signals onboarding status.

But look at this context through the lens of the MIT study. A timezone hints at geography, a name can suggest cultural or ethnic background, and a single connected account might read as financial unsophistication.

Personalization and bias are triggered in the same way. The more you know about a user, the better the model can tailor responses to them, but also the more surface area the model has to deteriorate its answers. The MIT study is a controlled demonstration of this, with the only thing changing between good and bad responses being the user bio prepended to the prompt.

This is especially relevant because it mirrors real world patterns. Agent products already have memory capabilities that store user information across conversations to personalize responses. Features like this are becoming standard.

Advice for product builders

If you’re leveraging LLMs, every user-specific field your model infers on is a potential bias vector. A user's name, location, and language proficiency are all useful for tailoring responses, and all present in the MIT study's personas that triggered degraded answers. So we need to keep in mind:

Context is design AND engineering. It's tempting to throw everything you know about a user into the prompt. But context is a design decision, not just an engineering one. What you include, what you abstract, and what you leave out entirely all shape who gets good answers and who doesn't.

Don’t just optimize for the “typical” user. If you're only aiming for accuracy for typical users, you will miss the outliers where it has the most impact. A system that works for 90% of users and fails the 10% who need it most? That’s a fail. And as a builder, you might not know it’s failing unless you actually go look and ask.

What this means for Enrich

We've written about the gap between AI tools that feel trustworthy versus ones that actually are trustworthy. This research shows a specific instance of this gap: a tool can sound helpful and confident, yet systematically give worse answers to the people we most want to help.

Ultimately this is a design and values question as much as a technical one. We can make commitments about the values we apply to how we learn and build:

We treat every piece of user context as a potential bias vector. Information we provide to the model alongside a user's question has to earn its way in. The default is to exclude. If including a field risks inequitably affecting results, it stays out.

We test for who the system fails, not just whether it works. Aggregate accuracy isn't enough. If our tool answers a benefits question correctly for some users and refuses or botches it for others, that’s a failure. We will evaluate around demographic segmentation, looking for kinds of degraded response patterns the MIT study found. Initially we’ll approach such testing and evaluating qualitatively, and as we grow we’ll develop quantitative methods as well.

We keep testing. This isn't a pre-launch checkbox. Models change, context changes, our user base changes. Bias evaluation has to be a recurring practice, not a one-time audit.

New technologies tend to serve some people far better than others. We're optimistic about what AI can do, but we commit to being honest about where it falls short, and to building accordingly.