Turning complex LLMs value-alignment findings into an accessible, interactive evaluation platform.


TIME
June 2024 – Sep 2024
company
Microsoft Research
Team
1 Partner Research Manager
1 Senior Research PM
1 Principal Research PM Manager
1 Product Manager
5 Researchers
1 Engineer
2 UX Designers
overview
Microsoft Research’s work on Societal AI evaluates how large language models align with human values across multiple ethical and cultural frameworks. The research introduced a rigorous evaluation framework to uncover underlying value tendencies in LLMs. I translated these research findings into a practical benchmarking tool for real-world comparison and interpretation.
During my 4 month internship, I worked as a Product Manager in the Social Computing Group, collaborating with researchers, designers, and engineers to transform complex value-alignment research into a functional, web-based evaluation platform. I led product direction, defined feature scope, and designed experiences that made value-alignment insights accessible to broader audiences.
design highlight
value compass, A Unified Benchmarks for
Comparing and Interpreting LLM Value Alignment
Unified benchmarking
Across 4 different Value Systems
A unified leaderboard that ranks LLMs across four foundational value systems. Users can compare models at a glance and drill down into specific dimensions for deeper analysis.
From Static Scores
to Meaningful Interpretation
The model detail page moves beyond static scores.By combining value profiles, radar charts, and real evaluation cases, users can see how a model’s responses are interpreted and translated into specific value scores.
Side-by-Side Comparison Across Models
Users can compare up to five models and get report. A “value space” shows how models group together based on similar value patterns, helping users see broader cultural differences at a glance.
intro
Evaluating how LLMs align with
cultural, ethical, and social values
Microsoft Research has studied AI safety and value alignment for years. However, much of this work remains difficult for non-technical audiences to interpret or apply.
Value Compass translates AI alignment research into an interactive public tool. Built on benchmarks evaluating 30+ LLMs across multiple human value systems, it enables fine-grained exploration of how different models behave across cultures and contexts.
Adaptive Framework
Evaluate model performance across different cultural and social contexts, rather than relying on a single static framework.
Make value differences visible through consistent, comparable scoring grounded in observable model behavior.
Grounded in Social Science
Draw on insights from sociology, ethics, and AI safety to define and interpret human values in AI systems.
problem framing
Rapid Discovery With Researchers
& Stakeholders Under Constraint
In the first two weeks, I ran rapid interviews and syncs with researchers and stakeholders. The value alignment research itself was solid, but translating it into a usable product under tight time constraints exposed four key challenges.
No Clear, Dynamic Overview Across Models
Comparing models across four value systems required manually piecing together papers and spreadsheets. There was no single interface to see overall patterns or switch context easily.
Scores Without Intuitive Explanations
Final scores were available, but it wasn’t clear how specific responses led to those value interpretations, making results hard to trust.
Too Academic for Broader Audiences
Frameworks remained paper-centric with dense tables and jargon. Non-technical users struggled to meaningfully compare models without clearer visuals.
A Three-Month Public Launch Deadline
We had less than three months to ship a public version, with almost no room for iteration.
scoping
Phased Execution Plan Under a 12 Week Constraint

impact
Led Solo PM Efforts to Ship Value Compass
Turning Internal Research Benchmark Into a Public Tool in Under 3 Months
30+ LLMs
Enabled fine-grained comparison of cultural and ethical alignment across 30+ LLMs.
4 core module
Took the platform from MVP prototype to launched product as the sole product owner.
~ 30%
Lowered cognitive barriers for non-research users by ~30%, validated by user surveys.
Product Strategies 1
Dynamic Overview for Diverse Users
decision
Defined User Personas
Using insights from primary stakeholder interviews and internal reviews, I developed three detailed personas representing our target audience. These personas guided feature design and prioritization, which also shared during initial syncs with the research leadership team

23 years old · Master Students

35 years old · Product Manager
The original structure separated research and evaluation into different sites, creating a fragmented experience. I structured the Leaderboard as the primary entry point and consolidated research content under “Resources” section.

Product Strategies 2
Intuitive Explanations Making Research Transparent
decision
Making Scoring Transparent and Traceable
Static rankings alone weren’t enough. I redesigned the product to make the scoring logic visible and traceable. Users can move across value systems, adjust dimensions dynamically, and compare models visually. Each score links back to real evaluation cases, so people can see how specific responses led to particular value interpretations.
For those unfamiliar with the frameworks (such as Schwartz and its metrics), layered explanations help make the research understandable, without requiring users to read academic papers.

Product Strategies 3
Accessibility for Diverse Needs
decision
Supporting Different Depths of Exploration
To boost retention and engagement, I designed the benchmarks with progressive depth. Professional users dive deeper into value distributions and cultural correlations map through interactive visuals in the model deatil analysis and model comparision result.
“Test Your Values” is a fun quiz with 14 questions that find models that matches to user personal values.Casual users get into “Test Your Values” — a quick, fun quiz that matches models to their personal values in minutes. This turns dense research into an approachable, useful tool for everyone.

Reflection & Next step
This project is the result of collaborative efforts . Below is part of the team that contributed to this project. If you're interested in learning more, please visit our project website and check out the paper, which has been accepted to ACL 2025. 👏

Reflection
Research-based products must balance usability with methodological rigor.
I worked with researchers on an internal survey to identify usability gaps, focusing on whether the UX obscured the research logic or weakened the experience.
Design should make complex knowledge accessible to broader audiences without diluting it.
LLM research processes are inherently complex. I wasn’t trying to simplify the findings, but to create clearer pathways into them. By segmenting user needs, introducing visual comparisons, and layering explanations, I lowered the entry barrier while preserving technical depth.
As models become more capable, their value leanings and potential risks also scale. Making these patterns visible is critical for responsible AI development.
next step
From Static Benchmark to Interactive Exploration
As a next step, I explored more interactive ways to surface cultural differences in model behavior. I introduced map-based cultural visualization and Arena-style comparison to move beyond aggregate scores. Instead of only presenting final rankings, users can now explore how value alignment shifts across regions and models. This shifted the product from a static benchmark into a more exploratory tool.






