Protected case study
This work is under NDA. Enter the password to continue, or reach out to request access.
Researcher · Strategist
Anila Alexander
is in pursuit of
the next great problem.

I study what happens to people when intelligent systems enter their lives: how AI reshapes cognition, trust, mental health, and the legibility of human judgment at scale. The product work gave me a close-up view: eight years inside Google, Meta, and The New York Times watching those effects up close, in high-stakes conditions, with real people doing real work.

The through-line is communication. NYT was about how people read and discover information. Google was about what happens to the conversation between a company and its customers when AI enters it. Meta was about the next surface those conversations will happen on. The research question is always the same: does the system’s model of the person match how the person actually behaves?

© 2026 Anila Alexander

Work · 2015 – present

What happens to people
when intelligent systems
enter their lives.

Fear. Trust. The legibility of human judgment at scale. Eight years at NYT, Google, and Meta, always at the moment before the product existed, always asking what happens to people when the system enters their lives. The through-line: communication. How people read. How companies talk to customers. What the next surface for human connection looks like.

The work is organized into three registers: (1) a question about humans-with-AI, (2) a question about what research itself is for, and (3) a question about where systems and the people they’re built for diverge.

01
Who is in the loop with AI

The human-in-the-loop problem is a cognitive architecture problem, not an adoption problem. These studies tracked what happened to judgment, trust, and attention as AI took over the routine.

02
What research is for

Research as a decision tool. Telling a team when to ship and telling a team when to stop carry equal weight. The stimulus should match the question, not the format convention.

03
Where the system and the person diverge

Systems get built around a model of how people work and how people consume. The research question is whether that model is right, and what the cost is when it isn’t.

Work · 2015 — present

Curriculum Vitae

UX Researcher and Product Strategist with 8+ years leading research and strategy at The New York Times, Google, and Meta.

The question I keep returning to is how people adapt when the systems around them get smarter: how that changes what they trust, how they work, and how they understand their own role. The throughline across eight years at The New York Times, Google, and Meta isn’t the domain or the method. It’s that question. I’ve studied it inside newsrooms, inside AI-assisted service workflows, inside XR developer tooling. The answer is never simple and never just a product problem.

2025 — Present
Meta
UX Research and Product Strategy

Leading research to define product strategy for next-generation VR/AR/XR platforms, developers, and creators.

2021 — 2025
Google
UX Research and Program Lead

Led two multi-year 0→1 generative AI research programs from early exploration through production launch.

2018 — 2021
The New York Times
Product Strategy & UX Research

Led product strategy and UX research on mobile experiences, new content formats, and internal content management systems.

Teaching
2016 — 2020
The New School — Eugene Lang College
Instructor · Data Visualization
General Assembly
Lead Instructor · Product Management
Earlier experience
Peter G. Peterson Foundation · Product Strategy & UX Research · 2017–2018
Zelis · Product & UX Strategy · 2015–2016
Education
Hunter College — M.S. Applied Digital Sociology
Pratt Institute — UX/UI Mobile Design Certificate
New York University — B.A. Journalism and Politics
Connect on LinkedIn →
· Meta · 2025–Present

Research for the next generation of Mixed Reality experiences

Three connected studies building a research foundation for simulation and performance tooling across VR, AR, and MR.

XR is a technically complex, fast-moving domain where product pivots are common. When I joined Meta, the tools that creators and developers rely on to build VR, AR, and MR apps had to evolve to handle current and future hardware launches.

I built the knowledge base across three connected studies, using AI systematically to compress what would normally take months into weeks. The question I kept returning to: what does a developer actually need at the moment they’re trying to build something that has never been built before?

Projects
Chapter 02 of 04 — Knowing What to Fix Before the Hardware Shipped

Building the knowledge base, then testing it.

Little prior research existed on XR developer tools at Meta. I used AI to build the knowledge base across two secondary studies, then validated it with primary research in the week of launch.

Summary

In emerging technical fields, the first research problem isn’t what to study. It’s how to build enough knowledge to study anything at all. XR tooling in 2025 had almost no prior research coverage. I used AI systematically to compress the knowledge-building phase: NotebookLM for literature review synthesis across 15+ fragmented internal sources, AI-assisted social listening via Vocal across Reddit, YouTube, and Meta Community forums, and AI infographic generation for cross-functional readouts.

Then I ran usability sessions on XR Simulator 2.0 the week of launch. Seven sessions, AI-assisted synthesis, report delivered within days. Three studies, one argument: you can build a strategic research foundation in weeks if you use AI as a methodological choice, not a convenience.

I surfaced a hidden AI integration opportunity funded for the 2026 roadmap, defined Meta’s MR competitive bet, and shaped the H1 2026 hand simulation roadmap, which had been waiting on a research signal that didn’t exist.

Key takeaway

AI doesn’t replace researcher judgment. It removes the bottleneck between having the data and being able to act on it. The methodology here was built on that principle: AI handles the volume and compression problem, the researcher handles the synthesis, the so-what, and the decision. I build systems that are adaptable to each team’s needs while remaining cognitively safe for researchers and participants.

Problem space

When I joined Meta in 2025, the developer tools org had almost no dedicated UXR coverage. A Workplace post from the performance team had explicitly called for research — not as a courtesy, but because product decisions were being made without it. Both new and experienced creators and developers use profiling and simulation tools to build, test, and optimize VR, MR, and 2D applications. Key pain points had been named anecdotally: inconsistent terminology across tools, steep learning curves, no actionable guidance, simulators that couldn’t keep pace with what creators actually needed. But no one had synthesized what was known, mapped the competitive landscape, or run primary research in this space at Meta before.

XR Simulator 2.0 was launching in December 2025. A major hardware generation was coming. The question wasn’t whether these tools mattered — they did. The question was which of Meta’s existing positions were worth building on, which were liabilities, and what the next hardware generation would expose before the team had time to fix it. I had weeks, not months.

Two developer personas anchored the research across all three studies. They weren’t demographic categories. They were different relationships to the tools.

Persona 01
The Established Innovator

An experienced Unity developer who addresses performance during the Test and Optimize stages. Uses OVR Metrics for real-time monitoring, then moves to deeper tools when something is wrong. Values speed and doesn’t have time to re-learn an interface mid-sprint.

Persona 02
The Experimental Explorer

An Android developer adapting mobile apps for MR. Motivated by revenue and audience growth, but hard to attract and easy to lose to fragmentation and monetization uncertainty. Uses performance tools at two moments: making a 2D app MR-compatible, and preparing to publish.

The distinction matters because the tool that works for Persona 01 (RenderDoc, Perfetto) is the tool that fails Persona 02, and vice versa. A research question framed at the level of "how do developers use profiling tools" misses this entirely.

Business opportunity

Three gaps, each with a compounding cost if left unaddressed before the next hardware generation.

Performance Analyzer had a return on investment sitting unclaimed. Only 3% of organizations used it before submission, but those organizations saw 35% higher app approval rates. A tool with that effect size, with 3% penetration, is a growth lever the team hadn’t pulled. The barrier wasn’t the tool itself. It was onboarding and discoverability.

Meta’s competitive position in MR was the right bet, but the team needed research to confirm it. Apple led on polish, Android led on flexibility. Meta led on MR and multiplayer simulation, carrying the most usability debt of the three platforms. The research needed to say: own that position, don’t chase parity on someone else’s terms.

Hand simulation was about to become a bottleneck no one had surfaced yet. Six of seven usability participants independently asked for the same thing: custom, recordable, uploadable hand gestures. Apps were already being built that required them. If the simulator didn’t support them before the next hardware generation shipped, porting would stall. The research named it early enough to act on it.

Methods

Three studies, each calibrated to a different knowledge gap. AI was the methodological through-line, not a convenience, but as a deliberate choice about where researcher time is most valuable when the domain has no prior coverage and the clock is running.

Literature ReviewCompetitive AnalysisSocial ListeningUsability TestingAI Methods

Study 01 · NotebookLM for literature synthesis

The knowledge base for XR performance tools was fragmented across 15+ internal Workplace posts and research reports with no prior synthesis (the Workplace call-for-research post had flagged this explicitly). Sequential reading would have taken weeks. I uploaded the entire corpus to NotebookLM and ran structured queries to surface patterns, contradictions, and gaps across sources simultaneously. The synthesis identified the 24-point satisfaction gap, the Performance Analyzer 3%/35% finding, the two-persona framework, and the four key challenges (tool fragmentation, misaligned metrics, insufficient actionable guidance, and onboarding barriers) that shaped the research agenda for everything that followed.

The cross-functional readout was built using AI-assisted infographic generation: translating dense quantitative findings into a visual artifact the design and PM teams could work from directly. In a domain where most stakeholders are engineers, making the research readable is its own methodological problem.

Study 02 · Vocal for competitive analysis and social listening

Developer communities for XR tools are active and technically specific: Reddit threads on simulator bugs, YouTube tutorials surfacing workflow pain points, Meta Community forum posts on feature requests. Manually reading and synthesizing that volume of unstructured feedback would have taken weeks. I ran two structured prompts through Vocal: one on simulation tool competitive analysis across Meta, Apple, and Android; one specifically on gaze and hand input issues across platforms, drawing on social mentions, Quest Store reviews, bug reports, YouTube, and customer service tickets across a seven-week window.

This produced the competitive positioning finding (Meta leads on MR and multiplayer, carries the most usability debt), the XID ambivalence finding (developers building for next-gen hardware are uncertain about the audience and monetization path, which directly affects how much they invest in tool mastery), and the early signal on hand tracking as a differentiator.

Study 03 · Usability testing with AI-assisted synthesis

Seven moderated usability sessions on XR Simulator 2.0 in the week of launch. Participants were recruited across the Ainsley and Wren personas: established Unity developers and Android developers adapting mobile apps for MR — to test whether the reliability improvements had landed and what remained blocked. I used NotebookLM to synthesize session notes in parallel during data collection, enabling pattern identification before the final session was complete. The full report was delivered within days of closing. In a launch-week context where the team needed findings before the post-launch sprint was scoped, that turnaround was the point.

Key insights

Theme 01 · Performance tools

A 24-point satisfaction gap across the tool suite, with adoption and reliability problems concentrated in the tools developers need most as they scale.

Performance is continuous, not a one-time fix. Creators and developers address issues both proactively and reactively throughout the build lifecycle, averaging 20 to 30 minutes per profiling session excluding setup.

Tool satisfaction and reliability vary widely. OVR Metrics leads at 77% CSAT and 85% perceived reliability. Perfetto sits at 53%. RenderDoc %Bad is at 40% against a 20% EoY target, signaling significant UX debt.

Developer experience determines tool needs. New and generalist creators and developers rely on lightweight tools like OVR Metrics. Experienced and specialized creators and developers need RenderDoc and Perfetto, but these carry the highest onboarding barriers.

Performance Analyzer drives app approval rates. Only 3% of organizations used it before submission, yet those organizations saw 35% higher approval rates, signaling strong untapped value.

Figure 01 · Performance tools landscape
Five tools, five terminologies, five learning curves. Satisfaction varied by 24 points.
Tool Functionality Key strength H1'25 signal
OVR Metrics
Lightweight
Real-time app performance data (FPS, CPU/GPU load) shown as an in-headset HUD overlay. Easy to use, first line of defense. Quick detection of CPU/GPU issues.
77% CSAT
Reliability 85% · highest in class
Performance Analyzer
via MQDH
Real-time system metrics monitor, views system processes, flags when metric thresholds are hit. Used for deeper hardware / system investigation. Clear data presentation. Works well in tandem with OVR Metrics. Finer-grained data than engine profilers alone.
74% CSAT
Reliability 79%
Perfetto
System-wide
Android system-wide tracing tool (CPU scheduling, system processes). Comprehensive view of system performance. Useful for optimizing complex multi-threaded applications.
53% CSAT
Reliability 56% · below target
RenderDoc
via Meta Fork
Graphics debugging tool. Captures frames to analyze draw calls, shading, and rendering pipelines. Provides granular insights into graphics performance that no other tool offers.
40% %BAD
Goal 20% EoY 2025 · not in DevX Tracker
Sources · CSAT & perceived reliability surveys · usage analysis
Figure 02 · The 24-point gap
A 24-point spread between the highest and lowest CSAT scores. The shape of UX debt across the suite.
80% TARGET 0 50 100 OVR Metrics 77% Performance Analyzer 74% Perfetto 53% 24-point gap
Source · H1 2025 DevX tracker · CSAT survey · RenderDoc tracked as %Bad: 40% against 20% EoY target. Not in DevX Tracker.

Theme 02 · XID ambivalence

The social listening surfaced something the product questions hadn’t anticipated: a layer of ambivalence about building for XR at all. Developers building for next-generation hardware weren’t just uncertain about which tools to use. They were uncertain about whether the audience would be there, whether the monetization path was viable, and whether the investment in tool mastery would pay off. This matters for the research because it reframes the adoption problem. Low adoption of Performance Analyzer or XR Simulator isn’t just a discoverability or UX problem. It’s partly a motivation problem. Developers who aren’t confident in the platform aren’t going to invest in learning tools they might not need.

The implication for the product team: fixing onboarding and documentation is necessary but not sufficient. The tools need to visibly reduce risk — by making it easier to hit quality gates, easier to port existing apps, easier to test scenarios that would otherwise require multiple headsets. Each of those is an argument for building for XR, not just an argument for using a specific tool.

Theme 03 · Simulation tools

Meta leads on MR and multiplayer simulation, but adoption lags and competitors are closing the gap. The strategic position is clear; the execution gap is onboarding and discoverability.

Simulation tools appeal to indie studios and 3D creators, but mobile developers remain hard to convert. 47% of game developers globally focus on 3D development, making them well-positioned to build for XR. Indie studios and solo creators are a key segment, motivated by audience growth and new creative possibilities, but hard to attract without strong monetization and discoverability support.

XR Simulator drives publish rates but adoption lags. XRSim has a +0.1% effect on app publishing but only 40% adoption within 6 months, below Link’s 50%. WATM@14 sits at 18.27% against a 24% target.

Spatial Simulator is a major relief for Android creators and developers. Early dogfooding feedback described dramatically faster iteration speeds. No more don/doff. No dead batteries. Development anywhere on the go.

Competitors set a baseline expectation. Apple leads on polish. Android leads on flexibility. Meta leads on MR and multiplayer testing, but faces the most usability challenges of the three.

From lifecycle to workflow

The social listening component was the methodological move that made the competitive analysis possible at the speed it needed to happen. Using Vocal with structured prompts, I ran targeted queries by platform, by tool, by developer type, and synthesized findings across thousands of posts into a competitive signal the team could act on. Three platforms, distinct competitive positions, Meta’s MR advantage as the strategic bet. All of this came directly from that synthesis.

The competitive data told us where Meta stood. The developer interview data told us how developers actually used simulation tools inside the build process. These were different questions, and together they shaped the strategic recommendation: own the Build phase, because that is where simulation earns its place and where Meta has the strongest differentiated position.

Figure 03 · From lifecycle to workflow
Six phases. One that matters most for sim tools. Two workflows the data made unmissable inside it.
Part 1 — where sim tools live in the lifecycle
01
Discover
Awareness · Consider
02
Learn
Onboard · Educate · Explore
03 — focus
Build
Develop · Test & Optimize
Where sim tools live
04
Distribute
Validate · Release
05
Grow
Promote · Engage
06
Monetize
Revenue · Analyze
Part 2 — two workflows inside the Build phase
Workflow 01
Interchangeable
HeadsetSimHeadsetSimHeadsetSim
Don / doff, don / doff. Context lost every switch
Workflow 02
Time-blocked
XR Simulator (sustained)Headset
Uninterrupted sim blocks. Headset reserved for final validation
Framework · Canonical Builder Lifecycle (internal) · Workflows from competitive analysis + dev interviews

Simulation tools appeal to indie studios and 3D creators, but mobile developers remain hard to convert. 47% of game developers globally focus on 3D development, making them well-positioned to build for XR. Indie studios and solo creators are a key segment, motivated by audience growth and new creative possibilities, but hard to attract without strong monetization and discoverability support.

XR Simulator drives publish rates but adoption lags. XRSim has a +0.1% effect on app publishing but only 40% adoption within 6 months, below Link’s 50%. WATM@14 sits at 18.27% against a 24% target.

Spatial Simulator is a major relief for Android creators and developers. Early dogfooding feedback described dramatically faster iteration speeds. No more don/doff. No dead batteries. Development anywhere on the go.

Competitors set a baseline expectation. Apple leads on polish. Android leads on flexibility. Meta leads on MR and multiplayer testing, but faces the most usability challenges of the three.

Figure 04 · Side-by-side user journey: three simulators
Apple leads on polish. Android leads on flexibility. Meta leads on MR and multiplayer: it carries the most usability debt.
Apple
visionOS Simulator
Meta
XR Simulator
Android
XR Emulator (Jetpack)
01 Setup
IDE, SDKs, platform support
Xcode · visionOS SDK · Swift/SwiftUI. Mac only. Unity (Unreal in progress) · OpenXR runtime. Windows, Mac, Linux in progress. Android Studio · Jetpack XR SDK, ARCore, SceneCore. Windows, Mac, Linux.
02 Launch
Device selection, environment
Run app in Xcode, select visionOS sim. Virtual spatial environment. Activate in Unity, set OpenXR runtime. Simulates Quest 2, Quest Pro, Rift S. Create XR AVD in Device Manager. Home Space (2D panels), Full Space (3D).
03 Interact
Input, spatial, accessibility
Mouse/trackpad for hand/eye gestures. Multi-window, built-in accessibility. Keyboard, mouse, controller, data forwarding for real controllers. Headset motion, record/replay. Mouse/keyboard for head, hand, controller. Spatial panels, Orbiters, ARCore scene understanding.
04 Test & Debug
Tools, automation, profiling
Xcode debugger, console logs. Automation limited. Profiling integrated in Xcode. Unity/Unreal debugger, logs, telemetry. Record/replay head/input actions. Android Studio debugger, logcat, inspector. Emulator snapshots, record/replay.
05 Iterate
UI/UX, device switching
SwiftUI / Reality Composer Pro. Device switching N/A. Unity/Unreal UI, spatial layout. Toggle between Quest devices. MR simulation on roadmap. Jetpack Compose for XR, adaptive layouts. Change AVD config, test multiple devices.
06 Export
Capture, distribution
Built-in capture tools. App Store, TestFlight. Unity/Unreal capture, external tools. Quest Store, App Lab, direct install. Emulator capture tools. Play Store, direct install.
Strengths
Seamless ecosystem integration, polished spatial UI/UX, high fidelity, accessibility focus. Wide hardware support (Quest 2/Pro/Rift S), Unity/Unreal integration, robust controller simulation, strong gaming focus. Cross-platform (Win/Mac/Linux), open ecosystem, strong Android tooling, flexible emulator configuration.
Limitations
Limited hardware simulation, Mac-only, no cross-platform, closed ecosystem. Complex setup for beginners, less integrated debugging, platform fragmentation, limited OS-level spatial features. Newer platform, evolving SDKs, device fragmentation in the wild, less mature MR features.
Source · Competitive audit, Q4 2025 · synthesized from public docs, SDK reviews, and developer forums

Theme 04 · XR Simulator 2.0

7 of 7 participants praised the new UI. The standalone app and clearer panel organization were described as less clunky, more intuitive, and accessible enough that "anybody can use this." 4 of 7 noted improved stability and fewer crashes.

"My biggest complaint previously was it only worked like 40% of the time. It was quite frustrating. But with this new app, I feel like it just works every time."

Creator / developer participant, Session 2

Versioning and discoverability remain blockers. 3 of 7 participants were confused by two coexisting versions (v81 and v83) and the split between Unity package manager and standalone website installation, creating friction at the moment of adoption.

Gaze and eye tracking readiness is low. Most creators and developers had no hands-on experience with gaze inputs due to Quest 3/3S hardware limitations and EU privacy concerns, creating uncertainty about next-generation hardware readiness.

Hand simulation is the top feature request. 6 of 7 requested custom, recordable, and uploadable hand gestures, critical for creators building sign language apps, escape rooms, and gesture-driven experiences. 4 of 7 needed complex multi-hand interactions their apps required but could not simulate.

Deep dive: Hand simulation, the top feature request

Hands are the interface for next-generation XR, and 6 of 7 participants asked for the same thing in different ways: custom gestures, two-handed interactions, controller-to-hand mapping. Not edge cases. Load-bearing requirements for the apps creators were already trying to build.

rec
Finding 01

Custom, uploadable, multi-type hand gestures.

Participants expressed a strong desire for the ability to create, record, and upload custom hand gestures, as well as simulate complex interactions that go beyond the default set. 3 of 7 mentioned uploadable gestures explicitly; 4 of 7 needed compound gestures (e.g., using both hands to open a book) the simulator could not support.

"If we may be programmed our own custom hand gestures. Like, oh, I'm going to open a book and that would then cue the headset to do something. That would be great to be able to have that in XR Simulator that we've, you know, just press a button and the hands do that."

P2
Recommendation

Add features to record, create, and upload custom hand gestures. Expand capabilities to handle complex, multi-hand interactions.

L — ACTION R — UI
Finding 02

Flexible hand assignment and role customization.

Developers emphasized supporting both dominant and non-dominant hands, and assigning distinct roles to each. 2 of 7 design interactions where one hand performs actions while the other manages UI or menus. Testing for left- and right-handed users is a baseline inclusivity requirement.

"That was basically one of the first requests that I got from players when I didn't have full left handed support in my games. People would cry out and I think for hand tracking this will be even more important than with controllers because it's just even closer to your own physical body."

P3
Recommendation

Add the ability to switch between testing right-, left-, and both-hand gestures. Provide tooling and documentation for assigning distinct roles to each hand.

controller ? ? ? ? hand
Finding 03

Guidance on controller-to-hand mapping.

Developers want clearer guidance and tooling for how controller inputs translate to hand gestures, and how to simulate or debug those mappings. Confusion at this layer slows everything downstream. A barrier to porting existing apps for next-gen hardware.

"So we wanted to incorporate these for handshaking because they might work because it's hard to translate a lot of controls on the controller to hand."

P6
Implication

VR/AR developers may be slow to port their apps for next-gen hardware if they are unsure how to convert controller inputs to hand gestures. Provide clearer guidance and improved tooling within the simulator.

Impact

Three studies, one argument: AI-assisted research can build a strategic knowledge base in a product area with no prior coverage, then validate it in the launch window.

The performance tools study defined the 2026 tooling strategy and surfaced a specific AI integration opportunity: embedding AI-driven suggestions directly into performance tools to surface root causes and actionable guidance, not just flag issues. Developers, particularly new and generalist ones, consistently hit a wall where a tool would tell them something was wrong but not what to do about it. The recommendation was to close that gap with AI. That finding was funded for the 2026 roadmap. The broader fix — standardizing terminology across the suite and rebuilding the onboarding layer — addresses a structural problem that no individual tool fix can solve.

The simulation study established MR as Meta’s strategic competitive bet and defined the H1 2026 research agenda for discoverability and adoption. The recommendation was to double down on MR and multiplayer rather than chasing Apple’s polish or Android’s cross-platform flexibility. The XID ambivalence finding added a dimension the team hadn’t framed explicitly: adoption is partly a motivation problem, not just a UX problem, and the tools need to visibly reduce risk for developers who aren’t yet confident the platform is worth the investment.

The usability study shipped 13 features and 3 bugs into the post-launch sprint and surfaced hand simulation as the top feature request, named independently by 6 of 7 participants, before it could become a porting bottleneck for the upcoming hardware generation. The recommendation was immediate: resolve the version confusion between v81 and v83, consolidate the installation path, and add hand simulation capabilities before the next hardware generation ships. The H1 2026 hand simulation roadmap had been waiting on a research signal that didn’t exist. It does now.

The methodology is the other outcome. NotebookLM, Vocal, and AI-assisted synthesis are a reusable reference model for building research coverage fast in a domain with no prior dedicated coverage, applicable to any new product area the team enters next.

More projects
· Google · Jun 2021–May 2025

What generative AI does to the humans in the loop

When AI takes over the routine, something has to happen to human judgment. Four years studying that question inside a live service context, where the AI, the agent, and the customer form a three-party conversation with its own failure modes, its own power asymmetries, and its own emotional register.

Across nearly four years at Google, I led research at the points where generative AI was reshaping how Google’s products meet their users: from customers reaching support, to ML engineers building the models that ship inside consumer apps, to the organizational infrastructure that makes research scale.

Each program answered a distinct question, but they share a through-line: how does generative AI change the role of the human in the loop, and what does the experience look like at each stage?

Projects
Chapter 01 of 01 — GenAI CX Support Solution

When customers reach support, they want short, precise resolution: not a robot.

Two years and 25 studies. A published patent on a human-in-the-loop system. As Google moved support from manual to AI-assisted to AI-supervised, the research kept asking: does the customer still feel heard? A cross-channel synthesis study answered the question the lifecycle research kept raising: the problem wasn’t in any single channel. It was structural.

Summary

When voice, chat, and email agents were studied in parallel (three separate teams, no visibility into each other’s work) they surfaced the same anxiety independently: the AI didn’t know their customer. That convergence finding reframed the platform problem. It wasn’t a channel-specific implementation issue. It was structural. The program that produced this finding ran for two years across 25 studies: prototype testing, ethnographic field research in Manila, NASA-TLX cognitive load assessments, chat transcript analysis, and a cross-channel synthesis study tracking how the customer experience held up as support shifted from manual to assistive to supervised AI.

Findings shaped a human-in-the-loop system published as a patent (US20250209307A1) that automated a significant share of agent messages at scale. The convergence finding reframed the platform design brief: from channel-by-channel AI feature additions to a unified, customer-first architecture.

Key takeaway

The human-in-the-loop problem is not an adoption problem. It is a cognitive architecture problem. Content relevance is the gating variable, cognitive load is the constraint, and whether a human trusts AI depends on whether the right suggestion arrives at the moment they have bandwidth to evaluate it. Design for both or the AI gets ignored.

Problem space

When customers contact Google support, they’ve usually already failed at self-help. They want short, precise resolution: not a robot, not a script. The bar they set is high and specific: “I will find the solution from beginning to end. And if it’s not something I can help you with, I will jump through the hoops and I will explain it to someone else.” That’s what a good support experience feels like from the customer side. A committed human who owns the problem. As Google moved to scale support with generative AI, the design question wasn’t just whether the AI worked technically. It was whether the customer on the other end of the chat would still feel that.

Google’s support infrastructure was organized by channel. Voice had its own team, its own tooling, its own research agenda. So did chat, and email. A cross-channel synthesis study ran alongside the lifecycle research to ask whether that structure was serving the customer. What it surfaced was a convergence finding. Voice, chat, and email agents, working in separate teams with no visibility into each other’s work, surfaced the same anxiety independently: the AI didn’t know their customer. Different channels, different features, the same structural problem. The question the research kept raising was whether the infrastructure was built for that customer, or for the org chart.

Business opportunity

Google was shifting its 1:1 support model from fully manual to AI-assisted to AI-supervised, with the goal of scaling support capacity without degrading customer satisfaction. The research program was designed to track whether that shift was working, and for whom: agents, customers, and the organization.

The cost of getting it wrong was threefold. For customers: a support experience that felt less human the more automated it became, eroding the trust that converts a frustrated user into a retained one. For agents: a tool that added cognitive load rather than reducing it, creating rational resistance to adoption and undermining the productivity gains the system was built to deliver. For the organization: shipping an AI layer that didn’t know the customer, then building the next generation of the platform on top of that same structural gap.

Figure 01 · The lifecycle the research investigated
As Google shifted from manual to AI-assisted to AI-supervised support, the research followed the human role at each stage.
Current state
Manual

Agents solve every customer query without help.

Human role · Full
Interim state
Assistive

Agents use AI features to solve more customer queries at once.

Human role · Partial
Future state
Supervised

Agents supervise AI as it solves customer queries.

Human role · Oversight
Research question · What is the role of the human at each stage: how does the customer experience hold up?

Methods

A mixed-methods program of 25 studies over two years, spanning the full product lifecycle from pre-Beta through launch, alongside a parallel cross-channel synthesis study with 17 frontline agents and 2 workforce leads across voice, chat, and email. Primary methods included prototype testing, in-depth interviews, shadowing, focus groups, surveys, NASA-TLX cognitive load assessments, chat transcript analysis, and ethnographic field research with frontline agents in Manila. The 25 studies weren’t a single research push: each one was sequenced to where the product was, what the team needed to know next, and what could move the agent experience forward.

Mixed MethodsEthnographyLongitudinalInternationalCross-channel
Research timeline · 25 studies, Q4 2021 – Q4 2023
Pre-Beta through full launch and cross-channel synthesis.
Q4 ’21
Q1 ’22
Q2 ’22
Q3 ’22
Q4 ’22
Q1 ’23
Q2 ’23
Q3 ’23
Q4 ’23
Pre-Beta studies
Prototype testing, IDIs, transcript analysis
Beta pilot
Shadowing, auto-send evaluation
Vision sprint
Sprint + transcript text analysis
Multi-chat & RITE
Cognitive load, GenAI Rewriter
Cross-channel study
Cross-channel synthesis
Manila ethnography
Field study, 4 sites, 60 agents
Post-launch surveys
Focus groups, shadowing
Intercept CSAT
Continuous survey monitoring

Key insights

Theme 01 · What customers want

Customers who reach 1:1 support have already failed at self-help. By the time they open a chat, they’ve read an article, tried a fix, and run out of options. They arrive resolution-focused, not relationship-focused. They want short, precise conversations (“no lollygagging”), an agent who has already read what they wrote in the Contact Us Form, and a clear signal of where they are in the process. What they do not want: to re-explain themselves, to wait without knowing why, or to receive a scripted greeting that signals the agent hasn’t read anything.

The research surfaced five consistent pain points in the current 1:1 support experience.

  1. 01 Scripted welcoming. Responses that sounded robotic regardless of the customer’s issue or emotional state.
  2. 02 Lack of transparency. No signal of where the customer was in the resolution process.
  3. 03 Emotional volatility. Peaking during troubleshooting, where customers were already frustrated and the process was least predictable.
  4. 04 Generic interactions. Experiences that felt built for everyone and therefore for no one.

    “It very rarely feels personal. You start to doubt the sincerity of the help being provided.”

    Customer participant
  5. 05 No closure. Customers who left a chat without knowing whether their issue had actually been resolved.

    “I don’t know if, at that time, it was ever resolved.”

    Customer participant

The research also surfaced what a good experience looked like from the customer side. Not a faster resolution, necessarily. A trustworthy one. Customers wanted to feel that the person on the other end understood their specific situation, not a category of issue. That the agent had read what they’d written. That the AI, if present, was directing the conversation toward a goal rather than stalling it. The vision the research shaped named this directly: trustworthy support chat experiences, human-first intelligence, and end-to-end transparency.

Figure 02 · Customer expectations and sentiment across 6 stages of 1:1 support
From the moment an issue surfaces through closure: a six-stage map of what customers expect, where sentiment dips, and where pain points cluster.
01
Issue awareness & realization
PRE 1:1
02
Getting connected to a live agent
03
Helping agents understand issue
04
Troubleshooting
05
Verifying resolution
06
Closure
Customer expects
Self-resolve simple issues; find help fast.
Fast, easy connection to a human.
Agent reads the answers in CUF; doesn’t re-ask.
Clear, fast guidance. Communication on what’s happening.
Quick resolution. Knows when issue is resolved.
Issue resolved. Knows how to prevent recurrence.
Sentiment
FRUSTRATED RELIEVED REPEATING UNSTABLE RESOLVING CARED FOR
Pain points
Issue occurs. Hard to find 1:1 support; can’t self-resolve.
GSE can’t resolve. Robotic greeting from agent.
Process takes long. Repeating issue to multiple agents.
Dead-air during waiting. No clear progress signal.
Chat ends too quickly. Restarts and interruptions.
No closure in chat. Don’t know how to prevent recurrence.
Source · Synthesis of in-depth interviews, shadowing, focus groups, and chat transcript analysis · n = 25 studies

Theme 02 · Designing for the three-party conversation

AI suggestion relevance is everything. Agents found AI suggestions frequently out of context with the customer’s issue. Content relevance, shaped by job role, linguistic tone, conversation flow, and customer sentiment, became the key determinant of adoption.

The pilot surfaced something the product team hadn’t fully anticipated: introducing an LLM into a 1:1 support chat didn’t just change how agents worked. It created a three-party conversation with its own failure modes. Customers arrived with a completed Contact Us Form and assumed the agent had read it. The LLM, responding in real time, often hadn’t. The result was over-confirmation, phrase repetition, and dead-air that left customers feeling unheard before troubleshooting had even started. Agents were aware of this: “Customers ask me, ‘are you a robot or a human?’” The question wasn’t rhetorical. Customers genuinely couldn’t tell. And the inability to tell was eroding their confidence in the conversation.

From the customer side, the experience was legible and frustrating: “You can tell they must have a big book of like a ton of different current answers depending on what you say. I found that I had to repeat the question to try to get like a more precise answer.” That participant had identified the core problem precisely. The LLM was pattern-matching to categories, not responding to their specific situation. The conversation felt scripted because, behaviorally, it was.

I ran a focused study on these breakdown patterns: analyzing pilot transcripts across conversation types and mapping where the LLM’s dialog management fell short. The taxonomy that emerged framed every chat by how much of the issue was already known at handoff: Issue Clear, Needs More Information, or Too Much Information. Each category required a different LLM opening move. Without categorization, the LLM defaulted to the same greeting regardless: asking customers to re-explain what they’d already written.

Figure 03 · Conversation types: how the LLM should open based on CUF signal
Every 1:1 chat starts with a Contact Us Form. What’s in it (too much, too little, or just enough) determines the LLM’s opening move.
CUF RECEIVED AT CHAT HANDOFF How much is known about the issue? SPARSE / CLEAR MISSING / CONTRADICTORY OVERWHELMING ISSUE CLEAR CUF contains enough to begin
Customer described issue clearly. Agent/LLM can begin without probing.
LLM OPENING MOVE Verify → confirm → troubleshoot. NEEDS MORE INFO CUF sparse or contradictory
Customer rushed CUF. Symptom category mismatches description.
LLM OPENING MOVE Ask probing questions first. TOO MUCH INFO CUF over-detailed, issue buried
Customer over-explained. Real issue requires parsing and disambiguation.
LLM OPENING MOVE Parse → reflect back → confirm. ISSUE UNDERSTOOD TROUBLESHOOT → RESOLVE
Source · Conversation design study, H2 2022 · Transcript analysis across conversation types

The dialog management failures were systematic across five dimensions:

  1. 01 Over-confirmation. The LLM confirmed customer statements with identical phrasing across multiple exchanges, making customers feel they weren’t being heard. The transcript analysis found “I see,” “I’m sorry,” and “Thanks for the information” repeated three or more times in a single conversation.

    “You can tell they must have a big book of like a ton of different current answers depending on what you say. I found that I had to repeat the question to try to get like a more precise answer.”

    Customer participant
  2. 02 Reactive turn-taking. The LLM waited for customer input rather than proactively guiding the conversation, creating awkward silences and dead-air.
  3. 03 Missing conversation markers. Customers had no sense of where they were in the resolution process, no “halfway there” or “one more step.”
  4. 04 No emotion detection. The LLM applied the same tone to a customer locked out of their account and a customer whose toddler had made an accidental purchase. One participant described being asked to replicate a purchase to verify an error: “We want you to try to, like, replicate this error. And I’m like, wow, I’m not going to try to buy like more additional movie tickets, so that was kind of frustrating.” The LLM had no model of what that request cost the customer emotionally or practically.
  5. 05 Mismatched conversation style. Some customers front-load all their information; others provide it incrementally. The LLM treated both the same.

The downstream effect: customers disengaging, expressing frustration, disconnecting.

Figure 04 · Conversation flow mapping: five types, five behavioral specifications
Each conversation type required its own behavioral specification for the LLM: how to open, what to offer, when to escalate, and how to close. This is what it looked like to map that at scale.
Conversation type LLM opening Agent role Escalation trigger Handoff type Close condition
Resolution offered
LLM offers a direct resolution before agent is needed
Greet + present resolution option with confirmation prompt Monitor, intervene if customer rejects or escalates Customer rejects resolution or signals frustration Warm handoff to agent with context summary Customer confirms resolution accepted
Consult enabled
Agent can bring in a specialist for consultation within the conversation
Greet + triage issue type to identify whether specialist consultation is needed Initiate consult request; brief specialist on context before joining Specialist unavailable or issue exceeds consult scope Specialist joins conversation; LLM steps back Specialist confirms resolution; customer satisfied
Transfer to email
Issue requires async follow-up; conversation moves to email
Greet + explain why email is appropriate for this issue type Confirm transfer, set expectation for response time Customer disputes transfer or requests live resolution Channel switch with case summary passed to email queue Customer accepts email transfer
Transfer to chat
Issue requires specialist; customer transferred to new chat
Greet + identify issue as requiring specialist routing Initiate transfer, brief receiving agent on context No specialist available; customer wait time too high Warm transfer with full conversation context Receiving agent confirms handoff complete
End chat
Issue resolved or customer disengages; conversation closes
Summarise resolution, invite follow-up if needed Confirm close, log outcome for QA Customer re-engages with new issue New case opened if re-engagement detected Session closed; CSAT survey triggered
Source · Conversation flow mapping study · Google CX · 2022 · Connect Genie / Google One

The conversation type determined the opening move. But correctly opening a conversation is only the first behavioral decision. The service design blueprint below maps what the full interaction should look like across all six stages: what the LLM can contribute at each step, where the line of interaction runs, and where human judgment is non-negotiable. The two dividing lines (interaction and visibility) are the structural frame for understanding where the three-party conversation holds together and where it breaks down.

Figure 05 · Service design blueprint for the reimagined 1:1 support interaction
The synthesis artifact from the conversation design study, mapping customer tasks, agent tasks, LLM abilities, signals, and metrics across all six stages of support.
01
Issue awareness & realization
PRE 1:1
02
Getting connected to a live agent
03
Helping agents understand issue
04
Troubleshooting
05
Verifying resolution
06
Closure
Customer
expects
Issue resolution. No self-resolution path works.
Self-help can't resolve issue.
Transfers / long time / high effort.
Dead air / no visibility.
No reso / end chat abruptly.
Customer
tasks
Get content of HC URL · Find 1:1 support
Get connected
Prep info / screenshot · Answer probing qs
Follow reso steps · Wait for agent reply
Check fixes · Wait for next step
Understand next steps
Line of interaction
Agent
tasks
Skip greeter · Hold 1–2 mins · Provide proactive support
Ask for new info / screenshot · Branding, friendly tone
Provide reso steps · Show empathy · Proactive reso in product
Transfer in context · Set up connection to next steps
No dead ends · Follow-up with educational resources · Gather CSAT · Lay out next steps
Line of visibility
LLM
abilities
Personalized HC · Custom labels in HC
Agent status · Auto-generated issue description
Agent status · Auto follow-up, check-in · Image transcribe · Shared glossary
Suggestive reso steps · Agent status · Image transcribe
Validate fixed / reso · “Do you need more help?” · Auto-escalate
Issue summary in chat · Train agents for technical issues
Metrics & signals
LLM
signals
HC article info last page · HC article viewed
Case ID · Issue summary · Additional info
Answer effort
ICS tools · Agent knowledge base · Nalanda, Atlas
Metrics
Clicks of symptom chip, HC · Time in Self-Help · Assistive qs & answers · User account status
No dead air · Agent SLA
Time · Customer effort
Customer sentiment
CLTV · Customer sentiment
Source · Conversation design study, H2 2022 · Synthesis of Auto-send pilot transcript analysis, agent surveys, and focus groups

Theme 03 · Go where the agents were

The pilot data told us what agents were doing. It didn’t tell us why, or what the floor actually looked like. In Q4 2023, I co-led an ethnographic field study in Manila: visiting four vendor sites across four product areas, shadowing 60 agents across shifts, running roundtable discussions with 30 agents, and facilitating four co-design workshops. The decision to go in person was methodological: you can’t survey your way to an accurate mental model of a concurrent-chat agent at peak volume. You have to be in the room.

What the room revealed was a set of systemic blockers that remote research had named but not fully explained. The trust cliff, for instance, had shown up in pilot surveys as a content relevance problem. In person it looked different: agents handling 7 to 20 chats per shift encountered irrelevant AI suggestions early in the day, cancelled out, and never returned, not because they decided against the feature but because, behaviorally, they had adapted to cancelling. By the next chat the cancellation reflex was already automatic. Even a run of perfect suggestions afterward would be ignored. The window for building trust wasn’t the pilot. It was the first two or three suggestions of the first chat. That’s a design constraint, not a training problem.

The tool fragmentation finding surfaced a workaround culture that was invisible in the data. Agents had built personal workarounds for everything: personal help-center documents when KB articles ran out, SMEs as human search engines for emerging issues, vendor-hired language experts to double-check real-time translation quality because the automated service wasn’t trusted. Each workaround was a signal that the official tool ecosystem had failed at that point. The co-design workshops made this productive: agents sketched what they actually needed: KB articles surfaced at chat start, external tools embedded in the case tool, color-coded SLA timers in the active panel, automated post-chat wrap-up. They were designing the product they wished existed.

The synthesis was a finding I wrote up as a hot take: UI quality and LLM performance are coupled. A new interface fails if the LLM hasn’t earned trust, because agents don’t engage with UI they associate with a tool that let them down. And LLM value is invisible to agents who haven’t adopted the UI, because they never see the suggestions. The two problems look separate on a roadmap. In practice they’re the same adoption problem.

The training loop was breaking upstream

Another systemic blocker the pilot data couldn’t see: the training loop itself. New features moved through a chain of handoffs before reaching the frontline agent. Product team to vendor manager. Vendor manager to team lead. Team lead to agent during a shift huddle. At each handoff, context was stripped. What started as a value proposition arrived as a mechanic. Agents could describe what a feature did. They couldn’t describe why it had been built or how it was supposed to change their work.

Dedicated training time varied by vendor. Some agents completed trainings during low-volume periods between chats, using materials that sometimes included outdated screenshots. When questions surfaced about a new feature, team leads weren’t always sure who to direct them to. The chain that should have carried information downward had quietly inverted: it carried uncertainty upward, which then circled back to agents as silence. Training became, as one participant described it, an insulated space of guessing.

This mattered more for AI features than for UI changes. AI features rely on agents understanding the judgment call, not just the mechanic: when to accept a suggestion, when to edit it, when to ignore it entirely. A training chain that drops context can teach the mechanic. It can’t teach the judgment. The result was agents operating new AI features without a model for when they were supposed to help, which fed directly into the trust cliff and the cancellation reflex. The fix wasn’t a better training deck. The fix was shortening the chain.

Fear-based mental models

The QA finding explained why the trust cliff was structural rather than behavioral. Agents operated under a pass/fail accountability system that ran from a central Google quality team down through vendor managers, auditors, and team leads to frontline agents. Markdowns were binary, applied to the agent even when the AI caused the error. The AI sat entirely outside this chain: no markdowns, no accountability, no consequences. The result was a rational adoption barrier. Agents weren’t failing to understand the technology. They were calculating risk correctly. Introducing a feature that could trigger a markdown, with no mechanism for the AI to share the downside, was structurally incompatible with adoption unless QA parameters were redesigned alongside the product.

Figure 06 · QA accountability chain and the AI’s position outside it
The AI makes suggestions. The agent bears the consequences. The accountability chain that governs agent behavior has no equivalent for the AI.
GOVERNANCE LAYER QA EXECUTION FRONTLINE AGENT receives markdowns CENTRAL QA TEAM sets quality guidelines VENDOR MANAGER cascades to frontline AUDITORS audits vendor QA TEAM LEAD policy to agents VENDOR QA first case audit LLM AUTO-SEND no markdowns makes suggestions ACCOUNTABILITY FLOWS DOWN
Source · Manila ethnography, Q4 2023 · Multi-site field study, QA process analysis

This structural finding reframed the design challenge. The fear-based mental model wasn’t a training problem or a communication problem. It was a labor problem. The accountability system that governs how agents are evaluated hadn’t been updated to account for a new actor in the conversation. Until it was, agents had every rational reason to cancel AI suggestions rather than risk a markdown on a feature they didn’t control. The figure below maps how that fear manifested across every stage of a case.

Figure 07 · Agent fear-based mental model across the case lifecycle
QA anxiety doesn’t appear at one stage: it contaminates every step. Cross-channel study, Q3 2023, n=17 frontline agents across voice, chat, and email.
Taking the case Diagnosing Troubleshooting Resolution Closing Post-case
Tasks Greet customer
Read CUF
Check case log
Authenticate
Verify issue details
Ask probing questions
Put customer on hold
Check secondary tools
Run diagnostics
Check appeals & policies
Communicate resolution
Explain why
Talk about next steps
Send follow-up email
Update issue tracking
Write case note
Receive QA feedback
Get team lead feedback on case handling
Thoughts “Is the customer angry already? Do I need to calm them down?” “Will they get annoyed if I ask too many questions?” “I’ve had them on hold too long. They won’t like what I’ve found.” “Due to Google policy, I can’t help further. Will they be angry?” “Will the customer reopen the case to change the outcome?” “Will QA pass me? Did I show proper evidence? Am I missing anything? I don’t want to get markdown.”
Emotion ANXIOUS CAUTIOUS STRESSED TENSE VIGILANT UNRESOLVED
Anxiety baseline: never neutral, never resolved
Source · Cross-channel research study, Q3 2023 · 17 frontline agents, 1:1 sessions with storyboards and Figma prototype usability testing

Cognitive load

The fear-based mental model didn’t disappear once an agent opened a case. It ran alongside every other cognitive demand. And in the multi-chat environment, those demands were already at the limit. The QA anxiety meant agents were simultaneously managing the customer, monitoring the AI, and calculating whether each intervention would hold up to an audit. The interface then added four distinct load pressures on top of that. The combination produced what the multi-chat cognitive load study named “a poverty of attention”: the agent’s capacity to focus on the customer was structurally depleted before the conversation began.

The cognitive load problem in AI-assisted support isn’t that agents are overwhelmed in the abstract. It’s that the UI architecture forces a mental model mismatch at the exact moment agents need full attention on the customer. A parallel research program (tracking agents across single and multi-chat scenarios) made this concrete.

Customers can detect when agents are splitting attention. One participant described waiting on hold for three to five minutes: “It kind of feels like the agent may have been helping someone else. I just want to get my problem resolved as quickly as possible so I can get on with my day.” The agent’s cognitive state is not invisible to the person on the other end. It leaks.

The study identified four distinct cognitive load pressures that compound in multi-chat scenarios. Multitasking creates intrinsic load: agents simultaneously execute on two separate customer journeys. Context switching between chat windows creates extraneous load: the UI transition logic conflicts with agents’ natural attention patterns. Information density creates further intrinsic load even in a larger window: more real estate doesn’t solve the problem if the content hierarchy isn’t right. And underneath all of it, a germane load problem: the agent’s mental model for formulating a response actively mismatches the framework the UI imposes.

That last finding was the most actionable. Agents follow a consistent three-step process when deciding how to respond: understand the customer’s current situation, identify what kind of response is needed, then decide which tool to use. The UI was structured around the tools, not around that decision process. Fixing the cognitive load problem meant redesigning around how agents actually think, not around the features available to them.

Figure 08 · Four cognitive load dimensions in multi-chat AI-assisted support
Each dimension compounds the others. Addressing one without the others produces partial relief at best.
Agent Cognitive Load (ACL)
“How much of the agent’s attention capacity is used during handling 1:1 support cases.”
Multitasking, Multi-chat
#IntrinsicCL
# of tasks. Look into resolution in Cases, or secondary tools while chatting.
# of chats. Concurrent chats.
×
Context switching
#ExtraneousCL
Transition from case A to case B.
Transition from small to large chat window.
+
Mental model mismatch
#GermaneCL
Understandability of icons, notifications, and window transitions.
Schema of today’s workflow and Cases IA.
+
Info density
#IntrinsicCL
Signal-to-noise ratio.
Timing of information surfacing.
Result
High cognitive load in real-time multi-chat scenario creates a poverty of attention: the agent’s capacity to focus on the customer is structurally depleted before the conversation begins.
Source · Multi-chat scenario observation and cognitive load analysis

Impact on customer satisfaction

The cognitive load problem had a direct customer-facing consequence that the pilot data initially obscured. Agent satisfaction scores (ASAT) ran at 95–100% favorable during the pilot phases, which looked like strong adoption signal. But the pilot design itself was confounding the measurement. Agents felt heard because of continuous feedback loops and rapid iteration, not necessarily because the feature was performing. ASAT was measuring the experience of being in the pilot, not the quality of the tool.

When an agent was managing high cognitive load, splitting attention across multiple chats and monitoring AI suggestions under QA pressure, customers could detect it. One participant described waiting on hold for three to five minutes: “It kind of feels like the agent may have been helping someone else. I just want to get my problem resolved as quickly as possible so I can get on with my day.” The agent’s cognitive state was not invisible to the person on the other end. It leaked into the conversation as delay, as distraction, as the same scripted phrase appearing twice.

The recommendation was to define new metrics suited to the supervised chat context: efficiency (time to resolution, cancellation rate, suggestion adoption rate) and trust (agent-reported confidence with AI, QA markdown rate on AI-assisted cases). Those metrics would separate the experience of being supported from the experience of being served. That reframing shaped how the team instrumented Phase 3 and post-launch measurement.

Theme 04 · The channel data kept showing the same thing

The convergence finding. Voice, chat, and email agents, working in separate teams, surfaced the same anxiety independently: the AI didn’t know their customer. The convergence finding reframed the platform problem from channel-by-channel AI feature additions to a unified, customer-first architecture.

The cross-channel study was designed as a parallel investigation, with separate storyboards for voice, chat, and email, separate feature sets to test. What it produced, unexpectedly, was a convergence finding. The same core anxieties appeared independently across all three channels, in agents who had no visibility into each other’s work.

Voice agents worried that AI-suggested scripts would sound robotic to premium customers they’d built relationships with. Email agents worried that AI-generated drafts would send the same response to a repeat client who had already received it twice. Chat agents worried that past AI failures, irrelevant auto-replies that had angered customers, would keep happening. Different channels, different features, different failure modes. The same underlying concern: this tool doesn’t know my customer.

Customers were arriving at the same conclusion from the other side. The experience felt generic, scripted, built for a category of issue rather than their specific situation. One participant put it plainly: “It very rarely feels personal. You start to doubt the sincerity of the help being provided.” That doubt, once it set in, wasn’t just a satisfaction problem. It was a trust problem. And trust, once lost in a support conversation, is almost impossible to recover within the same session.

The cognitive load data compounded this. Agents across all three channels described the same experience of overwhelm when AI features were added to their existing workflows, with information competing for attention at the exact moments that required the most focus. A voice agent managing a live call couldn’t simultaneously review a transcript, monitor a conversation summary, and evaluate a suggested script. A chat agent under a two-minute SLA couldn’t pause to evaluate whether an AI-generated response was appropriate for this customer at this moment in this conversation.

The convergence was the finding. It meant the problem wasn’t specific to any one channel’s implementation. It was structural. The three-party conversation failure, the fear-based mental model, the cognitive load problem: each had looked like a separate design challenge. The channel data showed they were all symptoms of the same root cause. The infrastructure was built for the org chart. What it needed to become was built for the customer.

Figure 09 · Generic → Personalized → Individualized CX
Support has moved from generic to personalized. Individualized is the next necessary step, and a fundamentally different design problem.
Past
Generic

No relation to the customer’s segment, account, or history.

Ex. Google Help Center for all advertisers
Current
Personalized

Built for a customer segment (SMB, large customer, agency) but still treating the customer as a category member.

Ex. Ads Help Guide for a defined advertiser tier
Future · Target
Individualized

Built for a customer at the intersection of multiple segments simultaneously, drawing on everything the platform knows about that person across accounts, channels, and history.

Ex. GenZ SMB advertiser in a non-brand-unsafe market, managing 10+ campaigns
Source · Cross-channel research study, Q3 2023 · Synthesis across voice, chat, and email channel findings

Conclusion

Research findings don’t move organizations on their own. They move organizations when the people closest to the problem can pick them up and use them without a researcher in the room. By the end of the program, three findings had proven durable enough to warrant a designed artifact: the trust finding from Manila, the resolution journey from the transcript analysis, and the relevance finding from two years of adoption data. Each became a framework. Not because these were the most dramatic findings, but because they were the ones that kept coming up in rooms where decisions were being made without a researcher present.

The Trust Ladder came out of the Manila ethnography and the cross-channel fear-based mental model study: agents weren’t rejecting AI because they didn’t understand it, they were rejecting it because the accountability structures made any error existential. The ladder gave Change Management a model for sequencing training that addressed the fear, not just the feature. The Resolution Journey came out of the transcript analysis and customer interviews: a map of what customers expect, where sentiment dips, and where pain points cluster, used by cross-functional teams to identify where LLM value ends and human judgment is non-negotiable. The Relevance Matrix came out of the earliest prototype tests and was refined across two years of adoption data: content relevance, shaped by five intersecting factors, was the gating variable for whether any AI suggestion got used at all.

Figure 10 · Three portable frameworks from this research
Each framework was designed to be used independently by PMs, designers, and engineers.
Framework What it is What it does Used by
The Trust Ladder
Old → Emerging → Arrived
A 3-stage model of how agents move from fear-based to growth-based thinking when adopting AI. Sequences feature rollout and training design so adoption interventions address the fear, not just the feature. PM
Change Mgmt
L&D
The Resolution Journey
6 stages, issue to closure
Maps customer tasks, expectations, sentiment, and pain points across the full support journey. Identifies where LLM value ends and human judgment is non-negotiable. Used in roadmap prioritization and UI hierarchy decisions. UXD
Eng
Product Mktg
The Relevance Matrix
5 factors, 1 gating variable
Frames content relevance as a function of job role, linguistic tone, conversation flow, customer issue type, and customer sentiment. Specs contextual triggers for AI suggestions and informs information hierarchy in the agent UI. Eng
UXD
Data Analytics

Impact

A human-in-the-loop system published as a patent that scaled customer support with generative AI, without losing the human read of the room. A cross-channel synthesis that reframed the platform design problem before the next generation was built.

The HITL system published as a patent (US20250209307A1) automated a significant share of agent messages at scale across four product areas. The patent documents the structural move the research identified: keeping human judgment in the loop when generative AI handles routine response generation, so the system preserves the trust and transparency customers depend on. This isn’t a Google-specific problem. Every large organization deploying AI in customer-facing workflows has to solve for the same behavioral question, which is why a structural, publishable answer mattered more than a point solution.

The convergence finding did something harder to measure: it changed what the platform team thought they were building. Before the cross-channel synthesis, the roadmap was organized around individual channel improvements. After it, the design brief shifted. The Generic-to-Individualized typology gave product and design a shared vocabulary for the direction of travel: not more AI features per channel, but a platform that knows the customer well enough to make every interaction feel built for them specifically. That framing directly shaped the brief for the next generation of the platform, establishing customer-level context, cross-channel continuity, and account-aware AI as the requirements that channel-level optimization alone could never satisfy.

The QA finding and the cognitive load research had a more immediate impact: both fed directly into the design of Phase 3 instrumentation, changing how the team measured agent adoption and what success looked like in a supervised-AI context. Research that names the right metric before a product ships at scale is worth more than research that explains why the numbers are wrong afterward.

More projects
Chapter 01 of 04 — Two Studies, Two Outcomes: One Shipped, One Stopped

Two studies, two outcomes: one shipped at I/O 2024, the other stopped a v2 that wasn’t ready.

Foundational research that shaped the v1 launch, and follow-up research that found the v2 IDE prototype didn’t have product-market fit, redirecting engineering capacity to other I/O launches.

Summary

Visualization is often assumed to help developers understand complex models. This study asked a more specific question: where in the actual model development lifecycle does making something visible add value, and where does it just add noise? Two studies over two years with on-device ML developers. The first mapped the workflow and identified where visualization earns its place. The second tested a Model Explorer 2.0 IDE prototype and found the prototype intuitive only in the narrow scenario the demo was designed for.

The v1 research shaped a successful I/O 2024 launch. The v2 research stopped a launch that would have shipped against unvalidated assumptions, redirecting an estimated quarter of engineering effort to other I/O 2025 launches that did ship.

Key takeaway

Research can function as a decision tool, offering direct input to a specific action the team is about to take. Used that way, telling a team when to ship and telling a team when to stop carry equal weight. Both protect the product and the people using it.

Problem space

ML researchers and engineers need to identify and debug architecture, quality, and performance issues in large models, especially for on-device deployments where conversion and optimization processes can significantly alter a model from its original state. Existing tools couldn’t render large models and had significant usability problems incompatible with modern model architecture.

The two studies were each built around a specific decision point. The first: where does visualization actually earn its place in the on-device ML lifecycle? The second, a year later: does this IDE prototype solve the fragmented-tool problem well enough to ship? Different questions, different methods, the same underlying discipline — research tied to an action, not an open brief.

Methods

Two studies, one year apart, each calibrated to a different product decision. Study 01 was foundational: a literature review of approximately 15 internal research reports followed by eight contextual inquiry interviews with on-device ML developers, conducted against a hard pre-I/O 2024 deadline. The goal was to map the actual model development workflow end-to-end before the team committed engineering resources to a direction. Study 02 was evaluative: six prototype testing sessions to answer whether the Model Explorer IDE was ready to ship, or whether the team was building confidence on a scenario that didn’t reflect how developers actually worked.

For both studies, AI-assisted synthesis compressed the analysis cycle. In a pre-launch context where findings needed to reach the team before decisions were already made, turnaround speed was part of the methodology.

Contextual InquiryLiterature ReviewPrototype TestingUsability TestingAI Methods
Research timeline · Two studies, two outcomes
Q4 2023 – Q2 2025
Q4 ’23
Q1 ’24
Q2 ’24
Q3 ’24
Q1 ’25
Q2 ’25
Literature review
~15 internal reports
Study 01
Contextual inquiry
Usability testing
Pre-I/O deadline
Google I/O 2024
v1 launch
Shipped
Metrics monitoring
Post-launch dashboard
Study 02
Prototype testing
Google I/O 2025
v2 decision
No-go

Study 01 · Mapping the on-device ML lifecycle

Eight contextual inquiry interviews with 1P and 3P on-device ML developers (LLM and non-LLM specialists), conducted ahead of a hard pre-I/O 2024 deadline, preceded by a literature review of approximately 15 internal research reports. The goal was to map the actual model development workflow end-to-end and surface where visualization tools added load-bearing value, not just where they were nice-to-have.

The research established four core ODML stages developers follow to launch models on device: Build, Adapt, Integrate (Optimize), and Release (Deploy). Each stage has distinct tasks, evaluation targets, and tools. Developers frequently revisit earlier stages as they learn more about quality vs. performance trade-offs. An XFN alignment workshop followed the contextual inquiry, with results shared pre-I/O and a metrics dashboard monitored post-launch.

Figure 01 · Where visualization actually matters in the on-device ML lifecycle
Four stages, three where visualization is load-bearing. Developers frequently revisit earlier stages as they learn quality vs. performance trade-offs.
01 · Visualization critical
Build

Tracks architectural changes across model versions during early experimentation, when models are moving fast.

02 · Visualization critical
Adapt

Helps developers onboard to new models and verify quantization placement post-conversion.

03 · Visualization critical
Optimize

Surfaces performance bottlenecks and supports cross-team alignment before launch.

04 · Not load-bearing
Deploy

Production deployment and monitoring: visualization plays a minimal role here.

Source · Contextual inquiry with 8 on-device ML developers · Q1 2024

Key insights from Study 01

Visualization is load-bearing in three of four stages, and the role shifts at each one. In Build, it tracks architectural changes across model versions during early experimentation, when models are moving fast and developers need to compare versions to find best performance. In Adapt, it helps developers onboard to new models after handoff, and post-quantization lets them verify placement and configuration of quantized nodes, understand impacts on data types and operator fusions, and debug potential issues. In Optimize, it surfaces performance bottlenecks, supports model version comparison to assess performance impact, and functions as a visual aid for communicating changes to stakeholders and leadership. In Deploy, visualization plays a minimal role.

Code preference is a maturity signal, not a rejection of visualization. Some developers shift to code-based checks once they've worked with a model for about a year and architectures stabilize. This means visualization tools should optimize for the early-development phase where they earn their keep, not for ongoing power-user workflows where code is faster.

Top requested features cut across multiple stages. Side-by-side model graph comparison (currently requires multiple browser tabs and mental snapshots), richer per-node metrics including FLOP counts, MAC counts, and parameter sizes, tighter Colab integration to move from development to debugging seamlessly, and improved error reporting with clearer messages and integration with inference validation tools.

Figure 02 · Feature opportunities across the ODML lifecycle
Requested features cut across multiple stages. Most valuable earlier in the lifecycle, where developers move fast and the cost of getting it wrong compounds.
Feature Build Adapt Optimize Deploy
Side-by-side model comparison
Compare versions without multiple browser tabs or mental snapshots
Per-node metrics
FLOP counts, MAC counts, parameter sizes per layer and model
Colab integration
Move from development to debugging without context switching
Model editing capabilities
Experiment with graph changes without writing code
Tool integration
XProf and inference validation tools for debugging and profiling
Annotations and image examples
Add context to model graphs for communication and alignment
Relevant to stage
Not applicable
Source · Contextual inquiry with 8 on-device ML developers · Q1 2024
Study 01 outcome

Model Explorer shipped at Google I/O 2024. 934+ GitHub stars at launch, 61.8k unique GitHub visitors, coverage by VentureBeat and Hacker News.

Google Research Blog → Model Explorer

Study 02 · The IDE prototype that didn’t ship

A year later, the team built a prototype Model Explorer IDE: a unified environment combining visualization, quantization, numerical analysis, and layer modification in one tool. The hypothesis was that an IDE would solve the fragmented-tool problem the first study had surfaced. Six participants experienced in quantization and performance debugging were shown a demo video of the prototype, which walked through an engineer quantizing a small image segmentation model: opening files, running quantization without and then with "numerical diff" enabled, and using the exclude-selection feature to address problematic nodes.

The ask was simple: how would this fit into your existing workflows? The answer was more complicated than the demo suggested.

Key insights from Study 02

Debugging is iterative and hard to pin down. Quantization and performance debugging requires many iterations, driven by two factors: model size and developer familiarity with the issue. Engineers often struggle to pinpoint the layer or node causing problems, relying on past experience and manual fallbacks like printf statements when error messages are hard to interpret. Most participants took a hybrid approach: debugging locally for instant feedback loops while accessing cloud resources for compute-intensive tasks.

There is no canonical debugging workflow across Google. Participants used different tool combinations based on comfort and product use case: visualization (Model Explorer, Netron), debuggers (LLDB, printf, Simpleperf), frameworks (TensorFlow, TF Lite, DarwiNN AutoQuantizer), and code editors (Cider, Colab, XCode). The fragmentation the IDE was meant to solve also made developers hesitant to add yet another tool to their stack.

Figure · Study 02, Tool fragmentation
No canonical debugging workflow. Each developer assembled their own stack.
The question every session was trying to answer Which layer or node is causing the problem? No canonical path to get there. Each developer assembled their own stack. VISUALIZATION DEBUGGERS FRAMEWORKS CODE EDITORS Model Explorer Netron LLDB printf Simpleperf TensorFlow TF Lite DarwiNN AQ Cider Colab Xcode Every session iterated across tools. The tradeoff was constant: Local Instant feedback loops vs Cloud Compute-intensive tasks The fragmentation the IDE was meant to solve made developers hesitant to add yet another tool.

The prototype was intuitive in the demo’s narrow scenario. Participants found the IDE easy to use, saw potential time savings, and specifically praised the exclude-selection feature. But the demo showcased a small image segmentation model in a clean quantization workflow. The scenarios not in the demo, complex layer-specific quantization schemes, models with thousands of nodes, non-image inputs like audio or text generation, were exactly the scenarios participants most needed it to handle.

Three feature gaps that would block real-world adoption

01
Framework support beyond TensorFlow
Participants needed JAX and GMAX support. The demo’s TensorFlow-only scope didn’t reflect the framework diversity across Google’s ML teams.
02
Multiple data inputs for post-training quantization
A single representative image isn’t sufficient for real quantization workflows. Participants needed folder paths to datasets, not a single image.
03
Model surgery and editing capabilities
Participants needed to select and remove unnecessary operators from the model graph, a core part of real debugging that the prototype didn’t support.

A UI clarity problem also surfaced: half the participants confused the color-coded numerical diff signals for quantization status markers, suggesting the legend and color scheme needed rethinking before the prototype could be validated in realistic conditions.

Product-market fit

Part of a UX researcher’s job is to tell the team not just what users want, but whether the product they’ve built has earned the right to ship. The hypothesis for Study 02 was straightforward: developers use fragmented tools, an IDE would unify them, therefore an IDE has product-market fit. The data gave a more complicated answer.

The prototype tested well in the hero-story scenario, a small image segmentation model, a clean quantization workflow, tasks the demo was designed to showcase. But every participant who worked on real models at scale immediately raised the scenarios the demo didn’t cover: large models with thousands of nodes, complex layer-specific quantization schemes, audio and text inputs, the full end-to-end workflow. The prototype didn’t have product-market fit as scoped because it had been validated against a simplified version of the problem, not the real one.

“The overall experience looks intuitive and extremely useful. But I worry with a segmentation model, they’re normally quite small. How would this work if the model has 10,000 nodes? So that could be harder.”

P2 · Core ML Frameworks + ODML

Impact

Most portfolio research shows what helped something ship. This case study also shows what earned a team permission to stop.

“Your research uncovered a key misunderstanding regarding how users want to develop and debug model quality and performance. The clarity you provided allowed us to quickly course-correct our technical roadmap for Model Explorer. This prevented a significant detour, saving engineers and our PM an estimated quarter of work and ensuring we’re now aligned with actual user needs.”

Engineering Lead, Google
Study 01, shipped

Study 01 research shaped the v1 lifecycle map, directed engineering to the three stages where visualization is load-bearing, and informed the UI changes that shipped at Google I/O 2024. 934+ GitHub stars at launch. 61.8k unique GitHub visitors. VentureBeat and Hacker News coverage.

Gen AI on mobile and web · Google I/O 2024
Gen AI on mobile and web, Google I/O 2024
Visualize models with Model Explorer
Visualize Models With Model Explorer
Google Research Blog → Model Explorer
Study 02, stopped

Study 02 found that the IDE prototype had been validated against a hero-story scenario rather than the real workflows ML engineers actually run. No I/O 2025 launch. The team redirected an estimated quarter of engineering effort to other launches that did ship, because the research gave them something more valuable than a green light: a clear reason to wait.

More projects
Chapter 03 of 04 — Three Questions, Three Instruments: Measuring, Trending, Testing

A CSAT score on its own isn’t research. The schema that turns open-ended feedback into a pattern is.

Took over an inherited longitudinal CSAT program and ran it for 1.5 years, building a 25-category coding schema for open-ended responses, then running a targeted qualitative study when the survey signal raised a question quant alone couldn’t answer.

Summary

Three questions. Three instruments, each chosen because the previous one couldn’t answer what came next. A CSAT survey to measure satisfaction at scale. A 25-category inductive coding schema to make open-ended responses tractable and trended. And an API technical document (not a prototype, not a mockup) to test whether developers would actually adopt a proposed integration before anyone spent a quarter building it.

The schema turned 1.5 years of quarterly data into a longitudinal signal that informed Servomatic’s 2024 roadmap. The one-pager study found no product-market fit for the proposed dev-to-prod integration and saved two quarters of engineering work.

Key takeaway

The stimulus should match the question, not the format convention. A quantitative survey, a longitudinal coding schema, and a technical one-pager are not three different research approaches. They are three tools calibrated to three different questions in the same program. Knowing which instrument fits which question is the skill.

Problem space

Servomatic is the Google internal tool ML developers use to serve models in production. The quarterly CSAT survey had been running since 2022, started by a previous researcher. When I inherited the program in 2024, it produced a single number and a stack of free-text responses no one had time to read systematically. The number was useful as a temperature check; the responses were where the actual signal lived, but only if you could make them tractable.

The question wasn’t whether users were satisfied. They mostly were. The questions were what was driving that satisfaction, whether the drivers were stable, and whether there were structural friction points the product hadn’t addressed. And lurking in the open responses, a third question the survey couldn’t answer at all: whether a proposed change to how developers moved between dev and production environments would actually be adopted. Each question needed a different instrument.

Methods

The survey was already running. What it lacked was a way to read the open-ended responses at scale. I designed the coding schema inductively, working through responses and clustering them until the categories stabilized, rather than imposing categories from the outside. That produced 25 categories rolling up into four top-level areas: Documentation, Ease of Use, Support, and Performance. The decision to hold the schema fixed once it stabilized was deliberate: you can’t trend data across categories that shifted mid-program. Every quarter’s responses from that point could be coded and compared against prior quarters, turning a one-time report into a longitudinal signal.

Each quarter, invites were sent automatically via a Python script to active Servomatic users through internal chat, consistently producing response rates around 47%, high by survey standards, and a function of the internal distribution channel rather than anything special about the instrument. What made the responses useful wasn’t the volume. It was the schema.

When the longitudinal data surfaced a recurring undercurrent about dev-to-prod readiness that the survey couldn’t answer, I ran eight moderated qualitative interviews. The stimulus for the concept test was a technical one-pager describing the proposed API integration. Not a prototype. The fidelity matched the question: we weren’t testing whether an interface worked, we were testing whether developers would adopt the concept at all. A higher-fidelity prototype would have answered the wrong question.

LongitudinalSurveyInductive CodingQualitative Follow-upConcept Testing
Figure 01 · 25-category inductive coding schema
25 categories, four top-level areas. The schema stabilized after three rounds of coding and held across 1.5 years of quarterly data.
Area Category What it captured
Documentation Documentation quality Out-of-date, incomplete, or missing docs for common workflows and edge cases
Onboarding First-time setup difficulty, lack of getting-started guidance
Usage examples Requests for concrete, copy-and-adapt examples for common tasks
Error message clarity Uninformative or misleading error messages that didn’t help users resolve issues
Documentation for edge cases Multi-label classification, turn-up/turn-down, validation errors: cases docs didn’t cover
Access / permissions Unclear or cumbersome permission structures for accessing tools and resources
Ease of Use UI Interface friction, navigation issues, visual clarity
Deployment Difficulty deploying models to production, validating configurations
Configuration Complex or opaque configuration options, resource allocation difficulty
Iteration speed Slow feedback loops between making changes and seeing results
Model versioning Difficulty tracking, comparing, or rolling back model versions
Integration with upstream tools Friction connecting Servomatic with XManager, Servolab, and other pipeline tools
Support Support responsiveness Slow or inconsistent response times for troubleshooting requests
On-call coverage Limited support outside US-unfriendly timezones, gaps in incident coverage
Quota Resource quota limits that blocked or slowed development and production workflows
Customization Insufficient flexibility for custom binaries and non-standard use cases
Cross-team handoff Information lost at the boundary between development and production teams
Performance Reliability Outages, unexpected failures, inconsistent behavior in production
API stability Breaking changes, deprecated endpoints without adequate notice
Outage communication Lack of proactive communication before, during, and after outages
Tooling currency Outdated dependencies, lag between platform changes and tool updates
Validation Difficulty validating model behavior before and after deployment
Resource allocation Difficulty understanding, predicting, or optimizing compute and memory usage
Source · Inductive coding of open-ended CSAT responses · 2024–2025
Figure 02 · The schema in practice
Each open-response row was tagged across one or more of the 25 categories, then rolled into four quarterly totals. Multi-category tagging captured responses that cut across themes.
Response excerpt Categories Config Docs Model mgmt Monitor
"Documentation is spread across multiple internal sites — hard to find the right one for edge cases." documentation, support 0 1 0 0
"Clearer error messages when validating models would save hours of debugging." model management 0 0 1 0
"Configuration docs don't cover what I need, and resource limits aren't clear until deployment fails." documentation, configuration, resource management 1 1 0 0
"Would be helpful to see model health dashboards in one place instead of switching tools." monitoring 0 0 0 1
Quarterly totals 5 11 21
Illustrative sample. Excerpts shown are representative of the signal structure, not verbatim participant responses. Totals reflect actual quarterly category counts.

Key insights

Reading the longitudinal signal

Figure 03 · CSAT trend 2022–2025
Score held steady. The signal was in what the open responses said about why.
100% 75% 50% 25% 80% target Inherited Q1 2024 Q1 2022 Q1 2023 Q3 2023 Q1 2024 Q3 2024 Q1 2025 77% 76% 78% 76% 76% Previous researcher (no open-ended coding) With inductive coding schema (2024–2025)
Source · Quarterly CSAT program · Servomatic · 2022–2025

With the schema in place, every quarter’s data could be read against prior quarters. CSAT held steady at around 77%, consistent across the program. That stability was itself a finding: users found Servomatic reliable and trustworthy for the common case. The product wasn’t broken.

The volatility was in the open-response category mix. Four rolled-up categories accounted for most of the movement quarter to quarter.

What the categories revealed

The 25-category schema was designed to be exhaustive at the coding layer but tractable at the reporting layer. In practice, the open-response signal consolidated into four rolling themes that behaved differently across quarters. Documentation and Ease of Use stayed consistently high, meaning these weren’t one-off complaints but structural friction points the product hadn’t addressed. Support and Performance moved more, often spiking after specific incidents. The schema let me distinguish persistent friction from incident response: two categories of problem with very different fixes.

  1. 01 Documentation. The most persistent area. Editors flagged out-of-date or missing docs for edge cases like multi-label classification, turn-up/turn-down procedures, and validation errors. The consistent ask was for more usage examples they could copy and adapt.
  2. 02 Ease of Use. Friction clustered around deployment and configuration: deploying to production, validating models, accessing logs, and understanding resource allocation were all consistently difficult. Users wanted clearer, more actionable error messages and faster iteration on configurations.
  3. 03 Support. Issues spiked around specific incidents rather than persisting across quarters. Slow troubleshooting and limited coverage for custom binaries were recurring complaints, alongside requests for more responsive channels and on-call coverage in US-friendly timezones.
  4. 04 Performance. Moved most quarter to quarter, often tied to specific outages or releases with regressions. Users wanted more proactive communication about outages and gradual rollouts to non-production environments before full deployment.

The categories also raised a question the survey itself couldn’t answer: a recurring undercurrent in the open responses suggested that production readiness wasn’t being considered early enough in the development cycle, creating costly rework downstream.

When the signal needed depth

I ran a targeted qualitative study in April 2024 to answer the question the survey couldn’t: if and when production readiness gets considered in the development cycle, and whether developers experience downstream losses because they don’t factor it in early enough. Eight 45 to 60-minute moderated interviews with XManager and Servolab users. Half the session was about current workflows and pain points; the other half was concept testing against the proposed API integration, using a technical one-pager as the stimulus.

Figure 04 · Production workflow elicitation protocol
A blank instrument filled in per participant. Five stages, four dimensions. The structure forced consistent coverage across interviews without scripting the answers.
PREPARE VALIDATE DEPLOY MONITOR AUTOMATE Prepare candidate model for release Pre-deployment checks Set up endpoint, host on servers Access and monitor model in production Faster, more frequent refreshes TASKS TOOLS PAIN POINTS TRADE-OFFS
Stimulus · Qualitative study · 8 moderated interviews · April 2024

The structural barrier the API couldn't solve

Dev-to-prod is structurally siloed. Experiments are run to find the best model for the project; that model is then handed off to a different team or person to handle production readiness. The skillsets aren’t the same. The handoff is where time, context, and quality decisions get lost.

Participants were receptive to the proposed API integration in principle, but raised practical concerns about how it would fit existing org structures and developer skill sets. The proposal was technically sound. The adoption barrier was organizational: the integration would require cross-team coordination that didn’t currently exist.

The XID finding made this concrete. The integration proposed a unified experiment ID shared between XManager and Servolab. Participants had no strong preference either way. It wasn’t opposition. They didn’t understand what benefit it would bring. That ambivalence was the signal: if developers couldn’t articulate why the unified ID would help them, the integration hadn’t cleared the bar of solving a problem they actually felt. No product-market fit for the dev-to-prod simultaneous experimentation concept as scoped.

Figure 05 · The dev-to-prod handoff where information gets lost
Two teams, two sets of constraints, one handoff moment. Production requirements entered the loop too late.
DEVELOPMENT Train model Optimize for evals Run experiments XManager / Servolab Select best model Quality looks good HANDOFF info lost here PRODUCTION Deploy model New constraints surface ! Fail Costly rework sprint PROPOSED: SIMULTANEOUS DEV/PROD EXPERIMENTATION XManager + Servolab API integration: dev and prod constraints evaluated in parallel, not in sequence
Source · 8 moderated interviews · XManager and Servolab users · April 2024
Figure 06 · The cost of late-stage production assessment

“We trained the model. We deployed it. It turns out that the particular model type was really badly optimized for the processor. We deployed it because it looked really good in the evals and it looked really good in our energy estimation, but when we deployed it on the specific accelerator, it was terrible. We had to run a sprint to figure out how to get it to run on this processor.”

P8 · XManager User · Pixel sleep tracking model · passed eval, failed on hardware

Impact

Turned an inherited CSAT program into a research engine: the schema as the bridge between quant breadth and qual depth.

The schema made 1.5 years of quarterly data comparable for the first time, surfaced Documentation and Ease of Use as structural friction points rather than noise, and gave the product team a signal they could act on rather than a number they could only watch. The 2024 roadmap was directly informed by the category trends.

The qualitative study produced the more consequential finding. Testing a one-pager for a proposed API integration rather than a prototype was the right call: the question wasn’t whether the interface worked, it was whether the concept would be adopted. It wouldn’t. The adoption barrier was organizational, not technical, and a prototype would have answered a different question. The no-product-market-fit finding saved two quarters of engineering work and redirected capacity to higher-confidence bets. This is the same outcome as the Model Explorer v2 study: stopping a launch before it ships against unvalidated assumptions is a recurring and undervalued form of research impact.

More projects
Chapter 04 of 04 — Treating the Research Program as a Product

The bottleneck was never method. It was infrastructure.

Built and led a continuous discovery program from 2021 to 2023 that scaled cross-functional research capacity without scaling researcher headcount. Five designers, 100+ sessions, 20+ initiatives, and a program designed to teach itself.

Summary

Product teams were scaling faster than research. Designers needed iterative insight in days, not weeks. The obvious fix was hiring; the better fix was infrastructure. I built a self-service program where designers ran their own sessions against shared scaffolding, and I held the quality line through templates, coaching, and a recurring cross-functional debrief. The program ran for two years, supported 20+ product initiatives across five designers, and became the model other teams adopted when they hit the same bottleneck.

Treated the program itself as a research problem. Measured it with quarterly stakeholder interviews and a satisfaction and perceived-efficiency survey running every six months, producing four comparable data points over the program’s two-year lifetime. Iterated the scaffolding on what the research surfaced, and produced a framework (Discover → Direction → Deliver) that reframed where continuous discovery fits in the product lifecycle.

Key takeaway

Research programs are themselves research problems. The way to scale research isn’t to standardize it into a pipeline; it’s to treat the program as having users, needs, and unknowns of its own. Designers were the users of this program. What they needed wasn’t permission to do research. It was pre-built templates, facilitation guidance, a quality gate they could lean on, and a ritual that converted sessions into decisions. Once that was clear, the researcher’s job stopped being sole executor and started being architect.

Problem space

When I joined the product area in 2021, research operated reactively. Designers filed requests, research triaged them, studies landed weeks later, and insights were filed in decks only the research team knew how to find. As the team grew and AI product work accelerated, the gap between when a designer needed a signal and when research could deliver one widened. Research was becoming the bottleneck it was supposed to relieve.

The program ran across three distinct user populations, each with different recruiting logistics, session protocols, and legal review requirements. Frontline support agents operating AI-assisted tooling in live customer interactions. The customers those agents served. And internal operations staff, including QA auditors, team leads, and subject matter experts who shaped how the product got used in practice. Standardizing the recruiting pipeline across these populations was one of the quieter high-leverage moves in the program. It was what made weekly cadence possible across all three user types without forcing designers to reinvent the operational work for each study.

The obvious fix was hiring more researchers. The better fix was recognizing that not every research question required a researcher. Tactical concept tests and early usability sessions could be run by designers themselves, given the right scaffolding. What they couldn’t run on their own: participant recruiting, legal compliance, methodological judgment on when their question was outside tactical scope, and the synthesis discipline that turns sessions into institutional knowledge. Those were the things a researcher could own at scale if everything else was self-service.

This program was built before GenAI was permitted for UXR work, so every piece of it was manual: the facilitation, the synthesis, the templates, the debriefs. The constraint mattered. It meant every scaffolding decision had to earn its place. If a template didn’t save a designer real time, it was dead weight. That discipline shaped the program more than the eventual tooling would have.

Methods

I treated the program itself as a research problem. Before designing the scaffolding, I ran structured interviews with cross-functional stakeholders: designers, PMs, program managers, and design leadership. Four questions shaped the intake: what did they want the program to look like, how would they know it was working, how would insights flow across functions, and what hesitations did they have about participating in or relying on the program? The answers shaped every subsequent design decision.

The operating model that came out of that research: weekly self-service sessions, run by designers against pre-built templates, with me as coach and quality gate. I handled participant recruiting, legal review, and methodological consult. Designers handled facilitation, note-taking, and synthesis with template support. A 15-minute cross-functional debrief ritual converted sessions into decisions and surfaced duplicate work before teams accidentally re-tested the same question.

Two existing literatures grounded the design. Teresa Torres’s work on continuous discovery (Continuous Discovery Habits, 2021) argued that weekly customer touchpoints should be standard product team practice rather than a research-team privilege, which shaped how I positioned cadence and cross-functional participation as non-negotiable. Nielsen’s small-sample usability research (Nielsen and Landauer 1993; Nielsen 2000) established the quantitative floor: five participants surface roughly 85% of usability issues in a given round, and even three participants surface the majority of findable issues. For the tactical and directional questions the program was designed to answer, those thresholds were sufficient to iterate confidently. The small-N design wasn’t a shortcut. It was a principled match between sample size and question type, which let designers run weekly rather than quarterly without trading rigor for speed.

To keep the program honest, I ran two complementary measurement instruments. Qualitative stakeholder interviews quarterly throughout the two-year run, which surfaced friction that hadn’t yet shown up as a complaint. And a satisfaction and perceived-efficiency survey sent to all program participants every six months, producing four comparable data points across the program’s lifetime. The survey measured satisfaction on a percentage-agree scale, asked whether designers felt their iteration speed had actually improved, whether they saw the program as replacing or supplementing researcher-led work, and how confident they were in the synthesis coming out of their own sessions. It also included one open-ended question: “What could be improved about the Continuous Discovery process?” Each round’s free-text responses went into the next cycle of program iteration, which is how most of the six insights in Figure 01 originated. Running the survey twice a year rather than once gave me a trend instead of a snapshot, and made it possible to see when program changes landed (Q3 2022 peak, after the v2 template library). The program that existed in Q3 2023 was meaningfully different from the one I launched in Q3 2021. Six of the iterations that got us there came from a single cycle of stakeholder research and are documented below.

Program DesignStakeholder ResearchResearch OpsCoaching & EnablementConcept TestingUsability Testing

Key insights

Figure 01 · Six insights, six program changes
Stakeholder interviews and the open-ended survey question surfaced six specific friction points. Each one produced a change in how the program worked. The iteration cycle is why the program that existed in Q3 2023 was different from the one I launched in Q3 2021.
Area What designers told us Program change
Bandwidth High upfront effort was discouraging first-time participants from signing up. Built v2 template library: note-taking, debrief, and synthesis templates pre-built and ready to copy.
Bandwidth Designers wanted feedback from more participants per cycle than originally conceived. Educated on usability testing participant thresholds; adjusted guidance to set clearer expectations.
Education Designers didn’t know when in their process to use the program, defaulting to the end. Created the Discover → Direction → Deliver framework; embedded it into the signup flow.
Education Designers wanted concrete facilitation guidance, not just permission to run sessions. Surfaced existing training materials in the v2 template; added facilitation tips by question type.
Process No visibility into what other teams had already tested, leading to duplicated effort. Introduced a brief post-session synthesis email template shared with PM, Eng, and UXD.
Process Post-session alignment was inconsistent; insights weren’t being converted to decisions. Added structured 15-minute cross-functional debrief (PM, Eng, UXR, UXD) with debrief template.
Source · Quarterly stakeholder interviews and satisfaction survey, 2021–2023

Where continuous discovery actually belongs

The most consequential finding from the stakeholder research: designers and PMs thought the program was for tactical concept and usability testing at the end of the design process. That was how I’d initially scoped it too. But the interviews surfaced a missed opportunity. Continuous discovery could run across the full product lifecycle, not just before launch. The question wasn’t whether to test an existing design; the question was what phase of discovery the design was in.

The three-phase framework came out of that reframe. Each phase calibrates the question, the method, and the expected outcome to where the product actually is, not to what the program had conventionally been used for.

Figure 02 · Discover → Direction → Deliver
A framework for matching the research question to the product phase. Each phase anchors to a specific zone of product maturity, not to a single event. Continuous discovery belongs across the full lifecycle, not just before launch.
PRODUCT MATURITY Ambiguous space Multiple concepts Prototype ready Launch 01 · DISCOVER 02 · DIRECTION 03 · DELIVER Understand the space Validate direction De-risk before launch QUESTION METHODS OUTCOME
How are users handling this now? What are the workflow pain points?
Shadowing, Q&A, hypotheticals.
Current flows, workarounds, and mental models become visible.
I have 2–3 concepts. Which direction should I pursue?
Low-to-mid fidelity concept testing.
Confidence in which direction to move forward with.
My prototype is ready for pilot. Are there any red flags?
High fidelity pre-launch usability testing.
Surface red flags, new user flows, or blockers before production.
Source · Continuous Discovery Program v2, framework introduced Q2 2022

The template was the program

The single highest-leverage artifact was the v2 research template. It was a copy-ready Google Doc with sections for research plan, interview guide, note-taking, post-session debrief, and a synthesis email. Each section had prose guidance explaining what belonged there and why, with worked examples. The template let a designer walk from research plan to stakeholder-ready synthesis in a single document without needing to invent the structure.

The post-session debrief section was the hinge. It asked for four specific observations per session: pain points, surprise findings, least surprising findings, and key insights. The distinction between “surprise” and “least surprising” mattered. It forced designers to separate what they’d learned from what they’d already suspected, which kept the research honest and made the synthesis durable. A debrief that only listed confirmations wasn’t discovery; it was validation. The template surfaced that difference without me having to name it in every session.

Figure 03 · The v2 research template, post-session debrief section
Designers copied the template at the start of each study. Structure, prompts, and worked examples were pre-built. What looked like a form was actually a facilitation guide.
Post-Session Debrief [~15 min]

A debrief is a way for us to reflect and share observations from the sessions. It enables us to tease out patterns, themes, pain points, insights, and next steps. Below is a guide for the 15-minute debrief session. Don’t feel pressured to fill out all of these items.

Pain Points [Where a user struggled with a task, was confused, disappointed, unsure]
Example: Participants had trouble finding the confirmation button and didn’t know how to get to the next screen.
Surprise Findings [Something that made you think, challenged an assumption or perception]
Example: I didn’t know users kept their notes in so many different places.
Least Surprising Findings [Issues we already knew about]
Example: We expected users would get confused during this flow. We need to make sure we have good documentation for teams.
Key Insights [Major findings or insights]
Example: All participants preferred concept #2 because of the clearer information hierarchy.
Next Steps [Where we might go from here, document any major decisions]
Example: Rethink concept #1, move forward with concept #2.
Reproduction of the v2 template debrief section. Prose and examples adapted to remove internal product references.

A voice from inside the program

“Anila is very knowledgeable and easy to work with. She coordinated the recruiting of users, as well as the legal aspects, while providing help and guidance as I was creating my research plans, moderating, and synthesizing the user feedback. Without her help I wouldn’t be able to iterate and improve so quickly on my designs.”

Interaction Designer · Continuous Discovery Program participant

This quote was the signal I watched for. Not “the research was insightful” but “I wouldn’t be able to iterate so quickly.” The program’s purpose wasn’t to produce more research. It was to shorten the loop between a designer having a question and having an answer they could act on. When designers described their own iteration speed, the program was working. When they described the research, it wasn’t quite landing.

What the program couldn’t do

The program was designed for tactical and directional questions. It wasn’t designed for deep foundational research, and trying to use it that way produced thin findings. When teams asked questions that needed more than a 60-minute session, more than five participants, or more than a single designer’s synthesis capacity, I pulled the question out of the program and ran it as dedicated research. Protecting that distinction mattered. A self-service program that tries to handle everything ends up handling nothing well.

The other constraint was capacity. With manual ops, the ceiling was five active designers. Beyond that, I couldn’t maintain quality on recruiting, coaching, and the debrief ritual simultaneously. I knew this when I built it; it was the explicit tradeoff for keeping the rigor high. But it meant the program was a supplement to researcher-led work, not a replacement. Senior stakeholders sometimes wanted it to be the latter. The right answer was always to surface what the program could and couldn’t do and let them decide, rather than let the scope quietly drift.

What changes with AI

The manual overhead that capped the program at five designers can now be compressed. In 2026, I would redesign three of the operational bottlenecks with AI in the loop, preserving the researcher’s role as the judgment layer rather than the execution layer.

  1. 01 Living knowledge repositories. Product-area notebooks in a tool like NotebookLM, built from the accumulated session notes, synthesis emails, and debrief outputs. Instantly queryable by researchers and cross-functional partners. No more hunting through old decks or asking “did we test this?” The notebooks become the program’s institutional memory without requiring anyone to maintain a knowledge base manually.
  2. 02 AI-assisted synthesis with researcher review. LLMs draft initial themes and findings from interview transcripts; researchers review, revise, and ground the output in the raw data. The researcher’s judgment is applied to pattern validity, not to transcription labor. This is where the Human-in-the-Loop principle matters most: AI outputs that go out without researcher grounding carry the authority of research without the accountability.
  3. 03 Smart participant screening. LLM-drafted screener questions, scored responses against ideal participant profiles, edge cases flagged for human review. Recruiting was the biggest ops bottleneck in the original program. Most of it is pattern matching against known criteria, which LLMs handle well. The researcher applies judgment to edge cases and to the small number of recruits where fit matters more than screener alignment.

The underlying principle: AI scales and supplements UX researchers. It doesn’t replace them. The program should adapt to each team’s needs while remaining cognitively safe for researchers and participants. What changed between 2023 and 2026 isn’t the problem. It’s the ceiling on how many teams one researcher can support while holding the quality line.

Impact

Over two years, the program supported 20+ product initiatives through 100+ sessions, run by five designers with me as coach and quality gate. Four signals tracked whether it was working. Designers reported faster iteration on the biannual satisfaction and perceived-efficiency survey, with self-reported confidence in their own synthesis rising alongside it. PMs reported more confident direction-setting in the quarterly stakeholder interviews. The cross-functional debrief ritual embedded itself into team workflows beyond research sessions. And the model was adopted by adjacent teams that hit the same bottleneck.

The program also became the artifact that shaped how I think about research infrastructure in general. Research teams spend a lot of energy arguing that research matters. The bet this program placed was that research matters more when the surface area of who can do it gets wider and the researcher’s role gets more leveraged. Two years of data suggests the bet was right.

Figure 04 · Satisfaction trend across the program
Four data points, every six months across the program’s two-year run. The v2 template library rollout in mid-2022 was the strongest driver of satisfaction gains.
85% 80% 75% 70% % SATISFIED BASELINE 78% 78% Q1 2022 6 mo in 81% Q3 2022 1 yr in 76% Q1 2023 18 mo in 78% Q3 2023 2 yr in
Source · Program satisfaction and perceived-efficiency survey, H1 2022 – H2 2023 · n = program participants

References

Nielsen, Jakob, and Thomas K. Landauer. “A Mathematical Model of the Finding of Usability Problems.” Proceedings of ACM INTERCHI ’93 Conference, Amsterdam, 24–29 April 1993, pp. 206–213.

Nielsen, Jakob. “Why You Only Need to Test with 5 Users.” Nielsen Norman Group, 19 March 2000. nngroup.com/articles/why-you-only-need-to-test-with-5-users.

Torres, Teresa. Continuous Discovery Habits: Discover Products that Create Customer Value and Business Value. Product Talk LLC, 2021. ISBN 9781736633304.

More projects
Chapter 01 of 02 — When the Workflow Doesn’t Match the Tool

When the Workflow Doesn’t Match the Tool

Newsroom editors avoided a strategic recirculation feature. The product team thought they didn’t see the value. I shadowed five editors to find out what was actually happening, and found a workflow problem the tool had never been designed for.

Summary

Tools get designed for idealized versions of the people who use them. The idealized version of an editor curating related links is someone with time, attention, and a clean cognitive slate at the end of the publication process. The actual version is someone under deadline pressure who has already mentally moved on to the next story. I shadowed five editors across five desks, watching the whole publication workflow, and found the answer wasn’t about motivation. The tool required 10 to 15 minutes at a stage when editors had already moved on.

Reframed the problem from “editors aren’t adopting this” to “the workflow is broken. The tool was designed for an idealized version of how editors work, not how they actually work.” The findings directly shaped the design of Prism, NYT’s new CMS, built around the workflow that actually existed.

Key takeaway

A tool that requires attention at the wrong moment won’t be used correctly, regardless of how well it works. The CMS had embedded a wrong assumption about when and how editors had cognitive bandwidth for curation. The shadowing study found where in the workflow that assumption broke down.

Problem space

Bottom-of-the-page recirculation is one of the highest-leverage moments in the NYT reader experience: it guides someone from one story into the next, and converts casual readers into subscribers. The editorially-curated tool for that moment was called Related Links: a block of related stories at the bottom of an article, picked by hand by an editor.

But most readers weren’t seeing it. The data showed that most editors barely touched the tool. The product team’s working theory was that the newsroom didn’t see the value.

Methods

Contextual inquiry and shadowing with five editors across five desks, watching the full publication workflow rather than just the Related Links step. Shadowing was the right method because the question was about workflow integration, not tool usability in isolation. Synthesis used affinity diagramming across three lenses: business value, reader value, and editor workflow.

Contextual InquiryShadowing
Figure 01 · Affinity map: synthesis across three lenses
Business value, reader value, and editor workflow: the three questions that structured the synthesis.
Business Value
does it improve CTR?
what’s the strategy for use
where’s the data
do users care about this
sees it as a way to get users to hit the paywall
Reader Value
lots of recirc modules so why bother filling it out
thinks it looks like spam on the page
a way to surprise the reader
“extra dressing on the salad”
can excite the reader to go to the next story
Workflow
editors forget which Related Links boxes exist already
copy/paste workflow leads to errors
doesn’t promote use of articles from cross desks
some reporters forget how to make them
manually updating Related Links boxes is difficult
easy to accidentally delete in Oak
When Newsroom Uses It
when should I use them
helps with packaging for breaking news series
not a priority at all
adds evergreen content so it doesn’t need to be updated
reporters see it as a way to promote their stories
requires reporters to fill it out
Source · Affinity mapping synthesis · Related Links contextual inquiry · NYT · 2019

Key insights

The answer wasn’t about motivation. It was about time and cognitive load. The tool required 10 to 15 minutes per article at a stage in the workflow when editors were already under deadline pressure and had mentally moved on to the next story. The behavior the product team read as disengagement was rational adaptation to real constraints.

Theme 01 · Workflow and friction

The workflow took 10 to 15 minutes per article. Editors manually found 3 to 7 article URLs, picked a display option, wrote a title and description. By the time editors got to it, they were exhausted.

It was a final-step task with no clear ROI. Adding Related Links sat at step 9 of an 11-step publication process, after editors had already coordinated with print, visual, SEO, and copy desks. Without metrics to justify it, it dropped to the bottom of the list.

Post-publish updates were the worst friction. Every time a new article published in the topic area, editors had to go back and manually update every previous Related Links block. The maintenance burden compounded over time.

Figure 02 · The editors and their workflow
Two editors, two motivations, one shared workflow. Both hit the wall at the same step, and again two steps after publish.
Persona 01
News Editor · Politics, 6 yrs
Wants speed and consistency.
Persona 02
Features Editor · Styles, 4 yrs
Wants reach and breadth.
ARTICLE CREATION PUBLISH ARTICLE POST PUBLISH 01 · Article alerted as ready 02 · Updates made 03 · Other disciplines alerted 04 · Coordinate with journalist 05 · Publish date set 06 · Visual editor adds imagery 07 · SEO headline, summary, byline 08 · Print editor checks 09 · Final check: Related Links added here 10–15 min per article find URLs · pick display · write title 10 · Article published 11 · Collaborate on article updates 12 · Determine new related articles 13 · Manually update every previous Related Links block in that topic area compounds with every new article EDITOR SENTIMENT THROUGH THE WORKFLOW + NEWS EDITOR "It's extra dressing on the salad." FEATURES EDITOR "Is it worth it? Probably not." NEWS EDITOR FEATURES EDITOR
Source · Contextual interviews with five newsroom editors · Q1 2020

Theme 02 · Awareness and guidelines

Editors had no shared mental model of the strategic value. Was Related Links a display feature? A recirculation engine? A subscriber conversion tool? Editors gave conflicting answers, which meant the product team and the newsroom weren’t aligned on what success looked like.

No guidelines on how, when, or how much. Editors didn’t know which display style drove the most CTR, how many links to include (3 or 7+), or which articles deserved the treatment. Decisions were made by gut, not data.

Editors didn’t know what else existed in their topic area. Without a discovery layer for related coverage, editors couldn’t make connections across the archive, even when they wanted to.

Manual selection led to inconsistent packaging. The same news event could end up with three different Related Links blocks, each curated by a different editor, none aware of the others. India’s COVID coverage was a vivid example.

Theme 03 · Behavior and motivation

Two distinct user groups, same frustration. News Editors needed speed and consistency. Features Editors wanted to compete with the news desk for placement and reader attention. Both wanted to find, save, and reuse previous blocks. Neither could.

Feature desks treated their story pages as mini sections. For desks like Styles, Related Links was a way to showcase the breadth of their coverage and create a destination for readers. The most strategic users were also the most underserved.

Some reporters used it for self-promotion. A use case the product team hadn’t imagined: reporters added Related Links to their own articles to showcase their previous work to readers.

Usage tracked workflow fit, not tenure. The 13-year veteran used it on 1 of 5 articles. The 3-year editor used it on 3 of 5. Familiarity didn’t fix what the workflow couldn’t accommodate.

Figure 03 · Where to fix the workflow: Oak vs. Prism
Two paths emerged: incremental improvements to the legacy CMS, or relocating the feature to the new Google-Docs-based CMS already in development.
Priority
Editor need
Oak (legacy)
Prism (new)
High Editors need clear guidelines on how and when to use Related Links, based on metrics.
High Find and reuse previous Related Links blocks instead of rebuilding from scratch.
Medium Copy and paste Related Links boxes from article to article.
Medium Save Related Links blocks as templates for reuse across stories.
Medium Feature desks can curate Related Links to compete with news desks for placement.
Source · Synthesized from contextual interviews + product capability audit

What this looked like in practice

The research pointed toward Prism. The legacy CMS could absorb incremental improvements, but the workflow problem was structural: Related Links required a manual step at the wrong moment in the publication process. Prism, still in development, offered a different architecture: one where the recirculation step could be repositioned, automated in part, and stripped of the cognitive load that was causing editors to skip it entirely. Figure 03 shows what that workflow looked like in practice.

Figure 04 · Proposed editor workflow in Prism
More setup upfront. But the maintenance burden disappears: the post-publish update step is gone entirely.
Oak · Today
01–08 · Article creation, checks, coordination
09 · Final check: Related Links added here
10–15 min · find URLs · pick display · write title
10 · Article published
11–12 · Collaborate on updates, determine new related articles
13 · Manually update every previous Related Links block
In that topic area · compounds with every new article
Prism · Proposed
01–08 · Article creation, checks, coordination
Steps A–D · Storyline + Pharmacy ruleset
One-time setup · embedded in workflow
10 · Article published
11–12 · Collaborate on updates, determine new related articles
Step 13 · eliminated — placement updates automatically
Source · Recommended workflow design based on research findings · Q1 2020

Impact

Reframed the product team's understanding from "editors aren't adopting this" to "the workflow is broken, and that's why readers aren't getting the recirculation experience the strategy depends on."

The findings directly shaped the design of Prism, NYT's new CMS, built around the actual publication workflow rather than the idealized one. Bottom-of-page recirculation is one of the highest-leverage moments for reader conversion, which made the workflow fix a direct lever on subscriber growth. The synthesis methodology (affinity diagramming across business value, reader value, and editor workflow) became a template for adjacent CMS research at NYT.

More projects
Chapter 02 of 02 — Mobile Discovery

When Discoverability Is the Product Problem

An A/B test on a tab name found that reader-need framing beat service framing by 11 points. But a single diary entry revealed the real problem: discoverability, not naming, was the bottleneck. That finding shaped the Elections Tab strategy, which reached 2M WAU.

Summary

Adding a feature to a product doesn’t automatically add it to a user’s mental model of that product. NYT hadn’t launched a new mobile feature for readers in two years. I led research across the full product lifecycle on a proposed tab aggregating breaking news and personalized alerts: competitive scan, pre-launch usability, A/B framing, and a four-week post-launch diary study. The A/B test found that reader-need framing beat service framing by 11 points. A single diary entry surfaced the deeper finding: readers who had the tab weren’t using it because they didn’t know it was there.

The discoverability finding shaped the Elections Tab strategy, which reached nearly 2 million weekly active users in election week and became the template for two more mobile launches.

Key takeaway

News consumption is event-driven, not habit-driven. Readers don’t build new information routines on their own. They build them around moments that demand attention.

Problem space

NYT was working toward 10 million subscribers by 2025, and mobile apps were the highest-converting consumer surface. But the company hadn’t launched a new mobile feature for readers in two years. I led research on a proposed new tab aggregating breaking news, top stories, and personalized alerts into a reverse-chronological feed going back 72 hours, built for readers who wanted a fast catch-up.

The underlying question wasn’t really about the feature. It was about how readers form habits with information products. A reader who opens the NYT app has a mental model of what’s there. Adding a new tab doesn’t automatically update that model. The research question was: what does it take for a new surface to become part of how someone actually uses a product, rather than an option they never discover?

Methods

Full product lifecycle research spanning four phases: a competitive analysis to validate the problem space; pre-launch usability testing with a hi-fi prototype and internal dogfooding; an A/B framing test, in-product survey, and four-week diary study at launch; and a post-launch assessment measuring against KPIs. The diary study was chosen for post-launch because it captures behavior in natural context over time, which survey and interview methods cannot replicate for a habit-formation question.

Research lifecycle · From competitive scan to Elections Tab
Four studies. The first three asked what to name it. A single diary entry answered the real question.
Competitive scan
6 competitors, 7 features
Finding: naming is inherited, not validated
Pre-launch usability
24 participants, hi-fi prototype
Finding: concept validated, launch ready
A/B test
15% iOS audience, 8 weeks
+11pt
Catch Up beats Alerts by 11 points
Diary study
16 participants, 4 weeks
Pivot
Single diary entry: discoverability, not naming
Elections Tab
Hypothesis at scale
~2M WAU
Major news driver confirms discoverability hypothesis

Experiment design for the A/B test

The A/B test itself was designed, not just executed. Working from the July 2020 iOS baseline (5.1M WAU, 34% tap-through rate), I used NYT’s internal Abra Experiment Duration Calculator to model the tradeoff between audience allocation, test duration, and minimum detectable lift at a 95% confidence interval. The modeling made the calibration choice explicit: a 1% minimum detectable lift would need ~918K users and 27 days; a 3% lift would need ~103K users and 16 days. The team committed to detection sensitivity that matched the scale of effect we could act on, rather than over-powering the test for a lift that wouldn’t change the product decision.

Figure · Experiment sizing: the detection-sensitivity tradeoff
Smaller lifts need larger samples and longer runs. Choosing a detection threshold is a product decision, not a statistical one.
Minimum detectable lift Users needed Duration
1% 918,129 27 days
2% 230,517 18 days
3% 102,885 16 days
5% 37,347 15 days
10% 9,525 15 days
Source · NYT Abra Experiment Duration Calculator · July 2020 iOS ET2 baseline · 95% confidence interval · 3 test variants
Mixed MethodsExperimental DesignCompetitive AnalysisSurveyA/B TestingDiary Study

Key insights

Mapping the competitive landscape

Every competitor called their equivalent feature some version of Alerts. The category had converged on a pattern, including the name. A reverse-chronological feed, bell icon, 48 to 72 hour timeline: the design language was settled. That convergence became the question worth testing: if everyone called it Alerts, was that the right name, or just the inherited one?

Figure 01 · Competitive analysis: six competitors, seven features
Feature Wash. Post HuffPo Guardian Bleacher Daily News WSJ NYT (proposed)
Reverse chronological feed
Badging on icon for new alerts
Tab name Alerts Alerts Live Alerts Alerts Notifs TBD
Read / unread state
Timeline 48h 72h+ 72h+ 24h 72h 72h
Sign up for alerts in feed
Multiple tabs in feed
Present
Not present
Source · Direct review of competitor mobile apps · June 2020

Validating the concept before launch

Pre-launch usability testing with 24 participants across registered users and paying subscribers. Unmoderated remote sessions with a high-fidelity prototype. Participants moved through the user flow without prompting, found the feed readable and straightforward, and were open to seeing personalization added later. No major red flags surfaced. Subscribers wanted a morning-and-breaking-news habit. Registered users found the reverse-chronological framing made the feed feel timely and live.

The A/B experiment

15% of the iOS audience, equally split across two variants and a control. Both variants had identical design and content. The only difference was the tab name. The headline question: does framing a tab around a reader need (Catch Up) or a service (Alerts) drive more adoption?

Reader-need framing beat service framing on every metric. Pre-elections, Catch Up averaged 14,300 weekly active users (5.9% of eligible users) versus 12,500 for Alerts (5.1%). During election week, both variants roughly doubled, with Catch Up reaching 9.58% of eligible users versus 8.96% for Alerts. Tap-through rate followed the same pattern: 16.45% for Catch Up versus 14.14% for Alerts. The in-product survey put it most starkly: 67% of Catch Up users rated the experience Good versus 56% for Alerts, a gap that held across the entire test window.

But two findings complicated the picture. First, stickiness was identical across variants at 18% DAU/WAU — the name affected discovery and first impressions but not whether readers came back. Second, the tab was cannibalising existing behaviour rather than growing it. Sessions with the Catch Up or Alerts tab showed a slight decrease in For You tab activity. The tab wasn’t creating a new habit. It was redistributing an existing one.

The bigger insight was what the data couldn’t fully explain: some users confused the tab with For You, others discovered it by accident weeks after launch. Passive discovery wasn’t enough. New tabs need a major news driver, active onboarding, or promotion to overcome the inertia of an existing mental model.

Figure 02 · In-product survey results: Catch Up vs. Alerts
Same design. Same content. Only the name differed: the gap was 11 points.
GOOD NOT SURE BAD Catch Up VARIANT 67% 23% 10% Alerts VARIANT 56% 26% 18% +11pt 0% 100%
Source · In-product survey · n = 15% of iOS audience · ~8 weeks

After launch: the diary study

13 participants split between both variants, mixing registered users and subscribers, over four weeks. The goal was depth: understanding habit formation and engagement over time, not just first impressions. Catch Up users incorporated the tab more consistently than Alerts users, but several felt ambivalent about it. The defining entry came from a participant who had had the tab for weeks without knowing it was there.

“I’m a bit embarrassed to tell you that I discovered the Alerts tab only yesterday. I touched the button on my iPhone by mistake and saw the news about the winners for the Nobel Prize for physics. Great feature! I will be using it regularly.”

Participant 10 · post-launch diary

A great feature discovered by accident, weeks after launch. The finding pointed directly at what would solve it: a major news driver large enough that users couldn’t miss the new tab.

The decision: three paths, one recommendation

The research memo laid out three options. Rebuild the tab as a webview to allow faster iteration outside the native release cycle. Pause the tab and fold the alerts functionality into the broader personalized notification strategy. Or pause the tab and explore the concept of a live/breaking news destination as part of Storylines work. The memo recommended options two and three.

The reasoning was direct: the core hypothesis was right (readers want a place to track timely updates), but the MVP had delivered a lukewarm experience and native implementation was too slow to iterate toward a compelling one. The Elections Tab had already demonstrated that webview-based tabs could perform well. The question was whether to keep iterating on this specific concept or let it inform something bigger. The team chose the latter.

The discoverability hypothesis at scale

The 2020 US election was exactly the kind of news driver the Catch Up research had identified as the missing ingredient. Working from that hypothesis, I led the launch of a temporary Elections Tab: a focused destination for election coverage that ran for approximately six weeks. The hypothesis going in was that a pop-up, time-bound tab centered on a major news event would not disrupt habituated users and would drive new users to develop the habit the Catch Up tab had failed to instill on its own. Both predictions held.

Impact

Two million weekly active users in election week, and a research-led template that ended a two-year drought of mobile product launches.

The Elections Tab reached nearly 2 million weekly active users in election week, ran for approximately six weeks, and drove measurable lifts in registration and subscription conversion among readers who entered through the tab. It confirmed the discoverability hypothesis at scale: a major news driver solved the adoption problem that passive launch couldn't.

The research plan, experiment design, and post-launch memo were reused for two additional mobile launches including the Covid-19 Tab, making this the template for how mobile feature research was run at NYT going forward.

More projects
Writing

Essays & Notes

Published on After Legibility ↗
Loading posts…
No posts with this tag yet.

Subscribe to get new posts delivered to your inbox.

Subscribe on Substack →
Contact

Let's talk about
the next problem.

Open to research collaborations and connecting with people working on hard problems at the edges of AI, cognition, and society. I take on a small number of advisory engagements and am actively interested in academic collaborations, particularly on AI ethics, epistemic risk, mental health, and the societal impact of intelligent systems at scale.

Message sent. I'll be in touch.
Connect on LinkedIn →