Research Case Study · CHI PLAY · 2026

Turning Chatbots into Game Engines for Learning

A taxonomy for embedding game mechanics directly into LLM prompts, transforming transactional AI interactions into immersive, playful learning experiences.

Role Lead Researcher
Methods Participatory Design, Grounded Theory
Venue CHI PLAY 2026
60
Co-designed Prompts
960
Coded Units Analyzed
2.4×
Immersion Lift (L0→L2)
97%
Taxonomy Coverage
Quick Read The essentials in 60 seconds
01
The Problem

LLM interactions in education are transactional and inert. Students become passive consumers, outsourcing thinking to the AI instead of engaging deeply with material.

02
The Approach

Participatory design with educators, students, and game researchers to co-create 60 gamified prompts, then grounded theory analysis to derive a formal taxonomy.

03
The Taxonomy

5 primary dimensions and 19 sub-dimensions mapping the design space of gamified prompts: Game Director, Game Mechanics, The Teacher, AI Control, and NPCs.

04
The Impact

Taxonomy-guided prompts significantly increased immersion (2.2→5.2), enjoyment (3.8→5.4), and reduced designer cognitive load by 63%.

Large Language Models have become central infrastructure in education, yet their interactions remain stubbornly transactional—students ask, the AI answers, learning stays shallow. We asked: what if the chatbot wasn't a tutor, but a game engine?

Despite the rise of orchestration platforms like Playlab.ai, CircleIn, and Magic School AI, the resulting AI interactions are frequently linear and inert. Without deliberate pedagogical structuring, these assistants default to efficient information delivery—inadvertently encouraging the very passivity educators seek to avoid.

We define gameful prompting as the strategic embedding of game mechanics—such as narrative roles, resource constraints, and rule-based feedback—directly into the system instructions of an LLM. This approach transforms the general-purpose chatbot into a lightweight game engine, democratizing the creation of educational games.

Examples of gamified prompts: ARG game with food labels, image generation on world shifts, and ASCII point visualization

Examples from the corpus: (a) An ARG-styled game where participants upload food labels; (b) LLM-generated images on game world shifts; (c) Point systems visualized through ASCII characters.

Identifying the Research Gap

Recent studies warn of cognitive offloading—a phenomenon where learners outsource mental effort to the AI, treating it as an "answer engine" rather than a thinking partner. The central challenge is no longer providing access to AI, but designing interactions that resist passivity and sustain the "productive struggle" essential for deep learning.

Historically, achieving deep cognitive engagement required Game-Based Learning (GBL)—standalone video games or complex simulations. While pedagogically effective, traditional GBL is hindered by specialized development, substantial budgets, and extensive teacher training. Furthermore, once built, these games are static artifacts, difficult for educators to customize.

A critical literature review revealed that established serious games frameworks—the LM-GM model, Game Object Model, Four-Dimensional Framework—all assume a deterministic game engine where rules are hard-coded. None provide guidance for orchestrating mechanics in the probabilistic, unstructured environment of a Large Language Model. Meanwhile, existing prompt engineering frameworks (Lo et al.'s four-step framework, the GPEI model) excel at clarity and functional utility, but never address the motivational or immersive dynamics required to counter student disengagement.

Research Gap → Research Question

There is no formal design language that bridges static prompt engineering with dynamic gameplay. Educators have a powerful engine (the LLM) but no manual for constructing game-like experiences. This gap led to our research question: What game design elements and strategies can be embedded into LLM prompts to enhance student engagement?

We deliberately framed this as a design question rather than an efficacy question because the field first needs a shared vocabulary before it can systematically study optimization. You can't measure what you can't name.

Research Methodology

Our research followed a three-phase approach: (1) constructing a corpus of gamified prompts through participatory activities, (2) building the taxonomy through grounded theory, and (3) evaluating the taxonomy for coverage, usability, and impact.

Participant involvement across taxonomy development and evaluation phases

Participant involvement across taxonomy development and evaluation phases spanning generative co-design, prompt collection, taxonomy construction, and evaluative studies.

Why Participatory Design → Grounded Theory?

We considered two approaches: (a) starting from existing serious games theory and deductively mapping mechanics to prompts, or (b) starting from real artifacts created by real stakeholders and letting categories emerge inductively. We chose (b) because the LLM context is fundamentally different from traditional game engines—we needed to discover what practitioners actually do when they gamify prompts, not what they theoretically should do. A top-down framework would inherit the deterministic assumptions of existing models.

Participatory design ensured the taxonomy was grounded in the creative practices of its intended users, while grounded theory (Charmaz, 2006) ensured the categories emerged from data rather than preconceptions.

Building the Research Tool: StudyHelper

Before data collection could begin, we needed a controlled environment where participants could create, test, and iterate gamified prompts without the confounds of different LLM interfaces. We built StudyHelper, a custom platform with two modes: Playground for open-ended prompt creation, and Taxonomy Mode that overlays our taxonomy as an interactive design aid. This dual-mode architecture was crucial for our later within-subjects evaluation—the same interface, with and without the taxonomy layer.

StudyHelper platform architecture and interface

StudyHelper: (a) Architecture and API calls; (b) Playground mode for creating and testing prompts; (c) Taxonomy Mode for exploring the taxonomy; (d) How designers apply taxonomy elements while creating prompts.

Phase 1: Corpus Generation — Why Three Stakeholder Groups?

We deliberately triangulated prompt generation across three stakeholder types—each contributing a different perspective that a single group couldn't provide alone.

01

Co-Design with Instructors

Why: Instructors bring pedagogical intentionality—they think about learning objectives, scaffolding, and assessment. Their prompts established the "floor" of pedagogical rigor. 3 instructors with GBL expertise co-designed 4 prompts with interdisciplinary support (engineers, learning scientists, game researchers).

02

Gamified Hackathon

Why: Students are the other key stakeholder—understanding their vision for gamified learning is equally essential. Senior HCI and Game Design graduate students brought creative ambition and player empathy. The competitive format ($100 prize pool) ensured high-quality outputs. 16 prompts across diverse topics.

03

Course Assignments

Why: To reach saturation, we needed volume and diversity. Embedding prompt creation into two game design courses (45 + 6 students) produced 40 additional prompts while ensuring ecological validity—students applied real game design coursework to the task.

04

Grounded Theory Analysis

Why: With 60 prompts, we needed a systematic bottom-up approach. Two researchers segmented prompts into 960 units of analysis (each answering "This part of the prompt asks the LLM to..."), then used constant comparison to derive 75 open codes → 19 axial codes → 5 selective codes. Inter-rater reliability (κ = 0.78) was established on 30% of the dataset before independent coding.

Why This Unit of Analysis?

A key methodological decision was how to segment prompts. We rejected word-level and sentence-level coding as too granular, and prompt-level coding as too coarse. Instead, we adopted a functional unit: every segment that instructs the LLM to behave in a specific way. This gave us 960 analyzable units across 60 prompts—enough granularity to capture distinct design patterns while preserving the instructional intent of each segment.

Hackathon poster, participants interacting with prompts, and assignment instructions

(Left) Poster advertising the Bot-a-thon hackathon; (Middle) Participants comparing gamified prompts; (Right) Assignment instructions for course-embedded prompt creation.

Contribution 1: The Gamified Prompts Corpus

Through our generative activities, we obtained a corpus of 60 gamified learning prompts—the first available corpus of its kind. The average prompt was 394 words, distributed across diverse topics including programming, STEM, life skills, humanities, climate science, and more.

Distribution of prompts across topics and taxonomy element density

(Left) Distribution of gamified prompts across learning topics; (Right) Density of taxonomy elements across the corpus.

An annotated prompt from the corpus with highlighted codes and resulting interactions

A prompt from the corpus with taxonomy codes highlighted. The corpus allows users to search, query prompts based on tags, and interact with them directly.

Contribution 2: The Taxonomy of Gamified Prompts

Through grounded theory analysis of the 960 coded units, we derived the TGP—a taxonomy with 5 primary dimensions and 19 sub-dimensions that maps the full design space of gamified LLM prompts.

The complete Gamified Prompts Taxonomy with all dimensions, sub-dimensions, and examples

The complete TGP taxonomy illustrated with examples across all levels of codes—5 selective codes, 19 axial codes, and 75 open codes.

🎮

Game Director

Structural elements: game type, conditions, pathways, and world-building.

Game Type · Conditions · Pathway · World
⚙️

Game Mechanics

Reward systems, survival mechanics, inventory management, and randomization.

Points · Survival · Inventory · Randomization
📚

The Teacher

Pedagogical goals, learning pathways, hints/feedback, and metacognitive reflection.

Goals · Pathways · Hints · Metacognition
🤖

AI Control

Output restrictions, input mechanics, visual elements, and emoji systems.

Output · Input · Visuals · Emojis
🧙

NPCs

Character creation, personality, customization, and interaction patterns.

Establishing · Interaction Patterns

Game Director

The Game Director encompasses structural elements that define how gamified learning experiences unfold over time—ranging from single-player to multiplayer and ARG framings, win/loss conditions, level progressions, and immersive game worlds.

"You have been chosen to join the Guardians of Earth, a secret league dedicated to protecting our planet. In the first level [...], in the next habitat [...]" — Climate change prompt demonstrating Game Pathway design

Game Mechanics

Reward systems, survival mechanics (health, lives, stamina), inventory management, and randomization (dice rolls, loot boxes) transform educational interactions into engaging game-like experiences where progress is tracked and stakes feel real.

"A weighted point system, where players are awarded more points for deeper analysis of the character, rather than surface level observations." — Literature teaching prompt with pedagogically-aligned rewards

The Teacher

The Teacher dimension contains the educational elements that make these prompts effective learning tools—pedagogical goals (what to teach and what to avoid), structured learning pathways, adaptive hints and feedback, and metacognitive reflection prompts.

AI Control & NPCs

AI Control governs output length, language restrictions, choice-based input, and visual elements (ASCII art, generated images, emojis). NPCs create companions, mentors, and adversaries with distinct personalities that students interact with through conflict, persuasion, and collaboration.

Multiplayer prompt examples showing name collection, achievements, and dice-based randomization

(a) A multiplayer prompt collecting user names and roles; (b) A programming prompt using titles and achievements; (c) A D&D-styled prompt using dice rolls for randomization.

Key Results at a Glance

Taxonomy-guided prompts (L2) consistently outperformed both standard prompts (L0) and non-taxonomy gamified prompts (L1) across all engagement dimensions.

5.2
Immersion (L2)
Up from 2.2 (L0). Significant at p < 0.001 across all measures.
5.4
Enjoyment (L2)
Up from 3.8 (L0). Strongest statistical effects with χ²=16.8.
7.0
Ease of Use (Med.)
Taxonomy tool rated 7/7 vs. 4/7 for non-taxonomy. p < 0.001.
68%
Frustration Reduction
NASA-TLX frustration dropped from 4 to 2 (median). r = 0.68.

Evaluation Strategy

Why These Three Constructs?

Drawing on established principles from taxonomy evaluation literature in information systems (Nickerson et al., 2013; Kaplan et al., 2022) and HCI (Tabassi et al., 2023), we operationalized our assessment through three constructs: Coverage, Usability, and Impact. These aren't arbitrary—they map to the lifecycle of a taxonomy: it must first be comprehensive enough to classify the domain (Coverage), then practical enough for its intended users (Usability), and finally demonstrably beneficial in improving outcomes (Impact). Evaluating only one or two would leave critical gaps in our validation.

Coverage — Can It Classify Any Gamified Prompt?

The strongest test of coverage is classification of artifacts created without knowledge of the taxonomy. We invited game designers to a Game-AI workshop and asked them to create gamified prompts with no exposure to our categories. This approach prevents confirmation bias—if participants had seen the taxonomy, they might unconsciously design toward its categories.

From 21 prompts, we identified 103 granular units. 100 were successfully categorized within our taxonomy (97% coverage). Only 3 units fell outside: voice input, real-time timers, and video generation—all tied to emerging LLM modalities not yet widely available.

Distribution of taxonomy codes in coverage analysis

Distribution of selective and axial codes on gamified learning prompts analyzed as part of testing the taxonomy for coverage. The Teacher (31) and Game Director (27) were most frequent.

Usability Study Design

Why Within-Subjects? Why These Instruments?

Study design: We chose a within-subjects, repeated-measures design (N=33, randomized order) over between-subjects because prompt creation ability varies enormously across individuals. A within-subjects design lets each participant serve as their own control, dramatically reducing noise from individual differences in creativity, game design experience, and AI familiarity.

Why TAM + NASA-TLX together: These instruments measure complementary dimensions. The NASA-TLX captures the cost of the creative task itself—how mentally demanding, frustrating, and effortful it felt. The TAM captures the perceived value of the taxonomy as a tool—is it useful? Is it easy? Would you use it again? Together, they answer: does the taxonomy reduce the burden while increasing the perceived quality? One without the other tells an incomplete story.

Why Wilcoxon signed-rank test: Our data is ordinal (7-point Likert scales), paired (same participants in both conditions), and we cannot assume normality with N=33. The Wilcoxon signed-rank test is the appropriate non-parametric alternative to a paired t-test for exactly this scenario—it tests whether the distribution of differences is symmetric around zero without requiring interval-scale or normally distributed data.

TAM Results: Perceived Ease of Use & Usefulness

All four Perceived Ease of Use items showed significant positive shifts with large effect sizes. Notably, "prompt creation flow was straightforward" moved from a median of 4 (neutral) to 7 (strongly agree)—the largest single shift (r = 0.71). The improvements in ease-of-use translated directly into higher perceived usefulness: designers rated prompts created with the taxonomy as significantly better and faster to produce.

Construct Survey Question No-Tax Med. Tax Med. p-value Effect (r)
PEOU Prompt creation flow was straightforward 4 7 < 0.001 0.71
PEOU Easy to generate ideas 5 7 < 0.001 0.75
PEOU Easy to learn how to use 5 7 < 0.01 0.58
PEOU Easy to interact with 5 7 < 0.01 0.64
PU Helped create prompts more quickly 5 7 < 0.01 0.61
PU Helped create a better gamified prompt 5 7 < 0.01 0.56
PU Improved ability to think of gamification strategies 5 6 < 0.05 0.45
Attitude Felt confident using this tool 5 7 < 0.01 0.58
Intention Would like to use this in the future 5 7 < 0.001 0.68
Intention Would prefer this version over other tools 4 7 < 0.001 0.69

Wilcoxon signed-rank test (non-parametric, paired, ordinal data). N=33. Effect size: r = Z/√N. Conventions: 0.1=small, 0.3=medium, 0.5=large.

The only non-significant result was "made the process more effective" (median 6→6, p > 0.05). We interpret this as a ceiling effect—participants already rated the baseline tool as effective, leaving little room for improvement on this general measure. The more specific usefulness questions showed clear gains.

NASA-TLX: Workload Reduction

The NASA-TLX corroborated the TAM findings from the cost side: the taxonomy reduced what designers had to invest, not just what they got out.

NASA-TLX Dimension No-Tax Median Tax Median p-value Effect (r)
Mental Demand 5 3 < 0.01 0.63
Temporal Demand 4 2 < 0.01 0.52
Performance ↑ 5 6 < 0.05 0.46
Effort 4 3 < 0.01 0.57
Frustration 4 2 < 0.001 0.68

Frustration showed the largest effect (r = 0.68, large)—dropping from median 4 to 2. This aligns with qualitative data: without the taxonomy, participants described feeling "stuck" or "overwhelmed by possibilities." The taxonomy converted open-ended creative paralysis into structured exploration.

Impact — Designer Perspective

Open-ended responses from the usability study revealed three themes in how the taxonomy changed the design process:

"I just prefer this new library version over the first one by a lot. Coming up with ideas took me a while for the first one. I felt like I was firing off ideas so much faster with this because it was easier to translate my ideas into a game with the library." — P19, Prompt Designer
"I saw 'inventory,' 'NPCs,' and realized I could actually organize the chaos... I wouldn't have thought of resource management at all without the library." — P29, Prompt Designer
Distribution of taxonomy elements before and after taxonomy use

Distribution of taxonomy elements before and after the use of taxonomy (n=16). Total coded instances jumped from 118 to 213, with Game Director showing the largest increase (27→54).

Impact — Learner Engagement

Study Design: Why Three Levels? Why These Instruments?

Three-level comparison (L0/L1/L2): Rather than a simple with/without test, we designed three conditions to isolate the taxonomy's contribution. L0 (standard instructional prompt, no gamification) serves as baseline. L1 (gamified by an instructor without the taxonomy) tests whether any gamification helps. L2 (gamified with the taxonomy) tests whether the taxonomy produces measurably better gamification. This design lets us distinguish "does gamification help?" from "does structured gamification help more?"

Instrument selection: We selected specific subscales from three validated instruments, each targeting a distinct engagement dimension: PENS (Player Experience of Need Satisfaction) for immersion and emotional engagement—designed specifically for game-like experiences; IMI (Intrinsic Motivation Inventory) Interest/Enjoyment subscale for intrinsic motivation; and the Situational Interest Survey Attention Quality subscale for cognitive focus. Using game-specific instruments (PENS) rather than general UX scales (e.g., SUS) was a deliberate choice—we're evaluating play experiences, not software usability.

Why Friedman test + post-hoc Wilcoxon: With three repeated conditions (L0, L1, L2) and ordinal data from 10 participants, the Friedman test is the appropriate non-parametric alternative to repeated-measures ANOVA. Where Friedman showed significance, we ran post-hoc Wilcoxon signed-rank tests with Bonferroni correction (α = 0.017) to control for the inflated Type I error from multiple pairwise comparisons.

Ten students interacted with prompts at all three levels across two topics (climate change and basic math), providing both quantitative ratings and qualitative interview data.

Dimension L0 (Standard) L1 (No Tax) L2 (With Tax) Friedman χ² L0–L2 (Z)
Immersion 2.2 3.6 5.2 18.4*** −2.8**
Emotional Engagement 2.3 3.8 4.6 16.1*** −2.8**
Attention 3.5 4.3 4.4 11.2** −2.6**
Enjoyment 3.8 5.1 5.4 16.8*** −2.8**

Composite medians on 7-point Likert scale (n=10). Friedman test for overall effects; post-hoc Wilcoxon signed-rank with Bonferroni correction (α = 0.017). *p<0.017, **p<0.01, ***p<0.001.

Reading the Data: What the Pattern Tells Us

Immersion and enjoyment showed the strongest effects and continued improving from L1 to L2 (both significant), suggesting the taxonomy's structured game mechanics—progression systems, NPCs, inventory—add measurable value beyond basic gamification.

Attention improved significantly from L0→L1 but plateaued between L1→L2 (not significant). This is a meaningful null result: it suggests that basic narrative framing captures most of the attention benefit, while the taxonomy's additional mechanics contribute more to immersion and emotional engagement than sustained focus.

Cross-domain consistency: The pattern held across both topics (climate change and basic math), suggesting the taxonomy captures fundamental engagement mechanisms rather than domain-specific effects.

"I think the points actually cheered me up. I use a lot of ChatGPT in general, but this AI session felt more exciting." — P6, Student Participant
"Because of the character, I feel I can learn a little bit better, like my retention is increased... maybe not just retention, but my ability to stay focused on it for a little longer." — P5, Student Participant
"Here I am not just answering, instead I am talking to a person. We're thinking, brainstorming together... It's like a small community. Talking to these characters and then taking decisions in the chat." — P7, on how NPC interactions shifted their role from answerer to participant
Comparison of Level 0, Level 1, and Level 2 prompts showing increasing sophistication

Illustration of Level 0 (standard), Level 1 (gamified without taxonomy), and Level 2 (gamified with taxonomy) prompts, showing progressive design sophistication.

Research Reflections & Theoretical Contribution

Beyond the taxonomy itself, this work surfaced important methodological insights about studying AI-mediated learning experiences and the relationship between game design theory and generative AI.

Why Existing Frameworks Didn't Work

A natural question is: why not just apply the LM-GM model or MDA framework directly to LLM prompts? We initially considered this. The critical realization was that existing frameworks share a deterministic assumption—the game engine is fixed code. In GOM or LM-GM, a designer "selects" a mechanic and "implements" it. The relationship between rule and outcome is rigid. In LLM environments, we must persuade the model to simulate mechanics. This shifts the design challenge from implementation to orchestration—and existing frameworks have no vocabulary for orchestration.

The Orchestration Layer

Our taxonomy doesn't replace LM-GM or MDA—it provides the missing translation layer. For example, where LM-GM maps "Guidance → Tutor," our taxonomy specifies how to make the LLM maintain that mapping: through "NPC Establishing" codes to create the tutor character, "Interaction Patterns" to define how student and tutor exchange, and "AI Control" to prevent the tutor from breaking character.

The Prompt as Game Engine

Through "AI Control" (output/input restrictions) and "Game Conditions" (win/loss states), the prompt functions as the game engine described in MDA—ensuring mechanics generate the desired dynamics. Without these controls, the learner can easily "break" the game by forcing the LLM to behave differently, making the boundary between designer intent and player experience porous.

Meaningful Null Results

The attention plateau (L1≈L2) was as informative as the positive results. It suggests designers should prioritize NPC development and visual feedback for maximum engagement impact, while basic narrative framing is sufficient for sustained attention. This kind of differential insight is only possible because we measured multiple engagement dimensions independently.

Secondary Prompt Analysis

We limited the before/after taxonomy comparison to 16 participants who created prompts without the taxonomy first. This was deliberate—including participants who saw the taxonomy first would introduce carryover effects, inflating the "without" condition. This methodological choice reduced our N but increased the validity of the comparison (118→213 coded instances).

Implications for Stakeholders

01

For Educators

The taxonomy serves as an accessible "creative checklist," lowering the barrier to designing rich game-like learning activities without code. Participants described it as a "catalyst" that let them "fire off ideas faster"—it scaffolds rather than constrains creativity.

02

For HCI Practitioners

A structured vocabulary for designing and critiquing LLM-based educational interfaces. The statistically significant reductions in cognitive load (NASA-TLX) and improvements in perceived usefulness (TAM) suggest value as a design aid integrated into authoring environments.

03

For Researchers

An analytical tool to code and compare educational chatbots, and a basis for generative research questions: How do different combinations of Game Mechanics and NPC Interaction Patterns affect specific learning outcomes? Can we train a meta-LLM to auto-generate prompts from learning objectives?

Researchers developing taxonomy dimensions through iterative process

Researchers developing the dimensions of the taxonomy through an iterative grounded theory process with constant comparison.

Limitations & What I'd Do Differently

Skills & Methods Demonstrated

Research

Participatory Design, Co-Design Workshops, Grounded Theory, Within-Subjects Experiments, Semi-Structured Interviews

Analysis

Qualitative Coding, Constant Comparison, Inter-Rater Reliability, Wilcoxon Signed-Rank, Friedman Tests, NASA-TLX, TAM

Design

Taxonomy Development, Prompt Engineering, Game Mechanics Design, Educational Platform (StudyHelper) Design

Domain

Game-Based Learning, Serious Games Frameworks (LM-GM, MDA, DPE), LLM Orchestration, Educational Technology