Large Language Models have become central infrastructure in education, yet their interactions remain stubbornly transactional—students ask, the AI answers, learning stays shallow. We asked: what if the chatbot wasn't a tutor, but a game engine?
Despite the rise of orchestration platforms like Playlab.ai, CircleIn, and Magic School AI, the resulting AI interactions are frequently linear and inert. Without deliberate pedagogical structuring, these assistants default to efficient information delivery—inadvertently encouraging the very passivity educators seek to avoid.
We define gameful prompting as the strategic embedding of game mechanics—such as narrative roles, resource constraints, and rule-based feedback—directly into the system instructions of an LLM. This approach transforms the general-purpose chatbot into a lightweight game engine, democratizing the creation of educational games.
Examples from the corpus: (a) An ARG-styled game where participants upload food labels; (b) LLM-generated images on game world shifts; (c) Point systems visualized through ASCII characters.
Identifying the Research Gap
Recent studies warn of cognitive offloading—a phenomenon where learners outsource mental effort to the AI, treating it as an "answer engine" rather than a thinking partner. The central challenge is no longer providing access to AI, but designing interactions that resist passivity and sustain the "productive struggle" essential for deep learning.
Historically, achieving deep cognitive engagement required Game-Based Learning (GBL)—standalone video games or complex simulations. While pedagogically effective, traditional GBL is hindered by specialized development, substantial budgets, and extensive teacher training. Furthermore, once built, these games are static artifacts, difficult for educators to customize.
A critical literature review revealed that established serious games frameworks—the LM-GM model, Game Object Model, Four-Dimensional Framework—all assume a deterministic game engine where rules are hard-coded. None provide guidance for orchestrating mechanics in the probabilistic, unstructured environment of a Large Language Model. Meanwhile, existing prompt engineering frameworks (Lo et al.'s four-step framework, the GPEI model) excel at clarity and functional utility, but never address the motivational or immersive dynamics required to counter student disengagement.
There is no formal design language that bridges static prompt engineering with dynamic gameplay. Educators have a powerful engine (the LLM) but no manual for constructing game-like experiences. This gap led to our research question: What game design elements and strategies can be embedded into LLM prompts to enhance student engagement?
We deliberately framed this as a design question rather than an efficacy question because the field first needs a shared vocabulary before it can systematically study optimization. You can't measure what you can't name.
Research Methodology
Our research followed a three-phase approach: (1) constructing a corpus of gamified prompts through participatory activities, (2) building the taxonomy through grounded theory, and (3) evaluating the taxonomy for coverage, usability, and impact.
Participant involvement across taxonomy development and evaluation phases spanning generative co-design, prompt collection, taxonomy construction, and evaluative studies.
We considered two approaches: (a) starting from existing serious games theory and deductively mapping mechanics to prompts, or (b) starting from real artifacts created by real stakeholders and letting categories emerge inductively. We chose (b) because the LLM context is fundamentally different from traditional game engines—we needed to discover what practitioners actually do when they gamify prompts, not what they theoretically should do. A top-down framework would inherit the deterministic assumptions of existing models.
Participatory design ensured the taxonomy was grounded in the creative practices of its intended users, while grounded theory (Charmaz, 2006) ensured the categories emerged from data rather than preconceptions.
Building the Research Tool: StudyHelper
Before data collection could begin, we needed a controlled environment where participants could create, test, and iterate gamified prompts without the confounds of different LLM interfaces. We built StudyHelper, a custom platform with two modes: Playground for open-ended prompt creation, and Taxonomy Mode that overlays our taxonomy as an interactive design aid. This dual-mode architecture was crucial for our later within-subjects evaluation—the same interface, with and without the taxonomy layer.
StudyHelper: (a) Architecture and API calls; (b) Playground mode for creating and testing prompts; (c) Taxonomy Mode for exploring the taxonomy; (d) How designers apply taxonomy elements while creating prompts.
Phase 1: Corpus Generation — Why Three Stakeholder Groups?
We deliberately triangulated prompt generation across three stakeholder types—each contributing a different perspective that a single group couldn't provide alone.
Co-Design with Instructors
Why: Instructors bring pedagogical intentionality—they think about learning objectives, scaffolding, and assessment. Their prompts established the "floor" of pedagogical rigor. 3 instructors with GBL expertise co-designed 4 prompts with interdisciplinary support (engineers, learning scientists, game researchers).
Gamified Hackathon
Why: Students are the other key stakeholder—understanding their vision for gamified learning is equally essential. Senior HCI and Game Design graduate students brought creative ambition and player empathy. The competitive format ($100 prize pool) ensured high-quality outputs. 16 prompts across diverse topics.
Course Assignments
Why: To reach saturation, we needed volume and diversity. Embedding prompt creation into two game design courses (45 + 6 students) produced 40 additional prompts while ensuring ecological validity—students applied real game design coursework to the task.
Grounded Theory Analysis
Why: With 60 prompts, we needed a systematic bottom-up approach. Two researchers segmented prompts into 960 units of analysis (each answering "This part of the prompt asks the LLM to..."), then used constant comparison to derive 75 open codes → 19 axial codes → 5 selective codes. Inter-rater reliability (κ = 0.78) was established on 30% of the dataset before independent coding.
A key methodological decision was how to segment prompts. We rejected word-level and sentence-level coding as too granular, and prompt-level coding as too coarse. Instead, we adopted a functional unit: every segment that instructs the LLM to behave in a specific way. This gave us 960 analyzable units across 60 prompts—enough granularity to capture distinct design patterns while preserving the instructional intent of each segment.
(Left) Poster advertising the Bot-a-thon hackathon; (Middle) Participants comparing gamified prompts; (Right) Assignment instructions for course-embedded prompt creation.
Contribution 1: The Gamified Prompts Corpus
Through our generative activities, we obtained a corpus of 60 gamified learning prompts—the first available corpus of its kind. The average prompt was 394 words, distributed across diverse topics including programming, STEM, life skills, humanities, climate science, and more.
(Left) Distribution of gamified prompts across learning topics; (Right) Density of taxonomy elements across the corpus.
A prompt from the corpus with taxonomy codes highlighted. The corpus allows users to search, query prompts based on tags, and interact with them directly.
Contribution 2: The Taxonomy of Gamified Prompts
Through grounded theory analysis of the 960 coded units, we derived the TGP—a taxonomy with 5 primary dimensions and 19 sub-dimensions that maps the full design space of gamified LLM prompts.
The complete TGP taxonomy illustrated with examples across all levels of codes—5 selective codes, 19 axial codes, and 75 open codes.
Game Director
Structural elements: game type, conditions, pathways, and world-building.
Game Mechanics
Reward systems, survival mechanics, inventory management, and randomization.
The Teacher
Pedagogical goals, learning pathways, hints/feedback, and metacognitive reflection.
AI Control
Output restrictions, input mechanics, visual elements, and emoji systems.
NPCs
Character creation, personality, customization, and interaction patterns.
Game Director
The Game Director encompasses structural elements that define how gamified learning experiences unfold over time—ranging from single-player to multiplayer and ARG framings, win/loss conditions, level progressions, and immersive game worlds.
Game Mechanics
Reward systems, survival mechanics (health, lives, stamina), inventory management, and randomization (dice rolls, loot boxes) transform educational interactions into engaging game-like experiences where progress is tracked and stakes feel real.
The Teacher
The Teacher dimension contains the educational elements that make these prompts effective learning tools—pedagogical goals (what to teach and what to avoid), structured learning pathways, adaptive hints and feedback, and metacognitive reflection prompts.
AI Control & NPCs
AI Control governs output length, language restrictions, choice-based input, and visual elements (ASCII art, generated images, emojis). NPCs create companions, mentors, and adversaries with distinct personalities that students interact with through conflict, persuasion, and collaboration.
(a) A multiplayer prompt collecting user names and roles; (b) A programming prompt using titles and achievements; (c) A D&D-styled prompt using dice rolls for randomization.
Key Results at a Glance
Taxonomy-guided prompts (L2) consistently outperformed both standard prompts (L0) and non-taxonomy gamified prompts (L1) across all engagement dimensions.
Evaluation Strategy
Drawing on established principles from taxonomy evaluation literature in information systems (Nickerson et al., 2013; Kaplan et al., 2022) and HCI (Tabassi et al., 2023), we operationalized our assessment through three constructs: Coverage, Usability, and Impact. These aren't arbitrary—they map to the lifecycle of a taxonomy: it must first be comprehensive enough to classify the domain (Coverage), then practical enough for its intended users (Usability), and finally demonstrably beneficial in improving outcomes (Impact). Evaluating only one or two would leave critical gaps in our validation.
Coverage — Can It Classify Any Gamified Prompt?
The strongest test of coverage is classification of artifacts created without knowledge of the taxonomy. We invited game designers to a Game-AI workshop and asked them to create gamified prompts with no exposure to our categories. This approach prevents confirmation bias—if participants had seen the taxonomy, they might unconsciously design toward its categories.
From 21 prompts, we identified 103 granular units. 100 were successfully categorized within our taxonomy (97% coverage). Only 3 units fell outside: voice input, real-time timers, and video generation—all tied to emerging LLM modalities not yet widely available.
Distribution of selective and axial codes on gamified learning prompts analyzed as part of testing the taxonomy for coverage. The Teacher (31) and Game Director (27) were most frequent.
Usability Study Design
Study design: We chose a within-subjects, repeated-measures design (N=33, randomized order) over between-subjects because prompt creation ability varies enormously across individuals. A within-subjects design lets each participant serve as their own control, dramatically reducing noise from individual differences in creativity, game design experience, and AI familiarity.
Why TAM + NASA-TLX together: These instruments measure complementary dimensions. The NASA-TLX captures the cost of the creative task itself—how mentally demanding, frustrating, and effortful it felt. The TAM captures the perceived value of the taxonomy as a tool—is it useful? Is it easy? Would you use it again? Together, they answer: does the taxonomy reduce the burden while increasing the perceived quality? One without the other tells an incomplete story.
Why Wilcoxon signed-rank test: Our data is ordinal (7-point Likert scales), paired (same participants in both conditions), and we cannot assume normality with N=33. The Wilcoxon signed-rank test is the appropriate non-parametric alternative to a paired t-test for exactly this scenario—it tests whether the distribution of differences is symmetric around zero without requiring interval-scale or normally distributed data.
TAM Results: Perceived Ease of Use & Usefulness
All four Perceived Ease of Use items showed significant positive shifts with large effect sizes. Notably, "prompt creation flow was straightforward" moved from a median of 4 (neutral) to 7 (strongly agree)—the largest single shift (r = 0.71). The improvements in ease-of-use translated directly into higher perceived usefulness: designers rated prompts created with the taxonomy as significantly better and faster to produce.
| Construct | Survey Question | No-Tax Med. | Tax Med. | p-value | Effect (r) |
|---|---|---|---|---|---|
| PEOU | Prompt creation flow was straightforward | 4 | 7 | < 0.001 | 0.71 |
| PEOU | Easy to generate ideas | 5 | 7 | < 0.001 | 0.75 |
| PEOU | Easy to learn how to use | 5 | 7 | < 0.01 | 0.58 |
| PEOU | Easy to interact with | 5 | 7 | < 0.01 | 0.64 |
| PU | Helped create prompts more quickly | 5 | 7 | < 0.01 | 0.61 |
| PU | Helped create a better gamified prompt | 5 | 7 | < 0.01 | 0.56 |
| PU | Improved ability to think of gamification strategies | 5 | 6 | < 0.05 | 0.45 |
| Attitude | Felt confident using this tool | 5 | 7 | < 0.01 | 0.58 |
| Intention | Would like to use this in the future | 5 | 7 | < 0.001 | 0.68 |
| Intention | Would prefer this version over other tools | 4 | 7 | < 0.001 | 0.69 |
Wilcoxon signed-rank test (non-parametric, paired, ordinal data). N=33. Effect size: r = Z/√N. Conventions: 0.1=small, 0.3=medium, 0.5=large.
The only non-significant result was "made the process more effective" (median 6→6, p > 0.05). We interpret this as a ceiling effect—participants already rated the baseline tool as effective, leaving little room for improvement on this general measure. The more specific usefulness questions showed clear gains.
NASA-TLX: Workload Reduction
The NASA-TLX corroborated the TAM findings from the cost side: the taxonomy reduced what designers had to invest, not just what they got out.
| NASA-TLX Dimension | No-Tax Median | Tax Median | p-value | Effect (r) |
|---|---|---|---|---|
| Mental Demand | 5 | 3 | < 0.01 | 0.63 |
| Temporal Demand | 4 | 2 | < 0.01 | 0.52 |
| Performance ↑ | 5 | 6 | < 0.05 | 0.46 |
| Effort | 4 | 3 | < 0.01 | 0.57 |
| Frustration | 4 | 2 | < 0.001 | 0.68 |
Frustration showed the largest effect (r = 0.68, large)—dropping from median 4 to 2. This aligns with qualitative data: without the taxonomy, participants described feeling "stuck" or "overwhelmed by possibilities." The taxonomy converted open-ended creative paralysis into structured exploration.
Impact — Designer Perspective
Open-ended responses from the usability study revealed three themes in how the taxonomy changed the design process:
Distribution of taxonomy elements before and after the use of taxonomy (n=16). Total coded instances jumped from 118 to 213, with Game Director showing the largest increase (27→54).
Impact — Learner Engagement
Three-level comparison (L0/L1/L2): Rather than a simple with/without test, we designed three conditions to isolate the taxonomy's contribution. L0 (standard instructional prompt, no gamification) serves as baseline. L1 (gamified by an instructor without the taxonomy) tests whether any gamification helps. L2 (gamified with the taxonomy) tests whether the taxonomy produces measurably better gamification. This design lets us distinguish "does gamification help?" from "does structured gamification help more?"
Instrument selection: We selected specific subscales from three validated instruments, each targeting a distinct engagement dimension: PENS (Player Experience of Need Satisfaction) for immersion and emotional engagement—designed specifically for game-like experiences; IMI (Intrinsic Motivation Inventory) Interest/Enjoyment subscale for intrinsic motivation; and the Situational Interest Survey Attention Quality subscale for cognitive focus. Using game-specific instruments (PENS) rather than general UX scales (e.g., SUS) was a deliberate choice—we're evaluating play experiences, not software usability.
Why Friedman test + post-hoc Wilcoxon: With three repeated conditions (L0, L1, L2) and ordinal data from 10 participants, the Friedman test is the appropriate non-parametric alternative to repeated-measures ANOVA. Where Friedman showed significance, we ran post-hoc Wilcoxon signed-rank tests with Bonferroni correction (α = 0.017) to control for the inflated Type I error from multiple pairwise comparisons.
Ten students interacted with prompts at all three levels across two topics (climate change and basic math), providing both quantitative ratings and qualitative interview data.
| Dimension | L0 (Standard) | L1 (No Tax) | L2 (With Tax) | Friedman χ² | L0–L2 (Z) |
|---|---|---|---|---|---|
| Immersion | 2.2 | 3.6 | 5.2 | 18.4*** | −2.8** |
| Emotional Engagement | 2.3 | 3.8 | 4.6 | 16.1*** | −2.8** |
| Attention | 3.5 | 4.3 | 4.4 | 11.2** | −2.6** |
| Enjoyment | 3.8 | 5.1 | 5.4 | 16.8*** | −2.8** |
Composite medians on 7-point Likert scale (n=10). Friedman test for overall effects; post-hoc Wilcoxon signed-rank with Bonferroni correction (α = 0.017). *p<0.017, **p<0.01, ***p<0.001.
Immersion and enjoyment showed the strongest effects and continued improving from L1 to L2 (both significant), suggesting the taxonomy's structured game mechanics—progression systems, NPCs, inventory—add measurable value beyond basic gamification.
Attention improved significantly from L0→L1 but plateaued between L1→L2 (not significant). This is a meaningful null result: it suggests that basic narrative framing captures most of the attention benefit, while the taxonomy's additional mechanics contribute more to immersion and emotional engagement than sustained focus.
Cross-domain consistency: The pattern held across both topics (climate change and basic math), suggesting the taxonomy captures fundamental engagement mechanisms rather than domain-specific effects.
Illustration of Level 0 (standard), Level 1 (gamified without taxonomy), and Level 2 (gamified with taxonomy) prompts, showing progressive design sophistication.
Research Reflections & Theoretical Contribution
Beyond the taxonomy itself, this work surfaced important methodological insights about studying AI-mediated learning experiences and the relationship between game design theory and generative AI.
Why Existing Frameworks Didn't Work
A natural question is: why not just apply the LM-GM model or MDA framework directly to LLM prompts? We initially considered this. The critical realization was that existing frameworks share a deterministic assumption—the game engine is fixed code. In GOM or LM-GM, a designer "selects" a mechanic and "implements" it. The relationship between rule and outcome is rigid. In LLM environments, we must persuade the model to simulate mechanics. This shifts the design challenge from implementation to orchestration—and existing frameworks have no vocabulary for orchestration.
The Orchestration Layer
Our taxonomy doesn't replace LM-GM or MDA—it provides the missing translation layer. For example, where LM-GM maps "Guidance → Tutor," our taxonomy specifies how to make the LLM maintain that mapping: through "NPC Establishing" codes to create the tutor character, "Interaction Patterns" to define how student and tutor exchange, and "AI Control" to prevent the tutor from breaking character.
The Prompt as Game Engine
Through "AI Control" (output/input restrictions) and "Game Conditions" (win/loss states), the prompt functions as the game engine described in MDA—ensuring mechanics generate the desired dynamics. Without these controls, the learner can easily "break" the game by forcing the LLM to behave differently, making the boundary between designer intent and player experience porous.
Meaningful Null Results
The attention plateau (L1≈L2) was as informative as the positive results. It suggests designers should prioritize NPC development and visual feedback for maximum engagement impact, while basic narrative framing is sufficient for sustained attention. This kind of differential insight is only possible because we measured multiple engagement dimensions independently.
Secondary Prompt Analysis
We limited the before/after taxonomy comparison to 16 participants who created prompts without the taxonomy first. This was deliberate—including participants who saw the taxonomy first would introduce carryover effects, inflating the "without" condition. This methodological choice reduced our N but increased the validity of the comparison (118→213 coded instances).
Implications for Stakeholders
For Educators
The taxonomy serves as an accessible "creative checklist," lowering the barrier to designing rich game-like learning activities without code. Participants described it as a "catalyst" that let them "fire off ideas faster"—it scaffolds rather than constrains creativity.
For HCI Practitioners
A structured vocabulary for designing and critiquing LLM-based educational interfaces. The statistically significant reductions in cognitive load (NASA-TLX) and improvements in perceived usefulness (TAM) suggest value as a design aid integrated into authoring environments.
For Researchers
An analytical tool to code and compare educational chatbots, and a basis for generative research questions: How do different combinations of Game Mechanics and NPC Interaction Patterns affect specific learning outcomes? Can we train a meta-LLM to auto-generate prompts from learning objectives?
Researchers developing the dimensions of the taxonomy through an iterative grounded theory process with constant comparison.
Limitations & What I'd Do Differently
- Engagement ≠ Learning: Our evaluation deliberately measured engagement, immersion, and attention—not learning outcomes. This was a scoping decision: we first needed to establish that the taxonomy produces more engaging experiences before investing in learning transfer studies. Future work will incorporate pre/post assessments and measure knowledge retention.
- Sample size trade-offs: The impact study (n=10) used a mixed-methods approach precisely because the sample size limits statistical power. The qualitative interview data carries the primary weight; the quantitative results serve as supporting triangulation, not standalone evidence. A larger-scale validation is planned.
- Carryover in within-subjects design: While we randomized condition order in the usability study, the creative task itself may have a learning effect—participants might generate better prompts simply from having done it once. We mitigated this by analyzing before/after only for participants who completed the no-taxonomy condition first (n=16), but a between-subjects replication would strengthen the finding.
- Coverage blind spots: The three uncovered units (voice, timers, video) all relate to emerging LLM modalities. As these capabilities mature, the taxonomy will need continuous refinement—it's a living framework, not a fixed standard.
- Mechanic combinations: This work maps the design space but doesn't optimize within it. We don't yet know which combinations of taxonomy elements produce maximum benefit. A rubric for prompt quality assessment is a key next step.
Skills & Methods Demonstrated
Participatory Design, Co-Design Workshops, Grounded Theory, Within-Subjects Experiments, Semi-Structured Interviews
Qualitative Coding, Constant Comparison, Inter-Rater Reliability, Wilcoxon Signed-Rank, Friedman Tests, NASA-TLX, TAM
Taxonomy Development, Prompt Engineering, Game Mechanics Design, Educational Platform (StudyHelper) Design
Game-Based Learning, Serious Games Frameworks (LM-GM, MDA, DPE), LLM Orchestration, Educational Technology