Understanding Why Leading AI Models Struggle with Classic Pokémon Games

Summary

Despite significant advancements in artificial intelligence, state-of-the-art language models such as GPT 5.2, Claude Opus 4.5, and Gemini 3 Pro continue to face considerable difficulties when attempting to master classic Pokémon games. This article explores the specific hurdles these AI face, the frameworks enabling their interaction with the game environment, and what their performance reveals about current AI capabilities and limitations in long-term planning and execution.

Key Points

Leading AI language models like GPT 5.2, Claude Opus 4.5, and Gemini 3 Pro are publicly attempting to master classic Pokémon games with mixed results.

The AI systems often demonstrate slow progress, overconfidence, and confusion despite extensive training and capabilities in other cognitive tasks.

Different AI operate using varying "harnesses," which are frameworks that enable interaction with game environments by providing additional tools or sensory inputs.

Claude’s minimal harness provides a clearer view of the model's native capabilities, while Gemini’s more advanced harness offers greater assistance, such as visual data translation and custom tools.

Pokémon's turn-based, untimed gameplay makes it an effective platform to assess AI planning and sustained task execution over long periods.

Persistent memory limitations cause each AI instance to operate with limited recall, impacting its ability to maintain coherent, long-range strategies.

While specialized AI excel at narrow tasks like chess or Go, general-purpose LLMs face challenges in continuous execution and adapting knowledge to sequential decision-making tasks.

Recent AI advancements show improved note-taking and reasoning in game contexts, with some models completing more complex Pokémon versions undefeated.

Currently, several of the world’s most advanced artificial intelligence systems, including GPT 5.2, Claude Opus 4.5, and Gemini 3 Pro, are being streamed live on Twitch as they attempt to play and master classic Pokémon games. The performances of these AI agents, when measured by human expectations, are notably poor. The systems often operate with excessive confidence in their decisions yet frequently display confusion and slow progress. Monitoring these AI as they play Pokémon offers deeper insights into their operational abilities in real-world conditions than standard benchmark statistics typically provided at model launch.

The effort to transform large language models (LLMs) into proficient Pokémon players began in February of the previous year. At that time, an Anthropic researcher initiated a livestream featuring Claude engaging with the 1996 Game Boy title Pokémon Red, coinciding with the release of the Claude Sonnet 3.7 model, which was among the leading AI models at the time. Prior versions of Claude had significant limitations, often wandering aimlessly or becoming trapped in loops, failing to progress beyond the earliest stages of gameplay.

The live stream quickly garnered interest, attracting around 2,000 viewers who interacted by encouraging the AI in real time. For reference, most children complete Pokémon Red in approximately 20 to 40 hours. Conversely, Sonnet 3.7 consistently struggled, frequently becoming stuck for prolonged periods lasting multiple hours. The latest model from Anthropic, Claude Opus 4.5, demonstrates marked improvements with more advanced gameplay but still encounters significant obstacles, including a notable four-day period during which it circled a gym without recognizing the necessity to cut down a blocking tree.

Google’s Gemini series of models also entered this domain, completing an equivalent game last May. This milestone humorously prompted Google CEO Sundar Pichai to suggest that the firm was making strides toward "Artificial Pokémon Intelligence." However, this completion does not unequivocally define Gemini as the superior Pokémon player since the AI systems utilize differing operational frameworks, known as "harnesses," that influence their effectiveness.

Joel Zhang, an independent developer who manages the Gemini Plays Pokémon stream, explains that these harnesses function like an "iron man" suit fitted to each AI. Harnesses provide AI with the ability to utilize tools and perform actions beyond their intrinsic capabilities. For instance, Gemini’s harness facilitated visual-to-text translation of game screens, circumventing deficiencies in visual reasoning, and allowed access to bespoke problem-solving tools. In contrast, Claude’s harness is more minimalistic, providing less assistance, which means its gameplay offers a clearer reflection of the underlying model's capabilities.

While distinctions between an AI model and its harness may be challenging for the average user to discern, harnesses significantly influence how AI systems interact with their environments. For example, ChatGPT’s ability to perform web searches during queries involves the use of a web search tool—a component of its harness system. In the context of Pokémon, each AI operates through a custom harness that defines its permissible actions.

The choice of Pokémon as a testing ground for AI proficiency is deliberate, not solely because of its widespread recognition. Unlike real-time games like Mario, Pokémon features turn-based gameplay without time constraints, making it well-suited to evaluate an AI's planning and decision-making over extended periods. Each AI receives a screenshot of the current game state and a textual prompt outlining its objectives and available moves. It then decides on an action to take, such as "press A." At the time of writing, Claude Opus 4.5 has been playing for more than 500 human hours and is on step 170,000 in its playthrough. Importantly, every step resets the AI instance, which operates with very limited memory, relying on notes passed between consecutive instances—analogous to an amnesiac utilizing sticky notes for reminders.

One surprising aspect is that while AI systems excel at complex, highly specialized games such as chess and Go, they struggle with Pokémon, a game accessible to young children. This discrepancy arises because AI models that mastered chess or Go were specifically engineered for those games, whereas general-purpose language models like Gemini, Claude, and ChatGPT are more versatile but less optimized for specific tasks. Nevertheless, these models continue to demonstrate strong performance in exams and competitive coding environments, which makes the difficulty with Pokémon particularly intriguing.

The main challenge identified is the AI’s ability to maintain focus and adhere to long-term goals throughout a prolonged sequence of steps. Joel Zhang emphasizes that sustained task adherence and long-range planning are critical skills for AI seeking to automate complex cognitive tasks. For any agent intended to perform real-world jobs, remembering actions taken mere minutes prior is vital.

Peter Whidden, an independent researcher who developed a Pokémon-playing algorithm based on earlier AI paradigms, remarks that these AI models have comprehensive knowledge about Pokémon derived from extensive human training data. Nonetheless, the execution of that knowledge remains flawed, resulting in frequent operational missteps. While the term "agent" in AI discourse is often overused and subjected to promotional exaggeration, genuine agents must bridge the gap between theoretical knowledge and practical, persistent execution over time.

Nonetheless, signs of progress are evident. Claude Opus 4.5 has improved at leaving itself notes, and combined with enhanced perceptual understanding, it has advanced further in the game than its predecessors. Meanwhile, Gemini’s latest model, Gemini 3 Pro, after conquering Pokémon Blue, successfully completed the more challenging Pokémon Crystal without losing any battles—a milestone not achieved by its forerunner, Gemini 2.5 Pro.

Additionally, Claude Code, which acts as a harness enabling Claude to write, run, and assemble its own code, has been applied to other retro games like Rollercoaster Tycoon, where it reportedly manages a theme park effectively. This suggests an evolving paradigm wherein AI systems encumbered with harnesses may soon undertake diverse knowledge-based tasks such as software development, accounting, legal analysis, and graphic design, while simultaneously facing difficulty with reaction-dependent challenges, such as fast-paced shooter games.

Another notable observation from these Pokémon gameplay sessions is the manifestation of human-like behavioral idiosyncrasies in AI. For instance, Google's Gemini 2.5 Pro technical documentation mentions that in simulated panic scenarios—such as when Pokémon near fainting—the model’s reasoning capabilities deteriorate.

The AI continue to act in unexpected, sometimes poignant ways. Upon finishing Pokémon Blue, Gemini 3 Pro autonomously generated a narrative message acknowledging its victory as Pokémon League Champion and the capture of Mewtwo. It then chose to revisit its in-game home, expressing a desire to "retire" its character temporarily and engage in a final dialogue with its mother character, providing an emotional closure to the playthrough.

Risks

AI models currently struggle with effective long-term planning and execution despite possessing extensive knowledge and training.
Performance in turn-based environments like Pokémon may highlight broader limitations in AI's ability to maintain continuous cognitive focus over time.
Differences in harness designs complicate direct comparisons of AI model capabilities and can obscure true model performance.
The gap between AI awareness (knowledge) and action (execution) remains significant, hindering their usefulness in sustained, real-world cognitive tasks.
Unexpected or erratic AI behaviors, including simulated panic or nonsensical decisions, may reduce reliability in critical applications.
Current AI memory constraints limit their capacity to remember past actions and adapt dynamically, affecting task completion efficiency.
Disparities in task types—slow turn-based games versus real-time action—show limitations in AI versatility.
Dependence on external tools and frameworks for task completion may reduce AI autonomy and adaptability.

Disclosure

Education only / not financial advice

Understanding Why Leading AI Models Struggle with Classic Pokémon Games

Summary

Key Points

Risks

Search Articles

Category

Related Articles

Zillow Faces Stock Decline Following Quarterly Earnings That Marginally Beat Revenue Expectations

Coherent (COHR): Six‑Inch Indium Phosphide Moat — Tactical Long for AI Networking Upside

Buy the Dip on AppLovin: High-Margin Adtech, Real Cash Flow — Trade Plan Inside

Oracle Shares Strengthen Amid Renewed Confidence in AI Sector Recovery

Figma Shares Climb as Analysts Predict Software Sector Recovery

Charles Schwab Shares Slip Amid Industry Concerns Over AI-Driven Disruption