Currently, several of the world’s most advanced artificial intelligence systems, including GPT 5.2, Claude Opus 4.5, and Gemini 3 Pro, are being streamed live on Twitch as they attempt to play and master classic Pokémon games. The performances of these AI agents, when measured by human expectations, are notably poor. The systems often operate with excessive confidence in their decisions yet frequently display confusion and slow progress. Monitoring these AI as they play Pokémon offers deeper insights into their operational abilities in real-world conditions than standard benchmark statistics typically provided at model launch.
The effort to transform large language models (LLMs) into proficient Pokémon players began in February of the previous year. At that time, an Anthropic researcher initiated a livestream featuring Claude engaging with the 1996 Game Boy title Pokémon Red, coinciding with the release of the Claude Sonnet 3.7 model, which was among the leading AI models at the time. Prior versions of Claude had significant limitations, often wandering aimlessly or becoming trapped in loops, failing to progress beyond the earliest stages of gameplay.
The live stream quickly garnered interest, attracting around 2,000 viewers who interacted by encouraging the AI in real time. For reference, most children complete Pokémon Red in approximately 20 to 40 hours. Conversely, Sonnet 3.7 consistently struggled, frequently becoming stuck for prolonged periods lasting multiple hours. The latest model from Anthropic, Claude Opus 4.5, demonstrates marked improvements with more advanced gameplay but still encounters significant obstacles, including a notable four-day period during which it circled a gym without recognizing the necessity to cut down a blocking tree.
Google’s Gemini series of models also entered this domain, completing an equivalent game last May. This milestone humorously prompted Google CEO Sundar Pichai to suggest that the firm was making strides toward "Artificial Pokémon Intelligence." However, this completion does not unequivocally define Gemini as the superior Pokémon player since the AI systems utilize differing operational frameworks, known as "harnesses," that influence their effectiveness.
Joel Zhang, an independent developer who manages the Gemini Plays Pokémon stream, explains that these harnesses function like an "iron man" suit fitted to each AI. Harnesses provide AI with the ability to utilize tools and perform actions beyond their intrinsic capabilities. For instance, Gemini’s harness facilitated visual-to-text translation of game screens, circumventing deficiencies in visual reasoning, and allowed access to bespoke problem-solving tools. In contrast, Claude’s harness is more minimalistic, providing less assistance, which means its gameplay offers a clearer reflection of the underlying model's capabilities.
While distinctions between an AI model and its harness may be challenging for the average user to discern, harnesses significantly influence how AI systems interact with their environments. For example, ChatGPT’s ability to perform web searches during queries involves the use of a web search tool—a component of its harness system. In the context of Pokémon, each AI operates through a custom harness that defines its permissible actions.
The choice of Pokémon as a testing ground for AI proficiency is deliberate, not solely because of its widespread recognition. Unlike real-time games like Mario, Pokémon features turn-based gameplay without time constraints, making it well-suited to evaluate an AI's planning and decision-making over extended periods. Each AI receives a screenshot of the current game state and a textual prompt outlining its objectives and available moves. It then decides on an action to take, such as "press A." At the time of writing, Claude Opus 4.5 has been playing for more than 500 human hours and is on step 170,000 in its playthrough. Importantly, every step resets the AI instance, which operates with very limited memory, relying on notes passed between consecutive instances—analogous to an amnesiac utilizing sticky notes for reminders.
One surprising aspect is that while AI systems excel at complex, highly specialized games such as chess and Go, they struggle with Pokémon, a game accessible to young children. This discrepancy arises because AI models that mastered chess or Go were specifically engineered for those games, whereas general-purpose language models like Gemini, Claude, and ChatGPT are more versatile but less optimized for specific tasks. Nevertheless, these models continue to demonstrate strong performance in exams and competitive coding environments, which makes the difficulty with Pokémon particularly intriguing.
The main challenge identified is the AI’s ability to maintain focus and adhere to long-term goals throughout a prolonged sequence of steps. Joel Zhang emphasizes that sustained task adherence and long-range planning are critical skills for AI seeking to automate complex cognitive tasks. For any agent intended to perform real-world jobs, remembering actions taken mere minutes prior is vital.
Peter Whidden, an independent researcher who developed a Pokémon-playing algorithm based on earlier AI paradigms, remarks that these AI models have comprehensive knowledge about Pokémon derived from extensive human training data. Nonetheless, the execution of that knowledge remains flawed, resulting in frequent operational missteps. While the term "agent" in AI discourse is often overused and subjected to promotional exaggeration, genuine agents must bridge the gap between theoretical knowledge and practical, persistent execution over time.
Nonetheless, signs of progress are evident. Claude Opus 4.5 has improved at leaving itself notes, and combined with enhanced perceptual understanding, it has advanced further in the game than its predecessors. Meanwhile, Gemini’s latest model, Gemini 3 Pro, after conquering Pokémon Blue, successfully completed the more challenging Pokémon Crystal without losing any battles—a milestone not achieved by its forerunner, Gemini 2.5 Pro.
Additionally, Claude Code, which acts as a harness enabling Claude to write, run, and assemble its own code, has been applied to other retro games like Rollercoaster Tycoon, where it reportedly manages a theme park effectively. This suggests an evolving paradigm wherein AI systems encumbered with harnesses may soon undertake diverse knowledge-based tasks such as software development, accounting, legal analysis, and graphic design, while simultaneously facing difficulty with reaction-dependent challenges, such as fast-paced shooter games.
Another notable observation from these Pokémon gameplay sessions is the manifestation of human-like behavioral idiosyncrasies in AI. For instance, Google's Gemini 2.5 Pro technical documentation mentions that in simulated panic scenarios—such as when Pokémon near fainting—the model’s reasoning capabilities deteriorate.
The AI continue to act in unexpected, sometimes poignant ways. Upon finishing Pokémon Blue, Gemini 3 Pro autonomously generated a narrative message acknowledging its victory as Pokémon League Champion and the capture of Mewtwo. It then chose to revisit its in-game home, expressing a desire to "retire" its character temporarily and engage in a final dialogue with its mother character, providing an emotional closure to the playthrough.