Whether you are using ChatGPT (GPT-5), DeepSeek, or Claude, you might have noticed a subtle decline in intelligence during extended sessions. Recent scientific research confirms this intuition: even the most advanced AI models suffer significant performance degradation in long conversations.
The Research: Longer Chats, Lower Accuracy
Researcher Philippe Laban and his team recently conducted in-depth stress tests on the latest generation of Large Language Models (LLMs). The study covered six key domains: coding, database querying, math, and summarization. The results were startling: accuracy dropped by up to 33% when task-relevant information was spread across multiple turns (sharding) rather than a single prompt.
Study shows a clear performance dip as conversation turns increase.
Why Does This Happen?
Despite modern models having massive context windows, “remembering” data isn”t the same as “understanding” it in context. As dialogue grows, background noise accumulates. When information is sharded across messages, the model”s logical reasoning is disrupted. Interestingly, technical tweaks like lowering temperature values do not solve this underlying architectural flaw.
3 Practical Tips to Keep Your AI Smart
Start Fresh Often: Avoid keeping a single chat window open all day. Initiate a “New Chat” whenever a topic shifts or a task enters a new phase.
The Summary Inheritance Method: Before starting a new chat, ask the AI to summarize all key conclusions from the current session. Paste this summary as the first prompt in your new window.
Prioritize Single-Prompt Input: Try to provide all necessary background information in one go rather than drip-feeding details across five separate messages.
Conclusion
AI performance is dynamic, not static. Understanding these limitations is key to maintaining high productivity in 2026. Keep your conversation windows “fresh” to ensure the best results.
令人惊讶的是,曾经作为行业标杆的 OpenAI Whisper Large v3 在此次评测中仅排在中间位置(4.2%),与第一名差距明显。这意味着在追求极致准确度的商业应用中,Whisper 可能不再是首选。
三、惊喜发现:Google Gemini 的全能表现
Google 的表现同样值得关注。Gemini 3 Pro 以 2.9% 的错误率位列第二。最关键的一点是,Google 并没有专门为转录任务对 Gemini 进行大量训练,其强大的表现主要得益于其通用的多模态能力。这预示着未来顶级大模型(LLM)可能会通过其通用的逻辑理解能力,在特定工具领域实现“降维打击”。
四、AgentTalk 测试:AI 助理的未来
针对语音助手等交互场景,Artificial Analysis 还进行了专门的 AA-AgentTalk 测试。结果显示,ElevenLabs Scribe v2 和 Gemini 3 Pro 依然优势明显,错误率分别仅为 1.6% 和 1.7%。这对于开发 AI 智能家居、车载系统的开发者来说,提供了极具参考价值的数据。