Why Benchmarks Alone Can’t Define AI Progress: A Critical Look at Grok 3
Benchmarks have long been a staple in AI evaluation, but any serious AI scientist knows they can be gamed. I’ve written about this in detail before, and even LMsys had to adjust its blind test format — masking Grok 3 under a different label instead of merely hiding the brand — to mitigate brand bias. The problem with high-ability AI, especially models at the GPT-4 level or those relying on test-time compute, isn’t just about raw performance metrics. There are two fundamental challenges that no benchmark can fully capture.
The first major issue is the inability of current models to perform multi-layered strategic reasoning. If we break down any complex problem into distinct layers — scan, optimize & plan, and implement — any single mistake at one stage will cascade catastrophically in the final output. No amount of test-time compute will fix this because the issue is embedded in how these models process information sequentially.
The second issue concerns comprehending new knowledge. The standard knowledge gap for large models tends to be around 6–8 months. Even when fine-tuned with newer information, there’s evidence that contradictions can arise between newly introduced facts and the foundational knowledge established during pre-training. The core problem here is that these models don’t operate on fixed logical principles; their “logic” is dictated by weightings that emerge during pre-training. Some argue that retrieval-augmented generation (RAG) can address this, but RAG isn’t comprehension — it’s still just pattern-matching with an external database.
To assess Grok 3’s real-world capabilities, I tested it with four exercises:
- Basic text recognition: How many ‘r’s are in “strawberry”?
- Numerical comparison: Between 8.11 and 8.9, which is larger?
- Morse code decoding: A structured but non-trivial translation task.
- Advanced mathematical reasoning: Writing an RNN that correctly implements a nested matrix, where each element is itself a layered matrix rather than a simple number.
The first three tests were straightforward, but the last one posed a real challenge. Grok 3 managed to generate an RNN, but it cheated by flattening the matrix structure. This was better than Gemini’s approach, but GPT-4, despite attempting to stick to the original concept, performed even worse. To be fair, no AI today can truly solve this problem. Addressing it requires rethinking computation beyond NumPy’s conventional matrix handling and developing entirely new methodologies for nested matrix manipulation. Additionally, no publicly available training corpus exists for this kind of problem, making it fundamentally difficult for any current AI to learn.
Final Verdict
Is Grok 3 impressive? Yes, but it’s not a revolutionary leap. It represents an incremental improvement in the landscape of large language models, but the fundamental limitations of contemporary AI remain. While Grok 3 may not redefine the field, its release does serve one crucial purpose: pushing OpenAI to accelerate the development of GPT-4.5 and GPT-5. In that sense, competition remains the real driving force behind AI progress.