Claude 3.0 is impressive, but is it worth it? It depends on the job!
Just today Anthropic announced the launch of three models of Claude 3.0, with the largest named Opus, the mid-sized named Sonnet, and the smallest named Haiku (a clever naming scheme, playing with the size of literary works, indicating that Opus is a significant work, Sonnet is medium, and Haiku, of course, is a short Japanese poem).
The model card for Claude indicates that Opus outperforms the current champion, GPT-4, in nearly every metric.
Given that SIU now has access to the API of almost every major AI in the market, it’s our duty to evaluate how capable each AI is.
Measuring the intelligence of large-scale AIs is not easy because for everyday questions, it’s almost impossible to differentiate them. We need to set clear tasks and boundaries before we can accurately assess their quality.
Just the other day, I enjoyed listening to a lecture by Professor Geoffrey Hinton at the Romanes Lecture at the University of Oxford. Hinton’s ideas often conflict with those of Yann LeCun, representing different schools of thought, but both have significantly influenced the AI field, along with Yoshua Bengio, recognized as the three AI godfather (including the new wave like Demis Hassabis of DeepMind Google and Ilya Sutskever of OpenAI).
From this lecture, I got the idea of “nested matrices,” where a matrix’s members are matrices themselves, not vectors, and these member matrices can further nest within each other.
I drafted this idea in LaTex for future use, thinking it could be useful for some tasks, though none come to mind right now, and I have other work in the pipeline, so it’s shelved for the moment. I didn’t think I’d dig it out for AI testing.
I set a task for three AIs: 1) GPT-4 (the standard, used through ChatGPT), 2) the newcomer Claude 3 Opus (used through Anthropic’s workbench), and 3) Gemini 1.0 Ultra (used through Gemini Advanced) and the latest Gemini 1.5 Pro (used through AI Studio), to write Python code implementing an RNN using the nested matrices concept.
Since this idea had never been thought of before, I came up with it yesterday. Therefore, there’s no database to train the AI, making this a test of pure logical capability.
The results were as follows:
- GPT-4’s first implementation failed due to a compile error. I asked the AI to fix the code, which it did by cheating, abandoning the nested matrices concept for a standard RNN implementation (which was bound to pass).
- Claude 3.0 Opus’s code ran successfully on the first try, but with a slight cheat by collapsing the nested matrix entities into vectors, making it not too difficult to pass.
- Gemini 1.0 Ultra struggled significantly, responding to corrections without understanding the task, and Gemini 1.5 Pro refused to comply, arguing that implementing RNN using nested matrices isn’t a standard approach and might be inefficient (which is true, I was testing their limits). So, neither passed, but upon reviewing Claude’s code, they criticized and suggested improvements well, showing a good understanding of the concept but refusing to implement more complex tasks.
In the second round of experiments, using GPT-4 as the base and incorporating feedback from Gemini Pro 1.5, I aimed to implement a full RNN with an N-dimensional topology, not just 2D as usual, requiring several iterations to get the first version, which still had errors. Claude managed to fix these (after two attempts, supplying the errors encountered).
However, this code wasn’t ideal because Claude still used the method of collapsing nested matrices into vectors to use TensorFlow functions, somewhat losing the original purpose of designing nested matrices, but we could still conclude the experiment preliminarily.
In summary, Claude 3.0 Opus proved to be somewhat superior to GPT-4, but not overwhelmingly so. GPT-4 survived due to its better user interaction and understanding of user commands (thanks to interactive fine-tuning for ChatGPT more than its AI capability).
Anthropic priced Opus quite high, positioning it at the same level as OpenAI, indicating it aims to compete with the upcoming GPT-5. Given the cost, careful consideration of the cost-benefit is required to decide if it’s worth using over GPT-4 for tasks like the one demonstrated.
This puts pressure on OpenAI from Gemini and now Claude’s three models, with high expectations for GPT-5. If it doesn’t perform well, it could face significant repercussions.
Note: You can access Colab’s Notebook over the final code here.