OpenAI announced a new AI model called CriticGPT on Thursday. This model is specifically designed to catch bugs in the code generated by ChatGPT, which is expected to act as an AI assistant to enhance human oversight of AI systems and improve the alignment between AI behavior and human expectations.
The development of CriticGPT utilizes a technique called Reinforcement Learning from Human Feedback (RLHF), assisting human reviewers in making the outputs of large language models (LLMs) more accurate.
In a research paper titled "LLM Critics Help Catch LLM Bugs," OpenAI outlines the findings regarding CriticGPT's bug detection capability.
The researchers trained CriticGPT on a dataset of code samples with intentionally inserted bugs, allowing it to learn how to identify and flag various coding errors. The results of the study showed that annotators preferred CriticGPT's critiques over human critiques in 63 percent of cases involving naturally happening LLM mistakes.
In addition, teams using CriticGPT were able to write more comprehensive critiques while reducing confabulation rates compared to AI-only critiques.
Application of CriticGPT Beyond Code Review
While CriticGPT was primarily developed for code review, the researchers discovered that its capabilities extend beyond just identifying coding errors. They tested CriticGPT on a subset of ChatGPT training data that human annotators had previously identified as perfect.
Surprisingly, CriticGPT identified errors in 24 percent of these cases, which were later verified by human reviewers. This demonstrates the model's potential to generalize to non-code tasks and showcases its ability to catch mistakes that human assessment could overlook.
However, it is important to note that CriticGPT has some limitations. The model was trained on relatively short ChatGPT answers, which may not fully prepare it for evaluating longer and more complex tasks that future AI systems might tackle.
Additionally, while CriticGPT reduces confabulations, It doesn't completely remove them, and human trainers can still make labeling errors due to these incorrect outputs.
Challenges Faced by CriticGPT Training Teams
As language models like ChatGPT become more advanced and generate intricate and complicated answers, it becomes increasingly difficult for human trainers to accurately judge the quality of the outputs.
This poses a fundamental limitation to the RLHF technique, as models surpass the knowledge and capabilities of human reviewers.
CriticGPT addresses this challenge by assisting human trainers in making better judgments during the training process. By leveraging AI to evaluate and critique the outputs of ChatGPT, human trainers can benefit from enhanced guidance in aligning the language model with human goals.
CriticGPT demonstrated superior bug-catching capabilities compared to human reviewers. It caught approximately 85 percent of bugs, while human reviewers only caught 25 percent.
To train CriticGPT, human trainers deliberately inserted bugs into the code snippets generated by ChatGPT. This methodology allowed the researchers to evaluate CriticGPT's performance accurately.
However, it is important to note that more research is needed to apply CriticGPT to tasks beyond code generation and to handle more complex tasks.
CriticGPT's current training focused on short code snippets generated by ChatGPT. OpenAI recognizes the need to develop new methods to train CriticGPT to handle longer and more complex tasks effectively.
Additionally, CriticGPT, being an AI model itself, is susceptible to problems like hallucination, which may have potential consequences if not properly addressed.