KEYWORDS: ChatGPT, Chain-of-Thought, fine-tuning, GPT-4
Just a few days ago, OpenAI co-founder Greg Brockman demonstrated ChatGPT’s new plug-in capabilities live at TED2023.
According to TED, Brockman showed how ChatGPT could help you create a recipe for dinner, generate a picture of the final look, create the corresponding grocery list in Instacart, and post it to your Twitter account. All of this happened within the chatbot.[1]
After the release of ChatGPT last December, the speed of iteration of Large Language Models (LLMs) is remarkable. OpenAI immediately released GPT-4 after three and a half months, upgraded it to a multimodal model, and it is now even able to perform tasks autonomously, based on simple queries.
It seems that LLMs like GPT-4 are getting smarter and more human-like.
There is an important underlying change — the phenomenon of the “emergence” of large language models, such as CoT, which suddenly appear after the training parameters and data volume exceed a certain value, making the AI very intelligent all of a sudden.
1. Reasoning ability makes ChatGPT smart
Let’s start with an interesting example.
In the picture above, a user showed a world map made of chicken nuggets with a short description and asked GPT-4 to explain it.
GPT-4 not only recognizes and understands the image but also deduces that it’s a joke and gives its own deduction and understanding based on the text and the picture.
In this case, ChatGPT, based on GPT-4, can surprisingly understand the human sense of humor.
In addition, GPT-4 also has impressive comprehension and reasoning skills.
The following examples show the difference in competence and reasoning ability between GPT-4 and ChatGPT-3.5.
Please noticed that the following tests were released before ChatGPT-4, so we will use ChatGPT-3.5 to define the earlier version and ChatGPT-4 for the present.
As you can see in the images above, ChatGPT-3.5 immediately gave up on answering the questions, while GPT-4 not only gave the correct answer but also showed its ability to reason and analyze. The yellow part highlights key successful reasoning steps.
So, the question is, how come ChatGPT-4 seems much smarter?
What’s particularly noteworthy is that as LLMs become large enough, they exhibit “emerging abilities” that were previously unattainable with smaller models.
One of the most compelling of these emerging abilities is their capacity for human-like reasoning, called Chain-of-Thought (CoT). This means that LLMs can effectively break down complex tasks into smaller, more manageable sub-tasks, or even provide step-by-step reasoning processes that mimic human thought processes.
This remarkable feature of LLMs allows the model to reason through a series of examples, building on the information provided at each step to arrive at a final answer. With the ability to reason through complex problems, LLMs have the potential to revolutionize many fields, including education, healthcare, and scientific research.
This approach allows LLMs to perform tasks that were previously considered too challenging for artificial intelligence systems.
The examples below show the results both with and without the CoT prompt in mathematics.
In this math question, with only a put few-shot examples in the prompts, Chain-of-Thought methods provide a more natural way of prompting LLMs to reason through complex problems.
What if we don’t give hint prompts, but just simply give a prompt like “Let’s think step by step” to see how the model reacts?
With such a “magic” phrase, GPT-4 not only showed the correct answer but also told how it draws conclusions by reasoning.
This newfound ability has generated considerable excitement in the research community, as reasoning is often considered a key feature of human intelligence, but the fact that LLMs can now perform similar tasks represents a major breakthrough in AI research.
2. How does CoT improve AI’s reasoning ability?
In order to know how CoT improves the reasoning ability of language models, one striking example is ScienceQA, which proposes a new benchmark that aims to address the limitations of existing scientific question-answering datasets.
Specifically, the authors design language models that learn to generate lectures and explanations as a chain of thought (CoT) to mimic the multi-hop reasoning process in answering ScienceQA questions.
It consists of approximately 21k multiple-choice questions on various scientific topics and annotations of their answers with corresponding lectures and explanations.
The authors use a CoT approach to prompt the language models. They formulate the task to output a natural explanation alongside the predicted answer.
The results are remarkable.
The team finds that CoT can improve the question-answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. They also explore the upper bound for models to leverage explanations by feeding them into the input; they observe that it improves the few-shot performance of GPT-3 by 18.96%.
There are several methods of CoT, such as in-context learning and fine-tuning the language model.
In-context learning involves the injection of few-shot samples into the prompts, allowing the model to learn from in-context examples and output similar logits.
Fine-tuning the language model to facilitate the reasoning process involves leveraging CoT data to update the parameters of the language model. By updating the parameters, the model can better reason about complex problems and provide accurate solutions.
To find out more information about Chain-of-Thought and related methods, please read:
3. Is ChatGPT smart enough in science?
As we mentioned above, the effectiveness of the reasoning ability is highly dependent on the size of the models.
Let’s take a look at how ChatGPT performs in the field of science.
In physics, according to Nautilus [4], Sidney Perkowitz, Charles Howard Candler Professor of Physics Emeritus, Emory University, used “E =mc²” as an example.
GPT-3.5 correctly identified the equation, but incorrectly that it implies that a large mass can be changed into a small amount of energy. Only when re-entered “E =mc²” did GPT-3.5 correctly state that a small mass can produce a large amount of energy.
But for GPT-4, even when Perkowitz typed E = mc² several times, GPT-4 always stated that a small mass would yield a large energy.
As Perkowitz concluded: “Compared to GPT-3.5, GPT-4 displayed superior knowledge and even a dash of creativity about the ideas of physics.”
Furthermore, Perkowitz tested whether GPT-3.5 and GPT-4 by asking physical and space science about often misunderstood fringe ideas, and both showed correct answers.
But he also argued that the distinction can be harder to make when factors such as politicization or public policy sway the presentation of scientific issues, which may themselves be under study without definitive answers.
Despite ChatGPT still being criticized as “still somewhat illusory”, we believe that in the near future, as more emergent capabilities are discovered and used properly, ChatGPT-like AI systems will become more reliable and thoughtful virtual assistants in specific areas, such as science and business scenarios.