You have full access to this article via Jozef Stefan Institute.
When a student encounters a challenging mathematics problem or a programmer needs to write a complex algorithm, they will rarely solve it all in one go. Instead, they will reason through the task, jotting down notes and intermediate steps to arrive at a final solution. Likewise, large language models (LLMs) -- artificial intelligence (AI) systems that process and generate human language -- perform better at complex tasks when they write down their reasoning process before blurting out an answer than when they do not. In a paper in Nature, the DeepSeek AI team reports that LLMs can be incentivized to learn to reason without ever being shown examples of human reasoning trajectories, using a trial-and-error process called reinforcement learning.
So, what needs to be done to get an LLM to write out its reasoning process? Early efforts to elicit reasoning in LLMs simply added an extra instruction. Instead of prompting the LLM with "Q: Is 119 a prime number? A:" and expecting it to answer yes or no, researchers might input "Q: Is 119 prime? A: Let's think step by step." A small change in language was enough to induce the LLM to produce a step-by-step explanation -- called a reasoning trace -- before giving its answer. Other efforts taught LLMs to show their reasoning by presenting them with examples of humans using reasoning to solve problems. The LLM then learnt to produce reasoning traces that looked like the ones in the data -- this is called supervised learning. However, prompting or training the LLM using human inputs can introduce biases, and these approaches prevent the model from developing its own ways of reasoning, which might perform better than human examples.
The researchers introduced a paradigm for eliciting reasoning steps from LLMs that are separate from the production of an answer. They implemented this in a model called DeepSeek-R1, which was released in January 2025. Rather than hoping that the LLM would reason when it was instructed to do so, or guiding it using examples of the human reasoning process, the researchers used a type of algorithm called reinforcement learning. Reinforcement-learning algorithms resemble how a child might learn to play a video game. As the child navigates their avatar through the game world, they learn through trial and error that some actions (such as collecting gold coins) earn points, whereas others (such as running into enemies) set their score back to zero. In a similar vein, DeepSeek-R1 was awarded a high score when it answered questions correctly and a low score when it gave wrong answers.
The researchers realized that, because maths and programming questions typically have verifiable answers, they could create a scoring system that helped the LLM to improve during the training process. The researchers' main discovery was that, when the LLM was trained to produce correct answers using the trial-and-error process of reinforcement learning, it naturally learnt to output its reasoning (Fig. 1). This contrasts with previous prompting-based approaches, which were more akin to expecting a child to learn to master a video game by having them read the instructions, or supervised-learning approaches, which can be likened to expecting the child to master a game by watching a sibling play it hundreds of times.
Because it was trained using reinforcement learning, the LLM was not limited to learning human-defined reasoning patterns; it could also discover its own behaviours that earned high rewards. The researchers found that the LLM learnt to evaluate its own in-progress reasoning by reflecting on the statements it had already generated, and that it learnt to explore alternative approaches in its responses. As one example of this, the model learnt to insert phrases into its reasoning such as "Wait. That's an aha moment I can flag here."
However, the LLM also learnt certain behaviours which, although they might have helped it to produce better responses, resulted in reasoning traces that were difficult to understand. For example, the LLM adopted a behaviour in which its reasoning would switch back and forth between Chinese and English (the two languages the LLM was optimized to understand). The researchers also found that the LLM learnt to produce extremely long reasoning traces, which can contain 10,000 words or more. Furthermore, the reinforcement-learning method had to be trained on questions with clear-cut right or wrong answers (such as maths problems). This meant that the LLM didn't learn how to handle questions requiring nuanced, subjective or long-form responses.
The researchers show that many of these issues were resolved by using a multistage training framework, in which the LLM was exposed to alternating stages of reinforcement learning and supervised learning. Trained in this way, DeepSeek-R1 achieved state-of-the-art accuracy on tasks that assessed maths and coding skills, factual knowledge and other forms of language understanding, in both Chinese and English.
Ultimately, the question of what makes a good reasoning LLM is a philosophical as much as a technical one. What behaviours do users want from an AI when they ask it hard questions? At one extreme, imagine an AI that has learnt to reason in a gibberish language that no human can hope to understand. Should we care that its reasoning is completely unintelligible, so long as it arrives at the correct answer? The version of DeepSeek-R1 that was trained through reinforcement learning alone tended to produce responses that were convoluted, long or otherwise difficult for humans to read. Ultimately, the researchers found that they needed to introduce some supervised learning to strike a balance between effective reasoning and intelligible responses to a broad variety of user queries.
DeepSeek-R1 has developed from a powerful but opaque solution-finder into a system that is capable of human-like conversations. This journey reflects the need for AI systems that not only accurately solve problems but are also tools that humans can understand, trust and meaningfully collaborate with.