Xuzeng He

#### Supervised Fine-tuning, Reinforcement Learning from Human Feedback and the latest SteerLM

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/0*jX0U0LEuxA1yZ1L5" />
<figcaption>Source: Generated using Google’s Gemini</figcaption>
</figure>

### Author

· Xuzeng He (**ORCID:**
[0009--0005--7317--7426](https://orcid.org/0009-0005-7317-7426))

### Introduction

Large Language Models (LLMs), usually trained with extensive text data,
can demonstrate remarkable capabilities in handling various tasks with
state-of-the-art performance. However, people nowadays typically want
something more personalised instead of a general solution. For example,
one may want LLMs to assist in code writing while the other may seek
models that are specialised in medical knowledge. In this case, to
better align LLMs to human preference, we can fine-tune a pre-trained
model to make it specialised in knowledge from a specific domain.

In this post, we introduce 3 different algorithms to fine-tune your
LLMs, including the latest fine-tuning method proposed by
[NVIDIA ](https://www.nvidia.com/en-au/)--- SteerLM.

### Supervised Fine-tuning (SFT)

Supervised Fine-tuning (SFT) is the most common approach to adapt a
pre-trained model to a specific task. The model is trained on a labelled
dataset and learns to predict the correct label for each input. It
usually consists of 3 steps:

1.  Pre-train the model: The base model should be pre-trained beforehand
    to give it a basic understanding of language.
2.  Label the Dataset: Each data point in the task-specific training
    dataset should be labelled because SFT is a Supervised Learning
    algorithm, and Supervised Learning means training the model with a
    labelled dataset.
3.  Fine-tune the model: The parameter of the model is adjusted to
    improve its performance on the given task using the loss value
    between the prediction and the label for each datapoint.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/1*mamn6jRzocDJeeiqQ15bKw.png" />
<figcaption>Supervised Fine-Tuning process flow</figcaption>
</figure>

For some actual practice, one can check the SFTTrainer class from the
[TRL](https://github.com/lvwerra/trl) library (developed by Hugging
Face), which is designed to facilitate the SFT process. This class
accepts a column in your training dataset CSV that contains system
instructions, questions, and answers, which form the prompt structure.

### Reinforcement Learning from Human Feedback (RLHF)

Since SFT is pretty basic, we now move to a more complicated
algorithm --- Reinforcement Learning from Human Feedback (RLHF). As
suggested by its name, RLHF is a method that uses reinforcement learning
to directly optimise a language model with human feedback. It has
enabled language models to be trained to align with different sets of
complex human values. It mainly includes three core steps:

1.  Pretraining the model
2.  Gathering data and training a reward model
3.  Fine-tuning the LLM with reinforcement learning.

As a starting point, RLHF needs to be applied on an LLM that has been
pre-trained. This step can be skipped if the model is already
pre-trained beforehand. (Similar to SFT)

Next, with the LLM, one needs to generate data to train a Reward Model
so that human preferences can be integrated into this algorithm. The
goal is to retrieve a model or system that takes a sequence of text as
input and outputs a scalar reward which should numerically represent the
human preference.

Eventually, the technique of reinforcement learning is applied to the
LLM to fine-tune the model using a policy-gradient Reinforcement
Learning (RL) algorithm called [Proximal Policy Optimization
(PPO)](https://huggingface.co/blog/deep-rl-ppo). The model is
essentially fine-tuned using the reward value output by the reward model
and an additional penalty term, which is a scaled version of the
[Kullback--Leibler (KL)
divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence).
This penalty term can penalise the fine-tuned model from moving
substantially away from the initial pretrained model so that it can
output reasonably coherent content.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/1*6lO6xJTGDyK_ZlYtTWrMkw.png" />
<figcaption>RLHF Process flow. Source: <a
href="https://bmanikan.medium.com/demystifying-chatgpt-a-deep-dive-into-reinforcement-learning-with-human-feedback-1b695a770014">Demystifying
ChatGPT: A Deep Dive into Reinforcement Learning with
Human Feedback</a></figcaption>
</figure>

There are already a few active repositories for RLHF in Pytorch. The
primary repositories, in this case, are [Transformers Reinforcement
Learning (TRL)](https://github.com/lvwerra/trl),
[TRLX](https://github.com/CarperAI/trlx) which originated as a fork of
TRL, and [Reinforcement Learning for Language models
(RL4LMs)](https://github.com/allenai/RL4LMs).

### SteerLM

Apart from SFT and RLHF, a novel approach called SteerLM was recently
proposed by NVIDIA to overcome some limitations associated with
conventional SFT and RLHF methods. Similar to RLHF, SteerLM incorporates
additional reward signals by leveraging annotated attributes (e.g.,
quality, humour, toxicity) present in the [Open-Assistant
dataset](https://huggingface.co/datasets/OpenAssistant/oasst1/viewer/OpenAssistant--oasst1/validation)
for each response. It generally comprises 4 steps:

1.  Attribute Prediction Model: The base language model is trained as an
    Attribute Prediction Model to assess the quality of responses by
    predicting attribute values.
2.  Annotating Datasets using Attribute Prediction Model: The attribute
    prediction model is used to annotate response quality across diverse
    datasets.
3.  Attribute Conditioned SFT: Given a prompt and desired attribute
    values, a new base model is fine-tuned to generate responses that
    align with the specified attributes.
4.  Bootstrapping with High Quality Samples: Multiple responses are
    sampled from the fine-tuned model in the last step, specifying
    maximum quality. The sampled responses are evaluated by the trained
    attribute prediction model, leading to another round of fine-tuning.

<figure>
<img
src="https://cdn-images-1.medium.com/max/1024/1*1b2Ktr8ln19F0wV_SypA5Q.png" />
<figcaption>SteerLM Process Flow. Source: <a
href="https://docs.nvidia.com/nemo-framework/user-guide/latest/modelalignment/steerlm.html">Nvidia
Docs Hub</a></figcaption>
</figure>

For some actual practice, one can refer to this
[post](https://developer.nvidia.com/blog/announcing-steerlm-a-simple-and-practical-technique-to-customize-llms-during-inference/)
officially written by NVIDIA for a complete tutorial. Note that since
this method is developed by NVIDIA, AMD GPUs are currently not
supported.

### Conclusion

The use of Large Language Models has witnessed significant advancement
in multiple directions while there is a rising trend among users seeking
task-specific models. In this post, we introduce 3 different algorithms
to fine-tune LLMs, including SFT, RLHF and SteerLM. Through continuous
investigation and refinement, we believe that the use of Large Language
Models can open up exciting opportunities for us in the future.

### References

- Lambert, N.; Castricato, L.; von Werra, L.; and Havrilla, A., 2022.
  Illustrating reinforcement learning from human feedback (rlhf).
  <https://huggingface.co/blog/rlhf>
- Dong, Y., Wang, Z., Sreedhar, M. N., Wu, X., & Kuchaiev, O. (2023).
  SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative
  to RLHF (Version 1). arXiv.
  <https://doi.org/10.48550/ARXIV.2310.05344>

![](https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=64ad82081b55){width="1"
height="1"}


Supervised Fine-tuning, Reinforcement Learning from Human Feedback and the latest SteerLM  Author   · Xuzeng He (
<strong>
 ORCID:
</strong>
0009–0005–7317–7426) Introduction   Large Language Models (LLMs), usually trained with extensive text data, can demonstrate remarkable capabilities in handling various tasks with state-of-the-art performance. However, people nowadays typically want something more personalised instead of a general solution.

Messaggi di Rogue Scholar

Fine-tuning Large Language Models: A Brief Introduction