RLHF and alternatives: SFT

This blog was originally published on Medium as a collaborative effort with MantisNLP . See the next part of this blog series here .

Introduction

In recent years, the field of natural language processing (NLP) has witnessed remarkable advancements, primarily fueled by the development of large language models like GPT-3.5. These models, capable of generating coherent and contextually relevant text, have revolutionized various applications, ranging from chatbots and virtual assistants to content generation and translation. However, despite their impressive capabilities, these models often lack precision, exhibit biases, and struggle with nuanced understanding. This is where the importance of supervised fine-tuning ( SFT ) and Reinforcement Learning from Human Feedback ( RLHF ) comes into play.

However, RLHF ” ...is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. ” [1] .

In this article, Mantis NLP and Argilla team have joined efforts to showcase the alternatives to RLHF, more specifically Direct Preference Optimization (DPO) and Chain of Hindsight, including a hands-on on Argilla’s annotation platform to prepare the human preference data (PD) and the Transformers Reinforcement Learning (trl) library to fine-tune the Large Language Model (LLM) using DPO.

Supervised fine-tuning (SFT)

Large language models are typically pre-trained on massive amounts of text data, enabling them to learn the statistical regularities of language. However, they do not possess specific knowledge about particular domains or exhibit a comprehensive understanding of context. To bridge this gap, fine-tuning is employed, which involves training the model on domain-specific or task-specific datasets. Fine-tuning allows the model to acquire specialized knowledge and adapt its behavior to better serve specific applications, making it more useful and reliable.

Supervised fine-tuning (SFT) is a more focused approach that involves training a model on a labeled dataset that directly maps inputs to desired outputs. SFT, including instruction-tuning , teaches a model to respond based on what humans define.

To clarify the difference, let’s have an example. Imagine we are creating a conversational agent to solve customer support doubts about your software. Let’s suppose you use a foundational Large Language Model, for example, Falcon-7B, while the model you will generate will be a fine-tuned version of it.

Falcon can produce texts with high quality, but it may not match the style or specificity you are looking for, since it has been pre-trained to predict the next token in a sequence and not to predict tokens based on a specific task - a bot that emulates how your customer support team would answer.

This is an example input:

Can you write an email answering what the HTTP Error 500 means?

To this, Falcon (40B version) answers:

An HTTP Error 500 usually indicates an internal server error. This can be caused by a number of issues such as a problem with the server code or configuration, insufficient memory or CPU usage, or a conflict with other processes running on the server. To resolve the issue, you can try restarting the server or checking the server logs for more information on the specific error message.

The response is correct in terms of content, but it does not match the requirements of customer support messages, which usually:

Sound more polite;
Have a clear format;
Are focused on providing instructions to solve the problem;
Contain information about what to do if the problem continues.

So on, the corrected response for that prompt should be something similar to:

Dear Customer,

Thank you for reaching out to our customer support team regarding the HTTP Error 500 you are experiencing. We understand the frustration this may have caused, and we are here to assist you in resolving this issue.

An HTTP Error 500 typically signifies an internal server error. This error can arise due to various factors, including problems with the server code or configuration, inadequate memory or CPU usage, or conflicts with other processes running on the server.

To help you resolve this problem, we recommend the following steps:

Restart the server: …

Review server logs: …

If the problem persists after following these steps, please provide us with any additional details or error messages you may have encountered. This information will greatly assist us in further investigating the issue and finding a suitable resolution.

Once again, we apologize for any inconvenience this may have caused. Our dedicated support team is committed to resolving this matter for you promptly. If you have any further questions or concerns, please do not hesitate to reach out to us. We are always here to help.

Thank you for your patience and cooperation.

Best regards, Customer Support Representative

In order to have your LLMs produce texts which match your style and content guidelines, you need to fine-tune them using Supervised fine-tuning.

Supervised fine-tuning (SFT) based on Instructions

To carry out SFT on an LLM, you need data. More specifically, if you want to fine-tune a chatbot to successfully answer user requests, you need instruction data . There are many open source datasets (most of them with a non-commercial license!) which are instruction-based and you can use. This GitHub repository lists some of them:

Dataset link

Dataset

Language

Description

GPT-4all Dataset

GPT-4all

Pairs

English

400k entries

A combination of some subsets of OIG, P3 and Stackoverflow. Covers topics like general QA, customized creative questions.

RedPajama-Data-1T

RedPajama

Primarily English

1.2T tokens

A fully open pretraining dataset follows the LLaMA's method.

OASST1

OpenAssistant

Pairs,Dialog

Multilingual (English, Spanish, etc.)

66,497 conversation trees

A large, human-written, human-annotated high quality conversation dataset. It aims at making LLM generate more natural responses.

databricks-dolly-15k

Dolly2.0

Pairs

English

15K+ entries

A dataset of human-written prompts and responses, featuring tasks such as open-domain question-answering, brainstorming, summarization, and more.

AlpacaDataCleaned

Some Alpaca/ LLaMA-like models

Pairs

English

Cleaned version of Alpaca, GPT_LLM and GPTeacher.

GPT-4-LLM Dataset

Some Alpaca-like models

Pairs,

English, Chinese

52K entries for English and Chinese respectively 9K entries unnatural-instruction

NOT the dataset used by GPT-4!! It is generated by GPT-4 and some other LLM for better Pairs and RLHF. It includes instruction data as well as comparison data in RLHF style.

GPTeacher

Pairs

English

20k entries

A dataset contains targets generated by GPT-4 and includes many of the same seed tasks as the Alpaca dataset, with the addition of some new tasks such as roleplay.

Alpaca data

Alpaca, ChatGLM-fine-tune-LoRA, Koala

Dialog, Pairs

English

52K entries