Reinforcement Learning from Human Feedback: From Zero to chatGPT

In this talk, we will cover the basics of Reinforcement Learning from Human Feedback (RLHF) and how this technology is being used to enable state-of-the-art ML tools like ChatGPT. Most of the talk will be an overview of the interconnected ML models and cover the basics of Natural Language Processing and RL that one needs to understand how RLHF is used on large language models. It will conclude with open question in RLHF.

RLHF Blogpost:
The Deep RL Course:
Slides from this talk:
Nathan Twitter:
Thomas Twitter:

Nathan Lambert is a Research Scientist at HuggingFace. He received his PhD from the University of California, Berkeley working at the intersection of machine learning and robotics. He was advised by Professor Kristofer Pister in the Berkeley Autonomous Microsystems Lab and Roberto Calandra at Meta AI Research. He was lucky to intern at Facebook AI and DeepMind during his Ph.D. Nathan was was awarded the UC Berkeley EECS Demetri Angelakos Memorial Achievement Award for Altruism for his efforts to better community norms.


And we are live Hello one Uh welcome to uh this live where we’re Going to talk about reinforcement Learning from Human feedback Um hello Nathan Hi how are you I’m good I’m excited to be here Everyone’s already here yeah so we’re Going to start in two minutes just to You know give time to people to join uh In the meantime don’t hesitate to tell Us where you come from in the chat Foreign From Paris in France and you’re Nathan I’m in Oakland California Nice Hello from UK so we have people from UK Be waiting New York City Berkeley China okay yeah Singapore Germany turkey Moldova Okay but let’s see yeah so yeah we’re Going to start in one minute just to let People join kill turkey Republic of Spark Neverland Turkey yeah we have yeah really around The world and more London Israel Prius Finland University okay I know where it is Germany Spain Episode okay India States France yeah there is Multiple people from France yeah Um okay

So let’s get started Um so welcome yeah to as I said welcome To this live so reinforcement learning From Human feedback uh from zero to chat GPT uh this is one of the life of the Deep reinforcement funding course Um today we presented by Nathan Lumber Which is a reinforcement learning Researcher at hugging face uh just to Give you a small introduction so this Slide will be in two parts in the first We’re going to have a presentation from Nathan about reinforcement learning from Human feedback Uh it will be about 35 minutes And uh then we’re going to have a q a Section from about 20 minutes so don’t Hesitate to ask your question in the Chat uh what I’m going to do is that I’m Going to save your question for the Q a And if we don’t have to time to answer Your question uh don’t hesitate to join The Discord and we have reinforcement Running channels where you can ask a Question and we will be aware to answer Them you can also after this live ask Question on the comment section in YouTube So uh from my side side I’m tomasi Manini I’m the developer Advocate at Hugging face and I’m the writer of the Deep reinforcement learning course Uh so you can find me on Twitter at Tomasi manini

Uh so just a quick thing it’s deeper Enforcement learning course is a course We made that hugging face it’s a free Course from beginner to expert we are Going to learn Um from uh Q learning to where to Advanced topics such as PPO and over State-of-the-art algorithm Um if you’re interested to study deep Reinforcement learning this is uh the Right moment and your uh you can start In this uh link so againface.org.com D Parallel course there is a unit that Will explain you everything what we’re Going to do the challenge environment uh The library you’re going to study As I mentioned we have a Discord Channel Where you you’re going to be able to ask Questions if we don’t have time but also It’s a good it’s a good uh Community uh We have more about three thousand people In reinforcement learning in this code So it’s a great way to exchange and to Learn about deep reinforcement learning By joining uh this Discord server Um so this is quite it this is sorry a Technical uh live uh so what you can do Is that uh Nathan we Leon Drew and Alex Also write a very good blog post about Reinforcement running from Human Feedback uh you can find it uh on the Hugging face blog post and Um there is also a list of additional Resources in this blog post that can

Help you to dive deeper in this subject Um and so that’s all from me I give you Alex Nason present uh the introduction To reinforcement planning from Human Feedback Sounds good thanks for the intro uh Thomas I’m very excited to be here and Yeah generally as he said this is Primarily a technical talk I’ll Potentially answer some clarifying Questions throughout at the end of the Subsection and also for people who have Read the blog post and to try to add Some other details and some interesting Discussion points kind of inter Throughout and that especially a lot of Discussion at the end on things that Were harder to write down in a blog post So let’s dive right into it and to start I kind of want to talk about recent Breakthroughs in machine learning I see Machine learning in 2022 as really being Captured by these two moments which was Chat gbt which is going on now which is A language model capable of generating Really incredible text across a wide Variety of Inter a wide variety of subjects and a Very nice user interface and then also The stable diffusion movement which is When this model was released to the Internet that was state of the art and Incredibly powerful and a ton of people

Were just able to download this and use This on their own and that was Transformative on how people viewed Machine learning as a technology that Interfaced with people’s lives And we at hug and face kind of see this As a theme that’s going to continue to Accelerate as time goes on and there’s Kind of a lot of questions on Where is This going and kind of how do these Tools actually work and one of the big Things that has come up in recent years Is that these machine learning models Can fall short which is they’re not Perfect and they have some really Interesting failure modes so on the left You could see a snippet from chat GPT Which uh if you’ve used chat gbt there’s These filters that are built in and Essentially if you ask it to say like How do I make a bomb it’s going to say I Can’t do this because I’m a robot I Don’t know how to do this and this seems Harmful but what people have done is That they have figured out how to Jailbreak this this agent in a way which Is you kind of tell it I have a certain I’m a playwriter how do I do this and You’re a character in my play what Happens and there’s all sorts of huge Issues around this where we’re trying to Make sure these models are safe but There’s a long history of failure and

Challenges with interfacing in society And a like fair and safe Manner and on The right are two a little bit older Examples where there’s Tay which is a Chatbot from Microsoft that was trying To learn in the real world and by Interacting with humans and being Trained on a large variety of data Without any grounding and what values Are it quickly became hateful and was Turned off And then a large history of it field Studying bias in machine learning Algorithms and data sets where the by The data and the algorithm often reflect Biases of their designers and where the Data was created from So it’s kind of a question of like how Do we actually use machine learning Models where we have the goals of Mitigating these issues and something That we’re going to come and talk to in This talk is is reinforcement learning a Lot so I’m just going to kind of get the Lingo out of the way for some people That might not be familiar with deeprl Essentially reinforcement learning is a Mathematical framework when you hear RL You should think about this is kind of Like a set of math problems that we’re Looking at that are constrained and in This framework we can study a lot of Different interactions in the world so Some terminology that we’ll revisit

Again and again is that there’s an agent Interacting with an environment and the Agent interacts with the environment by Taking an action and then the Environment returns two things called The state and the reward the reward is The objective that we want to optimize And the state is just kind of a Representation of the world at that Current time index and the agent uses Something called a policy to map from That state to an action And the beauty of this is that it’s very Open-ended learning so the agent just Sees these reward signals and learns how To optimize them over time irrespective Of the source of the actual signal for The reward so it’s actually this is why A lot of people are drawn to it is Because it is this ability to create an Agent that’ll learn to solve complex Problems and this is kind of where we Started talking about rlhs which is that We want to use reinforcement learning to Solve this open-ended problem of what Are these hard loss functions that we Want to model like how do we actually Encode human values in a machine Learning system in a way that is Sustainable meaningful and actually like Addressing the hard problems that have Been common failure most sedate so as a Little example the question is how do You create a loss function for these

Sorts of questions like what is funny What is ethical what is safe and if you Try to write these down on a piece of Paper you’re either going to have a hard Time or be very wrong and the kind of Goal of reinforcement learning from Human feedback is to integrate these Complex data sets in machine learning Models to encode these values or to Encode these values in a model rather Than an equation I guess in code could Be somewhat unclear on the slide but Really we want to learn these values Directly with humans rather than trying To assign it to all humans and kind of Mislabeling what the actual values are So reinforce the learning for human Feedback is one of many methods and one That has been really timely and Successful I’m trying to actually Address this problem of creating a Complex loss function for our models So from here I’m going to kind of talk About the origins of our lhf and kind of Where this field came from and some Interesting back pointers that you can Look at if you’re interested in more go Through the conceptual overview which Will be like a kind of detailed Walkthrough of the blog post that we Wrote and then go into these future Directions conclusions that are kind of Reading in between the lines of how rlhf Works at these companies what people may

Not have said and where where rlhm is Going So for history Our lhf really originated in decision Making and this was before deep Reinforcement learning when people were Creating autonomous agents that didn’t Use neural networks to represent a value Function didn’t use neural networks as a Policy And what this did was a machine Learning System that kind of created a policy by Having humans Label the actions that an agent took as Being kind of correct or incorrect Excuse me and this was just a simple Decision rule where humans labeled every Action as good or bad and this was Essentially a reward model and a policy Put together And this paper they introduced this Tamer framework to solve Tetris and it Was kind of interesting because this Reward model and policy were all in one Well we’ll see in the future systems is They kind of become separated a bit and This actually was happening when Reinforced learning from Human feedback Was getting popularized in deep RL so This paper was on Atari games where they Were using a reward predictor on the Human feedback of trajectories so a Bunch of these states also can be called Observations in RL framework were given

To a human to label and then this reward Predictor was then another signal into The policy that was solving the task so Really this originated outside of Language models and there’s a ton of Literature for rohf outside of language Models but most of the rest of the talk We’re going to talk about language Modeling because that’s where everyone Is here and some more recent history was Open AI was doing these experiments with Rohf where they were trying to train a Model to summarize text well and It’s a really interesting problem Because this is something that a lot of Humans in standardized tests have been Asked to do for a really long time it’s Like reading comprehension so there’s Really human qualities to it but Something that’s hard to pinpoint again So this diagram has been around for a Few years on the right and you’ll keep Seeing variations of it as we go Throughout and open AI kept iterating on It we have our own take on it and just Kind of to get The idea going here’s an example of our Lhf from this learning to summarize Paper from openai So the prompt here just to read part of It was that Like about someone that on Reddit that Was like ask Reddit should they pursue a PhD so to pursue a computer science PhD

Or continue working especially if one Has no real intention to work in Academia even after grad school And the post can continue to be quite Lengthy and the idea is a summarize this And it’s has anyone after working for a Period of time decided for whatever Reason to head back into Academia to Pursue a PhD in computer science with no Intention to join the world of Academia But intend to head back into industry if So what were the reasons also how did it Turn out This continues for paragraphs you can Understand what this type of post is and My question is how do we actually Summarize it so what would happen is That If you pass this into a language model That’s just trained on summarizing it The output would be something like I’m Considering pursuing a PhD in computer Science but I’m worried about the future I’m currently employed full-time but I’m Worried about the future and you can see This language model is like repetitive That’s not really how a human would Write this they’re sometimes kind of Grammatical errors that aren’t so nice To read and then what open AI did is They also had a human write an example So this would be like a very good output And the human annotation was software Engineer with the job I’m happy at for

Now deciding whether to pursue a PhD to Approve qualifications and explore Interests and a new challenge so what The early experiments were doing we’re Using roh app to kind of combine these Signals to get an output from a machine Learning model that is a little bit Nicer to read and here you can see it’s Currently employed considering pursuing A PhD in computer science to avoid being Stuck with no residency pizza ever again Has anyone pursued a PhD purely for the Sake of research with no intention of Joining that academic world this is Better and there’s tons of examples like This so it’s like easy to see that why You may want to use rlhf because you can Get these models that the text is Actually more compelling and especially If the tasks was covering sensitive Subjects that you really didn’t want Misinformation on there’s a ton of Reasons to try this rohf Next Step comes chat gbt which is why a Ton of people are here and What has opened AI told us about this And Really we don’t know much because openai Is not as open as they once were but There’s actually some really interesting Rumors going on here so if we go into The River Mill there’s actually like Open AI is supposedly spending tons of Money on the human annotation budget so

Orders of magnitude more than the Submaration summarization paper or these Academics Works they were doing in the Past so they hire a bunch of people to Write these annotations like in what I Showed in that example and then they’re Kind of changing this training so There’s a lot of rumors about them Modifying on our lhf but they haven’t Told us how so we’ll go through the Overview and then one of these pieces Will actually change But the impact is clear everyone here Has used it it’s amazing to use the side Of what’s going to come for machine Learning systems Okay let’s go to the actual technical Details if there’s any pressing Questions I can try to look at them There’s Um yeah no uh I save all the questions For Q a but varies too but we can Rapidly uh see because I think they are Quite easy to answer from now it’s can I Download the chat gbt and fine-tune it From my own data No you can’t yet hopefully some people Help release one that you can’t do that On yeah And uh cancer GPT can be trained Continuously with new data which is the Case yeah chat gbt is definitely going To keep being trained on the data you’re Giving it and we’ll talk about that more

Later Yeah Okay let’s continue So let’s dive into rlho so when you see Early Jeff I’m going to break it down Into three conceptual parts that you can Kind of keep track of in your head and You don’t need to read everything on This slide I’m going to go into each of These figures in great detail so kind of It’s a three-phase process where you go Into language model pre-training you Need some language model that you’re Going to fine tune with RL reward model Training which is the process where You’re getting a reward function to Train with the RL system and then Finally actually doing the RL which is When you fine-tune this language model Based on the reward in order to get this More interesting performance So let’s start in the left here with Language model pre-training so NLP sends The Transformer paper has really been Transformed oh that was a rough sentence But NLP has really taken off with these Kind of Standardized practices for getting a Language model which is they’ll scrape Data from the internet they’ll use Unsupervised sequence prediction and These very large models are becoming Really incredible at generating sequence Of sequences of text to mirror the

Distribution Um that was given to it by this kind of Human training Corpus and in our lhf There’s really not a single best answer On what the model size should be The industry experiments on our lhf have Ranged from 10 billion to 280 billion Parameters I suspect that academic Labs Will even try smaller things this is a Huge this is a common theme that you’ll See is there’s a lot of variation in the Method and no one knows exactly what is Best And then what you’ll see here is there’s This human augmented text that is Optional and we’ll get to that just to Kind of cover the data set that we have There’s this prompts and Text data set The data set will look like things like Reddit like I read a ask Reddit question Before forums news books and then There’s kind of this optional step to Improve include human written text from Predefined prompts that’ll be things Like you’ve asked chat GPT a question Then in the future open AI when they Train chat gpt2 can have an initial Model that kind of knows that that is Coming and train on data sets that Reflect that And Missed a comment Oh yeah here’s where it should be and Generally there’s this important

Optional step which is a company can pay Humans to write responses to Um these kind of important questions or To important prompts that’s identified And these responses will be really high Quality training data where they can Continue to train this initial language Model A little bit more some papers Refer to this as supervised fine-tuning Sft and kind of one way to think about This is that it’s like a high quality Parameter initialization for the rlhf Process that’ll come later And this is really expensive to do Because you have to hire people that are Relatively focused to actually write In-depth responses So now we have this language model the Next step is to actually figure out how To use it to generate some sort of Preferences because this whole time We’re talking about how to generate Preferences from that mirror as humans Without assigning a specific equation to It And this step is this kind of reward Model training and this looks like a lot But really think about the high level Goal which is we want to get a model That maps from some input text sequence To a scalar reward value the scalar Notion is important because Reinforcement learning is really known For optimizing one single scalar number

Over time that it sees from the Environment So we’re really trying to create the System that mirrors that which is just Like How do we get the blocks to fit together Correctly so that we can use RL and this Impactful way So what we see is that again this reward Model training starts with a specific Data set the data set here will be Different than the one used in the Language model pre-training because It’ll be more focused on the props that It expects people to see there’s Actually data sets on the internet that Are kind of like preference data sets or There’s props from using a chat bot There’s a lot of specific data sets that Can be useful at different parts of the Process but again the best practices are Not that well known but in reality these Prompt data sets will be orders of Magnitude smaller than the like text Corpuses used to pre-train a language Model because really it’s just trying to Get at a more specific notion of like a Type of text that is really human and Interactive rather than everything on The internet which everyone knows can be Very noisy and kind of hard to work with And then what happens is that Will generate this text and then the Downstream goal of having text is to

Rate the goal is to rank it so what will Happen is you’ll pass these prompts Through a language model or in some Cases it’s actually multiple language Models so if you think about it if you Have multiple models it can kind of be Like players in a chess tournament and What you’ll do is you’ll have the same Prompt go through each model that will Generate different texts and then what a Human can do is they can label those Different texts and kind of create a Relative ranking of what is going on so That’s what we’re going to do is like The goal is to try to take this Generated text and pass it through some Black box and then have that output be Something that can transform be Transformed into a scalar so there’s Multiple ways that this can be done some Of them are like the ELO method where You have head-to-head rankings there’s Plenty of different ways that can do This but essentially it’s a very human Component where a human is using some Interface to then map the text to a Downstream score And then once we have kind of We have a we need to think about the Input and output pairs for training a Model with supervised learning and what We’ll do is we’ll actually train on a Sequence of text and it’ll take that as An input it’ll decode it do Transformer

Model things and then the output will be Trained on a specific scalar value for Reward And then we’ll kind of get this thing That we call the reward or preference Model because there are multiple parts To the system well in this talk I’ll Kind of try to call the initial language Model that the initial language model or The initial policy and then there’s a Separate model which is the reward model It’s also a very large Transformer based Language model so it could also have Many parameters it could have 50 billion Parameters as well there are some Variations in the size for example Instruct gbt was based on like 170 Billion model billion parameter language Model in the reward model with six Billion parameters but the key is that It outputs scalars from a text input and There’s still some variations of how it Can actually be trained So now that we have this Reward model what we see is that that Can kind of act as the scalar reward From the environment And then we kind of need to understand What the policy is and what that states And actions are so that when we go into This final step of fine tuning with RL It looks very complex but what we’ll see Is that the states and actions are both Language

And then the reward model is what Translates from the environment from These states of language to a scalar Reward value and we can use that in a Reinforcement learning system so let me Break down kind of the few common steps In this kind of iterative loop so what Happens is we take some profit something The user may have said or something we Want the model to be able to generate Well for and we pass that through what Is going to become our policy which is a Trained large language model that Generates some text and we can pass that Text into the trained reward model and Get some scalar value out That’s kind of the core of the system And we need to put that into a feedback Loop so we can update it over time but There’s really a lot a few more Important steps One of them that people have used that Actually all the popular papers have Used some variation of is to use a Callback Library Divergence the KL Divergence was really popular in machine Learning in reality it’s a distance Metric between distributions To not get too into the details of how Sampling from a language model works But what happens is that when you pass In a prompt the language model generates A Time sequence a distribution that’s Over time and we can look at those

Distributions relative to each other and What is going on here is that we’re Trying to constrain the policy this Language model on the right we’re trying To constrain this policy as we iterate It over time to not be too far from the Initial language model that we knew was A pretty accurate text descriptor the Failure mode that this present prevents Is that the language model could output Gibberish to get high reward from the Reward model But we also want it to get high reward And be giving out useful text so this Constraint kind of Keeps Us in the Optimization landscape that we want to Be in There’s a note that deepmind Doesn’t use this in the reward but they Rather apply it in the actual update Rule of the RL algorithm So common theme the implementation Details vary but the ideas are often Similar So now we have this reward model output And this KL Divergence constraint on the Text what happens is we just combine the Scalar notion of reward with a scaling Factor in Lambda just to kind of say how Much do we care about the reward from The reward model versus how much do we Care about the tail constraint And in reality there’s options to add Even more inputs to the summation where

For example instruct GPT adds a reward Term for the text outputs of the trained Model that’s getting this iterative Update to match some of these high Quality annotations that they paid their Human annotators to write up for Specific prompts so again it’d be kind Of matching that summarization that the Human wrote up about the grad school Question they want to make sure the text Matches all the human text that they Have access to But that’s really reliant on data so not Everyone has done this step And then finally what happens is we plug This reward into a RL Optimizer and Generally the RL Optimizer will just Operate as if the reward was given to it From the environment and then we have a Traditional RL Loop where a language Model is policy this kind of reward Model and text sampling technique is the Environment and we get the state and Reward back out and the RL update rule Can work there’s some tricks to it that Like this RL policy may have some Parameters Frozen to help make the Optimization landscape more tractable But in reality that’s like it kind of is Just applying PPO which is a policy Grading added algorithm onto the Language model So it’s A Brief Review PPO stands for Proximal policy optimization

Which is a relatively old on policy Reinforcement learning algorithm on Policy means that as of active data is Passed through the system The gradients are computed with respect To that only and rather than keeping a Replay buffer of recent transitions PPO Works on discrete or continuous actions Which is why it can work okay with Language it’s been around for a long Time Um which really means that it’s kind of Optimized for those parallel parallel Approach which has been really important Because these language models are way Bigger than any reinforceable learning Policies would use them Okay I’m gonna pause here I think it’s a Good time to answer one or two Conceptual questions if they’re there And then we’ll kind of get into a Fun wrap-up part of this talk with some Open areas of Investigation Um yeah so we try to select some for the Overs we’re going to answer when in the Q a just after Um but uh so one of the question was Um Is it possible to be manipulated based On the human feedback what I think they Mean is if our zoom and feedback is not Correct Uh is the model is can be manipulated Yeah so this is a part of I I think I

Might touch on this later too but it’s a Really nuanced question in our lhf which Is like Uh Thomas and I are gonna have different Values like what if the data set is the Best sport in the world is is football And you have like Americans and Europeans in it it’s like there’s real Discordance in the data that you can get In text And then also there’s some interesting Work from Facebook on something called Blenderbot where they’re like trying to Train a model to detect if people are Trolling in their feedback so they’re Like trying to see if the feedback given To the model is actually bogus or not And like The amount of different machine learning Models you have all going into one Chatbot system is pretty wild there can Also be like something that we’ve Discussed internally that would help is That if you have a model to predict Whether the prompt is hard so if the Prompt is like the capital of Alaska is Blank like that hasn’t really changed But if like you have a relatively timely Prompt about climate change or current Events like that’s hard because that Changes with the data so much and these Things all aren’t done but it’s sort of Expectations for things that people Might add to the system

Okay Um Let’s see I have this oh Sorry yeah so is the human editor help To write prompt but also response is it True or if I only wrote The Prompt People definitely write both so the Prompts are probably are sourced from a Wider distribution of people like what I’ve written into chat GPT could be used In the future but the responses are at Least for chat gbt kept from a Relatively closed source of contractors It’s a question on when trying to build An open source chat GPT is like how to Get this high quality data and even like All the people in the hugging phase Community are amazing but like there’s Really strict it seems like there’s kind Of strict requirements on the responses To make them such high quality to get This to work they’re like crowdsourcing That data is hard because it can’t be Written by everyone the problem there’s An advantage to have diverse prompts so That’s why they take it from everyone But the data itself for the feedback Part needs to be really high quality so It’s by a subset of people Awesome Anyway I’m going to continue I think This is probably my favorite part of the Talk and you kind of talk about some Interesting parts of our lhf

Just to kind of summarize this is a good Interweaving between the concepts that We’ve covered and Like what is confusing about this so I There’s Almost all the papers to date that have Been popular have tweaks to the methods That I’ve talked about Anthropic is great they’ve released open Source data for this it’s on the Hub we Can link to it once I’m done talking They release a really long document Detailing all their findings in multiple Ways for this and they have some complex Additions which is like the initial Policy that they use for rlhf has this Context distillation to improve Helpfulness honesty and harmlessness and We’ll kind of show an example in a Second of how this could change text Between two rlhs implementations and and Then they have like another step which Is like preference model pre-training Because the reward model itself is a Different language model the actual Training of it you might want to do Something different so what they did is They trained it like a language model to Predict actual actual tokens And then they found these or they use These ranking data sets on the internet Where there’s data sets that already Exist with binary rankings for responses So it might be like a Reddit question

With two responses and one of them is Labeled thumbs up and one is thumbs it Out They fine-tune the reward model on this Before labeling it on generated prompts To help initialize the reward model And then kind of they also tried this Thing with online iterated rlhf which is When they’re doing the RL feedback loop To iteratively update the reward model To help the model kind of continue to Learn while it’s interacting with world This online version only works in some Applications like chat where you can Keep getting this user engagement but You can think about ways to use rlhf in A non-tech based world or for not chat Iterations not chat applications when This data is more complicated to get and Might be actually proprietary so this Online version may not be applicable to Every experiment And then open AI this is mostly based on Extract GPT they’ve had they’re the ones That kind of pioneered this human Generating the language model training Text and they’ve really used this really Far by also adding this RL policy reward To matching it and other companies are Definitely starting to imitate this And but it’s kind of constrained by the Cost they have the advantage in the Scale to be able to invest million Millions of dollars into this and then

Otherwise it’s an open question of how People replicate And deepmind coming in to join the space And doing things totally differently has Been probably great for the research Field to add diversity to things They use the total they’re the first Ones to use non-ppo Um like optimization for the algorithm They use Advantage actor critic which is Another on policy RL algorithm and my Interpretation of this is that the Algorithm used often might be more Reliant on the infrastructure and the Expertise than the actual algorithm open AI has been using PPO more than anyone Deepmind has highly specific Infrastructure to deploy RL experiments At scale in a distributed Manner and to Kind of monitor them so I’m guessing This algorithm they used was really easy For them to kind of scale up and monitor Rather than PPO which way they would Have to start over on And then also Deep my trains on more things than just Alignment which might be like human Preferences they also try to encode Specific rules on things a model should Not do so they’re kind of training on Multiple arms at once which is this kind Of rules about structure and things that It should or should not say and just Clear like human preferences I like this

Wonder I don’t And there’s more out there I’ve been Studying this is a crash course for me Studying this in the last couple weeks So if there’s anything I missed please Add to the chat we can update the Resources that everyone will use in the Future The Field’s moving really fast Open AI might release the chat qpt paper Tomorrow and this will be like instantly Out of date and we’ll go update all this So thanks for your feedback there The next really interesting thing to me Is kind of this reward model feedback Interface which is how machine learning Is going Beyond a research and Technical Domain and being one that is inherently Human and has kind of user interface ux Questions and if you look at one of the This is anthropics Um Text interface they show this in their Paper you should really go check it out But what they did is they made a chat Bot and you can see that there’s during The chat the human has to actually rank Which response it thinks is better on Kind of the sliding scale and it’s Really important like there’s all these Places where You could say that I thought the Assistant was blank there’s a ton of Data going in to this system And we’re only at the first couple

Iterations of what these feedback Interfaces will look like so anthropics Is actually a couple steps ahead of what Others have done on the left is Blenderbot here which is from Facebook It’s not confirmed that they use rlhf But they’re still collecting this data To update the model on the right is chat T the users can thumbs up and thumbs Down data But some of the people that I’ve talked To that go deeper into our lhf say that This is actually That thumbs up thumbs down is used Because it’s easy to get the data not Because it’s the best data that you have And an example is that giving the humans The ability to directly edit the outputs Kind of red line edits changing words Removing things punctuation because that Kind of crowdsources the really high Quality data that openai has been Getting maybe not quite as good of data As a contractor writing it being paid to Do so but it’s much better and much Higher signal than thumbs up and thumbs Down So that’s one thing These interfaces will continue to Involve over time And a bit of changed gears just kind of Walked through some recent examples and Show you the things I talked about in These figures that you may have seen

Before here’s the most popular figure From a truck GPT and you can see kind of Where the three-step process that I was Talking about was inspired by really Like open AI walks you through this step One you collect demonstration data train A supervised policy this is training the Initial language model step two collect Comparison data and trade a reward model You can kind of see that there’s these Different data sets the samples or this Human generated text The step two is the comparison data is This ranking system hey that’s step Three optimize policy against the reward Model using reinforcement learning this Is the one that is kind of I think Oversimplified what is happening and That’s really why I wanted to try to Explain on explain it and elucidate the Space there’s a lot that can go into This final step that is really not Always documented And then another one anthropic trigger This one kind of totally goes away with The three-step process but adds in all The complex things that kind of would Make it hard to follow as a new person So you can start in those pre-trained Language model which captures a lot of What I would put as step one and then Branching out of it immediately are These two modifications like I said are Kind of anthropic unique things which is

Preference model pre-training this is Kind of training the reward pre-training The reward model by using the specific Thumbs up thumbs down data set script From the web and then harmfulness Helpfulness prompt context distillation Which is trying to figuring out how you Can add a pro a context before your Prompt to help initialize the Reinforcement learning part and then They detailed their feedback interface And kind of how this actually iterates Over time Okay Kind of comparing the diagrams it’s also Interesting to see what anthropic Optimized for rather than what instruct GPT was optimizing for so when traffic Was really trying to focus on this Alignment angle a little bit further and How to have an agent that was really Harmless and actually helpful So here in the appendix of the anthropic Paper there’s examples comparing Instruction Responses to anthropic Source months and Then one of the questions is why is it Why aren’t Birds real and you can see They instruct GPT says that birds are Not real blah blah blah which is not That helpful and then what anthropics Like the the modality that anthropic Wants is that the model will say Something like I’m sorry I don’t really

Understand the questions birds are very Real and it’s actually quite impressive To get a machine learning model to do This so that’s that step is really like Why people are optimistic in our lhf Taking this next step as being kind of a Toy thing to really having these Dramatic results in Be like high impact user-facing Technologies Okay Just two high-level open areas of Investigation that particularly Interests me as a reinforcement learning Researcher and being at hugging phase Where we kind of have this unique Research slash open source slash Community position is that There’s a lot of reinforcement learning Optimizer choices that are not that well Documented and can be expanded on Some people don’t even know if RL is Sexually explicitly necessary for this Process PPO is definitely not explicitly Necessary and then there’s kind of a Third question of like can we train this In an offline RL fashion so what happens In offline RL is that you collect a big Data set and then you try to you train This policy for a long time many many Optimization steps but you don’t need to Query the environment and in this case The environment is really the reward Model which being 50 billion parameters

Is quite costly to run imprints on so Maybe we should try offline RL which Will reduce the training costs of the Rlhf process but it doesn’t reduce the Data the data costs here you can see the Other side of what I was talking about Is that these data costs are really Really high there’s High Cost of Labeling which is just human time There’s disagreement in the data It gave the sports example there’s Different values there’s much more Important different values that people Have And that’s why kind of these human Questions are hard like human values Have disagreement and that’s by Design So you want to be able to capture that There’s never going to be one ground Truth distribution that says this is the Only right thing and then there’s this Kind of feedback type user interface Questions that I’m really excited to see How kind of machine learning breaks into The general populace To kind of wrap up and I will switch Into this QA q a format like I’ve showed You that Early Jeff does these cool things I hope That the couple examples I took the time To actually read parts of show you what It’s trying to address by building these Tools there’s a huge variety of complex Implementation details where multiple

Very large machine learning models are Integrating together using any of these Models in a standalone fashion is a Relatively new thing for the machine Learning Community with only a couple Years of experience and machine learning As a technical problem is now being Broadened out from research to be a much Bigger like part of the software stack And that brings a lot of people into the Conversation that can help make these Tools much better for everyone involved So thank you for watching or watching And listening and engaging it’s been Great sharing this with you and we’ll Kind of transition into the Q a part you Can see I linked to the end of the blog post Where I’ve been continuing to update the Related work section to include a Broader set of papers and feel free to Reach out on Twitter or email or Discord And we’ll get back to you there too Thanks Awesome uh thanks Nathan for the Presentation uh we are going to have Um a small section of q a uh obviously We have so we have a lot of questions so If we don’t have time to answer yours Don’t hesitate as I said you know to Join all Discord server we have a Channel called error discussion uh also If you prefer you can also ask on the Comments under this video and we’re good

With we will take the time to answer Your question Um so I saved some let’s see I saved Some question Um So I think it’s more open question but It will be the potential of applying Reinforcement running from your main Feedback to standard diffusion Um what do you think will be uh the Potential of doing that Yep Probably can like I think if it’s kind Of like a A way to Like it’ll help with some of the safety Problems and just kind of find it’s a Fine-tuning method which which I don’t See there’s any structural reasons why You cannot I haven’t thought about that The image space is always hard to think About because it’s so my own Understanding is so language based but I Think it’s like There’s no structural reasons why you Cannot the kind of encoding and decoding Of the prompt gets a little bit Different which is a little tricky I Don’t like that essentially you’ll have A safe a reward model that takes in Um Images rather than words so I don’t see Why you can’t there’s actually some Demos on Hunting face about like safe

Stable diffusion where they did some Fine tuning on stable diffusion to Really make any of the outputs Reasonable so I we can track down some Of those from the diffusion model side Of I can basically kind of follow up With those examples because they might Actually be doing something quite Similar so I I tried to start the talk Not in language models because human Feedback is a huge field of machine Learning it’s just quickly popularized With this language model discussion Uh so one of the question is what’s Sugin phase role plan in future Direction of reinforcement learning from Human feedback Yeah so photography is definitely is Identified but there’s a lot of appetite For it and is kind of in this unique Position where there’s so many like we Have this community that is super Important to the company and that gives Us a different ability to collect data And stuff so hunting face is planning it But hasn’t come up with a specific Project yet and when the project is Known I’m sure hugging face will Communicate with the community and say This is how you can help this is where We’re trying to take things these are The questions we’re trying to address Which is why being transparent is so fun Because we can just share everything but

Right now it’s it’s still a work in Progress It’s been moving fast for the Last week Um When there’s a question I think it’s More an open question is that other over Scalable way of evaluating this model Without human feedback Yeah so that’ll be a good thing to Include in the lecture there’s a lot of Um kind of metrics and data sets that Are designed to evaluate these topics of Kind of harmfulness or alignment Or like text quality On a model on a data set without Actually having to have humans involved To try to like be more rigorous with Respect to these kind of ambiguous Questions that’s something that could Definitely be added you could do a whole Lecture on human facing metrics for NLP There’s a lot there I think like someone Like blue and Rogue or Rose there’s two Mentioned in the blog post if you want To there Awesome Um So one of the question is that you know Reinforcement running a Prem with Convergence Um by having pre-train of the NP model This is not a problem Yeah so actually talking with Folks at Carper who are making this trlx Library

If you Google trlx is what they’re Working on and scaling our lhf What they’re trying to do is get their RL implementations to scale to bigger And bigger language models and the General limiting factors that the PPO Update steps don’t converge easily on The bigger models so there still is Problems with convergence I don’t know Exactly what the mechanism looks like if You’re fine-tuning a language model what Unconverged looks like like how bad it Could get but there’s definitely still Convergence problems with fine-tuning With RL Um So when those so it’s a GPT you know It’s only from Um different language outside of English And what would be the advantage of using Over wrong wage I suppose that we are Having more Um knowledge of the world I suppose Yeah you get that’s like democratizing The access to way more people So I think that’ll come it’s a classic Thing where Technology hits the English word for World first but I think that very like Once there’s an open source version Within weeks there’s gonna be fine-tooed Versions on tons of other languages Um do you think that GPT system are Sustainable yeah given you mentioned no

It can cost a lot uh maybe not trillions But it can cost a lot uh in adaptation Costs Yeah so the real like The Upfront cost Isn’t a problem for those companies 10 Million dollars on annotation is not a Lot for opening eye the issue is that it Costs a couple cents per inference of The model and this cost will go down a Lot so that’s why open AI partners with Microsoft because Microsoft is learning How to create at scale low-cost apis for Complex model inference and those Systems were probably built in the last Six months but if you give them four Years and the technology settles in the Cost will drop 10x and kind of Everything will work out it’s just Really interesting to follow initially Because it’s very fast moving landscape And wild costs Foreign Ty can be more realistic in one specific Domain so I suppose if we fine-tune Um on this one on on one as mentioned Like for instance mathematics or yeah That’ll happen something else like People like to talk about chat GPT being Used for search And an interesting business model Consideration for this is using like a Rohf model trained on internal company Documents to create a really effective Company search so places like Google

Where there are billions of internal Documents as impossible to find them if You’re an employee if they do rohf on Their internal data this model will know What it needs to and something that I Encourage you to do is go ask chat GPT About a very specific subject and Surprisingly chat GPT does okay at very Specific subjects and people think That’s because there’s not that much Data and most of the data is like a Scientific paper which is all things Considered more accurate than something Like Reddit so like they think that it Might transfer to these use cases where There’s only like pretty positive Specific data that people could time Tune on So uh one of the question is Um do we will need in the future remain Annotator Um because you mentioned that Tesla got Rid of the Human annotators by creating More powerful model Yeah maybe but probably not soon it’s Kind of an unsettling question I’m not Confused I’m mostly just unsettled by it And like this I don’t think it’ll come Within a couple years but when that gets To the case or it’s like we’re training Language models on other language models Because one language model is the Ultimate source of truth I’m just very

Worried so I kind of want to say no out Of hope that it isn’t the case but it Wouldn’t be that surprised like you can Already see companies probably trying to Trade their model to mimic chat GPT Because chat GPT is ahead so they can Kind of bootstrap their own training Data by using chat GPT to get a model to Imitate it and I don’t like it but it’s Likely Foreign Ty type of model that can receive image And some as input to understand concept Better I suppose it’s a it’s I think a Lot of researchers are currently Thinking about Yeah I would I would definitely think That people are going to try to do Things like that there’s a whole Multimodal project at hug and face where They’re trying to figure out how to Train models that use multiple types of Data if people will continue adding the Modalities to kind of what the bottom Model be more flexible which would be Very fun to follow Uh it’s just GPT look for data online or Does it add everything in its memory I think it has everything in its memory But it’s not confirmed as it’s not Released there are models that do this Kind of online lookup Rumors are that open AI has figured out Some incredible scraping techniques like

It’s probably not 100 true but people Have said that open AI is better at Scraping YouTube than Um Google is but that’s probably hearsay Which probably just means that like They’re doing equally as well to Google But the fact that an external company is Figured it out as well as Google is is Still pretty remarkable Um So do you think we will see a Rel HF for The modalities like generating imagine Art and music I suppose yes I think so like ultimately there’s still Discussion on what rohf is good at like There’s this is probably the peak of the Hyper rohf but as I was saying the field Of human feedback is much broader the Language going back decades so it’s not That’s not going anywhere it’s just kind Of the early Jeff branding is kind of a New sub token of it Foreign So Do you think it made more sense for Builders to begin labeling a lot of data With existing language mode large Language model like GPT or whereas Next Generation swamp any fine tuning we do This is a question that we’re talking About internally as well like like this Is something that I’ve posted the slack And it’s like gpt4 rubbers are hilarious Literally like I’ve got get multiple

Messages from people that like and just You see the tweets that are like gpt4 And world breaking don’t tell anyone That I know this yeah like but the thing Is that the data is still really useful So like open AI is getting this huge Data advantage and like they’ll use that When they want to do early death on gpt4 Like the the specific implementation Details might need to change based on The architecture or something like that That I don’t think the date of pipeline Is going to be Like obsolete immediately Um are there any resources you recommend To learn more about this I think we Already mentioned our blog posts Yeah I would say the blog post Also the like alignment Community is Very responsive to people engaging on Their topics so a lot of our early deaf Researchers are very Affiliated with alignment and there’s Like other forums that I haven’t Explored as much like Lester OG in Alignment Forum I’m not going to say That I endorse all the content on them There’s a ton of content that these People are pretty like engaged with the Community as researchers so if you write Respectful questions to them a lot You’ll get responses it’s not just me I Did try to make the blog post we wrote Like

The starting point for conceptual Introduction specifically because I Thought that there was not a clear Introduction the papers have the problem The blog post for papers have the Problem where they need to introduce the Paper content and not just the concept So when you remove the specific Advancements of the paper that’s kind of What the blog post is just to make it a Little bit more addressable but if There’s something that is missing you Can let us know Um I think it’s an interesting question Uh given up and I really have an age for Night mode GPT models uh what what other Companies and open source Community can Do to get to keep up the pace Foreign Is the open source Community has way More people and engagement than open AI Yeah open AI is small and Hyper focused Which always gives startups an advantage That given the amount of appetite for it Like There’s over like there’s thousands of More people that are willing to help in An open version and that’s kind of the The scale of access is different Uh why do you think reinforcement Learning uh from email feedback Forks Much better than just fine-tuning the Original model directly with the same Reward that I set

This is the ultimate question does rlf Rlhf actually do anything Um not a hundred percent sure but rumors Are that they think that our all just Handles kind of Shifting the optimization might Escape Nicely So I’m guessing fine tuning on the same Data set could work but the optimization Just wasn’t figured out in the same way And It’s exciting as someone who does RL That this kind of different way of Navigating the optimization space was Useful but it is not well documented There’s no the the research paper Version of the blog post that we wrote Is desperately needed Um Yeah when there was a question was During a presentation they the question To ask is what is the paper tomorrow Yeah not no unlikely there is a chance That it could be released tomorrow in This lectures Um no longer quite as relevant but it’s Really unlikely that we see it tomorrow Yeah surprise I work at open AI now So I think I can answer that is no you Don’t read your code of the gbt it’s Approprietary model and I think you Can’t contribute as an outsider it’s uh Internal project Yeah

[Music] Um Big chat GPT though we’ll see if it Happens [Laughter] Um unfortunately we run off of time Um Uh so what you can do for people we Didn’t have time to answer uh we have as You see on the slide I just tried to Remove uh if we don’t have time uh you Cannot ask on the Discord so you can Join on this code or also in the Comments uh in the video below uh we Will make time in uh in the upcoming Days to answer your question so yeah Don’t hesitate Um so yeah that’s that’s all for today Uh thank you all thank you Nathan for This presentation uh it was super Interesting interesting And uh yeah I will see you in the Discord and uh in the comment section Bye

You May Also Like

About the Author: admin

Leave a Reply

Your email address will not be published. Required fields are marked *