RLHF, DPO, GRPO: the alphabet soup of preference data, demystified
If you've done any preference-labelling work for an AI lab in the past three years, you've been labelling data for one of three training algorithms — usually without being told which. The rubric you got, the pay rate you were offered, and the cognitive load of the work all depend on which one. This is a working guide for evaluators who want to read the room before they accept a brief.
The base layer: SFT
Before any of the preference methods come in, every modern LLM goes through supervised fine-tuning (SFT). The lab takes a base pre-trained model and shows it thousands of (prompt, ideal response) pairs. The model learns to produce the kinds of answers the lab wants.
SFT data is written by humans — that's annotation work, not preference work. Pay tends to be the highest of any single-pass labelling job (often $40–80/hr) because writing a really good response from scratch is much harder than picking between two model outputs.
The preference methods below come after SFT, to refine the model's outputs into something users would actually prefer.
RLHF (Reinforcement Learning from Human Feedback)
The original. Published in OpenAI's 2022 InstructGPT paper, made famous by ChatGPT.
What you do as a rater: see a prompt and 2–7 candidate responses. Rank them best-to-worst, or pick the single best. Sometimes you also rate each response on a Likert scale (1–7) for individual properties: helpfulness, harmlessness, factuality.
How it gets used: the lab trains a separate reward model on your rankings — a neural network that scores any (prompt, response) pair the way you would. Then they use reinforcement learning to push the LLM toward higher reward-model scores.
What the rubric usually asks: compare on multiple axes at once and produce a single ranking. The rubrics tend to be long, with explicit tiebreakers ("if responses are equally helpful, prefer the shorter one; if equally harmless, prefer the more direct refusal"). Many labs ask for free-text justification on each ranking — a one-sentence reason why A beat B.
Tells: if you're seeing 4+ responses to compare, dense Likert grids, and a justification box at the bottom — you're in RLHF.
DPO (Direct Preference Optimization)
Published by Rafailov et al. in May 2023. The pitch: skip the separate reward model and train the LLM directly on the pairwise preferences. Simpler pipeline, comparable quality, much lower compute cost. Most labs have switched to DPO or a DPO variant for non-frontier model training.
What you do as a rater: see a prompt and exactly two responses. Pick which one is better. That's usually it — no Likert scores, no ranking, no justifications.
How it gets used: the lab feeds your (prompt, winner, loser) triples directly into the training loop. The model learns to maximize the probability of the winning response relative to the losing one.
What the rubric usually asks: short and binary. "Which is more helpful?" or "Which is more accurate?" — pick one. Some labs add a "they're tied" option to filter out low-signal pairs.
Tells: only ever two responses, single-question rubric, fast UX. Pay tends to be a little lower than RLHF because the cognitive load per item is much lower.
GRPO (Group Relative Policy Optimization)
Published by DeepSeek in early 2024 and quickly adopted by frontier reasoning labs (most notably for chain-of-thought reasoning training in models like o-series and r1-class systems). The pitch: for tasks where you can verify the right answer (math problems, code that compiles and passes tests, theorem proofs), you don't need humans to rank — you can let the model generate many candidate solutions and reward the ones that get the right answer.
What you do as a rater: for the parts of GRPO that do need humans, you write or verify the ground truth, not the preference. Example workflows:
- Write a math problem with a known answer + a reference solution path
- Verify whether the model's chain-of-thought is logically valid even when it gets the right final answer (catches "right answer for the wrong reason")
- Write unit tests for code-generation tasks so the verifier can score model outputs
How it gets used: the model generates many candidate solutions to each problem; an automated verifier scores them; the model is trained to prefer the solutions that scored well relative to their group.
What the rubric usually asks: rigour, not preference. Did the chain-of-thought actually justify the answer? Is your reference solution correct under multiple solving approaches? Are your unit tests comprehensive enough to catch shortcut reasoning?
Tells: the brief mentions "verifier," "reference solution," "chain-of-thought validity," "ground truth answer," or "reward signal." Pay is the highest of any preference-data work — typically $60–120/hr because the bar for "is this reasoning step valid?" requires real domain expertise.
How to tell which one a brief is for
Read the interface, then the rubric, in that order:
- One response, asking you to write the gold answer? SFT.
- 4+ responses, asking you to rank and justify? RLHF.
- Two responses, single binary question? DPO.
- One response, asking you to verify a chain of reasoning or write ground truth for a verifier? GRPO.
You won't usually be told explicitly. Labs tend to keep their training methodology opaque to raters because the rubric is supposed to capture everything you need to know. But knowing the method tells you what the rubric is actually measuring, which makes you faster and more consistent — both things that move you up the platform's internal quality rankings.
What's coming next
The frontier is hybrid. The methods you'll see in 2026 and later increasingly combine SFT for the base behavior, DPO for the bulk preference signal, RLHF for the long-tail safety judgments, and GRPO for the reasoning steps. Some shops call this "RL+verifier+preferences" and there's no settled industry name yet.
Practically, this means a single rater might cycle through all four UIs in a single weekend's work for one lab. Knowing which one you're in keeps you efficient. Treating them all the same gets you flagged for low-quality output on the methods that need depth, or for too-slow throughput on the methods that need volume.