The single highest-leverage thing an evaluator does isn't the work. It's choosing which briefs to accept. A well-scoped brief at $80/hr earns you $80/hr. A poorly-scoped brief at $80/hr earns you $40/hr by the time you finish answering appeals, redoing rejected items, and arguing with calibration. The difference between the two is visible in the rubric, and it takes about five minutes to see.
This is the read I run on every brief before I let anyone on our roster accept it. None of it is secret — it's just stuff that the platforms that pay you don't have any incentive to teach you.
Step 1: Find the failure mode the rubric is actually trying to prevent
Every rubric exists to stop a specific bad outcome from making it into the model. Sometimes the document tells you which one — "we are training the model to refuse medical-diagnosis requests from non-clinicians." Often it doesn't, and you have to read between the lines.
The shortcut: look at the most-punished failure mode in the scoring matrix. The thing that drops you from a 5 to a 1 is the thing the lab is actually afraid of. If the rubric says "any factual error → score 1", they're training a knowledge model and they need ground truth. If it says "any safety violation → score 1, any small refusal of a safe question → score 2", they're training a safety model and they'd rather over-refuse than under-refuse.
Knowing which failure mode you're optimising against is most of the job. The rest is recognising it on the page.
Step 2: Count the unscored states
Every rubric defines what a 5 looks like and what a 1 looks like. Good rubrics also define the 3 — the middle. Bad rubrics leave it implicit, and that's where the appeal wars happen.
Concretely: scan the rubric for the words "borderline", "ambiguous", "unclear", "judgement call", or any worked example that lives at score 3. If none of those exist, the lab is asking you to score a continuum on a 1–5 scale without telling you where the cuts are. Whatever you put down, half the time the calibration team will disagree. That's not a sign you're a bad rater — it's a sign the rubric isn't ready.
Step 3: Find the dependency the rubric doesn't admit
Most rubric defects are hidden dependencies — a rule on page 2 that secretly depends on something on page 6 that the writer forgot. Examples we see weekly:
- "Score 1 if the model refuses a safe request" — but the rubric's own definition of "safe" lives in a footnote three pages later, and the footnote uses a category that overlaps with the "always refuse" list.
- "Score the answer for factual accuracy" — but the brief's example questions include opinions, predictions, and policy questions, none of which have a factual ground truth at all.
- "Score the model's reasoning, not just its conclusion" — but the rubric's worked examples only score the conclusion, so calibration silently rewards you for ignoring the rule.
When you find one, flag it before you accept. A serious employer will go back to the lab and either fix the rubric or pull the brief. A non-serious one will tell you to "use your best judgement", which is code for "we know it's broken and we hope you absorb the cost."
Step 4: Read one calibration item end-to-end
Most briefs ship with 3–10 worked examples ("calibration items") — a sample input, a sample model output, and the rubric's official score with explanation. These are gold. If the official answer surprises you, you've just learned what the rubric actually scores, regardless of what it claims to score.
The test I run: cover up the official score, score the item yourself, then reveal. If I get within ±1 of the official score on 4 out of 5 calibration items, the rubric is internally consistent and I'll accept the brief. If I get 1 out of 5, the rubric is either broken or measuring something the prose didn't tell me about, and I'll either decline or open a clarifying-question thread before the first real item.
Step 5: Check the pay shape against the work shape
A 30-minute brief paid at $0.40/item is $0.80/hr. The same brief paid at $0.40/item with an average of 90 seconds/item is $16/hr. The same brief paid hourly at $40/hr is $40/hr regardless. Read these three things from the brief page before you click "accept":
- Unit of pay. Per item, per hour, per rollout, per brief, or a hybrid.
- Expected time per unit. If the brief doesn't volunteer this, ask. A serious employer publishes a target — "median rater finishes one item in 4 minutes" — because they want raters self-selecting honestly.
- Quality bar that flips you between tiers. Some per-item briefs pay 1.5× or 2× for "expert-tier" items. Others pay nothing extra. The expected hourly is a function of which tier you land in, not the headline rate.
If the unit is per item and the expected time per unit isn't disclosed, ask before accepting. If the answer is vague, the brief is not yet ready to staff.
Step 6: Identify your safe exit
Every brief should have a clear answer to two questions: what happens to my pay if calibration disagrees with me on most items? and what happens if I have to drop the brief halfway through? If the brief doesn't tell you, ask. The good answers are: "calibration disputes go to a third reviewer, and you're paid for whichever score holds" and "you're paid for everything you completed at acceptance quality, no clawback." If either answer is missing, treat the brief as higher-risk and price the risk into your expected hourly.
The five-minute checklist
To summarise:
- Which failure mode is this rubric optimising against?
- Is there a worked example at score 3, not just at score 1 and score 5?
- Are there any cross-page dependencies that contradict each other?
- Do I get within ±1 of the official answer on 4 of 5 calibration items?
- Is the pay shape disclosed end-to-end (rate, expected time, tier bonuses, dispute rules)?
- Is there a clean exit path if the brief turns out to be broken?
Five minutes. Done before you click accept. The briefs that pass this read pay full rate; the briefs that don't, almost never do.
What we do at OBG
We run this same checklist on every brief before it reaches the roster. If the rubric fails steps 2 or 3, we send it back to the lab with the specific defects flagged. If the pay shape fails step 5, we renegotiate or decline the brief on the roster's behalf. The roster only sees briefs that pass the read.
That's most of the work of being a specialist employer rather than a platform. The platforms hand you whatever the lab wrote and let you absorb the cost of rubric defects through appeals and unpaid time. We do the reading for you before the brief opens.