A binary classifier answers yes or no. You are tracking how often it gets the answer right: maybe 85%, maybe 92%. That single number hides something important: there are two different ways to be wrong, and they do not cost the same.
One kind of wrong brings in data that affects the credibility of your product. The other kind loses data that will take effort to get back (if you can get it back at all). Until you know which one is happening, you cannot make a good decision about what to fix.
Two kinds of wrong
Imagine you have 100 job postings. 30 of them are cybersecurity roles. Your classifier flags 20 as cybersecurity.
Of those 20 flags, 18 are correct and 2 are not. Those 2 are false positives: the classifier said yes when the answer was no. A facilities manager got flagged as a cybersecurity role. Your users look at irrelevant data and in the long run your portal credibility might be compromised. On the ops side, a reviewer will have to spend time removing listings that should not have been there.
But there were 30 real cybersecurity jobs and the classifier only found 18. The other 12 are false negatives: the classifier said no when the answer was yes. Your portal might be missing a very important job listing but nobody knows it is missing.
False positives create noise. False negatives create gaps.
Precision and recall
These two failure modes have names.
Precision measures the false positive problem. Out of everything the classifier flagged, how many were actually correct? In the example above, 18 out of 20 flags were real, so precision is 90%. High precision means less noise for your reviewers.
Recall measures the false negative problem. Out of everything that was actually positive, how many did we catch? In the example, we caught 18 out of 30 real cybersecurity jobs, so recall is 60%. High recall means fewer gaps.
Which error costs you more
This is the question that matters. It is a product decision, not a technical one.
On a job classifier I rebuilt, the original system had 89% precision and 68% recall. When it flagged a job, it was usually right. But it was missing nearly a third of the cybersecurity roles, and nobody knew they were missing until a human operator went hunting for them, hours every day.
Missing a real high-paying cybersecurity job listing was worse than bringing in a non-cybersecurity job. So we optimized for recall.
The tradeoff is real: tightening the classifier to reduce false positives usually means it starts missing more real ones. Loosening it to catch more usually means it flags more junk. You cannot maximize both. You pick the side that matters more for your use case. On a project we had an ensemble where a ML model focused on recall and an LLM focused on precision. The human reviewer focused attention on the cases where the two models disagreed.
F1: a single number for dashboards
Sometimes you need one metric to put on a slide or a dashboard. F1 combines precision and recall into a single score between 0 and 1.
What makes it useful: if one of the two metrics is low, F1 will reflect that. A classifier with 95% precision and 50% recall looks decent if you average them (72.5%). The F1 score is 65.5%, which better reflects how skewed the system actually is.
The job classifier started at F1 77%. After the rebuild, the LLM component reached F1 90%. Those two numbers communicate the improvement without requiring the audience to parse four separate metrics.
F1 is a summary, not a replacement. It does not tell you which side of the tradeoff is hurting you. Use it for reporting, but look at precision and recall separately when making tuning decisions.
Closing
A single accuracy number tells you how often the classifier is right. Precision and recall tell you how it is wrong. That distinction is what turns “the classifier needs to be better” into a specific conversation: are we losing data, or are we creating noise? Which one matters more? That is the conversation that leads to better system design.
Appendix:
The confusion matrix
| Actually positive | Actually negative | |
|---|---|---|
| Predicted positive | True Positive (TP) | False Positive (FP) |
| Predicted negative | False Negative (FN) | True Negative (TN) |
Formulas
- Precision:
TP / (TP + FP) - Recall:
TP / (TP + FN) - F1:
2 × (precision × recall) / (precision + recall)
Further reading
- From 68% Recall to 95%: Fixing a Job Classifier Without Increasing LLM Spend: the full case study behind the examples in this post
- 8 Standards for Building Production-Ready Features Using LLMs: standard 2 covers how to define performance thresholds and decide whether to optimize for precision or recall