slop-cop/references/calibration.mdCalibration
Density formula, severity tiers, genre adjustments, model fingerprints, contested-tell handling. Read this first.
slop-cop — Calibration
slop-cop scores prose on two parallel axes:
- AI-Slop — does this read like AI wrote it? (texture, rhythm, vocabulary tells, formatting)
- Comprehension — can a fresh reader follow this? (acronyms, named-entity bombing, telegraphic compression, missing thesis, structure, readability)
Each axis has its own catalog, its own density formula, and its own verdict. A piece can fail one and pass the other:
- Dense academic prose → may PASS AI-Slop (no
delve, no em-dash clusters) but CRITICAL Comprehension (jargon-bombed, no thesis) - Sycophantic ChatGPT marketing → CRITICAL AI-Slop, MEDIUM Comprehension
- Hand-written cover letter → PASS both
- Twitter-thread summary written by a human in a hurry → LOW AI-Slop, HIGH Comprehension (telegraphic, named-entity bombing)
The audit reports both verdicts. The combined recommendation is driven by whichever is worse.
Table of contents
AI-Slop axis (sections 1–8)
- AI-Slop density scoring
- Severity tiers (AI-Slop)
- Genre adjustments
- Model fingerprints
- Contested tells
- The sanding-off problem
- The uncanny-valley rule
- Burstiness approximation
Comprehension axis (sections 9–11)
1. Density-based scoring
A single tell is not a signal. Real writers use individual tells all the time. The signal is how many show up per 500 words, weighted by severity.
The formula
For a draft of N words, normalize to 500-word units (U = N / 500). Count violations by severity:
H= high-severity tells (always-cut items)M= medium-severity tellsL= low-severity tells (informational only)
Compute the density score:
density = (H × 3) + (M × 1) + (L × 0.25) per 500 words
= ((H × 3) + (M × 1) + (L × 0.25)) / U for the full draft
Verdict thresholds
| Density score | Verdict | Action |
|---|---|---|
| 0–2 | PASS | Polish-pass at most |
| 2–5 | LOW | Spot-fix the listed items |
| 5–10 | MEDIUM | Spot-fix sufficient; significant cleanup needed |
| 10–18 | HIGH | Substantial revision required |
| 18+ | CRITICAL | Recommend rewrite from scratch |
Compound triggers (escalate one tier)
- A high-severity rhetorical pattern + 5+ vocabulary hits + 1+ formatting tell within the same 500 words → escalate one tier
- Three or more H-severity tells in a single paragraph → escalate one tier
- The "uncanny valley" condition (see §7) → escalate one tier even when no individual tell is high-severity
What density does and doesn't tell you
It tells you whether the prose reads as AI-shaped. It does not tell you whether AI wrote it — humans imitate AI, and AI imitates humans. Treat the verdict as "this prose has the shape of AI writing," not "this prose was generated by AI."
2. Severity tiers explicit
The patterns, vocabulary, and formatting-tells files all tag every item H/M/L. The definitions:
High (H) — always cut
The phrase or pattern is essentially never the right choice. Even one instance in casual prose lowers the verdict tier. Examples:
- Em dashes in clusters (3+ per 500 words)
- Bold-first bullets in any short prose piece
- Sycophancy openers/closers ("Great question!", "I hope this helps!")
- "Delve" / "tapestry" / "showcasing"
- Grandiose framing ("stands as a testament to")
- Copula avoidance ("serves as", "boasts")
- Knowledge-cutoff disclaimer leakage ("As of my last update...")
- Vague-authority weasels with no citation
Cut without exception unless the phrase is being used in scare quotes or ironically.
Medium (M) — usually cut
The phrase or pattern survives in narrow contexts. Default is to cut; keep only if the word/structure is doing specific work that nothing else can. Examples:
- "While X, Y" sentence opener (one is fine; three is a pattern)
- "actually" (survives only contrasting concrete reality with theory)
- Hedged superlatives (sometimes warranted in genuinely uncertain claims)
- Symmetrical sentence pairs (one is rhetoric; three is a tic)
- Two-word punchlines (once per piece is forgivable)
In an audit, M-severity items are listed with the question: is this doing specific work? If not, cut.
Low (L) — context-dependent
Weak tell on its own. Note in the audit report but don't down-score the verdict. Examples:
- Em dashes alone (1–2 in a long piece, post-GPT-5.1)
- Absent contractions (formal register may justify it)
- Universal Oxford comma + American spelling
- "Actually" used to contrast theory and reality
L items inform the diagnosis ("this prose has these formal-register tells") but do not tip the verdict.
3. Genre adjustments
The same tell carries different weight depending on genre. The audit infers genre from the draft (or accepts a --genre flag for the scanner) and adjusts thresholds.
Casual / first-person / blog / Reddit / email
Default thresholds. Every tell weighted at full. This is the strictest mode and the most common case — users invoking the skill on their own writing usually want this.
Marketing / sales / landing copy
Marketing copy legitimately uses some intensifiers ("transformative", "groundbreaking") and some structure (TL;DRs, bulleted benefits). Adjust:
- Reduce buzzword-density penalty by 30%
- "Comprehensive", "robust", "seamless" allowed at 1 instance per 500 words before flagging
- Sycophancy still always-cut (no genre justifies "I hope this helps!")
- Performative openers still always-cut
Academic / research / formal
Academic prose legitimately hedges ("studies show" with citations is fine), uses some passive constructions, and follows section conventions ("Methods", "Results"). Adjust:
- Vague-authority phrases: only flag when uncited. "Studies show [Smith 2024]" is not a tell.
- Hedge stacking: only flag when the hedging exceeds the genuine uncertainty (research catalog notes that calibrated hedging is fine; saturation hedging is the tell)
- "Comprehensive review", "novel approach" allowed in title position
- "Challenges and Future Directions" section is normal here — only flag in non-academic prose
Encyclopedic / reference / Wikipedia-style
LLMs were trained heavily on Wikipedia. Encyclopedic prose triggers false positives across all detectors (GPTZero documents this). Adjust:
- Reduce all severity tiers by one for the duration of encyclopedic passages
- Copula avoidance ("serves as") still flagged — Wikipedia editors actually use "is" most of the time
- Synonym cycling more tolerated
- Burstiness threshold relaxed (encyclopedic prose runs uniform)
Fiction / dialogue / character voice
Voice-aware judgment. The character's voice may legitimately use any of these patterns. Apply the rules to narration but not dialogue or interior monologue. The "tells" rule is a bias toward the AI model's house voice; a strong character voice can override it.
If unsure of genre, default to casual (strictest). Users can override.
4. Model fingerprints
When density indicates AI shape, identify the likely model. This serves diagnosis ("this looks like Gemini, not Claude — adjust your prompts") and prompt engineering.
GPT-4 / 4o / 5
- Verb signature: delve, underscore, navigate, leverage, harness, showcase
- Adjective signature: noteworthy, commendable, intricate, meticulous, comprehensive
- Power words: supercharge, unleash, dive in, game-changing
- Trigrams: "individuals with diabetes", "characterized by elevated", "ranging from", "play a significant role"
- Format signature: heavy bullets and headers, em dashes (pre-5.1; opt-out exists since Nov 2025)
- Register: formal/clinical; reads like a slick consultancy deck
Claude
- Verb signature: examine, consider, distinguish, illuminate (lighter touch than GPT)
- Adjective signature: meaningful, careful, specific, worth examining
- Trigrams: "the distinction is worth", "meaningfully reduces", "I notice that", "it's worth examining"
- Format signature: clean paragraphs over heavy formatting; less bullet-heavy than GPT
- Sycophancy style: softer — "I notice…", "I should be careful here…", "it's worth examining…" rather than "Great question!"
- Register: academic-but-approachable
Gemini
- Verb signature: explore, navigate, understand (tutorial verbs)
- Adjective signature: simpler vocabulary than GPT (uses "high blood sugar" where GPT uses "elevated blood glucose levels")
- Trigrams: "the way for", "the cascade of", "is not a", "in the world of"
- Format signature: verbose; over-explains; longer paragraphs than necessary
- Register: "Google search result that learned to write paragraphs"
Reporting
The scanner uses these clusters as a heuristic. The audit report includes a "Likely model fingerprint" line: none / GPT / Claude / Gemini, with 2-3 specific markers as evidence. When two clusters are equally likely (mixed-model edits, or human polish on top of AI output), report "mixed" rather than picking.
Per Scientific American (cross-stylometry across thousands of outputs): all three models cluster tightly in stylometric space, while humans spread broadly. So the fingerprint signal is strong when present — but only when the prose is unedited or lightly edited.
5. Contested tells
Some tells are contested in the research. The audit acknowledges contestation rather than pretending unanimity.
Em dashes
The most-cited AI tell of 2024-2025. Rolling Stone, TechRadar, NYT all covered it. But OpenAI added an em-dash opt-out in GPT-5.1 (Nov 2025), and many human writers (Cory Doctorow, Cormac McCarthy estate, half of literary fiction) use them constantly.
This skill's default: em dashes in clusters (3+ per 500 words) = H severity. Em dashes alone (1–2 in a long piece) = L severity. Single em dashes are noted but don't down-score.
User override: if the user is Mahmoud or any writer with an explicit no-em-dash voice rule, treat ALL em dashes as H. Pass --strict-em-dash to the scanner.
"Actually" and decorative adverbs
Most decorative adverbs ("genuinely", "truly", "honestly", "frankly", "ultimately") are H. But "actually" survives when contrasting concrete reality with theory ("the model actually works under load"). The rule: if removing the adverb leaves the sentence unchanged or stronger, it was filler. The scanner flags every instance; the audit uses judgment.
Em-dashed asides vs comma asides
When a writer has been told "no em dashes" and they convert em dashes to comma asides, the prose can read awkwardly punctuated. The audit notes when comma-aside density spikes — sometimes that's a signal of em-dash conversion rather than natural rhythm.
Tricolons in formal prose
Three-beat structures ("life, liberty, and the pursuit of happiness") are a literary tradition. They survive in speeches, formal essays, and explicitly rhetorical contexts. The pattern flag is for unintended tricolon abuse — three adjectives strung together because the model defaulted to it.
"From X to Y" ranges
Sometimes a real range. Sometimes false. The audit asks: are X and Y genuinely the endpoints of a spectrum, or just two illustrative examples? If the latter, the construction is a tell.
When a tell is contested, the audit notes the contestation in the calibration section of the report.
6. The sanding-off problem
Sophisticated authors prompt-engineer around famous tells. After "delve" went viral in early 2024, arXiv frequency dropped sharply within months. The flagship vocabulary list is now less reliable than it was.
Implication for the scanner
Newer / less-famous tells are weighted higher than the v1 vocabulary list. Specifically:
- Boost by 1.5x: copula avoidance ("serves as", "boasts"), present-participle "-ing" tails, anaphora abuse, false ranges, hedge stacking, "while X, Y" openers
- Standard weight: vocabulary tells from category 2A (verbs), 2B (metaphors), 2C (intensifiers) — the famous list
- Standard weight, but flagged when present: sycophancy openers/closers (RLHF artifact, hard to sand off)
The rough heuristic
If a draft is clean on category 2A famous-vocabulary tells but dirty on category B sentence-level tells, that's a strong signal of sanded prose: the writer (or the prompt) removed the easy vocabulary tells but didn't catch the structural ones. The audit report flags this as "sanded-prose signature" when present.
Conversely, a draft heavy on famous vocabulary but clean on structural tells is more likely human imitation of AI than actual AI output.
7. The uncanny-valley rule
Multiple weak tells stacking causes "subliminal discomfort" — readers feel something is off before identifying why. Many sources describe this effect (LitHub, Pangram, The Ignorance Field Guide).
The trigger
When all three of the following are true:
- Zero high-severity tells
- Eight or more medium-or-low-severity tells per 500 words
- Burstiness ratio below 0.5 (sentence-length variance too uniform)
…escalate the verdict by one tier even though no individual violation is severe.
The diagnosis line in the audit reads: "Uncanny-valley pattern — multiple weak tells stacking. No single phrase reads as AI; the cumulative texture does."
This catches sanded prose (see §6) and well-prompted output where the writer removed the famous tells but didn't fix the underlying rhythm.
8. Burstiness approximation
Burstiness measures sentence-level variance — variation in length and structure. Humans cluster around 0.6-1.2 (standard deviation of sentence length / mean sentence length). LLMs cluster 0.2-0.4.
The scanner reports the burstiness ratio. The audit uses it as a hint, not a verdict — once a human edits AI output, burstiness rises and the signal weakens.
How the scanner computes it
sentence_lengths = [word_count(s) for s in sentences]
mean = sum(sentence_lengths) / len(sentence_lengths)
std = sqrt(sum((x - mean)**2 for x in sentence_lengths) / len(sentence_lengths))
burstiness = std / mean
Interpretation
- Below 0.3: strong AI rhythm signal
- 0.3 to 0.5: AI-leaning, but light editing can push prose into this range
- 0.5 to 0.8: ambiguous; light human polish on AI output, or naturally rhythmic AI prompting
- Above 0.8: strong human rhythm signal
Caveats
- Encyclopedic / reference prose runs uniform regardless of source. Burstiness is unreliable in that genre.
- Very short drafts (<100 words) don't have enough sentences to compute burstiness reliably. The scanner reports
n/abelow 5 sentences. - Burstiness alone is never the verdict. It's one signal among many.
GPTZero, Pangram, and Quillbot all document burstiness as a metric and its limitations; the metric is widely used but increasingly bypassed by prompt engineering. See sources.md for citations.
How to use this file during an audit
- Run the scanner. It produces raw counts of H / M / L tells per axis, plus burstiness, readability metrics, and density signals.
- Compute both density scores (AI-Slop §1, Comprehension §9).
- Apply genre adjustments (§3) and audience calibration (§10).
- Check for compound triggers and the uncanny-valley condition (§7).
- Check for sanded-prose signature (§6).
- Identify the model fingerprint if present (§4).
- Note contested tells in the calibration section of the report (§5).
- Output both verdicts plus the cross-axis recommendation (§11).
The audit report template (audit-report-template.md) defines the exact output format, including the dual-verdict header and the Calibration Notes section that surfaces all of the above.
9. Comprehension density scoring
The comprehension axis uses the same density formula as the AI-Slop axis but counts a different catalog (the patterns in comprehension.md).
The formula
comp_density = (compH × 3) + (compM × 1) + (compL × 0.25) per 500 words
= ((compH × 3) + (compM × 1) + (compL × 0.25)) / U for the full draft
Where:
compH= high-severity comprehension violations (acronym stacking, named-entity bombing, stat bombing, telegraphic colon-labeling, density-without-headings, long sentences, run-on sentences, coined terms used as known, curse of knowledge, buried lede, missing thesis, no topic sentence, first sentence doesn't hook)compM= medium-severity (wall of text, list-pretending-to-be-prose, definition-by-synonym, mixed audience, forward-reference, missing transitions, hierarchy collapse, no concrete examples, nut-graf missing, no skim layer, old-to-new inversion, parallelism failure, passive voice excess, nominalization, abstract noun stacking, hedge stacking, ambiguous pronoun, dangling modifier)compL= low-severity (glue-word bloat, prose-pretending-to-be-list, decorative qualifiers, negative construction)
Verdict thresholds
Same scale as AI-Slop:
| comp_density | Verdict | Action |
|---|---|---|
| 0–2 | PASS | Reader can follow it; polish at most |
| 2–5 | LOW | Spot-fix listed items |
| 5–10 | MEDIUM | Significant cleanup; reader will struggle in places |
| 10–18 | HIGH | Substantial revision; reader will lose the thread |
| 18+ | CRITICAL | Recommend rewrite; cold reader has no chance |
Compound triggers (escalate one tier)
- 3+ undefined acronyms in any 100-word window → escalate
- 5+ named entities introduced without context in any 100-word window → escalate
- 3+ numeric claims in a single sentence (no comparative anchor) → escalate
- 3+ telegraphic colon-labels in one paragraph → escalate
- Any sentence over 40 words → escalate
- Any paragraph over 150 words with no subheading → escalate
- Combined: any 100-word window with H-density > 5 → escalate
Readability metric panel
The scanner also computes 8 readability metrics (Flesch RE, FK Grade, SMOG, Coleman-Liau, Dale-Chall, lexical density, avg sentence length, passive voice %) — see readability-metrics.md. They appear in the audit report as a diagnostic panel under the comprehension verdict, but they don't directly feed the verdict score. The catalog patterns drive the score; the metrics calibrate.
When a piece scores PASS on patterns but the metrics show grade 16 / lexical density 68% / avg sentence 35 words, the audit notes the disconnect — typically academic prose where every individual sentence is fine but the cumulative texture is opaque.
Audience-aware scoring
The verdict is then adjusted by audience (see §10). A grade-12 score for technical docs is fine; for marketing copy it's HIGH.
10. Audience calibration
The same prose hits different verdicts depending on who's supposed to read it. Audience is the most important calibration input.
Threshold table by audience
| Audience | Flesch RE target | FK Grade target | Avg sentence | Passive % | Acronyms |
|---|---|---|---|---|---|
| General web / blog | 60–70 | 7–9 | 15–18 | <10% | Define all on first use |
| Marketing copy | 65–80 | 6–8 | 12–16 | <5% | Avoid; spell out every term |
| GOV.UK / civic / accessibility (WCAG AAA) | 70+ | 4–6 | 12–15 | <5% | Spell out always |
| Healthcare patient-facing | 70–80 | 6–8 | 12–15 | <5% | Spell out always |
| Tech blog (developer audience) | 50–65 | 9–12 | 18–22 | <10% | Define non-obvious only; standard ones (API, JSON, HTML, CSS) OK |
| Internal technical docs | 40–55 | 11–14 | 18–25 | <15% | Industry-standard OK |
| Academic / scientific | 30–50 | 12–16 | 20–28 | 10–20% | Field-standard OK |
How to apply
- Detect audience. The scanner infers from cues (citation patterns, code blocks, marketing CTAs, persona pronouns). User can override with
--audienceflag. - Map metrics to targets. For each readability metric, compute distance from the audience-specific target.
- Adjust verdict. A piece that scores HIGH on the catalog but well within the audience's metric band may be downgraded to MEDIUM. A piece that PASSes the catalog but blows the metric band by 50% may be upgraded.
Reader-test simulations (qualitative)
When the scanner can't tell, fall back on these:
- Smart 12-year-old test (Feynman): Could a smart 12-year-old or someone outside the field follow this? If you can't explain it simply, you don't truly understand it.
- Cold-reader test (Pinker): Show the draft to someone who hasn't been working on it. Ask: What's the main point? Where did you get confused? What terms did you not know? This is the prescription for exorcising the curse of knowledge.
- 5-second skim test: Show the page for 5 seconds. Ask "what did you see?" Tests whether the H1, first sentence, bolded keywords, and TL;DR convey the gist. (55% of web visitors leave within 15 seconds.)
- Cloze test (empirical): Delete every 6th word; have the reader fill in the blanks. Higher restoration rate = more comprehensible. >57% exact restoration = mastery.
When the audience is unknown
Default to casual (general web/blog). It's the strictest practical baseline and produces the most actionable verdict for unspecified contexts.
11. Cross-axis recommendations
When both verdicts are computed, the audit produces a single combined recommendation based on whichever axis is worse. The matrix:
| AI-Slop | Comprehension | Combined recommendation |
|---|---|---|
| PASS | PASS | Ship it. Polish-pass at most. |
| PASS / LOW | LOW | Spot-fix the comprehension items. Reader will follow with minor friction. |
| PASS / LOW | MEDIUM | Significant comprehension cleanup. Define acronyms, break up paragraphs, add a thesis. |
| PASS / LOW | HIGH / CRITICAL | Comprehension rewrite. The texture is fine but the reader can't follow. |
| MEDIUM | PASS / LOW | Slop spot-fix. Replace delve/em-dashes/sycophancy. Reader can follow already. |
| MEDIUM | MEDIUM | Both cleanup. Often the same fixes (telegraphic em-dashes hurt both axes). |
| HIGH / CRITICAL | PASS / LOW | Slop rewrite. Replace AI texture; reader-friendly structure already exists. |
| HIGH / CRITICAL | HIGH / CRITICAL | Full rewrite. Both axes failing = the prose isn't salvageable through editing. |
Top-fix combination
The audit's "Top 3 fixes" list pulls from both axes, ordered by impact:
- The single highest-impact item from whichever axis scored worse
- The highest-impact item from the other axis
- The next highest-impact item from whichever axis scored worse
This way the reader gets the most leverage in the smallest read.
What "passes" means in this dual-axis world
A piece "passes slop-cop" when both verdicts are PASS or LOW. A piece can technically pass the AI-Slop axis with HIGH Comprehension and still be unshippable for any audience that isn't already initiated.
This is the gap that drove v2: the tool used to say "PASS" on prose that no fresh reader could follow. Two axes fix the gap.
Sources for comprehension calibration
- CDC Clear Communication Index — reading-level benchmarks for healthcare
- GOV.UK style guide — civic-content readability targets
- Microsoft style guide — technical-content guidance
- WCAG 3.1.5 Reading Level (AAA) — accessibility threshold
- Pinker on the curse of knowledge — Harvard
- Cloze test — NN/g
- F-pattern reading — NN/g
The full bibliography is in sources.md.