Metaslop-cop/references/calibration.md

Calibration

Density formula, severity tiers, genre adjustments, model fingerprints, contested-tell handling. Read this first.

Raw on GitHub ↗·3,651 words·24 KB

slop-cop — Calibration

slop-cop scores prose on two parallel axes:

AI-Slop — does this read like AI wrote it? (texture, rhythm, vocabulary tells, formatting)
Comprehension — can a fresh reader follow this? (acronyms, named-entity bombing, telegraphic compression, missing thesis, structure, readability)

Each axis has its own catalog, its own density formula, and its own verdict. A piece can fail one and pass the other:

Dense academic prose → may PASS AI-Slop (no delve, no em-dash clusters) but CRITICAL Comprehension (jargon-bombed, no thesis)
Sycophantic ChatGPT marketing → CRITICAL AI-Slop, MEDIUM Comprehension
Hand-written cover letter → PASS both
Twitter-thread summary written by a human in a hurry → LOW AI-Slop, HIGH Comprehension (telegraphic, named-entity bombing)

The audit reports both verdicts. The combined recommendation is driven by whichever is worse.

Comprehension axis (sections 9–11)

Comprehension density scoring
Audience calibration
Cross-axis recommendations

1. Density-based scoring

A single tell is not a signal. Real writers use individual tells all the time. The signal is how many show up per 500 words, weighted by severity.

The formula

For a draft of N words, normalize to 500-word units (U = N / 500). Count violations by severity:

H = high-severity tells (always-cut items)
M = medium-severity tells
L = low-severity tells (informational only)

Compute the density score:

density = (H × 3) + (M × 1) + (L × 0.25)    per 500 words
        = ((H × 3) + (M × 1) + (L × 0.25)) / U   for the full draft

Verdict thresholds

Density score	Verdict	Action
0–2	PASS	Polish-pass at most
2–5	LOW	Spot-fix the listed items
5–10	MEDIUM	Spot-fix sufficient; significant cleanup needed
10–18	HIGH	Substantial revision required
18+	CRITICAL	Recommend rewrite from scratch

Compound triggers (escalate one tier)

A high-severity rhetorical pattern + 5+ vocabulary hits + 1+ formatting tell within the same 500 words → escalate one tier
Three or more H-severity tells in a single paragraph → escalate one tier
The "uncanny valley" condition (see §7) → escalate one tier even when no individual tell is high-severity

What density does and doesn't tell you

It tells you whether the prose reads as AI-shaped. It does not tell you whether AI wrote it — humans imitate AI, and AI imitates humans. Treat the verdict as "this prose has the shape of AI writing," not "this prose was generated by AI."

2. Severity tiers explicit

The patterns, vocabulary, and formatting-tells files all tag every item H/M/L. The definitions:

High (H) — always cut

The phrase or pattern is essentially never the right choice. Even one instance in casual prose lowers the verdict tier. Examples:

Em dashes in clusters (3+ per 500 words)
Bold-first bullets in any short prose piece
Sycophancy openers/closers ("Great question!", "I hope this helps!")
"Delve" / "tapestry" / "showcasing"
Grandiose framing ("stands as a testament to")
Copula avoidance ("serves as", "boasts")
Knowledge-cutoff disclaimer leakage ("As of my last update...")
Vague-authority weasels with no citation

Cut without exception unless the phrase is being used in scare quotes or ironically.

Medium (M) — usually cut

The phrase or pattern survives in narrow contexts. Default is to cut; keep only if the word/structure is doing specific work that nothing else can. Examples:

"While X, Y" sentence opener (one is fine; three is a pattern)
"actually" (survives only contrasting concrete reality with theory)
Hedged superlatives (sometimes warranted in genuinely uncertain claims)
Symmetrical sentence pairs (one is rhetoric; three is a tic)
Two-word punchlines (once per piece is forgivable)

In an audit, M-severity items are listed with the question: is this doing specific work? If not, cut.

Low (L) — context-dependent

Weak tell on its own. Note in the audit report but don't down-score the verdict. Examples:

Em dashes alone (1–2 in a long piece, post-GPT-5.1)
Absent contractions (formal register may justify it)
Universal Oxford comma + American spelling
"Actually" used to contrast theory and reality

L items inform the diagnosis ("this prose has these formal-register tells") but do not tip the verdict.

3. Genre adjustments

The same tell carries different weight depending on genre. The audit infers genre from the draft (or accepts a --genre flag for the scanner) and adjusts thresholds.

Casual / first-person / blog / Reddit / email

Default thresholds. Every tell weighted at full. This is the strictest mode and the most common case — users invoking the skill on their own writing usually want this.

Marketing / sales / landing copy

Marketing copy legitimately uses some intensifiers ("transformative", "groundbreaking") and some structure (TL;DRs, bulleted benefits). Adjust:

Reduce buzzword-density penalty by 30%
"Comprehensive", "robust", "seamless" allowed at 1 instance per 500 words before flagging
Sycophancy still always-cut (no genre justifies "I hope this helps!")
Performative openers still always-cut

Academic / research / formal

Academic prose legitimately hedges ("studies show" with citations is fine), uses some passive constructions, and follows section conventions ("Methods", "Results"). Adjust:

Vague-authority phrases: only flag when uncited. "Studies show [Smith 2024]" is not a tell.
Hedge stacking: only flag when the hedging exceeds the genuine uncertainty (research catalog notes that calibrated hedging is fine; saturation hedging is the tell)
"Comprehensive review", "novel approach" allowed in title position
"Challenges and Future Directions" section is normal here — only flag in non-academic prose

Encyclopedic / reference / Wikipedia-style

LLMs were trained heavily on Wikipedia. Encyclopedic prose triggers false positives across all detectors (GPTZero documents this). Adjust:

Reduce all severity tiers by one for the duration of encyclopedic passages
Copula avoidance ("serves as") still flagged — Wikipedia editors actually use "is" most of the time
Synonym cycling more tolerated
Burstiness threshold relaxed (encyclopedic prose runs uniform)

Fiction / dialogue / character voice

Voice-aware judgment. The character's voice may legitimately use any of these patterns. Apply the rules to narration but not dialogue or interior monologue. The "tells" rule is a bias toward the AI model's house voice; a strong character voice can override it.

If unsure of genre, default to casual (strictest). Users can override.

4. Model fingerprints

When density indicates AI shape, identify the likely model. This serves diagnosis ("this looks like Gemini, not Claude — adjust your prompts") and prompt engineering.

GPT-4 / 4o / 5

Verb signature: delve, underscore, navigate, leverage, harness, showcase
Adjective signature: noteworthy, commendable, intricate, meticulous, comprehensive
Power words: supercharge, unleash, dive in, game-changing
Trigrams: "individuals with diabetes", "characterized by elevated", "ranging from", "play a significant role"
Format signature: heavy bullets and headers, em dashes (pre-5.1; opt-out exists since Nov 2025)
Register: formal/clinical; reads like a slick consultancy deck

Claude

Verb signature: examine, consider, distinguish, illuminate (lighter touch than GPT)
Adjective signature: meaningful, careful, specific, worth examining
Trigrams: "the distinction is worth", "meaningfully reduces", "I notice that", "it's worth examining"
Format signature: clean paragraphs over heavy formatting; less bullet-heavy than GPT
Sycophancy style: softer — "I notice…", "I should be careful here…", "it's worth examining…" rather than "Great question!"
Register: academic-but-approachable

Gemini

Verb signature: explore, navigate, understand (tutorial verbs)
Adjective signature: simpler vocabulary than GPT (uses "high blood sugar" where GPT uses "elevated blood glucose levels")
Trigrams: "the way for", "the cascade of", "is not a", "in the world of"
Format signature: verbose; over-explains; longer paragraphs than necessary
Register: "Google search result that learned to write paragraphs"

Reporting

The scanner uses these clusters as a heuristic. The audit report includes a "Likely model fingerprint" line: none / GPT / Claude / Gemini, with 2-3 specific markers as evidence. When two clusters are equally likely (mixed-model edits, or human polish on top of AI output), report "mixed" rather than picking.

Per Scientific American (cross-stylometry across thousands of outputs): all three models cluster tightly in stylometric space, while humans spread broadly. So the fingerprint signal is strong when present — but only when the prose is unedited or lightly edited.

5. Contested tells

Some tells are contested in the research. The audit acknowledges contestation rather than pretending unanimity.

Em dashes

The most-cited AI tell of 2024-2025. Rolling Stone, TechRadar, NYT all covered it. But OpenAI added an em-dash opt-out in GPT-5.1 (Nov 2025), and many human writers (Cory Doctorow, Cormac McCarthy estate, half of literary fiction) use them constantly.

This skill's default: em dashes in clusters (3+ per 500 words) = H severity. Em dashes alone (1–2 in a long piece) = L severity. Single em dashes are noted but don't down-score.

User override: if the user is Mahmoud or any writer with an explicit no-em-dash voice rule, treat ALL em dashes as H. Pass --strict-em-dash to the scanner.

"Actually" and decorative adverbs

Most decorative adverbs ("genuinely", "truly", "honestly", "frankly", "ultimately") are H. But "actually" survives when contrasting concrete reality with theory ("the model actually works under load"). The rule: if removing the adverb leaves the sentence unchanged or stronger, it was filler. The scanner flags every instance; the audit uses judgment.

Em-dashed asides vs comma asides

When a writer has been told "no em dashes" and they convert em dashes to comma asides, the prose can read awkwardly punctuated. The audit notes when comma-aside density spikes — sometimes that's a signal of em-dash conversion rather than natural rhythm.

Tricolons in formal prose

Three-beat structures ("life, liberty, and the pursuit of happiness") are a literary tradition. They survive in speeches, formal essays, and explicitly rhetorical contexts. The pattern flag is for unintended tricolon abuse — three adjectives strung together because the model defaulted to it.

"From X to Y" ranges

Sometimes a real range. Sometimes false. The audit asks: are X and Y genuinely the endpoints of a spectrum, or just two illustrative examples? If the latter, the construction is a tell.

When a tell is contested, the audit notes the contestation in the calibration section of the report.

6. The sanding-off problem

Sophisticated authors prompt-engineer around famous tells. After "delve" went viral in early 2024, arXiv frequency dropped sharply within months. The flagship vocabulary list is now less reliable than it was.

Implication for the scanner

Newer / less-famous tells are weighted higher than the v1 vocabulary list. Specifically:

Boost by 1.5x: copula avoidance ("serves as", "boasts"), present-participle "-ing" tails, anaphora abuse, false ranges, hedge stacking, "while X, Y" openers
Standard weight: vocabulary tells from category 2A (verbs), 2B (metaphors), 2C (intensifiers) — the famous list
Standard weight, but flagged when present: sycophancy openers/closers (RLHF artifact, hard to sand off)

The rough heuristic

If a draft is clean on category 2A famous-vocabulary tells but dirty on category B sentence-level tells, that's a strong signal of sanded prose: the writer (or the prompt) removed the easy vocabulary tells but didn't catch the structural ones. The audit report flags this as "sanded-prose signature" when present.

Conversely, a draft heavy on famous vocabulary but clean on structural tells is more likely human imitation of AI than actual AI output.

7. The uncanny-valley rule

Multiple weak tells stacking causes "subliminal discomfort" — readers feel something is off before identifying why. Many sources describe this effect (LitHub, Pangram, The Ignorance Field Guide).

The trigger

When all three of the following are true:

Zero high-severity tells
Eight or more medium-or-low-severity tells per 500 words
Burstiness ratio below 0.5 (sentence-length variance too uniform)

…escalate the verdict by one tier even though no individual violation is severe.

The diagnosis line in the audit reads: "Uncanny-valley pattern — multiple weak tells stacking. No single phrase reads as AI; the cumulative texture does."

This catches sanded prose (see §6) and well-prompted output where the writer removed the famous tells but didn't fix the underlying rhythm.

8. Burstiness approximation

Burstiness measures sentence-level variance — variation in length and structure. Humans cluster around 0.6-1.2 (standard deviation of sentence length / mean sentence length). LLMs cluster 0.2-0.4.

The scanner reports the burstiness ratio. The audit uses it as a hint, not a verdict — once a human edits AI output, burstiness rises and the signal weakens.

How the scanner computes it

sentence_lengths = [word_count(s) for s in sentences]
mean = sum(sentence_lengths) / len(sentence_lengths)
std = sqrt(sum((x - mean)**2 for x in sentence_lengths) / len(sentence_lengths))
burstiness = std / mean

Interpretation

Below 0.3: strong AI rhythm signal
0.3 to 0.5: AI-leaning, but light editing can push prose into this range
0.5 to 0.8: ambiguous; light human polish on AI output, or naturally rhythmic AI prompting
Above 0.8: strong human rhythm signal

Caveats

Encyclopedic / reference prose runs uniform regardless of source. Burstiness is unreliable in that genre.
Very short drafts (<100 words) don't have enough sentences to compute burstiness reliably. The scanner reports n/a below 5 sentences.
Burstiness alone is never the verdict. It's one signal among many.

GPTZero, Pangram, and Quillbot all document burstiness as a metric and its limitations; the metric is widely used but increasingly bypassed by prompt engineering. See sources.md for citations.

How to use this file during an audit

Run the scanner. It produces raw counts of H / M / L tells per axis, plus burstiness, readability metrics, and density signals.
Compute both density scores (AI-Slop §1, Comprehension §9).
Apply genre adjustments (§3) and audience calibration (§10).
Check for compound triggers and the uncanny-valley condition (§7).
Check for sanded-prose signature (§6).
Identify the model fingerprint if present (§4).
Note contested tells in the calibration section of the report (§5).
Output both verdicts plus the cross-axis recommendation (§11).

The audit report template (audit-report-template.md) defines the exact output format, including the dual-verdict header and the Calibration Notes section that surfaces all of the above.

9. Comprehension density scoring

The comprehension axis uses the same density formula as the AI-Slop axis but counts a different catalog (the patterns in comprehension.md).

The formula

comp_density = (compH × 3) + (compM × 1) + (compL × 0.25)    per 500 words
            = ((compH × 3) + (compM × 1) + (compL × 0.25)) / U   for the full draft

Where:

compH = high-severity comprehension violations (acronym stacking, named-entity bombing, stat bombing, telegraphic colon-labeling, density-without-headings, long sentences, run-on sentences, coined terms used as known, curse of knowledge, buried lede, missing thesis, no topic sentence, first sentence doesn't hook)
compM = medium-severity (wall of text, list-pretending-to-be-prose, definition-by-synonym, mixed audience, forward-reference, missing transitions, hierarchy collapse, no concrete examples, nut-graf missing, no skim layer, old-to-new inversion, parallelism failure, passive voice excess, nominalization, abstract noun stacking, hedge stacking, ambiguous pronoun, dangling modifier)
compL = low-severity (glue-word bloat, prose-pretending-to-be-list, decorative qualifiers, negative construction)

Verdict thresholds

Same scale as AI-Slop:

comp_density	Verdict	Action
0–2	PASS	Reader can follow it; polish at most
2–5	LOW	Spot-fix listed items
5–10	MEDIUM	Significant cleanup; reader will struggle in places
10–18	HIGH	Substantial revision; reader will lose the thread
18+	CRITICAL	Recommend rewrite; cold reader has no chance

Compound triggers (escalate one tier)

3+ undefined acronyms in any 100-word window → escalate
5+ named entities introduced without context in any 100-word window → escalate
3+ numeric claims in a single sentence (no comparative anchor) → escalate
3+ telegraphic colon-labels in one paragraph → escalate
Any sentence over 40 words → escalate
Any paragraph over 150 words with no subheading → escalate
Combined: any 100-word window with H-density > 5 → escalate

Readability metric panel

The scanner also computes 8 readability metrics (Flesch RE, FK Grade, SMOG, Coleman-Liau, Dale-Chall, lexical density, avg sentence length, passive voice %) — see readability-metrics.md. They appear in the audit report as a diagnostic panel under the comprehension verdict, but they don't directly feed the verdict score. The catalog patterns drive the score; the metrics calibrate.

When a piece scores PASS on patterns but the metrics show grade 16 / lexical density 68% / avg sentence 35 words, the audit notes the disconnect — typically academic prose where every individual sentence is fine but the cumulative texture is opaque.

Audience-aware scoring

The verdict is then adjusted by audience (see §10). A grade-12 score for technical docs is fine; for marketing copy it's HIGH.

10. Audience calibration

The same prose hits different verdicts depending on who's supposed to read it. Audience is the most important calibration input.

Threshold table by audience

Audience	Flesch RE target	FK Grade target	Avg sentence	Passive %	Acronyms
General web / blog	60–70	7–9	15–18	<10%	Define all on first use
Marketing copy	65–80	6–8	12–16	<5%	Avoid; spell out every term
GOV.UK / civic / accessibility (WCAG AAA)	70+	4–6	12–15	<5%	Spell out always
Healthcare patient-facing	70–80	6–8	12–15	<5%	Spell out always
Tech blog (developer audience)	50–65	9–12	18–22	<10%	Define non-obvious only; standard ones (API, JSON, HTML, CSS) OK
Internal technical docs	40–55	11–14	18–25	<15%	Industry-standard OK
Academic / scientific	30–50	12–16	20–28	10–20%	Field-standard OK

How to apply

Detect audience. The scanner infers from cues (citation patterns, code blocks, marketing CTAs, persona pronouns). User can override with --audience flag.
Map metrics to targets. For each readability metric, compute distance from the audience-specific target.
Adjust verdict. A piece that scores HIGH on the catalog but well within the audience's metric band may be downgraded to MEDIUM. A piece that PASSes the catalog but blows the metric band by 50% may be upgraded.

Reader-test simulations (qualitative)

When the scanner can't tell, fall back on these:

Smart 12-year-old test (Feynman): Could a smart 12-year-old or someone outside the field follow this? If you can't explain it simply, you don't truly understand it.
Cold-reader test (Pinker): Show the draft to someone who hasn't been working on it. Ask: What's the main point? Where did you get confused? What terms did you not know? This is the prescription for exorcising the curse of knowledge.
5-second skim test: Show the page for 5 seconds. Ask "what did you see?" Tests whether the H1, first sentence, bolded keywords, and TL;DR convey the gist. (55% of web visitors leave within 15 seconds.)
Cloze test (empirical): Delete every 6th word; have the reader fill in the blanks. Higher restoration rate = more comprehensible. >57% exact restoration = mastery.

When the audience is unknown

Default to casual (general web/blog). It's the strictest practical baseline and produces the most actionable verdict for unspecified contexts.

11. Cross-axis recommendations

When both verdicts are computed, the audit produces a single combined recommendation based on whichever axis is worse. The matrix:

AI-Slop	Comprehension	Combined recommendation
PASS	PASS	Ship it. Polish-pass at most.
PASS / LOW	LOW	Spot-fix the comprehension items. Reader will follow with minor friction.
PASS / LOW	MEDIUM	Significant comprehension cleanup. Define acronyms, break up paragraphs, add a thesis.
PASS / LOW	HIGH / CRITICAL	Comprehension rewrite. The texture is fine but the reader can't follow.
MEDIUM	PASS / LOW	Slop spot-fix. Replace `delve`/em-dashes/sycophancy. Reader can follow already.
MEDIUM	MEDIUM	Both cleanup. Often the same fixes (telegraphic em-dashes hurt both axes).
HIGH / CRITICAL	PASS / LOW	Slop rewrite. Replace AI texture; reader-friendly structure already exists.
HIGH / CRITICAL	HIGH / CRITICAL	Full rewrite. Both axes failing = the prose isn't salvageable through editing.

Top-fix combination

The audit's "Top 3 fixes" list pulls from both axes, ordered by impact:

The single highest-impact item from whichever axis scored worse
The highest-impact item from the other axis
The next highest-impact item from whichever axis scored worse

This way the reader gets the most leverage in the smallest read.

What "passes" means in this dual-axis world

A piece "passes slop-cop" when both verdicts are PASS or LOW. A piece can technically pass the AI-Slop axis with HIGH Comprehension and still be unshippable for any audience that isn't already initiated.

This is the gap that drove v2: the tool used to say "PASS" on prose that no fresh reader could follow. Two axes fix the gap.

Sources for comprehension calibration

CDC Clear Communication Index — reading-level benchmarks for healthcare
GOV.UK style guide — civic-content readability targets
Microsoft style guide — technical-content guidance
WCAG 3.1.5 Reading Level (AAA) — accessibility threshold
Pinker on the curse of knowledge — Harvard
Cloze test — NN/g
F-pattern reading — NN/g

The full bibliography is in sources.md.

Calibration

slop-cop — Calibration

Table of contents

AI-Slop axis (sections 1–8)

Comprehension axis (sections 9–11)

1. Density-based scoring

The formula

Verdict thresholds

Compound triggers (escalate one tier)

What density does and doesn't tell you

2. Severity tiers explicit

High (H) — always cut

Medium (M) — usually cut

Low (L) — context-dependent

3. Genre adjustments

Casual / first-person / blog / Reddit / email

Marketing / sales / landing copy

Academic / research / formal

Encyclopedic / reference / Wikipedia-style

Fiction / dialogue / character voice

4. Model fingerprints

GPT-4 / 4o / 5

Claude

Gemini

Reporting

5. Contested tells

Em dashes

"Actually" and decorative adverbs

Em-dashed asides vs comma asides

Tricolons in formal prose

"From X to Y" ranges

6. The sanding-off problem

Implication for the scanner

The rough heuristic

7. The uncanny-valley rule

The trigger

8. Burstiness approximation

How the scanner computes it

Interpretation

Caveats

How to use this file during an audit

9. Comprehension density scoring

The formula

Verdict thresholds

Compound triggers (escalate one tier)

Readability metric panel

Audience-aware scoring

10. Audience calibration

Threshold table by audience

How to apply

Reader-test simulations (qualitative)

When the audience is unknown

11. Cross-axis recommendations

Top-fix combination

What "passes" means in this dual-axis world

Sources for comprehension calibration