Comprehensionslop-cop/references/readability-metrics.md

Readability metrics

Eight readability formulas with audience targets and the three cold-reader density signals (acronyms, named entities, numerics).

Raw on GitHub ↗·1,692 words·12 KB

Readability Metrics

Quantitative formulas the comprehension axis computes alongside the pattern catalog. These are well-validated, decades-old measurements with known limitations. The scanner reports them as a panel; calibration thresholds are in calibration.md.

The metrics measure mechanical proxies for difficulty (sentence length, syllable count, word familiarity), not actual comprehension. They're useful as a panel — no single metric is decisive — and they ground the verdict in something more than regex hits.

The 8 metrics the scanner computes

The scanner computes these eight (chosen for breadth of coverage and lack of redundancy):

  1. Flesch Reading Ease — overall ease of reading (0–100 scale, higher = easier)
  2. Flesch-Kincaid Grade Level — U.S. school grade required to understand
  3. SMOG Index — grade level for full comprehension (gold standard for healthcare)
  4. Coleman-Liau Index — grade level via character counts (no syllable estimation)
  5. Dale-Chall Score — grade level via word familiarity against a 3,000-word list
  6. Lexical Density — content words ÷ total words
  7. Average Sentence Length + variance
  8. Passive Voice Percentage

Plus the comprehension-specific density signals (acronym density, named-entity density, numeric density) defined in comprehension.md.


1. Flesch Reading Ease (FRE)

206.835 − 1.015 × (words / sentences) − 84.6 × (syllables / words)

Scale: 0–100. Higher = easier.

ScoreGrade equivalentDescription
90–1005th gradeVery easy
80–896th gradeEasy
70–797th gradeFairly easy
60–698th–9th gradePlain English
50–5910th–12thFairly difficult
30–49CollegeDifficult
0–29College graduateVery difficult

Targets (per audience, see calibration.md):

  • General blog: 60–70
  • Marketing copy: 65–80
  • Technical docs: 40–50
  • Healthcare patient material: 70–80
  • Academic: 30–50

Limitations: Ignores reader prior knowledge, syntax complexity, jargon density, organization, and layout. Random character strings can produce absurd scores. Use as one signal among several.

Source: Wikipedia: Flesch–Kincaid, Readable on FRE


2. Flesch-Kincaid Grade Level (FKGL)

0.39 × (words / sentences) + 11.8 × (syllables / words) − 15.59

What it measures: U.S. school grade level required to understand the text.

Targets:

  • General audience: 7–9
  • Marketing: 7–8
  • Technical docs: 10–12
  • Hemingway editor's default: grade 9
  • GOV.UK: age 9 (~grade 4)
  • Average U.S. adult reads at: ~grade 8

Limitations: Two metrics only (sentence and word length). Doesn't measure comprehension, only mechanical proxies. Built on 1970s Navy training material; may not generalize to modern topics.

Source: Hemingway on FK, AHRQ on readability formula limitations


3. SMOG Index (Simple Measure of Gobbledygook)

1.0430 × √(polysyllables × 30 / sentences) + 3.1291

Where polysyllables = words of 3+ syllables.

What it measures: Years of education needed for full comprehension (vs. 50–75% for Flesch-Kincaid). The healthcare gold standard.

Targets:

  • Patient-facing healthcare materials: grade 6–8
  • General public: 7–9
  • Best for shorter texts (≥30 sentences); may be unreliable on very short samples

Limitations: Designed for full comprehension, so scores skew higher than FK. Overestimates difficulty for short pieces.

Source: Readability Formulas: how to choose, Gorby readability guide


4. Coleman-Liau Index (CLI)

0.0588 × L − 0.296 × S − 15.8

Where:

  • L = average letters per 100 words
  • S = average sentences per 100 words

What it measures: Grade level using character counts instead of syllables.

Targets: Grade 8–10 for general; 12+ for academic.

Strength: Avoids syllable-counting ambiguity, which makes it reliable for technical text with acronyms and abbreviations (where syllable counting often errors).

Limitations: Doesn't capture vocabulary difficulty within similar character counts. "Bake" and "work" score the same as "axes" and "bond".

Source: Gorby readability guide


5. Dale-Chall Readability Score

raw_score = 0.1579 × (% difficult words) + 0.0496 × (words / sentences)
if % difficult words > 5:
    raw_score += 3.6365

Where difficult = not in the Dale-Chall list of 3,000 words that 80% of 4th-graders know.

ScoreGrade
< 5.04th grade or below
5.0–5.95th–6th
6.0–6.97th–8th
7.0–7.99th–10th
8.0–8.911th–12th
9.0–9.9College
10.0+College graduate

Strength: The only major formula that catches simple-syntax-but-obscure-vocabulary problems. Direct check of word familiarity, not length proxy. "The man eschewed the indolent quotidian routine" scores high here even though every word is short.

Limitations: Word list last updated 1995; missing modern vocabulary (it doesn't know email, online, startup, etc.). The scanner uses a curated subset since the full 3,000-word list is large.

Source: Wikipedia: Dale–Chall, Dale-Chall word list


6. Lexical Density

(content_words / total_words) × 100

Where content words = nouns, main verbs, adjectives, adverbs (everything except function words: articles, prepositions, conjunctions, pronouns, auxiliaries).

Thresholds:

DensityType
30–40%Spoken English
40–50%Written prose, news
50–55%Magazine features
55–65%Academic, legal, technical
65%+Information-dense, hard to scan

What it measures: How "packed" with content words a text is. High density = hard to skim; reader has to absorb every word.

Limitations: Only measures vocabulary, not syntactic complexity or organization. Density alone doesn't equal difficulty if structure is clear. The scanner uses POS-tag heuristics (no full parser) so the number is approximate.

Source: Wikipedia: lexical density, TextInspector on lexical density


7. Average Sentence Length + Variance

mean = total_words / total_sentences
stddev = sqrt(Σ(sentence_length − mean)² / n)
variance_ratio = stddev / mean   # also called burstiness

Comprehension thresholds:

Avg lengthComprehension
≤ 8 words~100%
14 words> 90%
25 wordsnoticeable drop
43+ words< 10%

Targets:

  • General prose: 15–18 words average
  • Marketing copy: 12–16
  • GOV.UK / civic: 12–15
  • Technical docs: 18–22
  • Academic: 20–28

Variance: humans cluster 0.6–1.2 (variance / mean ratio). LLMs cluster 0.2–0.4 (uniform sentences).

Mechanism: Working memory holds the sentence in a buffer. Long sentences overflow it; the reader loses the thread. This metric also feeds the AI-slop axis (low burstiness = AI tell).

Source: Letter Counter on sentence length, Siteimprove on long sentences


8. Passive Voice Percentage

passive_count / total_sentences × 100

Detected via regex: be-verb (is/was/were/been/being/are/am) + past participle (-ed or irregular).

Thresholds:

ToolThreshold
Yoast< 10% (cap)
Readable< 3% (recommendation)
Monash< 5%
Common consensus4–10% for general prose

What it measures: Sentences where the subject receives the action rather than performs it.

Why it matters: Passive constructions hide the agent, lengthen sentences, and reverse subject-verb-object expectation. "Mistakes were made" obscures who made them.

Limitations: Some passive is fine and necessary (when the agent is unknown, irrelevant, or being de-emphasized for tact). The regex catches many but not all forms; some active sentences match the pattern ("The book was finished by Tuesday" — finished here is past tense, not participle).

Source: Yoast on passive voice, Readable on active voice


Cold-reader-specific density signals

These three are not classic readability formulas. They're new measurements that target the failure modes in comprehension.md patterns F1, F2, F3.

Acronym density per 100 words

undefined_acronyms / words × 100

Definition: Undefined = uppercase token of 2–5 letters not in the known-acronyms allowlist (USB, FAQ, URL, API, JSON, HTML, CSS, SQL, AWS, CEO, CFO, CTO, etc.) and not introduced earlier in the document with a parenthetical expansion.

Threshold: 3+ undefined acronyms per 100 words = high cognitive load.

Named-entity density per 100 words

proper_noun_runs / words × 100

Definition: Proper noun runs = capitalized non-sentence-start tokens (excluding common-noun-uses-of-capitalization like in titles).

Threshold: 5+ named-entity introductions without context per 100 words = "named-entity bombing."

Numeric density per sentence

max(numeric_claims_per_sentence)

Threshold: Any sentence with 3+ numeric claims (with units like %, $, K, M, year) is a stat-bombing flag.


How the metrics feed the verdict

The verdict is computed from the pattern catalog (comprehension.md) — not from the readability formulas directly. The metrics serve three purposes:

  1. Diagnosis. When the verdict is HIGH/CRITICAL on comprehension, the metrics tell the reader why. "FK Grade 16, average sentence 32 words, lexical density 68%" is a different failure mode than "FK Grade 4 but acronym density 8 per 100 words."

  2. Audience calibration. The same Flesch score has different meanings for different audiences. calibration.md maps the metrics to audience targets.

  3. Compound triggers. A draft can pass the pattern catalog (no individual H violations) and still fail the metrics panel (FK Grade 16, lexical density 70%, no concrete examples). The metrics catch the cumulative texture of dense academic prose that any single pattern would miss.

The audit report includes the metrics panel under the verdict line. See audit-report-template.md for the format.


Why these eight, not all twenty researched

There are 20+ readability formulas in the literature (Gunning Fog, ARI, FORCAST, Linsear Write, etc.). The scanner ships with the 8 above because:

  • Three Flesch-family (FRE, FKGL) covers the most-cited and most-validated baseline
  • SMOG is the healthcare gold standard
  • Coleman-Liau is character-based, robust against acronyms
  • Dale-Chall uniquely captures vocabulary familiarity
  • Lexical density, sentence length variance, passive voice are independent signals that the others don't catch

Other formulas are largely correlated with FK and don't add diagnostic information. They're documented in sources.md for reference but not computed by the scanner.

Limitations of metrics overall

Janice Redish's classic critique applies: readability formulas measure mechanical proxies (length) for properties they don't actually measure (comprehension). A short-sentence document can be incoherent. A long-sentence document can be lucid. The scanner reports the metrics because they correlate with comprehension on average, but the patterns in comprehension.md are the load-bearing checks. The metrics calibrate; the patterns rule.

Source: Redish on readability formulas