🦉
Chuck
Got feedback?
Reach out to Chuck
Moderation Quality · Weekly RCA Report
W16 −0.04pp
W15 +1.92pp
W14 −1.82pp
W13
⊘ Persistent 0%

OMA held flat at −0.04pp — but underneath, the mix is highly turbulent

Shift-share decomposition of W16 (Apr 18–24) vs W15 (Apr 11–17). Global moved 86.04%→86.00%, yet ID and LATAM each contributed ~800% of Δ to the drag, offset by Adult Sexualized Behaviors and Tobacco recovering. The headline calm masks high single-policy and single-market volatility.

Global OMA W16
86.00%
▼ 0.04pp
from 86.04% · ≈ flat
EMEA · Wt 32.85%
83.89%
▲ +0.62pp
−578% offset of global Δ
AMS · Wt 21.57%
83.10%
▼ 0.45pp
373% of global Δ
APAC · Wt 45.57%
88.91%
▼ 0.14pp
304% of global Δ
Overview −0.04pp
Methodology & headline
Markets 12 dragging
ID #1 at 842%
Top Policies 10+ severe
Violent Behaviors leads
Tobacco +0.33pp
heaviest policy, recovering
Tobacco × Markets EMEA 54%
where the gain came from
ID Deep-dive −3.41pp
Indonesia by policy
Actions 7
Investigate the offsets
00

Methodology & headline summary

W16 (Apr 18–24) vs W15 (Apr 11–17). Each segment's contribution to OMA is decomposed into rate, weight, and interaction effects. Because the global Δ is tiny (−0.04pp), individual segment % of Δ figures can balloon — focus on absolute pp contributions to gauge true scale.

Rate effect = GWtW15 × (AccW16 − AccW15) — pure accuracy change at prior weight
Weight effect = (GWtW16 − GWtW15) × (AccW15 − Global AccW15) — mix shift relative to global mean
Interaction = (GWtW16 − GWtW15) × (AccW16 − AccW15) — joint change
The headline is misleading — the underlying mix is highly turbulent
86.04% → 86.00% looks like a non-event. But: Violent Behaviors fell −19.32pp ?, Personal Information - High Risk fell −31.01pp ?, Disparaging Religion fell −75.96pp ?. They were offset by equally severe gains: Adult Sexualized Behaviors recovering +3.87pp on heavy weight, Tobacco +2.16pp continuing its rebound, Invasive Cosmetic +22.73pp. Net ≈ 0.
Geographic drag concentrated in ID + LATAM = 1648% of global Δ
ID (−3.40pp accuracy, 842% of Δ) and LATAM (−3.03pp, 806% of Δ) each contributed more than 8× the global decline. SSA, PH, MENA1, BD, JP, TR all add 200%+ each. ID is now in its 3rd consecutive WoW decline: W13→W14 −3.16pp, W14→W15 +0.58pp, W15→W16 −3.40pp.
Adult Sexualized Behaviors + Tobacco delivered +0.77pp combined offset — what saved the headline
Adult Sexualized Behaviors (+3.87pp accuracy on 5.0% weight, contribution +0.43pp) and Tobacco & Nicotine (+2.16pp on 10.81% weight, contribution +0.33pp) together offset 2189% of the global decline. MENA2 recovered +6.84pp accuracy regionally (+0.30pp). Without these three, the headline would read closer to −1.5pp.
Why "% of Δ" looks extreme this week

Small denominator, large numerator

Global Δ = −0.04pp. When a segment contributes −0.30pp (a normal magnitude), it's ~750% of the global change. This is mathematically correct but visually scary.

The right interpretation: treat absolute pp contributions as the signal. Anything > 0.10pp is materially large in absolute terms — and the W16 table has 10+ such items on each side, indicating high underlying volatility.

If next week one of the offsets fails to repeat (e.g., Tobacco continues recovering but Adult Sexualized Behaviors regresses), the headline could swing 1–2pp easily. The current calm is fragile.

Data integrity flag — several policies report 0% accuracy

0%-accuracy policies need verification

Four policies show 0% accuracy in both W15 and W16 yet still contribute meaningfully to the global delta via weight changes:

  • Animal Abuse & Graphic Content: 0% → 0%, weight 0.13% → 0.39% (contribution −0.225pp)
  • Youth Physical Abuse, Assault & Neglect: 0% → 0%, weight 0.22% → 0.38%
  • Graphic Content: 0% → 0%, weight grew
  • Adult Sexual Abuse: 0% → 0%, weight 0.41% → 0.50%

A persistent 0% on a non-trivial sample is implausible as a true accuracy figure. Likely causes: data filter excluding all "approve" cases for these policies, sampling artifact, or definitional change. Verify before treating these as real signal.

01

By market — top contributors

12 markets dragging at 100%+ each — but 8 markets offset more than the entire decline
Drag side: ID 842%, LATAM 806%, SSA 440%, PH 398%, MENA1 347%, BD 294%, JP 252%, TR 241%. Offset side: MENA2 +6.84pp recovery, VN grew on improvement, BR +3.43pp, IT +5.37pp.
MarketAcc W15Acc W16Δ AccWt W15Wt W16RateWeightInterTotal% of Δ
ID90.22%86.81%−3.408.70%8.85%−0.296+0.006−0.005−0.295842.3%
LATAM85.54%82.51%−3.038.79%9.24%−0.266−0.002−0.014−0.282805.8%
SSA79.43%75.32%−4.113.37%3.52%−0.139−0.010−0.006−0.154439.8%
PH87.45%83.99%−3.454.15%3.95%−0.143−0.003+0.007−0.139397.7%
MENA186.79%84.98%−1.815.60%7.50%−0.101+0.014−0.034−0.122347.2%
BD86.85%84.98%−1.875.28%5.68%−0.099+0.003−0.007−0.103294.1%
JP92.01%90.48%−1.532.28%1.07%−0.035−0.072+0.018−0.088252.1%
TR88.70%84.36%−4.341.97%1.92%−0.085−0.001+0.002−0.084241.0%
ES89.25%82.39%−6.871.08%1.00%−0.074−0.003+0.006−0.071203.6%
MX82.78%81.87%−0.914.65%5.26%−0.042−0.020−0.006−0.068193.8%
Top-10 negative subtotal−1.4084017.4%
MENA277.79%84.63%+6.844.39%4.00%+0.300+0.032−0.027+0.305−870.2%
VN90.16%91.69%+1.536.84%7.73%+0.105+0.037+0.014+0.155−443.5%
BR83.73%87.16%+3.434.46%4.75%+0.153−0.007+0.005+0.150−429.1%
IT86.41%91.78%+5.372.47%2.39%+0.133+0.000−0.005+0.128−364.9%
MY83.36%90.47%+7.111.67%1.52%+0.119−0.004−0.007+0.108−309.6%
Top-5 positive subtotal+0.846−2417.3%
JP weight collapse (2.28% → 1.07%, −1.21pp) is the largest single mix-shift event among draggers. Despite the small accuracy decline (−1.53pp), the weight effect (−0.072pp) is unusually large because JP W15 accuracy (92%) was well above the global mean — shrinking it removes a high-quality contributor from the mix.
ID #1 dragger — 4 weeks of consecutive declines

ID: a recurring pattern, not a one-off

ID OMA accuracy fell from 90.22% → 86.81% in W16 (−3.40pp). This is the third significant decline in four weeks: W13 92.79% → W14 89.63% → W15 90.22% → W16 86.81%. Cumulative drop: −5.97pp from W13 baseline.

The market is also gaining global share (8.70% → 8.85%) while accuracy worsens — the interaction effect is small but negative. Suggests either Indonesia-specific moderation quality is degrading, or the additional volume is concentrated in harder-to-judge content.

Action: Request structured Indonesia retrospective. The trend is now clear enough to need a dedicated investigation.

LATAM #2 dragger — what's behind the −3.03pp accuracy drop

LATAM: pure rate-effect dominance

LATAM accuracy fell from 85.54% → 82.51%, a 3.03pp drop. Weight grew slightly (8.79% → 9.24%) which marginally amplified damage via interaction (−0.014pp).

The rate component (−0.266pp) is by far the largest driver. Investigate whether a regional policy change, language model update, or sampling shift hit the LATAM portfolio specifically in W16.

MENA2 #1 offset — +6.84pp recovery, is it durable?

MENA2: bounce-back from a chronic underperformer

MENA2 accuracy jumped 77.79% → 84.63% (+6.84pp). Weight contracted slightly (4.39% → 4.00%), so this is overwhelmingly a rate story.

Looking back, MENA2 has been a problem region — this single-week recovery is the largest market gain in the dataset. Whether it's durable depends on whether the W14–W15 issue was a one-off (sample anomaly, transient labeling problem) or whether deeper calibration work lifted the floor.

Confirm with the regional team whether structural changes were made.

02

By policy title — top contributors

Multiple policies dropped 10–35pp accuracy this week
Severe single-policy drops include Disparaging Religion (−75.96pp, 89.07%→13.11%), Light Body Exposure (−36.83pp), Personal Information - High Risk (−31.01pp), NSA Exceptions - Mature (−22.85pp), Suicide & NSSI (−21.99pp), Violent Behaviors (−19.32pp). Even with small weights, these aggregate fast.
Tobacco & Nicotine — 2nd-heaviest policy (10.81%) acted as a stabilizer for the second straight week
Tobacco accuracy continued recovering: 79.04% → 81.20% (+2.16pp) and its share contracted 12.23% → 10.81% (−1.42pp). Because Tobacco accuracy is well below the global mean (−7.0pp from 86.04%), shrinking its weight is a strong net positive. Combined contribution: +0.334pp (offsetting 952% of the global Δ).
Tobacco & Nicotine — deep dive into a sustained recovery

Tobacco's outsized impact

At 10.81% of W16 sample weight, Tobacco & Nicotine is the 2nd-largest single policy (after Youth Regulated Goods at 12.29%). Its accuracy moves the global needle directly.

Multi-week trajectory: clear recovery from W14 trough

  • W13: 84.66% acc, 13.23% wt — recent peak
  • W14: 76.07% acc, 11.06% wt — −8.59pp single-week collapse
  • W15: 79.04% acc, 12.23% wt — partial recovery (+2.97pp)
  • W16: 81.20% acc, 10.81% wt — continued recovery (+2.16pp)

Tobacco quality has rebounded ~5.13pp from the W14 trough, but is still 3.46pp below its W13 baseline. The trajectory is clearly positive.

Shift-share decomposition (W16 vs W15)

  • Rate effect: +0.264pp — accuracy gain at prior weight
  • Weight effect: +0.099pp — shrinking a below-mean segment helps
  • Interaction: −0.031pp — small, accuracy ↑ while weight ↓
  • Total: +0.334pp (≈ −952% of the global Δ)

What to watch

Sample volume: 1,878 → 1,537 cases (−18%). Some of the weight contraction may reflect a sampling change. Verify the methodology hasn't changed.

Below-mean accuracy persistence: at 81.20%, Tobacco is still 4.80pp below the global mean. If volume rebounds before quality recovers further, the helpful weight-effect direction will reverse — Tobacco could flip back to a major drag.

Action: Lock in the recovery — confirm whether the W14 trough was an isolated event and whether the 3-week rebound has structural support, not just regression-to-mean.

PolicyAcc W15Acc W16Δ AccWt W15Wt W16RateWeightInterTotal% of Δ
Violent Behaviors ?76.78%57.46%−19.321.47%1.77%−0.284−0.029−0.058−0.3711059.4%
Gambling - Depiction and Promotion69.68%59.89%−9.791.51%2.07%−0.148−0.092−0.054−0.293836.9%
Dangerous Trends - Serious Harm68.09%63.14%−4.954.83%4.95%−0.239−0.022−0.006−0.266758.7%
Personal Information - High Risk ?84.12%53.12%−31.010.67%0.71%−0.208−0.001−0.011−0.220627.4%
Youth Non-Sexualized Nudity76.77%74.61%−2.164.86%5.60%−0.105−0.069−0.016−0.189540.2%
Youth Body Exposure - Light (4-17)40.38%37.08%−3.300.67%0.98%−0.022−0.146−0.010−0.178507.6%
Youth Regulated Goods and Services73.69%72.65%−1.0412.10%12.29%−0.126−0.023−0.002−0.151430.1%
Light Body Exposure ?70.00%33.17%−36.830.08%0.30%−0.029−0.036−0.082−0.147419.9%
High Risk Driving64.91%60.73%−4.182.33%2.53%−0.097−0.041−0.008−0.147419.7%
Regulated Goods - Marketing/Trade47.96%48.53%+0.571.44%1.80%+0.008−0.135+0.002−0.129367.7%
Top-10 negative subtotal−2.0905967.7%
Adult Sexualized Behaviors54.88%58.75%+3.875.77%5.00%+0.224+0.239−0.030+0.433−1236.1%
Tobacco and Nicotine ★ 2nd heaviest policy79.04%81.20%+2.1612.23%10.81%+0.264+0.099−0.031+0.334−952.5%
Invasive Cosmetic Procedures ?65.14%87.86%+22.731.30%2.26%+0.295−0.201+0.219+0.313−894.6%
Combat sports, Extreme Sports & Stunts75.02%82.54%+7.514.04%4.28%+0.304−0.026+0.018+0.296−844.0%
Moderate Bullying48.14%50.83%+2.692.26%1.62%+0.061+0.247−0.017+0.290−826.3%
Top-5 positive subtotal+1.666−4753.5%
Severe single-policy regressions: Disparaging Religion 89.07%→13.11% (−75.96pp) ?; Suicide & NSSI 57.37%→35.38% (−21.99pp); NSA Exceptions - Mature 53.62%→30.76% (−22.85pp); Adult Sexual Solicitation 57.81%→46.33% (−11.47pp). These weren't in the top-10 by total contribution because their weights are tiny (<1%), but the rate magnitudes warrant individual investigation. Several policies report 0% accuracy in both weeks (Animal Abuse, Youth Physical Abuse, Graphic Content, Adult Sexual Abuse) ? — likely a data integrity issue, not real signal.
Violent Behaviors #1 — −19.32pp drop on growing weight

Violent Behaviors: triple-negative, all three effects against

Accuracy collapsed 76.78% → 57.46% (−19.32pp). Weight grew (1.47% → 1.77%), so the additional volume entered a now-failing segment — interaction effect (−0.058pp) compounds the damage.

This is one of the largest reputational-risk policy categories. A 19pp accuracy drop combined with growing volume is a serious signal — escalate immediately.

Personal Information - High Risk — −31pp single week

Personal Info High Risk: catastrophic single-week drop

Accuracy fell 84.12% → 53.12% (−31.01pp) on stable weight (~0.69%). The pure rate effect (−0.208pp) entirely explains this row's contribution.

A 31pp drop on a privacy-related, high-stakes policy is alarming. Possible drivers: policy interpretation change, new content vector (e.g., new types of doxxing patterns), or model/labeler retraining gone wrong. Investigate before W17.

Adult Sexualized Behaviors — +0.43pp top offset, what drove it

A.S.B: the largest single offset

Adult Sexualized Behaviors recovered 54.88% → 58.75% (+3.87pp). Weight contracted 5.77% → 5.00% (−0.77pp). Both effects are favorable: rate (+0.224pp) and weight (+0.239pp) — shrinking a below-mean segment helps.

This single policy contributed +0.433pp — by itself, more than 12× the global Δ in the offsetting direction. Worth understanding what drove the accuracy jump (calibration, content shift, sampling) since A.S.B is a chronic problem area.

Disparaging Religion — 89.07% → 13.11% (−75.96pp)

Disparaging Religion: most severe rate drop

This policy collapsed by 75.96pp on a tiny sample weight (~0.08–0.13%). Global impact is "only" −0.099pp (246%), but the rate magnitude is unprecedented.

Almost certainly a sample/policy/labeling artifact — a 76pp single-week swing is implausible as a true accuracy change. Verify the W16 sample is representative; if it is, escalate as a critical operational failure.

03

Tobacco & Nicotine — dedicated deep-dive

Tobacco contributes +0.334pp — the 2nd largest single-policy offset in W16
Accuracy continued recovering (79.04% → 81.20%, +2.16pp) for the second straight week. Both share and absolute volume contracted, and because Tobacco accuracy sits well below the global mean (−4.84pp), shrinking its weight further amplifies the global benefit.
+2.16pp
Accuracy
79.04% → 81.20%
−1.42pp
Share of mix
12.23% → 10.81%
−31.4%
Volume (raw)
288K → 198K
+0.334pp
Total contribution
−952% of global Δ
Volume dropped 31% — but share dropped only 11.6% (relative). Both are real shrinkage.
A common reading error: "fewer Tobacco cases" doesn't automatically mean "smaller share". Whole-pipeline volume also contracted (5.65M → 4.54M, −19.6%), so part of Tobacco's volume drop is just the pipeline shrinking. The genuine share contraction is −1.42pp (12.23% → 10.81%, relatively −11.6%) — Tobacco shrunk faster than the pipeline, so its mix share really did fall.
WeekAccWoW AccWt %WoW WtVolume (raw)WoW VolAnnot. sample
W13 (Mar 28–Apr 3)84.66%+6.2713.23%262,884796
W14 (Apr 4–10)76.07%−8.5911.06%−2.17223,139−15.1%509
W15 (Apr 11–17)79.04%+2.9712.23%+1.17288,581+29.3%1,878
W16 (Apr 18–24)81.20%+2.1610.81%−1.42197,907−31.4%1,537
Net 4-week trajectory: Accuracy −3.46pp from W13 peak (84.66% → 81.20%); share −2.42pp (13.23% → 10.81%); volume −24.7% (262K → 198K). Quality has rebounded 5.13pp from the W14 trough but hasn't fully recovered to baseline.
Volume vs share — why both dropped, and what that signals

Volume change: Tobacco shrank faster than the pipeline

OMA total volume W15 → W16: 5,645,117 → 4,544,936 (−19.6%).
Tobacco volume W15 → W16: 288,581 → 197,907 (−31.4%).

If Tobacco had shrunk at the same rate as the pipeline, it would have been ~232K (still 12.23% share). The fact that it dropped to 198K means Tobacco genuinely lost share — about 11.6% relative share contraction, or 1.42 percentage points absolute.

Share change: real but not dramatic

12.23% → 10.81% looks like a meaningful drop. But put in context: Tobacco's share has been oscillating in the 11–13% band for 4 weeks (W13 13.23%, W14 11.06%, W15 12.23%, W16 10.81%). The W16 figure is at the bottom of that band but not an outlier.

Why both directions help OMA right now

Tobacco accuracy (81.20%) is 4.84pp below the global mean (86.00%). For below-mean segments:

  • Accuracy ↑ directly lifts the global average via the rate effect.
  • Share ↓ removes a drag from the mix via the weight effect.
  • Both happening together is the ideal direction for an underperforming policy.

This is why the W16 contribution (+0.334pp) is so much bigger than what the +2.16pp accuracy gain alone would predict.

Shift-share decomposition — exactly where the +0.334pp comes from

Three components, all favorable in net

  • Rate effect: +0.264pp = 12.23% × (+2.16pp) — pure accuracy gain at prior weight
  • Weight effect: +0.099pp = (−1.42pp) × (79.04% − 86.04%) — shrinking a below-mean segment helps
  • Interaction: −0.031pp = (−1.42pp) × (+2.16pp) — small, since accuracy ↑ while weight ↓ (opposite directions)
  • Total: +0.334pp

If Tobacco had only improved accuracy (+2.16pp) without the share drop, contribution would be just +0.264pp. The share contraction adds another +0.099pp — about 30% extra leverage. The interaction is small because the two factors moved in opposite directions, which limits the joint effect.

Risk: when this stops being a tailwind

The favorable direction can reverse fast

Scenario A — Volume rebounds before accuracy fully recovers: If W17 sees Tobacco volume pop back to ~12% share (a ~+1.2pp share gain) while accuracy plateaus at ~81%, the weight effect flips to negative. A +1.2pp share increase × (81.0% − 86.0%) ≈ −0.06pp drag. Combined with no rate gain, Tobacco could contribute close to zero or slightly negative.

Scenario B — Accuracy recovery stalls or reverses: Tobacco hit 84.66% in W13 — that's the recent ceiling. Without structural fixes, regression to ~79% (the W15 level) is plausible. A −2pp accuracy fall × 11% weight ≈ −0.22pp drag in a single week.

Scenario C — Both reverse: Volume back to 13% AND accuracy back to 79% would create a triple-negative similar to what happened in W14 (−8.59pp accuracy collapse). The W16 +0.334pp tailwind would flip to ~−0.40pp drag — a 0.7pp swing on a single policy.

Action: Monitor whether the W14 trough was caused by an isolated event (sample anomaly, transient labeling problem) or a structural issue. The 3-week recovery looks real but is on a thin base.

Annotation sample vs moderation volume — different signals

Two different "weights" — don't conflate them

The OMA dashboard has two volume-like metrics for each policy:

  • Weight (raw volume): total moderation traffic for that policy. W16: 197,907 cases. This represents real production decisions made by the moderation system.
  • Annotation sample: how many cases were sampled for human evaluation. W16: 1,537 cases. This is the OHA/OMA evaluation effort.

Tobacco annotation sample: 1,878 → 1,537 (−18.2%). Tobacco's share of total annotation samples: 6.73% → 6.76% — essentially flat. So at the evaluation/sampling level, Tobacco's representation didn't change.

The 31% volume drop is in moderation traffic, not in evaluation effort. This means: Tobacco is being moderated less in production, not just sampled less. Likely drivers: content trends, policy enforcement changes, or seasonality.

04

Tobacco × Markets — where the +2.16pp gain came from

EMEA · Tobacco share 47.55%
74.39%
▲ +2.16pp
+1.16pp within Tobacco · 54% of gain
AMS · Tobacco share 8.38%
67.45%
▲ +15.86pp
+0.84pp within Tobacco · 39% of gain
APAC · Tobacco share 44.06%
89.79%
▼ 0.12pp
−0.02pp within Tobacco · roughly flat
EMEA delivered 54% of the Tobacco gain — but it's a single-market story
MENA2 alone drove +1.63pp of the +2.16pp Tobacco improvement (75% of gain). Tobacco accuracy in MENA2 jumped 54.25% → 86.21% (+31.96pp) ?. DE (+0.71pp) and EN (share contraction effect) added the rest. The recovery is concentrated, not broad.
APAC nearly cancelled itself out — strong gainers offset by ID + BD + KR drops
VN (+10.54pp accuracy, +6.12pp share growth in Tobacco), MY (+7.23pp), TH (+7.96pp) delivered big within-region gains. But ID ? (Tobacco share 17.17% → 10.21%, accuracy −4.47pp) alone offset −1.09pp. BD, KR, KH also dragged. Region-level result: ≈ flat.
04a

Top markets driving Tobacco's +2.16pp gain

MarketRegionAcc W15Acc W16Δ AccWt W15Wt W16Δ WtΔ Vol%Total% of Tob Δ
MENA2 ?EMEA54.25%86.21%+31.965.40%3.99%−1.41−49.3%+1.626−75.3%
VNAPAC87.50%98.04%+10.542.00%8.12%+6.12+178.1%+1.373−63.6%
DEEMEA58.00%64.26%+6.267.82%6.32%−1.50−44.6%+0.711−32.9%
LATAM ?AMS48.89%73.34%+24.452.92%3.52%+0.60−17.2%+0.679−31.4%
ENEMEA65.42%58.59%−6.829.74%3.61%−6.13−74.6%+0.589−27.3%
MYAPAC86.92%94.16%+7.231.11%4.13%+3.02+154.8%+0.537−24.9%
THAPAC84.55%92.51%+7.961.48%3.72%+2.24+72.3%+0.420−19.4%
BR ?AMS55.25%63.94%+8.692.97%2.57%−0.40−40.7%+0.319−14.8%
Top-8 positive subtotal+6.255−289.6%
ID ?APAC88.14%83.66%−4.4717.17%10.21%−6.96−59.2%−1.090+50.4%
MENA1EMEA88.12%74.00%−14.135.03%11.57%+6.54+57.9%−1.040+48.1%
SSA ?EMEA83.14%65.42%−17.721.31%4.99%+3.68+160.8%−0.734+34.0%
UAEMEA91.61%89.69%−1.927.72%3.60%−4.12−68.1%−0.588+27.2%
KR ?APAC96.04%81.92%−14.122.87%1.14%−1.73−72.8%−0.455+21.1%
BD ?APAC93.47%84.21%−9.263.38%1.01%−2.37−79.5%−0.435+20.2%
Top-6 negative subtotal−4.342+201.0%
"Δ Vol%" shows raw Tobacco volume change in that market W15→W16 (e.g., MENA2 −49.3% means Tobacco moderation traffic in MENA2 nearly halved). "% of Tob Δ" is each market's contribution to the +2.16pp Tobacco accuracy gain. Negative entries dragged the gain down.
04b

Why these markets had outsized impact

MENA2 — single market = 75% of Tobacco's gain

MENA2: pure rate-effect explosion

Tobacco accuracy in MENA2 jumped 54.25% → 86.21% (+31.96pp). This is the largest single-market accuracy swing in the entire dataset.

Mechanism breakdown:

  • Rate effect: +1.73pp (5.4% within-Tobacco share × +31.96pp accuracy gain)
  • Weight effect: +0.35pp (share dropped 5.4%→4.0%, beneficial since MENA2 was below Tobacco's 79% mean)
  • Interaction: −0.45pp (share ↓ × accuracy ↑ — opposing directions cancel)
  • Total: +1.63pp = 75% of Tobacco's overall +2.16pp gain

Why MENA2 mattered so much: a) starting accuracy was extremely low (54%), so improvement headroom was huge; b) it carried a meaningful share (5.4%) of Tobacco moderation, so each accuracy point of improvement multiplied; c) volume contracted 49% so the additional gain wasn't diluted.

Verify: a 32pp single-week swing in one market is implausible as organic improvement. Likely candidates: labeling guideline change, sample composition shift, or regional moderator team retrained. Investigate before W17.

VN — Tobacco volume tripled while accuracy rose

VN: ideal-direction expansion

Tobacco volume in VN: 5,776 → 16,065 (+178%). Within-Tobacco share rose 2.00% → 8.12% (+6.12pp). Accuracy rose 87.50% → 98.04% (+10.54pp).

This is the best-case pattern: more volume, higher accuracy. All three shift-share components are positive: rate (+0.21pp), weight (+0.52pp — VN was ABOVE Tobacco mean, so growing helps), interaction (+0.64pp).

Why this matters: if VN sustains 98% Tobacco accuracy at 8% share, it becomes a structural Tobacco anchor. Track whether the volume surge is a one-off (data backlog catch-up?) or new baseline.

ID — biggest drag on Tobacco's gain (−50% of Tobacco Δ)

ID: the largest single drag within Tobacco

ID accounted for 17.17% of Tobacco moderation in W15 — the largest single market. In W16: share crashed to 10.21% (−6.96pp absolute, −41% relative) AND accuracy fell from 88.14% → 83.66% (−4.47pp).

Mechanism: Volume dropped 59% (49,542 → 20,210). Accuracy dropped on top of the volume drop. ID was above Tobacco's mean, so losing share also hurt via weight effect.

  • Rate effect: −0.77pp
  • Weight effect: −0.63pp (above-mean segment shrinking is bad)
  • Interaction: +0.31pp (partially offset because share ↓ × accuracy ↓ aligns negatively, which mathematically becomes positive)
  • Total: −1.09pp = removed 50% of what could have been a much bigger Tobacco gain

Tobacco's +2.16pp gain would have been roughly +3.25pp without the ID drag. Indonesia is now the single largest variable in Tobacco's trajectory.

MENA1 — accuracy collapsed while volume surged (worst pattern)

MENA1: textbook triple-negative

MENA1 Tobacco share more than doubled (5.03% → 11.57%, +6.54pp absolute). Accuracy collapsed (88.12% → 74.00%, −14.13pp). Volume +57.9% (14,502 → 22,896).

This is the worst possible direction for a Tobacco market: more volume into a now-failing segment.

  • Rate effect: −0.71pp (5% share × −14.13pp accuracy)
  • Weight effect: +0.59pp (share grew, but MENA1 was ABOVE Tobacco mean at 88% W15 — so partially offset)
  • Interaction: −0.92pp (share ↑ × accuracy ↓ — the worst combination)
  • Total: −1.04pp drag

Pattern: Why did MENA1 Tobacco volume surge while quality crashed? Possible: sudden enforcement campaign or content trend in the region pushed more Tobacco cases into review faster than moderator capacity could absorb. Compare to the broader MENA1 market data (which dropped −1.81pp overall) — Tobacco is the dominant contributor to MENA1's regional decline.

Why some markets had outsized impact — three-factor framework

What makes a market matter for Tobacco's number

A market's contribution to Tobacco's accuracy delta depends on three factors:

  1. Share of Tobacco's volume. The bigger the slice, the more leverage. ID (17%), EN (10%), DE (8%), UA (8%) lead. Low-share markets like LV, EE, NO can have huge accuracy swings with negligible global impact.
  2. Accuracy gap from Tobacco's overall mean (~80%). Markets sitting well below mean (LATAM 49%, BR 55%, DE 58%, MENA2 54%) are "headroom" markets — small accuracy gains translate disproportionately. Markets near or above mean (PK 92%, KR 96%) have less upside per unit of effort.
  3. Share-direction × accuracy-direction alignment. Best case: above-mean market growing share AND improving (VN). Worst case: above-mean market shrinking AND declining (ID). The interaction term captures this combinatorial effect.

Why MENA2 dominated: low starting accuracy (huge headroom) + meaningful share (5%+) + share contraction at the right time (compounded the rate effect).

Why ID dragged so much: largest share by far (17%), accuracy was above mean at W15 (so losing share hurt), and accuracy also fell (compounding interaction).

Region-level Tobacco summary (full breakdown)

Where Tobacco moderation actually happens

EMEA: 47.55% of Tobacco's W16 mix (47.55% share, 94K cases). Accuracy 74.39% — well below Tobacco's 81.20% mean and far below the region-aggregate Tobacco baseline. EMEA Tobacco improved +2.16pp this week, driven almost entirely by MENA2 + DE recovery.

APAC: 44.06% of Tobacco's mix (87K cases). Accuracy 89.79% — well above Tobacco's mean. APAC is the high-quality anchor, but its accuracy slipped −0.12pp this week as ID drag offset VN/MY/TH gains.

AMS: 8.38% of Tobacco's mix (17K cases). Smallest share but biggest WoW accuracy improvement at +15.86pp (LATAM and BR both jumped 8–25pp). Despite the small footprint, AMS contributed +0.84pp to Tobacco's gain (39%).

Implications:

  • EMEA is structurally the weakest Tobacco region — sustained EMEA improvement would yield the biggest global lift.
  • APAC's flatness this week is mostly an ID story; if ID stabilizes, APAC reverts to a +0.5pp/week tailwind.
  • AMS is volatile (LATAM Tobacco at 49% W15 is implausibly low) — the W16 jump may be partially correction-to-mean, not real improvement.
05

Indonesia (ID) — policy-level breakdown

ID OMA · APAC
86.81%
▼ 3.41pp
from W15 90.22% · 4-week: −5.97pp from W13 92.79%
#1 dragger: Youth Sexualized Behaviors
47.66%
▼ 22.34pp
137% of ID Δ · weight 9.08% → 15.27%
#2 dragger: Youth Regulated Goods
26.46%
▼ 51.86pp
133% of ID Δ · sample 37→12, partly noise
Tobacco in ID
83.66%
▼ 6.81pp
30% of ID Δ · weight 26.3% → 14.3% absorbed most damage
ID's W16 decline is a youth-content moderation problem, not a Tobacco problem
Youth Sexualized Behaviors contributes −4.66pp (137% of ID's −3.41pp) — the single biggest drag. Youth Regulated Goods adds another −4.55pp (133%). Adult Sexualized Behaviors −2.35pp (69%). Youth Non-Sexualized Nudity −1.74pp (51%, most sample-backed). Tobacco contributes −1.01pp (30%) — significant but smaller than the youth-content cluster. The youth-content failures with growing weight are the dominant story.
Heavy data caveat: ID's policy-level samples are extremely small
Most policies have 1–12 samples per week. Only 4 policies (Tobacco, Youth Sexualized Behaviors, Youth Non-Sexualized Nudity, Youth Body Exposure - Sig & Mod) have W16 samples ≥ 30. Treat anything below sample 30 as directional, not quantitative. Some of the most extreme single-week swings (Counterfeit +40pp, Adult Sexual Activity −60pp, Graphic Content - Public Interest −100pp) are statistical artifacts on samples of 1–5.
05a

Methodology — shift-share applied to ID

For ID-internal analysis, "Global Acc" in the formulas is replaced by ID's overall accuracy (90.22% in W15). Each ID policy is decomposed into rate, weight, and interaction effects relative to ID's mean.

Rate effect = GWtW15 × (AccW16 − AccW15) — pure accuracy change at prior weight (within ID)
Weight effect = (GWtW16 − GWtW15) × (AccW15 − ID AccW15) — mix shift relative to ID mean (90.22%)
Interaction = (GWtW16 − GWtW15) × (AccW16 − AccW15) — joint change
−3.84pp
Sum of rate effects
accuracy degradation across policies
−5.50pp
Sum of weight effects
mix shifts toward below-mean policies
−5.05pp
Sum of interactions
share & accuracy moving against
Reconciliation note: shift-share sum is −14.4pp, ID's actual OMA Δ is −3.4pp
The mismatch is because this dataset reports "Title Accuracy", which differs from OMA accuracy in how cases are aggregated. Use the per-policy Rate / Weight / Interaction breakdown to compare policies against each other within ID, not as exact accounting against the −3.41pp headline. The directionality and ranking are valid; the absolute magnitudes are inflated by the metric mismatch.
05b

Policies dragging ID's accuracy — full shift-share decomposition

PolicyAcc W15Acc W16Δ AccWt W15Wt W16Δ WtRateWeightInterTotal% of ID ΔSample
Youth Sexualized Behaviors ?70.00%47.66%−22.349.08%15.27%+6.19−2.029−1.251−1.382−4.662137%51 → 32
Youth Regulated Goods and Services ?78.32%26.46%−51.868.87%8.79%−0.08−4.602+0.010+0.044−4.548133%37 → 12
Adult Sexualized Behaviors ?58.38%41.52%−16.864.16%7.55%+3.39−0.701−1.079−0.571−2.35269%25 → 18
Youth Body Exposure - Light (4-17) ?83.02%43.38%−39.630.44%4.51%+4.07−0.174−0.293−1.614−2.08161%12 → 11
Light Body Exposure ?66.67%19.91%−46.760.93%3.12%+2.19−0.436−0.516−1.024−1.97658%3 → 4
Youth Non-Sexualized Nudity ?81.80%66.91%−14.898.36%10.48%+2.12−1.245−0.178−0.316−1.73951%169 → 98
Tobacco and Nicotine ?90.47%83.66%−6.8126.33%14.34%−11.99−1.793−0.030+0.817−1.00730%91 → 28
Adult Sexual Activity ?67.12%6.48%−60.641.59%1.31%−0.28−0.966+0.066+0.173−0.72821%9 → 5
Top-8 negative subtotal−11.946−3.270−3.873−19.10560%
Reading the columns: Rate = pure accuracy regression at W15 weight. Weight = mix shift relative to ID's W15 mean (90.22%) — negative means the policy gained share while sitting below ID mean (a drag). Inter = compounding effect when share and accuracy move together adversely. Total = sum of the three.
Δ Wt is each policy's weight within ID's moderation mix. Some policies (Counterfeit Goods, Personal Information - High Risk, Graphic Content - Public Interest) excluded due to 0%/100% data integrity issues.
05c

Policies improving in ID — mostly statistical noise

PolicyAcc W15Acc W16Δ AccWt W15Wt W16RateWeightInterTotal% of ID ΔSample
High Risk Weight Loss & Muscle Gain0.00%100.00%+100.001.37%0.62%+1.368+0.673−0.746+1.295−38%2 → 1
Frauds & Scams72.22%100.00%+27.785.60%2.49%+1.554+0.560−0.864+1.250−37%18 → 4
Alcohol62.31%100.00%+37.692.38%1.88%+0.898+0.140−0.190+0.849−25%5 → 1
Physical Assault38.35%100.00%+61.651.63%0.02%+1.003+0.832−0.989+0.846−25%5 → 1
High Risk Driving70.09%100.00%+29.913.12%1.91%+0.933+0.243−0.362+0.814−24%11 → 6
Youth Body Exposure - Sig & Moderate ?77.45%91.46%+14.015.19%5.65%+0.727−0.060+0.065+0.733−21%60 → 39
Graphic Content - Realistic Fiction2.06%100.00%+97.940.63%1.24%+0.619−0.539+0.599+0.678−20%3 → 2
Top-7 positive subtotal+7.102+1.849−2.487+6.464−190%
Most "improvements" are statistical artifacts on samples of 1–6. Only Youth Body Exposure - Sig & Moderate (sample 60→39, +14pp on growing weight) is a credible offset. Note that for high-rate-effect rows, weight effect can flip sign because below-mean policies that shrink (e.g., High Risk Weight Loss going from 1.37% → 0.62% with W15 acc 0%, far below ID mean) is a positive weight effect even though the rate effect dominates here.
Youth Sexualized Behaviors — most credible single drag in ID

Youth Sexualized Behaviors: rate × growing share = compounding

Multi-week trajectory: W13 70.00% (sample 16) → W14 41.14% (sample 11) → W15 70.00% (sample 51) → W16 47.66% (sample 32).

The W14–W15 oscillation suggests this category sits on the noise threshold, but the W15 figure (70% on 51 samples) is the most reliable W15 baseline — and W16's 47.66% on 32 samples is large enough that the −22pp gap likely reflects a genuine quality issue.

The compounding factor: share grew 9.08% → 15.27% (+6.19pp). Volume jumped 16,656 → 25,077 (+50%). So more cases entered a now-failing category — interaction effect within ID.

Action: The growing volume + falling accuracy is the worst direction for any policy. Investigate whether the volume growth is content-driven (e.g., a viral trend) or operational (changed routing rules), and whether moderator capacity for this category scaled accordingly.

Youth Non-Sexualized Nudity — strongest sample-backed drag

Youth Non-Sexualized Nudity: 81.80% → 66.91%, sample 169 → 98

This is the most statistically reliable drag in ID this week. Even after W16's sample reduction, 98 samples gives reasonable confidence in the −14.89pp signal.

Multi-week: W13 73.25% (sample 43) → W14 94.67% (sample 27) → W15 81.80% (sample 169) → W16 66.91% (sample 98). The category is volatile but the W15→W16 movement is supported by a meaningful sample on both sides.

Share also grew (8.36% → 10.48%, +2.12pp), so the rate × interaction compounded.

Action: This is one of the few ID policy signals that is sample-backed and trend-coherent. Add to the ID quality investigation immediately.

Tobacco in ID — a different story from the global Tobacco picture

Why Tobacco's role in ID is different from its global role

Globally, Tobacco is a +2.16pp gainer. But in ID specifically:

  • Accuracy ↓: 90.47% → 83.66% (−6.81pp)
  • Share ↓: 26.33% → 14.34% (−11.99pp absolute, −45% relative)
  • Volume ↓: 48,288 → 23,555 (−51%)

This is the rare combination where weight effect more than offsets rate effect. Tobacco was an above-mean segment in ID (90% vs ID's 90.22% mean was barely above). Cutting its share in half removes some volume from a now-mediocre segment, which in shift-share terms is roughly neutral.

4-week trend (Tobacco accuracy in ID): 96.47% → 96.24% → 88.14% → 83.66%. This is a real degradation pattern, and ID is the single biggest contributor to the global Tobacco x Markets drag (−1.09pp on the Tobacco delta).

The ID Tobacco volume halving is itself worth investigating: is it a real moderation traffic decrease (content trend) or a routing/sampling change?

Youth Body Exposure - Light — 9× volume surge with accuracy halving

An enforcement campaign signature?

Volume in this policy went 804 → 7,410 in a single week (a 9× jump). Accuracy went 83.02% → 43.38% (−39.6pp).

This pattern is unusual: a category that was near-marginal in W15 suddenly carries 9× volume at much worse quality. Hypothesis: a content trigger (viral trend, regulatory directive, or routing rule change) flooded this category with cases the moderation pipeline wasn't calibrated for.

Sample is small (12 → 11) so the magnitude is uncertain, but the volume pattern alone is worth investigating — that kind of step-change usually has a discrete cause.

ID 4-week OMA trajectory — what's been declining

Indonesia's accuracy across W13–W16

  • W13: 92.79% — high baseline
  • W14: 89.63% (−3.16pp WoW)
  • W15: 90.22% (+0.58pp recovery)
  • W16: 86.81% (−3.41pp)
  • Cumulative: −5.97pp from W13 baseline

The recovery in W15 was partial and W16 erased it plus more. ID is now its lowest level in the analyzed window.

The decline is increasingly concentrated in Youth-related policies (per W16 data). This points to either: a) a Indonesia-specific content trend (more youth-content moderation cases hitting the system), b) a localization issue with policy guidelines for youth content in Bahasa Indonesia, or c) a moderator team rotation that affected youth-category calibration.

Caveats: how to read this analysis

Important data limitations

1. The "Title Accuracy" column in this dataset isn't directly comparable to OMA accuracy. Sum of policy contributions reaches −14.4pp while ID's actual OMA delta is −3.41pp. The metric definitions diverge — use this table for relative comparison among ID policies, not for precise contribution accounting.

2. Sample sizes are very small. Most policies have W16 samples between 1 and 30. Only 4 policies cross the 30-sample threshold. This is fundamentally a small-N analysis at the policy × market level.

3. Several policies show 0% or 100% accuracy persistently. These are likely pipeline/data-integrity artifacts, not real moderator behavior. Excluded from the dragger table.

4. Within-ID share doesn't sum to 100%. The dataset's "Weight Proportion" column appears to be normalized within sub-buckets, not within ID overall. Volume-derived share has been used in this analysis instead.

Recommended actions
1Don't celebrate the −0.04pp headline. The mix is unstable: 10+ policies dropped 10–35pp this week, balanced by equally large gainers. If next week one of the gainers fails to repeat, headline could swing 1–2pp.
2Investigate Violent Behaviors (−19.32pp accuracy, weight growing) — triple-negative on a high-reputational-risk category. Escalate to policy ops.
3Personal Information - High Risk dropped 31pp — privacy-sensitive, suspicious magnitude. Audit sample composition and labeler agreement before W17.
4ID + LATAM combined drag of 1648% — both regions saw 3+pp accuracy drops. Region-level RCA needed to determine if this is a shared cause (model update, content shift) or independent.
5ID is on a 3-of-4-week declining trend (W13 92.79% → W16 86.81%, cumulative −5.97pp). This is no longer a single-week event — request a structured Indonesia retrospective.
6Verify the offsets are real, not artifacts. Disparaging Religion (−76pp), Invasive Cosmetic (+23pp), MENA2 region (+6.84pp), Adult Fetish & Kinks (+32pp) — these magnitudes invite sampling/labeling scrutiny before being trusted as signal.
7Data integrity: 4 policies report 0% accuracy in both weeks (Animal Abuse & Graphic Content, Youth Physical Abuse, Graphic Content, Adult Sexual Abuse) yet still drag the global via weight changes. Likely a data filter or definitional issue — fix before treating as RCA signal.
Priority matrix — what to triage first

Triage prioritization

P0 (immediate, integrity risk): Personal Information - High Risk (−31pp), Violent Behaviors (−19.32pp). Both are reputational categories with material accuracy regression on growing or stable weight.

P0 (data integrity): Verify Disparaging Religion (−76pp), 0%-accuracy policies, and other extreme single-policy swings are not sample/labeling artifacts. Swings of this size are more likely measurement issues than real changes.

P1 (regional): ID + LATAM joint investigation. If the cause is shared (e.g., a regional model rollout), one fix solves both. Otherwise treat as independent.

P1 (trend): ID 4-week decline pattern — even if W16 isolated event resolves, the trend itself warrants attention.

P2 (lock in gains): Tobacco & Nicotine recovery (3 weeks now positive) and Adult Sexualized Behaviors offset — confirm structural drivers, not just regression-to-mean.

P3 (signal hygiene): Replace single-week % of Δ as the primary metric for non-trivial WoW reports — when global Δ < 0.1pp, use absolute pp contributions instead.

Global W14
84.28%
▼ 1.63pp
from 85.91% · 100% of decline
EMEA · Wt 32.4%
79.65%
▼ 6.48pp
128% of global decline
APAC · Wt 47.8%
88.61%
▲ 0.92pp
−31% offset the decline
AMS · Wt 19.7%
81.37%
▼ 0.12pp
1% of global decline
Overview −1.63pp
Methodology, decomposition & fuzzy
Hub × Type 129%
EMEA Appeal alone = 50.9%
EMEA Markets 110%
MENA1 leads at 38.4%
Top Projects TOP 10
GB-MNL #1 at 25.1%
Actions 7
P0–P2 prioritized items
00

Methodology & summary

W14 (Apr 4–10) vs W13 (Mar 28–Apr 3). Each segment's total contribution is decomposed into three additive components. Positive % of Δ = contributed to the decline; negative = offset.

Rate effect = GWtW13 × (AccW14 − AccW13) — pure accuracy change at prior weight
Weight effect = (GWtW14 − GWtW13) × (AccW13 − Global AccW13) — mix shift relative to global mean
Interaction = (GWtW14 − GWtW13) × (AccW14 − AccW13) — joint change
−2.64pp
Total rate effect
161% of decline
+0.76pp
Total weight effect
Offset 47%
+0.24pp
Total interaction
Offset 15%
Quality degraded across the board — here's why this matters
The rate effect (−2.64pp) alone would have caused a 3.4pp decline if the mix hadn't shifted favorably. The actual −1.63pp is the best-case outcome given how much accuracy fell — saved only by favorable weight rebalancing.
APAC's growth was the safety net — here's how
APAC (88.6% accuracy, above global mean) grew from 47.2% → 47.8% of mix. This single shift absorbed nearly half the damage. Without it, the headline would read −3.1pp instead of −1.63pp.
How to read this decomposition

Interpreting the three effects

Rate effect (161%) tells us accuracy degradation within segments — holding mix constant — more than fully explains the decline. This is the "quality got worse" signal.

Weight effect (offset 47%) means the mix actually shifted favorably: segments with above-average accuracy gained share. Without this, the decline would have been ~3.4pp instead of 1.63pp.

Interaction (offset 15%) captures the joint effect — segments that lost accuracy also tended to shrink in weight, providing a small additional buffer.

The sum: −2.64 + 0.76 + 0.24 = −1.63pp, matching the observed global decline exactly.

01

Fuzzy rate impact

−0.36pp
Fuzzy rate increase
21.8% of total decline
−1.28pp
Non-fuzzy accuracy decline
78.2% of total decline
Fuzzy rate rose +0.35pp — but three hubs tell completely different stories
AMS: decline is 100% fuzzy — real quality held steady. APAC: powered through the biggest fuzzy headwind (+0.49pp) with +1.41pp genuine improvement. EMEA: 96% of the −6.48pp drop is real accuracy errors, not borderline ambiguity.
HubFR W14FR W13Δ FRAcc Δ totalFuzzy explainsNon-fuzzy ΔVerdict
AMS1.76%1.57%+0.19pp−0.12pp −0.19pp+0.06pp Entire decline is fuzzy-driven. Non-fuzzy accuracy actually improved.
APAC2.08%1.59%+0.49pp+0.92pp −0.49pp+1.41pp Fuzzy headwind absorbed — non-fuzzy quality improved strongly (+1.41pp).
EMEA3.12%2.86%+0.26pp−6.48pp −0.26pp−6.21pp 96% of EMEA's decline is non-fuzzy. Fuzzy is a minor factor here.
Global2.35%2.00%+0.36pp−1.63pp −0.36pp−1.28pp Fuzzy = 22%, non-fuzzy = 78%
Key insight: The three hubs tell very different stories. AMS's small decline is 100% fuzzy — actual quality held steady. APAC powered through a large fuzzy increase with even larger genuine improvement. EMEA's massive drop is overwhelmingly real accuracy errors — fuzzy rate barely moved. This confirms EMEA's issue is fundamentally about moderation quality, not borderline-case ambiguity.
AMS — decline is 100% fuzzy-driven

AMS: a fuzzy story, not a quality story

AMS accuracy fell just −0.12pp, and the entire decline is explained by the fuzzy rate increase (+0.19pp). Once fuzzy is stripped out, AMS non-fuzzy accuracy actually improved by +0.06pp.

This means AMS's labeling quality is holding steady or improving — the headline number is being dragged by borderline cases being reclassified or new ambiguous content types entering the pipeline.

Action: Consider fuzzy calibration or policy clarification for the specific content types driving the 0.19pp fuzzy increase. This is a recoverable loss.

APAC — strong quality masked by fuzzy headwind

APAC: quality is better than the headline suggests

APAC's reported accuracy improved +0.92pp, but the underlying non-fuzzy improvement is actually +1.41pp — being partially masked by a +0.49pp fuzzy rate increase (the largest of any hub).

APAC absorbed the biggest fuzzy headwind and still delivered the best headline improvement. However, the fuzzy trend (+0.49pp WoW) needs monitoring — if it continues, it will eventually overwhelm the quality gains.

Action: Investigate whether policy updates or new content types in APAC are driving the fuzzy surge. The quality fundamentals are strong, but the fuzzy trajectory is concerning.

EMEA — fuzzy is a rounding error; the problem is real

EMEA: genuine moderation quality crisis

EMEA's fuzzy rate only increased +0.26pp, explaining just 4% of its massive −6.48pp accuracy decline. The remaining −6.21pp is pure non-fuzzy accuracy degradation.

This definitively rules out "borderline cases" as an explanation for EMEA's performance. The problem is fundamentally about labeler accuracy, policy interpretation, or operational execution — not content ambiguity.

EMEA also has the highest absolute fuzzy rate (3.12% vs 2.08% APAC, 1.76% AMS), suggesting a structural baseline of ambiguity in its content mix, but the week-over-week change is small.

01

Hub × project type

EMEA's three project types account for 129% of the decline
EMEA Appeal alone is 50.9%: accuracy collapsed 82.7% → 76.6% (−6.06pp) while still carrying 15.6% of global weight. APAC General Recall is the largest single offset (−30.6%), improving to 90.1% while gaining share.
HubTypeAcc W14Acc W13Δ AccGWt W14GWt W13RateWeightInterTotal% of Δ
EMEAAppeal76.6%82.7%−6.0615.6%19.1%−1.156+0.113+0.212−0.83150.9%
EMEAGeneral Recall84.2%91.6%−7.3712.1%10.0%−0.734+0.119−0.156−0.77147.1%
EMEAAnalytics Appeal77.7%89.7%−12.014.7%3.2%−0.387+0.055−0.174−0.50530.9%
AMSGeneral Recall84.4%85.8%−1.3714.3%9.5%−0.130−0.005−0.066−0.20012.2%
APACAppeal85.4%85.7%−0.2715.9%18.8%−0.052+0.006+0.008−0.0372.3%
Negative subtotal−2.345143.4%
AMSAppeal72.9%78.7%−5.865.0%10.5%−0.614+0.394+0.320+0.100−6.1%
AMSAnalytics Appeal79.0%62.7%+16.350.5%0.6%+0.102+0.038−0.027+0.113−6.9%
APACGeneral Recall90.1%88.6%+1.4927.5%24.1%+0.359+0.091+0.051+0.501−30.6%
Positive subtotal+0.714−43.7%
AMS Appeal — accuracy did fall (rate = −0.61pp), but its accuracy is well below the global mean, so the weight halving from 10.5% → 5.0% was net positive for the global number (+0.39pp weight effect), flipping total contribution to +0.10pp.
EMEA Appeal deep dive — why −6.06pp accuracy drop?

EMEA Appeal: rate effect dominance

The −1.156pp rate effect is the single largest driver in this decomposition. EMEA Appeal dropped from 82.7% to 76.6%, a −6.06pp swing, while still carrying 15.6% global weight.

The weight did shrink (19.1% → 15.6%), which partially offset the damage (+0.113pp weight effect, +0.212pp interaction), but the sheer magnitude of the accuracy collapse overwhelms both offsets.

Key question: Is this driven by specific BPO sites, policy updates, or labeler calibration drift? See the "Top Projects" tab for project-level decomposition.

APAC General Recall — why it's the biggest offset

APAC GR: the stabilizer

APAC General Recall improved from 88.6% to 90.1% (+1.49pp) while also gaining weight (24.1% → 27.5%). This is the ideal scenario: an above-average segment both improves and grows.

All three effects are positive: rate (+0.359pp), weight (+0.091pp), interaction (+0.051pp), summing to +0.501pp — the single largest offset at −30.6% of the decline.

02

EMEA market breakdown

5 markets drive 110% of the global decline — almost entirely rate-driven
MENA1 + EN + SSA + DE + MENA2. The damage is concentrated: MENA1 alone is 38.4%. Only SSA compounds all three effects — weight grew into a below-mean, declining segment.
MarketAcc W14Acc W13Δ AccGWt W14GWt W13RateWeightInterTotal% of Δ
MENA180.5%90.4%−9.896.26%6.46%−0.639−0.009+0.020−0.62838.4%
EN (GB)78.8%88.5%−9.673.56%4.18%−0.404−0.016+0.061−0.36022.0%
SSA75.9%84.2%−8.223.65%2.46%−0.202−0.021−0.099−0.32119.7%
DE77.4%86.8%−9.422.93%2.90%−0.273+0.000−0.003−0.27616.9%
MENA275.8%80.9%−5.084.28%4.37%−0.222+0.005+0.005−0.21313.0%
IT84.2%93.2%−9.072.43%2.27%−0.205+0.012−0.015−0.20812.7%
IL67.2%83.8%−16.550.36%0.43%−0.071+0.001+0.011−0.0593.6%
UA74.3%77.3%−3.041.08%1.04%−0.032−0.003−0.001−0.0362.2%
SSA is the only top market where all three effects are negative — weight expanded (2.46%→3.65%), accuracy sits below the global mean, and accuracy also fell. A triple headwind worth investigating.
MENA1 deep dive — largest market contributor at 38.4%

MENA1: pure rate problem

MENA1 dropped from 90.4% to 80.5% (−9.89pp) while maintaining roughly stable weight (6.46% → 6.26%). The rate effect (−0.639pp) almost entirely explains its contribution.

This is a nearly pure accuracy regression — no confounding mix shifts. The investigation should focus on what changed in MENA1 labeling quality, policy interpretation, or task distribution during W14.

SSA triple headwind — all three effects negative

SSA: compounding failure mode

SSA is unique among all segments: rate, weight, and interaction are all negative.

Rate (−0.202pp): accuracy fell from 84.2% to 75.9%, a −8.22pp drop.

Weight (−0.021pp): SSA's weight grew from 2.46% to 3.65%, but since SSA accuracy (84.2%) was below the W13 global mean (85.9%), this expansion hurts.

Interaction (−0.099pp): the weight grew AND accuracy fell simultaneously — the worst combination.

Key question: Was the SSA weight increase intentional (ramp-up)? If so, quality support did not scale with volume.

IL — steepest single-market accuracy drop (−16.55pp)

IL: low weight limits global impact

IL has the most dramatic accuracy decline of any market (83.8% → 67.2%, −16.55pp), but its small weight (0.36%) limits global impact to just −0.059pp (3.6% of decline).

Still worth flagging: a 16.5pp drop likely indicates a systemic issue — new policy, labeler turnover, or task type change — that could worsen if IL weight increases.

03

EMEA — top 10 individual projects (shift-share)

Top 3 projects drive 67% of the global decline
GB-ALR-MNL (25.1%): weight surged 6x into crashing accuracy. MENA2-CAS (22.0%): weight quadrupled into a chronically below-mean segment. MENA1-ANK (19.7%): pure accuracy regression. The common thread: weight expansion without quality support.
ProjectTypeAcc W14Acc W13GWt W14GWt W13RateWeightInterTotal% of Δ
GCP-TT-Video appeal-GB-en-ALR-MNLAppeal69.9%100.0%2.25%0.36%−0.108+0.266−0.568−0.41025.1%
TT-Video-Analytics Appeal-MENA2-ar-T&S-CASAnalytics Appeal67.1%73.5%2.25%0.51%−0.033−0.215−0.112−0.36022.0%
TT-Video-General Recall General-MENA1-ku-CNX-ANKGeneral Recall84.4%96.8%2.59%2.61%−0.322−0.002+0.002−0.32119.7%
TT-Video appeal-KE/TZ/UG-sw-TP-NBOAppeal69.4%81.4%1.12%0.77%−0.092−0.016−0.043−0.1519.2%
GCP-TT-Video-General Recall General-GB-en-TP-ALBGeneral Recall58.9%92.7%0.06%1.40%−0.473−0.091+0.451−0.1136.9%
GCP-TT-Video appeal-IT-it-TP-BRVAppeal84.7%92.3%0.87%1.60%−0.122−0.046+0.055−0.1136.9%
TT-Video appeal-MENA1-other-TP-MAKAppeal63.4%96.1%0.22%0.53%−0.174−0.032+0.101−0.1046.4%
GCP-TT-Video-General Recall General-DE-de-TLS-LEJGeneral Recall75.4%85.4%1.00%0.46%−0.046−0.003−0.054−0.1026.3%
TT-Video appeal-MENA1-ar-CNX-IBDAppeal74.2%78.9%1.48%1.17%−0.056−0.021−0.014−0.0915.6%
TT-Video-General Recall General-MENA1-ar-TP-MAKGeneral RecallN/A100%0.00%0.53%−0.534−0.075+0.534−0.0754.6%
Weight expansion is the recurring theme: 6 of 10 projects saw weight increase — when that expansion targets below-mean or declining-accuracy segments, the interaction effect compounds the damage. Only GR-MENA1-ku-CNX-ANK is a pure rate story (stable weight, −12.4pp accuracy drop).
GCP-TT-Video appeal-GB-en-ALR-MNL — #1 contributor at 25.1%, here's the mechanism

GB MNL: the weight surge trap

This project's weight surged 6.25x (0.36% → 2.25%) while accuracy crashed from 100% → 69.9%. The interaction effect (−0.568pp) is the largest single component — weight grew dramatically while accuracy fell dramatically.

The weight effect is actually positive (+0.266pp) because the project was above the global mean in W13 (100% vs 85.9%). But the interaction overwhelms it: expanding into what became a low-accuracy segment is a compounding failure.

Key question: Was this a deliberate ramp-up of a previously small project? If so, quality controls didn't scale with volume.

AA-MENA2-ar-T&S-CAS — weight quadrupled into a below-mean segment

MENA2 CAS: weight-driven damage

Weight grew from 0.51% → 2.25% (4.4x) while accuracy was already below the global mean (73.5%) and fell further to 67.1%. The weight effect alone (−0.215pp) is the largest component — this is a mix-shift problem, not primarily a rate problem.

All three effects are negative: rate (−0.033), weight (−0.215), interaction (−0.112). A triple headwind totaling −0.360pp (22.0% of decline).

Action: Validate whether this weight increase was intentional. Expanding a chronically below-mean segment without quality uplift compounds the global decline.

TT-Video-General Recall General-MENA1-ku-CNX-ANK — pure rate collapse, no mix excuse

MENA1 ANK: classic accuracy regression

This is the cleanest rate-driven case in the top 10: weight barely moved (2.61% → 2.59%), so the rate effect (−0.322pp) almost entirely explains the −0.321pp total contribution.

Accuracy dropped from 96.8% → 84.4% (−12.4pp) — a steep fall from a high base. No mix-shift or weight excuses here; something changed in execution quality.

Action: Investigate what changed for Kurdish-language GR in MENA1 during W14 — policy update, new labeler cohort, or calibration drift.

Recommended actions
1EMEA Appeal quality RCA — focus on EN, MENA1, DE BPO sites where rate effect is the dominant driver.
2DE-LEJ site investigation — two projects with catastrophic accuracy (0.0% and N/A), possible vendor execution failure.
3Four projects went to N/A (zero weight) in W14 — confirm whether this is sampling shortfall or project suspension.
4SSA weight expansion — the only market with a triple-negative (rate + weight + interaction all negative); validate if the volume increase is intentional.
5EMEA GCP weight surge (0.85% → 2.98%) into a 74%-accuracy segment — check if this is a ramp-up or reallocation, and whether quality support is in place.
6APAC fuzzy rate jumped +0.49pp (largest increase) — investigate whether policy updates or new content types are driving borderline cases. Non-fuzzy quality is strong, but the fuzzy trend needs monitoring.
7AMS decline is entirely fuzzy-driven — non-fuzzy accuracy actually improved. Consider whether fuzzy calibration or policy clarification could recover the 0.19pp loss.
Priority matrix — impact vs effort

Triage prioritization

P0 (immediate): DE-LEJ site — likely site-level outage/failure, accounts for 63.7% of decline from just two projects. Quick root cause identification could recover the most impact.

P1 (this week): Four N/A projects — verify if delivery gaps are fixable. If unplanned, restoring these could offset 167% of the decline (they overlap with rate-driven decline).

P1 (this week): MENA1 accuracy regression — 38.4% of decline, pure rate effect. Check if a policy update or labeler calibration issue occurred during W14.

P2 (track): SSA triple headwind and EMEA GCP weight surge — these are structural issues that need monitoring over W15–W16 to determine if they're transient or persistent.

P2 (track): APAC fuzzy rate surge (+0.49pp) — quality fundamentals are strong but the fuzzy trajectory needs monitoring. AMS fuzzy calibration is a quick-win candidate.

Global OMA W15
86.04%
▲ +1.92pp
from W14 84.12% · biggest weekly gain
EMEA · Wt 32.59%
83.29%
▲ +3.71pp
+62% of global gain
AMS · Wt 20.21%
83.55%
▲ +2.34pp
+24% of global gain
APAC · Wt 46.94%
89.05%
▲ +0.70pp
+15% of global gain
Overview +1.92pp
Methodology & headline
Markets EN +20%
EMEA-led recovery
Top Policies 5 lifters
Combat sports leads
Tobacco × Markets EMEA 90%
where the gain came from
ID Deep-dive +0.58pp
mirror image of W16
Actions 5
Lock in the gains
00

Methodology & headline summary

W15 (Apr 11–17) vs W14 (Apr 4–10). OMA jumped +1.92pp — the largest weekly gain in the observed window, fully reversing the W14 decline and finishing 0.10pp above the W13 baseline.

Rate effect = GWtW14 × (AccW15 − AccW14) — pure accuracy change at prior weight
Weight effect = (GWtW15 − GWtW14) × (AccW14 − Global AccW14) — mix shift relative to global mean
Interaction = (GWtW15 − GWtW14) × (AccW15 − AccW14) — joint change
The recovery is policy-led, not market-led — a handful of high-stakes categories rebounded across all regions
The top 5 policies alone contributed +2.42pp (126% of the global gain). Combat Sports +0.59pp, Designated Dangerous Entities +0.53pp, Violent Behaviors +0.44pp, Alcohol +0.44pp, Highly Imitable Acts +0.42pp. Several of these (Violent Behaviors, Personal Information - High Risk) had collapsed in W14 — W15 looks like a correction, not a structural improvement.
Rate effect drove the recovery (+2.0pp); Weight effect added another (+1.1pp)
Among sample-credible policies: Rate sum +2.02pp (accuracy genuinely improved), Weight sum +1.09pp (a few low-accuracy policies like Highly Imitable Acts and Alcohol contracted in share — beneficial since they were below mean), Interaction sum −0.45pp (slightly opposing). Quality recovery is the dominant story.
Hidden drags during a recovery week — watch these for W16 reversal risk
Dangerous Trends - Serious Harm dropped −9.36pp (drag −0.47pp) — the single biggest policy drag. Adult Sexualized Behaviors −3.67pp (drag −0.39pp). MX −6.07pp accuracy (drag −0.25pp). PK −1.71pp. The strong headline conceals these regressions — if any expand, the W16 picture changes quickly. (Spoiler: in W16, several of these flipped further while Adult Sexualized Behaviors recovered.)
How concentrated was the recovery?

Concentration analysis

Top 3 markets (EN, LATAM, MENA1) contributed +1.07pp — 56% of the global gain.
Top 5 markets contributed +1.56pp — 82%.
Top 3 policies contributed +1.56pp — 81%.
Top 5 policies contributed +2.42pp — 126% (i.e., the rest of the policies net negative).

Recovery is genuinely concentrated — a handful of policies and markets did most of the work. This is fragile: if next week one or two of these reverse, the headline swings substantially.

How does W15 fit in the W13–W16 trajectory?

4-week trajectory

  • W13: 85.94% baseline
  • W14: 84.12% (−1.82pp WoW) — significant drop
  • W15: 86.04% (+1.92pp WoW) — biggest weekly gain
  • W16: 86.00% (−0.04pp WoW) — flat but turbulent underneath

The W14–W15 swing pattern (−1.82pp → +1.92pp) is unusually large. Such a near-perfect reversal often suggests operational/sampling causes rather than a true two-step quality change: e.g., a labeling guideline correction issued mid-W14 only fully took effect in W15.

Volume context: OMA pipeline grew +21% from W14 (4.66M) to W15 (5.65M). The bigger sample base may have stabilized noisy categories.

01

By market — top contributors

EMEA delivered 62% of the +1.92pp recovery, and it's broad — not single-market
EMEA aggregate accuracy rebounded 79.59% → 83.29% (+3.71pp). The contributors are spread: EN +9.90pp, MENA1 +5.90pp, SSA +3.49pp, ES +8.40pp. AMS added +24% (LATAM +4.39pp, CA +7.58pp, BR +2.19pp). APAC contributed +15%, with BD +5.44pp and PH +3.10pp leading.
During this recovery week, MX and PK regressed
MX dropped 88.85% → 82.78% (−6.07pp), the biggest single-market regression and offset −0.25pp. PK dropped 92.40% → 90.69% (−1.71pp on heavier weight, drag −0.17pp). Both are noteworthy because they went the opposite direction of the broad recovery.
MarketAcc W14Acc W15Δ AccWt W14Wt W15RateWeightInterTotal% of Δ
EN78.79%88.69%+9.903.56%4.06%+0.352−0.027+0.049+0.375−20%
LATAM81.15%85.54%+4.397.92%8.79%+0.348−0.026+0.038+0.360−19%
MENA180.89%86.79%+5.905.82%5.60%+0.343+0.007−0.013+0.337−18%
BD81.41%86.85%+5.444.92%5.28%+0.268−0.010+0.020+0.278−14%
CA73.60%81.18%+7.582.69%2.31%+0.204+0.039−0.028+0.215−11%
SSA75.94%79.43%+3.493.66%3.37%+0.128+0.024−0.010+0.141−7%
BR79.45%81.64%+2.195.30%4.46%+0.116+0.039−0.018+0.137−7%
PH84.34%87.45%+3.103.82%4.15%+0.119+0.007+0.004+0.130−7%
Top-8 positive subtotal+1.972−103%
MX88.85%82.78%−6.073.94%4.65%−0.239+0.033−0.043−0.249+13%
PK92.40%90.69%−1.717.18%6.43%−0.123−0.062+0.013−0.172+9%
KR92.22%89.42%−2.801.96%1.99%−0.055+0.003−0.001−0.053+3%
Top-3 negative subtotal−0.474+25%
EMEA aggregate +1.18pp (62% of global gain), AMS +0.46pp (24%), APAC +0.28pp (15%). The story is broad — multiple markets in each region rebounded simultaneously, suggesting a global moderation issue (likely policy interpretation or labeling) was resolved between W14 and W15.
EN — biggest single-market gainer (+9.90pp)

EN: clean rate-effect recovery

UK English market accuracy jumped 78.79% → 88.69%, a +9.90pp rebound on slightly growing share (3.56% → 4.06%). Rate effect (+0.352pp) dominates — virtually all of EN's contribution comes from genuine accuracy improvement, not mix shift.

Combined with EN's W14 collapse, this looks like a clean V-shape: something specific to English-language moderation broke in W14 and got fixed by W15.

MX — outlier regression during a recovery week

MX: against the tide

MX accuracy fell from 88.85% → 82.78%, while almost every other market rose. Weight also grew (3.94% → 4.65%, +0.71pp), so the additional volume entered a now-failing market — interaction effect (−0.043pp) compounds the damage.

Possible causes: a Mexico-specific moderation issue (Spanish-language LATAM policy interpretation) that didn't share the W14→W15 fix that helped most other markets. Worth checking against MX-specific policy/labeler rotation.

02

By policy title — top contributors

5 high-stakes policies drove 126% of the recovery
Combat Sports +8.79pp (contribution +0.59pp), Designated Dangerous Entities +17.43pp (+0.53pp), Violent Behaviors +25.70pp (+0.44pp), Alcohol +1.68pp on heavy weight contraction (+0.44pp), Highly Imitable Acts weight contraction (+0.42pp). Several of these were W14 collapses — W15 is the correction.
Hidden drags: Dangerous Trends collapsed −9.36pp during the recovery week
Dangerous Trends - Serious Harm went 77.45% → 68.09% on growing share (4.61%→4.83%) — a triple-negative pattern that cost −0.47pp. Adult Sexualized Behaviors dropped −3.67pp on growing weight (drag −0.39pp). Reference to Cannabis, Drugs dropped −18.59pp (drag −0.25pp). These are the policies that didn't share in the recovery.
PolicyAcc W14Acc W15Δ AccWt W14Wt W15RateWeightInterTotal% of Δ
Combat sports, Extreme Sports & Stunts66.24%75.02%+8.795.37%4.04%+0.472+0.237−0.116+0.592−31%
Designated Dangerous Entities50.75%68.18%+17.432.34%1.59%+0.408+0.252−0.132+0.529−28%
Violent Behaviors ?51.08%76.78%+25.701.65%1.47%+0.425+0.062−0.048+0.439−23%
Alcohol67.99%69.67%+1.685.84%3.50%+0.098+0.379−0.039+0.437−23%
Highly Imitable Acts46.45%40.09%−6.363.00%1.61%−0.191+0.524+0.088+0.421−22%
Personal Information - High Risk ?42.65%84.12%+41.470.91%0.67%+0.378+0.101−0.101+0.378−20%
Regulated Goods - Marketing/Trade33.32%47.96%+14.641.61%1.44%+0.236+0.088−0.025+0.298−16%
Tobacco and Nicotine ★ heaviest policy76.07%79.04%+2.9711.06%12.23%+0.328−0.094+0.035+0.269−14%
Top-8 positive subtotal+3.363−175%
Dangerous Trends - Serious Harm77.45%68.09%−9.364.61%4.83%−0.432−0.015−0.021−0.467+24%
Adult Sexualized Behaviors58.55%54.88%−3.675.06%5.77%−0.186−0.180−0.026−0.391+20%
Reference to Cannabis, Drugs71.95%53.36%−18.590.94%1.17%−0.174−0.029−0.044−0.247+13%
Firearms & Explosive Weapons75.86%71.18%−4.692.80%3.50%−0.131−0.057−0.032−0.221+12%
Top-4 negative subtotal−1.326+69%
Pattern: 4 of the 5 top gainers had collapsed in W14. Combat Sports went from 79% (W13) → 66% (W14) → 75% (W15). Violent Behaviors went 63%→51%→77%. Designated Dangerous Entities went ~? → 51% → 68%. This points to a W14 operational issue (likely policy interpretation or labeling) that got resolved. Confirm whether the structural cause was identified — if not, the recovery may not be sticky.
Combat Sports — biggest single contributor (+0.59pp)

Combat Sports: rate + weight both helped

Accuracy rose 66.24% → 75.02% (+8.79pp) AND share contracted 5.37% → 4.04% (−1.33pp). Both effects favorable: rate (+0.47pp) from accuracy improvement, weight (+0.24pp) from below-mean segment shrinking.

4-week trajectory: W13 ~80% → W14 66% → W15 75% → W16 ~83%. The category has a clear W14 trough that is recovering through W16. Suggests a labeling-guideline change for sports content was rolled back or refined.

Highly Imitable Acts — pure weight-effect contribution

Highly Imitable Acts: accuracy fell, but the weight contraction more than offset

Accuracy actually got worse: 46.45% → 40.09% (−6.36pp). But weight halved: 3.00% → 1.61%. Since this segment was deeply below the global mean (84%), removing it from the mix is a strong positive even though its quality regressed.

Net: weight effect +0.52pp dwarfs the rate damage −0.19pp, total contribution +0.42pp.

This is a textbook "good-because-it-shrunk-not-because-it-improved" case. The weight change might reflect a sampling/routing decision — verify it's not just a statistical artifact.

Dangerous Trends - Serious Harm — biggest single drag

Dangerous Trends: against the recovery

Accuracy fell from 77.45% → 68.09% (−9.36pp) on growing share (4.61% → 4.83%). All three effects negative: rate −0.43pp, weight −0.02pp, interaction −0.02pp.

4-week trajectory: W13 ~69% → W14 77% → W15 68% → W16 65%. The W14 figure looks like the outlier — W13/W15/W16 cluster around 65–70%. So Dangerous Trends didn't really regress in W15; rather, W14 was an anomaly that the W15 reading reverted from.

Implication: this category sits structurally low (~65–70%) and any single-week reading is volatile. The W15 "drag" is partly a methodology artifact of comparing against an unusually high W14 baseline.

03

Tobacco × Markets — where Tobacco's +2.97pp gain came from

EMEA · biggest Tobacco recovery
recovery
▲ multi-market
SSA, EN, UA, ES, MENA1 all improved 11–74pp
AMS · BR led
55.25%
▲ +42.14pp
BR jumped from 13% → 55% Tobacco accuracy
APAC · ID started declining
88.14%
▼ 8.10pp
ID Tobacco fell on growing share — early W16 warning
Tobacco's +2.97pp gain was EMEA-led — 5 EMEA markets each contributed 0.5–1.5pp within Tobacco
BR +42.14pp (+1.68pp), SSA +56.38pp (+1.52pp), EN +11.37pp (+1.47pp), UA +21.70pp (+1.41pp), ES +73.80pp (+0.88pp). Multiple markets recovered Tobacco accuracy at the same time — consistent with a global Tobacco moderation issue being resolved.
Three markets foreshadowed the W16 Tobacco picture: MENA2, LATAM regression patterns started here
MENA2 Tobacco fell 77.27% → 54.25% (drag −1.34pp) — would recover +31.96pp in W16. LATAM fell 100% → 48.89% (drag −0.99pp) — would also recover in W16. ID fell 96.24% → 88.14% (drag −0.80pp) — would continue falling in W16. The pattern foreshadowed the W16 reversal in some markets and continuation in others.
MarketRegionAcc W14Acc W15Δ AccSh W14Sh W15Δ ShTotal% of Tob ΔSample
BRAMS13.11%55.25%+42.143.65%2.97%−0.68+1.682−57%7 → 32
SSAEMEA26.77%83.14%+56.382.90%1.31%−1.59+1.524−51%8 → 15
ENEMEA54.04%65.42%+11.3711.38%9.74%−1.64+1.469−50%20 → 39
UAEMEA69.92%91.61%+21.703.42%7.72%+4.30+1.411−48%18 → 83
ESEMEA0.00%73.80%+73.801.18%0.82%−0.36+0.875−29%6 → 16
MENA1EMEA76.83%88.12%+11.296.85%5.03%−1.82+0.554−19%23 → 56
PKAPAC82.69%92.17%+9.489.32%6.10%−3.22+0.365−12%45 → 103
CAAMS23.87%29.93%+6.061.17%0.54%−0.63+0.364−12%3 → 9
Top-8 positive subtotal+8.245−278%
MENA2 ?EMEA77.27%54.25%−23.0213.21%5.40%−7.81−1.338+45%23 → 27
LATAM ?AMS100.00%48.89%−51.110.81%2.92%+2.11−0.987+33%1 → 17
IDAPAC96.24%88.14%−8.1014.22%17.17%+2.95−0.798+27%23 → 92
Top-3 negative subtotal−3.123+105%
Pattern: the EMEA Tobacco recovery is broad — 5 markets each contributed 0.5–1.7pp. This points to a structural cause (e.g., a Tobacco-policy interpretation guidance updated globally between W14 and W15). The APAC piece is more ambiguous: PK improved while ID declined.
ID Tobacco — start of a 3-week decline

ID Tobacco: 96% → 88% in W15, then 88% → 84% in W16

ID Tobacco accuracy declined from 96.24% → 88.14% (−8.10pp) while ID's overall OMA was actually rising. Tobacco share also grew (14.22% → 17.17%, +2.95pp), so more cases entered a now-failing segment.

This was the start of a sustained ID Tobacco decline:

  • W13: 96.47%
  • W14: 96.24% (essentially flat)
  • W15: 88.14% (−8.10pp)
  • W16: 83.66% (−4.47pp)
  • Cumulative: −12.81pp from W13 baseline

The W15 drop was the inflection point. Whatever broke ID-Tobacco moderation appears to have started here.

04

Indonesia (ID) — policy-level breakdown

ID OMA · APAC
90.22%
▲ +0.58pp
small gain — ID was NOT a recovery driver
Top gainer: Youth Body Exposure - Sig & Mod
77.45%
▲ +31.18pp
would flip in W16 narrative — but stayed strong
Youth Sexualized Behaviors
70.00%
▲ +28.86pp
would collapse in W16 (−22.34pp) — fragile gain
Tobacco in ID — early decline
90.47%
▼ 5.77pp
first dip of a 3-week decline that continues to W16
ID W15 is the mirror image of W16 — same policies, flipped direction
Several policies that collapsed in W16 had jumped UP in W15: Youth Sexualized Behaviors (W15 +28.86pp → W16 −22.34pp), Adult Sexualized Behaviors (W14→W15 −7.70pp drag, then −16.86pp in W16). This volatility pattern with small samples (mostly < 30) suggests the policy-level ID data is noise-heavy. Look at the W14→W16 net change for a clearer signal.
04a

Methodology — shift-share applied to ID

Same framework as W16: each ID policy is decomposed against ID's overall accuracy mean (89.63% in W14).

Rate effect = GWtW14 × (AccW15 − AccW14) — pure accuracy change at prior weight (within ID)
Weight effect = (GWtW15 − GWtW14) × (AccW14 − ID AccW14) — mix shift relative to ID mean (89.63%)
Interaction = (GWtW15 − GWtW14) × (AccW15 − AccW14) — joint change
+1.14pp
Sum of rate effects
net accuracy improvement
+9.13pp
Sum of weight effects
noisy mix changes; small samples inflate
−3.92pp
Sum of interactions
share & accuracy moving against
Reconciliation note: shift-share sum is +6.35pp, ID's actual OMA Δ is +0.58pp
Same caveat as W16: this dataset's "Title Accuracy" diverges from OMA accuracy. The mismatch is even larger here because ID's policy-level samples are very small (most < 30). Use this for policy ranking, not absolute accounting.
04b

Policies dragging ID's accuracy in W15

PolicyAcc W14Acc W15Δ AccWt W14Wt W15RateWeightInterTotal% of ID ΔSample
High Risk Weight Loss & Muscle Gain ?100.00%0.00%−100.001.22%1.37%−1.221+0.015−0.147−1.352−232%1 → 2
Youth Regulated Goods and Services ?91.96%78.32%−13.658.61%8.87%−1.175+0.006−0.036−1.205−207%10 → 37
Tobacco and Nicotine ?96.24%90.47%−5.7721.02%26.33%+5.31−1.213+0.351−0.306−1.168−200%23 → 91
Frauds & Scams ?100.00%72.22%−27.781.22%5.60%+4.38−0.339+0.453−1.215−1.101−189%2 → 18
Sexualized Animation & Illustration - Suggestive ?100.00%0.00%−100.001.22%0.99%−1.221−0.023+0.226−1.018−175%1 → 2
Youth Non-Sexualized Nudity ?94.67%81.80%−12.874.52%8.36%+3.84−0.581+0.194−0.495−0.882−151%27 → 169
Adult Sexualized Behaviors ?66.08%58.38%−7.702.47%4.16%+1.69−0.191−0.397−0.130−0.717−123%3 → 25
The credible drag in ID W15 is Tobacco (sample 23→91, accuracy −5.77pp) — this is the start of the ID Tobacco decline that continues to W16. Youth Non-Sexualized Nudity also has decent samples (27→169) and shows a real drop. Most other "drags" rest on samples of 1–10 and are unreliable.
04c

Policies improving in ID — the W15 gains that flipped in W16

PolicyAcc W14Acc W15Δ AccWt W14Wt W15RateWeightInterTotal% of ID ΔSampleW16 fate
Youth Body Exposure - Sig & Moderate ?46.27%77.45%+31.188.89%5.19%+2.771+1.606−1.154+3.223+553%23 → 60held strong (+14pp more)
Youth Sexualized Behaviors ?41.14%70.00%+28.869.69%9.08%+2.796+0.293−0.174+2.915+500%11 → 51collapsed (W16: −22.34pp)
Personal Information - High Risk ?20.82%50.00%+29.185.86%0.62%+1.710+3.606−1.529+3.788+650%3 → 2still volatile
Combat Sports, Extreme Sports, & Stunts ?33.33%98.86%+65.530.64%3.46%+0.422−1.585+1.844+0.682+117%6 → 7held at 100%
Youth Body Exposure - Light (4-17) ?65.51%83.02%+17.512.47%0.44%+0.433+0.491−0.357+0.568+97%3 → 12collapsed (W16: −39.63pp)
Notice the W16 fate column: two of the W15 top ID gainers — Youth Sexualized Behaviors and Youth Body Exposure - Light — completely reversed in W16. This is the strongest evidence that ID's policy-level data is noise-heavy: real moderation quality doesn't oscillate ±25–30pp between weeks. Treat single-week ID-policy reads as directional, never quantitative.
ID W15 vs W16 — what stayed real, what was noise

Comparing ID across W15 and W16 reveals the noise floor

Real signals (sustained across W15 and W16):

  • Tobacco decline — started in W15 (96%→88%) and continued in W16 (88%→84%). Sample 23→91→28. Most credible structural finding.
  • Youth Body Exposure - Sig & Moderate gain — W14 46% → W15 77% → W16 91%. Sample 23→60→39. Sustained improvement.

Likely noise (reversed within one week):

  • Youth Sexualized Behaviors: W14 41% → W15 70% → W16 47%. Sample 11→51→32. ±25pp swings on samples this size are within noise.
  • Youth Body Exposure - Light: W14 66% → W15 83% → W16 43%. Sample 3→12→11. Pure bouncing.
  • Personal Information - High Risk: W14 21% → W15 50% → W16 100%. Samples 3, 2, 2.

Implication: ID-policy crossbreaks need either much larger samples or aggregation across weeks before they yield reliable signal. The market-level OMA (90.22% → 86.81%) is more trustworthy as a signal than any individual policy-level reading.

ID Tobacco — the W15 inflection point

ID Tobacco started declining in W15

4-week trajectory:

  • W13: 96.47% (sample 17)
  • W14: 96.24% (sample 23) — flat
  • W15: 88.14% (sample 91) — first material drop, share grew 14.22%→17.17%
  • W16: 83.66% (sample 28) — continued decline, share crashed back to 10.21%

The W15 drop is the most credible single reading because of the sample expansion (23→91). It's also the moment when ID Tobacco started diverging from global Tobacco, which was recovering in W15.

Whatever started breaking ID Tobacco moderation appears to have happened around the W14/W15 boundary. By W16 the share contraction had absorbed most of the rate damage, but the underlying quality issue persists.

ID 4-week OMA trajectory

Indonesia overall accuracy across W13–W16

  • W13: 92.79%
  • W14: 89.63% (−3.16pp)
  • W15: 90.22% (+0.58pp) — small recovery
  • W16: 86.81% (−3.41pp) — major drop

The W15 recovery was small and didn't restore the W13 baseline. By W16 ID had fallen to a 4-week low (cumulative −5.97pp from W13).

ID was barely participating in the W14→W15 global recovery (most other markets recovered 3–10pp; ID only +0.58pp). This was an early signal that ID had a more persistent issue than the policy-interpretation hiccup that affected most other markets in W14.

Recommended actions
1Confirm what got fixed between W14 and W15. The recovery is too broad and too policy-specific (Combat Sports, Designated Dangerous Entities, Violent Behaviors all recovered) to be coincidence. Identify the W14 root cause and verify the W15 fix is structural — otherwise the categories will re-collapse.
2Investigate Dangerous Trends - Serious Harm (−9.36pp during a recovery week). It went against the global tide — possibly a separate issue from whatever W14 affected. W16 saw further drop to 65.22%, so this isn't a one-week event.
3Watch ID specifically — barely participated in the recovery (+0.58pp vs global +1.92pp). The Tobacco decline in ID started here and worsened in W16. This was an early signal of ID's W16 problems.
4MX regression (−6.07pp on growing weight) — verify whether Mexico-specific Spanish-language LATAM moderation was excluded from the W14→W15 fix.
5Don't over-celebrate the +1.92pp. 4 of the 5 top contributors had collapsed in W14 — this is largely a correction, not a structural improvement. Net 2-week change vs W13 baseline is only +0.10pp.
Priority matrix — what to verify before W17

Triage prioritization

P0 (root cause): Identify the W14 → W15 swing cause. Was it a labeling-guideline change, a moderator team rotation, a sampling pipeline correction? If unknown, the same issue could recur.

P1 (verification): Confirm 4-week trajectories for the top recovery policies (Combat Sports, Designated Dangerous Entities, Violent Behaviors) — are they sustained in W16? Most are, which is good news.

P1 (early warning): ID Tobacco decline started here. Track whether the W15 inflection persists into W17.

P2 (outliers): MX, PK, Dangerous Trends regressed during the recovery week. Investigate whether they share a common cause or are independent local issues.

P3 (data hygiene): Several extreme single-policy swings (Personal Information +41pp, Sexualized Animation -45pp) rest on samples below 80. Use directional reading, not magnitudes, until samples expand.

Global OMA W13
85.94%
baseline
starting point of the analyzed window
Status
analysis pending
no W12 data for shift-share comparison
📋

W13 report not yet generated

W13 (Mar 28 – Apr 3) is the baseline reference for W14 analysis. A standalone W13 vs W12 RCA would require Overall Moderation Accuracy data for W12, which is not currently available.

Persistent 0% policies
15
stable
0% in every week they appear · ≥2 weeks
W14 OMA gain if excluded
85.15%
▲ +1.03pp
from 84.12% · 56,282 of 4.66M traffic (1.21%)
W15 OMA gain if excluded
86.68%
▲ +0.64pp
from 86.04% · 41,891 of 5.65M traffic (0.74%)
W16 OMA gain if excluded
86.89%
▲ +0.89pp
from 86.00% · 46,319 of 4.54M traffic (1.02%)

What this analysis is

A standalone view of policies that read exactly 0% accuracy in every week they appear in the W13–W16 window. Filter rules:

  • Must appear in at least 2 weeks across W13–W16 (single-week 0%-readings excluded as noise)
  • Must read 0% in every week of presence — any non-zero week disqualifies the policy
  • This isolates the structural measurement issue from week-specific noise
15 policies read 0% accuracy persistently — likely a measurement-pipeline artifact
These are predominantly CSAM/CSAE and high-severity categories: Adult Sexual Abuse, Youth Sexual Abuse - Depiction, Youth Physical Abuse, Suicide & NSSI - Highly Harmful, Graphic Content, Animal Abuse, Human Exploitation, etc. Real moderator behavior on these categories is overwhelmingly correct (auto-removed at high precision upstream). What lands in the OMA sample is the residual — borderline cases where conservative moderator judgment can be coded against an aggressive ground truth. Tiny samples (1–42 cases) make it easy for the entire policy reading to flip to 0%.
Case-grain math: how exclusion is computed at OMA level (production-weight basis)
The OMA Overall Moderation Accuracy table reports both an accuracy and a production traffic weight per week. We treat OMA as correct_cases / total_cases = acc × W where W is the moderation weight. The 15 persistent-0 policies have weight values reported in the policy breakdown (a subset of OMA's total moderation traffic). Because acc=0 for these policies, they contribute 0 correct cases to the numerator. Excluding them: numerator unchanged, denominator shrinks by their weight. Result: a +0.6 to +1.0pp lift in OMA.
01

Week-by-week OMA-level impact (production weight)

Week0% policiesOMA total weight0%-policy weight0% shareOMA measuredOMA adjustedGain
W13 (Mar 28–Apr 3)114,790,67259,7271.25%85.94%87.03%+1.09pp
W14 (Apr 4–10)124,661,40256,2821.21%84.12%85.15%+1.03pp
W15 (Apr 11–17)135,645,11741,8910.74%86.04%86.68%+0.64pp
W16 (Apr 18–24)154,544,93646,3191.02%86.00%86.89%+0.89pp
How to read: "OMA total weight" is the production moderation traffic for that week. "0%-policy weight" is how much of that traffic falls into the 15 persistent-0 policies. Excluding them removes 0 correct cases from the numerator and the weight from the denominator — recovering 0.6–1.1pp of OMA.
Sample-base alternative view: if we use OMA evaluation sample size instead of production weight, 0%-policies represent 0.6–0.85% of weekly samples and yield slightly smaller gains: W13 +0.74pp, W14 +0.66pp, W15 +0.54pp, W16 +0.67pp. The two views bracket the realistic range; weight-based is the more direct measure of "moderation production quality."
02

The 15 persistent 0% policies

PolicyW13 weightW14 weightW15 weightW16 weightTotal weight
Adult Sexual Abuse10,95111,9569,6699,09141,667
Youth Sexual Abuse - Depiction12,0531,9358,4717,44529,904
Youth Physical Abuse, Assault & Neglect12,4666,3075,0736,89430,740
Suicide & NSSI - Highly Harmful7,0698,9693,3923,80623,236
Graphic Content2,3067,9654,0265,24919,546
Animal Abuse & Graphic Content4,8771,0622,9957,11716,051
Blood2,01610,6611,59955514,831
Highly Harmful Adult Sexual Abuse - Visual Depiction7723521,8552,4465,425
Human Exploitation - Risk4,2573,0252,2811,71611,279
Youth Sexual Objectification & Fetish5082946121,414
Youth Sexual Abuse - Facilitation and Trade1,2025321,734
Youth Sexual Abuse — Promotion and Admission389590979
Human Exploitation - Facilitation2,7701,771645795,265
ERT Transfer1,7711081,879
Dangerous Misinformation - Policy Tag19079269
Total — all 15 policies59,72756,28241,89146,319204,219
All 15 policies above: appear in ≥2 weeks within W13–W16 AND read exactly 0% accuracy in every week they appear. Policies with mixed 0%/non-0% readings across weeks (e.g., Counterfeit Goods, Light Body Exposure, Personal Information - High Risk) are excluded as noise rather than structural artifacts.
03

Why these persist at 0%

The CSAM/CSAE measurement gap — likely root cause

Why these specific policies all read 0%

Most of the 15 policies share a property: they cover content that gets auto-removed by upstream classifiers with very high precision:

  • CSAM categories (Youth Sexual Abuse — multiple variants)
  • CSAE categories (Adult Sexual Abuse, Highly Harmful Adult Sexual Abuse)
  • Other extremely high-severity categories (Suicide & NSSI - Highly Harmful, Human Exploitation, Graphic Content)

Real-world moderator behavior on these categories is overwhelmingly correct, because the obvious cases never reach human review — they're auto-actioned. What ends up in the OHA/OMA sample is the residual: edge cases that escaped automated enforcement, where moderator judgment is genuinely difficult.

On these residuals:

  • The conservative moderator call (e.g., "approve as benign because content is borderline") may be coded against an aggressive ground truth that says "should remove."
  • Sample sizes are tiny (1–42 cases per week) so a small number of coding decisions flip the entire policy reading to 0%.
  • The pattern persists across weeks because the underlying sampling logic is stable.

Bottom line: 0% accuracy on these categories is almost certainly NOT a real moderator quality signal.

The math, in detail

Case-grain calculation, OMA-level

For each week, we have:

  • OMA accuracy = correct cases / total cases (from the headline number)
  • OMA total weight (production traffic) = total cases (from the OMA Subtotal row's Weight column)
  • 0%-policy weight = sum of weights for the 15 persistent-0 policies (from policy table)

Excluding 0%-policies:

  • Numerator change: 0%-policies have acc=0 so they contribute 0 × cases = 0 correct. Removing them doesn't change the numerator.
  • Denominator change: shrinks by the 0%-policy case count.
  • New OMA = original_numerator / (original_denominator − excluded_cases) = OMA × N / (N − excluded)

W16 example: OMA = 86.00% on 4,544,936 production weight. 0%-policies = 46,319 weight (acc=0 each). Numerator = 0.86 × 4,544,936 = 3,908,645 correct. Excluded numerator = 0. New = 3,908,645 / (4,544,936 − 46,319) = 3,908,645 / 4,498,617 = 86.89%. Gain = +0.89pp.

This gives a real OMA-level number, not a within-policy-table approximation.

Sample vs production-traffic view

Two ways to read "exclusion"

The page above uses the OMA evaluation sample as the case base — the population that gets human-reviewed for OMA scoring. This is the most direct reading of "headline OMA if these policies were excluded from sampling."

An alternative is to use the raw moderation weight (production traffic volume) as the case base. The two diverge slightly because OMA samples are not perfectly proportional to production traffic.

Production-traffic basis (page default): W13 +1.09pp, W14 +1.03pp, W15 +0.64pp, W16 +0.89pp.
Sample-base alternative: W13 +0.74pp, W14 +0.66pp, W15 +0.54pp, W16 +0.67pp.

The production-traffic view answers: "if we removed the production volume from the moderation pipeline that becomes 0%-policy in OMA, what would OMA become?" The sample view answers: "if we removed those same cases from the OMA evaluation sample, what would the headline read?"

For most reporting purposes the sample view is the right one. For estimating production-quality impact (e.g., what would happen if upstream classification took over these categories entirely), the production-traffic view is more relevant.

Recommended actions

What to do about persistent 0%-policies

P0 — investigate the metric pipeline. Confirm the hypothesis that these 15 categories surface only their residual cases (not the auto-actioned majority). If true, the policy table's accuracy column for these categories is mathematically incapable of being non-zero — it's a metric definition issue, not a moderator performance issue.

P1 — separate display. In future reports, segregate persistent-0 policies from the main accuracy breakdown. The headline number should still include them (because they're real moderation cases) but the per-policy ranking shouldn't surface them at the top of "drag" lists, where they create false alarms.

P2 — reframe the headline. Consider reporting both "OMA measured" and "OMA excluding persistent-0 categories" as paired numbers. The 0.5–0.7pp gap is stable enough across weeks that it could be a standing footnote.

P3 — sample the underlying population properly. If sampling logic restricts to residuals for these categories, expand sampling to include automated-action cases for ground-truth verification. This would let these categories report meaningful non-zero accuracies.