How it works

Blify Mod doesn't keyword-match. Every potentially-moderated message goes through a multi-stage pipeline designed to be cheap on easy cases and thorough on hard ones.

#The pipeline

   message
      │
      ▼
┌─────────────┐   trivially safe / rate-limited / immune
│  pre-filter ├─────────────────────────────────────────►  skip
└─────┬───────┘
      │
      ▼
┌─────────────┐   "no" — clearly doesn't match any rule
│   screen    ├─────────────────────────────────────────►  skip
└─────┬───────┘
      │  "yes — worth a closer look"
      ▼
┌─────────────┐
│    judge    │   full reasoning against the rule list
└─────┬───────┘
      │  context_sufficient = false
      ▼
┌─────────────┐
│  reconsider │   pulls replied-to + prior author messages
└─────┬───────┘
      │
      ▼
   apply action  +  log  +  DM the offender

#1. Pre-filter

Before any AI call, the bot drops:

  • Trivial content (very short, mostly emoji, etc.)
  • Messages from users currently inside a per-user screen rate-limit
  • Messages from immune users or users with an immune role
  • Messages already flagged by Discord's native AutoMod (no double-action)

#2. Screen

A tiny, fast model is asked one yes/no question: "Could this plausibly violate one of THIS server's listed rules?" On every tier this is OpenAI's gpt-5-nano with a Claude Haiku 4.5 fallback if OpenAI is rate-limiting. The screen is a binary gate — nano is plenty for it, and the savings keep the bot affordable. The actual judgment is where tier matters.

The screen is intentionally conservative — it answers "yes" on anything ambiguous so the judge gets the final call.

#3. Judge

If the screen says "yes", the bot calls Claude with the full system prompt: your rule list, the target message, and a slice of channel context (8–20 messages on Free, 15–100 on Pro). Claude returns structured JSON:

{
  "violation": true,
  "rule_broken": "No harassment or hate speech",
  "severity": 4,
  "action": "timeout",
  "timeout_minutes": 60,
  "reason": "Targeted slur at another member.",
  "context_sufficient": true
}
  • Free uses OpenAI gpt-5-mini — fast, cheap, accurate on the vast majority of cases. Claude Haiku 4.5 is the automatic fallback if OpenAI is unavailable.
  • Pro/Premium uses Claude Sonnet 4.6 — noticeably better on subtle context, sarcasm, and adversarial framing. OpenAI gpt-5-mini is the automatic fallback if Anthropic is unavailable.

#4. Reconsider (only when needed)

If the model says context_sufficient: false — i.e. "I can't be sure without seeing what this was a reply to" — the bot pulls the replied-to message and the user's prior consecutive messages, then re-runs the judge with that targeted context. This is what lets the bot correctly handle:

  • Sarcasm: "yeah do it" is harmless after a joke and an admission after a threat.
  • Callbacks: "same" referring to a slur made three messages back.
  • Ongoing thoughts: a single sentence that's part of a longer paragraph the author was typing.

#5. Apply, log, and DM

The action is applied (or skipped, if dry-run is on), a log embed is posted to your log channel, and the offender gets a DM explaining what happened, the rule, and the appeal link if you've set one.

#Strictness caps

After the model decides on an action, the configured strictness level can downgrade it before it's applied. See /setstrictness:

Level What gets capped
Low Timeouts, kicks, and bans are all converted to warnings
Medium Kicks and bans are converted to timeouts
High No cap — full action ladder (default)

Strictness only ever softens the bot's decision; it never escalates one.

#What the bot will not do

  • Enforce rules you didn't list. If a message contains content you find awful but no listed rule covers it, the bot returns action: "none". Servers choose what to moderate.
  • Follow instructions inside messages. Every user message is sent inside a fenced "untrusted data" block, and the system prompt explicitly forbids treating message content as instructions. Prompt-injection attempts ("ignore previous instructions", fake JSON, fake admin claims) are themselves treated as evidence of bad-faith behavior.
  • Action the server owner. Discord doesn't allow it.
  • Action anyone whose top role is ≥ the bot's role. Move the bot's role up.
  • Punish the user who @mentioned it. Manual scans only judge the content being scanned.

#Caching and deduplication

  • Cheap-judge cache: identical message text + identical rule set returns the cached "clean" verdict for a few minutes, so re-pasted spam doesn't pay for repeated full-judge calls.
  • AutoMod dedup: if Discord's native AutoMod has already flagged a message, Blify Mod skips it.
  • Per-user rate limits: rapid-fire messages from a single author share a screen result for a short window.
  • Per-server mention cooldown: manual @mention scans are limited to one every 5 minutes per server.