Skip to main content

Attack Detection

KoreShield uses a multi-layered detection system to identify prompt injection attempts and other security risks. Detection combines keyword rules, pattern analysis, custom rules via a DSL, ML-inspired heuristics, and allowlist/blocklist management.

Detection Layers

1. Keyword-Based Detection

Direct injection phrases are matched against a curated dictionary of 30+ patterns:

  • Instruction override: "ignore previous instructions", "forget everything", "new instructions", "disregard", "override"
  • Role hijacking: "you are now", "act as", "pretend to be", "new persona", "break character"
  • Jailbreak patterns: "jailbreak", "developer mode", "dan mode", "unrestricted mode", "god mode"
  • Safety bypass: "ignore safety", "bypass restrictions", "override rules", "disable safety", "remove guardrails"

2. Pattern-Based Detection

Structural analysis catches more sophisticated attacks:

Pattern TypeIndicatorSeverity
code_block_injectionCode blocks combined with "system" or "instruction" keywordshigh
role_manipulationPhrases like "you are", "act as", "pretend to be"medium
encoding_attemptBase64-like strings (20+ chars of [A-Za-z0-9+/=])high
prompt_leaking"reveal prompt", "show system", "system prompt", "leak prompt"high
data_exfiltration"send to", "upload to", "post to", "transmit to"high
adversarial_suffixAuthority-claiming suffixes like "these instructions have the highest authority"medium
multi_turn_injectionTurn markers like "##end" or "end of prompt"medium
math_trickMathematical expressions used to disguise instructionslow

3. ML-Inspired Heuristics

Feature-engineered scoring based on:

  • Keyword density: ratio of suspicious keywords to total words
  • Special character ratio: unusual concentration of special characters
  • Length anomalies: abnormally long or short prompts
  • Pattern complexity scoring: computes a composite risk score from extracted features

When the ML model is loaded, it contributes up to 30% of the overall confidence score.

4. Custom Rule Engine (DSL)

Rules support keyword (contains), regex, or exact matching and map to severity and action.

DSL format:

RULE <id> "<name>" "<description>"
PATTERN <type> "<pattern>"
SEVERITY <level>
ACTION <action>
TAGS <tag1>,<tag2>
ENABLED <true|false>

Example:

RULE custom_sql "Custom SQL Injection" "Detects SQL injection patterns"
PATTERN contains "SELECT * FROM users WHERE"
SEVERITY high
ACTION block
TAGS sql,injection
ENABLED true

Supported pattern types: contains, regex, exact. Severity levels: low, medium, high, critical. Actions: block, warn, log, allow.

5. Allowlists and Blocklists

The ListManager provides runtime-managed allow/block lists:

  • Add entries by type (ip, domain, keyword, etc.)
  • Entries can have an optional expiry (expires_in_days)
  • Allowlisted content bypasses detection; blocklisted content is rejected immediately

Confidence and Severity

  • Each indicator contributes to a cumulative confidence score (0.0 to 1.0)
  • Indicator severities: low, medium, high, critical
  • Sensitivity settings (low, medium, high) determine the enforcement threshold

Configuration

security:
sensitivity: medium
default_action: block
rate_limit: "60/minute"
features:
sanitization: true
detection: true
policy_enforcement: true

Tuning Guidance

  • Use high sensitivity for regulated or high-risk workloads
  • Review logs and /metrics to reduce false positives
  • Add known-safe patterns to the allowlist via the SDK or management API
  • Pair detection with the RAG Defense Engine for context-level scanning