Attack Detection

KoreShield uses a multi-layered detection system to identify prompt injection attempts and other security risks. Detection combines keyword rules, pattern analysis, custom rules via a DSL, ML-inspired heuristics, and allowlist/blocklist management.

Detection Layers

1. Keyword-Based Detection

Direct injection phrases are matched against a curated dictionary of 30+ patterns:

Instruction override: "ignore previous instructions", "forget everything", "new instructions", "disregard", "override"
Role hijacking: "you are now", "act as", "pretend to be", "new persona", "break character"
Jailbreak patterns: "jailbreak", "developer mode", "dan mode", "unrestricted mode", "god mode"
Safety bypass: "ignore safety", "bypass restrictions", "override rules", "disable safety", "remove guardrails"

2. Pattern-Based Detection

Structural analysis catches more sophisticated attacks:

Pattern Type	Indicator	Severity
`code_block_injection`	Code blocks combined with "system" or "instruction" keywords	high
`role_manipulation`	Phrases like "you are", "act as", "pretend to be"	medium
`encoding_attempt`	Base64-like strings (20+ chars of `[A-Za-z0-9+/=]`)	high
`prompt_leaking`	"reveal prompt", "show system", "system prompt", "leak prompt"	high
`data_exfiltration`	"send to", "upload to", "post to", "transmit to"	high
`adversarial_suffix`	Authority-claiming suffixes like "these instructions have the highest authority"	medium
`multi_turn_injection`	Turn markers like "##end" or "end of prompt"	medium
`math_trick`	Mathematical expressions used to disguise instructions	low

3. ML-Inspired Heuristics

Feature-engineered scoring based on:

Keyword density: ratio of suspicious keywords to total words
Special character ratio: unusual concentration of special characters
Length anomalies: abnormally long or short prompts
Pattern complexity scoring: computes a composite risk score from extracted features

When the ML model is loaded, it contributes up to 30% of the overall confidence score.

4. Custom Rule Engine (DSL)

Rules support keyword (contains), regex, or exact matching and map to severity and action.

DSL format:

RULE <id> "<name>" "<description>"
PATTERN <type> "<pattern>"
SEVERITY <level>
ACTION <action>
TAGS <tag1>,<tag2>
ENABLED <true|false>

Example:

RULE custom_sql "Custom SQL Injection" "Detects SQL injection patterns"
PATTERN contains "SELECT * FROM users WHERE"
SEVERITY high
ACTION block
TAGS sql,injection
ENABLED true

Supported pattern types: contains, regex, exact. Severity levels: low, medium, high, critical. Actions: block, warn, log, allow.

5. Allowlists and Blocklists

The ListManager provides runtime-managed allow/block lists:

Add entries by type (ip, domain, keyword, etc.)
Entries can have an optional expiry (expires_in_days)
Allowlisted content bypasses detection; blocklisted content is rejected immediately

Confidence and Severity

Each indicator contributes to a cumulative confidence score (0.0 to 1.0)
Indicator severities: low, medium, high, critical
Sensitivity settings (low, medium, high) determine the enforcement threshold

Configuration

security:
  sensitivity: medium
  default_action: block
  rate_limit: "60/minute"
  features:
    sanitization: true
    detection: true
    policy_enforcement: true

Tuning Guidance

Use high sensitivity for regulated or high-risk workloads
Review logs and /metrics to reduce false positives
Add known-safe patterns to the allowlist via the SDK or management API
Pair detection with the RAG Defense Engine for context-level scanning

Detection Layers​

1. Keyword-Based Detection​

2. Pattern-Based Detection​

3. ML-Inspired Heuristics​

4. Custom Rule Engine (DSL)​

5. Allowlists and Blocklists​

Confidence and Severity​

Configuration​

Tuning Guidance​