Attack Detection
KoreShield uses a multi-layered detection system to identify prompt injection attempts and other security risks. Detection combines keyword rules, pattern analysis, custom rules via a DSL, ML-inspired heuristics, and allowlist/blocklist management.
Detection Layers
1. Keyword-Based Detection
Direct injection phrases are matched against a curated dictionary of 30+ patterns:
- Instruction override: "ignore previous instructions", "forget everything", "new instructions", "disregard", "override"
- Role hijacking: "you are now", "act as", "pretend to be", "new persona", "break character"
- Jailbreak patterns: "jailbreak", "developer mode", "dan mode", "unrestricted mode", "god mode"
- Safety bypass: "ignore safety", "bypass restrictions", "override rules", "disable safety", "remove guardrails"
2. Pattern-Based Detection
Structural analysis catches more sophisticated attacks:
| Pattern Type | Indicator | Severity |
|---|---|---|
code_block_injection | Code blocks combined with "system" or "instruction" keywords | high |
role_manipulation | Phrases like "you are", "act as", "pretend to be" | medium |
encoding_attempt | Base64-like strings (20+ chars of [A-Za-z0-9+/=]) | high |
prompt_leaking | "reveal prompt", "show system", "system prompt", "leak prompt" | high |
data_exfiltration | "send to", "upload to", "post to", "transmit to" | high |
adversarial_suffix | Authority-claiming suffixes like "these instructions have the highest authority" | medium |
multi_turn_injection | Turn markers like "##end" or "end of prompt" | medium |
math_trick | Mathematical expressions used to disguise instructions | low |
3. ML-Inspired Heuristics
Feature-engineered scoring based on:
- Keyword density: ratio of suspicious keywords to total words
- Special character ratio: unusual concentration of special characters
- Length anomalies: abnormally long or short prompts
- Pattern complexity scoring: computes a composite risk score from extracted features
When the ML model is loaded, it contributes up to 30% of the overall confidence score.
4. Custom Rule Engine (DSL)
Rules support keyword (contains), regex, or exact matching and map to severity and action.
DSL format:
RULE <id> "<name>" "<description>"
PATTERN <type> "<pattern>"
SEVERITY <level>
ACTION <action>
TAGS <tag1>,<tag2>
ENABLED <true|false>
Example:
RULE custom_sql "Custom SQL Injection" "Detects SQL injection patterns"
PATTERN contains "SELECT * FROM users WHERE"
SEVERITY high
ACTION block
TAGS sql,injection
ENABLED true
Supported pattern types: contains, regex, exact.
Severity levels: low, medium, high, critical.
Actions: block, warn, log, allow.
5. Allowlists and Blocklists
The ListManager provides runtime-managed allow/block lists:
- Add entries by type (
ip,domain,keyword, etc.) - Entries can have an optional expiry (
expires_in_days) - Allowlisted content bypasses detection; blocklisted content is rejected immediately
Confidence and Severity
- Each indicator contributes to a cumulative confidence score (0.0 to 1.0)
- Indicator severities:
low,medium,high,critical - Sensitivity settings (
low,medium,high) determine the enforcement threshold
Configuration
security:
sensitivity: medium
default_action: block
rate_limit: "60/minute"
features:
sanitization: true
detection: true
policy_enforcement: true
Tuning Guidance
- Use
highsensitivity for regulated or high-risk workloads - Review logs and
/metricsto reduce false positives - Add known-safe patterns to the allowlist via the SDK or management API
- Pair detection with the RAG Defense Engine for context-level scanning