Safety
Plexa has four layers of safety. They run in this order:Safety rules
Sync functions that examine a validated command and return{ allowed }. Add as many as you need. The first blocker wins.
safety_blocked and the dispatch is rejected.
Approval hook
A single async function that runs after safety. May returntrue, false, or a modified command.
approval_error and approval_rejected.
There is one approval hook per Space. Calling addApprovalHook again replaces the previous one.
Prompt injection sanitizer
The aggregator scrubs body-supplied strings before they reach the brain. Patterns it strips:- Role prefixes:
system:,user:,assistant:,human: - Chat template tokens:
<|im_start|>,<|im_end|>,<|endoftext|>, any<|...|> - Anthropic markers:
\n\nHuman:,\n\nAssistant: - Known directives:
ignore previous instructions,disregard previous,you are now,forget the above,new instructions:
[redacted]. Tool definitions are left alone (they are developer-authored and may legitimately use words like user).
Confidence gating
PatternStore and AdaptiveMemory return a confidence on every hit. The body forwards that confidence when it reports a local decision via notifyDecision. Plexa classifies it:
space.getStats().avgConfidenceByBody.
Best practices for hardware bodies
- Put a hard cap on actuator inputs in a reflex inside the body. The brain can ask for
magnitude: 1.5; the body should clamp before the motor sees it. - Add a safety rule for the dangerous tool names you do not want the brain calling at all.
- Use the approval hook for human-in-the-loop on irreversible actions (writing to a database, sending an email, firing a motor).
- Bind the introspection server (port 4747) to
127.0.0.1only. There is no auth. - Run with
space.installShutdownHandlers()so memory is saved onSIGINT/SIGTERM. Without this, an unclean stop loses the session.