RPC and EventBus Architecture

This document describes the RPC (Remote Procedure Call) and EventBus architecture that enables multi-collector coordination and lifecycle management in DaemonEye.

Overview

DaemonEye uses a dual-protocol architecture for inter-component communication:

IPC Protocol: Protobuf over Unix sockets/named pipes for CLI-to-agent communication
EventBus Protocol: Topic-based pub/sub messaging for collector-to-agent coordination

The daemoneye-agent runs an embedded EventBus broker that provides:

Topic-based message routing with wildcard patterns
RPC patterns for collector lifecycle management
Correlation metadata for distributed tracing
Health monitoring and metrics collection

Architecture Diagram

flowchart TB
    subgraph "daemoneye-agent Process"
        AGENT[Agent Core]
        BROKER[Embedded EventBus Broker]
        RPC[RPC Service]
        IPC[IPC Server]
        PM[Process Manager]

        AGENT --> BROKER
        AGENT --> RPC
        AGENT --> IPC
        AGENT --> PM
        RPC --> BROKER
        PM --> RPC
    end

    subgraph "Collector Processes"
        PROC[procmond]
        NET[netmond]
        FS[fsmond]
    end

    subgraph "CLI Process"
        CLI[daemoneye-cli]
    end

    PROC -->|EventBus Topics| BROKER
    NET -->|EventBus Topics| BROKER
    FS -->|EventBus Topics| BROKER

    CLI -->|IPC Protobuf| IPC

    PM -->|RPC Lifecycle| PROC
    PM -->|RPC Lifecycle| NET
    PM -->|RPC Lifecycle| FS

EventBus Topic Hierarchy

The EventBus uses a hierarchical topic structure for message routing:

Event Topics

events.process.* - Process monitoring events
- events.process.lifecycle - Process start/stop events
- events.process.metadata - Process metadata updates
- events.process.tree - Process tree relationships
- events.process.integrity - Executable integrity events
- events.process.anomaly - Anomaly detection events
- events.process.batch - Batch process updates
events.network.* - Network monitoring events (future)
events.filesystem.* - Filesystem monitoring events (future)
events.performance.* - Performance monitoring events (future)

Control Topics

control.collector.* - Collector management
- control.collector.lifecycle - Start/stop/restart commands
- control.collector.config - Configuration updates
- control.collector.task - Task distribution
control.health.* - Health monitoring
- control.health.heartbeat - Heartbeat messages
- control.health.status - Status reports
- control.health.diagnostics - Diagnostic information

Wildcard Patterns

The EventBus supports two wildcard patterns:

+ - Single-level wildcard (e.g., events.process.+ matches events.process.lifecycle)
# - Multi-level wildcard (e.g., events.# matches all event topics)

RPC Patterns

Collector Lifecycle Operations

The RPC service provides structured request/response patterns for collector management:

Start Collector

use daemoneye_eventbus::rpc::{CollectorRpcClient, CollectorLifecycleRequest, RpcRequest, CollectorOperation};
use std::time::Duration;

let request = RpcRequest::lifecycle(
    "agent-id".to_string(),
    "control.collector.procmond".to_string(),
    CollectorOperation::Start,
    CollectorLifecycleRequest::start("procmond", None),
    Duration::from_secs(30)
);

let response = rpc_client.call(request, Duration::from_secs(30)).await?;

Stop Collector

let request = RpcRequest::lifecycle(
    "agent-id".to_string(),
    "control.collector.procmond".to_string(),
    CollectorOperation::Stop,
    CollectorLifecycleRequest::stop("procmond", true), // graceful=true
    Duration::from_secs(60)
);

let response = rpc_client.call(request, Duration::from_secs(60)).await?;

Health Check

let request = RpcRequest::health_check(
    "agent-id".to_string(),
    "control.collector.procmond".to_string(),
    Duration::from_secs(10)
);

let response = rpc_client.call(request, Duration::from_secs(10)).await?;

RPC Message Structure

All RPC messages include:

Correlation ID: Unique identifier for request/response matching
Timestamp: Request creation time
Timeout: Maximum execution time
Operation: Type of operation (Start, Stop, Restart, HealthCheck, etc.)
Payload: Operation-specific data

Error Handling

RPC calls can fail with:

Timeout: Operation exceeded timeout duration
NotFound: Target collector not found
InvalidOperation: Operation not supported
ExecutionError: Operation failed during execution
CommunicationError: Network or transport error

Correlation Metadata

All EventBus messages include correlation metadata for distributed tracing:

pub struct CorrelationMetadata {
    pub correlation_id: String,
    pub parent_correlation_id: Option<String>,
    pub root_correlation_id: String,
    pub sequence_number: u64,
    pub workflow_stage: Option<String>,
    pub tags: HashMap<String, String>,
}

This enables:

Request/response correlation across multiple hops
Workflow tracking for complex operations
Forensic analysis of event chains
Performance profiling of distributed operations

Multi-Collector Coordination

Task Distribution

The agent distributes tasks to collectors based on capabilities:

Agent receives detection task (e.g., SQL query)
Agent determines required collector capabilities
Agent publishes task to appropriate topic (e.g., control.collector.task)
Collectors with matching capabilities receive and execute task
Collectors publish results to result topic (e.g., events.process.batch)
Agent aggregates results and generates alerts

Result Aggregation

Results from multiple collectors are aggregated using:

Correlation IDs: Match results to original request
Sequence Numbers: Order results from same collector
Deduplication: Remove duplicate results across collectors
Timeout Handling: Handle slow or failed collectors gracefully

Load Balancing

When multiple instances of the same collector type are available:

Tasks are distributed round-robin across instances
Failed instances are detected via health checks
Tasks are automatically redistributed to healthy instances
Collectors can advertise capacity limits

Configuration

Broker Configuration

broker:
  socket_path: /tmp/daemoneye-eventbus.sock
  startup_timeout_seconds: 30
  max_subscribers: 100
  message_buffer_size: 10000
  enable_metrics: true

RPC Configuration

rpc:
  default_timeout_seconds: 30
  health_check_interval_seconds: 60
  enable_correlation_tracking: true
  max_pending_requests: 1000

Process Manager Configuration

process_manager:
  graceful_shutdown_timeout_seconds: 60
  force_shutdown_timeout_seconds: 5
  health_check_interval_seconds: 120
  enable_auto_restart: true
  max_restart_attempts: 3
  restart_backoff_seconds: 5