BLACKSHIELD

सार्वजनिक गाइड

Network Sensor High Availability & Architecture

Design resilient network sensor deployments with failover, load balancing, and multi-region architectures. लक्षित पाठक: Platform architects, network engineers, SRE teams. सामान्य सेटअप समय: 15 minutes.

reference

Use this if

Design resilient network sensor deployments with failover, load balancing, and multi-region architectures.

Audience
Platform architects, network engineers, SRE teams
Typical time
15 minutes

शुरू करने से पहले

  • You have a single-sensor deployment running and stable.
  • You understand your organization's RTO and RPO requirements.
  • You have access to infrastructure-as-code tools (CDK, Terraform, or Bicep) for your cloud.

Guide walkthrough

चरण 1

Baseline: Single-sensor deployment

Start here for small environments (<10 Gbps). Verify the sensor is working before adding complexity.

  • Deploy one sensor instance per VPC/VNet/GCP region.
  • Use local health check (port 8080) to validate sensor is running.
  • Store API key in Secrets Manager with read access restricted to the sensor IAM role.
  • Monitor CPU, memory, and ingestion metrics — upgrade instance type if >80% utilization.

What success looks like

Monitor CPU, memory, and ingestion metrics — upgrade instance type if >80% utilization.

चरण 2

Active-passive failover (HA pair)

Deploy a primary and standby sensor; failover triggered by health check failure.

  • Create two sensor instances in different subnets (primary and standby).
  • Use Auto Scaling Group with min=1, desired=1, max=1 to handle replacement.
  • Traffic Mirror Session: point primary ENI, set failover target to standby ENI.
  • If primary health check fails, AWS automatically switches to standby (manual or automatic via Lambda).
  • Monitoring: CloudWatch alarm on health check status; page on incident.

What success looks like

Monitoring: CloudWatch alarm on health check status; page on incident.

चरण 3

Active-active load distribution

Deploy multiple sensors and load-balance traffic across them for higher throughput.

  • Create 2–4 sensor instances behind a load balancer (NLB for AWS, ILB for GCP/Azure).
  • VPC Traffic Mirror / Packet Mirroring routes traffic to the load balancer target.
  • Each sensor ingests independently; findings are deduplicated at the platform.
  • Scale with Auto Scaling based on CPU/memory metrics.
  • RTO: <1 min; RPO: 0 (stateless sensors).

What success looks like

RTO: <1 min; RPO: 0 (stateless sensors).

चरण 4

Multi-region architecture

Deploy sensors in each region for local capture and resilience across data center failures.

  • Repeat active-active or active-passive in each production region.
  • Each region has its own VPC/VNet with independent sensors.
  • Platform backend is replicated (Primary region upstream, Secondary region warm standby).
  • DNS failover (Route 53, Cloud DNS, Azure DNS) or application-level failover to switch regions.
  • RPO: 0 (findings streamed in real-time); RTO: <5 min (DNS propagation + app failover).

What success looks like

RPO: 0 (findings streamed in real-time); RTO: <5 min (DNS propagation + app failover).

चरण 5

High-availability checklist

Before going production, confirm all aspects of your HA design.

  • Health checks: API endpoint responds 200 within 2 seconds.
  • Auto-scaling: ASG replaces failed instances within 3 minutes.
  • Secrets rotation: API key rotated without downtime (sensor pulls latest from Secrets Manager).
  • Backup and restore: sensor configuration backed up; can redeploy from template in <10 minutes.
  • Failover test: disable primary sensor, confirm traffic switches to secondary within SLA.
  • Metrics exported to central monitoring (CloudWatch, Datadog, Splunk, etc.).

What success looks like

Metrics exported to central monitoring (CloudWatch, Datadog, Splunk, etc.).

यह चलाएँ

active-passive-failover.yaml

yaml
AWSTemplateFormatVersion: "2010-09-09"
Description: "Network sensor failover using ASG"

Resources:
  SensorASG:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      MinSize: 1
      DesiredCapacity: 1
      MaxSize: 1
      HealthCheckType: ELB
      HealthCheckGracePeriod: 300

What success looks like

  • All sensor instances are healthy and passing health checks.
  • Failover occurs automatically within your defined RTO.
  • Findings continue to flow to the platform during sensor maintenance.
Network Sensor High Availability & Architecture | BlackShield Docs