Guide public

Network Sensor High Availability & Architecture

Design resilient network sensor deployments with failover, load balancing, and multi-region architectures. Audience: Platform architects, network engineers, SRE teams. Temps moyen de mise en place: 15 minutes.

reference

Use this if

Design resilient network sensor deployments with failover, load balancing, and multi-region architectures.

Audience: Platform architects, network engineers, SRE teams
Typical time: 15 minutes

Avant de commencer

You have a single-sensor deployment running and stable.
You understand your organization's RTO and RPO requirements.
You have access to infrastructure-as-code tools (CDK, Terraform, or Bicep) for your cloud.

Guide walkthrough

Étape 1

Baseline: Single-sensor deployment

Start here for small environments (<10 Gbps). Verify the sensor is working before adding complexity.

Deploy one sensor instance per VPC/VNet/GCP region.
Use local health check (port 8080) to validate sensor is running.
Store API key in Secrets Manager with read access restricted to the sensor IAM role.
Monitor CPU, memory, and ingestion metrics — upgrade instance type if >80% utilization.

What success looks like

Monitor CPU, memory, and ingestion metrics — upgrade instance type if >80% utilization.

Étape 2

Active-passive failover (HA pair)

Deploy a primary and standby sensor; failover triggered by health check failure.

Create two sensor instances in different subnets (primary and standby).
Use Auto Scaling Group with min=1, desired=1, max=1 to handle replacement.
Traffic Mirror Session: point primary ENI, set failover target to standby ENI.
If primary health check fails, AWS automatically switches to standby (manual or automatic via Lambda).
Monitoring: CloudWatch alarm on health check status; page on incident.

What success looks like

Monitoring: CloudWatch alarm on health check status; page on incident.

Étape 3

Active-active load distribution

Deploy multiple sensors and load-balance traffic across them for higher throughput.

Create 2–4 sensor instances behind a load balancer (NLB for AWS, ILB for GCP/Azure).
VPC Traffic Mirror / Packet Mirroring routes traffic to the load balancer target.
Each sensor ingests independently; findings are deduplicated at the platform.
Scale with Auto Scaling based on CPU/memory metrics.
RTO: <1 min; RPO: 0 (stateless sensors).

What success looks like

RTO: <1 min; RPO: 0 (stateless sensors).

Étape 4

Multi-region architecture

Deploy sensors in each region for local capture and resilience across data center failures.

Repeat active-active or active-passive in each production region.
Each region has its own VPC/VNet with independent sensors.
Platform backend is replicated (Primary region upstream, Secondary region warm standby).
DNS failover (Route 53, Cloud DNS, Azure DNS) or application-level failover to switch regions.
RPO: 0 (findings streamed in real-time); RTO: <5 min (DNS propagation + app failover).

What success looks like

RPO: 0 (findings streamed in real-time); RTO: <5 min (DNS propagation + app failover).

Étape 5

High-availability checklist

Before going production, confirm all aspects of your HA design.

Health checks: API endpoint responds 200 within 2 seconds.
Auto-scaling: ASG replaces failed instances within 3 minutes.
Secrets rotation: API key rotated without downtime (sensor pulls latest from Secrets Manager).
Backup and restore: sensor configuration backed up; can redeploy from template in <10 minutes.
Failover test: disable primary sensor, confirm traffic switches to secondary within SLA.
Metrics exported to central monitoring (CloudWatch, Datadog, Splunk, etc.).

What success looks like

Metrics exported to central monitoring (CloudWatch, Datadog, Splunk, etc.).

Demonstration only

This configuration is designed for ease of use. To deploy scanner clients at scale, please plan your deployment architecture accordingly or contact us for enterprise best practices.

Exécuter

active-passive-failover.yaml

yaml

AWSTemplateFormatVersion: "2010-09-09"
Description: "Network sensor failover using ASG"

Resources:
  SensorASG:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      MinSize: 1
      DesiredCapacity: 1
      MaxSize: 1
      HealthCheckType: ELB
      HealthCheckGracePeriod: 300

What success looks like

All sensor instances are healthy and passing health checks.
Failover occurs automatically within your defined RTO.
Findings continue to flow to the platform during sensor maintenance.

Helpful links

View Platform Status Network Sensor Guide

Related guides

Network Sensor High Availability & Architecture | BlackShield Docs