BLACKSHIELD

Public Guide

Troubleshooting and Platform Limits

Run a practical multi-scanner troubleshooting playbook for ingestion, authentication, provider connectivity, and throughput so teams can isolate failures fast across CI, cloud, SaaS, Kubernetes, and VM sources. Audience: Tenant admins, DevOps teams, scanner operators, integration owners, and support engineers. Typical setup time: 20-30 minutes.

troubleshooting

Use this if

Run a practical multi-scanner troubleshooting playbook for ingestion, authentication, provider connectivity, and throughput so teams can isolate failures fast across CI, cloud, SaaS, Kubernetes, and VM sources.

Audience
Tenant admins, DevOps teams, scanner operators, integration owners, and support engineers
Typical time
20-30 minutes

Before You Begin

  • Collect failing and last-successful job IDs, UTC timestamps, connector key, scanner type/version, and target context (repository, cloud account, cluster, or SaaS tenant).
  • Capture current API key metadata, connector configuration snapshot, and environment variable values used by the failing workflow so drift can be ruled out quickly.
  • Prepare one representative payload sample and expected output shape before re-running ingestion so you can compare parser behavior deterministically.

Guide walkthrough

Step 1

Check authentication and access first

Most ingestion failures begin with identity drift between API keys, connector credentials, and workspace permissions, so prove auth is healthy before tuning anything else.

  • Confirm the API key used by each scanner path (pipeline, cloud, Kubernetes, SaaS, VM) is active, correctly scoped to the expected workspace, and still within your rotation window in `/api-keys`.
  • Verify API base URL, tenant/workspace context, and environment variables for every third-party provider connection (for example AWS account, GCP project, GitHub org, GitLab group, or SaaS tenant) so credentials are not being sent to the wrong endpoint or environment.
  • Check that the operator account has permission to inspect connectors and alerts in `/integrations` and `/integrations/alerts`, because missing UI permissions can mask whether the issue is auth, connector health, or policy rejection.

What success looks like

Check that the operator account has permission to inspect connectors and alerts in `/integrations` and `/integrations/alerts`, because missing UI permissions can mask whether the issue is auth, connector health, or policy rejection.

Step 2

Validate payload format and metadata

Schema mismatch and unstable metadata are the most common causes of partial ingestion, silent skipping, and duplicate findings when multiple scanners feed the same assets.

  • Validate each scanner payload against the expected format before submission and confirm required fields are present (scanner type, resource identity, severity, timestamps, and evidence references) so normalized findings are not dropped during parsing.
  • Use stable dedup inputs across recurring runs by anchoring on durable identifiers such as provider account/project/subscription, repository or image digest, workload identity, and rule/check ID, rather than temporary execution IDs.
  • Use ingestion job status and downstream evidence in `/findings` to separate parse failures, schema validation errors, dedup collisions, and post-processing delays before escalating as a platform incident.

What success looks like

Use ingestion job status and downstream evidence in `/findings` to separate parse failures, schema validation errors, dedup collisions, and post-processing delays before escalating as a platform incident.

Step 3

Tune throughput and escalation

After auth and payload checks pass, optimize scan cadence and escalation quality so bursts from many connectors do not overwhelm ingestion and support can triage quickly when needed.

  • Distribute large scans across time windows and connector groups, especially when multiple third-party providers are scheduled simultaneously, to avoid avoidable queue spikes and lag.
  • Use bounded retries with exponential backoff and jitter for transient network and provider API failures, and stop retry storms by capping attempts per scanner run.
  • Escalate recurring failures with a complete packet: workspace identifier, connector key, scanner type and version, provider account/project context, job IDs, UTC timestamps, sample error response, and the exact dashboard links used for triage.

What success looks like

Escalate recurring failures with a complete packet: workspace identifier, connector key, scanner type and version, provider account/project context, job IDs, UTC timestamps, sample error response, and the exact dashboard links used for triage.

What success looks like

  • Root cause is narrowed to a concrete layer (credential scope, connector auth, payload schema, dedup behavior, permission boundary, or throughput saturation).
  • Mitigation and verification evidence are documented, and any escalation includes full diagnostics plus links to the exact dashboard views used during triage.
Troubleshooting and Platform Limits | BlackShield Docs