Skip to content

Feature Request: Add PreTargetSelection probe mode for stateful workload validation before pod selection #778

@XploY04

Description

@XploY04

Problem

When testing stateful workloads with asynchronous state propagation, chaos experiments fail with TARGET_SELECTION_ERROR because target pods are selected before their labels are updated by the operator.

Example: In CloudNativePG, after deleting the primary pod:

  1. Replica is promoted (cluster status updates immediately)
  2. Pod labels update 60-120s later via reconciliation
  3. Next chaos iteration tries to select cnpg.io/instanceRole=primary → no pods found → error

Current workaround: Increase CHAOS_INTERVAL to 300s, but this is inflexible and wastes time.

Why Existing Probes Don't Work

Looking at chaoslib/litmus/pod-delete/lib/pod-delete.go:

probe.RunProbes(&chaosDetails, clients, resultDetails, "DuringChaos", eventsDetails)  // Starts goroutines

for duration < chaosDetails.ChaosDuration {
    targetPodList, err := common.GetTargetPods(chaosDetails, clients)  // Selection happens here
    // ... chaos injection
}
  • Continuous probes run as non-blocking goroutines (go triggerInlineContinuousCmdProbe(...))
  • SOT/EOT/Edge run once per experiment, not per iteration
  • OnChaos runs after target selection

None validate state synchronously before target selection in each iteration.

Proposed Solution

Add a PreTargetSelection probe mode that runs synchronously before target selection in each iteration:

for duration < chaosDetails.ChaosDuration {
    // NEW: Block until validation passes
    if err := probe.RunProbes(&chaosDetails, clients, resultDetails, "PreTargetSelection", eventsDetails); err != nil {
        return errors.Errorf("pre-target selection validation failed: %v", err)
    }
    
    targetPodList, err := common.GetTargetPods(chaosDetails, clients)
    // ... chaos injection
}

Example usage:

probe:
- name: wait-for-primary-label
  type: cmdProbe
  mode: PreTargetSelection  # Blocks until ready
  runProperties:
    probeTimeout: 120
    retry: 50
    interval: 2
  cmdProbe/inputs:
    command: kubectl get pods -l cnpg.io/instanceRole=primary -o name | grep -q '.'

Benefits:

  • Blocks execution until state is ready
  • Supports retry logic with configurable timeout
  • Eliminates need for inflated CHAOS_INTERVAL
  • Works for any stateful workload (Kafka, Elasticsearch, Redis, etc.)

Implementation

Changes needed in litmus-go:

  1. Add PreTargetSelection to probe mode enum
  2. Update pkg/probe/probe.go to handle synchronous execution
  3. Add probe invocation before GetTargetPods() in chaos libraries
  4. Update documentation

Willingness to Contribute

I can implement this feature, add tests, and update documentation if the design is approved.


Question: Does this approach align with Litmus architecture? Open to alternative suggestions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions