-
Notifications
You must be signed in to change notification settings - Fork 135
Description
Problem
When testing stateful workloads with asynchronous state propagation, chaos experiments fail with TARGET_SELECTION_ERROR because target pods are selected before their labels are updated by the operator.
Example: In CloudNativePG, after deleting the primary pod:
- Replica is promoted (cluster status updates immediately)
- Pod labels update 60-120s later via reconciliation
- Next chaos iteration tries to select
cnpg.io/instanceRole=primary→ no pods found → error
Current workaround: Increase CHAOS_INTERVAL to 300s, but this is inflexible and wastes time.
Why Existing Probes Don't Work
Looking at chaoslib/litmus/pod-delete/lib/pod-delete.go:
probe.RunProbes(&chaosDetails, clients, resultDetails, "DuringChaos", eventsDetails) // Starts goroutines
for duration < chaosDetails.ChaosDuration {
targetPodList, err := common.GetTargetPods(chaosDetails, clients) // Selection happens here
// ... chaos injection
}- Continuous probes run as non-blocking goroutines (
go triggerInlineContinuousCmdProbe(...)) - SOT/EOT/Edge run once per experiment, not per iteration
- OnChaos runs after target selection
None validate state synchronously before target selection in each iteration.
Proposed Solution
Add a PreTargetSelection probe mode that runs synchronously before target selection in each iteration:
for duration < chaosDetails.ChaosDuration {
// NEW: Block until validation passes
if err := probe.RunProbes(&chaosDetails, clients, resultDetails, "PreTargetSelection", eventsDetails); err != nil {
return errors.Errorf("pre-target selection validation failed: %v", err)
}
targetPodList, err := common.GetTargetPods(chaosDetails, clients)
// ... chaos injection
}Example usage:
probe:
- name: wait-for-primary-label
type: cmdProbe
mode: PreTargetSelection # Blocks until ready
runProperties:
probeTimeout: 120
retry: 50
interval: 2
cmdProbe/inputs:
command: kubectl get pods -l cnpg.io/instanceRole=primary -o name | grep -q '.'Benefits:
- Blocks execution until state is ready
- Supports retry logic with configurable timeout
- Eliminates need for inflated
CHAOS_INTERVAL - Works for any stateful workload (Kafka, Elasticsearch, Redis, etc.)
Implementation
Changes needed in litmus-go:
- Add
PreTargetSelectionto probe mode enum - Update
pkg/probe/probe.goto handle synchronous execution - Add probe invocation before
GetTargetPods()in chaos libraries - Update documentation
Willingness to Contribute
I can implement this feature, add tests, and update documentation if the design is approved.
Question: Does this approach align with Litmus architecture? Open to alternative suggestions.