Feature Request: Add PreTargetSelection probe mode for stateful workload validation before pod selection

## Problem

When testing stateful workloads with asynchronous state propagation, chaos experiments fail with `TARGET_SELECTION_ERROR` because target pods are selected before their labels are updated by the operator.

**Example**: In CloudNativePG, after deleting the primary pod:
1. Replica is promoted (cluster status updates immediately)
2. Pod labels update 60-120s later via reconciliation
3. Next chaos iteration tries to select `cnpg.io/instanceRole=primary` → no pods found → error

**Current workaround**: Increase `CHAOS_INTERVAL` to 300s, but this is inflexible and wastes time.

## Why Existing Probes Don't Work

Looking at `chaoslib/litmus/pod-delete/lib/pod-delete.go`:

```go
probe.RunProbes(&chaosDetails, clients, resultDetails, "DuringChaos", eventsDetails)  // Starts goroutines

for duration < chaosDetails.ChaosDuration {
    targetPodList, err := common.GetTargetPods(chaosDetails, clients)  // Selection happens here
    // ... chaos injection
}
```

- **Continuous probes** run as non-blocking goroutines (`go triggerInlineContinuousCmdProbe(...)`)
- **SOT/EOT/Edge** run once per experiment, not per iteration
- **OnChaos** runs after target selection

None validate state synchronously **before** target selection in each iteration.

## Proposed Solution

Add a **PreTargetSelection** probe mode that runs synchronously before target selection in each iteration:

```go
for duration < chaosDetails.ChaosDuration {
    // NEW: Block until validation passes
    if err := probe.RunProbes(&chaosDetails, clients, resultDetails, "PreTargetSelection", eventsDetails); err != nil {
        return errors.Errorf("pre-target selection validation failed: %v", err)
    }
    
    targetPodList, err := common.GetTargetPods(chaosDetails, clients)
    // ... chaos injection
}
```

**Example usage:**

```yaml
probe:
- name: wait-for-primary-label
  type: cmdProbe
  mode: PreTargetSelection  # Blocks until ready
  runProperties:
    probeTimeout: 120
    retry: 50
    interval: 2
  cmdProbe/inputs:
    command: kubectl get pods -l cnpg.io/instanceRole=primary -o name | grep -q '.'
```

**Benefits:**
- Blocks execution until state is ready
- Supports retry logic with configurable timeout
- Eliminates need for inflated `CHAOS_INTERVAL`
- Works for any stateful workload (Kafka, Elasticsearch, Redis, etc.)

## Implementation

Changes needed in `litmus-go`:
1. Add `PreTargetSelection` to probe mode enum
2. Update `pkg/probe/probe.go` to handle synchronous execution
3. Add probe invocation before `GetTargetPods()` in chaos libraries
4. Update documentation

## Willingness to Contribute

I can implement this feature, add tests, and update documentation if the design is approved.

---

**Question**: Does this approach align with Litmus architecture? Open to alternative suggestions.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Add PreTargetSelection probe mode for stateful workload validation before pod selection #778

Problem

Why Existing Probes Don't Work

Proposed Solution

Implementation

Willingness to Contribute

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Add PreTargetSelection probe mode for stateful workload validation before pod selection #778

Description

Problem

Why Existing Probes Don't Work

Proposed Solution

Implementation

Willingness to Contribute

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions