Problem: Validate resilience under pod kills and regional impairment.
Assumptions/Constraints
- Cost note: Two clusters simulate regions.
- Install Chaos Mesh or Litmus Chaos, one per cluster.
ASCII Diagram
[Chaos Mesh]
|
+--> PodChaos: kill one pod in app=echo
|
+--> NetworkChaos: add 200ms delay to ns=echo
Effects:
- Envoy retries + multiple replicas keep SLI green
- p99 rises during NetworkChaos window
Steps (Chaos Mesh)
- Install the operator.
- Create a PodChaos to kill app pods.
- Create a NetworkChaos to add latency.
Example
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm upgrade --install chaos-mesh chaos-mesh/chaos-mesh -n chaos-testing --create-namespace --set dashboard.create=true
Pod kill
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata: { name: kill-echo, namespace: chaos-testing }
spec:
action: pod-kill
mode: one
duration: '1m'
selector:
namespaces: [ "echo" ]
labelSelectors: { app: echo }
Network latency
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata: { name: delay-echo, namespace: chaos-testing }
spec:
action: delay
mode: all
selector:
namespaces: [ "echo" ]
delay:
latency: '200ms'
jitter: '50ms'
duration: '5m'
Verification/DoD
- Service responds during pod kills due to replicas and retries.
- Added latency increases p99 within tolerance or triggers alerts.