pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!

andrefun · 2026-04-25T05:37:10Z

PR: fix(discovery/kubernetes): preserve cached endpoints when payload is empty

Refs: #12923

Type of change

Bugfix
Feature
Refactor

What this PR does / why we need it

The kubernetes service discovery currently writes an empty endpoint payload
to the shared dict whenever the inbound Endpoints / EndpointSlice event
contains no ready addresses. Subsequent requests against that service then
receive no valid upstream node: nil (HTTP 503) — a route is matched, an
upstream is selected, but dis.nodes(service_name) returns nil because the
dict-stored endpoint object has no entry for the requested port.

This is the failure mode reported in #12923: a small burst of 503s on a
single APISIX pod during a canary rollout. The error log only shows
init.lua:541: handle_upstream(): failed to set upstream: no valid upstream node: nil — there is no other warn / info / error before it.

Why this happens

on_endpoint_modified (and on_endpoint_slices_modified) only consider
subset.addresses (i.e. ready pods); notReadyAddresses is ignored. While
that is the correct routing decision, the write to the discovery dict still
replaces the previous endpoint payload — even when the new payload is empty.

Several legitimate transient k8s states can produce an "all not ready"
payload at one watch tick:

rolling update with maxUnavailable=100% / maxSurge=0
shared-dependency outage that fails readiness on every pod simultaneously
kubelet/node restart batches
endpoints controller intermediate states during reconciliation storms
informer reconnect: the LIST snapshot can capture an in-flight state that
the watch stream skipped

Once the empty payload is written, every request to that service on this
worker returns 503 until the next non-empty event arrives. The window can be
arbitrarily long (we observed ~13s in production); the dict does not
self-heal because nothing forces a re-list, and endpoint_lrucache keys on
the dict-stored version, so the empty result is cached.

What this PR changes

In apisix/discovery/kubernetes/core.lua:

on_endpoint_modified: if the computed endpoint_buffer is empty
(no port has any node), skip update_endpoint_dict and emit a warn log.
The previous payload is preserved.
on_endpoint_slices_modified: same guard, applied to the merged
cached_endpoints view across all cached slices for the service.
Both callbacks now emit a warn-level diagnostic when an inbound event has
no ready addresses (independent of whether the guard fires), so future
operators can see the empty events even when the guard prevents the
503s.

The legitimate "service truly went away" path is unchanged — it is handled
by on_endpoint_deleted / on_endpoint_slices_deleted, which delete the
key outright. The operate == "list" accounting in the guarded path still
marks the key as "seen" so that post_list's dirty-data cleanup does not
remove a key we intentionally left in place.

What about truly-dead nodes still in the dict?

If every pod really is gone, the cached payload now points at IPs that fail
to connect. APISIX active health checks (default 5s interval, 3 failures)
detect and evict them within ~15s, after which the upstream balancer
naturally returns no candidates and the request fails. Crucially, the
failure mode degrades from "instant 503 with no upstream attempted" to
"~15s of routing to dead pods, then upstream balancer says no node" — which
is the same end state, but exposed via the normal upstream-down path
(observable via active-check logs, retry counters, etc.) instead of the
silent discovery-state failure.

Reproduction

A minimal local reproduction (no k8s required) is included as a unit test
in t/discovery/kubernetes_empty_endpoint.t:

TEST 1: on_endpoint_modified preserves previous endpoints when payload is empty
  - seed: subsets[0].addresses = [10.0.0.1]              => dict has 10.0.0.1
  - empty: subsets[0].notReadyAddresses = [10.0.0.1]     => dict UNCHANGED (was: cleared, 503)
  - empty: subsets = []                                  => dict UNCHANGED (was: cleared, 503)
  - recover: subsets[0].addresses = [10.0.0.2]           => dict updated to 10.0.0.2
TEST 2: same coverage for on_endpoint_slices_modified

The unit test fails on master without the patch (the dict gets overwritten
to {} after the empty events) and passes with it.

End-to-end verification on apisix 3.10.0

The same logic was back-ported to apisix 3.10.0 (which still has the
single-file discovery/kubernetes/init.lua) and exercised against the
apache/apisix:3.10.0-debian image, using a serverless plugin to
synthesize on_endpoint_modified events:

Scenario	Without patch	With patch
Seed: `subsets[0].addresses=[ip]`	`502`	`502`
Push: `notReadyAddresses=[ip]`	`503`	`502`
Push: `subsets=[]`	`503`	`502`

(502 here means set_upstream succeeded and the request reached the
load-balancer phase — the fake upstream is intentionally unreachable so
the test isolates the discovery layer from real network success. The
relevant assertion is "no longer 503".)

When the guard fires, the warn log is emitted:

kubernetes discovery: skip empty endpoint update for default/myprod ...

Compatibility

No public API change, no schema change, no metric / log format change other
than the two new warn lines. The dict layout is identical. Operators who
silently relied on "empty endpoint events clear the dict" do not exist —
that behavior was a bug, and the only observable consequence was 503s.

Checklist

I have updated the documentation accordingly. (N/A — internal behavior fix)
I have added tests to cover my changes.
All new and existing tests passed.
I have added a changelog entry. (see CHANGES.md update if requested)

fix(discovery/kubernetes): preserve cached endpoints when payload is empty

test: add unit test for empty endpoint guard

andrefun added 2 commits April 25, 2026 13:30

Update core.lua

b65cd2b

fix(discovery/kubernetes): preserve cached endpoints when payload is empty

Create kubernetes_empty_endpoint.t

3c1bf51

test: add unit test for empty endpoint guard

dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Apr 25, 2026

pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(discovery/kubernetes): preserve cached endpoints when payload is empty#13299

fix(discovery/kubernetes): preserve cached endpoints when payload is empty#13299
andrefun wants to merge 2 commits intoapache:masterfrom
andrefun:fix/k8s-discovery-empty-endpoint-12923

andrefun commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Pfad - The Proxy pFad © 2024 Your Company Name. All rights reserved.

pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!

Conversation

andrefun commented Apr 25, 2026

PR: fix(discovery/kubernetes): preserve cached endpoints when payload is empty

Type of change

What this PR does / why we need it

Why this happens

What this PR changes

What about truly-dead nodes still in the dict?

Reproduction

End-to-end verification on apisix 3.10.0

Compatibility

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Pfad - The Proxy pFad © 2024 Your Company Name. All rights reserved.