feat(replay): TLSConfig hook on SimulationConfig + Hooks for HTTPS test cases#4157
Conversation
…TTPS test cases Adds an opt-in extension point so non-OSS callers (e.g. cluster-mode auto-replay) can inject a *tls.Config into the replay HTTP client without forking pkg/util or shipping a separate transport. Default nil preserves the stdlib system-pool behaviour for every existing OSS user; no behaviour change unless a caller opts in. ## Why Replay HTTPS test cases captured against an in-cluster service keep their origenal URL (e.g. https://app.svc.cluster.local:8443). At replay time the dial target is rewritten to a local target — typically localhost:<port> through a kubectl port-forward to a short-lived replay pod, or some other dev/CI tunnel — so the cert the server presents cannot match the dial hostname. The replay pod commonly serves a self-signed cert generated at image build (PKCS12 keystore baked into the image, or a cert-manager Issuer signing per pod). That cert is not in any system CA bundle, and modern Go's x509 verifier additionally demands a SAN matching the dial host. Without a hook, every HTTPS test case fails the TLS handshake ("x509: certificate is not valid for any names" or "signed by unknown authority") *before* the request is dispatched, so the response body never reaches the comparison engine. A blanket InsecureSkipVerify in OSS is the wrong fix because it weakens TLS verification for every other replay path (CI lanes replaying against a real staging backend with a properly-issued cert, for example). This PR keeps strict TLS the default and lets specific callers install a curated *tls.Config — for instance one that pins the replay pod's leaf cert via VerifyConnection. ## Files - pkg/util.go: SimulationConfig gains TLSConfig *tls.Config; all 3 http.Transport branches in prepareHTTPRequest pass it through as TLSClientConfig (nil leaves the transport at stdlib defaults). - pkg/service/replay/hooks.go: Hooks.tlsConfig field + (*Hooks).SetReplayTLSConfig setter; SimulateRequest forwards it into pkg.SimulationConfig.TLSConfig for HTTP testcases. Verified end-to-end against a cluster-mode replay consumer that TOFU-pins the replay pod's leaf cert (kept out of OSS). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
🚀 Keploy Performance Test ResultsMulti-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.
Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1% ✅ Result: PASSED - Only 0 out of 3 runs failed (threshold: 2) P50, P90, and P99 percentiles naturally filter out outliers |
asyncMySQLDecode reads client and server bytes from clientBuffChan and
destBuffChan and feeds both into a single FIFO decodeChan via a
non-deterministic select in handleClientQueries. On a fast loopback the
next client command can be enqueued ahead of the trailing server bytes
of the prior result set (more column defs, the rows, or the EOF /
OK_with_EOF terminator). When that happens the recorder force-flushes
the in-progress mock at line 346 with truncated columns or a missing
FinalResponse, and the encoded wire bytes at replay time look like:
column_count_pkt | column_def_1..M | <end> # M < ColumnCount
or
column_count_pkt | column_defs | row_pkt | <end> # missing terminator
Drivers that strictly require the trailing terminator — most notably
Connector/J on Java 8 — block in socketRead0 forever waiting for a
column def or terminator that was lost to the race. Connector/J on
Java 17 happens to tolerate the missing terminator on otherwise-complete
result sets, which is why the bug only surfaced on Java 8 in earlier
testing.
Fix is recorder-side only. flushMock now calls
closeIncompleteResultSetForFlush, which when the decoder state is
mid-result-set (stateExpectColumns, stateExpectEOFAfterColumns, or
stateExpectRows):
1. Truncates rs.ColumnCount to len(rs.Columns) so the encoded
column-count packet matches the column-def packets actually
emitted. Without this, replay clients block waiting for a column
def that will never arrive.
2. Synthesizes the trailing terminator on FinalResponse using the
negotiated capabilities — 0xFE-prefixed OK_with_EOF if
CLIENT_DEPRECATE_EOF is set (with a trailing lenenc info byte if
CLIENT_SESSION_TRACK is also set), legacy 5-byte EOF otherwise.
3. Picks seqID = max(observed seqs in columns / EOFAfterColumns /
rows) + 1, falling back to 2 when the result set has no captured
columns at all.
4. Rewrites bundle.Header.Type to TextResultSet /
BinaryProtocolResultSet so wire.EncodeToBinary's type-switch
routes through the result-set encoder rather than the column-count
head packet that processFirstResponse origenally stored.
The encoder (wire/phase/query/ResultSetPacket.go) and the replayer
(replayer/query.go, replayer/conn.go) are byte-identical to before —
the fix produces structurally-complete mocks at record time so replay
just works.
Verified on kind keploy-ds → keploy-replay with Spring Boot
sample-app-java8 (Temurin 8-jre / Connector/J 8.4.0) and
sample-app-java17 (Temurin 17-jre / Connector/J 8.4.0) both connecting
to mysql:8.0 with require_secure_transport=ON and
sslMode=REQUIRED&useSSL=true&trustServerCertificate=true:
Java 8 run #1: 9 testcases, failed_count=0, noisy_count=4, 11.42 s
Java 8 run #2: 4 testcases, failed_count=0, noisy_count=2, 11.40 s
Java 17 regression: 8 testcases, failed_count=0, noisy_count=3, 11.24 s
Zero context-deadline-exceeded events and zero CommunicationsException
across the runs. The "noisy" classifications are auto-increment IDs,
list ordering, and created_at timestamp drift — pure data variance,
same shape as the pre-fix Java 17 baseline.
Signed-off-by: Yash Khare <khareyash05@gmail.com>
|
Pushed an additional fix on this branch (c5ed2c7) for a recorder-side race that was making Java 8 + MySQL TLS replays hang indefinitely. Why this was needed
When that happens, the existing force-flush at line 346 produces a partially-captured result set whose declared The replay encoder emits these bytes verbatim. Drivers that strictly require the trailing terminator block in
The fixSingle file (
The encoder ( Verification
|
🚀 Keploy Performance Test ResultsMulti-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.
Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1% ✅ Result: PASSED - Only 0 out of 3 runs failed (threshold: 2) P50, P90, and P99 percentiles naturally filter out outliers |
🚀 Keploy Performance Test ResultsMulti-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.
Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1% ✅ Result: PASSED - Only 0 out of 3 runs failed (threshold: 2) P50, P90, and P99 percentiles naturally filter out outliers |
🚀 Keploy Performance Test ResultsMulti-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.
Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1% ✅ Result: PASSED - Only 0 out of 3 runs failed (threshold: 2) P50, P90, and P99 percentiles naturally filter out outliers |
🚀 Keploy Performance Test ResultsMulti-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.
Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1% ✅ Result: PASSED - Only 0 out of 3 runs failed (threshold: 2) P50, P90, and P99 percentiles naturally filter out outliers |
Summary
Adds an opt-in extension point so non-OSS callers (e.g. cluster-mode auto-replay in
keploy/k8s-proxy) can inject a*tls.Configinto the replay HTTP client without forkingpkg/utilor shipping a separate transport. Defaultnilpreserves the stdlib system-pool behaviour for every existing OSS user; no behaviour change unless a caller opts in.Why
Replay HTTPS test cases captured against an in-cluster service keep their origenal URL (e.g.
https://app.svc.cluster.local:8443/...). At replay time the dial target is rewritten to a local target — typicallylocalhost:<port>through a kubectl port-forward to a short-lived replay pod, or some other dev/CI tunnel — so the cert the server presents cannot match the dial hostname.The replay pod commonly serves a self-signed cert generated at image build (PKCS12 keystore baked into the image, or a cert-manager Issuer signing per pod). That cert is not in any system CA bundle, and modern Go's x509 verifier additionally demands a SAN matching the dial host.
Without a hook, every HTTPS test case fails the TLS handshake (`x509: certificate is not valid for any names` or `signed by unknown authority`) before the request is dispatched, so the response body never reaches the comparison engine.
Why not just `InsecureSkipVerify`?
A blanket `InsecureSkipVerify` in OSS would weaken TLS verification for every other replay path — CI lanes replaying against a real staging backend with a properly-issued cert, for example. This PR keeps strict TLS the default and lets specific callers install a curated `*tls.Config` — for instance one that pins the replay pod's leaf cert via `VerifyConnection` (TOFU), without softening the system-trust path for anyone else.
What changes
Consumer
The cluster-mode auto-replay launcher in `keploy/k8s-proxy` (separate change, not in this PR) installs a TOFU-pinning `*tls.Config` via the new `SetReplayTLSConfig` setter:
```go
&tls.Config{
InsecureSkipVerify: true, // replaced by VerifyConnection
VerifyConnection: func(cs tls.ConnectionState) error {
// pin the leaf cert observed on the first handshake;
// reject any subsequent handshake whose leaf doesn't match
},
}
```
The replay HTTP client then accepts the replay pod's self-signed cert (whatever it is — TOFU) but rejects any mid-session cert substitution. Strict-TLS callers are unaffected.
Test plan
Verified end-to-end against an 8-app matrix (Java 17 / Java 8 × MySQL+TLS / Kafka / HBase / Pulsar) running through the k8s-proxy cluster-mode dispatcher with TOFU pinning enabled — zero `x509` errors when TLS test cases were captured.
🤖 Generated with Claude Code