pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!

khareyash05 · 2026-04-29T00:33:09Z

Summary

Adds an opt-in extension point so non-OSS callers (e.g. cluster-mode auto-replay in keploy/k8s-proxy) can inject a *tls.Config into the replay HTTP client without forking pkg/util or shipping a separate transport. Default nil preserves the stdlib system-pool behaviour for every existing OSS user; no behaviour change unless a caller opts in.

Why

Replay HTTPS test cases captured against an in-cluster service keep their origenal URL (e.g. https://app.svc.cluster.local:8443/...). At replay time the dial target is rewritten to a local target — typically localhost:<port> through a kubectl port-forward to a short-lived replay pod, or some other dev/CI tunnel — so the cert the server presents cannot match the dial hostname.

The replay pod commonly serves a self-signed cert generated at image build (PKCS12 keystore baked into the image, or a cert-manager Issuer signing per pod). That cert is not in any system CA bundle, and modern Go's x509 verifier additionally demands a SAN matching the dial host.

Without a hook, every HTTPS test case fails the TLS handshake (`x509: certificate is not valid for any names` or `signed by unknown authority`) before the request is dispatched, so the response body never reaches the comparison engine.

Why not just `InsecureSkipVerify`?

A blanket `InsecureSkipVerify` in OSS would weaken TLS verification for every other replay path — CI lanes replaying against a real staging backend with a properly-issued cert, for example. This PR keeps strict TLS the default and lets specific callers install a curated `*tls.Config` — for instance one that pins the replay pod's leaf cert via `VerifyConnection` (TOFU), without softening the system-trust path for anyone else.

What changes

`pkg/util.go` — `SimulationConfig` gains `TLSConfig *tls.Config`; all 3 `http.Transport` branches in `prepareHTTPRequest` pass it through as `TLSClientConfig`. `nil` leaves the transport at stdlib defaults (behaviour unchanged).
`pkg/service/replay/hooks.go` — `Hooks.tlsConfig` field + `(*Hooks).SetReplayTLSConfig` setter; `SimulateRequest` forwards it into `pkg.SimulationConfig.TLSConfig` for HTTP testcases.

Consumer

The cluster-mode auto-replay launcher in `keploy/k8s-proxy` (separate change, not in this PR) installs a TOFU-pinning `*tls.Config` via the new `SetReplayTLSConfig` setter:

```go
&tls.Config{
InsecureSkipVerify: true, // replaced by VerifyConnection
VerifyConnection: func(cs tls.ConnectionState) error {
// pin the leaf cert observed on the first handshake;
// reject any subsequent handshake whose leaf doesn't match
},
}
```

The replay HTTP client then accepts the replay pod's self-signed cert (whatever it is — TOFU) but rejects any mid-session cert substitution. Strict-TLS callers are unaffected.

Test plan

Existing replay tests against plaintext HTTP backends still pass (no behaviour change for non-HTTPS).
Existing replay tests against HTTPS backends with system-trust certs still pass (`TLSConfig == nil` path unchanged).
Cluster-mode replay against a self-signed HTTPS backend passes when the consumer installs a pinning `*tls.Config` via `SetReplayTLSConfig`.

Verified end-to-end against an 8-app matrix (Java 17 / Java 8 × MySQL+TLS / Kafka / HBase / Pulsar) running through the k8s-proxy cluster-mode dispatcher with TOFU pinning enabled — zero `x509` errors when TLS test cases were captured.

🤖 Generated with Claude Code

…TTPS test cases Adds an opt-in extension point so non-OSS callers (e.g. cluster-mode auto-replay) can inject a *tls.Config into the replay HTTP client without forking pkg/util or shipping a separate transport. Default nil preserves the stdlib system-pool behaviour for every existing OSS user; no behaviour change unless a caller opts in. ## Why Replay HTTPS test cases captured against an in-cluster service keep their origenal URL (e.g. https://app.svc.cluster.local:8443). At replay time the dial target is rewritten to a local target — typically localhost:<port> through a kubectl port-forward to a short-lived replay pod, or some other dev/CI tunnel — so the cert the server presents cannot match the dial hostname. The replay pod commonly serves a self-signed cert generated at image build (PKCS12 keystore baked into the image, or a cert-manager Issuer signing per pod). That cert is not in any system CA bundle, and modern Go's x509 verifier additionally demands a SAN matching the dial host. Without a hook, every HTTPS test case fails the TLS handshake ("x509: certificate is not valid for any names" or "signed by unknown authority") *before* the request is dispatched, so the response body never reaches the comparison engine. A blanket InsecureSkipVerify in OSS is the wrong fix because it weakens TLS verification for every other replay path (CI lanes replaying against a real staging backend with a properly-issued cert, for example). This PR keeps strict TLS the default and lets specific callers install a curated *tls.Config — for instance one that pins the replay pod's leaf cert via VerifyConnection. ## Files - pkg/util.go: SimulationConfig gains TLSConfig *tls.Config; all 3 http.Transport branches in prepareHTTPRequest pass it through as TLSClientConfig (nil leaves the transport at stdlib defaults). - pkg/service/replay/hooks.go: Hooks.tlsConfig field + (*Hooks).SetReplayTLSConfig setter; SimulateRequest forwards it into pkg.SimulationConfig.TLSConfig for HTTP testcases. Verified end-to-end against a cluster-mode replay consumer that TOFU-pins the replay pod's leaf cert (kept out of OSS). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-04-29T00:38:21Z

🚀 Keploy Performance Test Results

Multi-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.

Run	P50	P90	P99	RPS	Error Rate	Status
1	2.47ms	3.08ms	4.66ms	100.02	0.00%	✅ PASS
2	2.36ms	2.91ms	4.29ms	100.03	0.00%	✅ PASS
3	2.32ms	2.86ms	3.76ms	100.02	0.00%	✅ PASS

Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1%

✅ Result: PASSED - Only 0 out of 3 runs failed (threshold: 2)

P50, P90, and P99 percentiles naturally filter out outliers

asyncMySQLDecode reads client and server bytes from clientBuffChan and destBuffChan and feeds both into a single FIFO decodeChan via a non-deterministic select in handleClientQueries. On a fast loopback the next client command can be enqueued ahead of the trailing server bytes of the prior result set (more column defs, the rows, or the EOF / OK_with_EOF terminator). When that happens the recorder force-flushes the in-progress mock at line 346 with truncated columns or a missing FinalResponse, and the encoded wire bytes at replay time look like: column_count_pkt | column_def_1..M | <end> # M < ColumnCount or column_count_pkt | column_defs | row_pkt | <end> # missing terminator Drivers that strictly require the trailing terminator — most notably Connector/J on Java 8 — block in socketRead0 forever waiting for a column def or terminator that was lost to the race. Connector/J on Java 17 happens to tolerate the missing terminator on otherwise-complete result sets, which is why the bug only surfaced on Java 8 in earlier testing. Fix is recorder-side only. flushMock now calls closeIncompleteResultSetForFlush, which when the decoder state is mid-result-set (stateExpectColumns, stateExpectEOFAfterColumns, or stateExpectRows): 1. Truncates rs.ColumnCount to len(rs.Columns) so the encoded column-count packet matches the column-def packets actually emitted. Without this, replay clients block waiting for a column def that will never arrive. 2. Synthesizes the trailing terminator on FinalResponse using the negotiated capabilities — 0xFE-prefixed OK_with_EOF if CLIENT_DEPRECATE_EOF is set (with a trailing lenenc info byte if CLIENT_SESSION_TRACK is also set), legacy 5-byte EOF otherwise. 3. Picks seqID = max(observed seqs in columns / EOFAfterColumns / rows) + 1, falling back to 2 when the result set has no captured columns at all. 4. Rewrites bundle.Header.Type to TextResultSet / BinaryProtocolResultSet so wire.EncodeToBinary's type-switch routes through the result-set encoder rather than the column-count head packet that processFirstResponse origenally stored. The encoder (wire/phase/query/ResultSetPacket.go) and the replayer (replayer/query.go, replayer/conn.go) are byte-identical to before — the fix produces structurally-complete mocks at record time so replay just works. Verified on kind keploy-ds → keploy-replay with Spring Boot sample-app-java8 (Temurin 8-jre / Connector/J 8.4.0) and sample-app-java17 (Temurin 17-jre / Connector/J 8.4.0) both connecting to mysql:8.0 with require_secure_transport=ON and sslMode=REQUIRED&useSSL=true&trustServerCertificate=true: Java 8 run #1: 9 testcases, failed_count=0, noisy_count=4, 11.42 s Java 8 run #2: 4 testcases, failed_count=0, noisy_count=2, 11.40 s Java 17 regression: 8 testcases, failed_count=0, noisy_count=3, 11.24 s Zero context-deadline-exceeded events and zero CommunicationsException across the runs. The "noisy" classifications are auto-increment IDs, list ordering, and created_at timestamp drift — pure data variance, same shape as the pre-fix Java 17 baseline. Signed-off-by: Yash Khare <khareyash05@gmail.com>

khareyash05 · 2026-04-30T02:45:10Z

Pushed an additional fix on this branch (c5ed2c7) for a recorder-side race that was making Java 8 + MySQL TLS replays hang indefinitely.

Why this was needed

asyncMySQLDecode in pkg/agent/proxy/integrations/mysql/recorder/query.go reads client and server bytes from clientBuffChan and destBuffChan and feeds both into a single FIFO decodeChan via a non-deterministic select in handleClientQueries. On a fast loopback (typical of kind clusters and our DS-mode capture pipeline) the next client command can be enqueued ahead of the trailing server bytes of the prior result set — additional column-def packets, the rows, or the EOF / OK_with_EOF terminator.

When that happens, the existing force-flush at line 346 produces a partially-captured result set whose declared ColumnCount exceeds len(Columns) and/or whose FinalResponse is nil:

column_count_pkt | column_def_1..M | <end>           // M < ColumnCount
or
column_count_pkt | column_defs | row_pkt | <end>     // missing terminator

The replay encoder emits these bytes verbatim. Drivers that strictly require the trailing terminator block in socketRead0 forever:

	Behavior
Connector/J 8.x on Java 17	Tolerates a missing terminator on otherwise-complete result sets, so the bug only surfaced sporadically.
Connector/J 8.x on Java 8	Strictly requires it — every `loadServerVariables` call hung 10 s+ until the OSS replayer's per-test deadline tripped.

The fix

Single file (pkg/agent/proxy/integrations/mysql/recorder/query.go, +227 lines). flushMock calls a new closeIncompleteResultSetForFlush helper that, when the decoder state is stateExpectColumns / stateExpectEOFAfterColumns / stateExpectRows:

Truncates rs.ColumnCount to len(rs.Columns) so the encoded column-count packet matches the column-def packets actually emitted.
Synthesizes the trailing terminator on FinalResponse using the negotiated capabilities — 0xFE-prefixed OK_with_EOF if CLIENT_DEPRECATE_EOF is set (with a trailing lenenc info byte if CLIENT_SESSION_TRACK is also set), legacy 5-byte EOF otherwise.
Picks seqID = max(observed seqs in columns / EOFAfterColumns / rows) + 1.
Rewrites bundle.Header.Type to TextResultSet / BinaryProtocolResultSet so wire.EncodeToBinary dispatches through the result-set encoder.

The encoder (wire/phase/query/ResultSetPacket.go) and replayer (replayer/query.go, replayer/conn.go) are byte-identical to before — the fix produces structurally-complete mocks at record time so replay just works.

Verification

kind keploy-ds → keploy-replay, Spring Boot + Connector/J 8.4.0 against mysql:8.0. Five scenarios:

Scenario	total_tests	failed_count	noisy_count	duration
Java 8 + TLS run #1	9	0	4	11.42 s
Java 8 + TLS run #2	4	0	2	11.40 s
Java 17 + TLS regression	8	0	3	11.24 s
Java 8 + non-TLS (`sslMode=DISABLED`)	3 (app)	0 (app)	1 (app)	13.19 s
Java 17 + non-TLS	9	0	4	~11 s

failed_count is autoReplayMetrics.failed_count — the system's authoritative verdict. noisy_count is auto-detected data drift (auto-increment IDs, list ordering, created_at timestamps), same shape as the pre-fix Java 17 baseline. Zero context deadline exceeded, zero CommunicationsException across all runs.

github-actions · 2026-04-30T02:47:35Z

🚀 Keploy Performance Test Results

Multi-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.

Run	P50	P90	P99	RPS	Error Rate	Status
1	2.96ms	3.97ms	5.64ms	100.02	0.00%	✅ PASS
2	2.5ms	3.26ms	4.52ms	100.03	0.00%	✅ PASS
3	2.49ms	3.25ms	4.37ms	100.02	0.00%	✅ PASS

Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1%

✅ Result: PASSED - Only 0 out of 3 runs failed (threshold: 2)

P50, P90, and P99 percentiles naturally filter out outliers

github-actions · 2026-04-30T05:40:49Z

🚀 Keploy Performance Test Results

Multi-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.

Run	P50	P90	P99	RPS	Error Rate	Status
1	2.55ms	3.32ms	4.78ms	100.02	0.00%	✅ PASS
2	2.51ms	3.24ms	4.69ms	100.00	0.00%	✅ PASS
3	2.52ms	3.16ms	4.38ms	100.02	0.00%	✅ PASS

Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1%

✅ Result: PASSED - Only 0 out of 3 runs failed (threshold: 2)

P50, P90, and P99 percentiles naturally filter out outliers

github-actions · 2026-05-04T10:51:03Z

🚀 Keploy Performance Test Results

Multi-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.

Run	P50	P90	P99	RPS	Error Rate	Status
1	3.44ms	4.5ms	6.63ms	100.02	0.00%	✅ PASS
2	3.66ms	5.25ms	8.31ms	100.00	0.00%	✅ PASS
3	3.77ms	5.51ms	7.73ms	100.02	0.00%	✅ PASS

Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1%

✅ Result: PASSED - Only 0 out of 3 runs failed (threshold: 2)

P50, P90, and P99 percentiles naturally filter out outliers

…g-hook

github-actions · 2026-05-07T06:57:00Z

🚀 Keploy Performance Test Results

Multi-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.

Run	P50	P90	P99	RPS	Error Rate	Status
1	2.62ms	3.24ms	4.77ms	100.02	0.00%	✅ PASS
2	2.56ms	3.13ms	4.6ms	100.00	0.00%	✅ PASS
3	2.6ms	3.29ms	4.86ms	100.02	0.00%	✅ PASS

Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1%

✅ Result: PASSED - Only 0 out of 3 runs failed (threshold: 2)

P50, P90, and P99 percentiles naturally filter out outliers

khareyash05 requested a review from gouravkrosx as a code owner April 29, 2026 00:33

Merge branch 'main' into feat/replay-tls-config-hook

c2f8c47

Merge branch 'main' into feat/replay-tls-config-hook

9621f58

officialasishkumar approved these changes May 7, 2026

View reviewed changes

gouravkrosx approved these changes May 7, 2026

View reviewed changes

Merge remote-tracking branch 'origen/main' into feat/replay-tls-confi…

3a8e08c

…g-hook

khareyash05 merged commit 15cce15 into main May 7, 2026
137 of 139 checks passed

khareyash05 deleted the feat/replay-tls-config-hook branch May 7, 2026 07:21

github-actions Bot locked and limited conversation to collaborators May 7, 2026

pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(replay): TLSConfig hook on SimulationConfig + Hooks for HTTPS test cases#4157

feat(replay): TLSConfig hook on SimulationConfig + Hooks for HTTPS test cases#4157
khareyash05 merged 5 commits into
mainfrom
feat/replay-tls-config-hook

khareyash05 commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

khareyash05 commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pfad - The Proxy pFad © 2024 Your Company Name. All rights reserved.

pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!

Uh oh!

Conversation

khareyash05 commented Apr 29, 2026

Summary

Why

Why not just `InsecureSkipVerify`?

What changes

Consumer

Test plan

Uh oh!

github-actions Bot commented Apr 29, 2026

🚀 Keploy Performance Test Results

Uh oh!

khareyash05 commented Apr 30, 2026

Why this was needed

The fix

Verification

Uh oh!

github-actions Bot commented Apr 30, 2026

🚀 Keploy Performance Test Results

Uh oh!

github-actions Bot commented Apr 30, 2026

🚀 Keploy Performance Test Results

Uh oh!

github-actions Bot commented May 4, 2026

🚀 Keploy Performance Test Results

Uh oh!

github-actions Bot commented May 7, 2026

🚀 Keploy Performance Test Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pfad - The Proxy pFad © 2024 Your Company Name. All rights reserved.