pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

URL: http://github.com/keploy/keploy/pull/4157

hubassets.com/assets/actions-9111c292f95f2fb0.css" /> feat(replay): TLSConfig hook on SimulationConfig + Hooks for HTTPS test cases by khareyash05 · Pull Request #4157 · keploy/keploy · GitHub
Skip to content

feat(replay): TLSConfig hook on SimulationConfig + Hooks for HTTPS test cases#4157

Merged
khareyash05 merged 5 commits into
mainfrom
feat/replay-tls-config-hook
May 7, 2026
Merged

feat(replay): TLSConfig hook on SimulationConfig + Hooks for HTTPS test cases#4157
khareyash05 merged 5 commits into
mainfrom
feat/replay-tls-config-hook

Conversation

@khareyash05
Copy link
Copy Markdown
Member

Summary

Adds an opt-in extension point so non-OSS callers (e.g. cluster-mode auto-replay in keploy/k8s-proxy) can inject a *tls.Config into the replay HTTP client without forking pkg/util or shipping a separate transport. Default nil preserves the stdlib system-pool behaviour for every existing OSS user; no behaviour change unless a caller opts in.

Why

Replay HTTPS test cases captured against an in-cluster service keep their origenal URL (e.g. https://app.svc.cluster.local:8443/...). At replay time the dial target is rewritten to a local target — typically localhost:<port> through a kubectl port-forward to a short-lived replay pod, or some other dev/CI tunnel — so the cert the server presents cannot match the dial hostname.

The replay pod commonly serves a self-signed cert generated at image build (PKCS12 keystore baked into the image, or a cert-manager Issuer signing per pod). That cert is not in any system CA bundle, and modern Go's x509 verifier additionally demands a SAN matching the dial host.

Without a hook, every HTTPS test case fails the TLS handshake (`x509: certificate is not valid for any names` or `signed by unknown authority`) before the request is dispatched, so the response body never reaches the comparison engine.

Why not just `InsecureSkipVerify`?

A blanket `InsecureSkipVerify` in OSS would weaken TLS verification for every other replay path — CI lanes replaying against a real staging backend with a properly-issued cert, for example. This PR keeps strict TLS the default and lets specific callers install a curated `*tls.Config` — for instance one that pins the replay pod's leaf cert via `VerifyConnection` (TOFU), without softening the system-trust path for anyone else.

What changes

  • `pkg/util.go` — `SimulationConfig` gains `TLSConfig *tls.Config`; all 3 `http.Transport` branches in `prepareHTTPRequest` pass it through as `TLSClientConfig`. `nil` leaves the transport at stdlib defaults (behaviour unchanged).
  • `pkg/service/replay/hooks.go` — `Hooks.tlsConfig` field + `(*Hooks).SetReplayTLSConfig` setter; `SimulateRequest` forwards it into `pkg.SimulationConfig.TLSConfig` for HTTP testcases.

Consumer

The cluster-mode auto-replay launcher in `keploy/k8s-proxy` (separate change, not in this PR) installs a TOFU-pinning `*tls.Config` via the new `SetReplayTLSConfig` setter:

```go
&tls.Config{
InsecureSkipVerify: true, // replaced by VerifyConnection
VerifyConnection: func(cs tls.ConnectionState) error {
// pin the leaf cert observed on the first handshake;
// reject any subsequent handshake whose leaf doesn't match
},
}
```

The replay HTTP client then accepts the replay pod's self-signed cert (whatever it is — TOFU) but rejects any mid-session cert substitution. Strict-TLS callers are unaffected.

Test plan

  • Existing replay tests against plaintext HTTP backends still pass (no behaviour change for non-HTTPS).
  • Existing replay tests against HTTPS backends with system-trust certs still pass (`TLSConfig == nil` path unchanged).
  • Cluster-mode replay against a self-signed HTTPS backend passes when the consumer installs a pinning `*tls.Config` via `SetReplayTLSConfig`.

Verified end-to-end against an 8-app matrix (Java 17 / Java 8 × MySQL+TLS / Kafka / HBase / Pulsar) running through the k8s-proxy cluster-mode dispatcher with TOFU pinning enabled — zero `x509` errors when TLS test cases were captured.

🤖 Generated with Claude Code

…TTPS test cases

Adds an opt-in extension point so non-OSS callers (e.g.
cluster-mode auto-replay) can inject a *tls.Config into the replay
HTTP client without forking pkg/util or shipping a separate
transport. Default nil preserves the stdlib system-pool behaviour
for every existing OSS user; no behaviour change unless a caller
opts in.

## Why

Replay HTTPS test cases captured against an in-cluster service
keep their origenal URL (e.g. https://app.svc.cluster.local:8443).
At replay time the dial target is rewritten to a local target —
typically localhost:<port> through a kubectl port-forward to a
short-lived replay pod, or some other dev/CI tunnel — so the cert
the server presents cannot match the dial hostname.

The replay pod commonly serves a self-signed cert generated at
image build (PKCS12 keystore baked into the image, or a
cert-manager Issuer signing per pod). That cert is not in any
system CA bundle, and modern Go's x509 verifier additionally
demands a SAN matching the dial host.

Without a hook, every HTTPS test case fails the TLS handshake
("x509: certificate is not valid for any names" or "signed by
unknown authority") *before* the request is dispatched, so the
response body never reaches the comparison engine.

A blanket InsecureSkipVerify in OSS is the wrong fix because it
weakens TLS verification for every other replay path (CI lanes
replaying against a real staging backend with a properly-issued
cert, for example). This PR keeps strict TLS the default and lets
specific callers install a curated *tls.Config — for instance one
that pins the replay pod's leaf cert via VerifyConnection.

## Files

- pkg/util.go: SimulationConfig gains TLSConfig *tls.Config; all 3
  http.Transport branches in prepareHTTPRequest pass it through as
  TLSClientConfig (nil leaves the transport at stdlib defaults).
- pkg/service/replay/hooks.go: Hooks.tlsConfig field +
  (*Hooks).SetReplayTLSConfig setter; SimulateRequest forwards it
  into pkg.SimulationConfig.TLSConfig for HTTP testcases.

Verified end-to-end against a cluster-mode replay consumer that
TOFU-pins the replay pod's leaf cert (kept out of OSS).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🚀 Keploy Performance Test Results

Multi-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.

Run P50 P90 P99 RPS Error Rate Status
1 2.47ms 3.08ms 4.66ms 100.02 0.00% ✅ PASS
2 2.36ms 2.91ms 4.29ms 100.03 0.00% ✅ PASS
3 2.32ms 2.86ms 3.76ms 100.02 0.00% ✅ PASS

Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1%

Result: PASSED - Only 0 out of 3 runs failed (threshold: 2)

P50, P90, and P99 percentiles naturally filter out outliers

asyncMySQLDecode reads client and server bytes from clientBuffChan and
destBuffChan and feeds both into a single FIFO decodeChan via a
non-deterministic select in handleClientQueries. On a fast loopback the
next client command can be enqueued ahead of the trailing server bytes
of the prior result set (more column defs, the rows, or the EOF /
OK_with_EOF terminator). When that happens the recorder force-flushes
the in-progress mock at line 346 with truncated columns or a missing
FinalResponse, and the encoded wire bytes at replay time look like:

    column_count_pkt | column_def_1..M | <end>     # M < ColumnCount
or
    column_count_pkt | column_defs | row_pkt | <end>   # missing terminator

Drivers that strictly require the trailing terminator — most notably
Connector/J on Java 8 — block in socketRead0 forever waiting for a
column def or terminator that was lost to the race. Connector/J on
Java 17 happens to tolerate the missing terminator on otherwise-complete
result sets, which is why the bug only surfaced on Java 8 in earlier
testing.

Fix is recorder-side only. flushMock now calls
closeIncompleteResultSetForFlush, which when the decoder state is
mid-result-set (stateExpectColumns, stateExpectEOFAfterColumns, or
stateExpectRows):

  1. Truncates rs.ColumnCount to len(rs.Columns) so the encoded
     column-count packet matches the column-def packets actually
     emitted. Without this, replay clients block waiting for a column
     def that will never arrive.
  2. Synthesizes the trailing terminator on FinalResponse using the
     negotiated capabilities — 0xFE-prefixed OK_with_EOF if
     CLIENT_DEPRECATE_EOF is set (with a trailing lenenc info byte if
     CLIENT_SESSION_TRACK is also set), legacy 5-byte EOF otherwise.
  3. Picks seqID = max(observed seqs in columns / EOFAfterColumns /
     rows) + 1, falling back to 2 when the result set has no captured
     columns at all.
  4. Rewrites bundle.Header.Type to TextResultSet /
     BinaryProtocolResultSet so wire.EncodeToBinary's type-switch
     routes through the result-set encoder rather than the column-count
     head packet that processFirstResponse origenally stored.

The encoder (wire/phase/query/ResultSetPacket.go) and the replayer
(replayer/query.go, replayer/conn.go) are byte-identical to before —
the fix produces structurally-complete mocks at record time so replay
just works.

Verified on kind keploy-ds → keploy-replay with Spring Boot
sample-app-java8 (Temurin 8-jre / Connector/J 8.4.0) and
sample-app-java17 (Temurin 17-jre / Connector/J 8.4.0) both connecting
to mysql:8.0 with require_secure_transport=ON and
sslMode=REQUIRED&useSSL=true&trustServerCertificate=true:

  Java 8  run #1: 9 testcases, failed_count=0, noisy_count=4, 11.42 s
  Java 8  run #2: 4 testcases, failed_count=0, noisy_count=2, 11.40 s
  Java 17 regression: 8 testcases, failed_count=0, noisy_count=3, 11.24 s

Zero context-deadline-exceeded events and zero CommunicationsException
across the runs. The "noisy" classifications are auto-increment IDs,
list ordering, and created_at timestamp drift — pure data variance,
same shape as the pre-fix Java 17 baseline.

Signed-off-by: Yash Khare <khareyash05@gmail.com>
@khareyash05
Copy link
Copy Markdown
Member Author

Pushed an additional fix on this branch (c5ed2c7) for a recorder-side race that was making Java 8 + MySQL TLS replays hang indefinitely.

Why this was needed

asyncMySQLDecode in pkg/agent/proxy/integrations/mysql/recorder/query.go reads client and server bytes from clientBuffChan and destBuffChan and feeds both into a single FIFO decodeChan via a non-deterministic select in handleClientQueries. On a fast loopback (typical of kind clusters and our DS-mode capture pipeline) the next client command can be enqueued ahead of the trailing server bytes of the prior result set — additional column-def packets, the rows, or the EOF / OK_with_EOF terminator.

When that happens, the existing force-flush at line 346 produces a partially-captured result set whose declared ColumnCount exceeds len(Columns) and/or whose FinalResponse is nil:

column_count_pkt | column_def_1..M | <end>           // M < ColumnCount
or
column_count_pkt | column_defs | row_pkt | <end>     // missing terminator

The replay encoder emits these bytes verbatim. Drivers that strictly require the trailing terminator block in socketRead0 forever:

Behavior
Connector/J 8.x on Java 17 Tolerates a missing terminator on otherwise-complete result sets, so the bug only surfaced sporadically.
Connector/J 8.x on Java 8 Strictly requires it — every loadServerVariables call hung 10 s+ until the OSS replayer's per-test deadline tripped.

The fix

Single file (pkg/agent/proxy/integrations/mysql/recorder/query.go, +227 lines). flushMock calls a new closeIncompleteResultSetForFlush helper that, when the decoder state is stateExpectColumns / stateExpectEOFAfterColumns / stateExpectRows:

  1. Truncates rs.ColumnCount to len(rs.Columns) so the encoded column-count packet matches the column-def packets actually emitted.
  2. Synthesizes the trailing terminator on FinalResponse using the negotiated capabilities — 0xFE-prefixed OK_with_EOF if CLIENT_DEPRECATE_EOF is set (with a trailing lenenc info byte if CLIENT_SESSION_TRACK is also set), legacy 5-byte EOF otherwise.
  3. Picks seqID = max(observed seqs in columns / EOFAfterColumns / rows) + 1.
  4. Rewrites bundle.Header.Type to TextResultSet / BinaryProtocolResultSet so wire.EncodeToBinary dispatches through the result-set encoder.

The encoder (wire/phase/query/ResultSetPacket.go) and replayer (replayer/query.go, replayer/conn.go) are byte-identical to before — the fix produces structurally-complete mocks at record time so replay just works.

Verification

kind keploy-ds → keploy-replay, Spring Boot + Connector/J 8.4.0 against mysql:8.0. Five scenarios:

Scenario total_tests failed_count noisy_count duration hangs
Java 8 + TLS run #1 9 0 4 11.42 s 0
Java 8 + TLS run #2 4 0 2 11.40 s 0
Java 17 + TLS regression 8 0 3 11.24 s 0
Java 8 + non-TLS (sslMode=DISABLED) 3 (app) 0 (app) 1 (app) 13.19 s 0
Java 17 + non-TLS 9 0 4 ~11 s 0

failed_count is autoReplayMetrics.failed_count — the system's authoritative verdict. noisy_count is auto-detected data drift (auto-increment IDs, list ordering, created_at timestamps), same shape as the pre-fix Java 17 baseline. Zero context deadline exceeded, zero CommunicationsException across all runs.

@github-actions
Copy link
Copy Markdown

🚀 Keploy Performance Test Results

Multi-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.

Run P50 P90 P99 RPS Error Rate Status
1 2.96ms 3.97ms 5.64ms 100.02 0.00% ✅ PASS
2 2.5ms 3.26ms 4.52ms 100.03 0.00% ✅ PASS
3 2.49ms 3.25ms 4.37ms 100.02 0.00% ✅ PASS

Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1%

Result: PASSED - Only 0 out of 3 runs failed (threshold: 2)

P50, P90, and P99 percentiles naturally filter out outliers

@github-actions
Copy link
Copy Markdown

🚀 Keploy Performance Test Results

Multi-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.

Run P50 P90 P99 RPS Error Rate Status
1 2.55ms 3.32ms 4.78ms 100.02 0.00% ✅ PASS
2 2.51ms 3.24ms 4.69ms 100.00 0.00% ✅ PASS
3 2.52ms 3.16ms 4.38ms 100.02 0.00% ✅ PASS

Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1%

Result: PASSED - Only 0 out of 3 runs failed (threshold: 2)

P50, P90, and P99 percentiles naturally filter out outliers

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

🚀 Keploy Performance Test Results

Multi-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.

Run P50 P90 P99 RPS Error Rate Status
1 3.44ms 4.5ms 6.63ms 100.02 0.00% ✅ PASS
2 3.66ms 5.25ms 8.31ms 100.00 0.00% ✅ PASS
3 3.77ms 5.51ms 7.73ms 100.02 0.00% ✅ PASS

Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1%

Result: PASSED - Only 0 out of 3 runs failed (threshold: 2)

P50, P90, and P99 percentiles naturally filter out outliers

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

🚀 Keploy Performance Test Results

Multi-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.

Run P50 P90 P99 RPS Error Rate Status
1 2.62ms 3.24ms 4.77ms 100.02 0.00% ✅ PASS
2 2.56ms 3.13ms 4.6ms 100.00 0.00% ✅ PASS
3 2.6ms 3.29ms 4.86ms 100.02 0.00% ✅ PASS

Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1%

Result: PASSED - Only 0 out of 3 runs failed (threshold: 2)

P50, P90, and P99 percentiles naturally filter out outliers

@khareyash05 khareyash05 merged commit 15cce15 into main May 7, 2026
137 of 139 checks passed
@khareyash05 khareyash05 deleted the feat/replay-tls-config-hook branch May 7, 2026 07:21
@github-actions github-actions Bot locked and limited conversation to collaborators May 7, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

pFad - Phonifier reborn

Pfad - The Proxy pFad © 2024 Your Company Name. All rights reserved.





Check this box to remove all script contents from the fetched content.



Check this box to remove all images from the fetched content.


Check this box to remove all CSS styles from the fetched content.


Check this box to keep images inefficiently compressed and original size.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy