Skip to content

Conversation

@dcoppa
Copy link

@dcoppa dcoppa commented Dec 5, 2025

Add configurable idle timeout for HBONE connections between proxies and ztunnel to address stale connection reuse when pod IPs are recycled.

This is particularly critical in environments with aggressive IP address reuse, such as AWS EKS with VPC CNI (default 30s cooldown period). Without an explicit idle timeout, Envoy defaults to 1 hour, causing proxies to reuse stale connections from connection pools when target pod IPs are recycled, resulting in 503 errors and upstream reset failures.

The new hbone_idle_timeout field in MeshConfig allows operators to configure the idle timeout appropriately for their environment. For AWS VPC CNI, a value of 15 seconds is recommended.

See: istio/istio#58389

Add configurable idle timeout for HBONE connections between proxies and
ztunnel to address stale connection reuse when pod IPs are recycled.

This is particularly critical in environments with aggressive IP address
reuse, such as AWS EKS with VPC CNI (default 30s cooldown period). Without
an explicit idle timeout, Envoy defaults to 1 hour, causing proxies to
reuse stale connections from connection pools when target pod IPs are
recycled, resulting in 503 errors and upstream reset failures.

The new hbone_idle_timeout field in MeshConfig allows operators to configure
the idle timeout appropriately for their environment. For AWS VPC CNI, a
value of 15 seconds is recommended.
@dcoppa dcoppa requested a review from a team as a code owner December 5, 2025 12:46
@istio-policy-bot
Copy link

😊 Welcome @dcoppa! This is either your first contribution to the Istio api repo, or it's been
a while since you've been here.

You can learn more about the Istio working groups, Code of Conduct, and contribution guidelines
by referring to Contributing to Istio.

Thanks for contributing!

Courtesy of your friendly welcome wagon.

@istio-testing istio-testing added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. needs-ok-to-test labels Dec 5, 2025
@istio-testing
Copy link
Collaborator

Hi @dcoppa. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ilrudie
Copy link
Contributor

ilrudie commented Dec 5, 2025

/ok-to-test

@istio-testing istio-testing added ok-to-test Set this label allow normal testing to take place for a PR not submitted by an Istio org member. and removed needs-ok-to-test labels Dec 5, 2025
@istio-testing istio-testing added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 5, 2025
Copy link
Contributor

@keithmattix keithmattix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment on docs and placement of the field.

/cc @howardjohn

@keithmattix
Copy link
Contributor

GitHub is being weird for me..here's my review comment:

Isn't this idle timeout just for envoy? Ztunnel doesn't respect it right? We should be clear in the comment (which gets turned into docs)

Also, should this kind of setting be a part of proxy config so that different envoys can have different values?

@dcoppa
Copy link
Author

dcoppa commented Dec 11, 2025

GitHub is being weird for me..here's my review comment:

Isn't this idle timeout just for envoy? Ztunnel doesn't respect it right? We should be clear in the comment (which gets turned into docs)

Also, should this kind of setting be a part of proxy config so that different envoys can have different values?

I believe the current placement in MeshConfig is more appropriate because the underlying issue is infrastructure-wide: IP address recycling in the AWS VPC CNI affects all workloads equally, and the 30-second cooldown period is applied cluster-wide. As a result, there is no clear justification for giving different workloads distinct HBONE idle timeouts. This choice is also consistent with the existing connect_timeout, which already resides in MeshConfig and represents a similar connection-level timeout.

As for the documentation, I tried to make the comment clearer following your advice. Is it better now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ok-to-test Set this label allow normal testing to take place for a PR not submitted by an Istio org member. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants