Skip to content

feat(vm): replace time-based token refill with completion-based return#20618

Draft
guzalv wants to merge 2 commits into
masterfrom
ambient/session-72b20ecb-252a-4ed2-bece-66108fecf32d
Draft

feat(vm): replace time-based token refill with completion-based return#20618
guzalv wants to merge 2 commits into
masterfrom
ambient/session-72b20ecb-252a-4ed2-bece-66108fecf32d

Conversation

@guzalv
Copy link
Copy Markdown
Contributor

@guzalv guzalv commented May 15, 2026

Description

Replace the VM index report rate limiter's time-based token refill mechanism (golang.org/x/time/rate.Limiter) with a completion-based token return. Instead of refilling tokens at a fixed rate per second, tokens are returned to the bucket when processing completes. This turns the rate limiter into a concurrency limiter: the bucket capacity controls how many reports can be in-flight simultaneously, and throughput naturally tracks Scanner V4's actual processing speed.

  • TryConsume() decrements available tokens; Return() increments them back when work completes
  • Per-client fair allocation with automatic rebalancing on connect/disconnect
  • OnClientDisconnect reclaims all in-flight tokens for the disconnected client
  • All methods are nil-safe (no-op on nil *Limiter)
  • Removed golang.org/x/time/rate dependency

Files changed

File Change
pkg/rate/limiter.go Replace golang.org/x/time/rate.Limiter with clientBucket; add Return()
pkg/rate/metrics.go Replace PerClientRate gauge with InFlightTokens gauge
pkg/rate/limiter_test.go Rewrite: 19 tests for completion-based behavior
pkg/env/virtualmachine.go Update env var docs for concurrency semantics
central/sensor/service/connection/connection_impl.go Add Return() to rateLimiter interface
central/sensor/service/connection/sensorevents.go Wire Return() at processing completion and dedup drop
central/sensor/service/connection/connection_test.go Add noopRL helper

User-facing documentation

No user-facing behavior change. The ROX_VM_INDEX_REPORT_BUCKET_CAPACITY env var semantics shift from "burst size" to "max concurrent in-flight", but the configuration surface is unchanged.

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

VM feature is gated behind ROX_VIRTUAL_MACHINES feature flag. Rate limiter only activates when ROX_VM_INDEX_REPORT_RATE_LIMIT > 0.

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

19 unit tests in pkg/rate/limiter_test.go covering Return(), concurrency, rebalancing, disconnect, and nil safety. Comparison benchmarks in pkg/rate/benchmark_comparison_test.go.

How I validated my change

Unit tests:

go test -v ./pkg/rate/   # 19 tests pass

Comparison benchmarks (limiter-level, simulated processing):

go test -v -run TestComparisonReport -count=1 -timeout=300s ./pkg/rate/
go test -v -run TestComparisonTimeSeries -count=1 -timeout=120s ./pkg/rate/

E2E cluster test on ga-acp (OCP 4.21, 3 masters + 4 workers), ~21 minutes total wall clock, 6-minute monitoring window per variant. Load: 100 fake VMs with 500 real RHEL 9 packages each, reports every 10 seconds, Scanner V4 performing real vulnerability matching (404 CVEs per VM). Rate limiter: capacity=30, refill=0.3/sec.

Metric Time-Based (baseline) Completion-Based (this PR)
Reports enriched/min 19 182
Sustained throughput 0.3 rps 3.0 rps (10x)
Central memory (avg) 252 Mi 522 Mi
Central memory (peak) 263 Mi 544 Mi
Reports dropped/10s ~97 ~0
Matcher CPU (per pod) 65-143m 714-824m
Scanner V4 DB CPU 213m 3,159m

The time-based limiter's sustained throughput (0.3 rps) matches the token refill rate exactly -- Scanner V4's actual capacity (~3 rps) is completely wasted. The completion-based limiter processes at Scanner V4's real speed while keeping Central memory stable and bounded (500-544 Mi over 6 minutes, no upward trend).

Full E2E test report and step-by-step reproduction guide in commit 4ef3694 (remove before merging):

Guzman Alvarez and others added 2 commits May 14, 2026 14:40
The VM index report rate limiter previously refilled tokens at a fixed
rate over time, which could leave capacity unused when processing was
fast or overcommit when processing was slow.

Tokens are now returned explicitly when pipeline processing completes
(or when a message is dropped by deduplication), so available capacity
always tracks actual processing throughput. This eliminates both wasted
and overused capacity.

Key changes:
- Replace golang.org/x/time/rate per-client buckets with counter-based
  clientBucket tracking available/capacity
- Add Return(clientID, msg) to the Limiter and rateLimiter interface
- Wire Return into sensorEventHandler.handleMessages (after pipeline
  completes) and addMultiplexed (after dedup drop)
- Replace PerClientRate metric with InFlightTokens gauge
- Update env var documentation to reflect concurrency semantics

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add E2E test report documenting the cluster comparison between time-based
and completion-based rate limiters, reproduction guide, and benchmark
tests that demonstrate the throughput difference at the limiter level.

These files are intended for PR review only and should be removed before
merging.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

🚀 Build Images Ready

Images are ready for commit 4ef3694. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.11.x-978-g4ef369487d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant