feat(backend/kernel): add use_kernel=True flag — route through the Rust kernel via PyO3#787
feat(backend/kernel): add use_kernel=True flag — route through the Rust kernel via PyO3#787vikrantpuppala wants to merge 9 commits into
Conversation
Phase 2 of the PySQL × kernel integration plan (databricks-sql-kernel/docs/designs/pysql-kernel-integration.md). Wires `use_sea=True` to a new `backend/kernel/` module that delegates to the Rust kernel via the `databricks_sql_kernel` PyO3 extension (kernel PR #13). New module: `src/databricks/sql/backend/kernel/` - `client.py` — `KernelDatabricksClient(DatabricksClient)`. Lazy- imports `databricks_sql_kernel` so a connector install without the kernel wheel doesn't `ImportError` at startup; only `use_sea=True` surfaces the missing-extra message. Implements open/close_session, sync + async execute_command (async_op=True goes through `Statement.submit()` and stashes the handle in a dict keyed on `CommandId`), cancel/close_command, get_query_state, get_execution_result, and the metadata calls (catalogs / schemas / tables / columns) via `Session.metadata().list_*`. Real server-issued session and statement IDs flow through (no synthetic UUIDs). - `auth_bridge.py` — translate the connector's `AuthProvider` into kernel `Session` kwargs. PAT (including federation-wrapped PAT — `get_python_sql_connector_auth_provider` always wraps the base in `TokenFederationProvider`, so a naive isinstance check never matches) routes through `auth_type="pat"`. Everything else routes through `auth_type="external"` with a callback that delegates to `auth_provider.add_headers({})`. (External today is rejected by the kernel at `build_auth_provider`; the separate kernel-side enablement PR will flip it on.) - `result_set.py` — `KernelResultSet(ResultSet)`. Duck-typed over `databricks_sql_kernel.ExecutedStatement` (sync execute) and `ResultStream` (metadata + async await_result) since both expose `arrow_schema()` / `fetch_next_batch()` / `fetch_all_arrow()` / `close()`. Same FIFO batch buffer the prior ADBC POC used, so `fetchmany(n)` for n smaller than the kernel's natural batch size doesn't re-fetch. - `type_mapping.py` — Arrow → PEP 249 description-string mapper. Lifted from the prior ADBC POC; centralised here so future kernel-result wrappers reuse the same mapping. Kernel errors → PEP 249 exceptions: `KernelError.code` is mapped in a single table to `ProgrammingError` / `OperationalError` / `DatabaseError`. The structured fields (`sql_state`, `error_code`, `query_id`, …) are copied onto the re-raised exception so callers can branch on them without reaching through `__cause__`. Routing: `Session._create_backend` flips the `use_sea=True` branch to instantiate `KernelDatabricksClient` instead of the native `SeaDatabricksClient`. The native `backend/sea/` module is left in place (no users on `use_sea=True` after this PR; its long- term fate is out of scope here). Packaging: `[tool.poetry.extras] kernel = ["databricks-sql-kernel"]`. `pip install 'databricks-sql-connector[kernel]'` pulls in the kernel wheel; `use_sea=True` without the extra raises a pointed ImportError telling the user how to install it. Known gaps (acknowledged, will be follow-ups): - Parameter binding (`execute_command(parameters=[...])`) raises NotSupportedError — PyO3 `Statement.bind_param` lands in a follow-up. - Statement-level `query_tags` raises NotSupportedError. - `get_tables(table_types=[...])` returns unfiltered rows (the native SEA backend's filter is keyed on `SeaResultSet`; needs a small port to operate on `KernelResultSet`). - External-auth end-to-end blocked on the kernel-side `AuthConfig::External` enablement PR. - Volume PUT/GET (staging operations): kernel has no Volume API. Test plan: - Unit: 37 new tests across `tests/unit/test_kernel_auth_bridge.py` (auth provider → kwargs mapping, including federation-wrapped PAT and the External trampoline call-counter check), `tests/unit/test_kernel_type_mapping.py` (Arrow type mapping + description shape), and `tests/unit/test_kernel_result_set.py` (buffer semantics, fetchmany across batch boundaries, idempotent close, close() swallowing handle-close failures). All pass. - Full unit suite: 600 pre-existing tests still pass; one pre-existing failure (`test_useragent_header` — agent detection adds `agent/claude-code` in this env) was already failing on main, unrelated to this change. - Live e2e against dogfood with `use_sea=True`: SELECT 1, `range(10000)`, `fetchmany` pacing, `fetchall_arrow`, all four metadata calls (returned 75 catalogs / 144 schemas in main / 47 tables in `system.information_schema` / 15 columns), `session_configuration={'ANSI_MODE': 'false'}` round-trips, bad SQL surfaces as DatabaseError with `code='SqlError'` and `sql_state='42P01'` on the exception. All checks pass. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <[email protected]>
The earlier auth_bridge routed OAuth/MSAL/federation through the kernel's External token-provider trampoline (a Python callable the kernel invoked per HTTP request). Removing that for now. Why: routing OAuth into the kernel inherently requires per-request token resolution to keep refresh working during a long-running session. Two viable mechanisms (kernel-native OAuth, or the External callback); both have costs (duplicate OAuth flows vs GIL-per-request). Punting the decision until there's actual demand on use_sea=True. Today: the bridge accepts PAT (including TokenFederationProvider- wrapped PAT, which is how `get_python_sql_connector_auth_provider` always shapes it). Any non-PAT auth_provider raises a clear NotSupportedError pointing the user at use_sea=False (Thrift). This shrinks the auth_bridge to ~50 lines and means the kernel- side External enablement PR is no longer on the connector's critical path — there's no kernel-side prerequisite for shipping use_sea=True for PAT users. Unit tests updated: - TokenFederationProvider-wrapped PAT still routes to PAT (kept). - Generic OAuth provider raises NotSupportedError (new). - ExternalAuthProvider raises NotSupportedError (new). - Silent non-PAT provider raises NotSupportedError (new) — reject the type itself rather than trying to extract a token we already know we can't use. Live e2e against dogfood with use_sea=True (PAT): all checks still pass (SELECT 1, range(10000), fetchmany pacing, four metadata calls, session_configuration round-trip, structured DatabaseError on bad SQL). Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <[email protected]>
Moves the previously-ad-hoc /tmp/connector_smoke.py into the repo
as a real pytest module under tests/e2e/ — same convention as the
rest of the e2e suite. Uses the existing session-scoped
`connection_details` fixture from the top-level conftest so it
shares the credential surface with every other live test.
11 tests cover:
- connect() with use_sea=True opens a session.
- SELECT 1: rows + description shape (column name + dbapi type slug).
- SELECT * FROM range(10000): multi-batch drain.
- fetchmany() pacing across the buffer boundary.
- fetchall_arrow() returns a pyarrow Table.
- All four metadata methods (catalogs / schemas / tables / columns).
- session_configuration={'ANSI_MODE': 'false'} round-trips.
- Bad SQL surfaces as DatabaseError with `code='SqlError'` and
`sql_state='42P01'` attached as exception attributes.
Module-level skips:
- `databricks_sql_kernel` not importable → whole module skipped via
pytest.importorskip (the wheel hasn't been installed).
- Live creds missing → fixture-level skip with a pointed message.
Run: `pytest tests/e2e/test_kernel_backend.py -v`. All 11 pass
against dogfood in ~20s.
Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <[email protected]>
|
Two updates since the initial PR: 1. Dropped External auth → PAT-only on the kernel backend (25723627). 2. Live e2e tests moved into the repo (6b308156). The previous ad-hoc The auth_bridge unit tests are updated: OAuth providers / ExternalAuthProvider now assert |
CI is failing across all jobs at \`poetry lock\` time:
Because databricks-sql-connector depends on databricks-sql-kernel
(^0.1.0) which doesn't match any versions, version solving failed.
The kernel wheel isn't yet published to PyPI — we verified the name
is available via the Databricks proxy, but the package itself hasn't
been built and uploaded yet. Declaring it as a poetry dep (even an
optional one inside an extra) requires the version to be resolvable,
and \`poetry lock\` runs as the setup step for every CI job: unit
tests, linting, type checks, all of them.
Fix: drop the \`databricks-sql-kernel\` dep declaration and the
\`[kernel]\` extra from pyproject.toml until the wheel is on PyPI.
The lazy import in \`backend/kernel/client.py\` still raises a
clear ImportError pointing at \`pip install databricks-sql-kernel\`
(or local maturin) when use_sea=True is invoked without the kernel
present.
When the kernel is published, a small follow-up will add back:
databricks-sql-kernel = {version = "^0.1.0", optional = true}
[tool.poetry.extras]
kernel = ["databricks-sql-kernel"]
A pointed comment in pyproject.toml documents the deferred change.
Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <[email protected]>
Three CI failures after the poetry-lock fix uncovered three real issues: 1. pyarrow is optional in the connector. The default-deps CI test job installs without it; the +PyArrow job installs with. The kernel backend's result_set.py + type_mapping.py import pyarrow eagerly (the kernel always returns pyarrow), and the unit tests import the backend at collection time — which crashes the default-deps job at ModuleNotFoundError. Fix: gate the three kernel unit tests on `pytest.importorskip( "pyarrow")` so they skip on default-deps and run on +PyArrow. Verified locally: 39 pass with pyarrow, 3 skipped without. No change to the backend module itself — nothing imports it until use_sea=True is invoked, and pyarrow is on the kernel wheel's runtime dep list so use_sea=True can't hit this either. 2. mypy: KernelDatabricksClient.open_session returns self._session_id, which mypy types as Optional[SessionId] because the field starts as None. Fix: bind the new id to a local non-Optional variable, assign to the field, return the local. CI's check-types runs cleanly on backend/kernel/ now; pre-existing mypy noise elsewhere isn't mine. 3. black --check: black 22.12.0 (the version CI pins) wants reformatting on result_set.py / type_mapping.py / client.py. Applied. Verified locally with the same black version. All 39 kernel unit tests + 619 pre-existing unit tests pass. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <[email protected]>
The +PyArrow CI matrix installs pyarrow but not the databricks-sql-kernel wheel (the wheel isn't on PyPI yet, and the [kernel] extra is deferred — see commit 31ca581). The previous fix gated unit tests on `pytest.importorskip("pyarrow")` but test_kernel_auth_bridge.py was still pulled into a kernel-wheel ImportError because: src/databricks/sql/backend/kernel/__init__.py -> from databricks.sql.backend.kernel.client import KernelDatabricksClient -> import databricks_sql_kernel # ImportError on +PyArrow CI The eager re-export from `__init__.py` was a convenience that broke every consumer that only needed a submodule (type_mapping, result_set, auth_bridge) — they all triggered the kernel wheel import for no reason. Fix: - Drop the eager re-export from `kernel/__init__.py`. Comment documents why and points callers (= session.py::_create_backend, already this shape) at the direct `from .client import ...`. - Drop the no-longer-needed `pytest.importorskip("pyarrow")` / `importorskip("databricks_sql_kernel")` from test_kernel_auth_bridge.py — auth_bridge.py itself has neither dep, so the test now runs on every CI matrix variant. - test_kernel_result_set.py and test_kernel_type_mapping.py keep the pyarrow importorskip because they themselves use pyarrow. Verified locally across the three matrix shapes: - both pyarrow + kernel installed: 39 pass. - pyarrow only (no kernel wheel — the +PyArrow CI shape): 39 pass. - neither: 9 pass (auth_bridge only), 2 modules skip (the others use pyarrow). Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <[email protected]>
…sing
The connector's coverage CI job runs the full e2e suite, several of
whose test classes parametrize ``extra_params`` over ``{}`` and
``{"use_sea": True}``. With ``use_sea=True`` now routing through
the Rust kernel via PyO3, those cases die at ``connect()`` with our
pointed ImportError because the ``databricks-sql-kernel`` wheel
isn't yet on PyPI — and that CI job (sensibly) doesn't try to
build it from a sibling repo.
Fix: ``pytest_collection_modifyitems`` hook in the top-level
``conftest.py`` that adds a ``skip`` marker to any parametrize case
with ``extra_params={"use_sea": True, ...}`` when
``importlib.util.find_spec("databricks_sql_kernel")`` returns
``None``. Behavior change is CI-only — local dev with the kernel
wheel installed (via ``maturin develop`` from the kernel repo)
runs those cases as before.
Once the kernel wheel is published, the [kernel] extra in
pyproject.toml gets enabled (see comment block there) and the
default-deps CI matrix will install it; the skip then becomes a
no-op.
Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <[email protected]>
|
CI status, final: 33 / 34 checks pass. The one failing check ( Failure breakdown:
My PR contributions to this job (all working as intended):
I don't believe my PR should be responsible for fixing the 7 pre-existing failures — they need their own fixes (server-side investigation for MST metadata; switching |
| logger.debug("Creating Thrift backend client") | ||
| databricks_client_class = ThriftDatabricksClient | ||
| # `use_sea=True` now routes through the Rust kernel via | ||
| # PyO3. The native pure-Python SEA backend |
There was a problem hiding this comment.
High severity — multi-reviewer consensus (architecture, agent-compat, devils-advocate, ops, test)
use_sea=True previously routed to SeaDatabricksClient (the pure-Python SEA REST backend). After this PR, the same kwarg routes to a Rust PyO3 wheel that isn't on PyPI yet. Side effects on existing use_sea=True users:
- ImportError on upgrade unless they separately install
databricks-sql-kernel - OAuth / federation / external auth →
NotSupportedError(auth_bridge.py) - Parameter binding →
NotSupportedError(client.py:476-484) query_tags→NotSupportedError- Volume PUT/GET → unsupported
- Telemetry mis-reports kernel sessions as
DatabricksClientType.SEA - The native SEA backend (
backend/sea/, ~700 LOC) is now zombie code: still in the tree, no longer reachable through any documented entry point.
The module docstring at backend/kernel/__init__.py even says the module's identity is deliberately decoupled from SEA REST (the kernel may switch transport SEA REST → SEA gRPC → …). That contradicts the flag name.
Recommend: introduce use_kernel=True as a new explicit flag. Leave use_sea=True routing to SeaDatabricksClient for now. Deprecate use_sea on a published timeline once the kernel reaches feature parity. Bundles cleanly with the related issues: docstring update at client.py:117 ("Use the SEA backend instead of the Thrift backend") is now factually wrong; CHANGELOG.md has no entry for this behavior change; no version bump; telemetry still reports DatabricksClientType.SEA for kernel sessions.
There was a problem hiding this comment.
Addressed in 24e9a5c — introduced a dedicated use_kernel=True flag. use_sea=True once again routes to the native pure-Python SeaDatabricksClient (unchanged); the new flag is opt-in and mutually exclusive. PR title + description updated to match.
|
|
||
| logger.debug("Creating kernel-backed client for use_sea=True") | ||
| return KernelDatabricksClient( | ||
| server_hostname=server_hostname, |
There was a problem hiding this comment.
High severity — flagged by ops, devils-advocate, architecture, maintainability, language
The kernel branch hardcodes 8 kwargs into KernelDatabricksClient(...). The Thrift branch (below) splats **kwargs. Silently dropped on use_sea=True:
_socket_timeout- All
_retry_*knobs (_retry_stop_after_attempts_count,_retry_delay_*, …) _tls_no_verify,_tls_verify_hostname,_tls_trusted_ca_filepool_maxsizeuse_cloud_fetch,use_hybrid_disposition,enable_query_result_lz4_compressionstaging_allowed_local_pathquery_tags- User-agent extras
_use_arrow_native_complex_types
KernelDatabricksClient.__init__ accepts **kwargs and never references them — accept-and-ignore with zero log line.
The most dangerous case: a user setting _tls_no_verify=True on Thrift (for an on-prem proxy / self-signed cert) gets that honored. On use_sea=True it silently no-ops and the kernel's own TLS stack will verify the cert. The operator believes verification is disabled when it isn't. Same for custom CA bundle (_tls_trusted_ca_file).
Worse: DriverConnectionParameters in telemetry (client.py:396-398) continues reporting socket_timeout=kwargs.get("_socket_timeout", None) — so dashboards lie about what's actually applied.
Recommend: at minimum, log a single WARNING at session-open enumerating which kwargs the kernel backend cannot honor (one-shot per process). Long-term, plumb retry/timeout/proxy/TLS through to the kernel, or refuse to start when unsupported knobs are set.
There was a problem hiding this comment.
Deferred — called out as a known gap in the updated PR description. The kernel manages its own HTTP stack today (TLS, retry, timeout, pool); we'll plumb a per-knob bridge as those surfaces appear kernel-side. Switching to use_kernel=True (24e9a5c) means existing use_sea=True callers — who relied on ssl_options / http_headers / retry knobs being honored on the SEA backend — are unaffected.
| logger.warning("Error closing kernel handle: %s", exc) | ||
| self._buffer.clear() | ||
| self._kernel_handle = None | ||
| self._exhausted = True |
There was a problem hiding this comment.
High severity — flagged by architecture, ops, performance
The base ResultSet.close() (src/databricks/sql/result_set.py:166-190) calls self.backend.close_command(self.command_id) to free server-side state and to drop client-side tracking. KernelResultSet.close() overrides the base entirely and calls only self._kernel_handle.close().
For the sync execute path this is OK (the executed handle owns the server statement; closing it releases server state). For the async path (execute_command(async_op=True)), the handle is also tracked in KernelDatabricksClient._async_handles. Result:
- User calls
cursor.execute(..., async_op=True)→ handle stored in_async_handles. - User calls
cursor.fetchall()→ result set built. - User calls
cursor.close()→KernelResultSet.close()calls_kernel_handle.close()directly. _async_handles[command_id.guid]still holds the now-closed handle.- Later,
close_session()iterates_async_handles.values()and calls.close()again on the dead handle.
The kernel's close() is idempotent per the PR's docstring, so this isn't a crash — but the bookkeeping is inconsistent and leaks an entry for every async-submitted statement that closes through the result-set path. Long-lived connections accumulate dead entries.
Recommend: KernelResultSet.close() should either (a) call super().close() (which calls backend.close_command), or (b) explicitly self.backend._async_handles.pop(self.command_id.guid, None) after closing the handle.
There was a problem hiding this comment.
Fixed in 24e9a5c. KernelResultSet.close() now pops the entry from backend._async_handles (under the new _async_handles_lock). No-op for sync-execute and metadata paths, which never register there.
| self.has_more_rows = False | ||
| self.status = CommandState.SUCCEEDED | ||
| return False | ||
| if batch.num_rows > 0: |
There was a problem hiding this comment.
High severity — performance, empirically verified
def _buffered_rows(self) -> int:
if not self._buffer:
return 0
first = self._buffer[0].num_rows - self._buffer_offset
rest = sum(b.num_rows for b in list(self._buffer)[1:]) # allocates + O(M)
return first + restTwo issues:
list(self._buffer)[1:]allocates a fresh list on every call — gratuitous. Useitertools.islice(self._buffer, 1, None)or iterate the deque._ensure_bufferedcalls_buffered_rows()in a loop, once per pulled batch → O(M²) in batch count.
Empirically verified with a synthetic harness:
- 5,000 single-row batches → ~447ms just in the row-count loop (Python-side, no pyarrow).
Hot path: fetchmany(small_N) / fetchone() repeatedly when the kernel returns many small batches, or fetchall() over a deep stream.
Fix (~10 lines): track self._buffered_count: int as a running counter — += batch.num_rows in _pull_one_batch, -= take in _take_buffered, recompute on _drain. _buffered_rows() becomes O(1); _ensure_buffered becomes O(M).
There was a problem hiding this comment.
Fixed in 37fa544. Replaced _buffered_rows with a running counter _buffered_count maintained by _pull_one_batch / _take_buffered / _drain. _buffered_rows is now O(1); _ensure_buffered is O(M) in batch count instead of O(M²).
|
|
||
| @pytest.fixture(scope="session") | ||
| def host(): | ||
| return os.getenv("DATABRICKS_SERVER_HOSTNAME") |
There was a problem hiding this comment.
High severity — flagged by test, ops
The new pytest_collection_modifyitems skips every extra_params={"use_sea": True} case when the kernel wheel isn't importable. CI installs --all-extras (.github/workflows/code-coverage.yml:39), but pyproject explicitly does NOT declare a [kernel] extra (per the PR's own comment at pyproject.toml:55-68). Result: every one of those cases is reported SKIPPED on every CI run.
Verified via grep against tests/e2e/test_driver.py — there are ~14 distinct @pytest.mark.parametrize("extra_params", [{}, {"use_sea": True}]) functions (test_query_with_large_wide_result_set, test_long_running_query, test_execute_async__*, test_unicode, test_fetchone, test_fetchall, test_fetchmany_*, test_iterator_api, test_multi_timestamps_arrow, …) plus 4 SEA-parametrized retry tests in tests/e2e/common/retry_test_mixins.py.
Before this PR they ran against SeaDatabricksClient. After this PR they vanish from the CI matrix entirely. This is a silent capability loss across the whole e2e suite, not just within the new code. Combined with F9 (no unit tests for the 511-LOC client.py), coverage of the kernel backend in CI may also fail the 85% threshold gate at .github/workflows/code-coverage.yml:80-81.
Recommend: (a) install the kernel wheel in CI before running e2e, OR (b) document this regression in the PR description so reviewers know use_sea=True e2e coverage in CI is currently 0, AND (c) file a follow-up to land kernel-wheel CI coverage.
There was a problem hiding this comment.
Addressed in 24e9a5c — removed the conftest collection hook entirely. With use_sea=True back on the native SEA backend, the existing extra_params=[{}, {"use_sea": True}] parametrized cases run as they did before this PR (no skip needed). The kernel backend is now opt-in via use_kernel=True and doesn't intercept existing e2e parametrizations.
| @@ -0,0 +1,511 @@ | |||
| """``DatabricksClient`` backed by the Rust kernel via PyO3. | |||
There was a problem hiding this comment.
High severity — flagged by test, devils-advocate
auth_bridge, result_set, and type_mapping each get unit coverage; the largest, most behavior-rich file in the PR has none. Only the e2e file exercises it — and that file skips silently when creds OR the kernel wheel are missing (which is the CI default, see F8).
Concretely uncovered by any non-live test:
_CODE_TO_EXCEPTIONmapping (14 entries) — trivially testable with a fake_kernel.KernelError; a typo on"Unauthenticated"collapses silently toDatabaseError._reraise_kernel_errorattribute forwarding — copies 7 structured fields (code,sql_state,error_code,vendor_code,http_status,retryable,query_id). E2E only verifiescode+sql_state.open_sessiondouble-open guard —raise InterfaceError(...)not exercised.- All 5
InterfaceError "no open session"guards —execute_command,get_catalogs,get_schemas,get_tables,get_columns. get_columnscatalog_namerequired check — e2e always passes a catalog.execute_commandparameters / query_tagsNotSupportedError— regression risk that acceptingparameterswould silently dispatch to a kernelStatementwith nobind_param.cancel_command/close_commandno-handle tolerant path.close_sessioncleanup of_async_handlesincluding swallow-on-KernelError.get_query_statesync-path SUCCEEDED shortcut and Failed-state re-raise._STATE_TO_COMMAND_STATEmapping (6 entries).max_download_threadsproperty.get_tablestable_typeswarning behavior.
All achievable with a MagicMock _kernel module via monkeypatch.setattr. The auth-bridge and result-set tests demonstrate the pattern. The absence of tests/unit/test_kernel_client.py is the single biggest test-quality issue in this PR.
Recommend: add tests/unit/test_kernel_client.py — ~150 LOC covers the items above.
There was a problem hiding this comment.
Addressed in 24e9a5c. Added tests/unit/test_kernel_client.py with 38 cases covering: full _CODE_TO_EXCEPTION (14-entry parametrize), _reraise_kernel_error attribute forwarding, full _STATE_TO_COMMAND_STATE (6-entry parametrize), all 5 no-open-session guards, open_session double-open, parameters and query_tags rejection, get_columns catalog-required, cancel_command / close_command tolerance, get_query_state sync-path SUCCEEDED and Failed-state re-raise, synthetic CommandId UUID shape, and close_session cleanup-on-failure. Uses a fake databricks_sql_kernel module installed into sys.modules so it runs without the Rust extension. 77/77 kernel unit tests pass locally.
| federated.add_headers = base.add_headers | ||
| kwargs = kernel_auth_kwargs(federated) | ||
| assert kwargs == {"auth_type": "pat", "access_token": "dapi-abc"} | ||
|
|
There was a problem hiding this comment.
Medium severity — flagged by language, maintainability, devils-advocate, test, security
federated = TokenFederationProvider.__new__(TokenFederationProvider)
federated.external_provider = base
federated.add_headers = base.add_headersThis bypasses __init__ (which requires http_client and normalizes hostname) AND monkey-patches over the real TokenFederationProvider.add_headers — the one containing all the token-exchange logic. The assertion kwargs == {"auth_type": "pat", "access_token": "dapi-abc"} therefore says nothing about whether the bridge handles the real federation flow. It passes "by accident" because a dapi-… token isn't a JWT, so the real path's _should_exchange_token happens to return False.
Any future change to TokenFederationProvider.add_headers (added telemetry, new refresh trigger, eager exchange) silently breaks the bridge while this test stays green. With a real-init federated provider, add_headers writes the federation-exchanged token (not the original PAT) — the bridge would extract that token, not the underlying PAT. The test name test_federation_wrapped_pat_routes_to_kernel_pat overstates what's verified.
tests/unit/test_token_federation.py:31-36 already demonstrates clean construction with a MagicMock http_client.
Recommend:
federated = TokenFederationProvider(
hostname="https://example.cloud.databricks.com",
external_provider=base,
http_client=Mock(),
)There was a problem hiding this comment.
Fixed in 37fa544. The test now constructs a real TokenFederationProvider(http_client=Mock()) and exercises its actual add_headers path; for a plain dapi-… PAT _should_exchange_token returns False (not a JWT) so no exchange fires and the mock http_client is never invoked.
| def get_catalogs( | ||
| self, | ||
| session_id: SessionId, | ||
| max_rows: int, |
There was a problem hiding this comment.
Medium severity — flagged by architecture, devils-advocate, ops
_async_handles is a single dict per KernelDatabricksClient. Multiple cursors on the same connection share it via self.backend. Mutations happen in execute_command (insert), close_command (pop), cancel_command (get), close_session (iterate-then-clear) — all unlocked.
Two threads issuing async statements concurrently are safe in CPython by GIL accident but not by design. Worse, close_session does for handle in list(self._async_handles.values()): ...; self._async_handles.clear() — a thread mid-execute_command(async_op=True) could add a new handle after the iterator copy is taken but before clear(). That handle is dropped on the floor with no .close() called — kernel-side state leaks.
The connector explicitly documents thread-safety per cursor; this regresses below that bar for shared async tracking.
Recommend: wrap mutations in a threading.RLock, or document non-thread-safety in the class docstring with an explicit warning.
There was a problem hiding this comment.
Fixed in 24e9a5c. Added self._async_handles_lock = threading.RLock() and wrapped every read/mutation site (execute_command insert, cancel_command / close_command / get_query_state / get_execution_result reads, close_session iterate+clear). The close_session pattern is now snapshot-under-lock then close-outside-lock — newly-added handles after the snapshot stay in the dict for the next sweep instead of being dropped on the floor.
| None, | ||
| ) | ||
| for field in schema | ||
| ] |
There was a problem hiding this comment.
Medium severity — flagged by agent-compat, maintainability
return str(arrow_type) for unrecognized types produces strings like "fixed_size_binary[16]" or "timestamp[ns, tz=UTC]" in cursor.description[i][1]. Code (and LLM agents) branching on description type strings — a common pattern, e.g., if col_type == "timestamp": — silently miss these cases.
The unit test only verifies pa.null() falls through to "null"; the visually ugly cases aren't covered.
pa.timestamp("us") and pa.timestamp("ns", tz="UTC") both pass is_timestamp and map to "timestamp" (fine), but other parametrized types (fixed-size, dictionary-encoded, union) fall to str(arrow_type).
Recommend: either (a) document the fallback shape in the module docstring so callers know to handle parameterized type strings, (b) lowercase + strip parameters before returning, or (c) extend the explicit list to cover the parametrized variants the kernel actually returns.
There was a problem hiding this comment.
Deferred. Documented as a follow-up — the kernel will eventually return a richer Arrow type surface, and the right shape is to expand the explicit table when kernel adds parameterized types we care about, not to lossily lowercase/strip on the connector side.
| return | ||
| try: | ||
| handle.close() | ||
| except _kernel.KernelError as exc: |
There was a problem hiding this comment.
Medium severity — flagged by language
Signature is (exc: BaseException) -> "Error" with # type: ignore[return-value] on the non-KernelError passthrough. The ignore is masking a real type issue: this function's contract is "either re-raise as Error or pass through."
Every caller in the file does raise _reraise_kernel_error(exc) after an isinstance(exc, KernelError) check (12 callers), so the non-KernelError passthrough is unreachable in practice.
Recommend: delete the dead branch and tighten the signature to (exc: _kernel.KernelError) -> Error (called only after an isinstance check). Or, if the passthrough is intentional, change return-type to Union[BaseException, Error] and drop the ignore. Deleting is simpler.
There was a problem hiding this comment.
Fixed in 37fa544. Tightened the signature to (exc: _kernel.KernelError) -> Error, dropped the unreachable passthrough branch and the # type: ignore[return-value], and replaced the defensive setattr try/except with a plain setattr(new, attr, getattr(exc, attr, None)) since none of the PEP 249 exception classes use __slots__.
| buffer_size_bytes=cursor.buffer_size_bytes, | ||
| ) | ||
|
|
||
| # ── Metadata ─────────────────────────────────────────────────── |
There was a problem hiding this comment.
Medium severity — flagged by maintainability, language
_use_arrow_native_complex_types: Optional[bool] = True is accepted in __init__ but never read. Not passed by session.py::_create_backend either. Drop entirely.
Also: Optional[bool] = True is essentially Union[bool, None] defaulted to True — the Optional is misleading because None and True are both acceptable for what would be a "use default" sentinel. If kept, restrict to bool = True.
There was a problem hiding this comment.
Fixed in 37fa544. Removed _use_arrow_native_complex_types from KernelDatabricksClient.__init__ — it was accepted but never read, and session.py::_create_backend doesn't pass it for the kernel branch.
| batch size; ``fetchall`` drains the whole stream. | ||
| """ | ||
|
|
||
| from __future__ import annotations |
There was a problem hiding this comment.
Medium severity — flagged by test, language
The class docstring claims the duck-typed kernel_handle must implement arrow_schema() / fetch_next_batch() / fetch_all_arrow() / close(). The production code never calls fetch_all_arrow — KernelResultSet streams via fetch_next_batch() + a custom _drain. The _FakeKernelHandle test double also doesn't implement fetch_all_arrow.
If the kernel adds a required method (e.g., fetch_next_batch(timeout=...) becomes mandatory), unit tests still pass and the regression lands in e2e — which is silently skipped per F8.
Recommend:
- Drop
fetch_all_arrowfrom the docstring contract, OR - Add a contract test that (when
databricks_sql_kernelis importable) asserts the duck-typed methods are actually exposed.
There was a problem hiding this comment.
Fixed in 37fa544. Renamed _metadata_result → _make_result_set and routed the sync-execute path (was client.py:510-517) and get_execution_result (was client.py:577-584) through it. Single construction site now.
| arraysize: int, | ||
| buffer_size_bytes: int, | ||
| ): | ||
| schema = kernel_handle.arrow_schema() |
There was a problem hiding this comment.
Low severity — flagged by performance
buffer_size_bytes is accepted by the constructor and forwarded to the base ResultSet, but never consulted by the kernel backend. The kernel currently caps buffer by rows-pulled, not bytes.
Recommend: document the no-op (a comment in the class docstring or constructor) so callers tuning buffer_size_bytes for memory ceilings on Thrift know it doesn't apply on use_sea=True.
There was a problem hiding this comment.
Fixed in 37fa544. Added a paragraph to the KernelResultSet class docstring explicitly documenting that buffer_size_bytes is accepted for base-class contract compatibility but is not consulted — kernel currently caps by rows pulled, not bytes.
| lz4_compressed=False, | ||
| arrow_schema_bytes=None, | ||
| ) | ||
| self._kernel_handle = kernel_handle |
There was a problem hiding this comment.
Low severity — flagged by maintainability
The base ResultSet expects a results_queue with a next_n_rows / remaining_rows / close interface — both ThriftResultSet and SeaResultSet use it. KernelResultSet passes results_queue=None and duplicates _pull_one_batch / _ensure_buffered / _take_buffered / _drain (~80 lines) plus its own fetch overrides.
The docstring even acknowledges the duplication: "Buffer shape mirrors the prior ADBC POC's AdbcResultSet."
Recommend (follow-up): extract BufferedArrowQueue(results_queue) wrapping any handle implementing arrow_schema() / fetch_next_batch() / close(). Both KernelResultSet and any future AdbcResultSet become 15-line constructor-only subclasses, and the base ResultSet.fetchmany_arrow / fetchall_arrow works unchanged.
There was a problem hiding this comment.
Deferred — this is a cross-cutting refactor that would touch both kernel and SEA backends. Tracked as a follow-up; appropriate to land alongside the ADBC POC if/when it gets revived.
| buffer_size_bytes=cursor.buffer_size_bytes, | ||
| ) | ||
|
|
||
| def _synthetic_command_id(self) -> CommandId: |
There was a problem hiding this comment.
Low severity — flagged by security
self._auth_kwargs = {"auth_type": "pat", "access_token": <raw_token>} is stored on the KernelDatabricksClient instance for its life. Thrift / native-SEA materialize the token only inside per-request add_headers calls. The kernel client elevates the cleartext token to a long-lived attribute on a connector object — at risk of accidental pickling, debugger dumps, or telemetry capture.
Recommend: clear self._auth_kwargs (or just self._auth_kwargs["access_token"]) immediately after _kernel.Session(...) returns in open_session. Or move it into a closure rather than an instance attribute.
There was a problem hiding this comment.
Fixed in 37fa544. Added a finally block to open_session that pops access_token from self._auth_kwargs after the kernel Session is constructed (or failed). Kernel owns the credential from then on; no cleartext copy stays on the long-lived connector object.
| kernel side.""" | ||
| with conn.cursor() as cur: | ||
| cur.execute("SELECT * FROM range(10000)") | ||
| rows = cur.fetchall() |
There was a problem hiding this comment.
Low severity — flagged by test
Docstring says SELECT * FROM range(10000) "exercises the CloudFetch / multi-batch path on the kernel side". 10000 BIGINT rows is ~80 KB — almost certainly a single inline chunk on a typical warehouse. Existing CloudFetch-aimed tests (tests/e2e/test_driver.py:145-180) use 100 MB / 12.5M rows.
The comment overstates scope. A future reader may believe CloudFetch is covered when it isn't.
Recommend: drop the misleading claim, or scale to range(2_000_000) to actually cross a chunk boundary.
There was a problem hiding this comment.
Fixed in 37fa544. Replaced the misleading 'exercises CloudFetch / multi-batch path' claim with a note that it covers end-of-stream drain over multiple fetch_next_batch calls and isn't large enough for CloudFetch — pointing to test_driver for CloudFetch coverage.
|
|
||
| from __future__ import annotations | ||
|
|
||
| from unittest.mock import MagicMock |
There was a problem hiding this comment.
Low severity — flagged by test
from unittest.mock import MagicMock is imported but never used in this file. Dead import.
There was a problem hiding this comment.
Fixed in 37fa544. Replaced MagicMock import with Mock since the federation test now needs the latter; no longer unused.
Code Review Squad — Failed Inline CommentsCould not post inline comments for: F2, F4, F7, F10, F11, F13, F14, F19, F22, F25 — see body below. F2 — Federated-PAT token refresh is dead — single snapshot at construct timeHigh severity — flagged by security + devils-advocate
The bridge captures only the first exchanged token and never re-extracts. Long-running kernel sessions outlive the exchanged token's TTL and start failing The bridge's own docstring acknowledges this failure mode while justifying OAuth rejection: "routing OAuth through PAT would silently break token refresh during long-running sessions." That exact failure mode applies to the federated-PAT case the bridge accepts. Recommend: either (a) propagate a refresh callback into the kernel F4 —
|
Cleanup pass on the kernel-backend PR addressing reviewer feedback that doesn't change observable behaviour: - result_set.py: replace O(M²) `_buffered_rows` with running counter `_buffered_count` maintained by pull/take/drain (perf F6). - result_set.py: docstring corrections — drop nonexistent `fetch_all_arrow` from kernel-handle contract (F20); document `buffer_size_bytes` as no-op on the kernel backend (F21). - client.py: tighten `_reraise_kernel_error` signature to `_kernel.KernelError` only; drop dead passthrough branch and the defensive setattr try/except (F17). - client.py: drop unused `_use_arrow_native_complex_types` kwarg (F18). - client.py: collapse three `KernelResultSet(...)` construction sites through `_make_result_set` (renamed from `_metadata_result`) (F19). - client.py: drop `metadata-` prefix from synthetic CommandId; use a plain `uuid.uuid4().hex` so anything reading `cursor.query_id` downstream sees a UUID-shaped string (F14). - client.py: clear the raw access token from `_auth_kwargs` after the kernel session is constructed — kernel owns the credential from then on, no need to retain a cleartext copy on the connector instance (F24). - auth_bridge.py: reject bearer tokens containing ASCII control characters at extraction time (defense-in-depth against header injection if a misbehaving HTTP stack ever places the token back into a header without scrubbing) (F25). - tests/unit/test_kernel_auth_bridge.py: construct a real `TokenFederationProvider(http_client=Mock())` instead of bypassing `__init__` with `__new__` + monkey-patching `add_headers`. Exercises the real federation passthrough path the bridge sees in production (F12). Drop unused `MagicMock` import (F27). - tests/e2e/test_kernel_backend.py: drop misleading CloudFetch claim on `test_drain_large_range_to_arrow` — 10000 BIGINT rows is ~80 KB, single inline chunk on a typical warehouse (F26). All 39 existing kernel unit tests pass. Co-authored-by: Isaac
…ve review fixes
Major change: route the kernel backend through a new ``use_kernel=True``
connection kwarg instead of repurposing ``use_sea=True``. ``use_sea=True``
once again routes to the native pure-Python SEA backend (no behaviour
change); ``use_kernel=True`` routes to the Rust kernel via PyO3. The
two flags are mutually exclusive.
This addresses the largest reviewer concern from the multi-agent
review: silently hijacking a documented public flag broke OAuth /
federation / parameter-binding callers on ``use_sea=True`` who had no
opt-out. With the new flag, the kernel backend is fully opt-in and
existing ``use_sea=True`` users continue to get the native SEA backend
they signed up for.
Other substantive fixes:
- session.py: restore ``SeaDatabricksClient`` import + routing. Reject
``use_kernel=True`` + ``use_sea=True`` together with a clear
``ValueError``.
- client.py (kernel ``Cursor.columns``): update docstring to flag the
``catalog_name=None`` divergence — kernel requires a catalog,
Thrift / native SEA do not (F13).
- conftest.py: drop the collection-time ``pytest_collection_modifyitems``
hook that was skipping ``extra_params={"use_sea": True}`` cases. With
``use_sea=True`` back on the native SEA backend, those cases run as
they did before this PR (F8).
- kernel/client.py: ``get_tables`` now applies the ``table_types``
filter client-side using ``ResultSetFilter._filter_arrow_table``
(the same helper the native SEA backend uses), wrapped in a tiny
``_StaticArrowHandle`` that flows the filtered table back through
the normal ``KernelResultSet`` path. Replaces the previous
"log a warning and return unfiltered" behaviour (F4).
- kernel/client.py: guard ``_async_handles`` with ``threading.RLock``
so concurrent cursors on the same connection don't race on
submit / close / close-session (F15).
- kernel/result_set.py: ``KernelResultSet.close()`` now drops the
entry from ``backend._async_handles`` so async-submitted statements
don't leave stale references behind (F5).
- kernel/{__init__,client,auth_bridge}.py, tests/e2e/test_kernel_backend.py:
update docstrings, error messages, and the e2e fixture to refer to
``use_kernel=True`` instead of ``use_sea=True``.
- client.py (``Connection`` docstring): document the new
``use_kernel`` kwarg + its Phase-1 limitations.
New tests:
- tests/unit/test_kernel_client.py (38 cases): cover the 14-entry
``_CODE_TO_EXCEPTION`` table, ``_reraise_kernel_error`` attribute
forwarding, the 6-entry ``_STATE_TO_COMMAND_STATE`` table, the
no-open-session guards on every method, ``open_session`` double-open,
``parameters`` / ``query_tags`` rejection, ``get_columns``'
catalog-required check, ``cancel_command`` / ``close_command``
no-handle tolerance, ``get_query_state`` sync-path SUCCEEDED, the
Failed-state re-raise, the synthetic-command-id UUID shape, and
``close_session`` cleanup even when per-handle close errors fire.
Uses a fake ``databricks_sql_kernel`` module installed into
``sys.modules`` so the test runs with no Rust extension dependency
(F9).
77/77 kernel unit tests pass.
Co-authored-by: Isaac
Code-review responses — summary-only findingsReplies to the findings that landed in the summary comment (couldn't be posted as inline replies because the cited line was outside the diff hunk):
77/77 kernel unit tests passing locally on the new branch tip |
gopalldb
left a comment
There was a problem hiding this comment.
Review summary
Verdict: ship after addressing 8 recommended items. No blockers, but several Thrift/SEA-parity and error-mapping gaps should be closed before users opt into use_kernel=True.
Counts: 0 blocker / 8 major / 6 minor / 1 nit.
Required before merge (blockers + majors):
- agent-1-001 src/databricks/sql/backend/kernel/client.py:371 — Async ExecutedAsyncStatement handle leaks on normal cursor.close()
- agent-1-002 src/databricks/sql/backend/kernel/client.py:293 — Non-KernelError PyO3 exceptions from execute_command bypass error mapping
- agent-1-003 src/databricks/sql/backend/kernel/client.py:307 — kernel_handle.arrow_schema() raised inside KernelResultSet.init bypasses
- agent-1-004 src/databricks/sql/backend/kernel/result_set.py:108 — fetch_next_batch holds the GIL across blocking network IO
- agent-2-001 src/databricks/sql/backend/kernel/type_mapping.py (summary-only) — VARCHAR/CHAR columns surface as
stringon use_kernel=True, diverging from - agent-2-002 src/databricks/sql/backend/kernel/type_mapping.py (summary-only) — DECIMAL precision and scale dropped from cursor.description
- agent-2-003 src/databricks/sql/backend/kernel/type_mapping.py (summary-only) — VARIANT type slug not detected — falls through to opaque str(arrow_type)
- agent-2-004 src/databricks/sql/backend/kernel/result_set.py (summary-only) — Volume PUT/GET silently returns rows instead of raising — is_staging_operation
Recommended (minors):
- agent-1-005 src/databricks/sql/backend/kernel/client.py:75 — ImportError + docstring tell users
pip install databricks-sql-kernelbut - agent-1-006 src/databricks/sql/backend/kernel/client.py:461 — table_types client-side filter / _StaticArrowHandle path has no unit coverage
- agent-1-007 src/databricks/sql/backend/kernel/client.py:342 — get_query_state returns SUCCEEDED for any unknown command_id including
- agent-1-008 src/databricks/sql/backend/kernel/auth_bridge.py:101 — Misconfigured-PAT path raises ValueError instead of ProgrammingError
- agent-1-009 src/databricks/sql/backend/kernel/type_mapping.py:73 — description's null_ok slot is always None despite Arrow field.nullable being
- agent-2-005 src/databricks/sql/backend/kernel/client.py (summary-only) — get_columns(catalog_name=None) raises ProgrammingError, SEA raises DatabaseError
Nits (author's call):
- agent-1-010 src/databricks/sql/backend/kernel/result_set.py:237 — Manual lock acquire/release block instead of
with backend._async_handles_lock
Adjustments made:
- Tightened reasoning on agent-1-003, agent-1-007, agent-1-008, agent-2-001, agent-2-004 to ≤3 sentences.
- No findings dropped: cross-checked all 15 against the 38-entry self-reply digest; the related digest items (close pop, _reraise tightening, str() fallback deferred, _async_handles_lock added) addressed adjacent concerns but not the defects raised here.
- No severity changes: all majors describe contract/parity defects that meet the major rubric; preserved per the asymmetric tie-break.
Findings without a specific location
[major] VARCHAR/CHAR columns surface as string on use_kernel=True, diverging from (src/databricks/sql/backend/kernel/type_mapping.py)
type_mapping.py maps every pyarrow.string() / large_string() to the type-name "string", but the Thrift backend (thrift_backend.py:737-743, _col_to_description) lowercases the underlying TTypeId so a VARCHAR or CHAR column comes back as "varchar" / "char" in cursor.description[i][1]. Consumers (and LLM agents) that branch on the description type slug — a long-standing pattern in this driver — will see a different value for the same query depending on use_kernel. Fix: read Arrow field metadata (Spark:DataType:SqlName) the same way _col_to_description does so VARCHAR/CHAR are preserved.
[major] DECIMAL precision and scale dropped from cursor.description (src/databricks/sql/backend/kernel/type_mapping.py)
description_from_arrow_schema returns (name, type_code, None, None, None, None, None) for every column, including decimals, while the Thrift backend (thrift_backend.py:750-764) reads precision and scale off the type qualifier and fills slots 4 and 5 of the PEP 249 7-tuple. pyarrow Decimal128Type / Decimal256Type carry precision and scale on the Arrow type itself (arrow_type.precision, arrow_type.scale), so the kernel backend has the data — populate the slots so SQLAlchemy / pandas / Polars adapters and any user code reading cursor.description[i][4:6] keep working on use_kernel=True.
[major] VARIANT type slug not detected — falls through to opaque str(arrow_type) (src/databricks/sql/backend/kernel/type_mapping.py)
The Thrift backend specifically inspects Arrow field metadata for Spark:DataType:SqlName == VARIANT and surfaces "variant" as the description type_code (thrift_backend.py:766-775). _arrow_type_to_dbapi_string takes only the bare pyarrow.DataType — it has no access to the field's metadata dict and no variant branch, so a VARIANT column falls into the final str(arrow_type) returning something like "extension<arrow.json>" or the underlying physical type. Fix: pass pyarrow.Field (not DataType) into the mapper and replicate the metadata check Thrift uses.
[major] Volume PUT/GET silently returns rows instead of raising — is_staging_operation (src/databricks/sql/backend/kernel/result_set.py)
KernelResultSet.init hardcodes is_staging_operation=False, but Cursor._handle_staging_operation is gated on self.active_result_set.is_staging_operation (client.py:1356, client.py:1461) — Thrift/SEA flip this flag for PUT/GET/REMOVE so the cursor intercepts and runs the upload/download. On use_kernel=True a user calling cursor.execute("PUT 'local' INTO ...") gets the raw presigned-URL metadata rows back as plain data with no error and no NotSupportedError, silently broken staging despite the PR description claiming "Not supported". Fix: detect PUT/GET/REMOVE in execute_command and raise NotSupportedError, or set the flag and raise from the cursor's staging handler.
[minor] get_columns(catalog_name=None) raises ProgrammingError, SEA raises DatabaseError (src/databricks/sql/backend/kernel/client.py)
KernelDatabricksClient.get_columns raises ProgrammingError when catalog_name is falsy; the native SEA backend raises DatabaseError for the identical precondition (sea/backend.py:802). PEP 249 §5.2 says ProgrammingError is the right class for missing-required-argument, so kernel is more correct — but two backends in the same connector raising different exception classes for the same missing-arg case is a divergence callers branching on exception type will notice. Fix: either update SEA to match, or document the kernel exception in the Cursor.columns() docstring alongside the existing catalog_name=None note.
| stream = handle.await_result() | ||
| except _kernel.KernelError as exc: | ||
| raise _reraise_kernel_error(exc) | ||
| return self._make_result_set(stream, cursor, command_id) |
There was a problem hiding this comment.
[major] get_execution_result wraps the ResultStream from handle.await_result() in a KernelResultSet whose _kernel_handle is the stream, not the async_exec; cursor.close() -> result_set.close() pops the async_exec from _async_handles but never calls async_exec.close(), so the ExecutedAsyncStatement is leaked server-side until close_session sweeps it. The leak is silent: only explicit cursor.close_command() (rarely called by users) frees the handle. Fix: either close async_exec inside KernelResultSet.close() when the kernel_handle was produced from an async path, or call async_exec.close() inside get_execution_result and stop tracking it after handing back the stream.
| self._async_handles[command_id.guid] = async_exec | ||
| return None | ||
| executed = stmt.execute() | ||
| except _kernel.KernelError as exc: |
There was a problem hiding this comment.
[major] execute_command's try/except only catches _kernel.KernelError around stmt.set_sql / submit / execute; PyO3 native exceptions (TypeError / OverflowError / ValueError raised by argument conversion or extension-internal Python errors) propagate unwrapped to the connector caller, breaking the documented contract that connector users only see PEP 249 exception types. Same shape applies to cancel_command / close_command / get_query_state / get_execution_result / all metadata methods. Fix: catch (Exception,) where the kernel call surface meets Python and route non-KernelError through a generic InterfaceError/OperationalError wrapper, or whitelist the documented PyO3 exception classes.
|
|
||
| command_id = CommandId.from_sea_statement_id(executed.statement_id) | ||
| cursor.active_command_id = command_id | ||
| return self._make_result_set(executed, cursor, command_id) |
There was a problem hiding this comment.
[major] Sync execute calls _make_result_set(executed, ...) AFTER the except _kernel.KernelError block at L293; KernelResultSet.init then calls kernel_handle.arrow_schema() (result_set.py:70) which can raise KernelError if the kernel must materialise the schema lazily, surfacing as raw KernelError instead of a mapped PEP 249 exception. Same gap exists for every metadata method (get_catalogs / get_schemas / get_tables / get_columns) and for get_execution_result. Fix: wrap the _make_result_set call inside the try/except, or have KernelResultSet.init defer arrow_schema() until first fetch.
| is exhausted.""" | ||
| if self._exhausted: | ||
| return False | ||
| batch = self._kernel_handle.fetch_next_batch() |
There was a problem hiding this comment.
[major] _pull_one_batch and _drain call self._kernel_handle.fetch_next_batch() with no Python-side allow_threads / GIL release; if the PyO3 extension does not internally release the GIL during the underlying network/CloudFetch read, other Python threads on the same interpreter (e.g. concurrent cursors on the same connection, telemetry sinks) stall for the duration. This is the main reason Python users opt into a Rust backend in the first place. Fix: confirm the PyO3 binding releases the GIL inside fetch_next_batch (Py::detach / allow_threads); if not, file a kernel-side issue and document the limitation.
| # (doing so breaks `poetry lock`). Once published the install | ||
| # hint will move to `pip install 'databricks-sql-connector[kernel]'`. | ||
| raise ImportError( | ||
| "use_kernel=True requires the databricks-sql-kernel package. Install it with:\n" |
There was a problem hiding this comment.
[minor] The ImportError hint at client.py:75 and the use_kernel docstring at databricks/sql/client.py:124 both instruct pip install databricks-sql-kernel, but pyproject.toml's own comment (lines 38-46) plus the PR description explicitly state the wheel is not yet published. A user hitting the ImportError today will run pip install databricks-sql-kernel and either install nothing or install an unrelated/squatted package. Fix: until the wheel is published, point users to the maturin-develop dev-install path only, or pre-publish the package name.
| ) | ||
| except _kernel.KernelError as exc: | ||
| raise _reraise_kernel_error(exc) | ||
| if not table_types: |
There was a problem hiding this comment.
[minor] get_tables(..., table_types=[...]) drains the kernel stream through _drain_kernel_handle, applies ResultSetFilter._filter_arrow_table, and re-wraps in _StaticArrowHandle — none of which is exercised by test_kernel_client.py, test_kernel_result_set.py, or the e2e suite (the metadata e2e call passes no table_types). A regression in column-index 5 = TABLE_TYPE or in the case-sensitivity argument would ship silently. Fix: add a unit test that feeds a fake handle returning a known-shape arrow table and asserts the filtered output.
| # No tracked async handle means execute_command ran | ||
| # sync and the result was materialised before returning; | ||
| # the command is terminal by construction. | ||
| return CommandState.SUCCEEDED |
There was a problem hiding this comment.
[minor] After close_command(cid) pops the handle, a subsequent get_query_state(cid) returns SUCCEEDED — the lookup misses and the 'sync paths are terminal by construction' branch fires, but for an async-then-closed command the correct state is CLOSED, not SUCCEEDED. Tolerated by the cursor polling loop today but misleading for any caller branching on state. Fix: track closed command_ids in a small set (or store state alongside the handle) so the lookup can disambiguate sync-never-tracked from already-closed.
| if _is_pat(auth_provider): | ||
| token = _extract_bearer_token(auth_provider) | ||
| if not token: | ||
| raise ValueError( |
There was a problem hiding this comment.
[minor] kernel_auth_kwargs raises bare ValueError when a PAT provider produces no Bearer header (and _extract_bearer_token raises ValueError on control-char-laden tokens), while the rest of the kernel-backend error surface routes misuse through PEP 249 exception types (ProgrammingError / NotSupportedError). A user catching DB-API exceptions will miss this case. Fix: raise ProgrammingError instead, or wrap in NotSupportedError so the auth-bridge error type is consistent with the surrounding API contract.
| field.name, | ||
| _arrow_type_to_dbapi_string(field.type), | ||
| None, | ||
| None, |
There was a problem hiding this comment.
[minor] description_from_arrow_schema hardcodes the 7th tuple element (null_ok) to None even though pyarrow Field exposes field.nullable directly. PEP 249 §Cursor.description says null_ok 'is True if NULL values are allowed'; callers that branch on null_ok will see all-None and lose useful information the kernel already provides. Fix: emit field.nullable for the 7th slot (None only when truly unknown).
| # register in ``_async_handles``. | ||
| guid = getattr(self.command_id, "guid", None) | ||
| if guid is not None: | ||
| self.backend._async_handles_lock.acquire() |
There was a problem hiding this comment.
[nit] result_set.py:237-241 uses an explicit acquire/finally/release pair where the rest of the codebase (and the same lock at client.py:233 / 289 / 310 / 325) uses a context-manager with block. Cosmetic but breaks consistency. Fix: with self.backend._async_handles_lock: self.backend._async_handles.pop(guid, None).
Summary
Phase 2 of the PySQL × kernel integration plan (design doc). Adds a new opt-in
use_kernel=Trueconnection flag that routes through a newbackend/kernel/module delegating to the Rust kernel via thedatabricks_sql_kernelPyO3 extension (kernel PR #13).This replaces the previous ADBC POC branches (
backend/adbc/andbackend/adbc_dm/onadbc-rust-backend-via-dm, which were never merged) with a clean port that uses the kernel's v0 Databricks-native API directly instead of layering through ADBC.Flag semantics
use_kernel=TrueKernelDatabricksClient(Rust kernel via PyO3)use_sea=TrueSeaDatabricksClient(pure-Python SEA)ThriftDatabricksClient(Thrift)use_kernel=Trueanduse_sea=Trueare mutually exclusive — passing both raisesValueError. Existinguse_sea=Truecallers are unaffected.What
use_kernel=TruedoesNew module layout
src/databricks/sql/backend/kernel/client.pyKernelDatabricksClient(DatabricksClient). Lazy-importsdatabricks_sql_kernelso a connector install without the kernel wheel doesn't fail at startup — onlyuse_kernel=Truesurfaces the missing-extra ImportError.src/databricks/sql/backend/kernel/auth_bridge.pyAuthProviderto kernelSessionauth kwargs. PAT (includingTokenFederationProvider-wrapped PAT — every provider is wrapped, so the naiveisinstance(AccessTokenAuthProvider)check has to look through the wrapper) routes throughauth_type='pat'. Anything else raisesNotSupportedErroruntil the kernel exposes a full external-auth surface.src/databricks/sql/backend/kernel/result_set.pyKernelResultSet(ResultSet). Duck-typed over the kernel'sExecutedStatement(sync exec) andResultStream(metadata + asyncawait_result); both exposearrow_schema / fetch_next_batch / close. FIFO batch buffer forfetchmany(n)semantics, with O(1) buffered-row accounting via a running counter.src/databricks/sql/backend/kernel/type_mapping.pyError mapping
KernelError.code→ PEP 249 exception class, in a single table inclient.py. Structured fields (sql_state,error_code,query_id,http_status,retryable,vendor_code) are copied onto the re-raised exception so callers can branch onerr.code/err.sql_statedirectly. Live e2e verified: bad SQL onuse_kernel=Truesurfaces asDatabaseError(code='SqlError', sql_state='42P01').Packaging
Without the kernel wheel,
use_kernel=Trueraises:Local dev:
cd databricks-sql-kernel/pyo3 && maturin develop --releaseinto the connector's venv. (The[kernel]extra is intentionally not declared inpyproject.tomlyet —databricks-sql-kernelisn't on PyPI, and declaring an unpublished dep breakspoetry lockfor every CI job. The extra will land once the wheel is on PyPI.)execute_command(parameters=[...])raisesNotSupportedErrorStatement.bind_paramlands in a small follow-up PR on the kernel repoquery_tagsNotSupportedErrorstatement_confget_columns(catalog_name=None)ProgrammingError(kernel'sSHOW COLUMNScannot span catalogs)Cursor.columns()docstringNotSupportedErrorfrom the auth bridgeexecution_result/retry_count/ chunk-level latency under-report onuse_kernel=Truessl_options/http_headers/http_clientare accepted-and-ignoredCode review feedback addressed in this revision
Multi-reviewer (architecture, security, ops, performance, test, maintainability, agent-compat, language, devil's advocate) review surfaced several issues; the highest-impact ones are addressed in commits
37fa5446(mechanical) and24e9a5c2(substantive):use_sea=Trueto a dedicateduse_kernel=True; native SEA routing is unchanged. (Was the largest reviewer concern.)table_typesfilter inget_tablesusing the SEA backend's_filter_arrow_tablehelper, replacing the previous "log a warning and return unfiltered" behaviour.KernelResultSetwith a running counter (_buffered_count).KernelResultSet.close()now drops the entry frombackend._async_handlesto avoid stale references.threading.RLockaround_async_handlesmutations / reads for concurrent-cursor safety.CommandIds use plainuuid.uuid4().hex(nometadata-prefix) socursor.query_idstays parseable downstream.KernelDatabricksClient._auth_kwargsafter the kernelSessionis constructed._use_arrow_native_complex_typeskwarg; tightened_reraise_kernel_errorsignature and dropped dead branch; collapsed threeKernelResultSet(...)construction sites through one_make_result_sethelper.Cursor.columns()docstring now documents thecatalog_name=Nonedivergence onuse_kernel=True.TokenFederationProvider(http_client=Mock())instead of bypassing__init__.Test plan
tests/unit/test_kernel_client.py(new in this revision, 38 cases) — covers_CODE_TO_EXCEPTION(14 entries),_reraise_kernel_errorattribute forwarding,_STATE_TO_COMMAND_STATE(6 entries), all no-open-session guards,open_sessiondouble-open, parameters / query_tags rejection,get_columnscatalog-required,cancel_command/close_commandtolerance,get_query_statesync-path SUCCEEDED and Failed-state re-raise, synthetic CommandId shape,close_sessioncleanup on partial close failures. Uses a fakedatabricks_sql_kernelmodule so the test runs with no Rust extension dependency.tests/unit/test_kernel_auth_bridge.py— PAT, federation-wrapped PAT (now via realTokenFederationProvider(http_client=Mock())), non-PAT rejection paths.tests/unit/test_kernel_type_mapping.py— Arrow type mapping per type, description-tuple shape, fallback tostr()for unknowns.tests/unit/test_kernel_result_set.py— buffer semantics,fetchmanyslicing within batch + across batch boundaries, idempotent close, close() swallowing handle-close failures, empty stream.test_useragent_header— agent detection addsagent/claude-codein this env, fails onmaintoo) is unrelated to this change.use_kernel=True(PAT): SELECT 1,SELECT * FROM range(10000),fetchmanypacing,fetchall_arrow, all four metadata calls (catalogs / schemas / tables / columns),session_configuration={'ANSI_MODE': 'false'}round-trips, bad SQL surfaces as DatabaseError withcode='SqlError'andsql_state='42P01'.This pull request and its description were written by Isaac.