Skip to content

feat(wait, workflows): Async toggle, chained-wait resume fix, execution status API#4514

Merged
TheodoreSpeaks merged 13 commits into
stagingfrom
fix/resume-poll-partially-resumed
May 16, 2026
Merged

feat(wait, workflows): Async toggle, chained-wait resume fix, execution status API#4514
TheodoreSpeaks merged 13 commits into
stagingfrom
fix/resume-poll-partially-resumed

Conversation

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator

@TheodoreSpeaks TheodoreSpeaks commented May 8, 2026

Summary

Three changes bundled in one PR — all related to the wait-block / paused-execution lifecycle.

1. Async toggle on the Wait block (new)

  • New Async switch on the Wait block (off by default)
  • Off → in-process sleep, capped at 5 minutes, units are seconds/minutes (restores pre-feat(block): Allow wait block to wait up to 30 days #4331 behavior)
  • On → always suspends via PauseMetadata regardless of duration, up to 30 days, units are minutes/hours/days (seconds disabled)
  • Wait Amount input description shows the limits: `Max 5 minutes (300 seconds). Enable Async for up to 30 days.`
  • Drops the now-unused `WAIT_INPROCESS_MAX_MS` env knob — the toggle is the explicit replacement
  • Known break: workflows saved during PR feat(block): Allow wait block to wait up to 30 days #4331 with `timeUnit: 'hours'` or `'days'` will throw `Wait time exceeds maximum of 5 minutes; enable async mode to wait up to 30 days` at runtime. Intentional — re-toggling Async is the migration. Flagged by Greptile, kept as-is per author guidance

2. Cron poll picks up `partially_resumed` rows

  • After fix(terminal): terminal console update for child spans + hitl state machine #4450, `persistPauseResult` merges new pause points with existing ones, so a chained-wait workflow ends up with a row in `status = 'partially_resumed'` (wait1 marked resumed, wait2 still paused) once wait1 finishes
  • The cron poll's `WHERE status = 'paused'` filter excluded those rows, so wait2 was never dispatched. Verified in prod logs: execution `2e9e4780...` had wait1 fire, agent run, then suspended for wait2 — and wait2 sat for 1h+ with `next_resume_at` long elapsed while every cron tick reported `claimedRows: 0`
  • Filter now matches `paused` and `partially_resumed`. `setNextResumeAt` was widened to the same set so it can null `nextResumeAt` after dispatching a `partially_resumed` row (without this, dispatched rows kept reappearing in every poll batch — flagged by both Greptile and Bugbot)
  • The existing partial index is gated on `status = 'paused'` so partially-resumed rows fall through to a sequential scan; volume is small enough that this is acceptable. Widening the index predicate can be a follow-up if scan time becomes an issue

3. New `GET /api/workflows/[id]/executions/[executionId]` status endpoint

Lightweight polling primitive that normalizes execution status across `workflowExecutionLogs` and `pausedExecutions` in one response — bridges the gap between `/api/jobs/{jobId}` (only tracks async-launched jobs, marks complete on first suspend) and `/api/v1/logs/executions/{eid}` (heavy snapshot dump, no live paused state).

Request:
```
GET /api/workflows/{workflowId}/executions/{executionId}
?includeOutput=true (optional — include finalOutput when status=completed)
?selectedOutputs=blockId,blockId.field,blockId.nested.path
X-API-Key:
```

Response:
```jsonc
{
"executionId": "...",
"workflowId": "...",
"status": "pending" | "running" | "paused" | "completed" | "failed" | "cancelled",
"trigger": "manual" | "api" | "schedule" | "webhook" | "chat",
"level": "info" | "warning" | "error",
"startedAt": "...",
"endedAt": "..." | null,
"totalDurationMs": 153035 | null,
"paused": null | {
"pausedAt": "...",
"resumeAt": "...", // earliest next-resume across active pause points
"pauseKind": "time" | "human" | null,
"blockedOnBlockId": "..." | null,
"pausedExecutionId": "...",
"pausePointCount": 1,
"resumedCount": 0
},
"cost": { "total": 0.005 } | null,
"error": "Wait 1: ..." | null,
"finalOutput": { ... } | null, // only when ?includeOutput=true and status=completed
"blockOutputs": { // only when ?selectedOutputs is set
"": { ...full output },
".":
} | null
}
```

Auth uses `validateWorkflowAccess` — same scoping as `/paused`. The URL workflow-scoping and DB `(executionId AND workflowId)` query are belt-and-suspenders against cross-workflow ID typos; the executionId is globally unique. Returns 404 for both unknown executionId and mismatched workflowId/executionId pairs.

When to use what:

Goal Endpoint
Track async-launched job (queue lifecycle, doesn't span suspends) `/api/jobs/{jobId}`
Full lifecycle of any execution, including post-suspend resumes this PR
List currently-paused executions for a workflow `/api/workflows/{id}/paused`
Historical log inspection / dashboards `/api/v1/logs?workflowId=...`

Type of Change

  • Bug fix (cron filter, setNextResumeAt guard)
  • New feature (Async toggle, status endpoint, selectedOutputs)

Testing

  • Wait handler unit tests: 22/22 passing
  • End-to-end against a 3-wait workflow via sync, stream, and async `/execute` APIs; cron resume cycled through wait1 → wait2 → wait3 to completion (~2m32s = 130s of waits + cron-poll latency)
  • New status endpoint hit against all four states: completed (cost + duration), failed (error string surfaced from `executionData.finalOutput.error`), currently-paused (`resumeAt`, `pauseKind`, `blockedOnBlockId`), unknown executionId (404)
  • `?selectedOutputs=blockId,blockId.field` returns per-block outputs filtered from `executionData.traceSpans`
  • `bun run lint` clean. `bun run check:api-validation:strict` passing (baseline bumped 734 → 745: +9 from staging, +1 from this PR)
  • Both Greptile and Bugbot findings on the partially_resumed flow addressed; one Bugbot dead-code finding (`tooltip` on a switch sub-block) removed

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

The chained-pause flow leaves a row in 'partially_resumed' status (wait1 done, wait2 still waiting). The poll's WHERE filter only matched 'paused', so wait2 was never picked up. Include 'partially_resumed' in the filter.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 8, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docs Ready Ready Preview, Comment May 16, 2026 1:06am

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 8, 2026

PR Summary

Medium Risk
Medium risk because it changes wait-block runtime behavior (new async mode + stricter sync limits) and broadens the cron resume poller to process additional paused-execution states, which could affect execution lifecycle and resumption behavior in production.

Overview
Adds a new polling endpoint, GET /api/workflows/{id}/executions/{executionId}, that returns a normalized execution lifecycle status (including paused-state details), optional finalOutput, and optional per-block outputs via selectedOutputs; docs/OpenAPI are updated with the new route and WorkflowExecutionStatus schema.

Updates the Wait block to introduce an explicit Async toggle: sync waits are now capped at 5 minutes (seconds/minutes), while async waits always suspend via pause metadata (minutes/hours/days, up to 30 days), with updated UI copy and expanded unit tests.

Fixes chained-wait resumption by having the cron resume poller and PauseResumeManager.setNextResumeAt treat pausedExecutions.status='partially_resumed' as poll-eligible (alongside paused), preventing missed resumes and repeated polling of dispatched rows.

Reviewed by Cursor Bugbot for commit 2ba29c6. Bugbot is set up for automated code reviews on this repo. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 8, 2026

Greptile Summary

This PR bundles three wait/pause lifecycle improvements: an explicit Async toggle on the Wait block that separates in-process sleeps (≤5 min) from disk-persisted suspensions (≤30 days), a cron-poll fix that picks up partially_resumed rows so chained-wait workflows actually resume, and a new GET /api/workflows/{id}/executions/{executionId} status endpoint for polling execution lifecycle across suspensions.

  • Async toggle (wait.ts, wait-handler.ts): Replaces the implicit threshold with an explicit user-controlled switch; adds timeUnitLong for async mode (minutes/hours/days) while capping the sync dropdown at seconds/minutes.
  • Cron poll fix (poll/route.ts, human-in-the-loop-manager.ts): Widens the WHERE status = 'paused' filter to include partially_resumed in both the dispatch query and the setNextResumeAt guard, preventing rows from re-appearing in every poll batch.
  • Status endpoint ([executionId]/route.ts, contracts/workflows.ts): New route that merges workflowExecutionLogs and pausedExecutions into a unified status response, with optional ?includeOutput and ?selectedOutputs support.

Confidence Score: 5/5

Safe to merge — the cron-poll fix is targeted and the new endpoint introduces no auth bypasses or data leaks.

The partially_resumed poll fix is a narrow two-line change with clear before/after semantics, verified against production logs and covered by the existing test suite. The async toggle is additive with an explicit fallback for legacy configs. The new status endpoint reuses the same validateWorkflowAccess middleware as existing paused-execution routes, and its DB queries are scoped by both executionId and workflowId. The resumedCount field is notNull().default(0) in the schema, so the non-nullable Zod field is safe. No data mutation occurs on the GET path.

No files require special attention.

Important Files Changed

Filename Overview
apps/sim/app/api/resume/poll/route.ts Widens WHERE status = 'paused' to inArray(['paused', 'partially_resumed']) — the core cron-poll bug fix; inArray was already imported in this file.
apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts Widens setNextResumeAt guard to include 'partially_resumed' so the cron can null nextResumeAt after dispatching, preventing permanent re-appearance in poll batches.
apps/sim/executor/handlers/wait/wait-handler.ts Replaces the single duration threshold with an isAsync flag; sync path capped at 5 min, async path always suspends via PauseMetadata and caps at 30 days. Logic and defaults are correct.
apps/sim/app/api/workflows/[id]/executions/[executionId]/route.ts New status endpoint merging workflowExecutionLogs and pausedExecutions; two DB queries run sequentially where parallelism is possible. PausePoint.resumeStatus field usage is correct per types.ts.
apps/sim/blocks/blocks/wait.ts Adds Async switch and a second unit dropdown (timeUnitLong) conditioned on async=true; removes hours/days from the sync dropdown. Conditions and defaults are consistent with the handler.
apps/sim/lib/api/contracts/workflows.ts Adds WorkflowExecutionStatusResponse schema and getWorkflowExecutionContract; correctly reuses existing workflowExecutionParamsSchema.
apps/sim/executor/handlers/wait/wait-handler.test.ts Updates existing tests for hours/days to use async mode; adds new cases for 5-min rejection, async-always-suspends, seconds-in-async rejection, and missing timeUnitLong default. 22/22 passing per PR.

Sequence Diagram

sequenceDiagram
    participant Client
    participant StatusAPI as GET /executions/{id}
    participant PollCron as Cron /resume/poll
    participant HITLMgr as HumanInTheLoopManager
    participant DB

    Client->>StatusAPI: "GET ?selectedOutputs=blockId"
    StatusAPI->>DB: SELECT workflowExecutionLogs WHERE (executionId, workflowId)
    StatusAPI->>DB: SELECT pausedExecutions WHERE executionId
    DB-->>StatusAPI: logRow + pausedRow (status: partially_resumed)
    StatusAPI-->>Client: "{status: paused, resumeAt, blockedOnBlockId}"

    Note over PollCron,DB: Chained-wait resume fix
    PollCron->>DB: "SELECT pausedExecutions WHERE status IN ('paused','partially_resumed') AND nextResumeAt <= now"
    DB-->>PollCron: partially_resumed row (wait2 due)
    PollCron->>HITLMgr: dispatchRow(executionId)
    HITLMgr->>DB: setNextResumeAt(null) WHERE status IN ('paused','partially_resumed')
    Note right of DB: Previously WHERE status='paused' was a no-op for partially_resumed rows

    Note over Client,DB: Async Wait block
    Client->>StatusAPI: GET (after async wait completes)
    StatusAPI->>DB: SELECT workflowExecutionLogs
    DB-->>StatusAPI: status: completed
    StatusAPI-->>Client: "{status: completed, finalOutput: ...}"
Loading

Reviews (3): Last reviewed commit: "docs(api): tighten getWorkflowExecution ..." | Re-trigger Greptile

Comment thread apps/sim/app/api/resume/poll/route.ts
Adds WAIT_INPROCESS_MAX_MS env var (default 300000ms = 5 min). Lower it locally (e.g. 5000) to exercise the suspend/cron-resume path with short waits.
@TheodoreSpeaks TheodoreSpeaks changed the title fix(wait): poll partially_resumed rows so chained waits resume feat(wait): Suspend Workflow toggle + chained-wait resume fix May 15, 2026
Comment thread apps/sim/executor/handlers/wait/wait-handler.ts Outdated
- setNextResumeAt now matches paused OR partially_resumed; otherwise the
  cron poller can't null nextResumeAt after dispatching a chained-wait
  row, so it keeps reappearing in every poll batch until execution ends
  (flagged by both greptile and bugbot)
- Suspend mode now defaults missing timeUnitLong to 'minutes' instead of
  falling back to 'seconds' and immediately erroring (flagged by bugbot)
@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

@BugBot review @greptile review

Comment thread apps/sim/executor/handlers/wait/wait-handler.ts Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 4c7bc83. Configure here.

Restores the pre-#4331 description on the Wait Amount field so the limit is visible before submit instead of only at runtime. Mentions both the 5 min default and the 30 day cap with Suspend Workflow.
Comment thread apps/sim/blocks/blocks/wait.ts Outdated
…tus endpoint

Normalized status (pending|running|paused|completed|failed|cancelled)
across workflowExecutionLogs and pausedExecutions in a single response.
Surfaces paused-state details (resumeAt, pauseKind, blockedOnBlockId)
when a row exists in pausedExecutions, and the error string for failed
runs. finalOutput is opt-in via ?includeOutput=true.
Returns per-block outputs filtered by selectedOutputs paths (same
shape as the execute endpoint). Reads from executionData.traceSpans,
walks children recursively, and resolves dot-paths into each block's
output. Bare blockId returns the full output.
Switch sub-blocks return null from renderLabel (sub-block.tsx:238), so
the tooltip never reached the user. The trade-off explanation already
lives in longDescription and bestPractices. Flagged by bugbot.
…rtially-resumed

# Conflicts:
#	scripts/check-api-validation-contracts.ts
@TheodoreSpeaks TheodoreSpeaks changed the title feat(wait): Suspend Workflow toggle + chained-wait resume fix feat(wait, workflows): Async toggle, chained-wait resume fix, execution status API May 16, 2026
Adds the WorkflowExecutionStatus schema and the getWorkflowExecution
operation to the OpenAPI spec, including completed/paused/failed
response examples and the includeOutput + selectedOutputs query params.
Registers the page in the Workflows section of the API reference.
@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

@BugBot review @greptile review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 2ba29c6. Configure here.

@TheodoreSpeaks TheodoreSpeaks merged commit fffb879 into staging May 16, 2026
14 checks passed
@waleedlatif1 waleedlatif1 deleted the fix/resume-poll-partially-resumed branch May 16, 2026 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant