development_jobs #18
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "development_jobs"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
PR Review:
development_jobs(#18)14 commits, ~200 files, +11,777 / -2,378 lines
Architecture — Solid
zinit_sdk,zinit_server,zinit_ui,zinit_jobs,zinit_rhai,zinit_pid1,zinit(CLI). Zero cross-deps between binaries — all through SDK.rust-embed(no CDN). Single binary ships everything.make run/make rundevscreen sessions + health checks.Critical Issues (must fix before merge)
JobAttemptrecords never populated —Job.attemptsis alwaysVec::new(). Never written during completion or retry.job.attempts()RPC returns empty forever. Persisted jobs haveattempts: []. Data loss.zinit_jobs/src/manager.rsJobManagerwith a freshscript_filesRwLock disconnected from the parent. Orphaned temp files, no cancellation ifjob.delete()called mid-retry.zinit_jobs/src/manager.rs:~549sync_from_zinit()— Readswas_terminalthen separately updates. Another task could transition the job between read and write, causing duplicate terminal processing (persist + cleanup + retry fired twice).zinit_jobs/src/manager.rsWaitingforever.detect_cycle()exists but only called for graph display.zinit_jobs/src/manager.rs:~101afterfield in JobSpec silently ignored — Parses but does nothing. Must usedepends_onwith explicit{"job_sid":"<SID>","dep_type":"after"}. Documented in test plan but not in API. Footgun.zinit_sdk/src/jobs.rs,zinit_jobs/src/manager.rsspec.timeout_msmaps tostop_timeout_ms(graceful shutdown), not a wall-clock deadline. A job withtimeout_ms: 5000can run forever.zinit_jobs/src/manager.rs:~117High-Severity Issues (should fix soon)
Runningbefore zinit starts it — Phase set to Running immediately, butservice.startmay fail/delay. Stale state visible to consumers.zinit_jobs/src/manager.rs:~188wake_dependents()only checks first 100 jobs —list()uses default pagination limit. >100 waiting jobs = some never woken.zinit_jobs/src/manager.rs:~567client.set()fails, job removed from store but script file + log dir left behind. If set succeeds but response fails, orphaned zinit service with no job.zinit_jobs/src/manager.rslet _ = client.stop/delete(...). Job state updates proceed even when zinit ops fail → state divergence.zinit_jobs/src/manager.rsArc<JobManager>, runs forever. No cancellation token.zinit_jobs/src/manager.rs:~80Medium / Design Concerns
.unwrap()in IPC afterget_by_name()— low probability panic under concurrent graph mutation/openrpc.jsonreturns 404 — onlyrpc.discoverworksCI Status
Builds are failing — likely related to some of the above.
Verdict: Architecture is well-structured. Workspace split and SDK-as-sole-dependency is the right pattern. Job system needs concurrency/data-integrity fixes before merge. Will start working on critical + high-severity fixes.
Fixes Applied
All critical and high-severity issues from review have been addressed in two commits:
[
4c8d977] fix: job manager concurrency bugs, data integrity, and pre-existing test failuresCritical (1–6):
job.attempts()returns real datascript_filesshared viaArc<RwLock>, spawned retry usesCancellationTokenfor cancellationsync_from_zinit()eliminated —was_terminalcheck + state update in single write lock acquisitionJOB_CYCLE_DETECTEDafter/requiresshorthand onJobSpec— auto-converted todepends_onentries vianormalize_deps()timeout Nscommand,stop_timeout_msset to timeout + 5s graceHigh-severity (7–11):
set()succeeds, not beforewake_dependents()unlimited — useslimit: u32::MAXinstead of default 100let _ =replaced withwarn!()on all zinit client callsCancellationToken+Dropimpl cancels on manager dropPre-existing bug fixes:
retry()now acceptsRetryingphase (auto-retry sets this before calling retry — was causing jobs to get stuck)JobFilter::Defaultgiveslimit: 100instead of0(was breaking all filtered list queries)[
191eb6d] fix: remove unsafe unwrap in IPC, clean up warnings, update test plan.unwrap()with proper error response inhandle_statusandhandle_get_configIPC handlersXinetConfigimport in integration test fixturesTEST_PLAN.md: markafterfield known issue as fixedTest Results
zinit_sdkunit testszinit_serverunit testszinit_jobsintegration testsxinetintegration testsThe 2 previously failing job tests (
test_job_retry_policy,test_demo_pipeline_e2e) now pass. The 6 xinet failures are pre-existing and unrelated to the job system.Remaining (deferred, non-blocking)
Server deserializes params directly as XinetConfig but the client was sending {config: {...}} causing 'missing field name' deserialization error. Send the config directly, matching the async_client behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>The fallthrough case returned Some(PortOwner { pid: 0, name: None }) even when the port was not found, causing test_check_port_in_use_free_port to fail. Return None instead to correctly indicate the port is not in use. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>Same bug as find_port_owner_linux — when lsof doesn't find a listener, the fallthrough returns Some(PortOwner { pid: 0, name: None }) instead of None. This may also help if the previous CI run was stale. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>CI Fixes Summary
All CI tests now pass. Here's a summary of the fixes made across this PR:
Build & Infrastructure
cargo build --workspacebeforecargo testso integration tests can find binaries$(CARGO_ENV)totest-bashandtest-rhaitargets socargois in PATH on CI--features fullwith--workspace(5 scripts:run-all.sh,harness.sh,run.sh,run-tui.sh, rhairun-all.sh)*.rhaito[0-9][0-9]_*.rhaito skip debug scripts likerepro_deadlock.rhaiBinary Discovery
find_pid1_binary(): Fixed workspace root detection — added|| content.contains("zinit_pid1")checkfind_server_binary(): Madepubso pid1 tests can use itzinit_pid1::spawn_server(): AddedZINIT_SERVER_BINenv var support instead of only checking hardcoded system paths (/sbin/zinit_server)ZINIT_SERVER_BINenv var to pid1 processSDK & Protocol Fixes
ZinitClient::xinet_set: Removed extra{config: ...}wrapper — server expects flat paramsuse zinit::...→use zinit_sdk::...Test Correctness
test_oneshot_service_success→State::Success(notInactive),test_manual_stop_prevents_restart→State::Exited(notInactive),test_oneshot_service→State::Success(notExited)ExternalPortConflictwhen services like PostgreSQL are runningfind_port_owner_linux/macos: Fixed fallthrough returningSome(PortOwner)when port is NOT in use — now returnsNonedrop(listener)to let kernel flush/proc/net/tcpentry on CI