lab: service core leaves the host in a non-recoverable state after a failed start; related CLI/docs polish #282
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_skills#282
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Five separate medium-severity issues observed during a fresh-install test of
labon Ubuntu 24. Grouped here for triage convenience; happy to split into individual issues if preferred. None are install-blockers on their own, but each makeslabless pleasant or harder to recover from.Environment
labversion:lab 0.1.0curl … install.sh | bash+lab user init+ (eventually-successful)lab install core1.
lab service <X> --stopfails when the service is infailedstateAfter a
lab service corerun that failed at thehero_aibroker_serversmoke-test phase, the service is left infailedstate withrestarts: 4and there is nolabverb that can clean it up.Reproduce:
Note the contradictory wording: "may already be stopped" (i.e., fine) immediately followed by "stop failed — state 'failed'" (i.e., not fine).
Fix: Treat
failedstate as already-stopped.--stopshould deregister with hero_proc, clear the restart counter, remove socket directories, and return success. Anything else leaves the user with no recovery path short of editing hero_proc's SQLite DB by hand.2. Failed-start services are left registered with hero_proc
Closely related to #1. When
lab service corefails partway through, the half-started service stays registered:This makes
lab service corenon-idempotent — the next run will see the broken registration and behave inconsistently. The leftover socket directories (~/hero/var/sockets/hero_aibroker/{admin,billing,chat,embedder,images,memory,meta,models,speech,video}/) also linger as empty placeholders.Fix: On smoke-test or start failure, lab should:
3.
lab user initwarns "lab not installed" when it actually isAfter installing via the documented curl one-liner (
cargo install --root ~/hero), lab lands at~/hero/bin/lab. Then:The init step is looking for
labin~/.local/bin/laband complaining when it's not there — but the binary is already in the destination it was trying to move it to (~/hero/bin/lab). The warning is incorrect, and worse, its suggested remediation ("run lab_install.sh or lab_build.sh first") implies lab isn't installed when it already is.Fix: Before issuing the warning, check whether
~/hero/bin/labalready exists; if so, silently treat the move as already done.4.
lab service(no args) errors with "no .git directory found"Run from
/root(or any non-repo directory):There is no top-level
lab servicediscovery command — to find out what services are registered, the user has to already know a service name. This is especially awkward because the olderlab --status(which served as a global discovery command) was retired in favor oflab service <name> --status.Fix: When no name is given AND the cwd isn't in a git repo, fall back to listing all registered services (the equivalent of the old top-level
lab --status). That preserves the git-repo-aware ergonomics when inside a service repo, while giving outside-the-repo callers a useful default.5. README is significantly out of date relative to the current binary
crates/lab/README.mddocuments top-level flags and verbs that have been retired or renamed in the binary ondevelopment:lab --start hero_code_server,lab --stop,lab --statuslab service <name> --start | --stop | --status--release/--install/--binflagslab buildCODEROOTenv varPATH_CODElab path,lab completions,lab infochecklab user initis now a mandatory step betweeninstall.shand using labConcrete consequences during the fresh-install test:
lab --start hero_code_serverand getserror: unrecognized subcommand.CODEROOTnot set, and is confused becauselab pathreferencesPATH_CODEinstead.install.shand follows README's "Install" section, never finds thelab user initstep, andlab patherrors withPATH_ROOT is not set.Fix: Audit the README section by section against the current binary. Specifically:
lab service,lab --start/stop/status)" section with the newlab service <name> --start|--stop|--statussyntax.lab [flags])" section underlab build [flags].CODEROOTreferences withPATH_CODE(or document both if both work).lab user init.lab path,lab completions,lab infocheck.Priority
All five are medium severity — they don't block install outright, but each leaves a user stuck or confused at a predictable point. #1 and #2 are the most consequential because they make
lab service corenon-idempotent after a failure. #5 is the umbrella that, if fixed, would prevent users from hitting many of the others by accident.Closed as solved by PR #296