-
Notifications
You must be signed in to change notification settings - Fork 1
sdf-cli Migration to slurmrest
#13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
2Ryan09
wants to merge
30
commits into
main
Choose a base branch
from
sdf-cli-to-slurmrest
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
9c25f22
add doc
2Ryan09 564c548
additional alternatives
2Ryan09 9ebb191
use ansible
2Ryan09 55fde55
use slurmrest
2Ryan09 d661315
add to ci/cd
2Ryan09 e9b7818
update readme
2Ryan09 e54d4ee
streamline job processing and enhance slurmrest integration by using …
2Ryan09 ed091db
update default URL to sdf-slurmrest-dev.slac.stanford.edu
2Ryan09 1395b93
add typing to slurmrest
2Ryan09 54435f0
update doc to include autogenerated client
2Ryan09 c364bbd
fix test imports
2Ryan09 4553784
rm port from new URL
2Ryan09 ed37f62
Merge branch 'main' into sdf-cli-to-slurmrest
2Ryan09 bab6c24
slurmrest v0.0.44 -> v0.0.42 endpoints
2Ryan09 a28b0aa
query associations one at a time
2Ryan09 3054a6c
add doc about individual association fetching
2Ryan09 c5106b4
fix broken data formatting
2Ryan09 7b539aa
rm deprecated slurmremap
2Ryan09 f30bd35
account for data format changes
2Ryan09 4aa333f
REST API supports @cluster in account name
2Ryan09 6ac8670
add test for data format consistency
2Ryan09 699b399
update submodule
2Ryan09 ef529a5
SLURM_JWT -> SLURMREST_JWT
2Ryan09 64532d3
add helpful logging
ea2c9cd
update submodule
2Ryan09 cb40e71
typo
2Ryan09 dd06bd1
typos
2Ryan09 85a40a8
typo
2Ryan09 e4620b5
documentation updates
2Ryan09 7d34cef
Merge remote-tracking branch 'origin/main' into sdf-cli-to-slurmrest
2Ryan09 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,3 +21,9 @@ lib64 | |
|
|
||
| # VS Code | ||
| .vscode/ | ||
|
|
||
| # Secrets | ||
| .env | ||
|
|
||
| # Generated Client | ||
| slurmrest_client/ | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Submodule project
updated
from 5f3c7c to f32c1a
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,106 @@ | ||
| # Migration of `sacctmgr` and `sacct` Calls to `slurmrest` | ||
|
|
||
| Currently, `sdf-cli` runs `sacctmgr` and `sacct` CLI tools directly via `subprocess` in order to gather account association information, job accounting information, and toggling the number of nodes assigned to a facility in the event of an overage. In a step towards containerization and more robust interaction with SLURM, it is desired to migrate these CLI tool usages to `slurmrest` endpoints. | ||
|
Comment on lines
+1
to
+3
|
||
|
|
||
| ## Current Usage | ||
|
|
||
| There are currently three read operations and one write operations: | ||
|
|
||
| ### Read Operations | ||
|
|
||
| - `sacctmgr show assoc where account={','.join(list_of_assoc)} --noheader -P format=Account,GrpNodes,GrpJobs,MaxJobs` | ||
| - `modules/coact.py` | ||
| - Equivalent endpoint: `GET /slurmdb/v0.0.44/associations/` | ||
| - `SLURM_TIME_FORMAT=%s sacct --allusers --duplicates --allclusters --allocations --starttime="{start}T00:00:00" --endtime="{start}T23:59:59" --truncate --parsable2 --format=JobID,User,UID,Account,Partition,QOS,Submit,Start,End,Elapsed,NCPUS,AllocNodes,AllocTRES,CPUTimeRAW,NodeList,Reservation,ReservationId,State` | ||
| - `api/scripts/jobs2usage.py:37` | ||
| - Equivalent endpoint: `GET /slurmdb/v0.0.44/jobs/` | ||
| - `SLURM_TIME_FORMAT=%s {sacct_bin_path} --allusers --duplicates --allclusters --allocations --starttime="{date}T{start_time}" --endtime="{date}T{end_time}" --truncate --parsable2 --format=JobID,User,UID,Account,Partition,QOS,Submit,Start,End,Elapsed,NCPUS,AllocNodes,AllocTRES,CPUTimeRAW,NodeList,Reservation,ReservationId,State` | ||
| - `modules/coact.py` | ||
| - Equivalent endpoint: `GET /slurmdb/v0.0.44/jobs/` | ||
|
|
||
| ### Write Operations | ||
|
|
||
| - `sacctmgr modify -i account name=$facility:_regular_@$cluster set GrpTRES=node=$nodes` | ||
| - `cli/modules/coact.py` | ||
| - NOT directly possible, `slurmrest` does not support account resource allocation assignments | ||
|
|
||
| ## Migration Implications | ||
|
|
||
| Currently, three daemon tasks can be migrated easily: | ||
| - `coact-jobs-import.sh` | ||
| - `coact-reporegistration-daemon.sh` | ||
| - `coact-userregistration-daemon.sh` | ||
|
|
||
| One remains difficult due to the un-migratable write operation: | ||
| - `coact-facility-overage-daemon.sh` | ||
|
|
||
| ## Possible Alternatives | ||
|
|
||
| All potential workarounds within `slurmrest` have significant disadvantages vs the current `GrpTRES=node=0` approach. `sacctmgr` appears to be the only reliable way to make modifications to account allocations. There is [a ticket](https://support.schedmd.com/show_bug.cgi?id=24356) with SLURM to support more `sacctmgr` features, however there is no activity on it other than the original post. | ||
|
|
||
| ### Path Forward | ||
|
|
||
| > Execute same CLI tools via ansible | ||
|
|
||
| We already have to maintain an ssh connection between the future container and the SLURM infrastructure through Ansible. Ansible allows for direct, ad-hoc command execution via its [`command`](https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/command_module.html) module. | ||
|
|
||
| e.g. | ||
| ``` | ||
| ansible [pattern] -m command -a 'sacctmgr ...' | ||
| ``` | ||
|
|
||
| ## Use of the Autogenerated Python Client | ||
|
|
||
| `slurmrest` ships with the ability to [generate a Python client](https://slurm.schedmd.com/rest.html#python-guide) via `openapi-generator-cli`. The openapi spec was generated once using the `slurmrest` instance within [`slurm-docker-cluster`](https://github.com/giovtorres/slurm-docker-cluster) and now lives in `openapi-specs/`. | ||
|
|
||
| Any future updates to `slurmrest` should support previous endpoints, but any new endpoints will require regenerating the openapi spec, which requires a live `slurmrest` instance. | ||
|
|
||
| For local development, the client can be created via `make generate-client` (a Java runtime is needed). For containerization, the client is built in CI/CD and will be packaged inside the container for usage. This was chosen to keep the client out of the git history, as it is large and not managed by SLAC, while still keeping it available for usage. | ||
|
|
||
| ## Associations Fetching | ||
|
|
||
| One notable deviation from the previous CLI data gathering to `slurmrest` is the fetching of associations. Previously, all associations were fetched with: | ||
|
|
||
| ```bash | ||
| sacctmgr show assoc where account={','.join(list_of_assoc)} --noheader -P format=Account,GrpNodes,GrpJobs,MaxJobs | ||
| ``` | ||
|
|
||
| In the `slurmrest` implementation, now associations are collected one by one as the data volume over network has caused instability. | ||
|
|
||
| ## Data Format Differences | ||
|
|
||
| Several format differences between sacct/sacctmgr CLI output and slurmrest REST API responses required explicit handling in the migration. | ||
|
|
||
| ### Jobs: Memory TRES units | ||
|
|
||
| sacct serialises memory TRES with a unit suffix (K/M/G), e.g.: | ||
|
|
||
| ``` | ||
| AllocTRES=cpu=128,mem=512G,node=4,billing=128,gres/gpu:a100=4 | ||
| ``` | ||
|
|
||
| slurmrest returns memory TRES `count` as a **bare integer in megabytes** with no suffix. To remain compatible with `_kilos_to_int()`, which was written for sacct-style suffixed strings, the migration appends an `M` suffix when serialising memory from the REST response: | ||
|
|
||
| ```python | ||
| value = f"{tres.count}M" if tres.type == "mem" else str(tres.count) | ||
| ``` | ||
|
|
||
| Without this, `_kilos_to_int("524288")` would interpret the value as bytes rather than megabytes, producing a ~1024× underestimate relative to `cluster["mem"]` (which is stored in bytes from `nodememgb * 1073741824`). | ||
|
|
||
| ### Jobs: Time fields | ||
|
|
||
| sacct returns Unix timestamps as integers when `SLURM_TIME_FORMAT=%s` is set. The migration code then called `parse_datetime(int(d["Start"]), force_tz=True)` to convert them. | ||
|
|
||
| slurmrest returns timestamps as integers in the same epoch-second format, but nested under a `time` struct: | ||
|
|
||
| ```python | ||
| # sacct: int(d["Start"]) → unix timestamp | ||
| # REST: job.time.start → unix timestamp (same value, different path) | ||
| pendulum.from_timestamp(job.time.start) | ||
| ``` | ||
|
|
||
| No unit conversion is needed, but a guard for `0` / falsy values is required since slurmrest uses `0` to indicate "not set" (e.g. a job that never started has `time.start == 0`). | ||
|
|
||
| ### Jobs: TRES key format for GPUs | ||
|
|
||
| sacct uses `gres/gpu:a100=4` (type/name:subtype=count). slurmrest splits this into `tres.type = "gres/gpu"`, `tres.name = "a100"`, `tres.count = 4`. The migration reconstructs the sacct-style key as `f"{tres.type}/{tres.name}"` where a name is present, otherwise just `tres.type`. The `_calc_resource_hours` GPU detection (`if "gpu" in k`) handles both forms correctly. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this client generator doesn't support v3.1, may be fine but something to keep in mind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
downside is you don't get pydantic models using this: https://github.com/openapi-generators/openapi-python-client. but things are at least typed..