[ci] Make "Linux > Build" reruns recover from immutable-artifact collision#11834
[ci] Make "Linux > Build" reruns recover from immutable-artifact collision#11834simonrozsival wants to merge 3 commits into
Conversation
…ision
When the "Linux > Build" job's first attempt succeeds far enough to publish
the `nuget-linux-unsigned` pipeline artifact but then fails in a later
(often post-job) step, the job is marked failed even though the SDK was
already built and published. Pipeline artifact names are immutable per
build, so a "rerun failed jobs" re-executes the full ~20+ minute build and
then dies at the publish step with:
##[error]Artifact nuget-linux-unsigned already exists for build <id>.
This makes the build un-rerunnable: every rerun hits the same wall, and the
only recovery is queuing a brand-new build.
Detect the already-published artifact at the start of the job and skip the
rebuild and the colliding republish, letting the rerun go green and reuse
the artifact the earlier attempt produced. The check falls back to building
normally on any API error (safe default).
The 1ES SDK publish is moved out of `templateContext.outputs` (which always
runs at job end and cannot be conditioned) into inline
`1ES.PublishPipelineArtifact@1` steps guarded by the same condition, mirroring
the existing `publish-artifact.yaml` pattern already used for build-result
artifacts in this job.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the Linux build pipeline templates to make reruns resilient when the Linux SDK pipeline artifact has already been published in a previous attempt (artifact names are immutable per buildId + artifactName, so republishing on rerun can permanently fail).
Changes:
- Added an early step in
build-linux-steps.yamlto query the Azure DevOps build artifacts API and set a variable used to skip rebuild/republish on rerun. - Added
condition:guards to skip the expensive build/packaging steps when the artifact already exists. - Moved Linux SDK artifact publishing out of
templateContext.outputs(unconditional) and into explicit (conditionally guarded) publish steps.
Show a summary per file
| File | Description |
|---|---|
| build-tools/automation/yaml-templates/build-linux.yaml | Removes templateContext.outputs artifact publishing so publishing can be conditioned within steps. |
| build-tools/automation/yaml-templates/build-linux-steps.yaml | Adds artifact-existence probing and conditions to skip rebuild/republish work on reruns. |
Copilot's findings
- Files reviewed: 2/2 changed files
- Comments generated: 3
| ${{ if eq(parameters.use1ESTemplate, true) }}: | ||
| templateContext: | ||
| outputs: | ||
| - output: pipelineArtifact | ||
| displayName: upload linux sdk | ||
| artifactName: ${{ parameters.nugetArtifactName }} | ||
| targetPath: ${{ parameters.xaSourcePath }}/bin/Build$(XA.Build.Configuration)/nuget-linux | ||
| - output: pipelineArtifact | ||
| displayName: upload linux sdk symbols | ||
| artifactName: ${{ parameters.nugetArtifactName }}-symbols | ||
| targetPath: ${{ parameters.xaSourcePath }}/bin/Build$(XA.Build.Configuration)/nuget-linux-symbols |
There was a problem hiding this comment.
Did these just move? We might want to queue a main/Xamarin.Android/DevDiv pipeline build to make sure nothing broke there.
There was a problem hiding this comment.
Yes — they moved out of templateContext.outputs (which always runs at job end and can't be conditioned) into inline 1ES.PublishPipelineArtifact@1 steps so they can be skipped on reruns. The public dotnet-android run here exercises them, but agreed it's worth queueing an internal Xamarin.Android/DevDiv build to confirm the 1ES production template path still publishes correctly. I can kick one off — or if you'd prefer to queue it, happy to hold for that before merge.
There was a problem hiding this comment.
I guess this one is from a fork, so we can't queue the internal pipeline...
There was a problem hiding this comment.
Ok, I pushed your branch here to origin, testing at:
Probe the SDK and -symbols pipeline artifacts independently and set a per-artifact variable. Rebuild steps now run unless BOTH artifacts already exist, and each publish step is skipped only for the artifact that is already present. This handles a partial prior publish (e.g. the SDK artifact was published but the build failed before publishing -symbols) by rebuilding and publishing just the missing artifact without colliding on the one that already exists. Addresses code review feedback on dotnet#11834. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Make the
Linux > BuildCI job recover on a rerun instead of getting permanently stuck once it has published its SDK artifact. Today, if the first attempt builds and publishesnuget-linux-unsignedbut then fails in a later (often post-job) step, every "rerun failed jobs" attempt fails at the publish step with:…because pipeline artifact names are immutable per build. The only recovery is to queue an entirely new build. This PR detects the already-published artifact up front and skips the rebuild + republish so the rerun goes green and reuses what the earlier attempt produced.
Background — what actually happens
Observed on public build
1489305(dotnet-android, dnceng-public):Attempt 1 (original run) — the build itself succeeded:
make jenkins✓ →make create-nupkgs✓ →copy linux sdk✓upload linux sdk✓ andupload linux sdk symbols✓ —nuget-linux-unsignedwas successfully publishedCache@2save ("cache Android toolchain archives") failed because the agent disk was full:Attempt 2 (rerun failed jobs):
workspace: clean: all, ephemeral 1ES pool), rebuilds everything (~20+ min), re-uploads the content, then fails at the final associate step:Why the build becomes un-rerunnable
Pipeline artifacts are keyed by (buildId + artifactName) and are immutable. The artifact name only registers after its content is fully committed — that is literally the step that fails with "already exists". So once attempt 1 registers
nuget-linux-unsignedon the build, every rerun of that job re-executes the publish and collides. No rerun can ever go green; only a brand-new build (new buildId) will pass.The underlying trigger here is a full agent disk during the post-job toolchain-cache save — infrastructure we do not control on the public pool. Rather than chase the disk-full root cause, this PR makes the job robust on reruns regardless of why the earlier attempt failed after publishing.
What this PR changes
build-tools/automation/yaml-templates/build-linux-steps.yaml:_apis/build/builds/$(Build.BuildId)/artifacts) using$(System.AccessToken)and probes each artifact independently, setting a per-artifact job variable:AlreadyBuiltLinuxSdkfornuget-linux-unsignedandAlreadyBuiltLinuxSdkSymbolsfornuget-linux-unsigned-symbols. Any API/network error falls back tofalse→ build normally (safe default).make jenkins,make create-nupkgs, the symbolsCopyFiles/DeleteFiles, andcopy linux sdkgaincondition: and(succeeded(), or(ne(variables['AlreadyBuiltLinuxSdk'], 'true'), ne(variables['AlreadyBuiltLinuxSdkSymbols'], 'true')))— so a partial prior publish still rebuilds the files needed for the missing artifact.AlreadyBuiltLinuxSdk, the-symbolspublish onAlreadyBuiltLinuxSdkSymbols. If a prior attempt published the SDK but died before-symbols, the rerun rebuilds, skips the SDK publish (no collision), and publishes only the missing-symbolsartifact.build-tools/automation/yaml-templates/build-linux.yaml:templateContext.outputs(which always runs at job end and cannot be conditioned) into inline1ES.PublishPipelineArtifact@1steps in the steps template, guarded by the skip condition. This mirrors the already-approved pattern inpublish-artifact.yaml(used for the "Build Results - Linux" artifact in this same job), so it stays within 1ES governance.Behavior after this change
On a rerun where both artifacts already exist, the job: checks the API → sets both
AlreadyBuiltLinuxSdk=trueandAlreadyBuiltLinuxSdkSymbols=true→ skips the rebuild and both publishes → finishes green in ~1–2 minutes, reusing the valid artifacts from the earlier attempt. If only one artifact was published before the earlier attempt failed, the rerun rebuilds and republishes just the missing one. Downstream stages already consumenuget-linux-unsignedfrom the build viaDownloadPipelineArtifact, so they are unaffected. Because nothing heavy runs, the disk never re-fills, so the post-job cache save no longer fails either.First runs are unchanged: the artifact doesn't exist yet, so the guard returns
falseand the full build runs as before.Why not other approaches
(Attempt N)suffix used for "Build Results") — can't be used fornuget-linux-unsigned, because downstream stages reference the fixed name.templateContext.outputsentries — output entries always run at job end and there's no in-repo precedent for conditioning them; using inline1ES.PublishPipelineArtifact@1(already used elsewhere in the repo) is the more certain, precedented mechanism.Testing / validation
1489305: the REST API cleanly reports each artifact present via the exact"name":"<name>"fixed-string match, so probingnuget-linux-unsigneddoes not false-matchnuget-linux-unsigned-symbolsand each variable is set independently.dotnet-androidrun (first run exercises the normal build path; the skip path is exercised on any rerun).Notes
Since this only affects CI pipeline YAML, there is no product/runtime impact.