Add EnvironmentValidator TSG: Test System Drive Free Space#302
Conversation
New troubleshooting guide for the AzStackHci_Hardware_Test_SystemDrive_Free_Space environment validator check, which fails when an Azure Local machine's system drive (C:) drops below the 30 GB minimum. Covers where the failure surfaces (the Azure portal update readiness view, the AzStackHciEnvironmentChecker event log EventID 17205, and the HealthCheckResult JSON on the infrastructure share), tiered space reclamation with production-safety labels, and re-validation via the Environment Checker module and the pre-update health check. Adds the file to the EnvironmentValidator README index. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a new Environment Validator troubleshooting guide (TSG) for the AzStackHci_Hardware_Test_SystemDrive_Free_Space check (30 GB free space requirement on C:), and indexes it in the EnvironmentValidator README so it’s discoverable alongside existing validators/known-issues documentation.
Changes:
- Introduces a new TSG documenting symptoms, where to find the failure output (portal + on-box), discovery, tiered remediation, and re-validation steps.
- Adds the new TSG link to
TSG/EnvironmentValidator/README.md.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
TSG/EnvironmentValidator/Troubleshoot-Hardware-Test-SystemDrive-Free-Space.md |
New end-to-end TSG for diagnosing and remediating low C: free space that blocks Azure Local deployment/updates. |
TSG/EnvironmentValidator/README.md |
Adds an index entry pointing to the new TSG. |
…ty link - Rename to Troubleshooting-Test-SystemDrive-Free-Space.md to match the EnvironmentValidator sibling naming (Troubleshooting-Test-<Name>) and drop the outlier Hardware- segment; update the README index link text and path. - Add the Azure Local low-capacity requirements link (https://aka.ms/azurelocallowcapacityrequirements) to Related. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…holder - Wrap the Windows Update cache clear in try/finally with -ErrorAction Stop on the service stop, so wuauserv/bits are always restarted even if Remove-Item fails or the session is interrupted. - Replace the concrete D:\logbackup export path with a placeholder, since a node may not have a D: volume; reinforces that the destination must be a non-C: volume or a network share. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thanks for the review. Addressed in 9ac32c9:
The updated Windows Update cache mitigation block was re-validated end-to-end on a live Azure Local cluster: the drive was driven below 30 GB, the TSG's own clear reclaimed the space (about 37 GB), and the check returned to SUCCESS. |
|
Validated this TSG end-to-end on a live Azure Local lab cluster (build 10.2607) — the remediation works, and I have one accuracy issue to flag in the diagnostic section. I reproduced the failure and recovery exactly as written: filled the system drive below the threshold → the check reported FAILURE with the documented detail line ( I also exercised the multi-node guidance on a four-node cluster:
The Tier 3 "do not delete" call-out is accurate — on a real node the large platform-managed folders it names (e.g. One issue — the names in the "On the machine" section don't match current builds. On build 10.2607, the result is emitted under:
Two consequences a reader will hit:
Caveat: this may be build drift — an earlier build may have emitted the longer Net: accurate and remediation-precise; just the diagnostic-section names need a refresh for current builds. |
…tion Per AlBurns-MSFT's live validation on build 10.2607, the EventID 17205 body and the portal readiness row surface the telemetry/health-scanner name (AzStackHci_Hardware_SystemDriveFreeSpace / "System Drive Free Space"), while earlier builds emit the env-checker name (the longer "..._Test_..." form). Make the on-box diagnostics work on any build: - Get-WinEvent filter now matches either name (regex alternation), so it returns the record regardless of build. - Example JSON Name/DisplayName/Title updated to the current-build form, with a [!NOTE] documenting the earlier form and that the Detail line is identical. - Portal "appears as" text notes both the current and earlier display names. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thanks for the thorough live validation, Alex, and the build-drift catch. I confirmed it both ways: an earlier build emits the env-checker name in the event body ( Pushed a build-robust fix (3de8ef8) so the diagnostic section works on any build:
The |
…space Add a callout to "Verify the fix" clarifying that the portal readiness view and the HealthCheckResult.EnvironmentChecker JSON reflect the LAST health check, so they can keep showing the failure until refreshed (by the pre-update health check or the next scheduled periodic health check). The fast targeted validator reflects live free space immediately, so it is the right way to confirm a fix without waiting on the portal. Learning from a live detect -> mitigate -> revalidate validation of this TSG on a lab cluster, where the targeted check returned SUCCESS immediately while the cluster-wide HealthCheckResult JSON was still ~20h stale.
Live end-to-end validation: grade AI validated this TSG end to end on a live 2-node Azure Local lab cluster using an automated detect -> mitigate -> revalidate harness driven by this PR's own published guidance. Summary: the documented bad state trips the real Environment Validator check, the failure is discoverable exactly where the TSG says, the TSG's own mitigation reclaims the space, and the check returns to SUCCESS.
Notes fed back into the TSG (latest commit): during the run the fast targeted validator returned SUCCESS immediately, while the cluster-wide The TSG's other reclamation tiers (DISM component cleanup, dump/log removal, etc.) are documented for other consumers and were not exercised by this single disk-pressure injection. INTERNAL note: validated on lab cluster b88rb1605; this comment contains no customer data. |
AlBurns-MSFT
left a comment
There was a problem hiding this comment.
LGTM — thanks for the quick turnaround on the build-drift fix. Verified the refreshed diagnostic section: both check-name forms (AzStackHci_Hardware_Test_SystemDrive_Free_Space and AzStackHci_Hardware_SystemDriveFreeSpace) are present in the shipping AzStackHci.EnvironmentChecker module, so the match-either Get-WinEvent filter is correct, and the cmdlets all verify. Approving.
What
Adds a troubleshooting guide for the AzStackHci_Hardware_Test_SystemDrive_Free_Space Environment Validator check, which fails when an Azure Local machine's system drive (
C:) drops below the required 30 GB of free space. The check is Critical and blocks deployment and update operations until resolved, and there was no dedicated TSG for it in the EnvironmentValidator folder.What is covered
AzStackHciEnvironmentCheckerevent log (Event ID 17205), and theHealthCheckResult.EnvironmentChecker.*.jsonresult file on the infrastructure share. Includes the validator result JSON and how to read theDetailline.Invoke-AzStackHciHardwareValidation -Include Test-SystemDriveFreeSpace -PassThru) and the authoritativeInvoke-SolutionUpdatePrecheck.All PowerShell examples were tested on a live Azure Local cluster. The file is also added to the EnvironmentValidator README index.