Skip to content

Tests: SNIClose close-during-async-IO regression tests (ADO #43847 / ICM 775308542)#4420

Open
paulmedynski wants to merge 5 commits into
mainfrom
dev/paul/sni-close
Open

Tests: SNIClose close-during-async-IO regression tests (ADO #43847 / ICM 775308542)#4420
paulmedynski wants to merge 5 commits into
mainfrom
dev/paul/sni-close

Conversation

@paulmedynski

@paulmedynski paulmedynski commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Investigation and regression tests for ADO #43847 / ICM 775308542 — "SNIClose can deadlock with in-flight async I/O during connection close."

The original incident was reported against the retired System.Data.SqlClient; this work verifies whether the same deadlock exists in current Microsoft.Data.SqlClient native SNI by reproducing "close while an async read is in flight" across every relevant connection phase.

Result: no deadlock reproduced in any tested path on net9.0 or net462. Current native SNI drains/cancels the pending async I/O and closes promptly in each scenario.

Tests added

UnitTests — SNICloseDeadlockTest.cs (in-process TDS server)

  • Post-login query read (Close/Dispose): strengthened to assert the async read is genuinely pending at close (readTask not completed, connection Open).
  • Pre-login stall (Close/Dispose): a bare TcpListener withholds the pre-login response so OpenAsync's read pends during connection establishment.
  • TLS-over-TDS handshake (Close/Dispose): the listener advertises ENCRYPT_ON, reads the client's TLS ClientHello, then withholds the ServerHello, leaving the client inside SslStream.AuthenticateAsClient with a pending SNI read on the SSL-over-TDS transport at close time.

ManualTests — MarsCloseDeadlockTest.cs (live SQL Server)

  • MARS smux logical session (Close/Dispose): MultipleActiveResultSets=true with WAITFOR DELAY to keep an async read pending on a MARS logical session at close. This path requires a real server because the in-process TDS test server does not support MARS. Verified against a local SQL Server on net9.0 and net462.

Each scenario closes on a worker thread under a bounded wait; a deadlock would manifest as the wait timing out.

Results

Scenario net9.0 net462
Post-login query-read close pass pass
Pre-login response close pass pass
TLS-over-TDS handshake close pass pass
MARS smux session close (live server) pass pass

Notes

  • The native TdsParserStateObjectNative.Dispose() still carries an UNDONE comment about blocking for pending callbacks on logical (MARS) connections, and DisposeCounters() waits on _readingCount but not _pendingCallbacks. These remain latent hardening opportunities, but no live deadlock was found in current MDS.

Checklist

  • Tests added
  • Public API changes documented (none)
  • Verified against customer repro (original repro was System.Data.SqlClient; not reproducible in current MDS)
  • No breaking changes

Draft: opened for review/discussion of the investigation outcome.

Adds gated-server repro tests that close/dispose a connection while an
async network read is in flight, using the in-proc TDS server with a
query engine that stalls the response.

Findings:
- On managed SNI (Linux) these tests PASS: disposing the socket completes
  the pending ReadAsync with ObjectDisposedException, so there is no
  deadlock in the managed teardown path.
- The ICM deadlock is specific to the native SNIClose path on Windows,
  which cannot be exercised from a Linux host. These tests are intended
  to reproduce that hang on Windows/native SNI and serve as bounded-wait
  regression guards elsewhere.

No fix applied yet.
…ke, MARS)

Expands the ADO #43847 / ICM 775308542 investigation into whether SNIClose can deadlock with in-flight async I/O during connection close.

UnitTests (SNICloseDeadlockTest.cs), in-process TDS server:
- Strengthen the existing post-login query-read scenarios to assert the async read is genuinely pending at close (readTask not completed, connection Open).
- Add pre-login stall scenarios (bare TcpListener withholds the pre-login response) for Close and Dispose.
- Add faithful TLS-over-TDS handshake scenarios: the listener advertises ENCRYPT_ON, reads the client's ClientHello, then withholds the ServerHello so the client is inside SslStream.AuthenticateAsClient with a pending SNI read at close.

ManualTests (MarsCloseDeadlockTest.cs), live SQL Server:
- Add MARS (smux logical-session) Close/Dispose scenarios using WAITFOR DELAY to keep an async read pending; the in-process TDS server does not support MARS.

All scenarios pass on net9.0 and net462: current native SNI drains the pending async I/O and closes promptly in every path (post-login, pre-login, TLS handshake, and MARS).
Copilot AI review requested due to automatic review settings July 2, 2026 18:01
@github-project-automation github-project-automation Bot moved this to To triage in SqlClient Board Jul 2, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds regression coverage to ensure SqlConnection.Close() / Dispose() do not deadlock when native SNI has an in-flight async read during connection teardown, spanning post-login, pre-login, TLS-handshake, and MARS logical-session phases.

Changes:

  • Added in-process simulated-server unit tests covering close/dispose with a pending async read after login, during pre-login, and during TLS-over-TDS handshake.
  • Added a live SQL Server manual test covering the MARS SMUX logical-session path with a pending async read during close/dispose.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
src/Microsoft.Data.SqlClient/tests/UnitTests/SimulatedServerTests/SNICloseDeadlockTest.cs New simulated-server unit tests that force pending reads and verify close/dispose completes within a bounded timeout across multiple connection phases.
src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/AsyncTest/MarsCloseDeadlockTest.cs New manual live-server MARS test that keeps an async read pending on a logical session and verifies close/dispose completes promptly.

@paulmedynski paulmedynski added this to the 7.1.0-preview3 milestone Jul 2, 2026
…on-test docs

Resolves copilot-pull-request-reviewer feedback on #4420:

- Guard the finally-block connection.Dispose() cleanup with closedInTime in all
  four scenarios (post-login read, pre-login, TLS handshake, MARS). When a close
  deadlock is detected (closedInTime == false), skip the potentially-blocking
  Dispose() so the bounded-wait Assert fails fast instead of hanging.
- Correct the SNICloseDeadlockTest class doc: these are regression guards that
  PASS on the current code base (MDS drains in-flight async I/O and closes
  promptly); they would fail only if a future change reintroduced the deadlock.

Tests: SNICloseDeadlockTest 6/6 pass on net9.0 and net462; MarsCloseDeadlockTest
2/2 pass on net9.0 against a live server.
Replace each pair of Close/Dispose [Fact] methods plus their private helper with a single [Theory] parameterized by disposeInsteadOfClose ([InlineData(false)]/[InlineData(true)]), matching the MARS manual test's style. No behavior change; still 6 cases (3 theories x 2).
Copilot AI review requested due to automatic review settings July 3, 2026 09:28

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

…anup

Second round of copilot-pull-request-reviewer feedback on #4420:

- Run the Close/Dispose worker on a dedicated long-running thread
  (Task.Factory.StartNew(..., TaskCreationOptions.LongRunning)) instead of a
  thread-pool thread, so a hypothetical deadlock regression parks one dedicated
  thread rather than starving the shared pool.
- Wrap each scenario body in try/finally so cleanup always runs even if a
  precondition assert throws: the release event is always signaled, the TCP
  listener is always stopped, and the command/connection are disposed when the
  close worker never started or completed. The blocking connection.Dispose() is
  still skipped when a deadlock is actually detected (closedInTime == false), so
  the failure stays bounded and no listener/connection is leaked.

Applies to all four scenarios (post-login read, pre-login, TLS handshake, MARS).
Tests: SNICloseDeadlockTest 6/6 pass on net9.0 and net462; MarsCloseDeadlockTest
2/2 pass on net9.0.
@paulmedynski paulmedynski moved this from To triage to In progress in SqlClient Board Jul 3, 2026
@paulmedynski paulmedynski added the Area\Tests Issues that are targeted to tests or test projects label Jul 3, 2026
@paulmedynski paulmedynski marked this pull request as ready for review July 3, 2026 17:21
@paulmedynski paulmedynski requested a review from a team as a code owner July 3, 2026 17:22
Copilot AI review requested due to automatic review settings July 3, 2026 17:22
@paulmedynski paulmedynski moved this from In progress to In review in SqlClient Board Jul 3, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment on lines +251 to +264
try
{
// Read one byte of the client's pre-login packet. Once this
// returns the client has sent its pre-login and posted its
// async read for the response. We only need to know data
// arrived, so a single byte is sufficient.
stream.ReadByte();
}
catch
{
// The client may tear down the socket; ignore.
}

clientConnected.Set();
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area\Tests Issues that are targeted to tests or test projects

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

4 participants