feat: add NGWMN getters as an ogc sibling; extract a shared OGC engine#324
feat: add NGWMN getters as an ogc sibling; extract a shared OGC engine#324thodson-usgs wants to merge 1 commit into
ogc sibling; extract a shared OGC engine#324Conversation
35500b3 to
9e4c0ca
Compare
ogc sibling; extract a shared OGC engine
3ba001e to
803a15d
Compare
|
@ehinman, |
`ogc/engine.py` had absorbed ~1940 LOC spanning six unrelated concerns -- constants/config, HTTP error handling, arg validation, request building, response shaping, and the async pagination driver. (DOI-USGS#324 extracted this generic engine out of waterdata/utils.py, so the god-module relocated here rather than disappearing -- this is DOI-USGS#318's original intent, retargeted onto the post-DOI-USGS#324 layout.) Split it into cohesive private siblings under `ogc/`, moving every definition AST-identically (no signature/logic change): _constants.py URLs, OgcDialect, regexes, param sets, context vars, gpd probe _http.py headers, error-body, _raise_for_non_200, retry-after _validate.py arg normalization/validation, id switching _requests.py request building (GET/CQL2, date formatting, pagination URLs) _responses.py wire response -> DataFrame (parse + shape + finalize) engine.py async pagination driver + get_ogc_data, re-exporting the above engine.py (1937 -> 570 LOC) stays the public facade: it re-exports every name so existing `from dataretrieval.ogc.engine import ...` sites in waterdata, ngwmn, and the tests keep working unchanged. The geopandas parse chain lives in _responses.py (its boundary is the domain, not a test patch target); the single gpd monkeypatch seam was relocated to `_responses` -- the only test change. Behavior-preserving: all 42 top-level symbols moved with byte-identical AST bodies; 296 offline tests pass; ruff + mypy --strict clean; no import cycles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`ogc/engine.py` had absorbed ~1940 LOC spanning unrelated concerns -- constants/config, per-call request context, HTTP error handling, arg validation, request building, and response shaping -- on top of the async pagination driver. (DOI-USGS#324 extracted this generic engine out of waterdata/utils.py, so the god-module relocated here rather than disappearing; this is DOI-USGS#318's original intent retargeted onto the post-DOI-USGS#324 layout.) Split it into cohesive private siblings under `ogc/`, moving every definition AST-identically (no signature/logic change): _constants.py URLs, OgcDialect, regexes, param sets, the geopandas probe _context.py per-call request context (base-url / dialect / row-cap vars) _http.py headers, error-body, _raise_for_non_200, retry-after _validate.py pure arg normalization/validation + id switching _requests.py request building (GET/CQL2, dates, pagination URLs, queryables) _responses.py wire response -> DataFrame (parse + shape + finalize) engine.py async pagination driver + get_ogc_data, re-exporting the above engine.py (1937 -> 570 LOC) stays the public facade: it re-exports every name so existing `from dataretrieval.ogc.engine import ...` sites in waterdata, ngwmn, and the tests keep working unchanged. The DAG is acyclic and leaf-first (_constants <- _context <- _http/_validate/_requests/_responses <- engine); _validate is a pure leaf -- the lone HTTP probe (_check_ogc_requests) lives in _requests with the other request builders. The geopandas parse chain lives in _responses; its single gpd monkeypatch seam was relocated there -- the only test change. Behavior-preserving: all 42 top-level symbols moved with byte-identical AST bodies; 296 offline tests pass; ruff + mypy --strict clean; no import cycles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`ogc/engine.py` had absorbed ~1940 LOC spanning unrelated concerns -- constants/config, per-call request context, HTTP error handling, arg validation, request building, and response shaping -- on top of the async pagination driver. (DOI-USGS#324 extracted this generic engine out of waterdata/utils.py, so the god-module relocated here rather than disappearing; this is DOI-USGS#318's original intent retargeted onto the post-DOI-USGS#324 layout.) Split it into cohesive private siblings under `ogc/`, moving every definition AST-identically (no signature/logic change): _constants.py URLs, OgcDialect, regexes, param sets, the geopandas probe _context.py per-call request context (base-url / dialect / row-cap vars) _http.py headers, error-body, _raise_for_non_200, retry-after _validate.py pure arg normalization/validation + id switching _requests.py request building (GET/CQL2, dates, pagination URLs, queryables) _responses.py wire response -> DataFrame (parse + shape + finalize) engine.py async pagination driver + get_ogc_data, re-exporting the above engine.py (1937 -> 570 LOC) stays the public facade: it re-exports every name so existing `from dataretrieval.ogc.engine import ...` sites in waterdata, ngwmn, and the tests keep working unchanged. The DAG is acyclic and leaf-first (_constants <- _context <- _http/_validate/_requests/_responses <- engine); _validate is a pure leaf -- the lone HTTP probe (_check_ogc_requests) lives in _requests with the other request builders. The geopandas parse chain lives in _responses; its single gpd monkeypatch seam was relocated there -- the only test change. Behavior-preserving: all 42 top-level symbols moved with byte-identical AST bodies; 296 offline tests pass; ruff + mypy --strict clean; no import cycles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
4a869d1 to
acff56c
Compare
That sounds good to me, @thodson-usgs. I like having ngwmn as its own module. |
Port the NGWMN functions from the USGS R `dataRetrieval` package and refactor the Water Data OGC machinery into a generic, API-agnostic engine, so NGWMN and Water Data are sibling layers on top of it -- NGWMN does not depend on Water Data. dataretrieval/ogc/ generic OGC engine (no service-specific config) engine.py request build, pagination, parse/finalize, get_ogc_data chunking.py URL-byte multi-value chunker filters.py CQL `filter` splitting progress.py self-updating status line The engine is parameterized by an `OgcDialect` and a base-url context variable rather than branching on service names: Water Data POSTs CQL2 for `monitoring-locations` and renders `daily` time args date-only; NGWMN needs neither. Adding a sibling API is a new dialect + base URL, not engine edits. dataretrieval/ngwmn.py sibling getters that import only dataretrieval.ogc: get_sites, get_water_level, get_lithology, get_well_construction, get_providers dataretrieval/waterdata/ thin Water Data layer on the engine; the Statistics API lives in its own waterdata/stats.py module. Unified `state` parameter across the modern getters, accepting a full name, a two-letter postal code, or a two-digit ANSI/FIPS code; normalized by codes.states.to_state (50 states + DC, fails fast on a typo) and resolved at the getter layer. The native state_code/state_name parameters remain as an escape hatch. Also: export ChunkInterrupted at the package top level and document the catch-and-resume affordance on the OGC getters; key the empty-result short-circuit off `features` (NGWMN omits `numberReturned`) and tolerate geometry-less features; always return PEP-8 snake_case columns; add a pre-commit mypy hook mirroring the CI type-check; and skip anyio in mypy (its 3.10 `match` syntax can't be parsed under the 3.9 type-check target). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
acff56c to
114e213
Compare
Ports the NGWMN functions from the R
dataRetrievalPR and, per review, refactors the Water Data OGC machinery into a shared engine so NGWMN and Water Data are sibling layers on top of it — NGWMN does not depend on Water Data.Architecture
Layering / import rules (the organization a reviewer can check):
ogc/*is fully API-agnostic — it imports nodataretrieval.waterdataordataretrieval.ngwmn.waterdata/*andngwmn.pyare siblings on the engine;ngwmn.pyimports onlydataretrieval.ogc(+codes.states), neverdataretrieval.waterdata.api.py/stats.py/ratings.py/nearest.pyare thin getter layers; the shared Water-Data state (service→id map, dialect, theget_ogc_datawrapper) lives once inutils.py.The engine is parameterized, not branched:
get_ogc_data(args, service, output_id, *, base_url, extra_id_cols, dialect). AnOgcDialect(cql2_services, date_only_services, …)(threaded via a context variable, like the base-url context) carries per-API quirks — Water Data POSTs CQL2 formonitoring-locationsand rendersdailytime args date-only; NGWMN needs neither. Adding a sibling API is a new dialect + base URL, not engine edits.The multi-value chunker (recently fixed in #322) is generic and applies to NGWMN unchanged — verified that a forced-small-budget multi-site NGWMN query chunks and unions correctly.
Unified
stateparameterA single, canonical
stateparameter spans the modern getters and accepts a full name ("Wisconsin"), a two-letter postal code ("WI"), or a two-digit ANSI/FIPS code ("55"). It is normalized once bycodes.states.to_state()(50 states + DC; fails fast on a typo) and resolved at the getter layer into whatever native queryable each endpoint wants —state_namefor the OGC getters,US:XXstate_codefor stats, the per-collection queryable for NGWMN. The nativestate_code/state_nameparameters remain as an escape hatch (e.g. non-US FIPS codes); passingstatetogether with either raises.Engine fixes (NGWMN's API differs from the main one)
featuresrather than thenumberReturnedthat NGWMN omits (otherwise pages with data were silently dropped).geometrykey (GeoDataFrame.from_featurescan't index a missing key).PEP naming
The engine snake_cases any non-snake column in
finalize, so the package always returns PEP-8 column names regardless of the upstream API — a no-op today (both APIs are already snake_case) but enforced going forward.Tests & checks
Live NGWMN tests for all five getters (
tests/ngwmn_test.py); unit tests forto_state, the_with_state/ getter-layer state resolution, and_to_snake_case;mock.patchsites repointed toogc.engine; a module fixture activatesWATERDATA_DIALECTfor the direct_construct_api_requestsunit tests. The unit suite passes andmypy --strict+ruffare clean. A pre-commitmypyhook (pinned to the samemypy<2major as CI, withhttpx/anyioin the hook env) now mirrors the CI type-check locally.Note
CI will show 3 pre-existing failures (
test_get_daily_properties/_id,test_get_continuous) — the live-API drift fixed by #323, not introduced here (branch is offmain). They go green once #323 merges.🤖 Generated with Claude Code