[Feature] : Add REST API endpoints for access through AI #1117
Draft
pulk17 wants to merge 1 commit into
Draft
Conversation
ab72a13 to
1441393
Compare
1f3c1d5 to
094f3af
Compare
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Please prefix your pull request with one of the following: [FEATURE] [FIX] [IMPROVEMENT].
In raising this pull request, I confirm the following (please check boxes):
My familiarity with the project is as follows (check one):
Feature: JSON REST API for the CCExtractor CI Platform (
mod_api)Summary
This PR introduces a complete, spec-driven JSON REST API for the CCExtractor Sample Platform, mounted at
/api/v1. It replaces the need for scrapers or direct database access by exposing every CI data point (runs, samples, results, diffs, logs, errors, artifacts, queue status) through authenticated, validated, and rate-limited endpoints.The API is fully described by the
openapi-ci-api.yamlspecification and validated against it using Schemathesis property-based testing. It integrates cleanly with the existing data model (mod_test,mod_regression,mod_sample,mod_auth) without modifying any of those modules.Motivation
The existing platform serves server-rendered HTML pages and lacks a programmatic interface. This makes it impossible for:
This API solves all of the above by providing a standard REST interface with proper authentication, authorization, pagination, filtering, and error handling.
Architecture
The module is organized into five layers, each with a single responsibility:
Why this structure matters
before_request/after_requesthooks, so route handlers never contain auth, validation, or rate-limit boilerplateload) and output serialization (dump), ensuring the API contract matches the OpenAPI specservices/status.pyto prevent inconsistent status logic across endpointsEndpoints
Authentication (
/auth/tokens)POST/auth/tokensGET/auth/tokenstokens:manage?all=true)DELETE/auth/tokens/currentDELETE/auth/tokens/{id}Runs (
/runs)GET/runsruns:readplatform,branch,commit_sha,repository,status,created_after/before,sortPOST/runsruns:writeGET/runs/{id}runs:readGET/runs/{id}/summaryruns:readGET/runs/{id}/progressruns:readGET/runs/{id}/configruns:readPOST/runs/{id}/cancelruns:writeSamples (
/samples,/runs/{id}/samples,/regression-tests)GET/runs/{id}/samplesruns:readGET/runs/{id}/samples/{sid}runs:readGET/samplesruns:readname,extension,tag,sha256,statusGET/samples/{id}runs:readGET/samples/{id}/historyruns:readGET/regression-testsruns:readactive,category,tag,sample_idfiltersResults (
/runs/{id}/samples/{sid}/regression-tests/{rid}/outputs/{oid})GET.../expectedresults:readGET.../actualresults:readGET.../diffresults:readPOST/runs/{id}/samples/{sid}/baseline-approvalbaselines:write+ admin/contributorErrors & Logs (
/runs/{id}/errors,/runs/{id}/logs)GET/runs/{id}/errorsresults:readGET/runs/{id}/infrastructure-errorssystem:readGET/runs/{id}/error-summaryresults:readGET/runs/{id}/logssystem:readlevel,source,containsfiltersGET/runs/{id}/samples/{sid}/logssystem:readSystem (
/system)GET/system/healthGET/system/queuesystem:readGET/runs/{id}/artifactsresults:readSecurity
Token Authentication
spci_for easy identification in logs and secret scannersScope-Based Authorization
Six scopes control access granularity:
runs:readruns:writeresults:readbaselines:writesystem:readtokens:manageNon-admin users cannot request
baselines:write. The scope check returns 403 with details about which scopes are missing and which scopes the token holds, making debugging straightforward.Role Enforcement
Certain operations additionally require specific user roles:
POST /runs/{id}/cancel— admin, contributor, or testerPOST /runs/{id}/samples/{sid}/baseline-approval— admin or contributorPOST /runs(Triggering runs):runs:writescope, regardless of the repository.admin,contributor, ortester. To trigger runs for your own fork repository, any authenticated user (who owns the fork and hasruns:writescope) can do so.Rate Limiting
Per-client sliding window rate limits protect against abuse:
POST /auth/tokensRate limit headers (
X-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Reset,Retry-After) are included on every response. A background eviction process prunes stale entries every 100 requests to bound memory usage.Rate limiting is automatically disabled when
app.config['TESTING']isTrueto prevent interference with test suites. In production, theTESTINGflag is never set.Security Headers
Every API response includes:
Strict-Transport-Security: max-age=31536000; includeSubDomainsContent-Security-Policy: default-src 'none'; frame-ancestors 'none'X-Content-Type-Options: nosniffX-Frame-Options: DENYPath Traversal Protection
All file-serving endpoints (
/expected,/actual,/diff) use_safe_resolve()which callsos.path.realpath()and verifies the resolved path is within theTestResultsbase directory. File extensions from the database are sanitized to remove/,\, and..sequences before being appended to filenames.Input Validation
All validation happens at the middleware layer through reusable decorators:
@validate_body(Schema)— Validates JSON body against a Marshmallow schema. Returns 415 for non-JSON content types, 400 for malformed JSON, and 400 with field-level error details for schema violations@validate_offset_pagination()— Extracts and validateslimit(1–100) andoffset(≥0, ≤2147483647). Rejects requests that mix cursor and offset pagination@validate_cursor_pagination()— Extracts and validateslimitandcursorfor log endpoints@validate_path_id('param')— Ensures URL path parameters are positive integers within int32 range@validate_date_range— Parsescreated_after/created_beforeISO 8601 params and rejects inverted ranges@validate_sort()— Validates thesortquery parameter against a whitelist (created_at,-created_at,run_id,-run_id)GET /samplesname filter escapes%and_to prevent SQL LIKE injectionStatus Derivation
Status logic is centralized in
services/status.pyto prevent inconsistencies:Run Status
The existing
test.failedproperty only checks forTestStatus.canceledand is not reliable for determining whether regression tests passed. Instead, we derive status from the latestTestProgressrow and, for completed runs, count actual failures fromTestResultrows:queuedrunningpreparationortestingcanceledcanceledpasscompleted+ zero failuresfailcompleted+ one or more failuresincompleteSample Status
passfailmissing_outputnot_startedDummy Row Detection
The CI worker writes a sentinel
TestResultFilerow withregression_test_output_id = -1andgot = 'error'when a test produces no output. This row is filtered from all API responses and drives themissing_outputstatus.Pagination
Two pagination strategies are used:
Offset Pagination (most endpoints)
{ "data": [...], "pagination": { "limit": 50, "offset": 0, "total": 142, "next_offset": 50 } }Cursor Pagination (build logs)
{ "data": [...], "pagination": { "limit": 100, "next_cursor": 3847 } }Cursor pagination is used for logs because they are read from files, not from the database, and offset-based access would require reading and discarding lines on every request.
Error Response Format
All errors follow a consistent structure:
{ "code": "validation_error", "message": "Request failed schema validation.", "details": { "fields": { "email": ["Missing data for required field."], "password": ["Shorter than minimum length 1."] } } }HTTP exceptions (404, 405, 500) are caught and reformatted into this structure so that API clients never receive HTML error pages.
Performance Considerations
Batch Status Derivation
batch_get_run_data()inservices/status.pypreloads allTestProgress,TestResult, andTestResultFilerows for a batch of runs in three queries (instead of N+1), then computes statuses in-memory. This is critical for theGET /runslist endpoint.Status Filtering Caveat
When
GET /runs?status=runningis requested, the API must load up to 1,000 runs and compute their derived status in-memory (because status is not a database column). Atruncated: trueflag is included in the pagination response when the 1,000-run cap is hit. This is documented in the response schema.Result File Preloading
Endpoints that iterate over
TestResultrows (run summary, run samples, sample history) usedefaultdict-based preloading ofTestResultFilerows to avoid N+1 queries.Diff Size Limits
The diff endpoint rejects files larger than 10 MiB and directs clients to the download URL instead. Output files are truncated at 1 MiB with a
truncated: trueflag and a GCS download link for the full file.Database Changes
New Table:
api_tokenidINT PRIMARY KEYuser_idINT FK → user.idtoken_nameVARCHAR(50)uq_user_token_name)token_hashVARCHAR(255)token_prefixVARCHAR(16)scopes_jsonTEXTcreated_atDATETIME(tz)expires_atDATETIME(tz)revoked_atDATETIME(tz)No existing tables are modified.
Test Coverage
Test Organization (265 tests total)
Schemathesis (Property-Based Testing)
The
test_schemathesis.pymodule validates every endpoint against the OpenAPI specification using Schemathesis, which generates hundreds of edge-case inputs per endpoint. It covers:Hypothesis is configured with
max_examples=5anddeadline=Noneto keep CI runtime reasonable (~7 minutes locally) while still providing meaningful coverage. Rate limiting is bypassed during testing viaapp.config['TESTING'].What the tests mock
config_parser.parse_config→ returns an in-memory SQLite configgoogle.cloud.storage.Client.from_service_account_json→ returns a mock GCS clientNew Files
mod_api/__init__.pymod_api/utils.pymod_api/models/api_token.pymod_api/middleware/auth.pymod_api/middleware/error_handler.pymod_api/middleware/rate_limit.pymod_api/middleware/security.pymod_api/middleware/validation.pymod_api/schemas/*.pymod_api/services/status.pymod_api/services/diff_service.pymod_api/services/error_service.pymod_api/services/log_service.pymod_api/services/storage.pymod_api/routes/*.pytests/api/*.pyopenapi-ci-api.yamlTotal new code: ~5,500 lines (source) + ~3,000 lines (tests) + ~2,700 lines (OpenAPI spec)
Dependencies
New pip dependencies:
argon2-cffi— Token hashing (argon2id)marshmallow— Schema validation and serializationschemathesis— OpenAPI contract testing (test dependency only)hypothesis— Property-based testing engine (test dependency only, pulled in by schemathesis)All other dependencies (
flask,sqlalchemy,passlib,google-cloud-storage) were already inrequirements.txt.How to Test Locally
Known Limitations & Future Work
GET /runs?status=Xrequires in-memory filtering with a 1,000-run cap. A future improvement would be to materialize run status as a database column updated by a trigger or the CI worker.GET /runs/{id}/samples/{sid}/logsreturns 404 because the CI worker doesn't produce per-sample log files. The endpoint is implemented and documented as a placeholder for when the worker adds this capability.blob.exists()before generating signed URLs (to avoid latency). Clients should handle 404s from the signed URL gracefully.