Skip to content

Add Oracle SQL/PGQ support to Awesome-Text2GQL#68

Open
ayoubmoussaid wants to merge 11 commits into
ldbc:masterfrom
ayoubmoussaid:master
Open

Add Oracle SQL/PGQ support to Awesome-Text2GQL#68
ayoubmoussaid wants to merge 11 commits into
ldbc:masterfrom
ayoubmoussaid:master

Conversation

@ayoubmoussaid

Copy link
Copy Markdown

Summary

This PR adds Oracle SQL/PGQ support to Awesome-Text2GQL and introduces tooling to translate, validate, compare, and export Oracle SQL/PGQ datasets from Text2GQL-Bench.

The main goals are:

  • Convert framework/TuGraph-style schemas into Oracle SQL/PGQ artifacts.
  • Translate supported Cypher/GQL-like benchmark queries into Oracle SQL/PGQ GRAPH_TABLE queries.
  • Validate translated queries against Oracle Database.
  • Compare Oracle SQL/PGQ results with Neo4j Cypher results for the same benchmark records.
  • Export only validated Oracle SQL/PGQ records where Oracle and Neo4j results match.
  • Document the Oracle SQL/PGQ generation workflow and dataset preparation workflow.

New Dataset with only valid Oracle SQL/PGQ queries added is available at Text2GQL-Dataset

What Changed

Oracle SQL/PGQ implementation

Added Oracle SQL/PGQ support under app/impl/oracle_sqlpgq/, including:

  • Schema conversion from framework/TuGraph-style graph schemas to Oracle relational tables and property graph DDL.
  • Oracle SQL/PGQ query translation from Graph-IL.
  • Oracle SQL/PGQ AST visitor support.
  • Oracle DB client integration using python-oracledb.
  • Template-based Oracle SQL/PGQ corpus generation.
  • Query generalization and corpus combination utilities.
  • SQL/PGQ helper utilities.

Dataset preparation utilities

Added dataset_prep/ scripts for working with the published Text2GQL-Bench dataset:

  • Discover benchmark query files and graph import configs.
  • Translate Cypher/GQL-like records to Oracle SQL/PGQ.
  • Optionally validate translated SQL/PGQ against Oracle.
  • Analyze translation and runtime failures.
  • Compare Oracle SQL/PGQ results with Neo4j results.
  • Export validated Oracle SQL/PGQ dataset records.

Examples and documentation

Added examples for schema conversion, Oracle graph setup, Cypher-to-Oracle SQL/PGQ translation, template corpus generation, LLM-based corpus generation, and corpus combination.

Added detailed documentation for:

  • Oracle SQL/PGQ data generation workflows.
  • Dataset preparation workflows for Text2GQL-Bench.

The main README now points to those focused docs instead of duplicating long Oracle SQL/PGQ command sequences.

Why

This adds a practical Oracle SQL/PGQ path to the Text2GQL data generation workflow. It makes it possible to generate queries using LLMs, or take existing Text2GQL-Bench Cypher/GQL-like records, translate them into Oracle SQL/PGQ where supported, validate them against a live Oracle property graph, and compare results against Neo4j to identify records that are semantically safe to export.

Validation Notes

Final export validation stats:

  • Databases processed: 33
  • Total records selected: 22,407
  • Records considered for Oracle/Neo4j comparison: 20,653
  • Records exported: 19,633
  • Failed comparisons: 325
  • Skipped records: 2,449

Reviewer Notes: Source Dataset / Schema Issues Found

While validating Text2GQL-Bench records, the new tooling found several cases where the source Cypher/GQL-like query appears inconsistent with its own graph import schema. These are classified as unsupported instead of emitting potentially incorrect Oracle SQL/PGQ.

The following are representative examples for reviewers to inspect.

1. Relationship direction mismatch

Dataset: dev/Address
Record: bird_address_0

MATCH (t1:zip_data)<-[zip_code:ZIP_CODE]-(t2:country)
WHERE t2.county = 'ARECIBO'
RETURN sum(t1.households)

Why this matters:
The query uses the ZIP_CODE relationship in a direction that does not match the discovered import schema. The translator intentionally refuses to emit Oracle SQL/PGQ for this because reversing the relationship could change query semantics.

  1. Invalid or mismatched label
    Dataset: dev/Address
    Record: bird_address_7
MATCH (t1:state)<-[state:STATE]-(t2:country)
WHERE t1.name = 'Alabama'
RETURN count(t2.county)

Why this matters:
The query references a label or relationship shape that does not align cleanly with the graph schema discovered from the dataset import config.

  1. Invalid or missing property
    Dataset: dev/FInancial_Financial_Management
    Record: at2gsynth_financialfinancialmanagement_73
MATCH (b:BUDGET)-[alloc:AllocatedTo]->(a:ACCOUNT)
WHERE b.currency <> a.currency
RETURN b.budget_id, b.category, a.account_number,
       b.currency AS budget_currency,
       a.currency AS account_currency

Why this matters:
The query references properties that are not present on the corresponding schema elements. The tooling treats this as a source/schema mismatch and does not emit SQL/PGQ, because generating SQL against absent properties would produce invalid or misleading output.

- Updated pyproject.toml to include oracledb dependency.
- Introduced new test suite for translating Cypher queries to Oracle SQL/PGQ.
- Implemented dataset preparation tests for Oracle integration.
- Added live tests for OracleDB client functionality.
- Created query generalizer and template instantiator for Oracle SQL/PGQ.
- Enhanced corpus combiner to handle Oracle-specific queries and validation.
- Included schema parser for generating Oracle DDL statements.
…strict validation

- Added support for node and edge primary key mappings in OracleSqlPgqQueryTranslator.
- Introduced strict property validation to ensure properties are defined for variables.
- Updated methods to normalize label maps and handle aggregate functions in WITH clauses.
- Enhanced validation for translated queries, including handling of string predicates and label predicates.
- Improved error handling for missing properties when strict validation is enabled.
- Added new command-line arguments for validation timeout and fetch limit in dataset preparation.
- Updated tests to cover new features, including primary key mapping and strict validation scenarios.
- Enhance `test_detect_unsupported_oracle_sqlpgq_features` with additional assertions for various unsupported query patterns.
- Introduce `test_failure_analysis_groups_unsupported_query_shapes` to analyze failure signatures for unsupported queries.
- Implement `test_failure_analysis_uses_manifest_for_invalid_schema` to validate schema direction and property checks against a manifest.
- Add normalization tests in `test_compare_normalizes_temporal_strings_and_numeric_precision` and `test_compare_normalizes_oracle_and_neo4j_node_identity`.
- Create tests for path normalization in `test_compare_normalizes_single_neo4j_path_to_flat_element_sequence`.
- Include checks for nondeterministic limits in `test_compare_detects_nondeterministic_limit_without_order_by`.
- Expose file stem label aliases in `test_loader_exposes_file_stem_label_aliases`.
…atches

- Introduced `is_supported_correlated_optional_match` to validate correlated optional matches in Cypher queries.
- Updated `detect_unsupported_features` to remove "optional_match" feature if correlated optional matches are supported.
- Removed redundant optional match translation logic from `cypher2oracle_sqlpgq`.
- Added comprehensive tests for various optional match scenarios, including correlated optional matches and their translations to SQL.
- Improved handling of optional match clauses in the dataset preparation and query translation processes.
…ement

- Implemented CypherSchema to manage and validate graph schema based on provided configuration.
- Added methods for detecting validation issues in Cypher queries, including node and edge label checks, property validation, and unsafe numeric conversions.
- Introduced utility functions for parsing Cypher variable labels, property references, and edge relationships.
- Included comprehensive handling of schema name aliases and property types.
- Ensured deduplication of validation issues for cleaner output.
- Introduced checks for unique schema ownership of properties in CypherSchema.
- Added detection for unsafe temporal arithmetic in aggregate queries.
- Improved handling of broad bounded variable length relationships in translation.
- Updated tests to cover new features and edge cases, including disambiguation of complex aggregate property aliases.
- Refactored unsupported feature detection to exclude expensive variable length paths.
- Enhanced query translation to preserve real ID properties over pseudo identities.
- Added stable tiebreakers for ordered queries with limits in comparison functions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant