Skip to content

RubricBasedEvaluator _normalize_text too basic — fails on judge model markdown output #6072

@tottenjordan

Description

@tottenjordan

Description

_normalize_text() in rubric_based_evaluator.py only does text.lower().strip(). When LLM judge models return rubric verdicts with markdown formatting (bullets, smart quotes, extra whitespace, non-ASCII characters), the exact-match rubric lookup fails silently, producing incorrect scores.

Reproduction

Judge model returns: "• The response correctly identifies the tool"
Expected rubric: "the response correctly identifies the tool"
_normalize_text() produces: "• the response correctly identifies the tool"
Result: No match → score defaults to 0 or lowest rubric

Common Patterns That Fail

  • Leading bullets: , *, -
  • Smart quotes: "...", '...'
  • Non-ASCII: accented characters, em-dashes
  • Multi-space: "the response" vs "the response"
  • Trailing whitespace/newlines

Suggested Fix

Enhanced normalization:

def _normalize_text(text: str) -> str:
    if not isinstance(text, str):
        return ""
    text = re.sub(r'^[\s*•\-]+', '', text)   # Strip leading bullets
    text = re.sub(r'[\s*•\-]+$', '', text)   # Strip trailing
    text = re.sub(r'\s+', ' ', text)          # Collapse whitespace
    text = text.encode('ascii', 'ignore').decode()  # Remove non-ASCII
    return text.lower().strip()

Additionally, a substring fallback when exact match fails would prevent silent scoring failures:

# If exact match fails, try substring match
for rubric_text, score in rubric_map.items():
    if normalized_response in rubric_text or rubric_text in normalized_response:
        return score

Impact

Without this fix, GEPA optimization produces unreliable rubric-based scores (rubric_based_final_response_quality_v1, rubric_based_tool_use_quality_v1), leading to suboptimal prompt evolution.

Environment

  • google-adk 2.2.0
  • Judge models: gemini-2.5-pro, gemini-3.5-flash

Metadata

Metadata

Labels

eval[Component] This issue is related to evaluationneeds review[Status] The PR/issue is awaiting review from the maintainer

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions