Skip to content

Use Flex processing for the OpenAI judge model#152

Open
JosephMarinier wants to merge 1 commit into
mainfrom
joseph/use-flex-tier-for-openai-judge-model
Open

Use Flex processing for the OpenAI judge model#152
JosephMarinier wants to merge 1 commit into
mainfrom
joseph/use-flex-tier-for-openai-judge-model

Conversation

@JosephMarinier

Copy link
Copy Markdown
Collaborator

Use Flex processing for the OpenAI judge model, which will halve its cost in exchange for slower response times and occasional resource unavailability. This doesn't affect benchmarking an OpenAI model. More details on flex tier here.

which will halve its cost in exchange for slower response times and occasional resource unavailability. This doesn't affect benchmarking an OpenAI model. More details on flex tier [here](https://developers.openai.com/api/docs/guides/flex-processing).
@JosephMarinier JosephMarinier self-assigned this Jun 16, 2026
)
category = "accuracy"
default_model = "us.anthropic.claude-opus-4-6-v1"
default_params = {"max_tokens": 100000} # Drop the OpenAI-only flex tier inherited from TextJudgeMetric.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure 100k is enough? Did you check the prompt size after 10 min convo?

@JosephMarinier JosephMarinier Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not check that myself, but it has been 100k since the beginning of EVA (inherited from src/eva/metrics/base.py). Is that OK?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right, I guess so then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants