Bring Code Review to GA on Self-Hosted

This issue is to bring Code Review to GA on Self-Hosted. Validation of each model x feature can be accomplished via Self-Hosted platform on Evaluation Runner.

The following models and platforms should be considered and evaluated for use with the feature on Self-Hosted. Validation can be done via Evaluation Runner with relevant dataset for each feature. Additional information on each currently model supported can be found here.

Model families

Mistral
OpenAI
Anthropic Claude 3.5 Sonnet
Llama 3

Platforms

vLLM
Azure OpenAI
AWS Bedrock

Definition of Done

The following criteria is the definition of done by which the Custom Models team has historically determined readiness for GA, and are enabled by the Self-Hosted Platform. Each feature team has ownership to assess their own GA readiness, with the collaborative support of Custom Models.

Each model has been validated for performance on the feature on all supported platforms
Examine individual inputs and outputs that scored poorly (1-2 scores); Look for and document any patterns of either poor feature performance or poor LLM judge callibration. Iterate on the model prompt to eradicate patterns of poor performance.
Achieve less than 20% poor answers (defined as 1s and 2s from an LLM judge, or less than 0.8 cosine similarity) using each supported model for those areas in which we do have supporting validation datasets.
Update Self-Hosted documentation with compatibility level
feature team determines GA readiness and roll-out plan

Edited Jul 22, 2025 by 🤖 GitLab Bot 🤖