Booking.com built Judge-LLM, a framework where strong LLMs evaluate other models against a carefully curated golden dataset. Clear metric definitions, rigorous annotation, and iterative prompt engineering make evaluations more scalable and consistent than relying solely on humans. **The takeaway**: Robust LLM evaluation isnβt just about scoresβit requires well-defined metrics, trusted judges, and disciplined processes to be reliable in production.