LLM Evaluation: Practical Tips at Booking.com
A new LLM evaluation framework taps into an"LLM-as-judge"setup—think strong model playing human annotator. It gets prompted (or fine-tuned) to mimic human scores and rate outputs from other LLMs. It runs on a tightly labeledgolden dataset, handles both pointwise and head-to-head comparisons, and sh..