Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 8 - LLM Evaluation

109 minutesAdvancedAI for Business

Stanford Online. Methodical pass through rule-based metrics, LLM-as-judge biases, factuality and agent evaluation, and the failure modes of static benchmarks. Use it as the theory companion to the article's section on choosing what to measure and why most off-the-shelf metrics under-predict real regressions.

AI Expert note

Use this for evaluation theory, not as a product checklist. Translate the ideas into small tests tied to your own risk categories, user tasks and release process.

What you should get from this

Understand why benchmark scores, model judges and factuality metrics can fail to predict product regressions.

Watch or know first

Comfortable with technical lecture material and basic LLM evaluation terminology.

Watch next

Continue through the same learning path with the next curated companion videos.

Related videos

Take it further

Hand-picked external courses that go deeper on this topic.

See all courses for AI for Business