Introduction
Evaluating LLMs: Introduction
()
1. Foundations of LLM Evaluation
Topics
()
Introduction to evaluation: Why it matters
()
Generative versus understanding tasks
()
Key metrics for common tasks
()
2. Evaluating Generative Tasks
Topics
()
Evaluating multiple-choice tasks
()
Evaluating free text response tasks, part 1
()
Evaluating free text response tasks, part 2
()
AIs supervising AIs: LLM as a judge
()
3. Evaluating Understanding Tasks
Topics
()
Evaluating embedding tasks
()
Evaluating classification tasks
()
Building an LLM classifier with BERT and GPT
()
4. Using Benchmarks Effectively
Topics
()
The role of benchmarks
()
Interrogating common benchmarks
()
Evaluating LLMs with benchmarks
()
5. Probing LLMs for a World Model
Topics
()
Probing LLMs for knowledge
()
Probing LLMs to play games
()
6. Evaluating LLM Fine-Tuning
Topics
()
Fine-tuning objectives
()
Metrics for fine-tuning success
()
Practical demonstration: Evaluating fine-tuning
()
Evaluating and cleaning data
()
7. Case Studies
Topics
()
Evaluating AI agents: Task automation and tool integration
()
Measuring retrieval-augmented generation (RAG) systems
()
Building and evaluating a recommendation engine using LLMs
()
Using evaluation to combat AI drift
()
Time-series regression
()
8. Summary of Evaluation and Looking Ahead
Topics
()
When and how to evaluate
()
Looking ahead: Trends in LLM evaluation
()
Conclusion
Evaluating LLMs: Summary
()