This beta introduces datasets and evaluation workflows, enabling you to organize LLM spans, run evaluations with custom evaluators, compare alternative responses with remix, and track results over time.
Features & Enhancements
- Datasets — Create and manage datasets of LLM spans with full CRUD operations, version history, and multi-row selection. Use datasets to organize spans for evaluation and comparison workflows.
- Dataset Remix — Generate alternative LLM responses for dataset spans using different models or providers. Compare outputs side-by-side with inline expandable comparison view and track results in a leaderboard.
-
Evaluators & Evaluations — Define custom evaluators with prompt templates including built-in defaults. Run evaluations against spans or entire datasets with live streaming progress updates. Control evaluation runs with stop/restart
capabilities. - Manual Scoring — Manually score dataset spans with custom score titles for human-in-the-loop evaluation workflows.
- Evaluation Results — View evaluation results with delta comparisons from previous runs, variance statistics per evaluator, and visual stat bars for quick insights.
- Improved Onboarding — Redesigned welcome page with interactive demo project containing pre-seeded data and automatic navigator expansion for first-time users.
Fixes & Improvements
- Column Persistence — Fixed column order and visibility not persisting correctly in spans table.
- Timestamp Handling — Improved nanosecond timestamp handling across the codebase to prevent precision issues.
- Cost Tracking — Fixed floating-point precision artifacts in cost calculations and improved cost chart accuracy.
- UI Improvements — Various fixes for table styling, tooltip behavior, context menu focus, and layout stability.
- Model Prices — Updated model pricing and context window data.



