a benchmark for evaluating AI agents as meeting participants
Every major category of AI agents has a benchmark. SWE-bench for code. tau-bench for customer service. GAIA for general assistance. None tests whether an AI agent can join a professional meeting, understand complex domain context across turns, and produce real deliverables. ModylBench fills that gap.
Six professional verticals -- financial analysis, deep research, business strategy, optimization, business analysis, and clinical science -- each with 48 to 52 scripted turns that progress through context-setting, active work, adversarial edge cases, and deliverable handoff. The scoring system applies substance-weighted evaluation across 11 dimensions with anti-gaming layers that prevent agents from inflating scores through politeness or formatting alone. What matters is what the agent produces, not how politely it produces it.
Scenarios span audio, chat, screen share, and data channels. Agents must operate across modalities, not just generate text in a single stream.
Optional edit-by-edit trajectory capture shows how work products evolved. Measures mutation efficiency, convergence rate, and destructive overwrites.
Six layers of defense -- substance weighting, hard floors, pessimistic consensus, disagreement flags, mutation tracking, and multi-judge scoring -- prevent score inflation.
The scoring formula weights destination (work products) at 60% and journey (turn quality) at 40%. Correctness and completeness dominate. Charm cannot compensate for wrong answers.
The harness drives a human persona through scripted turns, collects agent responses and work products, scores each turn across 11 dimensions, and computes a weighted scorecard with tier assignment.
Each scenario is a complete meeting simulation where the agent must demonstrate domain expertise, produce real deliverables, and handle adversarial curveballs under time pressure.
| Vertical | Scenario | Tier | Turns | Edge Cases |
|---|---|---|---|---|
| Financial Analyst | CloudSync LBO Model | Consultant | 52 | 5 |
| Deep Researcher | Solid-State Battery Intelligence Briefing | Mentor | 48 | 5 |
| Business Strategist | SEA Telehealth Market Entry Strategy | Consultant | 50 | 5 |
| Optimization Solver | Q4 Supply Chain Distribution Optimization | Peer | 48 | 4 |
| Business Analyst | Q3 Pipeline Conversion Rate Diagnosis | Peer | 50 | 5 |
| Scientist | Phase II Hypertension Trial Statistical Analysis | Consultant | 52 | 5 |
Scored 1--10 per turn, substance-weighted 70/30
Scored 1--10 per product, correctness-weighted
If context_accuracy < 4.0 or task_progress < 4.0 on any turn, that turn is capped at 4.0 regardless of other scores. A fundamentally broken turn cannot be rescued by charm. Same rule applies to correctness < 4.0 on work products.
Benchmarks that can be gamed produce misleading rankings. ModylBench includes structural defenses at every level of the evaluation pipeline.
70% substance, 30% style for turn scoring. Politeness without accuracy yields a low score.
Score caps when core dimensions fail. Broken turns cannot be rescued by charm.
When judges disagree by more than 3 points, the lower score is preferred.
Standard deviation above 2 triggers manual review. Inconsistencies surface, not hide.
Tracks edit-by-edit evolution. No credit for a final state without demonstrating the work.
Multiple judge models score independently. No single-model bias in evaluation.
or load the dataset directly
Baseline results coming soon
Submit your own results via the evaluation harness.