ModylBench

a benchmark for evaluating AI agents as meeting participants

6
Verticals
300
Turns
11
Dimensions
29
Edge Cases

Why this matters

Every major category of AI agents has a benchmark. SWE-bench for code. tau-bench for customer service. GAIA for general assistance. None tests whether an AI agent can join a professional meeting, understand complex domain context across turns, and produce real deliverables. ModylBench fills that gap.

Six professional verticals -- financial analysis, deep research, business strategy, optimization, business analysis, and clinical science -- each with 48 to 52 scripted turns that progress through context-setting, active work, adversarial edge cases, and deliverable handoff. The scoring system applies substance-weighted evaluation across 11 dimensions with anti-gaming layers that prevent agents from inflating scores through politeness or formatting alone. What matters is what the agent produces, not how politely it produces it.

What sets this apart

Multimodal Evaluation

Scenarios span audio, chat, screen share, and data channels. Agents must operate across modalities, not just generate text in a single stream.

CRDT Mutation Tracking

Optional edit-by-edit trajectory capture shows how work products evolved. Measures mutation efficiency, convergence rate, and destructive overwrites.

Anti-Gaming Framework

Six layers of defense -- substance weighting, hard floors, pessimistic consensus, disagreement flags, mutation tracking, and multi-judge scoring -- prevent score inflation.

Substance Over Style

The scoring formula weights destination (work products) at 60% and journey (turn quality) at 40%. Correctness and completeness dominate. Charm cannot compensate for wrong answers.

Evaluation pipeline

The harness drives a human persona through scripted turns, collects agent responses and work products, scores each turn across 11 dimensions, and computes a weighted scorecard with tier assignment.

ModylBench architecture diagram showing the evaluation pipeline from scenario through agent, judge, and scorecard
SCENARIO AGENT JUDGE SCORECARD +-----------------+ +-----------------+ +-----------------+ +-----------------+ | | | | | | | | | human persona | | model under | | multimodal | | turn scores | | scripted turns +-->+ test responds +-->+ judge (LLM or +-->+ product scores | | expected | | to each turn | | programmatic) | | tier assignment| | outputs | | | | | | pass@k | | | | responses + | | | | | +-----------------+ | work products | +-----------------+ +-----------------+ +-----------------+

Six professional verticals

Each scenario is a complete meeting simulation where the agent must demonstrate domain expertise, produce real deliverables, and handle adversarial curveballs under time pressure.

Vertical Scenario Tier Turns Edge Cases
Financial Analyst CloudSync LBO Model Consultant 52 5
Deep Researcher Solid-State Battery Intelligence Briefing Mentor 48 5
Business Strategist SEA Telehealth Market Entry Strategy Consultant 50 5
Optimization Solver Q4 Supply Chain Distribution Optimization Peer 48 4
Business Analyst Q3 Pipeline Conversion Rate Diagnosis Peer 50 5
Scientist Phase II Hypertension Trial Statistical Analysis Consultant 52 5

11 dimensions, substance-weighted

modylbench_score = 0.4 × journey + 0.6 × destination
Substance over style. What the agent produces weighs more than how it gets there.

Turn Quality (Journey)

Scored 1--10 per turn, substance-weighted 70/30

context_accuracy 0.25 Did the agent correctly understand the domain context?
task_progress 0.25 Did this turn advance toward the meeting goal?
iteration_quality 0.20 How well did the agent incorporate feedback?
adaptability 0.15 Did the agent handle unexpected inputs gracefully?
presentation_quality 0.10 Was the output well-formatted and professional?
social_quality 0.05 Was the conversational interaction natural?

Work Product Quality (Destination)

Scored 1--10 per product, correctness-weighted

correctness 0.30 Are the facts, calculations, and data accurate?
completeness 0.25 Does the product contain all requested components?
actionability 0.20 Can a professional use this deliverable as-is?
professional_quality 0.15 Does it meet industry-standard formatting?
format_presentation 0.10 Is the visual and structural presentation polished?

Hard Floor Rule

If context_accuracy < 4.0 or task_progress < 4.0 on any turn, that turn is capped at 4.0 regardless of other scores. A fundamentally broken turn cannot be rescued by charm. Same rule applies to correctness < 4.0 on work products.

Peer
≥ 6.0
Competent colleague. Gets the job done reliably.
Mentor
≥ 7.5
Senior expert. Insightful, anticipatory, thorough.
Consultant
≥ 9.0
Top-tier advisory. Polished and comprehensive.

Six layers of defense

Benchmarks that can be gamed produce misleading rankings. ModylBench includes structural defenses at every level of the evaluation pipeline.

Substance Weighting

70% substance, 30% style for turn scoring. Politeness without accuracy yields a low score.

Hard Floors

Score caps when core dimensions fail. Broken turns cannot be rescued by charm.

Pessimistic Consensus

When judges disagree by more than 3 points, the lower score is preferred.

Disagreement Flags

Standard deviation above 2 triggers manual review. Inconsistencies surface, not hide.

Mutation Trajectory

Tracks edit-by-edit evolution. No credit for a final state without demonstrating the work.

Multi-Judge

Multiple judge models score independently. No single-model bias in evaluation.

Three commands

# install
pip install modylbench
# score pre-recorded agent responses
modylbench evaluate --responses responses.jsonl --output scorecard.json
# see all available scenarios
modylbench list-scenarios

or load the dataset directly

from datasets import load_dataset

ds = load_dataset("use-aleatoric/modylbench", split="test")

Results

Baseline results coming soon

Submit your own results via the evaluation harness.

# submit results to the leaderboard
modylbench evaluate --responses responses.jsonl --output scorecard.json
modylbench submit --scorecard scorecard.json --model "your-model-name"

Cite this work

@misc{modylbench2026,
  title={{ModylBench}: A Multimodal Benchmark for Evaluating {AI} Agents as Meeting Participants},
  author={{aleatoric research}},
  year={2026},
  url={https://huggingface.co/datasets/use-aleatoric/modylbench},
  note={Dataset, evaluation harness, and leaderboard at https://github.com/use-aleatoric/modylbench}
}