ModylBench -- Evaluating AI Agents as Meeting Participants

Abstract

Why this matters

Every major category of AI agents has a benchmark. SWE-bench for code. tau-bench for customer service. GAIA for general assistance. None tests whether an AI agent can join a professional meeting, understand complex domain context across turns, and produce real deliverables. ModylBench fills that gap.

Six professional verticals -- financial analysis, deep research, business strategy, optimization, business analysis, and clinical science -- each with 48 to 52 scripted turns that progress through context-setting, active work, adversarial edge cases, and deliverable handoff. The scoring system applies substance-weighted evaluation across 11 dimensions with anti-gaming layers that prevent agents from inflating scores through politeness or formatting alone. What matters is what the agent produces, not how politely it produces it.

Key Features

What sets this apart

Multimodal Evaluation

Scenarios span audio, chat, screen share, and data channels. Agents must operate across modalities, not just generate text in a single stream.

CRDT Mutation Tracking

Optional edit-by-edit trajectory capture shows how work products evolved. Measures mutation efficiency, convergence rate, and destructive overwrites.

Anti-Gaming Framework

Six layers of defense -- substance weighting, hard floors, pessimistic consensus, disagreement flags, mutation tracking, and multi-judge scoring -- prevent score inflation.

Substance Over Style

The scoring formula weights destination (work products) at 60% and journey (turn quality) at 40%. Correctness and completeness dominate. Charm cannot compensate for wrong answers.

Architecture

Evaluation pipeline

The harness drives a human persona through scripted turns, collects agent responses and work products, scores each turn across 11 dimensions, and computes a weighted scorecard with tier assignment.

Scenarios

Six professional verticals

Each scenario is a complete meeting simulation where the agent must demonstrate domain expertise, produce real deliverables, and handle adversarial curveballs under time pressure.

Vertical	Scenario	Tier	Turns	Edge Cases
Financial Analyst	CloudSync LBO Model	Consultant	52	5
Deep Researcher	Solid-State Battery Intelligence Briefing	Mentor	48	5
Business Strategist	SEA Telehealth Market Entry Strategy	Consultant	50	5
Optimization Solver	Q4 Supply Chain Distribution Optimization	Peer	48	4
Business Analyst	Q3 Pipeline Conversion Rate Diagnosis	Peer	50	5
Scientist	Phase II Hypertension Trial Statistical Analysis	Consultant	52	5

Scoring

11 dimensions, substance-weighted

modylbench_score = 0.4 × journey + 0.6 × destination

Substance over style. What the agent produces weighs more than how it gets there.

Turn Quality (Journey)

Scored 1--10 per turn, substance-weighted 70/30

context_accuracy 0.25 Did the agent correctly understand the domain context?

task_progress 0.25 Did this turn advance toward the meeting goal?

iteration_quality 0.20 How well did the agent incorporate feedback?

adaptability 0.15 Did the agent handle unexpected inputs gracefully?

presentation_quality 0.10 Was the output well-formatted and professional?

social_quality 0.05 Was the conversational interaction natural?

Work Product Quality (Destination)

Scored 1--10 per product, correctness-weighted

correctness 0.30 Are the facts, calculations, and data accurate?

completeness 0.25 Does the product contain all requested components?

actionability 0.20 Can a professional use this deliverable as-is?

professional_quality 0.15 Does it meet industry-standard formatting?

format_presentation 0.10 Is the visual and structural presentation polished?

Hard Floor Rule

If context_accuracy < 4.0 or task_progress < 4.0 on any turn, that turn is capped at 4.0 regardless of other scores. A fundamentally broken turn cannot be rescued by charm. Same rule applies to correctness < 4.0 on work products.

Peer

≥ 6.0

Competent colleague. Gets the job done reliably.

Mentor

≥ 7.5

Senior expert. Insightful, anticipatory, thorough.

Consultant

≥ 9.0

Top-tier advisory. Polished and comprehensive.

Anti-Gaming

Six layers of defense

Benchmarks that can be gamed produce misleading rankings. ModylBench includes structural defenses at every level of the evaluation pipeline.

Substance Weighting

70% substance, 30% style for turn scoring. Politeness without accuracy yields a low score.

Hard Floors

Score caps when core dimensions fail. Broken turns cannot be rescued by charm.

Pessimistic Consensus

When judges disagree by more than 3 points, the lower score is preferred.

Disagreement Flags

Standard deviation above 2 triggers manual review. Inconsistencies surface, not hide.

Mutation Trajectory

Tracks edit-by-edit evolution. No credit for a final state without demonstrating the work.

Multi-Judge

Multiple judge models score independently. No single-model bias in evaluation.

Quick Start

Three commands

        # install

        pip install modylbench

        # score pre-recorded agent responses

        modylbench evaluate --responses responses.jsonl --output scorecard.json

        # see all available scenarios

        modylbench list-scenarios

or load the dataset directly

        from datasets import load_dataset

        ds = load_dataset("use-aleatoric/modylbench", split="test")

ModylBench