𝜏³-bench: advancing agent benchmarking to knowledge and voice𝜏³-bench is here. We've expanded agent evaluation to two new frontiers: knowledge retrieval and voice.2026年3月18日ProductThought leadership
𝜏²-bench: evaluating conversational agents in a dual-control environment𝜏²-bench challenges AI agents not just to reason and act, but to coordinate, guide, and assist a user in achieving a shared objective. This leap from solo operation to co-ownership of a task pushes agents into a much more demanding space.2025年6月11日ProductThought leadership
𝜏-bench: benchmarking AI agents for the real-worldSierra’s AI research team is on a mission to advance the frontier of conversational AI agents. In this research paper, we present a new benchmark for evaluating AI agents' performance and reliability in real-world settings, with dynamic user and tool interaction.2024年6月17日ProductThought leadership