AnyAi.fyi - Discover ANY AI to make more online for less.

venturebeat

Has this stealth startup finally cracked the code on enterprise AI agent reliability? Meet AUI's Apollo-1

For more than a decade, conversational AI has promised human-like assistants that can do more than chat. Yet even as large language models (LLMs) like ChatGPT, Gemini, and Claude learn to reason, explain, and code, one critical category of interaction remains largely unsolved — reliably completing tasks for people outside of chat. Even the best AI models score only in the 30th percentile on Terminal-Bench Hard, a third-party benchmark designed to evaluate the performance of AI agents on completing a variety of browser-based tasks, far below the reliability demanded by most enterprises and users. And task-specific benchmarks like TAU-Bench airline, which measures the reliability of AI agents on finding and booking flights on behalf of a user, also don't have much higher pass rates, with only 56% for the top performing agents and models (Claude 3.7 Sonnet) — meaning the agent fails nearly half the time. New York City-based Augmented Intelligence (AUI) Inc., co-founded by Ohad Elhelo and Ori Cohen, believes it has finally come with a solution to boost AI agent reliability to a level where most enterprises can trust they will do as instructed, reliably. The company’s new foundation model, called Apollo-1 — which remains in preview with early testers now but is close to an impending general release — is built on a principle it calls stateful neuro-symbolic reasoning.It's a hybrid architecture championed by even LLM skeptics like Gary Marcus, designed to guarantee consistent, policy-compliant outcomes in every customer interaction.“Conversational AI is essentially two halves,” said Elhelo in a recent interview with VentureBeat. “The first half — open-ended dialogue — is handled beautifully by LLMs. They’re designed for creative or exploratory use cases. The other half is task-oriented dialogue, where there’s always a specific goal behind the conversation. That half has remained unsolved because it requires certainty.”AUI defines certainty as the difference between an agent that “probably” performs a task and one that almost “always” does. For example, on TAU-Bench Airline, it performs at a staggering 92.5% pass rate, leaving all the other current competitors far behind in the dust — according to benchmarks shared with VentureBeat and posted on AUI's website.Elhelo offered simple examples: a bank that must enforce ID verification for refunds over $200, or an airline that must always offer a business-class upgrade before economy. “Those aren’t preferences,” he said. “They’re requirements. And no purely generative approach can deliver that kind of behavioral certainty.”AUI and its work on improving reliability was previously covered by subscription news outlet The Information, but has not received widespread coverage in publicly accessible media — until now. From Pattern Matching to Predictable ActionThe team argues that transformer models, by design, can’t meet that bar. Large language models generate plausible text, not guaranteed behavior. “When you tell an LLM to always offer insurance before payment, it might — usually,” Elhelo said. “Configure Apollo-1 with that rule, and it will — every time.”That distinction, he said, stems from the architecture itself. Transformers predict the next token in a sequence. Apollo-1, by contrast, predicts the next action in a conversation, operating on what AUI calls a typed symbolic state.Cohen explained the idea in more technical terms. “Neuro-symbolic means we’re merging the two dominant paradigms,” he said. “The symbolic layer gives you structure — it knows what an intent, an entity, and a parameter are — while the neural layer gives you language fluency. The neuro-symbolic reasoner sits between them. It’s a different kind of brain for dialogue.”Where transformers treat every output as text generation, Apollo-1 runs a closed reasoning loop: an encoder translates natural language into a symbolic state, a state machine maintains that state, a decision engine determines the next action, a planner executes it, and a decoder turns the result back into language. “The process is iterative,” Cohen said. “It loops until the task is done. That’s how you get determinism instead of probability.”A Foundation Model for Task ExecutionUnlike traditional chatbots or bespoke automation systems, Apollo-1 is meant to serve as a foundation model for task-oriented dialogue — a single, domain-agnostic system that can be configured for banking, travel, retail, or insurance through what AUI calls a System Prompt.“The System Prompt isn’t a configuration file,” Elhelo said. “It’s a behavioral contract. You define exactly how your agent must behave in situations of interest, and Apollo-1 guarantees those behaviors will execute.”Organizations can use the prompt to encode symbolic slots — intents, parameters, and policies — as well as tool boundaries and state-dependent rules. A food delivery app, for example, might enforce “if allergy mentioned, always inform the restaurant,” while a telecom provider might define “after three failed payment attempts, suspend service.” In both cases, the behavior executes deterministically, not statistically.Eight Years in the MakingAUI’s path to Apollo-1 began in 2017, when the team started encoding millions of real task-oriented conversations handled by a 60,000-person human agent workforce. That work led to a symbolic language capable of separating procedural knowledge — steps, constraints, and flows — from descriptive knowledge like entities and attributes.“The insight was that task-oriented dialogue has universal procedural patterns,” said Elhelo. “Food delivery, claims processing, and order management all share similar structures. Once you model that explicitly, you can compute over it deterministically.”From there, the company built the neuro-symbolic reasoner — a system that uses the symbolic state to decide what happens next rather than guessing through token prediction.Benchmarks suggest the architecture makes a measurable difference. In AUI’s own evaluations, Apollo-1 achieved over 90 percent task completion on the τ-Bench-Airline benchmark, compared with 60 percent for Claude-4. It completed 83 percent of live booking chats on Google Flights versus 22 percent for Gemini 2.5-Flash, and 91 percent of retail scenarios on Amazon versus 17 percent for Rufus.“These aren’t incremental improvements,” said Cohen. “They’re order-of-magnitude reliability differences.”A Complement, Not a CompetitorAUI isn’t pitching Apollo-1 as a replacement for large language models, but as their necessary counterpart. In Elhelo’s words: “Transformers optimize for creative probability. Apollo-1 optimizes for behavioral certainty. Together, they form the complete spectrum of conversational AI.”The model is already running in limited pilots with undisclosed Fortune 500 companies across sectors including finance, travel, and retail. AUI has also confirmed a strategic partnership with Google and plans for general availability in November 2025, when it will open APIs, release full documentation, and add voice and image capabilities. Interested potential customers and partners can sign up to receive more information when it becomes available on AUI's website form.Until then, the company is keeping details under wraps. When asked about what comes next, Elhelo smiled. “Let’s just say we’re preparing an announcement,” he said. “Soon.”Toward Conversations That ActFor all its technical sophistication, Apollo-1’s pitch is simple: make AI that businesses can trust to act — not just talk. “We’re on a mission to democratize access to AI that works,” Cohen said near the end of the interview.Whether Apollo-1 becomes the new standard for task-oriented dialogue remains to be seen. But if AUI’s architecture performs as promised, the long-standing divide between chatbots that sound human and agents that reliably do human work may finally start to close.

Discover Copy

Rating

Innovation

Pricing

Technology

Usability

We have discovered similar tools to what you are looking for. Check out our suggestions for similar AI tools.

venturebeat

The beginning of the end of the transformer era? Neuro-symbolic AI startup

The buzzed-about but still stealthy New York City startup <a href="https://www.aui.io/">Augmented Intelligence Inc (AUI)</a>, which seeks to go beyond the popular &q [...]

More Copy

Match Score: 289.60

venturebeat

Most enterprises can't stop stage-three AI agent threats, VentureBeat

A rogue AI agent at Meta <a href="https://venturebeat.com/security/meta-rogue-ai-agent-confused-deputy-iam-identity-governance-matrix">passed every identity check and still ex [...]

More Copy

Match Score: 172.37

Noble Audio FoKus Apollo review: The high price of pristine audio

I don’t review a lot of $650 <a data-i13n="cpos:1;pos:1" href="https://www.engadget.com/reviews/headphones/">headphones</a>. That’s because most audio comp [...]

More Copy

Match Score: 160.50

venturebeat

Claude Code 2.1.0 arrives with smoother workflows and smarter agents

Anthropic has <a href="https://github.com/anthropics/claude-code/blob/main/CHANGELOG.md">released Claude Code v2.1.0</a>, a notable update to its "vibe coding&a [...]

More Copy

Match Score: 151.11

venturebeat

Claude’s next enterprise battle is not models: it’s the agent control p

New VB Pulse data shows Microsoft and OpenAI leading enterprise agent orchestration, but Anthropic’s first measurable foothold points to a larger fight over who controls the infras [...]

More Copy

Match Score: 143.38

venturebeat

Perplexity takes its ‘Computer’ AI agent into the enterprise, taking ai

<a href="https://www.perplexity.ai/">Perplexity</a>, the AI-powered search company valued at $20 billion, announced on Wednesday at its inaugural <a href="https: [...]

More Copy

Match Score: 119.16

venturebeat

Microsoft takes Agent 365 out of preview as shadow AI becomes an enterprise

<a href="https://microsoft.com/">Microsoft</a> last week took <a href="https://www.microsoft.com/en-us/microsoft-agent-365">Agent 365</a>, its mana [...]

More Copy

Match Score: 110.34

venturebeat

Microsoft launches MXC, an OS-level sandbox for AI agents, with OpenAI and

For the past two years, the technology industry has raced to make AI agents more capable — teaching them to write code, navigate software interfaces, manage files, and orchestrate multi-ste [...]

More Copy

Match Score: 110.30

venturebeat

Testing autonomous agents (Or: how I learned to stop worrying and embrace c

Look, we've spent the last 18 months building production AI systems, and we'll tell you what keeps us up at night — and it's not whether the model can answer ques [...]

More Copy

Match Score: 108.58