AI isn’t just a feature anymore – it’s becoming a teammate! From drafting emails to designing slides, researching markets, or building financial models, a new layer of “agentic” tools is emerging that resemble an AI-native Office suite.
But here’s the challenge: as of now the landscape is fragmented, with new tools emerging every week. Anthropic just launched an “creating and editing file” feature for Claude this week! Consumers are left wondering: which tool should I actually use, and in what scenarios can I start embedding agentic tools into my daily work?
To find out how these tools perform in practice, we mapped the market and benchmarked AI-native tooling across a variety of everyday office tasks – making spreadsheets, taking meeting notes, and writing emails. Our benchmarks found impressive performance by a number of generalist tools, some standout vertical applications, and a few hints about how the marketplace is developing.
The market is splitting into two approaches to agentic productivity. On one side are the “all-in-one” horizontal tools, built to handle anything across apps and tasks. On the other are vertical specialists, designed to go deep into a single workflow like email, slides, or spreadsheets. Both are evolving quickly – and both have tradeoffs.
Generalists – Horizontal tools
Generalist tools are designed for flexibility. They can move across different contexts, apps, and tasks, but often at the cost of polish and precision. Within this camp, three formats stand out:
Specialists – Vertical Tools
Specialist tools are built for depth and reliability. Instead of trying to do everything, these tools focus on structured workflows where trust, polish, and user control are critical. Today’s vertical landscape is anchored by tools covering core professional workflows.
To see how these tools perform for real-world tasks, we tested them against benchmarks to measure where they succeed and where they fall short.
The prompts were designed to span six core dimensions: summarization, communication, file understanding, research, planning, and execution.
Use Case 1: PowerPoint
Prompt: Design a visual-heavy, 7-slide deck about Gen Z internet behavior trends in 2025.
Gamma acts as a verticalized AI-powered presentation tool with built-in templates and design features that allow a deck to be generated in under two minutes. As a full presentation editor, it offers a wide range of controls for editing after generation – users can adjust layouts, change visuals and fonts, add charts, and prompt the AI for text or design suggestions.
Genspark and Manus, operating as general assistants, tend to produce decks that are more content-heavy, often closer to research reports. Their outputs take longer to generate but tend to exhibit deeper analysis and stronger prompt alignment. ChatGPT Agent produced simpler decks resembling text-based reports, with weaker design capabilities and much longer generation time.
Anthropic just launched file creation and editing in Claude this week. On the presentation generation task, it’s the fastest general-purpose agent we’ve tested, though the design still needs refinement.
Overall, if you need a presentation for external use where visual quality and post-generation control are critical, Gamma is the best pick. If you’re looking for a content-heavy deck to inspire research or analysis, Genspark is the better choice.
Use Case 2: Spreadsheet
Prompt: Extract all the data from this PDF and calculate operating margin.
Spreadsheets are a sophisticated use case. Their complexity is particularly pronounced for outputs like complex financial models, where both formatting and exact accuracy are crucial. Still, AI spreadsheet tools are beginning to show signs of working for more basic and mid-level tasks, such as extracting data from PDFs and performing basic financial calculations.
In this test, we uploaded a page from an S-1 filing and asked the tools to calculate the company’s operating margin. Among the horizontal agents, Manus performed best: it extracted the data into a structured spreadsheet format and quickly returned accurate results. Claude was also the fastest in spreadsheet tasks and produced the correct answer, but its output was limited – providing minimal analysis and failing to pull the full set of data into the sheet.
Shortcut, as a vertical Excel-focused agent, offered a more comprehensive analysis in a native Excel environment, though it took longer to run and extracted only the data relevant to the calculation rather than the full dataset.
Use Case 3: Email
Prompt: email to schedule a dinner on next Thursday
Fyxer, Serif, and Jace function as vertical assistants for email. Each can generate competent drafts and maintain context across threads. Serif stands out for its level of customization: it supports playbooks, email labels, and preference settings – giving users a way to encode best practices and apply consistent workflows across similar scenarios.
Their approaches to scheduling diverge – but all were able to execute on a simple scheduling task:
By contrast, Comet brings general-assistant capabilities into email. It can draft replies, follow prompts to schedule meetings, send invites, and search the inbox. But it lacks built-in customization features like playbooks, labels, or preferences – so drafts feel less tailored compared to dedicated email assistants.
Use Case 4: Research
Prompt: Summarize and compare the latest quarterly cloud revenue growth for Microsoft, Amazon, and Google in a table with sources, then analyze the drivers behind the results in a short report.
Thanks to AI tools, consumers can now generate deep and research-informed analyses in seconds – work that previously might have taken hours of work and years of prior experience.
All of the products we tested were able to pull the correct cloud revenue growth numbers and organize them into tables. The differences came in nuance and speed – reflecting each product’s underlying optimizations and constraints.
Comet and Dia, the two AI-native browsers, were the fastest. They returned results in under 20 seconds, but their outputs were lighter on analysis and less structured compared to Manus, which delivered more comprehensive tables and deeper explanations of the drivers behind the numbers.
Source quality also varied. Comet and ChatGPT Agent stood out for drawing directly from authoritative sources such as earnings reports and Yahoo Finance, often including inline citations that made it easier to verify accuracy.
Overall, the tradeoff is clear: if you prioritize deeper analysis and are less sensitive to process time, Manus is the strongest choice. If you value speed and want a quick, decent answer, Comet is a better fit.
Use Case 5: Meeting Note-taking
The notepad is on during a meeting
Meeting note-taking is one of the most natural AI applications, saving consumers energy by letting them focus on the conversation instead of typing. Tools in this category generally operate in a notepad format, automatically transcribing and structuring the discussion, while ChatGPT’s record mode offers a more lightweight alternative. All of the products benchmarked support retrieval through keyword search, but their strengths diverge across note quality, customization, and collaboration.
Mem produces the most exhaustive records, capturing discussions and action items in detail, while ChatGPT’s record mode offers higher-level summaries that are easier to skim but less complete. Granola differentiates with customizable templates that adapt to different meeting types, giving users more control over structure and output.
Granola, Mem, and Notion all allow users to prepare notes in advance, add guidance during the meeting, and follow along with real-time transcription. Notion stands out for collaboration: tasks can be assigned directly within notes, synced to Notion Calendar, and aligned with broader team workflows.
Overall, if you want comprehensive capture, Mem is the best fit; for structure and customization, Granola excels; and for team coordination, Notion is the strongest choice.
From running tests across these use cases, several patterns emerged: