Parallel Quality Benchmarks
Give your AI the highest-quality web search tools available
When building applications that rely on web data to make decisions or answer questions, nothing matters more than accuracy. These benchmarks help to measure different web search offerings on their ability to answer prompts accurately. By obsessing over accuracy, we consistently lead the market with state-of-the-art quality. In addition to leading in accuracy, Parallel often leads in pricing.
### About this benchmark
This benchmark[benchmark]($https://lastexam.ai/) consists of 2,500 questions developed by subject-matter experts across dozens of subjects (e.g. math, humanities, natural sciences). Each question has a known solution that is unambiguous and easily verifiable, but requires sophisticated web retrieval and reasoning. Results are reported on a sample of 100 questions from this benchmark.
### Methodology
- - **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
- - **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
- - **Testing Dates**: Testing was conducted from November 3rd to November 5th.
## Parallel Quality Benchmarks
Give your AI the highest-quality web search tools available
When building applications that rely on web data to make decisions or answer questions, nothing matters more than accuracy. These benchmarks help to measure different web search offerings on their ability to answer prompts accurately. By obsessing over accuracy, we consistently lead the market with state-of-the-art quality. In addition to leading in accuracy, Parallel often leads in pricing.
### Search API
#### HLE
| Series | Model | Cost (CPM) | Accuracy (%) | | --------- | ------------ | ----------- | ------------ | | Parallel | parallel | 82 | 47 | | Others | exa | 138 | 24 | | Others | tavily | 190 | 21 | | Others | perplexity | 126 | 30 | | Others | openai gpt-5 | 143 | 45 |
### About this benchmark
This benchmark[benchmark]($https://lastexam.ai/) consists of 2,500 questions developed by subject-matter experts across dozens of subjects (e.g. math, humanities, natural sciences). Each question has a known solution that is unambiguous and easily verifiable, but requires sophisticated web retrieval and reasoning. Results are reported on a sample of 100 questions from this benchmark.
### Methodology
- - **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
- - **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
- - **Testing Dates**: Testing was conducted from November 3rd to November 5th.
#### BrowseComp
| Series | Model | Cost (CPM) | Accuracy (%) | | --------- | ------------ | ----------- | ------------ | | Parallel | parallel | 156 | 58 | | Others | exa | 233 | 29 | | Others | tavily | 314 | 23 | | Others | perplexity | 256 | 22 | | Others | openai gpt-5 | 253 | 53 |
### About this benchmark
This benchmark[benchmark]($https://openai.com/index/browsecomp/), created by OpenAI, contains 1,266 questions requiring multi-hop reasoning, creative search formulation, and synthesis of contextual clues across time periods. Results are reported on a sample of 100 questions from this benchmark.
### Methodology
- - **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
- - **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
- - **Testing Dates**: Testing was conducted from November 3rd to November 5th.
#### WebWalker
| Series | Model | Cost (CPM) | Accuracy (%) | | --------- | ------------ | ----------- | ------------ | | Parallel | parallel | 42 | 81 | | Others | exa | 107 | 48 | | Others | tavily | 156 | 79 | | Others | perplexity | 91 | 67 | | Others | openai gpt-5 | 88 | 73 |
### About this benchmark
This benchmark[benchmark]($https://arxiv.org/abs/2501.07572) is designed to assess the ability of LLMs to perform web traversal. To successfully answer the questions in the benchmark, it requires the ability to crawl and extract content from website subpages. Results are reported on a sample of 100 questions from this benchmark.
### Methodology
- - **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
- - **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
- - **Testing Dates**: Testing was conducted from November 3rd to November 5th.
#### FRAMES
| Series | Model | Cost (CPM) | Accuracy (%) | | --------- | ------------ | ----------- | ------------ | | Parallel | parallel | 42 | 92 | | Others | exa | 81 | 81 | | Others | tavily | 122 | 87 | | Others | perplexity | 95 | 83 | | Others | openai gpt-5 | 68 | 90 |
### About this benchmark
This benchmark[benchmark]($https://huggingface.co/datasets/google/frames-benchmark) contains 824 challenging multi-hop questions designed to test factuality, retrieval accuracy, and reasoning. Results are reported on a sample of 100 questions from this benchmark.
### Methodology
- - **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
- - **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
- - **Testing Dates**: Testing was conducted from November 3rd to November 5th.
#### SimpleQA
| Series | Model | Cost (CPM) | Accuracy (%) | | --------- | ------------ | ----------- | ------------ | | Parallel | parallel | 17 | 98 | | Others | exa | 57 | 87 | | Others | tavily | 110 | 93 | | Others | perplexity | 52 | 92 | | Others | openai gpt-5 | 37 | 98 |
### About this benchmark
This benchmark[benchmark]($https://openai.com/index/introducing-simpleqa/), created by OpenAI, contains 4,326 questions focused on short, fact-seeking queries across a variety of domains. Results are reported on a sample of 100 questions from this benchmark.
### Methodology
- - **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
- - **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
- - **Testing Dates**: Testing was conducted from November 3rd to November 5th.
#### Batched SimpleQA
| Series | Model | Cost (CPM) | Accuracy (%) | | --------- | ------------ | ----------- | ------------ | | Parallel | parallel | 50 | 90 | | Others | exa | 119 | 71 | | Others | tavily | 227 | 59 | | Others | perplexity | 100 | 74 | | Others | openai gpt-5 | 91 | 88 |
### About this benchmark
This benchmark was created by batching 3 independent questions from the original SimpleQA dataset[SimpleQA dataset]($https://openai.com/index/introducing-simpleqa/) to create 100 composite, more complex, questions.
### Methodology
- - **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
- - **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
- - **Testing Dates**: Testing was conducted from November 3rd to November 5th.
### Task API
#### DeepSearchQA
| Series | Model | Cost (CPM) | Accuracy (%) | | -------- | ------------------------ | ---------- | ------------ | | Parallel | Pro | 100 | 62 | | Parallel | Ultra | 300 | 68.5 | | Parallel | Ultra2x | 600 | 72.6 | | Others | Gemini Deep Research | 2500 | 64.3 | | Others | OpenAI GPT 5.2 Pro | 1830 | 61 | | Others | Exa | 740 | 30 | | Others | Perplexity Deep Research | 1540 | 25 |
CPM: USD per 1000 requests. Cost is shown on a Log scale.
### Benchmark
This benchmark, created by researchers at Google, consists of 900 prompts for evaluating agents on difficult multi-step information-seeking tasks across 17 different fields.
### Methodology
Accuracy refers to answers that are “Fully Correct”: A response is fully correct if and only if the submitted set is semantically identical to the ground-truth set. The agent must identify all correct answers while including zero incorrect answers.
For Exa, Perplexity, GPT 5.2-pro, and Gemini Deep Research API, we evaluate them on their highest thinking and search context settings.
### Testing dates
December 15-18, 2025
#### SealQA-SEAL0
| Series | Model | Cost (CPM) | Accuracy (%) | | -------- | ---------------- | ---------- | ------------ | | Parallel | Core | 25 | 42.3 | | Parallel | Core2x | 50 | 49.5 | | Parallel | Pro | 100 | 52.3 | | Parallel | Ultra | 300 | 55.9 | | Parallel | Ultra8x | 2400 | 56.8 | | Others | Perplexity DR | 1258.2 | 38.7 | | Others | Exa Research Pro | 2043.2 | 45 | | Others | GPT-5 | 189 | 48.6 |
CPM: USD per 1000 requests. Cost is shown on a Log scale.
### About the benchmark
SealQA[SealQA]($https://arxiv.org/abs/2506.01062) is a challenge benchmark for evaluating search-augmented language models on fact-seeking questions where web search typically yields conflicting, noisy, or unhelpful results.
SEAL-0 is a core set of problems where even frontier models with browsing consistently fail. It's named "zero" due to its high failure rate.
### Methodology
**Benchmark Details:** We tested on the full SEAL-0 (111 questions) dataset. Questions require reconciling conflicting web sources.
**LLM Evaluator:** We evaluated responses using an LLM-as-a-judge, measuring factual accuracy against verified ground truth.
**Benchmark Dates:** Testing took place between October 20 and 28, 2025.
**Cost Standardization:** Parallel uses deterministic per-query pricing. For token-based APIs, we normalized to cost per thousand queries (CPM) as measured on the benchmark.
#### SealQA SEAL HARD
| Series | Model | Cost (CPM) | Accuracy (%) | | -------- | ---------------- | ---------- | ------------ | | Parallel | Core | 25 | 60.6 | | Parallel | Core2x | 50 | 65.7 | | Parallel | Pro | 100 | 66.9 | | Parallel | Ultra | 300 | 68.5 | | Parallel | Ultra8x | 2400 | 70.1 | | Others | Perplexity DR | 1221.5 | 50.1 | | Others | Exa Research Pro | 2192.4 | 59.1 | | Others | GPT-5 | 161.7 | 64.6 |
CPM: USD per 1000 requests. Cost is shown on a Log scale.
### About the benchmark
SEAL-HARD[SEAL-HARD]($https://arxiv.org/abs/2506.01062) contains a broader set of queries that includes SEAL-0 and additional highly challenging questions.
### Methodology
**Benchmark Details:** We tested on the full SEAL-0 (111 questions) and SEAL-HARD (254 questions) datasets. Questions require reconciling conflicting web sources.
**LLM Evaluator:** We evaluated responses using an LLM-as-a-judge, measuring factual accuracy against verified ground truth.
**Benchmark Dates:** Testing took place between October 20 and 28, 2025.
**Cost Standardization:** Parallel uses deterministic per-query pricing. For token-based APIs, we normalized to cost per thousand queries (CPM) as measured on the benchmark.
#### BrowseComp
| Series | Model | Cost (CPM) | Accuracy (%) | | --------- | ---------- | ---------- | ------------- | | Parallel | Ultra | 300 | 45 | | Parallel | Ultra2x | 600 | 51 | | Parallel | Ultra4x | 1200 | 56 | | Parallel | Ultra8x | 2400 | 58 | | Others | GPT-5 | 488 | 38 | | Others | Anthropic | 5194 | 7 | | Others | Exa | 402 | 14 | | Others | Perplexity | 709 | 6 |
CPM: USD per 1000 requests. Cost is shown on a Log scale.
### About the benchmark
This benchmark[benchmark]($https://openai.com/index/browsecomp/), created by OpenAI, contains 1,266 questions requiring multi-hop reasoning, creative search formulation, and synthesis of contextual clues across time periods. Results are reported on a random sample of 100 questions from this benchmark. Read the blog[blog]($https://parallel.ai/blog/deep-research-benchmarks).
### Methodology
- - Dates: All measurements were made between 08/11/2025 and 08/29/2025.
- - Configurations: For all competitors, we report the highest numbers we were able to achieve across multiple configurations of their APIs. The exact configurations are below.
- - GPT-5: high reasoning, high search context, default verbosity
- - Exa: Exa Research Pro
- - Anthropic: Claude Opus 4.1
- - Perplexity: Sonar Deep Research reasoning effort high
#### DeepResearchBench
| Series | Model | Cost (CPM) | Win Rate vs Reference (%) | | -------- | ---------- | ---------- | ------------------------- | | Parallel | Ultra | 300 | 82 | | Parallel | Ultra2x | 600 | 86 | | Parallel | Ultra4x | 1200 | 92 | | Parallel | Ultra8x | 2400 | 96 | | Others | GPT-5 | 628 | 66 | | Others | O3 Pro | 4331 | 30 | | Others | O3 | 605 | 26 | | Others | Perplexity | 538 | 6 |
CPM: USD per 1000 requests. Cost is shown on a Log scale.
### About the benchmark
This benchmark[benchmark]($https://github.com/Ayanami0730/deep_research_bench) contains 100 expert-level research tasks designed by domain specialists across 22 fields, primarily Science & Technology, Business & Finance, and Software Development. It evaluates AI systems' ability to produce rigorous, long-form research reports on complex topics requiring cross-disciplinary synthesis. Results are reported from the subset of 50 English-language tasks in the benchmark. Read the blog[blog]($https://parallel.ai/blog/deep-research-benchmarks).
### Methodology
- - Dates: All measurements were made between 08/11/2025 and 08/29/2025.
- - Win Rate: Calculated by comparing RACE[RACE]($https://github.com/Ayanami0730/deep_research_bench) scores in direct head-to-head evaluations against reference reports.
- - Configurations: For all competitors, we report results for the highest numbers we were able to achieve across multiple configurations of their APIs. The exact GPT-5 configuration is high reasoning, high search context, and high verbosity.
- - Excluded API Results: Exa Research Pro (0% win rate), Claude Opus 4.1 (0% win rate).
#### WISER-Atomic
| Series | Model | Cost (CPM) | Accuracy (%) | | -------- | -------------- | ---------- | ------------ | | Parallel | Core | 25 | 77 | | Parallel | Base | 10 | 75 | | Parallel | Lite | 5 | 64 | | Others | o3 | 45 | 69 | | Others | 4.1 mini low | 25 | 63 | | Others | gemini 2.5 pro | 36 | 56 | | Others | sonar pro high | 16 | 64 | | Others | sonar low | 5 | 48 |
CPM: USD per 1000 requests. Cost is shown on a Log scale.
### About the benchmark
This benchmark, created by Parallel, contains 121 questions intended to reflect real-world web research queries across a variety of domains. Read our blog here[here]($https://parallel.ai/blog/parallel-task-api).
### Steps of reasoning
50% Multi-Hop questions
50% Single-Hop questions
### Distribution
40% Financial Research
20% Sales Research
20% Recruitment
20% Miscellaneous
### FindAll API
#### WISER
| Series | Model | Cost (CPM) | Recall (%) | | -------- | ----------------------- | ---------- | ---------- | | Parallel | FindAll Base | 60 | 30.3 | | Parallel | FindAll Core | 230 | 52.5 | | Parallel | FindAll Pro | 1430 | 61.3 | | Others | OpenAI Deep Research | 250 | 21 | | Others | Anthropic Deep Research | 1000 | 15.3 | | Others | Exa | 110 | 19.2 |
CPM: USD per 1000 requests. Cost is shown on a Log scale.
### Benchmark
This benchmark, created by Parallel, contains 40 complex multi-criteria queries covering public companies, startups, SMBs, specialized entities, and people (e.g., executives, researchers, professionals).
### Methodology
To measure recall we take the number of correct matches / total entities in the ground truth dataset. The ground truth dataset is created by taking the union of all correct matches across the competitor set. Cost is calculated as the average cost to find 1000 correct matches.
### Testing dates
Nov 13th-17th, 2025