Parallel Quality Benchmarks

Give your AI the highest-quality web search tools available

When building applications that rely on web data to make decisions or answer questions, nothing matters more than accuracy. These benchmarks help to measure different web search offerings on their ability to answer prompts accurately. By obsessing over accuracy, we consistently lead the market with state-of-the-art quality. In addition to leading in accuracy, Parallel often leads in pricing.

Search API / HLE

API Platform

[API Platform](https://docs.parallel.ai/)

COST (CPM)

ACCURACY (%)

Loading chart...

Parallel

Others

### About this benchmark

This benchmark[benchmark](https://lastexam.ai/) consists of 2,500 questions developed by subject-matter experts across dozens of subjects (e.g. math, humanities, natural sciences). Each question has a known solution that is unambiguous and easily verifiable, but requires sophisticated web retrieval and reasoning. Results are reported on a sample of 100 questions from this benchmark.

### Methodology

- **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
- **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
- **Testing Dates**: Testing was conducted from November 3rd to November 5th.

Tool

Accuracy

Cost (CPM)

parallel

47%

82CPM

openai gpt-5

45%

143CPM

perplexity

30%

126CPM

exa

24%

138CPM

tavily

21%

190CPM

## Parallel Quality Benchmarks

Give your AI the highest-quality web search tools available

### Search API

#### HLE

| Series    | Model        | Cost  (CPM) | Accuracy (%) |
| --------- | ------------ | ----------- | ------------ |
| Parallel  | parallel     | 82          | 47           |
| Others    | exa          | 138         | 24           |
| Others    | tavily       | 190         | 21           |
| Others    | perplexity   | 126         | 30           |
| Others    | openai gpt-5 | 143         | 45           |

### About this benchmark

### Methodology

- **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
- **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
- **Testing Dates**: Testing was conducted from November 3rd to November 5th.

#### BrowseComp

| Series    | Model        | Cost  (CPM) | Accuracy (%) |
| --------- | ------------ | ----------- | ------------ |
| Parallel  | parallel     | 156         | 58           |
| Others    | exa          | 233         | 29           |
| Others    | tavily       | 314         | 23           |
| Others    | perplexity   | 256         | 22           |
| Others    | openai gpt-5 | 253         | 53           |

### About this benchmark

This benchmark[benchmark](https://openai.com/index/browsecomp/), created by OpenAI, contains 1,266 questions requiring multi-hop reasoning, creative search formulation, and synthesis of contextual clues across time periods. Results are reported on a sample of 100 questions from this benchmark.

### Methodology

- **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
- **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
- **Testing Dates**: Testing was conducted from November 3rd to November 5th.

#### WebWalker

| Series    | Model        | Cost  (CPM) | Accuracy (%) |
| --------- | ------------ | ----------- | ------------ |
| Parallel  | parallel     | 42          | 81           |
| Others    | exa          | 107         | 48           |
| Others    | tavily       | 156         | 79           |
| Others    | perplexity   | 91          | 67           |
| Others    | openai gpt-5 | 88          | 73           |

### About this benchmark

This benchmark[benchmark](https://arxiv.org/abs/2501.07572) is designed to assess the ability of LLMs to perform web traversal. To successfully answer the questions in the benchmark, it requires the ability to crawl and extract content from website subpages. Results are reported on a sample of 100 questions from this benchmark.

### Methodology

- **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
- **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
- **Testing Dates**: Testing was conducted from November 3rd to November 5th.

#### FRAMES

| Series    | Model        | Cost  (CPM) | Accuracy (%) |
| --------- | ------------ | ----------- | ------------ |
| Parallel  | parallel     | 42          | 92           |
| Others    | exa          | 81          | 81           |
| Others    | tavily       | 122         | 87           |
| Others    | perplexity   | 95          | 83           |
| Others    | openai gpt-5 | 68          | 90           |

### About this benchmark

This benchmark[benchmark](https://huggingface.co/datasets/google/frames-benchmark) contains 824 challenging multi-hop questions designed to test factuality, retrieval accuracy, and reasoning. Results are reported on a sample of 100 questions from this benchmark.

### Methodology

- **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
- **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
- **Testing Dates**: Testing was conducted from November 3rd to November 5th.

#### SimpleQA

| Series    | Model        | Cost  (CPM) | Accuracy (%) |
| --------- | ------------ | ----------- | ------------ |
| Parallel  | parallel     | 17          | 98           |
| Others    | exa          | 57          | 87           |
| Others    | tavily       | 110         | 93           |
| Others    | perplexity   | 52          | 92           |
| Others    | openai gpt-5 | 37          | 98           |

### About this benchmark

This benchmark[benchmark](https://openai.com/index/introducing-simpleqa/), created by OpenAI, contains 4,326 questions focused on short, fact-seeking queries across a variety of domains. Results are reported on a sample of 100 questions from this benchmark.

### Methodology

- **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
- **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
- **Testing Dates**: Testing was conducted from November 3rd to November 5th.

#### Batched SimpleQA

| Series    | Model        | Cost  (CPM) | Accuracy (%) |
| --------- | ------------ | ----------- | ------------ |
| Parallel  | parallel     | 50          | 90           |
| Others    | exa          | 119         | 71           |
| Others    | tavily       | 227         | 59           |
| Others    | perplexity   | 100         | 74           |
| Others    | openai gpt-5 | 91          | 88           |

### About this benchmark

This benchmark was created by batching 3 independent questions from the original SimpleQA dataset[SimpleQA dataset](https://openai.com/index/introducing-simpleqa/) to create 100 composite, more complex, questions.

### Methodology

- **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
- **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
- **Testing Dates**: Testing was conducted from November 3rd to November 5th.

### Task API

#### DeepSearchQA

| Series   | Model                    | Cost (CPM) | Accuracy (%) |
| -------- | ------------------------ | ---------- | ------------ |
| Parallel | Pro                      | 100        | 62           |
| Parallel | Ultra                    | 300        | 68.5         |
| Parallel | Ultra2x                  | 600        | 72.6         |
| Others   | Gemini Deep Research     | 2500       | 64.3         |
| Others   | OpenAI GPT 5.2 Pro       | 1830       | 61           |
| Others   | Exa                      | 740        | 30           |
| Others   | Perplexity Deep Research | 1540       | 25           |

CPM: USD per 1000 requests. Cost is shown on a Log scale.

### Benchmark

This benchmark, created by researchers at Google, consists of 900 prompts for evaluating agents on difficult multi-step information-seeking tasks across 17 different fields.

### Methodology

Accuracy refers to answers that are “Fully Correct”: A response is fully correct if and only if the submitted set is semantically identical to the ground-truth set. The agent must identify all correct answers while including zero incorrect answers.

For Exa, Perplexity, GPT 5.2-pro, and Gemini Deep Research API, we evaluate them on their highest thinking and search context settings.

### Testing dates

December 15-18, 2025

#### SealQA-SEAL0

| Series   | Model            | Cost (CPM) | Accuracy (%) |
| -------- | ---------------- | ---------- | ------------ |
| Parallel | Core             | 25         | 42.3         |
| Parallel | Core2x           | 50         | 49.5         |
| Parallel | Pro              | 100        | 52.3         |
| Parallel | Ultra            | 300        | 55.9         |
| Parallel | Ultra8x          | 2400       | 56.8         |
| Others   | Perplexity DR    | 1258.2     | 38.7         |
| Others   | Exa Research Pro | 2043.2     | 45           |
| Others   | GPT-5            | 189        | 48.6         |

CPM: USD per 1000 requests. Cost is shown on a Log scale.

### About the benchmark

SealQA[SealQA](https://arxiv.org/abs/2506.01062) is a challenge benchmark for evaluating search-augmented language models on fact-seeking questions where web search typically yields conflicting, noisy, or unhelpful results.

SEAL-0 is a core set of problems where even frontier models with browsing consistently fail. It's named "zero" due to its high failure rate.

### Methodology

**Benchmark Details:** We tested on the full SEAL-0 (111 questions) dataset. Questions require reconciling conflicting web sources.

**LLM Evaluator:** We evaluated responses using an LLM-as-a-judge, measuring factual accuracy against verified ground truth.

**Benchmark Dates:** Testing took place between October 20 and 28, 2025.

**Cost Standardization:** Parallel uses deterministic per-query pricing. For token-based APIs, we normalized to cost per thousand queries (CPM) as measured on the benchmark.

#### SealQA SEAL HARD

| Series   | Model            | Cost (CPM) | Accuracy (%) |
| -------- | ---------------- | ---------- | ------------ |
| Parallel | Core             | 25         | 60.6         |
| Parallel | Core2x           | 50         | 65.7         |
| Parallel | Pro              | 100        | 66.9         |
| Parallel | Ultra            | 300        | 68.5         |
| Parallel | Ultra8x          | 2400       | 70.1         |
| Others   | Perplexity DR    | 1221.5     | 50.1         |
| Others   | Exa Research Pro | 2192.4     | 59.1         |
| Others   | GPT-5            | 161.7      | 64.6         |

CPM: USD per 1000 requests. Cost is shown on a Log scale.

### About the benchmark

SEAL-HARD[SEAL-HARD](https://arxiv.org/abs/2506.01062) contains a broader set of queries that includes SEAL-0 and additional highly challenging questions.

### Methodology

**Benchmark Details:** We tested on the full SEAL-0 (111 questions) and SEAL-HARD (254 questions) datasets. Questions require reconciling conflicting web sources.

**LLM Evaluator:** We evaluated responses using an LLM-as-a-judge, measuring factual accuracy against verified ground truth.

**Benchmark Dates:** Testing took place between October 20 and 28, 2025.

**Cost Standardization:** Parallel uses deterministic per-query pricing. For token-based APIs, we normalized to cost per thousand queries (CPM) as measured on the benchmark.

#### BrowseComp

| Series    | Model      | Cost (CPM) | Accuracy  (%) |
| --------- | ---------- | ---------- | ------------- |
| Parallel  | Ultra      | 300        | 45            |
| Parallel  | Ultra2x    | 600        | 51            |
| Parallel  | Ultra4x    | 1200       | 56            |
| Parallel  | Ultra8x    | 2400       | 58            |
| Others    | GPT-5      | 488        | 38            |
| Others    | Anthropic  | 5194       | 7             |
| Others    | Exa        | 402        | 14            |
| Others    | Perplexity | 709        | 6             |

CPM: USD per 1000 requests. Cost is shown on a Log scale.

### About the benchmark

### Methodology

- Dates: All measurements were made between 08/11/2025 and 08/29/2025.
- Configurations: For all competitors, we report the highest numbers we were able to achieve across multiple configurations of their APIs. The exact configurations are below.
- - GPT-5: high reasoning, high search context, default verbosity
- - Exa: Exa Research Pro
- - Anthropic: Claude Opus 4.1
- - Perplexity: Sonar Deep Research reasoning effort high

#### DeepResearchBench

| Series   | Model      | Cost (CPM) | Win Rate vs Reference (%) |
| -------- | ---------- | ---------- | ------------------------- |
| Parallel | Ultra      | 300        | 82                        |
| Parallel | Ultra2x    | 600        | 86                        |
| Parallel | Ultra4x    | 1200       | 92                        |
| Parallel | Ultra8x    | 2400       | 96                        |
| Others   | GPT-5      | 628        | 66                        |
| Others   | O3 Pro     | 4331       | 30                        |
| Others   | O3         | 605        | 26                        |
| Others   | Perplexity | 538        | 6                         |

CPM: USD per 1000 requests. Cost is shown on a Log scale.

### About the benchmark

This benchmark[benchmark](https://github.com/Ayanami0730/deep_research_bench) contains 100 expert-level research tasks designed by domain specialists across 22 fields, primarily Science & Technology, Business & Finance, and Software Development. It evaluates AI systems' ability to produce rigorous, long-form research reports on complex topics requiring cross-disciplinary synthesis. Results are reported from the subset of 50 English-language tasks in the benchmark. Read the blog[blog](/blog/deep-research-benchmarks).

### Methodology

- Dates: All measurements were made between 08/11/2025 and 08/29/2025.
- Win Rate: Calculated by comparing RACE[RACE](https://github.com/Ayanami0730/deep_research_bench) scores in direct head-to-head evaluations against reference reports.
- Configurations: For all competitors, we report results for the highest numbers we were able to achieve across multiple configurations of their APIs. The exact GPT-5 configuration is high reasoning, high search context, and high verbosity.
- Excluded API Results: Exa Research Pro (0% win rate), Claude Opus 4.1 (0% win rate).

#### WISER-Atomic

| Series   | Model          | Cost (CPM) | Accuracy (%) |
| -------- | -------------- | ---------- | ------------ |
| Parallel | Core           | 25         | 77           |
| Parallel | Base           | 10         | 75           |
| Parallel | Lite           | 5          | 64           |
| Others   | o3             | 45         | 69           |
| Others   | 4.1 mini low   | 25         | 63           |
| Others   | gemini 2.5 pro | 36         | 56           |
| Others   | sonar pro high | 16         | 64           |
| Others   | sonar low      | 5          | 48           |

CPM: USD per 1000 requests. Cost is shown on a Log scale.

### About the benchmark

This benchmark, created by Parallel, contains 121 questions intended to reflect real-world web research queries across a variety of domains. Read our blog here[here](/blog/parallel-task-api).

### Steps of reasoning

50% Multi-Hop questions
50% Single-Hop questions

### Distribution

40% Financial Research
20% Sales Research
20% Recruitment
20% Miscellaneous

### FindAll API

#### WISER

| Series   | Model                   | Cost (CPM) | Recall (%) |
| -------- | ----------------------- | ---------- | ---------- |
| Parallel | FindAll Base            | 60         | 30.3       |
| Parallel | FindAll Core            | 230        | 52.5       |
| Parallel | FindAll Pro             | 1430       | 61.3       |
| Others   | OpenAI Deep Research    | 250        | 21         |
| Others   | Anthropic Deep Research | 1000       | 15.3       |
| Others   | Exa                     | 110        | 19.2       |

CPM: USD per 1000 requests. Cost is shown on a Log scale.

### Benchmark

This benchmark, created by Parallel, contains 40 complex multi-criteria queries covering public companies, startups, SMBs, specialized entities, and people (e.g., executives, researchers, professionals).

### Methodology

To measure recall we take the number of correct matches / total entities in the ground truth dataset. The ground truth dataset is created by taking the union of all correct matches across the competitor set. Cost is calculated as the average cost to find 1000 correct matches.

### Testing dates

Nov 13th-17th, 2025

Parallel Quality Benchmarks

## Parallel Quality Benchmarks

### Search API

#### HLE

#### BrowseComp

#### WebWalker

#### FRAMES

#### SimpleQA

#### Batched SimpleQA

### Task API

#### DeepSearchQA

#### SealQA-SEAL0

#### SealQA SEAL HARD

#### BrowseComp

#### DeepResearchBench

#### WISER-Atomic

### FindAll API

#### WISER

Contact

Products

Resources

Info