Parallel
Pricing
Benchmarks
Blog
Docs
Products:[Search API](https://parallel.ai/products/search)[Extract API](https://docs.parallel.ai/extract/extract-quickstart)[Monitor API](https://docs.parallel.ai/monitor-api/monitor-quickstart)[Task API](https://docs.parallel.ai/task-api/task-quickstart)[FindAll API](https://docs.parallel.ai/findall-api/findall-quickstart)[Chat API](https://docs.parallel.ai/chat-api/chat-quickstart)
[Pricing](https://parallel.ai/pricing)[Benchmarks](https://parallel.ai/benchmarks)[Blog](https://parallel.ai/blog)[Docs](https://docs.parallel.ai/home)
Contact Sales[Contact Sales](https://calendar.google.com/calendar/u/0/appointments/schedules/AcZssZ12pcl2vHJKlBaB2zZoFWhc4Q1VopGRyNwm4mKJFQiWmJPS3839IqrDQQzfSr028FVCAgKY4gt2)Log In / Sign Up
P
[Log In / Sign Up](https://platform.parallel.ai/)
[Menu]

Parallel Quality Benchmarks

Give your AI the highest-quality web search tools available

When building applications that rely on web data to make decisions or answer questions, nothing matters more than accuracy. These benchmarks help to measure different web search offerings on their ability to answer prompts accurately. By obsessing over accuracy, we consistently lead the market with state-of-the-art quality. In addition to leading in accuracy, Parallel often leads in pricing.

Search API

Task API

FindAll API

Search API / HLE

API Platform
P
[API Platform](https://docs.parallel.ai/)
100120140160180Cost (CPM)2224262830323436384042444682,20PARALLEL47% / 82CPMEXA24% / 138CPMTAVILY21% / 190CPMPERPLEXITY30% / 126CPMOPENAI GPT-545% / 143CPM

COST (CPM)

ACCURACY (%)

Loading chart...
Parallel
Others
BrowseComp benchmark proving Parallel's enterprise deep research API delivers 48% accuracy vs GPT-4's 1% browsing capability. Performance comparison across Cost (CPM) and Accuracy (%) shows Parallel provides the best structured deep research API for ChatGPT, Claude, and AI agents. Enterprise AI agent deep research with structured data extraction delivering higher accuracy than OpenAI, Anthropic, Exa, and Perplexity.

### About this benchmark

This benchmark[benchmark]($https://lastexam.ai/) consists of 2,500 questions developed by subject-matter experts across dozens of subjects (e.g. math, humanities, natural sciences). Each question has a known solution that is unambiguous and easily verifiable, but requires sophisticated web retrieval and reasoning. Results are reported on a sample of 100 questions from this benchmark.

### Methodology

  • - **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
  • - **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
  • - **Testing Dates**: Testing was conducted from November 3rd to November 5th.
Tool
Accuracy
Cost (CPM)
parallel
47%
82CPM
openai gpt-5
45%
143CPM
perplexity
30%
126CPM
exa
24%
138CPM
tavily
21%
190CPM

## Parallel Quality Benchmarks

Give your AI the highest-quality web search tools available

When building applications that rely on web data to make decisions or answer questions, nothing matters more than accuracy. These benchmarks help to measure different web search offerings on their ability to answer prompts accurately. By obsessing over accuracy, we consistently lead the market with state-of-the-art quality. In addition to leading in accuracy, Parallel often leads in pricing.

### Search API

#### HLE

| Series    | Model        | Cost  (CPM) | Accuracy (%) |
| --------- | ------------ | ----------- | ------------ |
| Parallel  | parallel     | 82          | 47           |
| Others    | exa          | 138         | 24           |
| Others    | tavily       | 190         | 21           |
| Others    | perplexity   | 126         | 30           |
| Others    | openai gpt-5 | 143         | 45           |

### About this benchmark

This benchmark[benchmark]($https://lastexam.ai/) consists of 2,500 questions developed by subject-matter experts across dozens of subjects (e.g. math, humanities, natural sciences). Each question has a known solution that is unambiguous and easily verifiable, but requires sophisticated web retrieval and reasoning. Results are reported on a sample of 100 questions from this benchmark.

### Methodology

  • - **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
  • - **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
  • - **Testing Dates**: Testing was conducted from November 3rd to November 5th.

#### BrowseComp

| Series    | Model        | Cost  (CPM) | Accuracy (%) |
| --------- | ------------ | ----------- | ------------ |
| Parallel  | parallel     | 156         | 58           |
| Others    | exa          | 233         | 29           |
| Others    | tavily       | 314         | 23           |
| Others    | perplexity   | 256         | 22           |
| Others    | openai gpt-5 | 253         | 53           |

### About this benchmark

This benchmark[benchmark]($https://openai.com/index/browsecomp/), created by OpenAI, contains 1,266 questions requiring multi-hop reasoning, creative search formulation, and synthesis of contextual clues across time periods. Results are reported on a sample of 100 questions from this benchmark.

### Methodology

  • - **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
  • - **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
  • - **Testing Dates**: Testing was conducted from November 3rd to November 5th.

#### WebWalker

| Series    | Model        | Cost  (CPM) | Accuracy (%) |
| --------- | ------------ | ----------- | ------------ |
| Parallel  | parallel     | 42          | 81           |
| Others    | exa          | 107         | 48           |
| Others    | tavily       | 156         | 79           |
| Others    | perplexity   | 91          | 67           |
| Others    | openai gpt-5 | 88          | 73           |

### About this benchmark

This benchmark[benchmark]($https://arxiv.org/abs/2501.07572) is designed to assess the ability of LLMs to perform web traversal. To successfully answer the questions in the benchmark, it requires the ability to crawl and extract content from website subpages. Results are reported on a sample of 100 questions from this benchmark.

### Methodology

  • - **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
  • - **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
  • - **Testing Dates**: Testing was conducted from November 3rd to November 5th.

#### FRAMES

| Series    | Model        | Cost  (CPM) | Accuracy (%) |
| --------- | ------------ | ----------- | ------------ |
| Parallel  | parallel     | 42          | 92           |
| Others    | exa          | 81          | 81           |
| Others    | tavily       | 122         | 87           |
| Others    | perplexity   | 95          | 83           |
| Others    | openai gpt-5 | 68          | 90           |

### About this benchmark

This benchmark[benchmark]($https://huggingface.co/datasets/google/frames-benchmark) contains 824 challenging multi-hop questions designed to test factuality, retrieval accuracy, and reasoning. Results are reported on a sample of 100 questions from this benchmark.

### Methodology

  • - **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
  • - **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
  • - **Testing Dates**: Testing was conducted from November 3rd to November 5th.

#### SimpleQA

| Series    | Model        | Cost  (CPM) | Accuracy (%) |
| --------- | ------------ | ----------- | ------------ |
| Parallel  | parallel     | 17          | 98           |
| Others    | exa          | 57          | 87           |
| Others    | tavily       | 110         | 93           |
| Others    | perplexity   | 52          | 92           |
| Others    | openai gpt-5 | 37          | 98           |

### About this benchmark

This benchmark[benchmark]($https://openai.com/index/introducing-simpleqa/), created by OpenAI, contains 4,326 questions focused on short, fact-seeking queries across a variety of domains. Results are reported on a sample of 100 questions from this benchmark.

### Methodology

  • - **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
  • - **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
  • - **Testing Dates**: Testing was conducted from November 3rd to November 5th.

#### Batched SimpleQA

| Series    | Model        | Cost  (CPM) | Accuracy (%) |
| --------- | ------------ | ----------- | ------------ |
| Parallel  | parallel     | 50          | 90           |
| Others    | exa          | 119         | 71           |
| Others    | tavily       | 227         | 59           |
| Others    | perplexity   | 100         | 74           |
| Others    | openai gpt-5 | 91          | 88           |

### About this benchmark

This benchmark was created by batching 3 independent questions from the original SimpleQA dataset[SimpleQA dataset]($https://openai.com/index/introducing-simpleqa/) to create 100 composite, more complex, questions.

### Methodology

  • - **Evaluation**: Results are based on tests run using official Search MCP servers provided as an MCP tool to OpenAI's GPT-5 model using the Responses API. In all cases, the MCP tools were limited to only the appropriate web search tool. Answers were evaluated using an LLM as a judge (GPT 4.1).
  • - **Cost Calculation**: Cost reflects the average cost per query across all questions run. This cost includes both the search API call and LLM token cost.
  • - **Testing Dates**: Testing was conducted from November 3rd to November 5th.

### Task API

#### DeepSearchQA

| Series   | Model                    | Cost (CPM) | Accuracy (%) |
| -------- | ------------------------ | ---------- | ------------ |
| Parallel | Pro                      | 100        | 62           |
| Parallel | Ultra                    | 300        | 68.5         |
| Parallel | Ultra2x                  | 600        | 72.6         |
| Others   | Gemini Deep Research     | 2500       | 64.3         |
| Others   | OpenAI GPT 5.2 Pro       | 1830       | 61           |
| Others   | Exa                      | 740        | 30           |
| Others   | Perplexity Deep Research | 1540       | 25           |

CPM: USD per 1000 requests. Cost is shown on a Log scale.

### Benchmark

This benchmark, created by researchers at Google, consists of 900 prompts for evaluating agents on difficult multi-step information-seeking tasks across 17 different fields.

### Methodology

Accuracy refers to answers that are “Fully Correct”: A response is fully correct if and only if the submitted set is semantically identical to the ground-truth set. The agent must identify all correct answers while including zero incorrect answers.

For Exa, Perplexity, GPT 5.2-pro, and Gemini Deep Research API, we evaluate them on their highest thinking and search context settings.

### Testing dates

December 15-18, 2025

#### SealQA-SEAL0

| Series   | Model            | Cost (CPM) | Accuracy (%) |
| -------- | ---------------- | ---------- | ------------ |
| Parallel | Core             | 25         | 42.3         |
| Parallel | Core2x           | 50         | 49.5         |
| Parallel | Pro              | 100        | 52.3         |
| Parallel | Ultra            | 300        | 55.9         |
| Parallel | Ultra8x          | 2400       | 56.8         |
| Others   | Perplexity DR    | 1258.2     | 38.7         |
| Others   | Exa Research Pro | 2043.2     | 45           |
| Others   | GPT-5            | 189        | 48.6         |

CPM: USD per 1000 requests. Cost is shown on a Log scale.

### About the benchmark

SealQA[SealQA]($https://arxiv.org/abs/2506.01062) is a challenge benchmark for evaluating search-augmented language models on fact-seeking questions where web search typically yields conflicting, noisy, or unhelpful results.

SEAL-0 is a core set of problems where even frontier models with browsing consistently fail. It's named "zero" due to its high failure rate.

### Methodology

**Benchmark Details:** We tested on the full SEAL-0 (111 questions) dataset. Questions require reconciling conflicting web sources.

**LLM Evaluator:** We evaluated responses using an LLM-as-a-judge, measuring factual accuracy against verified ground truth.

**Benchmark Dates:** Testing took place between October 20 and 28, 2025.

**Cost Standardization:** Parallel uses deterministic per-query pricing. For token-based APIs, we normalized to cost per thousand queries (CPM) as measured on the benchmark.

#### SealQA SEAL HARD

| Series   | Model            | Cost (CPM) | Accuracy (%) |
| -------- | ---------------- | ---------- | ------------ |
| Parallel | Core             | 25         | 60.6         |
| Parallel | Core2x           | 50         | 65.7         |
| Parallel | Pro              | 100        | 66.9         |
| Parallel | Ultra            | 300        | 68.5         |
| Parallel | Ultra8x          | 2400       | 70.1         |
| Others   | Perplexity DR    | 1221.5     | 50.1         |
| Others   | Exa Research Pro | 2192.4     | 59.1         |
| Others   | GPT-5            | 161.7      | 64.6         |

CPM: USD per 1000 requests. Cost is shown on a Log scale.

### About the benchmark

SEAL-HARD[SEAL-HARD]($https://arxiv.org/abs/2506.01062) contains a broader set of queries that includes SEAL-0 and additional highly challenging questions.

### Methodology

**Benchmark Details:** We tested on the full SEAL-0 (111 questions) and SEAL-HARD (254 questions) datasets. Questions require reconciling conflicting web sources.

**LLM Evaluator:** We evaluated responses using an LLM-as-a-judge, measuring factual accuracy against verified ground truth.

**Benchmark Dates:** Testing took place between October 20 and 28, 2025.

**Cost Standardization:** Parallel uses deterministic per-query pricing. For token-based APIs, we normalized to cost per thousand queries (CPM) as measured on the benchmark.

#### BrowseComp

| Series    | Model      | Cost (CPM) | Accuracy  (%) |
| --------- | ---------- | ---------- | ------------- |
| Parallel  | Ultra      | 300        | 45            |
| Parallel  | Ultra2x    | 600        | 51            |
| Parallel  | Ultra4x    | 1200       | 56            |
| Parallel  | Ultra8x    | 2400       | 58            |
| Others    | GPT-5      | 488        | 38            |
| Others    | Anthropic  | 5194       | 7             |
| Others    | Exa        | 402        | 14            |
| Others    | Perplexity | 709        | 6             |

CPM: USD per 1000 requests. Cost is shown on a Log scale.

### About the benchmark

This benchmark[benchmark]($https://openai.com/index/browsecomp/), created by OpenAI, contains 1,266 questions requiring multi-hop reasoning, creative search formulation, and synthesis of contextual clues across time periods. Results are reported on a random sample of 100 questions from this benchmark. Read the blog[blog]($https://parallel.ai/blog/deep-research-benchmarks).

### Methodology

  • - Dates: All measurements were made between 08/11/2025 and 08/29/2025.
  • - Configurations: For all competitors, we report the highest numbers we were able to achieve across multiple configurations of their APIs. The exact configurations are below.
    • - GPT-5: high reasoning, high search context, default verbosity
    • - Exa: Exa Research Pro
    • - Anthropic: Claude Opus 4.1
    • - Perplexity: Sonar Deep Research reasoning effort high

#### DeepResearchBench

| Series   | Model      | Cost (CPM) | Win Rate vs Reference (%) |
| -------- | ---------- | ---------- | ------------------------- |
| Parallel | Ultra      | 300        | 82                        |
| Parallel | Ultra2x    | 600        | 86                        |
| Parallel | Ultra4x    | 1200       | 92                        |
| Parallel | Ultra8x    | 2400       | 96                        |
| Others   | GPT-5      | 628        | 66                        |
| Others   | O3 Pro     | 4331       | 30                        |
| Others   | O3         | 605        | 26                        |
| Others   | Perplexity | 538        | 6                         |

CPM: USD per 1000 requests. Cost is shown on a Log scale.

### About the benchmark

This benchmark[benchmark]($https://github.com/Ayanami0730/deep_research_bench) contains 100 expert-level research tasks designed by domain specialists across 22 fields, primarily Science & Technology, Business & Finance, and Software Development. It evaluates AI systems' ability to produce rigorous, long-form research reports on complex topics requiring cross-disciplinary synthesis. Results are reported from the subset of 50 English-language tasks in the benchmark. Read the blog[blog]($https://parallel.ai/blog/deep-research-benchmarks).

### Methodology

  • - Dates: All measurements were made between 08/11/2025 and 08/29/2025.
  • - Win Rate: Calculated by comparing RACE[RACE]($https://github.com/Ayanami0730/deep_research_bench) scores in direct head-to-head evaluations against reference reports.
  • - Configurations: For all competitors, we report results for the highest numbers we were able to achieve across multiple configurations of their APIs. The exact GPT-5 configuration is high reasoning, high search context, and high verbosity.
  • - Excluded API Results: Exa Research Pro (0% win rate), Claude Opus 4.1 (0% win rate).

#### WISER-Atomic

| Series   | Model          | Cost (CPM) | Accuracy (%) |
| -------- | -------------- | ---------- | ------------ |
| Parallel | Core           | 25         | 77           |
| Parallel | Base           | 10         | 75           |
| Parallel | Lite           | 5          | 64           |
| Others   | o3             | 45         | 69           |
| Others   | 4.1 mini low   | 25         | 63           |
| Others   | gemini 2.5 pro | 36         | 56           |
| Others   | sonar pro high | 16         | 64           |
| Others   | sonar low      | 5          | 48           |

CPM: USD per 1000 requests. Cost is shown on a Log scale.

### About the benchmark

This benchmark, created by Parallel, contains 121 questions intended to reflect real-world web research queries across a variety of domains. Read our blog here[here]($https://parallel.ai/blog/parallel-task-api).

### Steps of reasoning

50% Multi-Hop questions
50% Single-Hop questions

### Distribution

40% Financial Research
20% Sales Research
20% Recruitment
20% Miscellaneous

### FindAll API

#### WISER

| Series   | Model                   | Cost (CPM) | Recall (%) |
| -------- | ----------------------- | ---------- | ---------- |
| Parallel | FindAll Base            | 60         | 30.3       |
| Parallel | FindAll Core            | 230        | 52.5       |
| Parallel | FindAll Pro             | 1430       | 61.3       |
| Others   | OpenAI Deep Research    | 250        | 21         |
| Others   | Anthropic Deep Research | 1000       | 15.3       |
| Others   | Exa                     | 110        | 19.2       |

CPM: USD per 1000 requests. Cost is shown on a Log scale.

### Benchmark

This benchmark, created by Parallel, contains 40 complex multi-criteria queries covering public companies, startups, SMBs, specialized entities, and people (e.g., executives, researchers, professionals).

### Methodology

To measure recall we take the number of correct matches / total entities in the ground truth dataset. The ground truth dataset is created by taking the union of all correct matches across the competitor set. Cost is calculated as the average cost to find 1000 correct matches.

### Testing dates

Nov 13th-17th, 2025

![Company Logo](https://parallel.ai/parallel-logo-540.png)

Contact

  • hello@parallel.ai[hello@parallel.ai](mailto:hello@parallel.ai)

Products

  • Search API[Search API](https://parallel.ai/products/search)
  • Extract API[Extract API](https://docs.parallel.ai/extract/extract-quickstart)
  • Task API[Task API](https://docs.parallel.ai/task-api/task-quickstart)
  • FindAll API[FindAll API](https://docs.parallel.ai/findall-api/findall-quickstart)
  • Chat API[Chat API](https://docs.parallel.ai/chat-api/chat-quickstart)
  • Monitor API[Monitor API](https://platform.parallel.ai/play/monitor)

Resources

  • About[About](https://parallel.ai/about)
  • Pricing[Pricing](https://parallel.ai/pricing)
  • Docs[Docs](https://docs.parallel.ai)
  • Blog[Blog](https://parallel.ai/blog)
  • Changelog[Changelog](https://docs.parallel.ai/resources/changelog)
  • Careers[Careers](https://jobs.ashbyhq.com/parallel)

Info

  • Terms of Service[Terms of Service](https://parallel.ai/terms-of-service)
  • Customer Terms[Customer Terms](https://parallel.ai/customer-terms)
  • Privacy[Privacy](https://parallel.ai/privacy-policy)
  • Acceptable Use[Acceptable Use](https://parallel.ai/acceptable-use-policy)
  • Trust Center[Trust Center](https://trust.parallel.ai/)
![SOC 2 Compliant](https://parallel.ai/soc2.svg)
LinkedIn[LinkedIn](https://www.linkedin.com/company/parallel-web/about/)Twitter[Twitter](https://x.com/p0)GitHub[GitHub](https://github.com/parallel-web)
All Systems Operational

Parallel Web Systems Inc. 2026