Web Crawler - Solution & Explanation

Q: Is Web Crawler easy or hard?

Web Crawler is considered a Medium problem. The traversal logic is straightforward BFS or DFS, but candidates must correctly filter URLs by hostname and prevent revisiting nodes using a set.

Q: Web Crawler Python/Java solution

Both Python and Java implementations use the same structure: extract the hostname, maintain a visited set, and traverse URLs using BFS or DFS. Python typically uses a deque and set, while Java uses Queue and HashSet with repeated calls to HtmlParser.getUrls.

Q: How to solve Web Crawler in O(n)?

Use BFS or DFS with a hash set to track visited URLs. Start from startUrl, extract its hostname, and repeatedly fetch neighbors using HtmlParser.getUrls. Only add URLs with the same hostname and skip ones already visited, ensuring each page is processed exactly once.

Q: What is the best approach for Web Crawler?

Graph traversal using BFS or DFS with a visited set is the best approach. Start from startUrl, extract its hostname, and only crawl URLs that share the same domain. Each discovered URL is processed once, giving O(N + E) time complexity where N is the number of pages and E is the number of links.

Q: Is Web Crawler asked at Google/Amazon/Meta?

Web crawler style problems appear in interviews at companies like Google and Meta because they test graph traversal, URL parsing, and handling visited states. The problem models real distributed crawler systems but reduces it to BFS/DFS traversal over URLs.

Q: What data structure is used in Web Crawler?

The core data structures are a queue (for BFS) or recursion/stack (for DFS) and a hash set to store visited URLs. String parsing is used to extract and compare hostnames so the crawler stays within the same domain.

Q: What is the time complexity of Web Crawler?

The optimal solution runs in O(N + E) time. Each URL (node) is visited once and every hyperlink (edge) returned by HtmlParser.getUrls is processed once during traversal. Space complexity is O(N) for storing visited URLs and the BFS queue or DFS recursion stack.

MediumPremiumFree on FleetCodeString Depth-First Search Breadth-First Search Interactive6 min readAsked at: Amazon, Microsoft, Meta +4

Practice this problem

Problem Statement

Given a url startUrl and an interface HtmlParser, implement a web crawler to crawl all links that are under the same hostname as startUrl.

Return all urls obtained by your web crawler in any order.

Your crawler should:

Start from the page: startUrl
Call HtmlParser.getUrls(url) to get all urls from a webpage of given url.
Do not crawl the same link twice.
Explore only the links that are under the same hostname as startUrl.

As shown in the example url above, the hostname is example.org. For simplicity sake, you may assume all urls use http protocol without any port specified. For example, the urls http://leetcode.com/problems and http://leetcode.com/contest are under the same hostname, while urls http://example.org/test and http://example.com/abc are not under the same hostname.

The HtmlParser interface is defined as such:

interface HtmlParser {
  // Return a list of all urls from a webpage of given url.
  public List<String> getUrls(String url);
}

Below are two examples explaining the functionality of the problem, for custom testing purposes you'll have three variables urls, edges and startUrl. Notice that you will only have access to startUrl in your code, while urls and edges are not directly accessible to you in code.

Note: Consider the same URL with the trailing slash "/" as a different URL. For example, "http://news.yahoo.com", and "http://news.yahoo.com/" are different urls.

Example 1:

Input:
urls = [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.google.com",
  "http://news.yahoo.com/us"
]
edges = [[2,0],[2,1],[3,2],[3,1],[0,4]]
startUrl = "http://news.yahoo.com/news/topics/"
Output: [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.yahoo.com/us"
]

Example 2:

Input: 
urls = [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.google.com"
]
edges = [[0,2],[2,1],[3,2],[3,1],[3,0]]
startUrl = "http://news.google.com"
Output: ["http://news.google.com"]
Explanation: The startUrl links to all other pages that do not share the same hostname.

Constraints:

1 <= urls.length <= 1000
1 <= urls[i].length <= 300
startUrl is one of the urls.
Hostname label must be from 1 to 63 characters long, including the dots, may contain only the ASCII letters from 'a' to 'z', digits from '0' to '9' and the hyphen-minus character ('-').
The hostname may not start or end with the hyphen-minus character ('-').
See: https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_hostnames
You may assume there're no duplicates in url library.

Approach Overview

Problem Overview: Starting from startUrl, crawl every reachable page that belongs to the same hostname. The only way to discover links is through the provided HtmlParser.getUrls(url) API, so the problem becomes a graph traversal where URLs are nodes and hyperlinks are edges.

Approach 1: Breadth-First Search (BFS) Traversal (Time: O(N + E), Space: O(N))

Treat each URL as a node in a graph. Use a queue to perform breadth-first search starting from startUrl. Extract the hostname from the starting URL, then repeatedly dequeue a URL, call getUrls, and enqueue any discovered link that shares the same hostname and hasn’t been visited. A set tracks visited URLs to avoid infinite loops caused by cyclic links. BFS is straightforward and processes pages level by level, which mirrors how many real crawlers explore the web.

Approach 2: Depth-First Search (DFS) Traversal (Time: O(N + E), Space: O(N))

Another option is recursive or iterative depth-first search. Start with startUrl, store the hostname, and recursively visit every neighbor returned by getUrls. Before visiting a URL, check two conditions: it matches the original hostname and it hasn’t been visited before. DFS explores one branch fully before moving to another, which keeps the implementation concise with recursion and a visited set. From a complexity standpoint, DFS and BFS are identical because each valid URL is processed once.

Hostname filtering is the key constraint. Extract the domain portion of the starting URL and compare it with each discovered link. This ensures the crawler never leaves the target site even if external links appear in the page list. Since URL comparison and hostname parsing involve string operations, efficient substring checks keep overhead minimal.

Recommended for interviews: Either BFS or DFS is acceptable since both run in O(N + E) time and O(N) space. BFS with a queue and visited set is the most commonly expected approach because it clearly demonstrates graph traversal fundamentals and avoids recursion depth concerns. Mentioning DFS as an alternative shows strong understanding of traversal patterns.

Solution

Code

Python Java C++Go

Python

Java

C++

Try this approach in the editor →

Detailed Complexity Analysis

Approach	Time	Space	When to Use
Breadth-First Search (Queue)	O(N + E)	O(N)	General case. Preferred in interviews for clear level-by-level graph traversal.
Depth-First Search (Recursive/Stack)	O(N + E)	O(N)	Cleaner implementation with recursion when stack depth is manageable.

Video Solution

Web Crawler LeetCode 2020 07 15 • Vivek Sharma • 3,603 views views

Watch 5 more video solutions →

Frequently Asked Questions

Is Web Crawler easy or hard?

Web Crawler is considered a Medium problem. The traversal logic is straightforward BFS or DFS, but candidates must correctly filter URLs by hostname and prevent revisiting nodes using a set.

Web Crawler Python/Java solution

Both Python and Java implementations use the same structure: extract the hostname, maintain a visited set, and traverse URLs using BFS or DFS. Python typically uses a deque and set, while Java uses Queue and HashSet with repeated calls to HtmlParser.getUrls.

How to solve Web Crawler in O(n)?

Use BFS or DFS with a hash set to track visited URLs. Start from startUrl, extract its hostname, and repeatedly fetch neighbors using HtmlParser.getUrls. Only add URLs with the same hostname and skip ones already visited, ensuring each page is processed exactly once.

What is the best approach for Web Crawler?

Graph traversal using BFS or DFS with a visited set is the best approach. Start from startUrl, extract its hostname, and only crawl URLs that share the same domain. Each discovered URL is processed once, giving O(N + E) time complexity where N is the number of pages and E is the number of links.

Is Web Crawler asked at Google/Amazon/Meta?

Web crawler style problems appear in interviews at companies like Google and Meta because they test graph traversal, URL parsing, and handling visited states. The problem models real distributed crawler systems but reduces it to BFS/DFS traversal over URLs.

What data structure is used in Web Crawler?

The core data structures are a queue (for BFS) or recursion/stack (for DFS) and a hash set to store visited URLs. String parsing is used to extract and compare hostnames so the crawler stays within the same domain.

What is the time complexity of Web Crawler?

The optimal solution runs in O(N + E) time. Each URL (node) is visited once and every hyperlink (edge) returned by HtmlParser.getUrls is processed once during traversal. Space complexity is O(N) for storing visited URLs and the BFS queue or DFS recursion stack.

Ready to solve this problem?

Practice Web Crawler with our built-in code editor and test cases.

Practice on FleetCode

Web Crawler Multithreaded

Problem Info

DifficultyMedium

Acceptance68.8%

Approaches1

Reading time6 min

Asked at

Amazon Microsoft Meta Snowflake Rubrik

Practice this problem

Open in Editor

Web Crawler - Solution & Explanation

Problem Statement

Approach Overview

Solution

Code

Detailed Complexity Analysis

Video Solution

Frequently Asked Questions

Ready to solve this problem?

Problem Info

Table of Contents

Web Crawler - Solution & Explanation

Problem Statement

Approach Overview

Solution

Code

Detailed Complexity Analysis

Video Solution

Frequently Asked Questions

Ready to solve this problem?

Problem Info

Table of Contents

Problem Statement

Approach Overview

Solution

Code

Detailed Complexity Analysis

Video Solution

Frequently Asked Questions

Related Problems

Ready to solve this problem?

Problem Info

Table of Contents

Problem Statement

Approach Overview

Solution

Code

Detailed Complexity Analysis

Video Solution

Frequently Asked Questions

Related Problems

Ready to solve this problem?

Problem Info

Table of Contents