1236. Web Crawler

Given a url startUrl and an interface HtmlParser , implement a web crawler to crawl all links that are under the  same hostname as  startUrl

Returns all urls obtained by your web crawler in any order.

Your crawler should:

As shown in the example url above, the hostname is example.org . For simplicity sake, you may assume all urls use http protocol without any  port specified.

The function interface is defined like this: 

interface HtmlParser {
public:
  // Returns a list of urls contained in url .
  public List<String> getUrls(String url);
}

Below there are two examples explaining the functionality of the problem, for custom testing purposes you'll have 3 variables  urls edges  and  startUrl . Notice that you will only have access to  startUrl , while  urls  and  edges  are secret to you on the rest of the testcases.

 

Example 1:

Input:

urls = [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.google.com",
  "http://news.yahoo.com/us"
]
edges = [[2,0],[2,1],[3,2],[3,1],[0,4]]
startUrl = "http://news.yahoo.com/news/topics/"
Output:

 [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.yahoo.com/us"
]

Example 2:

Input:

 
urls = [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.google.com"
]
edges = [[0,2],[2,1],[3,2],[3,1],[3,0]]
startUrl = "http://news.google.com"
Output:

 ["http://news.google.com"]
Explanation: 

The startUrl links to all other pages that do not share the same hostname.

 

Constraints:

Difficulty:

Medium

Lock:

Prime

Company:

Amazon Dropbox Facebook

Solution(Chinese):

LEETCODE 1236. Web Crawler 解题思路分析