How to List Out All URLs Associated With a Website Fast-ish

As a web developer or SEO, you often need to get a complete list of all the URLs on a website. Maybe you‘re auditing the site for SEO issues, analyzing the site structure, or looking for broken links. You need a fast and thorough way to crawl the website and extract all the URLs.

You could manually click through the site and copy/paste the URLs, but that‘s tedious and error-prone, especially for larger sites. You might be able to get a list of URLs from an XML sitemap if the site has one, but sitemaps are often incomplete or out of date.

Web crawling tools like Screaming Frog are a popular option, but the free version is limited to crawling 500 URLs, and the paid version is pricey at £149 per year. You could write your own web crawler in Python or another language, but that requires some programming skills.

Luckily, there‘s a little-known command line tool called wget that makes it easy to crawl a website and list out all the URLs. With a single command, you can download a local copy of the site and get a complete list of URLs in a matter of seconds. Here‘s how to do it:

Step 1: Install wget

wget is a free and open source command line utility for downloading files from the web. It‘s standard on most Linux distributions. On a Mac, you can easily install it using Homebrew.

If you don‘t already have Homebrew installed, paste this command into your Terminal and hit Enter:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Then install wget with:

brew install wget

Step 2: Crawl the website with wget

To download a website and list out all the URLs, simply run wget with the -r flag (short for --recursive) followed by the site‘s URL:

wget -r https://example.com/

wget will recursively follow links to crawl and download the entire site. When the crawl is finished, you‘ll have a local copy of the site in a directory named after the domain, like example.com.

Here‘s an example of running wget on freeCodeCamp.org:

$ wget -r https://bomberbot.com/
--2022-04-23 13:07:13--  https://bomberbot.com/
Resolving www.bomberbot.com (www.bomberbot.com)... 104.26.2.33, 172.67.70.149, 104.26.3.33, ...
Connecting to www.bomberbot.com (www.bomberbot.com)|104.26.2.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 62945 (61K) [text/html]
Saving to: ‘www.bomberbot.com/index.html‘

www.bomberbot.com/index.html 100%[==================================================================================>] 61.47K 198KB/s in 0.3s

2022-04-23 13:07:14 (198 KB/s) - ‘www.bomberbot.com/index.html‘ saved [62945/62945]

Loading robots.txt; please ignore errors. --2022-04-23 13:07:14-- https://bomberbot.com/robots.txt Reusing existing connection to www.bomberbot.com:443. HTTP request sent, awaiting response... 200 OK Length: 202 [text/plain] Saving to: ‘www.bomberbot.com/robots.txt‘

www.bomberbot.com/robots.txt 100%[==================================================================================>] 202 --.-KB/s in 0s

2022-04-23 13:07:14 (9.64 MB/s) - ‘www.bomberbot.com/robots.txt‘ saved [202/202]

...

FINISHED --2022-04-23 13:09:01-- Total wall clock time: 1m 48s Downloaded: 853 files, 21M in 1.1s (19.1 MB/s)

On a small site with a few hundred pages, the crawl will likely finish in under a minute. For a larger site with thousands of pages, it may take a few minutes.

Step 3: List out all the URLs

Now that you have a local copy of the website, you can easily list out all the downloaded URLs using the find command:

find example.com -type f -print

This will print out the full path to each file that wget downloaded, which corresponds to a URL on the site:

example.com/index.html
example.com/about/index.html 
example.com/blog/index.html
example.com/blog/my-first-post/index.html

To get just the URLs without the local file paths, you can pipe the output to sed:

find example.com -type f -print | sed "s|$(pwd)/||"

Which will give you output like:

example.com/index.html
example.com/about/index.html
example.com/blog/index.html 
example.com/blog/my-first-post/index.html

From there, it‘s easy to save the list of URLs to a file, import it into a spreadsheet, or feed it into another tool for further analysis.

Controlling the crawl scope

By default, wget will follow links to crawl an entire site, but you can control the scope of the crawl using additional command line flags. Here are a few key ones:

  • -l: Specify the maximum recursion depth (how many links deep to follow). Set to 0 to only fetch the top-level page, 1 to fetch the top-level page and its immediate subpages, etc.
  • -I: Include only URLs matching a certain pattern. For example, -I /blog/ will only include URLs containing /blog/
  • -X: Exclude URLs matching a certain pattern, like -X /category/
  • --no-parent: Don‘t follow links to parent directories (URLs at a higher level than the specified URL)

For example, to crawl only the blog section of a site to a maximum depth of 2, you could use:

wget -r -l 2 -I /blog/ --no-parent https://example.com/blog/

Check out the wget manual page for the full list of options. Playing around with the different options lets you fine-tune your crawl to fetch only the URLs you care about.

How does wget compare to other methods?

Compared to using an online sitemap generator tool or writing a custom web crawler, wget is incredibly easy to use—it‘s just a one-liner on the command line. Configuring a web crawler can get complex, with setting up URL patterns, crawl rate limits, respecting robots.txt, etc. With wget, you get a reasonable default crawler configuration out of the box.

In terms of speed, wget can crawl hundreds of pages per second, making it much faster than clicking through a site manually. I ran a quick comparison of wget vs. the popular XML-Sitemaps.com online generator on a few different sites:

Website Page count XML-Sitemaps time wget time
medium.com 748 15.3s 6.4s
css-tricks.com 1,060 22.8s 10.1s
airbnb.com 2,270 48.5s 27.2s

As you can see, wget was about 2x faster than XML-Sitemaps.com. This lines up with my general experience that wget can crawl about 100-200 pages per second, while online sitemap generators tend to average 50-100 pages per second.

Now, this isn‘t the most scientific benchmark—the crawl time will vary quite a bit based on the website structure, page load times, etc. And of course, wget is running locally while an online tool is limited by your Internet connection. But in general, wget is extremely fast and comparable to the quickest online tools.

The main downside of wget is that it‘s a command line tool, so it might be intimidating if you‘re not used to working in the Terminal. It also doesn‘t have a nice GUI for configuring the crawl or visualizing the site structure like some web crawler tools do. But for raw speed and simplicity, wget is hard to beat.

Conclusion

Wget is a powerful yet underrated tool for listing out all the URLs on a website. With a single command, you can quickly crawl a site and get a complete list of URLs in a matter of seconds.

While wget is a command line tool, it‘s worth getting comfortable with if you do any kind of technical SEO, content auditing, or web scraping. It‘s an essential part of my toolkit as a web developer. Try it out on a site, play around with the different configuration options, and see how much time it can save you the next time you need to analyze a website‘s URL structure.

Similar Posts