Teck's Treehouse: Mirror website with wget and httrack

Tuesday, February 23, 2021

Mirror website with wget and httrack

There are some open source tools to mirror websites, but personally, I think the easiest to use are wget and httrack.

wget has the mirror option, and it is simply a matter of using that option with the website to mirror.

wget -m http://mysite.net

If you want to ignore the robots.txt file, use the ‐‐execute robots=off option.

wget -m ‐‐execute robots=off http://mysite.net

httrack is another great tool which runs from the command line. Just executing httrack with the URL is enough.

httrack http://mysite.net

If you want to save the site at a specific path, use the -O option

httrack http://mysite.net -O /home/user/mysite_mirror

Similarly, you can ignore robots.txt using the -s0 option.

httrack http://mysite.net -s0

Ignoring robots.txt helps to retrieve files which would otherwise not be allowed by these web crawling programs. However, this may result in high network traffic, so use at your own discretion.

By the way, httrack also has a GUI called webhttrack. On Ubuntu systems, you can install the entire package by

sudo apt install webhttrack

which will install both httrack and the GUI.

Teck's Treehouse

Tuesday, February 23, 2021

Mirror website with wget and httrack

No comments:

Calligraphy commission/purchase

Categories

Links

Buy Me A Coffee

Amazon Books

Blog Archive

Subscribe Now: Feed Icon

Subscribe via email

About Me