Tuesday, February 23, 2021

Mirror website with wget and httrack

There are some open source tools to mirror websites, but personally, I think the easiest to use are wget and httrack.
 
wget has the mirror option, and it is simply a matter of using that option with the website to mirror.
wget -m http://mysite.net
If you want to ignore the robots.txt file, use the ‐‐execute robots=off option.
wget -m ‐‐execute robots=off http://mysite.net
 
httrack is another great tool which runs from the command line. Just executing httrack with the URL is enough.
httrack http://mysite.net
If you want to save the site at a specific path, use the -O option
httrack http://mysite.net -O /home/user/mysite_mirror
Similarly, you can ignore robots.txt using the -s0 option.
httrack http://mysite.net -s0

Ignoring robots.txt helps to retrieve files which would otherwise not be allowed by these web crawling programs. However, this may result in high network traffic, so use at your own discretion.

By the way, httrack also has a GUI called webhttrack. On Ubuntu systems, you can install the entire package by
sudo apt install webhttrack
which will install both httrack and the GUI.

No comments: