ewiget
Admin
 Admin
| Posts: 171 |  | Karma: 2
|
how to mirror your website using wget - 2005/07/16 11:47
If you have a static web site (not using php or some other server side includes language such as asp, or not database driven) and you wish to mirror it (make a copy of its content), you can use either an ftp program or a simplier method is to use wget (works on most operating systems, including windows).
Introduction GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the two most widely-used Internet protocols. It is a non-interactive commandline tool (on most 'nix systems), so it may easily be called from scripts, cron jobs, terminals without Xsupport, etc.
Wget has many features to make retrieving large files or mirroring entire web or FTP sites easy, including:
* Can resume aborted downloads, using REST and RANGE * Can use filename wild cards and recursively mirror directories * NLS-based message files for many different languages * Optionally converts absolute links in downloaded documents to relative, so that downloaded documents may link to each other locally * Runs on most UNIX-like operating systems as well as Microsoft Windows * Supports HTTP and SOCKS proxies * Supports HTTP cookies * Supports persistent HTTP connections * Unattended / background operation * Uses local file timestamps to determine whether documents need to be re-downloaded when mirroring * GNU wget is distributed under the GNU General Public License.
The 'nix version of wget is available from GNU.org FTP
The version of wget for Windows web page is located here wget for windows
If you do not like to use command line programs on Windows, then have a look at this graphical user interface for wget wGetGUI
Mac users - wget is included in some versions of Mac OS 10.x up to 10.2. Beginning with OS X 10.3 they have included curl, which has some features similar to wget. For those of you who wish to use wget on OS X greater than 10.2, you can find a binary download at StatusQ or the direct download link is http://www.statusq.org/images/wget.zip NOTE: Xtreme Web Hosts did not compile this binary download, nor were we able to test it. If it breaks your system, we can not be held liable or responsible. The provided download link is offered only as a convenience to our visitors. You may find an alternative to the download link by simply using a search engine, such as Google
An excellent page that provides additional information, links, program descriptions, tips, tricks, etc related to wget is Lachlan Cranswick web site
So, above I have provided you with some basic information.....here are the three commands I use from within Gentoo Linux You can find out more about my experience with Linux at the Maysville Linux Users Group - MLUG
wget --mirror http://www.site_i_want_to_mirror/
The above command line uses the long-option of --mirror to mirror a web site. The short-option is simply -m Either command using the long or short option will mirror the domain listed as http://www.site_i_want_to_mirror/ It follows all links, downloads all graphics for each page linked so that the entire domain will display in a browser on your local computer system. It also keeps the directory structure in-tact. It is very useful when a web site is moving over to our hosting system. One word of caution: If the web site is database driven, it will not download the database in a format for easily restoring it, although it will download most of the database content and save it in standard html format. For example, if the domain has a bulletin board like phpbb, it will download each post as a seperate html file. Mirroring a web site can take a long time if there is a lot of content and images. If the web site is using php code or other server side includes, it does not actually download the code, but downloads the page the code is on as it would display in a web browsers "view source" or similar.
wget -p http://www.site_i_want_to_mirror/page_to_mirror.html
The above command line uses the short-option of -p for page-requisites, therefore the long-option is simply --page-requisites This command will download all of the files of the page_to_mirror.html including its graphics. It does not download links on the page, simply that single page and all graphics needed to display it properly on the local computer in the web browser of your choice. Words of caution: If the web site is using php code or other server side includes, it does not actually download the php code, but downloads the page the code is on as it would display in a web browsers "view source" or similar. Another note concerning server side includes such as php or asp pages, if the web page has options in it, such as the next example, you will only get the default main page. See the next example for how to get the actual content.
wget --page-requisites 'http://www.site_i_want_to_mirror/index.php?option=whatever&secondoption=whatever'
The above command uses the long-option of -p and causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets. You must surround the url by single quotes in order to get the actual content in these types of pages.
Hope you have found this information useful!
Post edited by: ewiget, at: 2005/07/16 11:52
Ed Wiget Technical Support http://www.xtremewebhosts.com |