main top header image show network and world montage
clear spacer image
Main Menu
Home
News
About Us
Tutorials
Forums
Contact Us
Newsletter
Search
Recommend Us
Tools Menu
Order Web Hosting
Register Domain Name
Transfer Domain Name
Who Owns Domain?
Domain Suggestions
Manage Domain Name
Search Engine Submit
Free Files
Submit Trouble Ticket
 


clear spacer image Home
ewiget
Admin

Admin
Posts: 171
graph
Karma: 2  
SEO (search engine optimization) robots.txt - 2008/02/04 10:15 This is an often overlooked file in web site development, yet is probably one of the most important files your web site can have. The robots.txt file basically tells search engines what they can and can not index. Most people believe they would want a search engine to index everything. I would assume most people do not want this to happen.

For instance, have you ever placed files inside a directory as temporary storage? What about doing automatic database backups? Did you ever place a blank index.htm file in a folder thinking search engines could not access the other files there? Do you use a CMS on your web site like drupal, mambo, joomla, phpnuke, etc or another web site feature that has an administration section? Most people have done this or have these things on their web sites. Search engines, unless told otherwise, will index everything on your web site including personal files stored in hidden directories, backup files, database sql exports, etc.

The way to fix that is by using the robots.txt file and it is very simple to do. Remember, the robots.txt file simply tells search engines and crawlers what they can and can not index. This is helpful in keeping them out of folders that you do not want indexed like the admin or stats folder or other content that they should not index.

The robots.txt file uses a couple of parameters or variables. Here is a list that you can include and their meaning:
1 - User-agent: In this field you can specify a specific robots access policy or a “*” for all robots (more explained in examples)
2 - Disallow: In this field you specify the files and folders not to include in the crawl.
3 - #: the number sign represents comments

EXAMPLES

example 1 - this allows all robots to index everything
Code:

  User-agent: * Disallow



example 2 - this example blocks all robots and spiders from indexing the entire cgi-bin directory
Code:

  User-agent: * Disallow: /cgi-bin/



example 3 - this example lets googlebot index everything while all other robots and spiders are blocked from indexing admin.php file, the cgi-bin directory, the admin directory, and the stats directory
Code:

  User-agentgooglebot Disallow: User-agent: * Disallow: /admin.php Disallow: /cgi-bin/ Disallow: /admin/ Disallow: /stats/



example 4 - this example uses no trailing / which acts like a wildcard * and will match any url that has /admin in the name including /admin, /adminisrator, /administrator/, /adminibator, /admin/, /admin.html, /admin.php
Code:

  User-agent: * Disallow: /admin



There are also various suggestions for some content management systems and those running dynamic web sites (database backends).

example 5 - this is an example wordpress robots.txt (may need adjusted based on the root of your wordpress installation)
Code:

  User-agent: * Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/cache Disallow: /wp-content/themes Disallow: /trackback Disallow: /comments Disallow: /category/*/* Disallow: */trackback Disallow: */comments Disallow



example 6 - phpnuke - Since all PHP-Nuke functions are handled through the modules.php and admin.php files, all other directories should be excluded. Since the admin.php functions are for administrators only, this file should also be specifically excluded
Code:

  User-agent: * Disallow: /admin.php Disallow: /GoogleTap/ Disallow: /abuse/ Disallow: /admin/ Disallow: /backup/ Disallow: /blocks/ Disallow: /cgi-bin/ Disallow: /db/ Disallow: /depository/ Disallow: /download/ Disallow: /downloads/ Disallow: /images/ Disallow: /import/ Disallow: /includes/ Disallow: /jpcache/ Disallow: /kernel/ Disallow: /language/ Disallow: /modules/ Disallow: /themes/



example 7 - this will work for most mambo and joomla sites
Code:

  User-agent: * Disallow: /administrator/ Disallow: /cache/ Disallow: /components/ Disallow: /editor/ Disallow: /help/ Disallow: /includes/ Disallow: /language/ Disallow: /mambots/ Disallow: /media/ Disallow: /modules/ Disallow: /templates/ Disallow: /installation/



Post edited by: ewiget, at: 2008/02/04 10:20
Ed Wiget
Technical Support
http://www.xtremewebhosts.com
  | | The administrator has disabled public write access.
Professional Web Site Design & Hosting Service - References Available by Request

© 2008 Xtreme Web Hosts - Professional web site hosting, business website hosting and domain names
Joomla! is Free Software released under the GNU/GPL License.
Today is:   Thursday, 28 August 2008 21:15