ewiget
Admin
 Admin
| Posts: 171 |  | Karma: 2
|
SEO (search engine optimization) robots.txt - 2008/02/04 10:15
This is an often overlooked file in web site development, yet is probably one of the most important files your web site can have. The robots.txt file basically tells search engines what they can and can not index. Most people believe they would want a search engine to index everything. I would assume most people do not want this to happen.
For instance, have you ever placed files inside a directory as temporary storage? What about doing automatic database backups? Did you ever place a blank index.htm file in a folder thinking search engines could not access the other files there? Do you use a CMS on your web site like drupal, mambo, joomla, phpnuke, etc or another web site feature that has an administration section? Most people have done this or have these things on their web sites. Search engines, unless told otherwise, will index everything on your web site including personal files stored in hidden directories, backup files, database sql exports, etc.
The way to fix that is by using the robots.txt file and it is very simple to do. Remember, the robots.txt file simply tells search engines and crawlers what they can and can not index. This is helpful in keeping them out of folders that you do not want indexed like the admin or stats folder or other content that they should not index.
The robots.txt file uses a couple of parameters or variables. Here is a list that you can include and their meaning: 1 - User-agent: In this field you can specify a specific robots access policy or a “*” for all robots (more explained in examples) 2 - Disallow: In this field you specify the files and folders not to include in the crawl. 3 - #: the number sign represents comments
EXAMPLES
example 1 - this allows all robots to index everything
| Code: |
User-agent: *
Disallow:
|
example 2 - this example blocks all robots and spiders from indexing the entire cgi-bin directory
| Code: |
User-agent: *
Disallow: /cgi-bin/
|
example 3 - this example lets googlebot index everything while all other robots and spiders are blocked from indexing admin.php file, the cgi-bin directory, the admin directory, and the stats directory
| Code: |
User-agent: googlebot
Disallow:
User-agent: *
Disallow: /admin.php
Disallow: /cgi-bin/
Disallow: /admin/
Disallow: /stats/
|
example 4 - this example uses no trailing / which acts like a wildcard * and will match any url that has /admin in the name including /admin, /adminisrator, /administrator/, /adminibator, /admin/, /admin.html, /admin.php
| Code: |
User-agent: *
Disallow: /admin
|
There are also various suggestions for some content management systems and those running dynamic web sites (database backends).
example 5 - this is an example wordpress robots.txt (may need adjusted based on the root of your wordpress installation)
| Code: |
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /trackback
Disallow: /comments
Disallow: /category/*/*
Disallow: */trackback
Disallow: */comments
Disallow:
|
example 6 - phpnuke - Since all PHP-Nuke functions are handled through the modules.php and admin.php files, all other directories should be excluded. Since the admin.php functions are for administrators only, this file should also be specifically excluded
| Code: |
User-agent: *
Disallow: /admin.php
Disallow: /GoogleTap/
Disallow: /abuse/
Disallow: /admin/
Disallow: /backup/
Disallow: /blocks/
Disallow: /cgi-bin/
Disallow: /db/
Disallow: /depository/
Disallow: /download/
Disallow: /downloads/
Disallow: /images/
Disallow: /import/
Disallow: /includes/
Disallow: /jpcache/
Disallow: /kernel/
Disallow: /language/
Disallow: /modules/
Disallow: /themes/
|
example 7 - this will work for most mambo and joomla sites
| Code: |
User-agent: *
Disallow: /administrator/
Disallow: /cache/
Disallow: /components/
Disallow: /editor/
Disallow: /help/
Disallow: /includes/
Disallow: /language/
Disallow: /mambots/
Disallow: /media/
Disallow: /modules/
Disallow: /templates/
Disallow: /installation/
|
Post edited by: ewiget, at: 2008/02/04 10:20
Ed Wiget Technical Support http://www.xtremewebhosts.com |