Web Crawler
A web crawler, also known as a spider or a search-bot (short of search-robot), is a software used by search engines to collect data for their search results.
Behind the scenes of search engines
When you use Google, Yahoo, or any other search engine, it doesn't go looking for sites "out there" in the Internet. It searches through its own, private data base that has been created in advance, and which contains site names, addresses, keywords, etc.. The more complete the data base, the more search results the engine will return for your query.
With over 25 billion pages in the Internet (as of 2009), it's impossible to visit each site and manually collect information about it. Instead, search engines use crawler software. The crawler simulates a regular user: It asks for a page on your site, reads the content, and catalogs it by keywords for the search engine's use. If the page contains links to other pages, the crawler may (or may not) follow them and repeat the process.
Getting your site crawled
If you want the pages of your website to appear in any search engine results, they must first be visited by that engine's crawler bots. There are two main ways to achieve that:
- Submit your site directly to the search engine. By giving the search engine your domain name, you invite it to crawl your site. Depending on the engine's software, it will gradually discover other pages linked to your home page.
Additionally, some search engines allow you to submit a site map: a file listing all available pages in your site. This may accelerate the rate at which your pages are crawled and indexed.
- Have other well-crawled sites link to your site. Not only will you get crawled from them, but you will also receive "search engine points" by being linked to: your site pages will likely have higher page ranks, which means they will appear high on search results lists, making your site more visible.
Preventing your site from being crawled
Sometimes you don't want search engines to catalog your site content, or even specific pages in your site. You can control most crawler behavior by creating a special file with a list of commands. These commands can demand that a crawler not visit a certain page, directory, etc.. Create the list of commands according to your needs, name your file robots.txt, and upload it to your site. Most search-bots automatically search for a robots.txt file and obey its instructions.
Crawlers and site traffic
Since crawlers request pages from your site and mostly act like a normal user, their activity is taken into account in most traffic monitoring programs. When you track the number of visitors or page hits in your site, for example, take care to count only human traffic (some statistics programs make this distinction for you).
