page content: What is a Web Crawler, behind the scenes of search engines, getting your site crawled, Crawlers and site traffic

Web Crawler

A web crawler, also known as a spider or a search-bot (short of search-robot), is a software used by search engines to collect data for their search results.

Behind the scenes of search engines

When you use Google, Yahoo, or any other search engine, it doesn't go looking for sites "out there" in the Internet. It searches through its own, private data base that has been created in advance, and which contains site names, addresses, keywords, etc.. The more complete the data base, the more search results the engine will return for your query.

With over 25 billion pages in the Internet (as of 2009), it's impossible to visit each site and manually collect information about it. Instead, search engines use crawler software. The crawler simulates a regular user: It asks for a page on your site, reads the content, and catalogs it by keywords for the search engine's use. If the page contains links to other pages, the crawler may (or may not) follow them and repeat the process.

Getting your site crawled

If you want the pages of your website to appear in any search engine results, they must first be visited by that engine's crawler bots. There are two main ways to achieve that:

 

Preventing your site from being crawled

Sometimes you don't want search engines to catalog your site content, or even specific pages in your site. You can control most crawler behavior by creating a special file with a list of commands. These commands can demand that a crawler not visit a certain page, directory, etc.. Create the list of commands according to your needs, name your file robots.txt, and upload it to your site. Most search-bots automatically search for a robots.txt file and obey its instructions.

Crawlers and site traffic

Since crawlers request pages from your site and mostly act like a normal user, their activity is taken into account in most traffic monitoring programs. When you track the number of visitors or page hits in your site, for example, take care to count only human traffic (some statistics programs make this distinction for you).

 

***SOCIALIZEIT***