ComputersProgramming

What is a crawler? The functions of the search robot "Yandex" and Google

Every day a lot of new materials appear on the Internet: websites are created, old web pages are updated, photos and video files are downloaded. Without invisible search robots, it would be impossible to find any of these documents on the World Wide Web. There are no alternatives to such robotic programs at this time. What is a search robot, why is it needed and how does it function?

What is a crawler

The search robot of sites (search engines) is an automatic program that is able to visit millions of web pages, quickly navigating the Internet without operator intervention. Bots constantly scan the World Wide Web, find new Internet pages and regularly visit already indexed. Other names of search robots: spiders, crawlers, bots.

Why are search engines needed?

The main function that search robots perform is indexing of web pages, as well as texts, images, audio and video files on them. Bots check links, mirror sites (copies) and updates. Robots also monitor HTML code for compliance with the standards of the World Organization, which develops and implements technological standards for the World Wide Web.

What is indexing and why is it needed?

Indexing - this, in fact, is the process of visiting a particular web page by search robots. The program scans texts posted on the site, images, videos, outbound links, after which the page appears in the search results. In some cases, the site can not be scanned automatically, then it can be added to the search engine manually by the webmaster. Typically, this happens when there is no external reference to a certain (often only recently created) page.

How search crawlers work

Each search engine has its own bot, while Google's search robot can significantly differ in the mechanism of operation from a similar program "Yandex" or other systems.

In general, the principle of the robot is as follows: the program "comes" to the site by external links and, starting from the main page, "reads" the web resource (including browsing those service data that the user does not see). The bot can both move between pages of one site, and move on to others.

How does the program choose which site to index? Most often the "journey" of a spider begins with news sites or large resources, directories and aggregators with a large reference mass. The crawler continuously scans pages one by one, the speed and sequence of indexing are affected by the following factors:

  • Internal : Padding (internal links between pages of the same resource), the size of the site, the correctness of the code, convenience for users and so on;
  • External : the total amount of reference mass that leads to the site.

The first thing a search robot is looking for on any site is a robots.txt file. Further indexing of the resource is carried out based on the information obtained from this document. The file contains precise instructions for "spiders", which allows you to increase the chances of visiting the page by search engines, and therefore, to achieve the earliest possible entry of the site in the issuance of "Yandex" or Google.

Search engine similarity programs

Often the term "search robot" is confused with intellectual, user or autonomous agents, "ants" or "worms". Significant differences are available only in comparison with agents, other definitions denote similar types of robots.

So, agents can be:

  • Intellectual : programs that move from site to site, independently deciding how to proceed; They are not widely distributed on the Internet;
  • Stand-alone : these agents help the user in choosing a product, searching for or filling out forms, these are so-called filters that are not very relevant to network programs;
  • User programs: programs facilitate the user's interaction with the World Wide Web, such as browsers (eg Opera, IE, Google Chrome, Firefox), instant messengers (Viber, Telegram) or email programs (MS Outlook or Qualcomm).

"Ants" and "worms" are more similar to the search "spiders". The former form a network among themselves and interact like the real ant colony, "worms" are self-reproducing, otherwise they act the same way as the standard search robot.

Varieties of search robots

There are many types of search robots. Depending on the purpose of the program, they can be:

  • "Mirror" - they look through duplicate sites.
  • Mobile - are aimed at mobile versions of Internet pages.
  • Fast - fix new information promptly, viewing the latest updates.
  • Links - index links, count their number.
  • Indexers of different types of content - separate programs for text, audio and video recordings, images.
  • "Spyware" - look for pages that are not yet displayed in the search engine.
  • "Woodpeckers" - periodically visit sites to check their relevance and efficiency.
  • National - browse web resources located on the domains of one country (for example, .ru, .kz or .ua).
  • Global - all national sites are indexed.

Robots of major search engines

There are also separate search engine robots. In theory, their functionality may vary significantly, but in practice the programs are almost identical. The main differences between the indexing of Internet pages by robots of the two main search engines are as follows:

  • Strictness of verification. It is believed that the mechanism of the search robot "Yandex" is somewhat more strict about the site for compliance with the standards of the World Wide Web.
  • Maintain the integrity of the site. Google's crawler indexes the entire site (including media content), "Yandex" can also view the page selectively.
  • Speed of checking new pages. Google adds a new resource to the SERP for a few days, in the case of Yandex, the process can take two weeks or more.
  • Frequency of reindexing. The search robot "Yandex" checks for updates several times a week, and Google - once every 14 days.

The Internet, of course, is not limited to two search engines. Other search engines have their own robots, which follow their own indexing parameters. In addition, there are several "spiders" that are not developed by large search resources, but by individual teams or webmasters.

Common misconceptions

Contrary to popular belief, "spiders" do not process the information received. The program only scans and saves web pages, and further processing is done by completely different robots.

Also, many users believe that the search robots have a negative impact and are "harmful" to the Internet. Indeed, individual versions of "spiders" can significantly overload the server. There is also a human factor - the webmaster who created the program, can make mistakes in the settings of the robot. Nevertheless, most of the existing programs are well designed and professionally managed, and any emerging problems are quickly eliminated.

How to manage indexing

Search robots are automatic programs, but the indexing process can be partially controlled by the webmaster. This is greatly assisted by external and internal optimization of the resource. In addition, you can manually add a new site to the search engine: large resources have special forms of registration of web pages.

Similar articles

 

 

 

 

Trending Now

 

 

 

 

Newest

Copyright © 2018 en.delachieve.com. Theme powered by WordPress.