We can think that Web Search is an Information Retrieval Problem. Compared to searching a database, the search for document contents is more terrifying since it is not structured. Documents should be indexed for making the search easier and less time-consuming.
Each document has object terms namely, the author’s name, document URL, and the date of publication. It may also have non-objective terms intended to reflect the information known as content terms. The effectiveness of search engines can be measured by two main parameters:
- Indexing exhaustivity
- Term specificity
Indexing is the processing of a document representation by assigning content descriptions or terms to the document.
In indexing, Web documents are characterized by recall (the ratio of the number of relevant documents retrieved to the total number of documents retrieved). This function can be performed either manually or automatically. But many Websites render manual indexing quite impractical.
Automatic indexing includes single-term indexing, statistical methods, as well as information theoretical and probabilistic methods. In addition to this, automatic indexing uses linguistic and multi-term or phrase indexing.
Since the internet is a vast collection of information, it is difficult to find the specific information you actually need. Therefore, the search features in a Web browser such as Internet Explorer provides easy access to a special facility called a search engine. Search engines scan the Internet for the words or topics you are looking for.
A Search Engine is a software that searches through a database of Web pages for specific information.
Types of Search Engines
Google is an interesting search engine having many unique features. For example, you want to search for company information. It is useful for company searches because of the unusual way it ranks Websites. Type http://www.google.com in the Address bar and press Enter to go to the Google home page.
When you are there, type the company name in the search box and click the Google Search button. Google is so good at finding the best matching Websites in a search, it offers a feature to automatically look for the best possible match and load it. To use this feature, type a company’s name in the search box and click the I’m Feeling Lucky button. It gives the related company information.
Yahoo!
It is basically a search directory. It is hierarchically organized with a subject catalog or directory of the Web which is browsable and searchable. Yahoo! is a good search engine and provides you with the required information quickly.
Links to various services are accomplished in two ways:
- By user’s submissions
- Through robots that retrieve new links from known pages
Yahoo! indexes Web pages, Usenet, and e-mail addresses. This search engine has 14 categories listed on its homepage. Each of these categories is divided into several subcategories. A search box is provided for user search in all these options. you can search Yahoo! in two modes:
Yahoo! search page
Yahoo! search options
Yahoo! search page uses operators such as (+) inclusive and (-) exclusive, etc. Yahoo! search options are meant to get switches for fine-tuning your search. These switches use relevancy ranking in obtaining the query output. the query output is a list of documents and related Yahoo! categories, along with the first few lines of the document.
If the search request fails in Yahoo!, it is automatically routed to AltaVista for more search. Yahoo! offers a lot of extra services like freemail accounts, region-specific sites, searches to locate people, site reviews, and a customizable news page.
AltaVista
This has been created by the research facility of Digital Electronics Corporation (DEC) of the USA. This search engine has a spider called Scooter that traverses the Web and Usenet newsgroups.
Indexing is based on the full text of a document and the first few lines are used as an abstract. AltaVista search supports full Boolean, phrase, and case-sensitive searches. The engine has two modes of search types namely, simple and advanced search.
In a simple search, AltaVista will attempt to find pages that include as many of your search words as possible and rank those pages from highest to lowest in the result. In advanced search, the page uses the same syntax rules as the basic search, but it adds Boolean operators to make searches much more flexible. The operators include &(AND), |(OR), and !(NOT).
The advanced search ranks results on the basis of giving a higher score to documents that contain the query terms in the first few words or the documents in which the query terms are found close to each other.
HotBot
This engine retrieves and indexes Web documents using a robot called Slurp and a parallel network of workstations. HotBot comes in two types: Like(ordinary HTML) and ActiveX. HotBot offers simple keyword as well as Boolean searches.
This search engine is most suitable for searching specific words or phrases. The HotBot search contains a text box for the users to enter their query string and a list box to choose the appropriate rule, like all words, any words, or exact phrases. HotBot is primarily used for fine-tuning your search. you can select whether the target page must or must not contain the words or exact phrases.
WebCrawler
WebCrawler has powerful search customization and a good selection of site reviews. It has a Web robot called a Webbot that creates a daily index of keywords from documents all over the Web. The robot starts with a known set of HTML documents and uses the URLs in them to retrieve new documents.
The search engine directs the navigation in a modified breadth-first mode. It indexes both the title and the full text of HTML documents. Terms are weighted by their frequency of occurrences in the document. WebCrawler also features a WebRoulette, which suggests randomly selected sites for you to visit. it has another option called Surf the Web Backwards, which allows you to enter a URL and get a list of all the sites which link directly to it.
Excite
It uses a spider and indexer for the full-text search of documents. The spider retrieves only Web and Usenet newsgroup documents. Users can submit URLs for indexing. The indexer generates index terms and a short document summary. The Excite index consists of about 50 million URLs.
This engine is a full-featured search engine. It offers services like searches that are case-sensitive. The Boolean operators used by Excite are AND, NOT, and OR.
InfoSeek
It is a popular search engine with a robot that retrieves HTML and PDF documents. It indexes full text and generates a short summary of each document. Infoseek allows searches on the Web, Usenet groups, and Web Frequently Asked Questions (FAQs). This offers indexed site searches and divides the Web into a number of convenient baskets. Unlike Yahoo! InfoSeek aims to have cataloged more Websites than virtually any other search engine on the Internet.
Lycos
It contains 66 million pages in its database. This search engine has a robot that uses heuristics to navigate the Web and build a searchable index. For each document indexed, the robot keeps the outgoing links in a queue and selects a URL from it.
One heuristic, for example, may force the robot to select a URL that points to a Web server’s homepage. Users can submit URLs for indexing. Lycos indexes titles, headings, and subheadings of HTML, FTP, and Gopher documents. it also offers a lot of content like news, site review, links a people finder, etc. It also has the ability to search for images and sounds.
Other types of Search Engines
There are many other search engines on the Web such as InfoMarket, MetaCrawler, and IndiaInfo.co, All4one, and Highways61.com. choosing the right search engine will need patience and experience. If you use Metasearch engines then they minimize your effort to search to a great extent. A search engine is evolving every day to improve Web retrieval efficiency.
Searching Criterion
Search Tools
The search tools have two ways to find specific information:
- Directories
- Spiders
The problem with directories, which store knowledge in some structure, is that classification is a labor-intensive activity, and there are far more publishers of directories than classifiers on the Web. And if the information you are looking for is not reflected by the classification structure, then you are out of luck and this happens very often.
An alternative to this is intensive automation that involves finding or robot, that explores the Web and helps to find Web pages. Spiders (also known as crawlers, robots, or bots) have the ability to test databases against queries and order the resulting matches. They have a user interface for obtaining and presenting results. A spider strips away many other markup features so that it simply sees the pure HTML source. However, a spider is blind to the information contained in images and audio or video clips.
Search Services
Search services broadcast user queries to several search engines and various other information sources simultaneously. They then merge the results submitted by these different sources, check for duplicates, and present them to the users as an HTML page with clickable URLs. Search sites are basically of the following two types:
- Search directories
- Search engines
Search Directories
Search directories contain a list of Websites organized hierarchically into categories and subcategories. These are created manually.
Search Engine
A search engine continuously sends out the so-called spiders, which start on a homepage of a server and pursue all links stepwise. The word indicates are created from individual pages and the databases are updated.
To eliminate the need for looking up several search engines, log on to Metasearch sites. They take your requests to various search engines and help you with better coverage. Metasearch sites do not have search capabilities of their own.
How to Search using Search Engines?
Using a search engine is pretty simple. Just type in the data to be searched, the space provided at the search engine’s current page, and click search. The result will be displayed with information corresponding to the search in the form of clickable URLs leading to the pages you seek. In some search engines, the data is related only after editorial processing.
Other tutorials