Parachute Cover
The official online job search resource hosted By Dick Bolles, author of "What Color is Your Parachute"
Research on the Internet | The UnderWeb & Deep Searching
Research on the Internet
 
 
The UnderWeb & Deep Searching
More info:
The Underweb --- Why Search Engines Miss So Much
Search engines are not able to access all of the Internet: in fact, some experts say that as much as half of the Web is inaccessible to them, and by inference, to you. (For an explanation of why this is so, click on the "more info" box) The part of the Internet that is beyond the reach of search engines is often called the UnderWeb, the Invisible Web, the Deep Web, and similar names. Here are some other places to look for data on the UnderWeb.
LibDex
This is an index to 18,000 libraries, many of which have online materials and databases that you can access.

The Digital Library Project at U.C. Berkeley
A glimpse of the future, when everything is on line. There is so much stuff here, that I am not going to bother describing it. Instead, when you have some time, jump in and see what is available at the site.

Resources and Databases at Purdue University Library
An excellent page which directs you straight to many UnderWeb databases, grouped by subject. Also, take a look around the whole site while you are here; the Library has a strong Web presence, with current information and many special features.

Finding Information on the Internet
From the library at U.C. Berkeley; this is the best single article I've found on Internet research.

Deep Web White Paper
From BrightPlanet, this is a really good one as well. Excellent source list of references and links.

Technical Communication Library
This is a site especially for technical writers. It's classic UnderWeb: there are many resources here, but it is unlikely that much of the Library's content will show up in the results of most search engines so you need to poke around a little. Here (link to http://tc.eserver.org/sitemaps/categories.lasso) is a page where you can browse (or search) through the subjects; click on one, and the resources under that subject are listed. In this case, a "resource" could be an article; it could be a list of links; it could be a pointer to other databases like this one. Nice interface, too.

Educator's Reference Desk Database
This is a database of over a million abstracts related to education, which covers a wide swath. You may use the database to locate the actual documents, find various libraries with more data available, or view the documents online if you choose to subscribe to that service.

U.S. Patent & Trademark Office
Huge databases containing trademark information, and patent data going back to 1790. As with many UnderWeb locations, the site is spare and utilitarian, and you may need a little practice to feel comfortable here.

Searching the Internet: Recommended Sites and Search Techniques

Copyright©1996-2013 by Richard N. Bolles
All rights reserved. No part of this site may be quoted or reproduced without written permission.
For any suggested additions, updating or corrections to this site, please e-mail the Webmaster.
More On The Underweb --- Why Search Engines Miss So Much
close
Experts say that anywhere from 50% to 90% of the Web is hidden from search engines --- all search engines. To understand why, we have to know a little about how search engines work.

A search engine sends out a rover, or "bot", that looks at various web pages, and indexes those pages into a large database that the search engine keeps. When you enter a search query, the engine looks in its database of indexed pages for a match, and returns results accordingly.

But at the end of the day, not all pages on the Web will be in the search engine's database. These parts of the Web that search engines cannot reach is often called by names like the UnderWeb, the Hidden Web. etc., while what is available through standard search engine technologies is, logically enough, called the Visible Web.

Common reasons that sites are not part of the Visible Web are as follows:

  • Most Web sites make money through advertising. They sell ad space based on the number of people that visit their site; and people visit the site because of the data that is there. It is not in the site's interest to have their data available through a search engine, like Google; if that was the case, then people could just go to Google to get that data. Google would get the "hit", therefore make the advertising money, instead of the site that originally had the data. (Special commands can be embedded in a Web site's page data to stop search engine bots from entering and indexing what is there.)
  • When a search engine indexes the data on a Web site, that consumes a certain amount of the search engine's computing resources. If a site's data is of interest to a relatively small number of people, then the search engine company will not want to index the data there; it is not a profitable use of resources. Why clog up their database with information hardly anybody is going to want? Similarly, some data changes too fast to keep up with economically, like the millions of listings on eBay and their current prices. No search engine would want such data.
  • Even if a search engine wants to index a certain Web page, sometimes the data is hidden behind an interface that requires a human hand. If the data cannot be indexed automatically, then it cannot be indexed.
An example might be the phone number lookup site at Switchboard.com. You can go there and look up phone numbers, or do reverse number searches, etc. The site is built around their large database of names and phone numbers; and as you look at the page, you can see the ads they have that help to make the site profitable.

Now, it is obviously not in Switchboard's interest to have all of their names and numbers stored in Google's database. If people could find Switchboard's data by just going to Google, then, again, it would be Google making money from selling advertising, not Switchboard. Likewise, Google probably doesn't want to clog up their database with every phone number in the country; it is too low a return for the resources consumed. And lastly, the interface may be impenetrable to Google's automatic search bot. If the bot cannot access the database of names and numbers directly, it usually has no idea how to use this one-at-a-time-lookup interface to get at the data.

close window
close