Thursday, November 6, 2008

Unit 10: Reading notes

David Hawking, Web Search Engines: Part I and II, IEEE Computer, June 2006
  • Interesting point - today's search engines not only index many more sites, they provide "much higher quality" answers to queries.
  • "For redundancy and fault tolerance, large search engines operate multiple, geographically distributed data centers"
  • Crawling algorithms are used to search webpages in incredibly high numbers
  • "If each HTTP request takes one second to complete—some will take much longer or fail to respond at all—the simple crawler can fetch no more than 86,400 pages per day. At this rate, it would take 634 years to crawl 20 billion pages. In practice, crawling is carried out using hundreds of distributed crawling machines." Wow.
  • "Search engines use an inverted file to rapidly identify indexing terms—the documents that contain a particular word or phrase"
Shreeves, S.L., Habing, T. O., Hagedorn, K., & Young, J. A. (2005). Current developments and future trends for the OAI protocol for metadata harvesting. Library Trends, 53(4), 576-589.
  • Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) - means to federate access to diverse e-print archives through metadata harvesting
  • OAI-PMH was initially designed to meet the needs of the e-print archives community - and could be applicable in a broad range of communities, including libraries, museums, and archives.
The Deep Web: Surfacing Hidden Value by Michael K. Bergman
  • "Searching on the Internet today can be compared to dragging a net across the surface of the ocean"
  • Survey in 2000 concluded that public information on the deep Web was 400 to 550 times larger than the commonly defined World Wide Web - what about today?
  • Deep Web documents are on average 27% smaller than surface Web documents
  • Largest deep web site: National Climatic Data Center (NOAA)
  • "These observations suggest a splitting within the Internet information search market: search directories that offer hand-picked information chosen from the surface Web to meet popular search needs; search engines for more robust surface-level searches; and server-side content-aggregation vertical "infohubs" for deep Web information to provide answers where comprehensiveness and quality are imperative."

2 comments:

NA said...

Hi,
I'm also found some of the claims in "The Deep Web" to be interesting. I imagine the amount of information in the deep web has grown incredibly since the 2000 survey--as one of the articles states, the deep web is the fastest growing part of the internet.
I think search directories with "hand-picked information chosen from the surface Web to meet popular search needs" is a very good idea. I'm sure some exist by now.

Rian said...

I agreed with your comment (wow) about the simple crawler managing to only get through 86,400 pages per day. Although, I don't know how to interpret it. I have two takes on it. First, I'm amazed it can even go through that many pages as it is, let alone that being "slow", and second, that was alot of information in two little sentences.