Web-Searching Agents

Web-Searching Agents
(a subtopic of Agents)

Recent News about THE TOPICS (annotated)

The World Wide Web has become a vast resource of information. The problem is that finding the information that an individual desires is often quite difficult, because of the complexity in organization and the quantity of information stored.

- from Web Hunting: Design of a Simple Intelligent Web Search Agent

spider web

Good Places to Start

Text Parsing - Get a Job. Part of It's Alive! - From airport tarmacs to online job banks to medical labs, artificial intelligence is everywhere. By Jennifer Kahn. Wired (March 2002; 10.03). "The vast job bank Monster.com, for instance, uses an intelligent Web crawler called FlipDog to find new customers. Wandering the Web, the crawler develops a sense for which parts of sites are more likely to contain jobs, then parses the pages to pull out the relevant information (company, salary, kind of work, address for sending a resume) and files it in a database. The first time the crawler ran, it came back with more than half a million jobs. The real feat was not that FlipDog found the postings, but that it was able to organize them."

Web Hunting: Design of a Simple Intelligent Web Search Agent. By G. Michael Youngblood. ACM Crossroads Student Magazine (Summer 1999). "The goal of this article is to introduce the reader to the basic elements of an intelligent agent, and then apply those elements to a Web search agent to provide the framework for the construction of a simple intelligent Web search agent. An overview of typical artificial intelligence search algorithms will be presented and performance metrics will be discussed. This article presents a collection of ideas and pointers to resources that will hopefully provide some insight and basis for further inquiry into the subject matter."

Is There an Intelligent Agent in Your Future? By James A. Hendler. Nature Web Matters (March 11, 1999). " A good internet agent needs these same capabilities. It must be communicative: able to understand your goals, preferences and constraints. It must be capable: able to take options rather than simply provide advice. It must be autonomous; able to act without the user being in control the whole time. And it should be adaptive; able to learn from experience about both its tasks and about its users preferences. Let's look at each of these in turn...."

Diving Deep Into The Web - Pair's search engine scours 'hidden' sites. By Michael Bazeley. The Mercury News (August 17, 2005; registration req'd.). "You think the Web is big? In truth, it's far bigger than it appears. The Web is made up of hundreds of billions of Web documents -- far more than the 8 billion to 20 billion claimed by Google or Yahoo. But most of these Web pages are largely unreachable by most search engines because they are stored in databases that cannot be accessed by Web crawlers. Now a San Mateo start-up called Glenbrook Networks -- says it has devised a way to tunnel far into the 'deep web. and extract this previously inaccessible information. ... Komissarchik and her father, Edward Komissarchik, say they have figured out how to analyze the forms on Web pages and understand the type of information the sites are looking for. Then, Glenbrook's Web crawlers use artificial intelligence to walk themselves through sometimes complex Web forms, answering questions, such as the location of their desired job, in the same way a human would."

The Semantic Web. By Tim Berners-Less, James Hendler, and Ora Lassila. Scientific American (May 2001). "The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users. ... The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation."

Also see:
- Building The Web of Tomorrow - Creators Say 'Semantic' Web Will Be Smarter. By Sophia Kingman. ABCNEWS.com (December 21, 2001). "Semantic means 'of or relating to meaning,' and this new Web will have content better identified so that, for example, future search engines will be able to understand context and discard the irrelevant.
- Tiny Circuits: Tim Berners-Lee discusses the future of the Web. NPR Talk of the Nation: Science Friday With Ira Flatow. [Radio Interview; November 1, 2002]
- When the web starts thinking for itself. By David Green. vnunet's Ebusinessadvisor (December 20, 2002). "While XML, RDF and ontologies provide the basic infrastructure of the semantic web, it is intelligent agents that will realise its power."
- The Semantic Web. From Semaview's "At-a-Glance" Illustration Series. "Designed as a one minute overview of the Semantic Web, this illustration discusses a half dozen key points in language that can be understood by managers and techies alike." Be sure to see #4: "Ontologies give the metadata meaning."
- The Web's Father Expects a Grandchild - Tim Berners-Lee is working on the "Semantic Web," with its richer information links that unlock the power of "unplanned reuse of data." Interviewed by Andy Reinhardt. BusinessWeek online (October 22, 2004). "Q: You're working now on the Semantic Web, which will allow richer associations among data and, as the name implies, start to create a sense of "meaning" in online information. Where are things heading? A: The impact of the Semantic Web will be different from [today's] hypermedia Web. ... The Semantic Web is different. It's a space of data. It's all the information which is now in databases, spreadsheets, and application-specific files, like calendar files or photo metadata. What's exciting about the Semantic Web is its potential for serendipity, the unplanned reuse of data. The effect will be even more powerful for the Semantic Web because you won't have to be a person following the links. A machine will be able to follow links. Q: Can you give me an example? ..."

Sony lab tips 'emergent semantics' to make sense of Web. By Junko and Yoshida R. Colin Johnson. EE Times (November 1, 2004). "Sony Computer Science Laboratory is positioning its 'emergent semantics' as a self-organizing alternative to the W3C's Semantic Web that does not require any recoding of the data currently available online. Based on successful experiments with communities of robots, emergent-semantic technology is built on the principles of human learning, representatives of the Sony lab said at an open house here last month. Much as these communities of 'agents' extract meaning (semantics) from the character of their interactions, emergent semantics extracts the meaning of Web documents from the manner in which people use them, the researchers said."

AI gets down to business. By Matthew Broersma. ZDNet UK. (January 23, 2001). "Web robots don't necessarily carry out tasks for one Web site. Many researchers envision a world of semi-autonomous 'agents', roaming the Web and carrying out various tasks for their owners. Present software such as the 'mobile agents' of Netherlands-based Tryllian could be the forerunner of intelligent bots making purchases and carrying out other business transactions without human intervention."

The Push for News Returns. By Kendra Mayfield. Wired News (March 30, 2002). "The University of Michigan is working on a similar service called NewsInEssence, which also uses natural language techniques to find and summarize multiple news articles on the Web." Also see: AI-Generated News Collections

Filtering with Intelligent Software Agents. By Shaun Abushar and Naoki Hirata. (A course project from 1998.) "Information overload is a problem of the world today, but intelligent agents help reduce this problem. Using them to filter the oncoming 'traffic' of the 'information highway' can help reduce cost, effort, and time."

Readings Online

AI Magazine cover Intelligent Systems and the Internet - A Special Issue of AI Magazine. 18(2), Summer 1997. "The articles describe a broad and diverse set of systems. The AI technologies used span the gamut from machine learning to natural language processing, from case-based reasoning to knowledge representation, and more. Applications include Web page filtering, a grant finder, a FAQ finder, a home page finder, a shopping assistant, and more." - from the Introduction, by Oren Etzioni.

AI think, therefore I am. Virtual agents feature - Computerised characters that look, sound, move and seemingly think like real people are emerging from the realms of science fiction into everyday life. Superguide by David Braue. apcmag.com (December 16, 2003). "Agents are all over the Internet, across which search engine 'spiders' interactively locate and index sites, and are also common in subscription news services. ... Many researchers believe such agents will become pervasive personal assistants, helping people keep up with a constant flood of information by proactively sorting, cataloguing and presenting it in a meaningful way."

Spinning a smarter Web - The Semantic Web promises to make information more meaningful. By Sue Bowness. Information Highways (March/April 2004). "Besides the development of a shared system that will help computers to understand the relationships between databases and documents, the second aspect of the Semantic Web is to help computers develop reasoning and communication. Artificial intelligence researchers like Sheila McIlraith are all over this concept, developing applications called 'agents' that would interact with each other and be able to carry out research commands on the user's behalf. ... While the development of Semantic Web languages and agents are already a rudimentary reality, some aspects of the Semantic Web are almost entirely theoretical at the present time. For instance, what if an agent is confronted with conflicting information, how will it know which statement is true? This is where proof-checking mechanisms will be a necessary addition to the process. Through digital signatures, applications on the Semantic Web would learn to trust certain data based on its author, and allow transactions based on that shared context."

Personalized and Focused Web Spiders. By Michael Chau and Hsinchun Chen. In Web Intelligence (February 2003, pp. 197-217; Springer-Verlag). N. Zhong, J. Liu, Y. Yao, editors. Abstract: "As the size of theWeb continues to grow, searching it for useful information has become increasingly difficult. Researchers have studied different ways to search the Web automatically using programs that have been known as spiders, crawlers,Web robots, Web agents, Webbots, etc. In this chapter, we will review research in this area, present two case studies, and suggest some future research directions."

Check out the University of Arizona Artificial Intelligence Lab's Spiders are Us web site and their related publications. [The lab is part of the Management Information Systems Department and is headed by Dr. Hsinchun Chen.]

Weaving A Web of Ideas - Engines that search for meaning rather than words will make the Web more manageable. By Steven M. Cherry. IEEE Spectrum (September 2002). "What companies like Google, Autonomy, and Verity are doing, in other words, is figuring out better ways of doing what search engines have always tried to do: deliver the best documents the existing Web has on a given topic. The advocates of the Semantic Web, on the other hand, are looking beyond the current Web to one in which agent-like search engines will be able to not just deliver documents, but get at the facts inside them as well. ... Valuable as the Semantic Web might be, it won't replace regular Web searching. Peter Pirolli, a principal scientist in the user interface research group at the Palo Alto Research Center (PARC), notes that usually a Web querier's goal isn't an answer to a specific question. 'Seventy-five percent of the time, people are engaged in what we call sense-making,' Pirolli says. ... PARC researchers think there's plenty of room for improving Web searches. One method, which they call scatter/gather, takes a random collection of documents and gathers them into clusters, each denoted by a single topic word, such as 'medicine,' 'cancer,'"radiation,' 'dose,' 'beam.' The user picks several of the clusters, and the software rescatters and reclusters them, until the user gets a particularly desirable set. ... For Autonomy, Bayesian networks are the starting point for improved searches. The heart of the company's technology, which it sells to corporations like General Motors and Ericsson, is a pattern-matching engine that distinguishes different meanings of the same term and so 'understands' them as concepts."

Agent-Based Engineering, the Web, and Intelligence. By Charles J. Petrie, Stanford Center for Design Research. IEEE Expert, 11:6, pp. 24-29, (December 1996). "This article concerns Internet-based 'agents', about which there has been much hyperbole recently. There has been much discussion on the software agents email list about the defining nature of agents on the Internet. Some have tried to offer the general definition of agents as someone or something that acts on one's behalf, but that seems to cover all of computers and software. Other than such generalities, there has been no consensus on the essential nature of agents. This suggests that the word is overloaded for a variety of contexts. In this article I will survey the types and definitions of agents eventually focusing on those useful for engineering. Because it is simply silly to discuss software agents without distinguishing them from other known types of software, I will venture to offer a definition."

Going where no search engine has gone before - Connotate Technologies uses information agents to extract data from Deep Web. By Dibya Sarkar. FCW.com (May 30, 2005). "Google, one of the most popular search engines, at best can index and search about 4 billion to 5 billion Web pages, representing only 1 percent of the World Wide Web. But officials from Connotate Technologies, a company based in New Brunswick, N.J., said they have developed technology that can mine and extract data from the Deep Web, which contains an estimated 500 billion Web pages, and deliver it in any format and through any delivery mechanism. The Deep Web refers to content in databases that rarely shows up in Web searches. Through the use of intelligence-based software modules called information agents, corporate and government organizations can quickly and easily target specific unstructured data from intranets and password-protected Web sites on a continual basis. 'What the agents do is they automate time-consuming Web interaction,' said Bruce Molloy, the company's chief executive officer. 'So an agent can act on your behalf, type in information, search terms, can click on links, can know your password — but we would keep it protected — can automatically go to sites and bring back information, format and cut and paste results.' ... Connotate was formed in 1999 by three Rutgers University professors, whose Web-mining technology research was funded by the Defense Advanced Research Projects Agency and the university. ... 'It's a lot like showing something to a small child for the first time,' said Chris Giarretta, Connotate's customer relationship manager. Essentially, he said, the more you show what a user wants, the better the agent will get at finding it."

Intelligent Searching Agents on the Web. Search Engines column by Tracey Stanley. Ariadne (Issue 7; January 1997). "Intelligent agents can utilise the spider technology used by traditional web search engines, and employ this in new kinds of ways. Typically, these tools are spiders which can be trained by the user to search the web for specific types of information resources. The agent can be personalised by its owner so that it can build up a picture of individual likes, dislikes and precise information needs. An intelligent agent can also be autonomous - so that it is capable of making judgements about the likely relevance of material."

Robots Invade the Net. By Mike Wooldridge. CNET (March 16, 1998). "As you read these words, thousands of nonhuman entities are also using the Web, harvesting text, submitting search queries, even posing as people. They carry out their missions without a hint of compassion--which sounds bad unless they happen to be on a mission for you. They are bots (short for 'software robots'), software programs that run all by themselves on the Web, sifting through data and making their own decisions. In many ways, they hold the key to making the Web a more valuable tool." And this very easy to read series of articles makes a great place to begin to understand just what it is they do and what they can do for you.

Related Web Sites

Aware, from Stottler Henke Associates, Inc. "is a new tool for searching the Internet that learns what the user is looking for and helps gather highly targeted results. Aware uses patent pending intelligent agent technology to analyze the terms and documents that are relevant to the user’s research area, enabling it to search more deeply and broadly than unaided users can."

iVia: High Octane Software for Internet portal and Virtual Library Creation and Management. "The iVia system is an INFOMINE creation generously funded by the National Science Digital Library of the National Science Foundation, the National Leadership Grant Program of the U.S. Institute of Museum and Library Services, the Fund for the Improvement of Post-Secondary Education of the U.S. Department of Education and the Library of the University of California, Riverside." As explained on the New Technologies page: "iVia utilizes a range of programs known as crawlers to traverse the Web and identify new Internet resources. iVia's crawlers are used to help identify important academic resources on the Internet. The crawlers function as .collection development. tools."

InfoSpiders: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery. From Filippo Menczer and the Adaptive Agents Research Group, University of Iowa. "An artificial life - inspired multi-agent adaptive system for autonomous, scalable information search in the Web." In addition to the links you'll find on this page to related news articles, papers, and even narrated demos, there's one that invites you to give a troop of spiders their marching orders:

"MySpiders is a java applet that uses intelligent, autonomous, adaptive software agents to search the internet on behalf of the user for information about the user's query. MySpiders complement, rather than replace, traditional search engines, by locating recent documents that may not have been indexed by search engines yet."

"Letizia is a user interface agent that assists a user browsing the World Wide Web. As the user operates a conventional Web browser such as Netscape, the agent tracks user behavior and attempts to anticipate items of interest by doing concurrent, autonomous exploration of links from the user's current position. The agent automates a browsing strategy consisting of a best-first search augmented by heuristics inferring user interest from browsing behavior." From Henry Lieberman of the Media Laboratory at the Massachusetts Institute of Technology.

See a Quicktime movie demo of Letizia [5MB file];
Also see:
- Let's Browse: A Collaborative Web Browsing Agent from Henry Lieberman, Neil Van Dyke, and Adriana Vivacqua. "Let's Browse is an experiment in building an agent to assist a group of people in browsing, by suggesting new material likely to be of common interest. It is built as an extension to the Web browsing agent Letizia, which acts as an advance scout."
- Why Surf Alone?: Exploring the Web with Reconnaissance Agents. By Henry Lieberman, Christopher Fry and Louis Weitzman. Communications of the ACM, August 2001, pp. 69-75.

The Semantic Web. From Cycorp, Inc. "The Semantic Web is an exciting vision for the future of information technology, but it is a vision that presupposes the ability to represent web content with efficiency and expressiveness. If a scalable way to add semantics to the World Wide Web (WWW) can be found, the Semantic Web will create a world where agents, search engines, and other programs can read semantic markup to decipher the real meaning of a web page. The Semantic Web-aware agents will be able to retrieve computer readable facts, integrate and reason about those facts, answer questions, solve problems, and generally bring a new level of intelligence to the WWW that is unimaginable with today’s technology. ... The key to harvesting this new semantic information will be the creation of the Semantic Web-aware agents that can cope with a diversity of meanings and inconsistencies across local ontologies. These agents will need the capability to interpret, understand, elaborate, and translate among the many heterogeneous local ontologies that will populate the the Semantic Web."

Also see this article: Super Searches - IBM's webfountain, a new internet tool, helps companies spot online trends before they emerge. By Laura A. Locke. Time Magazine (November 8, 2004).

Softbots. Computer Science Department, University of Washington. You can read about softbot projects and view online demonstrations.

WebFountain. "IBM Research developed WebFountain, which can transform passive, reactive organizations and processes into more proactive and agile businesses that can sense and respond to real-time internal and external events, issues, and marketplace changes. ... WebFountain is comprised of three primary components: 1. The Platform: ... It has integrated miners, crawlers and applications which focus on specific or global tasks. ... 3. Multi-Disciplinary Text Analytics: WebFountain provides an integrated platform for multi-disciplinary text analytic approaches. This includes natural language processing, statistics,
probabilities, machine learning, pattern recognition and artificial intelligence."

WebMate. From the Software Agents Group at Carnegie Mellon University. "WebMate, a personal digital assistant, is a promising solution to the problem of finding useful information among a sea of texts and other web documents."

Web Robots Pages. Includes a FAQ, a database of current webcrawlers, some online articles and a few related web sites.

Related Pages

Agents
Filtering [it was a sidebar on this page until getting its own page in October 2002]
Information Retrieval & Extraction
General Index to AI in the news: Agents
More News Sources & Collections including AI-Generated Collections
MultiAgent Systems
Ontologies

More Readings

Leonard, Andrew. 1997. Bots: The Origin of New Species. San Francisco: Hardwired. Surveys the vast spectrum of software agents--from bots that retrieve information to bots that chat--and compares them to evolving organisms.

Fair Use Notice