Data Mining and Discovery

Data Mining and Discovery
(a subtopic of Machine Learning)

Good Places to Start

Readings Online

Related Web Sites

More Readings
(see FAQ)

Recent News about THE TOPICS (annotated)

photo of Donald Michie

Why do we not, since the phenomena are well known, build a "knowledge refinery" as the basis of a new industry, comparable in some ways to the industry of petroleum refining, only more important in the long run? The product to be refined is codified human knowledge.

-Donald Michie, A Prototype Knowledge Refinery

Data mining is an AI powered tool that can discover useful information within a database that can then be used to improve actions. To appreciate why businesses are so excited about data mining, you need only imagine that a major department store chain is looking for ways to boost sales. They have a large database containing information about customers and the nature of their purchases (with particulars such as identity of items, price, date, and time of sale). Suppose a data mining utility unearthed a pattern in the data which indicated that customers who shopped on Saturday afternoons and who made their initial purchase of the day in the shoe department tended to make, on average, 4 additional purchases from other departments and that the average member of this group spent more per visit than the typical shopper. Can you now envision the sort of advertising campaign that the department store chain might want to embark upon ?

Good Places to Start

A golden vein - Computing: Analysis of customer information, better known as “data mining”, is finally delivering on its promises—and expanding into some promising new areas. The Economist Technology Quarterly (June 10, 2004). "In the old days, knowing your customers was part and parcel of running a business, a natural consequence of living and working in a community. But for today's big firms, it is much more difficult: a big retailer such as Wal-Mart has no chance of knowing every single one of its customers. So the idea of gathering huge amounts of information and analysing it to pick out trends indicative of customers' wants and needs -- data mining -- has long been trumpeted as a way to return to the intimacy of a small-town general store. But for many years, data mining's claims were greatly exaggerated. ... In recent years, however, improvements in both hardware and software, and the rise of the world wide web, have enabled data mining to start delivering on its promises."

Knowledge Discovery in Databases: Tools and Techniques. By Peggy Wright. Crossroads. 1998. "The purpose of this paper is to present the results of a literature survey outlining the state-of-the-art in KDD techniques and tools. The paper is not intended to provide an in-depth introduction to each approach; rather, we intend it to acquaint the reader with some KDD approaches and potential uses."

Eureka! Knowledge Discovery. By Neena Buck. Software Magazine. December 2000/January 2001 cover story. "Knowledge discovery and data mining (KDD) is evolving from an esoteric art and a point solution, to a mainstream technology embedded in a variety of solutions, to help businesses turn information into insight."

Research: From lab to market. By Michael Kanellos. CNET News (June 16, 2004). "Data mining, the ability to find unexpected patterns in accumulated data, was born during a lunch break. At a customer conference in the early 1990s, an executive at British department store chain Marks & Spencer was explaining his database woes to Rakesh Agrawal, an information retrieval specialist at IBM. The store was collecting all sorts of data but didn't know what to do with it. So Agrawal and his team began devising algorithms for asking open-ended queries, eventually authoring a 1993 paper that would become required reading in data-mining science. The report has been cited in more than 650 other studies, making it one of the most widely cited papers of its kind. ... Agrawal, the data-mining pioneer, is today working on a system that will scramble customer data in a way that will allow companies to study buying trends or other patterns while preserving strict privacy."

... and here is a link to Rakesh Agrawal's publications.

Advanced Scout: Data Mining and Knowledge Discovery in NBA Data, a Brief Application Description by Inderpal Bhandari, et al. Data Mining and Knowledge Discovery 1, 121-125 (1997). Available from NEC ResearchIndex. "We describe Advanced Scout software from the perspective of data mining and knowledge discovery. This paper highlights the pre-processing of raw data that the program performs, describes the data mining aspects of the software and how the interpretation of patterns supports the process of knowledge discovery."

Data Mining Research: Opportunities and Challenges. A Report of three NSF Workshops on Mining Large, Massive, and Distributed Data. By Robert Grossman, Simon Kasif, Reagan Moore, David Rocke, and Jeff Ullman. January 1999. "Data mining is the semi-automatic discovery of patterns, associations, changes, anomalies, rules, and statistically significant structures and events in data. That is, data mining attempts to extract knowledge from data. Data mining differs from traditional statistics in several ways:...." And be sure to read the impressive "Success Stories" in Section 5! Made available by The National Center for Data Mining (NCDM) at the University of Illinois at Chicago (UIC).

Mining for trends at the help desk. By John Boyd. IBM Think Research (1999). "Ordinary data mining simply looks for keywords, but the text-mining system -- dubbed TAKMI (an abbreviation for Text Analysis and Knowledge Mining but also a Japanese word meaning 'skilled craftsman') -- spots grammatical relationships, as well. Knowing which word is the subject, which the verb, and which the object, TAKMI can categorize calls according to whether they are, say, complaints or questions and according to the product that is causing difficulty." Also see:

Knowledge Discovery & Data Mining Research at IBM. "The challenge of extracting knowledge from data draws upon research in statistics, databases, pattern recognition, machine learning, data visualization, optimization, and high-performance computing, to deliver advanced business intelligence and web discovery solutions."

Financial Services data mining example: Identifying risky borrowers. From Salford Systems. "To introduce you to data mining with the CART decision tree software we are going to walk through a real world example drawn from the Financial Services industry. The database is an extract from a group of customers who selected a financial loan product, some of whom went 'BAD'. The information we will make use of comes from standard credit reports provided by all the major credit bureaus...."

Data Mining Glossary. From Two Crows Corporation.

Readings Online

Data Mining and Knowledge Discovery, published by Kluwer Academic Publishers.

The DBMS Guide to Data Mining Solutions (1998). A collection of articles by Estelle Brand and Rob Gerritsen. Titles include Predicting Credit Risk, Neural Networks, and Decision Trees.

Data-Mining. California Computer News (October 27, 2004). "The Andrew W. Mellon Foundation is funding the two-year, nearly $600,000 multi-institutional project, which John Unsworth, dean of Illinois' Graduate School of Library and Information Science (GSLIS), will lead. In his winning project, titled 'Web-based Text-Mining and Visualization for Humanities Digital Libraries,' Unsworth expects to produce software 'for discovering, visualizing and exploring significant patterns across large collections of full-text humanities resources in digital libraries and collections.' ... In traditional 'search-and-retrieval' projects, scholars bring specific queries to collections of text and get back more or less useful answers to those queries, Unsworth said. 'By contrast, the goal of data-mining, including text-mining, is to produce new knowledge by exposing unanticipated similarities or differences, clustering or dispersal, co-occurrence and trends.' ... With its roots in statistics, artificial intelligence and machine learning, data-mining has been around since the 1990s. ... With data-mining tools, Unsworth said, you first select a body of material that you think is important in some way, next select features of those materials that you similarly think are important, and then 'map the occurrence of those features in the selected materials to see whether patterns emerge. If patterns do emerge, you analyze them and from that analysis emerges -- if you are lucky -- new insights into the materials.' For example, in the planning grant for this project, members of his research team, using the full set of Shakespeare's plays, selected five 'circulation-of characters' features...."

Duo-Mining -Combining Data and Text Mining. By Guy Creese. DMReview.com (September 16, 2004). "As standalone capabilities, the pattern-finding technologies of data mining and text mining have been around for years. However, it is only recently that enterprises have started to use the two in tandem - and have discovered that it is a combination that is worth more than the sum of its parts. First of all, what are data mining and text mining? They are similar in that they both 'mine' large amounts of data, looking for meaningful patterns. However, what they analyze is quite different. ... Collections and recovery departments in banks and credit card companies have used duo-mining to good effect. Using data mining to look at repayment trends, these enterprises have a good idea on who is going to default on a loan, for example. When logs from the collection agents are added to the mix, the understanding gets even better. For example, text mining can understand the difference in intent between, 'I will pay,' 'I won't pay,' 'I paid' and generate a propensity to pay score - which, in turn, can be data mined. To take another example, if a customer says, 'I can't pay because a tree fell on my house;' all of a sudden it is clear that it's not a 'bad' delinquency - but rather a sales opportunity for a home loan."

Data Mining. Edmund X. DeJesus' introduction to this collection of three articles from the October 1995 issue of Byte begins with: "There's gold in your data, but you can't see it." The three articles which follow this introduction are: The Data Gold Rush, by Sara Reese Hedberg; A Data Miner's Tools, by Karen Watterson; and, Data-Mining Dynamite, by Cheryl D. Krivda.

The Rebirth of Artificial Intelligence. Lisa DiCarlo. Forbes (May 16, 2000). "Oracle is promoting its Intelligent WebHouse tools. These tools give companies a detailed survey of their Web-surfing customers, determining what sites they have visited before and what their relationship is to that site. This, Howard says, 'enables companies to do a better job cross-selling and up-selling customers. You can [discover] sales programs on other sites and do competitive analysis.'"

The race to computerise biology. The Economist Technology Quarterly (December 12, 2002). "It is in data mining, however, where bioinformatics hopes for its biggest pay-off. First applied in banking, data mining uses a variety of algorithms to sift through storehouses of data in search of 'noisy' patterns and relationships among the different silos of information. The promise for bioinformatics is that public genome data, mixed with proprietary sequence data, clinical data from previous drug efforts and other stores of information, could unearth clues about possible candidates for future drugs."

Data Mining: Exploiting the Hidden Trends in Your Data. By Herb Edelstein. DB2 Online Magazine (Spring 1997). "Essentially, data mining discovers patterns and relationships hidden in your data. It's part of a larger process called knowledge discovery; specifically, the step in which advanced statistical analysis and modeling techniques are applied to the data to find useful patterns and relationships. The knowledge-discovery process as a whole is essential for successful data mining because it describes the steps you must take to ensure meaningful results."

From Data Mining to Knowledge Discovery in Databases. Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth (1996). AI Magazine 17(3): 37-54. "Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. What is all the excitement about? This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases. The article mentions particular real-world applications, specific data-mining techniques, challenges involved in real-world applications of knowledge discovery, and current and future research directions in the field."

Knowledge Discovery in Databases: An Overview. By William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus. AI Magazine 13(3): Fall 1992, 57-70. "Definition of Knowledge Discovery: Knowledge discoveryis the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. Given a set of facts (data) F, a language L, and some measure of certainty C, we define a patternas a statement S in L that describes relationships among a subset F _S of F with a certainty c, such that S is simpler (in some sense) than the enumeration of all facts in F_S. A pattern that is interesting (according to a user-imposed interest measure) and certain enough (again according to the user’s criteria) is called knowledge. The output of a program that monitors the set of facts in a database and produces patterns in this sense is discovered knowledge."

Knowledge-based Scientific Discovery from Geological Databases. By C. Li and G. Biswas. (1995). "It is common knowledge in the oil industry that the typical cost of drilling a new offshore well is in the range of $30 40 million, but the chance of that site being an economic success is 1 in 10. Recent advances in drilling technology and data collection methods have led to oil companies and their ancillaries collecting large amounts of geophysical/geographical data ... Can this vast amount of history from previously explored fields be systematically utilized to evaluate new plays and prospects?"

Software: Text Mining. By Cade Metz. One of PC Magazine's Future Tech - 20 Hot Technologies to Watch (July 1, 2003). "Text-mining software is one of the front-line tools that the government is now using to tease out valuable connections. These specialized search engines can quickly sift through mountains of unstructured text -- anything that's not carefully arranged in a database or spreadsheet -- and pull out the meaningful stuff. They can infer relationships within data that are not stated explicitly."

Machine Learning, Neural and Statistical Classification. D. Michie, D.J. Spiegelhalter, C.C. Taylor (eds). "[This]The book (originally published in 1994 by Ellis Horwood) is now out of print. The copyright now resides with the editors who have decided to make the material freely available on the web."

Machine Learning and Data Mining. By Tom M. Mitchell, Center for Automated Learning and Discovery at Carnegie Mellon University. (1999). Communications of the ACM, Vol. 42, No. 11; pages 30 - 36. "To illustrate one important research issue, consider again the problem of predicting risk of emergency C-section for pregnant women. One key limitation of current data mining methods is that in fact they cannot utilize the full patient record that is already routinely captured in hospital medical records! This is because current hospital records for pregnant women often contain sequences of images (e.g., the ultrasound images taken during pregnancy), other raw instrument data...."

Statistical Data Mining Tutorials - Tutorial Slides by Andrew Moore, professor of Robotics and Computer Science at the School of Computer Science, Carnegie Mellon University. "The following links point to a set of tutorials on many aspects of statistical data mining, including the foundations of probability, the foundations of statistical data analysis, and most of the classic machine learning and data mining algorithms."

Virtual Prospecting. From oil exploration to neurosurgery, new tools are revealing the secrets hidden in mountains of data. By Otis Port. BusinessWeek Online. (March 23, 2001).

Smart Tools - Companies in health care, finance, and retailing are using artificial-intelligence systems to filter huge amounts of data and identify suspicious transactions. By Otis Port, with Michael Arndt and John Carey. Business Week's 2003 edition of The BusinessWeek50.

Coaxing Meaning Out Of Raw Data. By John W. Verity. Business Week (February 3, 1997). "First developed to help scientists make sense of experimental data, this software has enough smarts to 'see' meaningful patterns and relationships on its own--to see patterns that might otherwise take tens of man-years to find. That's a huge leap beyond conventional computer databases, which are powerful but unimaginative: They must be told precisely what to look for. Data-mining tools can sift through immense collections of customer, marketing, production, and financial data and, using statistical and artificial-intelligence techniques, identify what's worth noting and what's not."

Business Intelligence - The Value in Mining Data. By Jonathan Wu. DM Review (February 2002). "Data mining can best be described as a business intelligence (BI) technology that has various techniques to extract comprehensible, hidden and useful information from a population of data. This BI technology makes it possible to discover hidden trends and patterns in large amounts of data. The output of a data mining exercise can take the form of patterns, trends or rules that are implicit in the data. ... The following are examples of practical uses of data mining and the value it provides those who use this technology to mine their data. ... Fraud Detection ... Inventory Logistics ... Defect Analysis ... Focused Hiring."

Related Web Sites

ACM Special Interest Group on Knowledge Discovery in Data and Data Mining.

AI on the Web: Machine Learning. A resource companion to Stuart Russell and Peter Norvig's "Artificial Intelligence: A Modern Approach" with links to reference material, people, research groups, books, companies and much more.

The Data Mine. Maintained by Andy Pryke.

Data Mining Product Features. Profiles of, and links to, many data mining commercial products. From Exclusive Ore Inc.

"The Intelligent Data Understanding (IDU) subproject of NASA's Intelligent Systems Project develops techniques for transforming data into scientific understanding."

"The IDU Data Mining (DM) technical area is about techniques for processing and combining raw data -- from large, distributed, heterogeneous, multidimensional data sets with complex spatial and/or temporal dynamics -- to detect patterns and regularities. This includes techniques for dealing with highly skewed, non-representative data (as when searching large datasets for "small" patterns, such as failure modes within operations logs)."
"The IDU Machine Learning (ML) technical area -- or Machine Learning for Decision-Making and Action -- focuses on the ultimate use of knowledge extracted from data. It develops data-driven techniques to assist NASA scientists and engineers in monitoring, controlling, and maintaining complex systems."
"The IDU Knowledge Discovery (KD) technical area -- or, more properly, Knowledge Discovery for Understanding and Analysis -- develops interactive methods for discovering classification rules and inferring causation, starting from background knowledge and observed associations. New KD methods can help scientists and engineers understand causal relationships and processes in the physical world."
Research Task List

KD nuggets. Offers links to reference collections, newsletters, mailing lists, datasets, companies, job openings, competitions, and more . . . including:

Software for Data Mining and Knowledge Discovery:
- Data-mining software guide: Categories include: Agents; Bayesian and Dependency Networks; Deviation and Fraud Detection; and Text Analysis, Text Mining, and Information Retrieval (IR).
- Domain-specific data-mining solutions. Categoeies include: Bioinformatics and Pharmaceutical; eCommerce; Personalization; Sports and Entertainment; and Travel.

Knowledge Discovery Laboratory at the University of Massachusetts Amherst, Department of Computer Science. "KDL investigates how to find useful patterns in large and complex databases. We study the underlying principles of data analysis algorithms, develop innovative techniques for knowledge discovery, and apply those techniques to practical tasks in areas such as fraud detection, scientific data analysis, and web mining."

Microsoft's Machine Learning and Applied Statistics (MLAS) group "is focused on learning from data and data mining. By building software that automatically learns from data, we enable applications that (1) do intelligent tasks such as handwriting recognition and natural-language processing, and (2) help human data analysts more easily explore and better understand their data."

Related Pages

Astronomy & Space Exploration
Banking
Business
Ethical & Social Implications
Fraud Detection & Prevention
General Index to AI in the news: Machine Learning
Knowledge Management
Law Enforcement
Machine Learning
Marketing, Customer Relations & E-Commerce
Scientific Discovery

More Readings

Toward Automated Discovery in the Biological Sciences. By Bruce G. Buchanan and Gary R. Livingston. AI Magazine (Spring 2004) 25(1): 69-84. "Knowledge discovery programs in the biological sciences require flexibility in the use of symbolic data and semantic information. Because of the volume of nonnumeric, as well as numeric, data, the programs must be able to explore a large space of possibly interesting relationships to discover those that are novel and interesting. Thus, the framework for the discovery program must facilitate proposing and selecting the next task to perform and performing the selected tasks. ... Our results demonstrate that both reasons given for performing tasks and estimates of the interestingness of the concepts and hypotheses examined by HAMB contribute to its performance and that the program can discover novel, interesting relationships in biological data."

Artificial Intelligence and Link Analysis - Papers from the 1998 AAAI Fall Symposium. David Jensen and Henry Goldberg, Program Cochairs. "Computer-based link analysis is increasingly used in law enforcement investigations, insurance fraud detection, telecommunications network analysis, pharmaceuticals research, epidemiology, and a host of other specialized applications. Link analysis explores associations among large numbers of objects of different types. For example, a law enforcement application might examine familial relationships among suspects and victims, the addresses at which those persons reside, and the telephone numbers that they called during a specified period. The ability of link analysis to represent relationships and associations among objects of different types has proven crucial in assisting human investigators to comprehend complex webs of evidence and draw conclusions that are not apparent from any single piece of information. However, there is both a need and opportunity to apply new technologies. Much of the current software for link analysis is little more than a graphical display tool. While visualizing networks has proven useful, some advanced applications of link analysis involve tens of thousands of objects and links as well as a rich array of possible data models. Manual construction and analysis of such networks has proven difficult. In addition, a large number of related techniques in artificial intelligence and several other fields have the potential to assist human reasoning about complex networks of relationships. These techniques draw on work from search, semantic networks, ontological engineering, autonomous agents, inductive logic programming, graph theory, social network analysis, knowledge discovery in databases, entity-relationship modeling, information extraction, information retrieval, and metaphor."

Proceedings of the International Conference on Knowledge Discovery and Data Mining. From the AAAI Press.

Fair Use Notice