Data Mining & Ware Housing
SUBBA REDDY.P.V BALAJI.S
Reg No: 08L21A1203 Reg No: 08L21A1284
III B.Tech (Information Technology) III B.Tech(Information Technology)
Email: 08L21A1203@gmail.com Email: email@example.com
VAAGDEVI INSTITUTE OF TECHNOLOGY & SCIENCE
PEDDASETTIPALII (VI), PRODDATUR-516361
In today's fast-paced, information-based economy, companies must be able to integrate vast amounts of heterogeneous data and applications from disparate sources in order to support strategic IT initiatives such as Business Intelligence, Business Process Management, Business Process Reengineering, Business Activity Monitoring and Business Performance Management. Since its inception, It has continued to build on its unique software architecture to make the integration process easier to learn and use, faster to implement and maintain, and operate at the best performance possible- in other words, Simply Faster Integration.
Relational database management systems (RDBMS) are designed to store data according to the most efficient method of data cataloging, which is that defined by mathematical set theory as expressed in the relational paradigm. In many cases, however, the most efficient method for cataloging data is not the most efficient method for storing and retrieving such data. Where relational databases do well is where the data is most appropriately managed, as flat lists having simple data types, involving few associations with data in other lists. When dealing with data that must be kept in complex interdependent structures or when data must be rapidly retrieved by following paths of associations rather than by simply walking down simple lists, the relational database begins to show characteristics such as multiple-index management and traversal and complex normalized schema structures. These impediments, along with limits in row length or table size, can, in some cases, represent such profound encumbrances that an RDBMS must be regarded as impractical for certain data management tasks. Although leading RDBMS vendors have been introducing features that enable their products to support data outside the relational paradigm, the fundamental means of management and access of such data remains relational and, for the most part, SQL based. This fact will continue to make RDBMS products unnecessarily difficult to set up and manage, and too inefficient, for some kinds of databases.
An Introduction to Data Mining
Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line. When implemented on high performance client/server or parallel processing computers, data mining tools can analyze massive databases to deliver answers to questions such as, "Which clients are most likely to respond to my next promotional mailing, and why?"
This paper provides an introduction to the basic technologies of data mining. Examples of profitable applications illustrate its relevance to today’s business environment as well as a basic description of how data warehouse architectures can evolve to deliver the value of data mining to end users.The past two decades has seen a dramatic increase in the amount of information or data being stored in electronic format. This accumulation of data has taken place at an explosive rate.
Figure 1 shows the data explosion. and the Growing Base of Data
Data storage became easier as the availability of large amounts of computing power at low cost ie the cost of processing power and storage is falling, made data cheap.
Representation of Data in Data Mining
The Cube is used to represent multidimensional data. The cube is created from a star schema or snowflake schema of tables. The star schema (sometimes referenced as star join schema) is the simplest data warehouse schema, consisting of a single "fact table" with a compound primary key, with one segment for each "dimension" and with additional columns of additive, numeric facts. The snowflake schema is a variation of the star schema used in a data warehouse. The snowflake and star schema are methods of storing data which are multidimensional in nature (i.e. which can be analyzed by any or all of a number of independent factors) in a relational database
Fig. 1. Cube presentation of data in multidimensional model
An Architecture for Data Mining
To best apply these advanced techniques, they must be fully integrated with a data warehouse as well as flexible interactive business analysis tools. Many data mining tools currently operate outside of the warehouse, requiring extra steps for extracting, importing, and analyzing the data. Furthermore, when new insights require operational implementation, integration with the warehouse simplifies the application of results from data mining. The resulting analytic data warehouse can be applied to improve business processes throughout the organization, in areas such as promotional campaign management, fraud detection, new product rollout, and so on.
The term data mining has been stretched beyond its limits to apply to any form of data analysis. Some of the numerous definitions of Data Mining, or Knowledge Discovery in Databases are:
Data Mining, or Knowledge Discovery in Databases (KDD) as it is also known, is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. This encompasses a number of different technical approaches, such as clustering, data summarization, learning classification rules, finding dependency net works, analyzing changes, and detecting anomalies.
Data mining is the search for relationships and global patterns that exist in large databases but are `hidden' among the vast amount of data, such as a relationship between patient data and their medical diagnosis. These relationships represent valuable knowledge about the database and the objects in the database and, if the database is a faithful mirror, of the real world registered by the database
The following diagram summarizes the some of the stages/processes identified in data mining and knowledge discovery
The phases depicted start with the raw data and finish with the extracted knowledge which was acquired as a result of the following stages:
1 Selection: Selecting or segmenting the data according to some criteria e.g. all those people who own a car, in this way subsets of the data can be determined.
2 Preprocessing: This is the data cleansing stage where certain information is removed which is deemed unnecessary and may slow down queries for example unnecessary to note the sex of a patient when studying pregnancy. Also the data is reconfigured to ensure a consistent format as there is a possibility of inconsistent formats because the data is drawn from several sources e.g. sex may recorded as f or m and also as 1 or 0.
3 Transformation: The data is not merely transferred across but transformed in that overlays may added such as the demographic overlays commonly used in market research. The data is made useable and navigable.
4 Data mining: this stage is concerned with the extraction of patterns from the data. A pattern can be defined as given a set of facts (data) F, a language L, and some measure of certainty C a pattern is a statement S in L that describes relationships among a subset Fs of F with a certainty c such that S is simpler in some sense than the enumeration of all the facts in Fs.
Applications of Data mining
Data mining has many and varied fields of application some of which are listed below.
1) 1. Retail/Marketing
1 Identify buying patterns from customers
2 Find associations among customer demographic characteristics
3 Market basket analysis
2) 2. Banking
4 Detect patterns of fraudulent credit card use
5 Identify `loyal' customers
6 Predict customers likely to change their credit card affiliation
7 Determine credit card spending by customer groups
3) 3. Insurance and Health Care:
8 Claims analysis - i.e which medical procedures are claimed together
9 Predict which customers will buy new policies
10 Identify behaviour patterns of risky customers
1) 4. Medicine
11 Characterise patient behaviour to predict office visits
12 Identify successful medical therapies for different illnesses
Data Mining Functions
Data mining methods may be classified by the function they perform or according to the class of application they can be used in. Some of the main techniques used in data mining are…
1) 1. Classification
Data mine tools have to infer a model from the database, and in the case of supervised learning this requires the user to define one or more classes. The database contains one or more attributes that denote the class of a tuple and these are known as predicted attributes whereas the remaining attributes are called predicting attributes. A combination of values for the predicted attributes defines a class.
2) 2. Associations:
Given a collection of items and a set of records, each of which contain some number of items from the given collection, an association function is an operation against this set of records which return affinities or patterns that exist among the collection of items. These patterns can be expressed by rules such as "72% of all the records that contain items A, B and C also contain items D and E." The specific percentage of occurrences (in this case 72) is called the confidence factor of the rule. Also, in this rule, A,B and C are said to be on an opposite side of the rule to D and E. Associations can involve any number of items on either side of the rule.
Comprehensive data warehouses that integrate operational data with customer, supplier, and market information have resulted in an explosion of information. Competition requires timely and sophisticated analysis on an integrated view of the data. However, there is a growing gap between more powerful storage and retrieval systems and the users’ ability to effectively analyze and act on the information they contain. Both relational and OLAP technologies have tremendous capabilities for navigating massive data warehouses, but brute force navigation of data is not enough. A new technological leap is needed to structure and prioritize information for specific end-user problems. The data mining tools can make this leap. Quantifiable business benefits have been proven through the integration of data mining with current information systems, and new products are on the horizon that will bring this integration to an even wider audience of users.
When your strategy is deep and far reaching, then what you gain by your calculations is much, so you can win before you even fight. When your strategic thinking is shallow and near-sighted, then what you gain by your calculations is little, so you lose before you do battle. Much strategy prevails over little strategy, so those with no strategy can only be defeated. So it is said that victorious warriors win first and then go to war, while defeated warriors go to war first and then seek to win. It is obvious to anyone that culls through the voluminous information technology (I/T) literature, attends industry seminars, user group meetings or expositions, reads the ever accelerating new product announcements of I/T vendors, or listens to the advice of industry gurus and analysts, that there are four subjects that overwhelmingly dominate I/T industry attention as we move into the late 1990s:
Why we need Data Warehousing
Data mining potential can be enhanced if the appropriate data has been collected and stored in a data warehouse. A data warehouse is a relational database management system (RDMS) designed specifically to meet the needs of transaction processing systems. It can be loosely defined as any centralized data repository which can be queried for business benefit but this will be more clearly defined later.
Data warehousing is a new powerful technique making it possible to extract archived operational data and overcome inconsistencies between different legacy data formats. As well as integrating data throughout an enterprise, regardless of location, format, or communication requirements it is possible to incorporate additional or expert information. It is, the logical link between what the managers see in their decision support EIS applications and the company's operational activities
In other words the data warehouse provides data that is already transformed and summarized, therefore making it an appropriate environment for more efficient DSS and EIS applications.
Characteristics of A Data Warehouse
According to Bill Inmon, author of Building the Data Warehouse and the guru who is widely considered to be the originator of the data warehousing concept, there are generally four characteristics that describe a data warehouse:
1 Subject-Oriented: Data are organized according to subject instead of application e.g. an insurance company using a data warehouse would organize their data by customer, premium, and claim, instead of by different products (auto, life, etc.). The data organized by subject contain only the information necessary for decision support processing.
2 Integrated: When data resides in many separate applications in the operational environment, encoding of data is often inconsistent. For instance, in one application, gender might be coded as "m" and "f" in another by 0 and 1. When data are moved from the operational environment into the data warehouse, they assume a consistent coding convention e.g. gender data is transformed to "m" and "f".
3 Time-Variant: The data warehouse contains a place for storing data that are five to 10 years old, or older, to be used for comparisons, trends, and forecasting. These data are not updated.
4 Non-Volatile: Data are not updated or changed in any way once they enter the data warehouse, but are only loaded and accessed.
Processes In Data Warehousing
The first phase in data warehousing is to "insulate" your current operational information, i.e. to preserve the security and integrity of mission-critical OLTP applications, while giving you access to the broadest possible base of data. The resulting database or data warehouse may consume hundreds of gigabytes - or even terabytes - of disk space, what is required then are efficient techniques for storing and retrieving massive amounts of information. Increasingly, large organizations have found that only parallel processing systems offer sufficient bandwidth.
The data warehouse thus retrieves data from a variety of heterogeneous operational databases. The data is then transformed and delivered to the data warehouse/store based on a selected model (or mapping definition). The data transformation and movement processes are executed whenever an update to the warehouse data is required so there should some form of automation to manage and execute these functions. The information that describes the model and definition of the source data elements is called "metadata". The metadata is the means by which the end-user finds and understands the data in the warehouse and is an important part of the warehouse. The metadata should at the very least contain;The structure of the data
1 The algorithm used for summarization;
2 The mapping from the operational environment to the data warehouse.
Data cleansing is an important aspect of creating an efficient data warehouse in that it is the removal of certain aspects of operational data, such as low-level transaction information, which slow down the query times. The cleansing stage has to be as dynamic as possible to accommodate all types of queries even those which may require low-level information. Data should be extracted from production sources at regular intervals and differences between various styles of data collection. Pooled centrally but the cleansing process has to remove duplication and reconcile
The current detail data is central in importance as it:
1 Reflects the most recent happenings, which are usually the most interesting;
2 It is voluminous as it is stored at the lowest level of granularity;
3 It is always (almost) stored on disk storage which is fast to access but expensive and complex to manage
4 Uses of Data Warehousing
5 Retail: Analysis of scanner check-out data Tracking, analysis, and tuning of sales promotions and so on…
6 Telecommunications Analysis of: call volumes, equipment, sales, customer, and profitability costs Inventory,
Criteria for a Data Warehouse
The criteria for data warehouse RDBMS are as follows:
1 Load Performance: Data warehouses require incremental loading of new data on a periodic basis within narrow time windows; performance of the load process should be measured in hundreds of millions of rows and gigabytes per hour and must not artificially constrain the volume of data required by the business.
2 Load Processing: Many steps must be taken to load new or updated data into the data warehouse including data conversions, filtering, reformatting, integrity checks, physical storage, indexing, and metadata update. These steps must be executed as a single, seamless unit of work.
3 Data Quality Management: The shift to fact-based management demands the highest data quality. The warehouse must ensure local consistency, global consistency, and referential integrity despite "dirty" sources and massive database size. While loading and preparation are necessary steps, they are not sufficient. Query throughput is the measure of success for a data warehouse application. As more questions are answered, analysts are catalysed to ask more creative and insightful questions.
4 Query Performance - Fact-based management and ad-hoc analysis must not be slowed or inhibited by the performance of the data warehouse RDBMS; large, complex queries for key business operations must complete in seconds not days
Our strategic analysis of data warehousing is as follows: Strategy is about, and only about, building advantage. The business need to build, compound, and sustain advantage is the most fundamental and dominant business need and it is insatiable. Advantage is built through deep and far-reaching strategic thinking. The strategic ideas that support data warehousing as a strategic initiative are learning, maneuverability, prescience, and foreknowledge. Data warehousing meets the fundamental business needs to compete in a superior manner across the elementary strategic dimension of time. Data warehousing is a rare instance of a rising tide strategy. A rising tide strategy occurs when an action yields tremendous
Leverage. Data warehousing raises the ability of all employees to serve their customers and out-think their competitors.
2) 1. Ralph Kimball, The Data Warehouse Toolkit (New York, NY: John Wiley & Sons, Inc., 1996), Pp. 15-16
3) 2. W. H. Inmon, Claudia Imhoff, and Ryan Sousa, Corporate InformatioFactory (New York, NY: John Wiley & Sons, Inc., 1998), Pp. 87-100
4) 3. Len Silverston, W. H. Inmon, and Kent Graziano, The Data Model Resource Book (New York, NY: John Wiley & Sons, Inc., 1997)
5) 4. Douglas Hackney, Understanding and Implementing Successful Data Marts (Reading, MA: Addison-Wesley, 1997), Pp. 52-54, 183-84, 257, 307-309
6) 5. White Paper, available at http://www.informatica.com.
7) 6. Hackney, op. cit.
8) 7. Informatica, op. cit.