GETTING TO GRIPS WITH DATA MINING
So what is Data Mining? Well one commentator I heard recently rewrote this question as, ‘How knowledge discovery in databases works?’ The following is a step by step walk through the processes involved in data mining.
Well we start with raw data, lots and lots of raw data, either internally in an organization say, or externally that may or may not have some bearing on the organization.
Target or Selected Data
We then decide, as was mentioned in my blog on Big Data, what we want or hope to achieve in terms of the information from data. So, with this target data in mind we begin a selection process to weed out data which is irrelevant, or not ready to use yet, and this will then be our ‘Target Data’.
Pre-Processes or Adjusted Data
We should then do some preliminary testing of portions of that data using various algorithms (step by step set of operations to be performed – Wikipedia) mainly to eliminate outliers in the data (an outlier in statistical terms is an observation point that is distant from other observation points - Wikipedia) and null or missing values. Where missing data and outliers occur it is usual, to apply the same algorithms to another sub-set of data from the same target data source to try and understand reasons why the outliers are occurring or data is missing from where it is missing. Different algorithms might be worth trying to explain the anomalies or null values here.
Transformed or Further Adjusted Data
We then seek to transform that data which may include the process of normalization (Database normalization is the process of organizing the attributes and tables of a relational database to minimize data redundancy – Google). We then apply statistical tests to see if variables within the data are correlated (the extent to which two or more variables fluctuate together). Correlated is not to be taken to mean that one variable, definitely, causes the other to fluctuate as there may be other factors at work which cause the two variables to fluctuate together. However, when there is a positive correlation it does show that two variables fluctuate in parallel together at the same time. A negative correlation shows the extent to which one variable increases, while the other decreases. Ideally one is looking for variables that do not correlate, so one can apply steps to uncorrelated this data.
Data Mining Model Infographic
The next step is data mining is where we look for patterns in the data. Here one takes the transformed data and applies algorithms that provide things like classification, clustering, statistical learning, association analysis, and link mining, which are all among the most important topics in data mining research and development.
Types of information obtainable from data mining:
Associations: Occurrences linked to single event
Sequences: Events linked over time
Classification: Recognizes patterns that describe group to which item belongs
Clustering: Similar to classification, when no groups have been defined; finds groupings within data
Forecasting: Uses series of existing values to forecast what other values will be.
An academic paper published by The University of Maryland, says the following about one of the most popular data mining algorithms, Apriori:
“Many of the pattern finding algorithms such as decision tree, classification rules and clustering techniques that are frequently used in data mining have been developed in the machine learning research community. Frequent pattern and association rule mining is one of the few exceptions to this tradition. The introduction of this technique boosted data mining research and its impact is tremendous. The algorithm is quite simple and easy to implement. Experimenting with Apriori-like algorithm is the first thing that data miners try to do” (www.cs.umd.edu/~samir/498/10Algorithms-08.pdf).
In future blogs we will go into more detail about the algorithms like ‘Apriori’ used for data mining.
Evaluation and Interpretation Leading to Knowledge
Finally, the data scientist must interpret these patterns and produce new knowledge emanating from the data mining process. The data scientist may return to any of the process phases to repeat the experiment with a different subset of the data in order to refine and / or improve upon the knowledge gained.