A credit card company has 10, 00,000 customers and wants to know how many customers would default on their payment this month. How does one answer this question?
A company wants to start a paid value added service for its 10, 00,000 existing customers and wants to know how many customers are likely to subscribe for the service. How does one answer that question?
In a situation where the database size is small like 100, it would have been very easy to answer the above two questions. However when the size of the database is very huge, extracting trends and patterns from the data can be a very tedious and difficult task without much probability of success. However, this is extremely critical for a business as it enables them to take proactive data-based decisions on the expected or predictable behaviours or trends and therefore it cannot be left neglected or untouched.
This leads to the need of a tool or a technique or a methodology or a practice which can solve the problem and identity hidden trends and patterns in large databases.
Data Mining helps in answering the above questions.
What is Data Mining?
Data is any information a variable or a set of variables based on which knowledge is derived. One of the most important aspects of collecting, storing and analysing data is that it can help in predicting trends and behaviours of variables for which the data has been collected. In the early 1960s, the practice of data collection started gaining momentum to answer questions on historical data like “number of units sold in last 3 years”. Initially this was done manually with the use of calculators and the data written down in registers/books etc. With the computer becoming popular in companies, it made data collection activity and storage easier for them. However, the data collection answered questions relating to major business metrics only. If a company wanted to know the number of goods sold by area and location wise, this mere data collection activity did not answer all questions. This led to the evolution of data warehousing. Data warehousing answered many questions related to historical data. This data warehousing included collecting all data and categorizing them for easy usage. However, it too had a limitation. This data warehousing helped in analysing the data to identify trends and patterns where the database size was small. In areas where the database size is huge like the examples mentioned in the abstract section, data ware housing does not solve the issue as it is most likely to miss the hidden trends and patterns. The issue gets even worse today as the average size of small databases also goes in terabytes. (1 terabyte = 1, 000,000,000,000 bytes).
The answer to this issue is a must for any business to be successful and to take data based decisions. For years, people have been trying to answer this issue and the solution has been found.
The solution is Data Mining.
Data mining is the extraction of hidden trends and patterns in a huge database to convert the data into knowledge. This knowledge helps in predicting future trends or patterns which enables companies to take data driven proactive decisions to increase revenue or reduce cost or both. Data mining is also called knowledge discovery. Data mining is increasingly becoming popular in many organizations across the globe today. It uses various automated analytical tools to discover relationships and trends in data which may help in predicting the future behaviour.
What is required for data mining?
There are 3 major constituents of data mining:-
1. Huge database
The data mining techniques are useful only when the data size is huge and analysing it from every angle and each category is not possible with traditional techniques.
Today, the size of an average database is very huge in most of the organizations making is it easy for them to us data mining techniques..
2. Powerful computers
Analyzing massive databases can be a very tedious job and without a powerful computer, the analysis many never end. Also, the expectations from an ordinary computer must be realistic. Analysing large amount of data like in gigabytes and terabytes requires powerful processors which can handle any data size and do not crash.
3.Algorithms
The analysis in data mining is based upon algorithms which analyse data using advanced statistics and hypothesis. These algorithms identify sequences, trends and patterns in the data.
What does data mining do?
Data mining mostly has two functions:-
1. Automated future behaviour prediction
It predicts the future behaviour of a variable or variables in a database automatically based on historical data. Traditionally, this activity was done manually. Many data analysis tools have existed in the past. However these tools were limited to performing a test on the data and returning a result. The interpretation of the result however was done manually and the expected future behaviour was derived from this result. The issue in this technique has been that different people may interpret the same result in differently mostly due to the subjectivity. However, data mining with the help of powerful algorithms, makes the entire processes automated leaving the subjectivity out and completing the entire process in a very short span of time compared to the traditional technique. Foe e.g. with the concepts of customer segmentation becoming popular, the need of selective targeting for products has become a necessity. Data mining on the basis of historical data helps to identify which customers would most likely buy a new product. This enables organizations to target these customers making the process very cost efficient and fast unlike the traditional method.
2. Identification of hidden trends
Data mining helps to identify or discover new or hidden trends in data. With the large databases, the assumption that all trends and consumer behaviours are already known can be very lethal for an organization. Data mining helps to identify any such new trends existing in a database. For e.g. the data mining technique help in credit card fraud detection.
Application of data mining
Today, data mining is used in many organizations across industries and countries however it is mostly used to understand and predict consumer behaviour. This inclines data mining techniques to be more popular in retail and financial institution. Data mining helps to draw a relationship between various variables based on historical data. For e.g. a credit card company could recommend certain products to customers on their monthly credit card statements based on their shopping habits or frequent buys or spending habits. This also opens a new area for revenue making for the credit card company by placing the recommendations on the statements and charging a commission or a fee from the company selling that product. On the other hand, the same data could be sold to various companies for selective customer targeting. Another example can be that may book stores send promotion offers to customers on certain books based on their reading habit. If a new fiction is launched, it can send a promotion offer to all its customers interested in fiction book. This enables the book store to have a better success rate in selling the book in a very efficient manner.
Data mining is also used in credit card fraud detection, education, genetics engineering, customer relationship management, marketing, cross selling etc.
How does data mining work?
Data mining works on the concept of predictive modelling. It is a process in which a model is designed with a sample data where the result is already known to us. Then this model is tested with the similar data where the result is unknown to us. In data mining, the modelling is done in the same manner. The data mining computer has already stored various scenarios and models where the solution or the result is already know. Then the data mining activity is done on the data where the result is not known. For e.g. a credit card company wants to increase its market share by adding new customers. They can either send a promotion offer to all prospective customers without capturing their need and requirements thus incurring a huge cost or use modelling. The company has a database of its existing customers capturing their spending habits, age, location, gender, occupation etc. This credit card company offers various types of credit cards. It also has a list of prospective customers with their details like age, location, gender, occupation etc. A model is built based on the existing customer data base like most of the customers in the age group of 18-22 years prefer a student’s credit card or most of the employees in working in multinational companies prefer an international credit card. This model is applied on the database of prospective customers where the techniques distil which credit card would be the first choice for each prospective customer. This enables the organization to do selective targeting of customers for each of their products.
Various data mining models
There are many types of modelling done in data mining. The most popular ones are:-
1. Neural networks
Neural networks are the most common modelling technique used in data mining. This model is very efficient is establishing the relationship between many variables at the same time. A database may have hundreds of independent variables stored in various classes. The dependent variable may or may not have a direct relationship with them. The dependent variable may have an indirect relation ship like another variable between the primary independent variable and the dependent variable. Neural networks helps in predicting the dependent variable after establishing the relationship. This may be very complex similar to the biological neural network.
2. Nearest neighbour
The nearest neighbour is a simple modelling technique. This technique uses the solutions of similar problems to predict the dependent variable. This technique identifies the nearest neighbours of the independent variable and places the variable in that particular class of variables. In this case the distance between the variable and neighbours is measured and the then calculated to identify the nearest neighbours. The limitation in this case is that it is very time consuming as the measurement and calculation is very lengthy and tedious.
3. Decision trees
Decision trees are a modelling technique based on certain rules or series of rules which lead to a particular variable or class. For e.g. a credit Card Company wants to classify all its credit card holders as defaulters or non-defaulters (customers who have made their payments). For this modelling rules can be set as payment received by due date or not and classification can be done accordingly. The decision tree can be single or binary or multi way tree. A number of algorithms are used in decision trees.
4. Rule induction
This modelling technique uses statistics to predict the dependent variable. The rule induction is also based on rules like the decision tree however it does not have the limitation of forming tree like structure. In rule induction, the rules may or may not branch out depending on each rules classification. This is believed to be better than a decision tree.
5. Genetic Algorithm
The objective of genetic algorithm is not to identify trends or patterns in a data sets but to identify models which ay be used in the data mining.
Data mining process
The data mining process consists of the following steps:-
1. Data Collection
This step involves the identification of the data to be collected, their respective sources, the data format, data description and the assessment of the quality of data. Some of the data collected may not be usable in the raw form may require some additional activity to make the data usable. At time the data has multiple sources and the data retrieved from these sources are not complete, in such case the data needs to be first reconciled and then integrated.
2. Data management
The data collected has to be classified into various segments and categories. After that the data has to be stored accordingly in the data warehouse. If the storing does not happen right it can lead to the failure of entire data mining activity
3. Data Access
The data stored in data warehouses must be accessible and retrievable at will. It should be accessible in the desired form and categories. This is very important for all the analysis. Today many data warehouse are struggling in this area as making the data accessible in a required format for all categories is not an easy task.
4. Data analysis
This step is the most critical step all the data mining takes place here. All the analysis for the dependent variable based on the independent variable through various model, statistical tools etc takes place in this step.
5. Result presentation
Once the analysis is complete, it gives an output of the data mining activity establishing all the relationships predicting the trends and behaviours in the dependent variable. This output must be presented in the right form and manner
Conclusion
Data mining is very helpful in analysing huge data and predicting trends and patterns in the data which can be used by organizations in many ways. However, unrealistic expectations like 100% accuracy can lead to extensive damage. Data mining is only a tool and and may give wrong result, Due t this reason the results obtained from data mining must be validated first and only then proceeded with. If used wisely, the same can play an important role in increasing the profitability of an organization. Although, in the data mining process most people pay attention to the models however quality data collection and storage is even more critical and important.








