February 20, 2023

2 Introduction to Predictive Modeling

Data analysis involves a number of modeling techniques that can be classified as three main types:

1 Predictive Modeling

Predictive modeling, sometimes called supervised learning, focuses on understanding or making predictions about a given variable based on one or more other variables. The ultimate goal of predictive analysis is accuracy. For example, you may want to predict which potential customers are more likely to apply for a credit card based on a knowledge of their demographic and financial information. In this scenario, there are two types of variables:

Below are several additional applications of predictive modeling:

2 Cluster Analysis

Cluster analysis, or segmentation, is a type of unsupervised learning problem that is very different from predictive modeling. Cluster analysis is appropriate when you can associate each case in a dataset with other cases that share similar distinct characteristics. You end up with several groups, in which the members of each group are very similar to each other but very different compared to members of other groups.

Cluster analysis is often used in marketing campaigns so that customers receive ads tailored for the group to which they belong rather than generic ads. These groups are identified via cluster analysis. In such a scenario, there is no dependent variable; only independent variables are used to segment the cases.

Cases in which the variables have similar values are grouped into clusters in an attempt to find homogenous subsets. Determining the number of clusters to create is part of the challenge of such techniques. Since clustering is exploratory in nature, the resulting clusters are not necessarily right or wrong. Instead, cluster solutions should be judged by their ability to address the specific business problem you are trying to solve. Applications of cluster analysis include:

3 Association Modeling

Companies like Amazon and Netflix have made association modeling commonplace. As consumers, most of us have encountered recommendation engines where a movie is recommended based upon our prior viewing habits or books are recommended based on our prior purchases (or even just our prior browsing behavior). Association models use transactional data to predict future transactions. The idea is that you may be able to suggest additional items that a person may want or need based on their previous buying behavior. This results in statements such as, “People who bought product A and product B might also like product C” appearing on their computer screens.

Association modeling is often described primarily as a kind of market basket analysis, but while it is strongly associated with retail data, it can also be applied in other areas. For example, in predictive maintenance, a pair of part failures might frequently be associated with the failure of a third part even though that third part doesn’t show evidence of trouble at the time the first two fail.

Transactions in association models often occur at the same time. For instance, many items might be listed on a grocery store receipt, but, there is no indication of which purchases occurred before others. Hot dogs and hot dog buns may be frequently purchased together, but there is no “rule” that indicates which purchase occurred first. In a predictive model, you might say that hot dogs “predict” buns or that buns predict hot dogs. A further refinement involves sequence analysis, which does take into consideration the order in which events, such as purchases, occur. This can be useful in predictive maintenance or in web mining, where it might be beneficial to know the sequence in which website visitors click on links and buttons on a page or move to other pages in the site.

Applications of association modeling include:

4 Overview of Predictive Models

This course covers only some of the most popular predictive models. Specifically, we’ll take a look at the following:

4.1 Statistical Models

Statistical models produce equations and statistical tests guide predictor selection. These models make certain assumptions whereas rule induction and machine learning models do not. Here are several characteristics of statistical predictive models:

Below is a list of some statistical models.

4.2 Decision Tree Models

A decision tree or rule induction model is an important type of predictive model. It derives a set of rules in relation to a dependent variable. The model’s output shows the reasoning for each rule and can therefore be used to understand the decision-making process that drives a particular outcome. Models that produce decision trees belong to this class of models. Generally, decision tree predictive models:

Below is a list of several rule induction models:

4.3 Machine Learning Models

Machine learning models are optimized for learning complex patterns. Unlike traditional statistical techniques, no assumptions are made about the data. Machine learning models do not produce a set of rules like rule induction models, nor do they produce easy-to-understand equations like statistical models. Thus, machine learning models are often said to be “black box” models. They produce a set of equations, but because there is a hidden layer (possibly several hidden layers), the interpretation of the coefficient weights is not straightforward as it is with traditional statistical models or rule induction models. Machine learning predictive models:

Below is a list of several machine learning models:

5 Model Validation

The process of statistical hypothesis testing, which involves a result’s statistical significance in the context of certain data distribution assumptions (such as having normally distributed errors), helps us determine when we have found a valid and reliable result. However, most data-mining methods do not depend on specific data distribution assumptions for drawing inferences from the sample to the population. So how is validation achieved? Model validation in data mining is usually done by partitioning the data into training and testing datasets. Models are developed from the training data and then the models’ predictions are tested on the testing data. Validity is established by demonstrating that the model applies to data different from what was used to derive the model. Statisticians often recommend such validation for statistical models, but it is crucial for more general (less distribution-bound) data-mining techniques.

6 How to Choose a Model

Choosing a model is difficult. Obviously, if you have a variable in the data file that you want to predict, then any of the predictive models (depending on the target variable’s level of measurement) will perform the task albeit with varying degrees of success. If you want to find groups of individuals that behave similarly on a number of fields in the data, then any of the clustering methods are appropriate. The use of association rules, while not directly giving you the ability to make predictions, are extremely useful as a tool for understanding the various patterns within the data.

However, determining which particular prediction technique will work best depends specifically on how the variables you want to predict are related to the predictors. There are suggested guidelines as to when one technique may work better than another, but these are only suggestions and not rules.

From the previous discussion, it follows that more than one prediction model can be used to predict an outcome. The business context provides the first deciding factor in selecting a model. For example, if your goal is to extract a set of rules from the model, a rule induction model is the only choice. Alternatively, if the model itself is of no interest but must nevertheless be as accurate as possible, then any of the models could be a candidate for the task. When one class of model is preferred but there are many models within that class, how do you choose a specific model?

Each model has different characteristics when it comes to they way in which:

There are many subtle differences between the models. In the end, however, it is always the business users who balance the pros and cons, and decide which model or combination of models should be used. There is a wide range of possibilities and it is only the business user who can decide what to do.

Data analytics and reporting tools such as KNIME provide for simplicity in building models. Machine learning models, rule induction models (decision trees), and statistical models can be built with great ease and speed, and their results compared. You must remember that data mining is an iterative process: models will be built, broken down, and often even combined before the user is satisfied with the results.

One final yet important point to keep in mind when building models is that software will only find rules or patterns in data if they actually exist. You cannot extract a model with high predictive accuracy if there are no associations between the predictors and dependent variables.

Reference

Course text: UCI. (2020). Introduction to Predictive Modeling.

# DS# ML# Data Mining