Target Leakage

Use of the Assisted Modeling tool requires participation in the Alteryx Analytics Beta program. Visit the the Alteryx beta program, also known as the Alteryx Customer Feedback Program, to find out more. All Alteryx Beta Program notifications and disclaimers apply to this content. Functionality described here may or may not be available as part of the Beta Program. To send feedback about the documentation, send an email message to helpfeedback@alteryx.com.

This article provides information about a type of machine learning error that occurs more often with complex dataset.

Target leakage: Also known as data leakage. Target leakage occurs when a variable that is not a feature is used to predict the target. This occurs when the model is built, or trained, with information (known as the training dataset) that will not be available in unseen data. This is one of the causes of model error and results in a model that seems too good to be true. Target leakage occurs more often with complex datasets. For example, if your training dataset that was normalized or standardized using missing value imputation (such as min, max, mean) - such that the training dataset had full knowledge of the distribution of data in the training dataset. An unseen dataset would not have any knowledge of the distribution of data in the dataset - that is, a row in the unseen dataset would not know if there were three records or three million other records. It wouldn't be able to normalize or standardize itself using unseen data. As a result, the model ends up overfitting to the training data and produces a higher accuracy when run on unseen data than it would if the model had not fit the training dataset so well (too well).

Prevent target leakage

The following actions may help prevent target leakage:

Cross validation - for time series this means selecting data points from your dataset and randomly assigning them to training and testing sets
Create and keep a validation dataset for performing a final reality check later