Machine Learning: Data Drift vs Concept Drift

As models are deployed in real-world scenarios, they encounter new data and face the challenge of adapting to changes in the underlying patterns and distributions. Understanding the differences between data drift and concept drift is crucial for effectively addressing these challenges and ensuring the continued accuracy and relevance of machine learning models. In this article, we delve into the distinctions between data drift and concept drift, explore their implications, and discuss strategies to mitigate their effects. Join us as we unravel the intricacies of data drift and concept drift, and discover how they shape the dynamic field of machine learning.

What is Drift?

Data drift is the phenomenon where the distribution of the input data changes over time, affecting the accuracy of the machine learning model.

Why do Machine Learning Models Drift?

Machine learning models drift because the data that they were trained on become outdated or no longer represents the current conditions. This can happen for various reasons, such as changes in the external world, changes in the context of the application, changes in user preferences or behaviors, or changes in data quality or availability. When the data changes, the relationship between the input and output data may also change, resulting in the model becoming less accurate or relevant for the current task.

Types of Drift

There are two main types of drift in machine learning:

Data Drift
Concept Drift

Data Drift:

Data drift, also known as covariate shift, is a phenomenon in which the distribution of data inputs that an ML model was trained on deviates from the distribution of the data inputs that the model encounters during application. This discrepancy can cause the model to lose accuracy and effectiveness in making predictions or decisions.

For instance, consider an ML model trained on customer data from a specific retail store to predict purchase likelihood based on age, income, and location. If the distribution of the input data (age, income, and location) for the new data differs significantly from the training dataset, it can lead to data drift and result in reduced model accuracy.

To address data drift, various techniques can be employed, such as weighing or sampling, to account for the disparities in data distributions. For example, one approach is to assign weights to examples in the training dataset in order to align them more closely with the input data distribution of the new data that the model will be applied to.

There is not drift

There is Drift

Example:

Data drift can happen when a machine learning model that was taught to recognize handwritten numbers is used on a different set of handwritten numbers that look different from what it learned before. If the new numbers have different styles, sizes, or orientations than what the model was trained on, it might make mistakes and not recognize them correctly because the way the numbers look has changed.

How to Detect and Prevent Data Drift?

To prevent and detect data drift, you can use some of the following methods:

Regular model retraining: You can retrain your model with new data periodically or when you detect a significant change in the data distribution. This can help your model adapt to the current state of the data and maintain its performance.
Continuous model monitoring and evaluation: You can monitor and evaluate your model’s performance over time using metrics such as accuracy, precision, recall, F1-score, etc. You can also use techniques such as confusion matrix, ROC curve, AUC score, etc. to visualize and analyze your model’s performance. If you notice a drop in your model’s performance, you can investigate the possible causes of data drift and take corrective actions.
Statistical tests: You can use various statistical tests to compare the distribution of the input data over time and detect any significant changes. Some examples of statistical tests are hypothesis tests, divergence measures, or clustering methods. You can also use visualization tools such as histograms, box plots, scatter plots, etc. to explore the data distribution and identify any outliers or anomalies.
Model-based approach: You can use a classification model to determine whether certain data points are similar to another set of data points. If the model has a hard time differentiating between the data sets, then there is no significant data drift. If the model is able to correctly separate the data sets, then there is likely some data drift.
Adaptive windowing (ADWIN): You can use this approach for streaming data, where there is a large amount of infinite data flowing in and it is infeasible to store it all. ADWIN is an algorithm that dynamically adjusts the size of a sliding window that contains the most recent data points. It detects changes in the data distribution by comparing the statistical properties of two sub-windows within the sliding window. If there is a significant difference between the sub-windows, it means that there is a change in the data distribution and ADWIN shrinks or expands the window accordingly.

Algorithms to Detect Data Drift

. To detect data drift, you can use some of the following algorithms:

Kullback-Leibler Divergence: This algorithm measures the difference between two probability distributions by calculating the amount of information lost when one distribution is used to approximate the other. It can be used to compare the input data distribution in the current dataset with the input data distribution in the training dataset. If the divergence is high, it means that data drift has occurred.
Jensen-Shannon Divergence: This algorithm is a variant of Kullback-Leibler Divergence that is symmetric and always has a finite value. It measures the difference between two probability distributions by calculating the average of Kullback-Leibler Divergences between each distribution and their mean. It can also be used to compare the input data distribution in the current dataset with the input data distribution in the training dataset. If the divergence is high, it means that data drift has occurred.
Kolmogorov-Smirnov Test: This algorithm is a non-parametric test that compares the empirical distribution functions of two samples. It calculates the maximum absolute difference between the two distribution functions and sets an alarm for drift when this value is larger than a user-defined threshold. This algorithm is sensitive to differences in both location and shape of the distributions. It is well suited for numerical data.
CUSUM (Cumulative Sum Control Chart): This algorithm is a statistical process control method that detects changes in the mean of a sequence of data points. It calculates the cumulative sum of deviations from a reference mean and sets an alarm for drift when this value crosses a user-defined threshold. This algorithm can detect small and gradual changes in the data distribution.
EWMA (Exponentially Weighted Moving Average): This algorithm is another statistical process control method that detects changes in the mean of a sequence of data points. It calculates an exponentially weighted moving average of deviations from a reference mean and sets an alarm for drift when this value crosses a user-defined threshold. This algorithm can detect small and gradual changes in the data distribution and gives more weight to recent observations.
ADWIN (Adaptive Windowing): This algorithm dynamically adjusts the size of a sliding window that contains the most recent data points. It detects changes in the data distribution by comparing the statistical properties of two sub-windows within the sliding window. If there is a significant difference between the sub-windows, it means that there is a change in the data distribution and ADWIN shrinks or expands the window accordingly.

Concept Drift

Concept drift, also referred to as model drift, occurs when the task that a model was designed to perform undergoes changes over time. For instance, let's consider a machine learning model trained to identify spam emails based on their content. If the types of spam emails people receive undergo significant changes, the model may struggle to accurately detect spam.

Concept drift can be categorized into four types:

Sudden drift: This type occurs when there is an abrupt and immediate change in the relationship between inputs and outputs. For example, if a new law is implemented that affects the way people send emails, it can cause a sudden drift in spam detection.
Gradual drift: This type occurs when there is a gradual change in the relationship between inputs and outputs over time. For instance, if people gradually start using new slang words or abbreviations in their emails, it can result in a gradual drift in spam detection.
Incremental drift: This type occurs when there is a continuous and smooth change in the relationship between inputs and outputs over time. For example, if people's preferences or behaviors gradually evolve, it can lead to an incremental drift in customer churn prediction.
Recurring concepts: This type occurs when previously encountered concepts resurface after a certain period. For example, if seasonal trends impact the way people send emails, it can introduce recurring concepts in spam detection.

To overcome concept drift, one approach is to continuously monitor the performance of machine learning models over time and retrain them using new data or adjust their parameters as necessary. This enables the model to adapt to the changing data distribution and maintain its accuracy and effectiveness.

Machine Learning: Data Drift vs concept drift - concept drift

Example

Concept drift can occur when a machine learning model, which was trained to predict customer churn based on behavior and preferences, is used on a new group of customers who have different expectations or needs than the ones it learned from. This can lead to incorrect predictions because the relationship between what the model sees as input and what it predicts as output has changed.

Another example of concept drift is when a machine learning model, designed to identify spam emails by analyzing their content, is used on a new set of emails that have different types of spam or new ways of hiding spam compared to what it was trained on. As a result, the model may struggle to identify some spam emails because the task it was trained for has changed.

How to Detect and Prevent Concept Drift?

To prevent and detect concept drift, you can use some of the following methods:

Weighting the importance of new data: You can use techniques such as online learning or ensemble learning to weight the importance of new data over old data. This can help your model adjust to changes in data relationships more quickly and smoothly.
Creating new models to solve sudden or recurring concept drift: You can create new models to solve sudden or recurring concept drift when your existing model is no longer able to handle the changes in data relationships. For example, you might create a new model for a new season or a new market segment.
Maintaining a static model as a baseline for comparison: You can maintain a static model as a baseline for comparison to measure how much your current model has drifted from its original state. This can help you identify when and where concept drift occurs and how severe it is.

Algorithms to detect concept drift

To detect concept drift, you can use some of the following algorithms:

DDM (Drift Detection Method): This algorithm monitors the error rate and the standard deviation of a machine learning model over time and divides the data state into three states: normal, warning, and drift. If the error rate increases significantly and exceeds a certain threshold, it indicates that concept drift has occurred.
ADWIN (Adaptive Windowing): This algorithm dynamically adjusts the size of a sliding window that contains the most recent data points. It detects changes in the data relationship by comparing the statistical properties of two sub-windows within the sliding window. If there is a significant difference between the sub-windows, it means that there is a change in the data relationship and ADWIN shrinks or expands the window accordingly.
EDDM (Early Drift Detection Method): This algorithm is an extension of DDM that uses the distance between errors instead of the error rate to detect concept drift. It assumes that the distance between errors increases when concept drift occurs. It also uses two thresholds to classify the data state into normal, warning, and drift.
Page-Hinkley Test: This algorithm is a statistical test that detects changes in the mean of a sequence of data points. It calculates the cumulative sum of deviations from a reference mean and sets an alarm for drift when this value is larger than a user-defined threshold. This algorithm is sensitive to the parameter values, resulting in a tradeoff between false alarms and detecting true drifts.
Kolmogorov-Smirnov Test: This algorithm is a non-parametric test that compares the empirical distribution functions of two samples. It calculates the maximum absolute difference between the two distribution functions and sets an alarm for drift when this value is larger than a user-defined threshold. This algorithm is sensitive to differences in both location and shape of the distributions. It is well suited for numerical data.
ExStreamModel: This algorithm uses model explanation to detect concept drift. It triggers the computation of the model explanation at equidistant intervals and calculates the average dissimilarity between the current model explanation and the previous ones. If the dissimilarity exceeds a certain threshold, it indicates that concept drift has occurred.
ExStreamAttr: This algorithm is a variant of ExStreamModel that uses more detailed monitoring of dissimilarities at the attribute level. It calculates the dissimilarity between each attribute’s contribution to the model explanation and compares it with a reference distribution. If any attribute’s dissimilarity exceeds a certain threshold, it indicates that concept drift has occurred.
Meta-ADD (Active Drift Detection with Meta-learning): This algorithm uses meta-learning to classify concept drift by tracking the changed pattern of error rates. It extracts meta-features based on the error rates of various concept drifts and trains a meta-detector using a prototypical neural network. The meta-detector is then fine-tuned to adapt to the corresponding data stream via stream-based active learning.

Conclusion

Data drift and concept drift are two critical factors that can significantly impact the performance and effectiveness of machine learning models. While data drift refers to the shift in the distribution of input data, concept drift pertains to the changes in the relationship between inputs and outputs over time. Both phenomena pose challenges in maintaining model accuracy and reliability as they encounter new datasets or evolving tasks.