Ensure Data Quality and Data Integrity in Machine Learning

In our previous blog, we explored the potential challenges and pitfalls associated with the data used in machine learning models. From data processing issues to changing data schemas, lost data, and broken upstream models, a range of factors can impact the quality and integrity of the data. While we strive for a flawless data ecosystem, it's crucial to be realistic and focus on the timely detection and mitigation of these issues.

Typically, data reliability and consistency are primarily addressed by data engineering teams, often incorporating checks and monitoring systems at the database level. However, when it comes to machine learning systems, the focus shifts to a specific subset of data consumed by a particular model. In this context, overall data quality across the entire warehouse becomes less relevant, and the emphasis is on monitoring and ensuring the integrity of the data subset relevant to the model's performance. Additionally, the feature processing code plays a vital role in the pipeline and requires its own dedicated monitoring setup.

Also read: Overcoming Data Challenges in Machine Learning Model Monitoring

Thus, in the domain of data quality and integrity, a convergence between MLOps (Machine Learning Operations) and DataOps becomes essential. This symbiotic union allows us to address the unique considerations of machine learning systems while upholding the best practices of data operations.

Ensure Data Quality and Data Integrity in Machine Learning

Let us embark on a journey to explore the key facets that demand our unwavering attention:

1. Model calls

The first question we must answer pertains to the functionality of our model. To shed light on this, we scrutinize the number of model responses as a basic yet valuable check layered atop software monitoring. This helps us determine if the model is operational even when the service itself appears functional.

Furthermore, it aids in identifying instances where a fallback mechanism, such as a business rule, may be utilized more frequently than anticipated. By establishing a "normal" usage pattern, such as the model's deployment on an e-commerce website or its regular consumption of new sales data, we gain clarity on the expected level of model interaction.

Keeping a watchful eye on the number of model calls enables us to easily detect anomalies and initiate debugging procedures when required.

2. Data Schema Concistency

As previously discussed, the evolution of data schemas poses a risk to data integrity. Whether due to inadvertent errors or well-intentioned modifications, we aspire to detect any deviations promptly. A meticulous feature-by-feature examination allows us to investigate if the feature set remains intact.

For tabular data, we establish the number of columns, ensuring that nothing is missing or added unexpectedly. Furthermore, we validate the consistency of feature data types, preventing instances where numerical values are erroneously replaced with categorical values or vice versa.

By conducting schema validation, we obtain a concise overview that affirms whether the incoming dataset aligns with our expectations.

3. Illuminating Missing data

The detection of missing data is a vital aspect of maintaining data integrity. While a certain degree of missing values may be deemed acceptable, we aim to compare the level of missing data against the "normal" range for both the entire dataset and individual features. This scrutiny ensures that critical features are not inadvertently lost during the data pipeline.

It is essential to acknowledge that missing values can manifest in various forms, ranging from empty entries to placeholders like "unknown" or "999." Employing comprehensive checks that account for standard expressions of missing data, such as "N/A," "NaN," or "undefined," helps safeguard against oversight. Occasional manual audits, complemented by domain expertise, serve as an added layer of assurance in detecting any deviations from expected missing data patterns.

Furthermore, setting data validation thresholds empowers organizations to determine when it is necessary to pause the model or employ fallback mechanisms, especially if the number of missing features exceeds predefined thresholds.

4. Ensuring Reliable Feature values

Simply having data at our disposal does not guarantee its correctness. Subtle errors can permeate the dataset, compromising its integrity.

Consider scenarios such as mistakenly crunching numbers in Excel, resulting in an "age" column with values ranging from 0.18 to 0.8 instead of the expected 18 to 80.

Another instance could involve a physical sensor malfunctioning and producing constant values for an extended period. Furthermore, human error during feature calculation, such as the accidental omission of a minus sign, may introduce negative sales numbers.

To combat these issues, monitoring the ranges and statistics of features becomes crucial. For numerical features, ensuring values fall within reasonable ranges is essential, while categorical attributes warrant defining a list of possible values and scrutinizing for any novelties. Expert input and domain knowledge can aid in establishing expectations for specific inputs.

Tracking key feature statistics, such as average, mean, min-max ratios, and quantiles, allows us to identify anomalies and promptly address any issues that compromise data integrity.

5. Illuminating Feature Processing

A vital aspect to consider is where to conduct data validation checks within the pipeline. When encountering incorrect data, our initial inquiry revolves around the root cause. Ideally, we aim to identify the error as soon as it is detected.

In cases where the source data remains intact, but anomalies arise during its transformation into model features, broken joins or faulty feature code may be responsible.

By conducting separate validation checks for inputs and outputs at each step of the pipeline, we enhance our ability to pinpoint and rectify the issue efficiently. This approach minimizes the need for extensive retracing of steps in complex pipelines, ultimately saving valuable time and resources.

Conclusion:

Nurturing data excellence stands as the first line of defense for production machine learning systems. By vigilantly monitoring data quality and integrity, we can preemptively detect and rectify a plethora of issues before they impact the overall performance of our models.

This practice serves as a fundamental health check, akin to monitoring latency or memory consumption. Furthermore, it extends to both human-generated and machine-generated inputs, as each possesses its unique set of potential errors.

Embracing data monitoring empowers organizations to uncover abandoned or unreliable data sources, illuminating pathways for continual improvement and the cultivation of truly impactful machine learning systems.