Data validation is when a program checks the data to make sure it meets some rules or restrictions. There are many different data validation checks that can be done. For example, we may check that the data:
Is of correct data type, for example a number and not a string
Does not have invalid values, such as provided zip code has a letter
Is not out of range, such as age given as negative number or there is division by zero
Meets constraints, for example, given date can only be in future or message does not exceed maximum length
Is consistent with other data or constraint, for example student test score and student letter grade are consistent
Is valid, such as given filename is for an existing file
Is complete, such as making sure all required form fields have data
Where we check the data is program dependent, but here are some clues:
Any time we get direct input from the user, check data type and valid range.
When our code is contained in a library where anyone can use the code, check all input for all methods so that the caller cannot corrupt or break our code.
Any time a caller changes the class attributes, check to make sure that the value can be changed and that the new value is valid.
Types of Validation in Python
There are three types of validation in python, they are:
Type Check: This validation technique in python is used to check the given input data type. For example, int, float, etc.
Length Check: This validation technique in python is used to check the given input string’s length.
Range Check: This validation technique in python is used to check if a given number falls in between the two numbers.
Using a flag
flagName = False while not flagName: if [Do check here]: flagName = True else: print('error message')
Using an exception
while True: try: [run code that might fail here] break except: print('This is the error message if the code fails') print('run the code from here if code is successfully run in the try block of code above') #Reference: http://www.easypythondocs.com/validation.html
Data Validation Tools
While working on data, data validation is a crucial task which ensures that the data is cleaned, corrected and is useful. Cerberus is an open source data validation and transformation tool for Python. The library provides powerful and lightweight data validation functionality which can be easily extensible along with custom validation. The Cerberus 1.x versions can be used with Python 2 while version 2.0 and later rely on Python 3 features.
Colander is a Python Library for validating and deserializing data which is obtained via XML, JSON, an HTML form post or any other equally simple data serialisation. It can be said as a good basis for form generation systems, data description systems, and configuration systems. The library has been tested on Python version 2.7 and above and can be used to define a data schema, serialise an arbitrary Python structure to a data structure composed of strings, mappings, and lists and deserialise a data structure composed of strings, mappings, and lists into an arbitrary Python structure after validating the data structure against a data schema.
Schema is a library for validating Python data structures such as those obtained from config-files, forms, external services or command-line parsing, converted from JSON/YAML (or something else) to Python data-types. If the data is valid, Schema.validate will return the validated data and if the data is invalid, Schema will raise SchemaError exception.
Voluptuous is a Python data validation library. It is primarily intended for validating data coming into Python as JSON, YAML, etc. The library follows mainly three goals which are simplicity, support for complex data structures and providing useful error messages. There are several benefits of this library such as the validators are simple callables, errors are simple exceptions, schemas are basic Python data structures, etc.
Valideer can be said as the lightweight data validation and adaptation library for Python. It supports both validations (check if a value is valid) and adaptation (convert a valid input to an appropriate output. It is extensible such as the new custom validators and adaptors can be easily defined and registered. The validation schemas can be specified in as declarative and extensible language.
Schematics is a Python library for data validation which combines types into structures, validate them, and transform the shapes of your data based on simple descriptions. It can also be used in a range of tasks such as design and document specific data structures, convert structures to and from different formats such as JSON or MsgPack, validate API inputs, define message formats for communications protocols, like an RPC, and much more.
Steps to Data Validation
Step 1: Determine data sample
Determine the data to sample. If you have a large volume of data, you will probably want to validate a sample of your data rather than the entire set. You’ll need to decide what volume of data to sample, and what error rate is acceptable to ensure the success of your project.
Step 2: Validate the database
Before you move your data, you need to ensure that all the required data is present in your existing database. Determine the number of records and unique IDs, and compare the source and target data fields.
Step 3: Validate the data format
Determine the overall health of the data and the changes that will be required of the source data to match the schema in the target. Then search for incongruent or incomplete counts, duplicate data, incorrect formats, and null field values.
Methods for Data Validation
You can perform data validation in one of the following ways:
Scripting: Data validation is commonly performed using a scripting language such as Python to write scripts for the validation process. For example, you can create an XML file with source and target database names, table names, and columns to compare. The Python script can then take the XML as an input and process the results. However, this can be very time intensive, as you must write the scripts and verify the results by hand.
Enterprise tools: Enterprise tools are available to perform data validation. For example, FME data validation tools can validate and repair data. Enterprise tools have the benefit of being more stable and secure, but can require infrastructure and are costlier than open source options.
Open source tools: Open source options are cost-effective, and if they are cloud-based, can also save you money on infrastructure costs. But they still require a level of knowledge and hand-coding to be able to use effectively. Some open source tools are SourceForge and OpenRefine.
Challenges in Data Validation
Data validation can be challenging for a couple of reasons:
Validating the database can be challenging because data may be distributed in multiple databases across your organization. The data may be siloed, or it may be outdated.
Validating the data format can be an extremely time-consuming process, especially if you have large databases and you intend to perform the validation manually. However, sampling the data for validation can help to reduce the time needed.
The Tech Platform