What is Data Processing? Importance and Different stages of Data Processing.

Data processing

Data processing occurs when data is collected and translated into usable information. Usually performed by a data scientist or team of data scientists, it is important for data processing to be done correctly as not to negatively affect the end product, or data output.

Data processing starts with data in its raw form and converts it into a more readable format (graphs, documents, etc.), giving it the form and context necessary to be interpreted by computers and utilized by employees throughout an organization.

Importance of Data Processing

Many data scientists spend too much time thinking about learning rates, neuron structures, and epochs before actually using correctly optimized data. Without properly formatting data, your neural network will be useless, regardless of the hours you may spend optimizing its hyperparameters. Preprocessing data tends to revolve around the following tasks:

Data Cleaning
Outlier Removal
Transformation
Normalization

Data cleaning

Cleaning data and removing outliers are tedious and compulsory tasks. Between datasets that are from the internet, have conflicting file formats, or are created by non-technical people, you will nearly always find issues with syntax, row delimiters, etc.

For example, sometimes a dataset might have quote marks around numeric values (‘4’,’5’,’6’), which would be interpreted as a string. We need a way to strip irrelevant surrounding characters for each value. Another example might be row delimiters. Some datasets separate their rows by line breaks (‘/n’), some by spaces, some by unusual characters like a pipe (|). Some don’t separate them at all, and assume you know the field count and can just split rows by counting commas. That can be dangerous, because aligning it incorrectly could shuffle your data and render it completely useless (repeated fields in same row, etc.). You can also get empty values, which in a text file might look like two commas side by side.

To prevent errors later on, you may need a place-holding value for this. It is very normal that datasets don’t do that kind of thing automatically — to them, it’s just a blank value. But to you, it’s a potential breakpoint.

Occasionally, you can even run into unicode character issues. Different datasets from around the world may use different comma characters and won’t recognize the one you specify to be the delimiter. In the following untouched example dataset, you can see a few of these issues:

Outlier removal

Outlier removing is also very important with this example. Few datasets ever come without data that is irrelevant to the problem at hand.

In this case, we can see the first row and the first column both have data that we don’t want to feed through our neural network. The first column is an ID column. Our algorithm has no interest in this data, as it does not inform the neural network on the dataset’s subject. This column should be removed. Likewise, the first row gives the field labels. This, too, should not but considered by the neural network. Once we have a dataset that consists of only meaningful data, the real preprocessing begins.

Transformation

The dataset above is also a good example of when transformation is needed, specifically alpha classification. It is obvious that we need to do something with columns that contain values that are non-numeric, such as the “Good” and “Premium” descriptions, as well as values like “VVS2,” “SI2,” “VS2,” and so on.

Generally, transformation refers to any conversion between two formats of data, even though both will still attempt to represent the same thing. A really great example of this is how alphabetic values in datasets which represent a classification are converted to numeric binary vectors. The process of this conversion is simple. Firstly, find all the different classes in an alphabetic column. So from just what we can see in the example above, the “cut” column would list all of these:

“Good”
“Fair”
“Very Good”
“Ideal”
“Premium”

So, we have five classes. We now need to replace the single “cut” column, with five columns, named by each class. This transformation would look like the following, where the resulting data would be incorporated into the original dataset.

Each row is a binary vector, where [0,1,0,0,0] represents the class ideal. Not only does this actually use less memory (although it may not look like it), but it is now fully interpretable by a machine learning algorithm. The meaning behind the data remains, for as long as all rows use the binary values consistently.

Here is a real-life demonstration, showing my software Perceptron transforming an alpha heavy dataset, with sample rows from the original dataset (all alphabetic).

Every field is a classification, clearly shown by the first letter of the classification name.

For example, the first field has only two classes: “p” and “e,” meaning poisonous or edible. It is worth mentioning that this dataset is for predicting whether mushrooms are edible or not based on mainly visual attributes.

Here is our resulting transformation. On the left, we can see all the classes that were found per field. Just above, we can see how each field has become multiple fields based on all the classes of that field. Every 0 and 1 (on a given row) will be one neuron on an input layer to a neural network (bar target values).

Let’s have a closer look at the first row, which went from

p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u

to this…

[1, 0],[1, 0, 0, 0, 0, 0],[1, 0, 0, 0],[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],[1, 0],[1, 0, 0, 0, 0, 0, 0, 0, 0],[1, 0],[1, 0],[1, 0],[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[1, 0],[1, 0, 0, 0, 0],[1, 0, 0, 0],[1, 0, 0, 0],[1, 0, 0, 0, 0, 0, 0, 0, 0],[1, 0, 0, 0, 0, 0, 0, 0, 0],[1],[1, 0, 0, 0],[1, 0, 0],[1, 0, 0, 0, 0],[1, 0, 0, 0, 0, 0, 0, 0, 0],[1, 0, 0, 0, 0, 0],[1, 0, 0, 0, 0, 0, 0]

The first class of the fields are p, x ,s, n, etc. Which happens to match up with our first-row example. Then, if we look at the result, the first item of each vector is hot. The amount of classes for each field will also match that field’s sample binary vector length.

Normalization

Classification fields never really require normalization. However, numeric values that reflect a specific amount of something (instead of what could be a binary structure or classification identity) nearly always need to be normalized. Normalization is all about scale. If we look at the dataset from earlier

fields such as depth, price, x, y, and z all have very different scales, because they are measured in entirely different units. To us, this makes sense. However, having these different scales inputted through a neural network can cause a massive imbalance to weight values and the learning process. Instead, all values need to be of the same scale, while still representing the varying quantities being described. Most commonly, we do this by making values between 0 and 1, or at least close to this range. Most simply, we could use a divider on each field:

Price of DiamondNormalised Price of Diamond 3260.3263340.3344230.423

Because the prices are mainly three digits, we can divide them all by 1,000. For other fields, we would use any divider that consistently gets the values between 0 and 1. Using an Iris dataset, I am going to demonstrate how a badly scaled dataset performs. Here are some sample rows

You can see the first field as values as large as 50, and in the second field we have values around 3, and finally some values under 1. Training a neural network with this data produced the following results

Although there is a decrease in error, the performance is awful at 60%. Now, if we normalize the data, so that the datasets looks like this:

We then get much better results

There are a few issues with this trivial method of normalization. If we look at the last numeric column, all the values are 0.02, 0.03, etc. Although these are now between 0 and 1, they are still not very well scaled, and still out of proportion to the other fields. To solve this, we can use a much better method of normalization that actually takes a field’s highest and lowest value into account and then calculates what all the other values should be based on this range. The equation to do this is:

Where x is your normalized value, i is your unnormalized value, min and max are the highest and lowest values in the field, and R1 to R2 are your desired bounds for the normalized value (0 and 1). This will result in every value being perfectly scaled by each field.

Different Stages of Data Processing

1. Data collection

Collecting data is the first step in data processing. Data is pulled from available sources, including data lakes and data warehouses. It is important that the data sources available are trustworthy and well-built so the data collected (and later used as information) is of the highest possible quality.

2. Data preparation

Once the data is collected, it then enters the data preparation stage. Data preparation, often referred to as “pre-processing” is the stage at which raw data is cleaned up and organized for the following stage of data processing. During preparation, raw data is diligently checked for any errors. The purpose of this step is to eliminate bad data (redundant, incomplete, or incorrect data) and begin to create high-quality data for the best business intelligence.

3. Data input

The clean data is then entered into its destination (perhaps a CRM like Salesforce or a data warehouse like Redshift), and translated into a language that it can understand. Data input is the first stage in which raw data begins to take the form of usable information.

4. Processing

During this stage, the data inputted to the computer in the previous stage is actually processed for interpretation. Processing is done using machine learning algorithms, though the process itself may vary slightly depending on the source of data being processed (data lakes, social networks, connected devices etc.) and its intended use (examining advertising patterns, medical diagnosis from connected devices, determining customer needs, etc.).

5. Data output/interpretation

The output/interpretation stage is the stage at which data is finally usable to non-data scientists. It is translated, readable, and often in the form of graphs, videos, images, plain text, etc.). Members of the company or institution can now begin to self-serve the data for their own data analytics projects.

6. Data storage

The final stage of data processing is storage. After all of the data is processed, it is then stored for future use. While some information may be put to use immediately, much of it will serve a purpose later on. Plus, properly stored data is a necessity for compliance with data protection legislation like GDPR. When data is properly stored, it can be quickly and easily accessed by members of the organization when needed.

Resource: Medium, Wikipedia

The Tech Platform