top of page

What is Data Leakage? Types, Detection and Prevention. How to Fix Data Leakage?

“Data leakage” refers to the unauthorized passage of data or information from inside an organization to a destination outside its secured network. Data leakage can refer to electronic data, which can be transmitted via the web; or physical data, which can be stored and moved on devices like USB sticks or hard drives. Data leakage is one of the most important aspects of cybersecurity businesses have to consider today and can be avoided through the use of tools and education.

Types of Data Leakage

Data leakage happens in various ways. The data leakage can be started from either an internal source or an external source. The much needed protective steps need to be taken to guarantee the common data leakage trials can be stopped. The most commonly found data leakage types are listed below.

1. The Accidental Breach

The data leakage is not by an external source or malicious at all times. Most of the data breaches are unintentional and happens when the recipient is misplaced or the wrong recipient is chosen. But these kinds of data leakages without intention still attract the same penalties and result in reputation damage.

2. Unhappy Employees

Data leakages are not always by an external source. Many times it is due to unhappy employees within the organization. It has been found that most of the data leakage happens over cameras, printers, photocopies, and USB drives. The dumpster diving used for discarded documents is also used for data leakage. If the employee has made up his mind to leak the data there is nothing that can be done to stop it. This kind of data leakage is known as data exfiltration.

3. Electronic communication medium

Communication is vital in today’s work environment. Employees in most organizations have access to email, the internet, and instant messaging available. With all these arises the possibility of data leaks to outside sources. The commonly used weapon is malware which is sent using any of the above mentioned electronic media. This is found to have a high success rate when it comes to data leakage. The cybercriminals send a fake id that looks like a legitimate business email account with a request for sensitive information. The user checking the legitimate email account sends the requested information which can vary from financial details or pricing details.

Another cyber technique that has been quite successful is phishing attacks. On click of a link consisting of malicious code sent by a cyber-criminal will give the control of the victim’s system to the cyber-criminal. The cybercriminal will access the victim’s laptop or network and gathers the information required.

Data Leakage Detection

A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). Some of the data is leaked and found in an unauthorized place (e.g., on the web or somebody’s laptop). The distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. We propose data allocation strategies (across the agents) that improve the probability of identifying leakages. These methods do not rely on alterations of the released data (e.g., watermarks). In some cases we can also inject “realistic but fake” data records to further improve our chances of detecting leakage and identifying the guilty party.


Traditionally, leakage detection is handled by watermarking, e.g., a unique code is embedded in each distributed copy. If that copy is later discovered in the hands of an unauthorized party, the leaker can be identified.

Disadvantages of Existing Systems:

Watermarks can be very useful in some cases, but again, involve some modification of the original data. Furthermore, watermarks can sometimes be destroyed if the data recipient is malicious. E.g. A hospital may give patient records to researchers who will devise new treatments.

Similarly, a company may have partnerships with other companies that require sharing customer data. Another enterprise may outsource its data processing, so data must be given to various other companies. We call the owner of the data the distributor and the supposedly trusted third parties the agents.


Our goal is to detect when the distributor’s sensitive data has been leaked by agents, and if possible to identify the agent that leaked the data. Perturbation is a very useful technique where the data is modified and made “less sensitive” before being handed to agents. We develop unobtrusive techniques for detecting leakage of a set of objects or records. We develop a model for assessing the “guilt” of agents.

We also present algorithms for distributing objects to agents, in a way that improves our chances of identifying a leaker.

Finally, we also consider the option of adding “fake” objects to the distributed set. Such objects do not correspond to real entities but appear realistic to the agents.

In a sense, the fake objects acts as a type of watermark for the entire set, without modifying any individual members. If it turns out an agent was given one or more fake objects that were leaked, then the distributor can be more confident that agent was guilty.

Problem Setup and Notation:

A distributor owns a set T={t1,…,tm}of valuable data objects. The distributor wants to share some of the objects with a set of agents U1,U2,…Un, but does not wish the objects be leaked to other third parties. The objects in T could be of any type and size, e.g., they could be tuples in a relation, or relations in a database. An agent Ui receives a subset of objects, determined either by a sample request or an explicit request:

1. Sample request

2. Explicit request

Guilt Model Analysis: Our model parameters interact and to check if the interactions match our intuition, in this section we study two simple scenarios as Impact of Probability p and Impact of Overlap between Ri and S. In each scenario we have a target that has obtained all the distributor’s objects, i.e., T = S.


1. Evaluation of Explicit Data Request Algorithms

In the first place, the goal of these experiments was to see whether fake objects in the distributed data sets yield significant improvement in our chances of detecting a guilty agent. In the second place, we wanted to evaluate our e-optimal algorithm relative to a random allocation.

2. Evaluation of Sample Data Request Algorithms

With sample data requests agents are not interested in particular objects. Hence, object sharing is not explicitly defined by their requests. The distributor is “forced” to allocate certain objects to multiple agents only if the number of requested objects exceeds the number of objects in set T. The more data objects the agents request in total, the more recipients on average an object has; and the more objects are shared among different agents, the more difficult it is to detect a guilty agent.


1. Data Allocation Module:

The main focus of our project is the data allocation problem as how can the distributor “intelligently” give data to agents in order to improve the chances of detecting a guilty agent.

2. Fake Object Module:

Fake objects are objects generated by the distributor in order to increase the chances of detecting agents that leak data. The distributor may be able to add fake objects to the distributed data in order to improve his effectiveness in detecting guilty agents. Our use of fake objects is inspired by the use of “trace” records in mailing lists.

3. Optimization Module:

The Optimization Module is the distributor’s data allocation to agents has one constraint and one objective. The distributor’s constraint is to satisfy agents’ requests, by providing them with the number of objects they request or with all available objects that satisfy their conditions. His objective is to be able to detect an agent who leaks any portion of his data.

4. Data Distributor:

A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). Some of the data is leaked and found in an unauthorized place (e.g., on the web or somebody’s laptop). The distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means.

Hardware Required:

System : Pentium IV 2.4 GHz

Hard Disk : 40 GB

Floppy Drive : 1.44 MB

Monitor : 15 VGA colour

Mouse : Logitech.

Keyboard : 110 keys enhanced.

RAM : 256 MB

Software Required:

O/S : Windows XP.

Language : Asp.Net, C#.

Data Base : Sql Server 2005

How to fix the problem of Data Leakage?

The main culprit behind this is the way we split our dataset and when. The following steps can prove to be very crucial in preventing data leakage:

Idea-1 (Extracting the appropriate set of Features)

To fix the problem of data leakage, the first method we can try is to extract the appropriate set of features for a machine learning model. While choosing features, we should make sure that the given features are not correlated with the given target variable, as well as that they do not contain information about the target variable, which is not naturally available at the time of prediction.

Idea-2 (Create a Separate Validation Set)

To minimize or avoid the problem of data leakage, we should try to set aside a validation set in addition to training and test sets if possible. The purpose of the validation set is to mimic the real-life scenario and can be used as a final step. By doing this type of activity, we will identify if there is any possible case of overfitting which in turn can act as a caution warning against deploying models that are expected to underperform in the production environment.

Idea-3 (Apply Data preprocessing Separately to both Train and Test subsets)

While dealing with neural networks, it is a common practice that we normalize our input data firstly before feeding it into the model. Generally, data normalization is done by dividing the data by its mean value. More often than not, this normalization is applied to the overall data set, which influences the training set from the information of the test set and eventually it results in data leakage. Hence, to avoid data leakage, we have to apply any normalization technique separately to both training and test subsets.

Idea-4 (Time-Series Data)

Problem with the Time-Series Type of data:

When dealing with time-series data, we should pay more attention to data leakage. For example, if we somehow use data from the future when doing computations for current features or predictions, it is highly likely to end up with a leaked model. It generally happens when the data is randomly split into train and test subsets.

So, when working with time-series data, we put a cutoff value on time which might be very useful, as it prevents us from getting any information after the time of prediction.

Idea-5 (Cross-Validation)

When we have a limited amount of data to train our Machine learning algorithm, then it is a good practice to use cross-validation in the training process. What Cross-validation is doing is that it splits our complete data into k folds and iterates over the entire dataset in k number of times and each time we are using k-1 fold for training and 1 fold for testing our model.

The advantage of this approach is that we used the entire dataset for both training and testing purposes. However, if you get suspicious about data leakage, then it is better to scale or normalize the data and compute the parameters on each fold of cross-validation separately.

Data Leakage Prevention

The techniques and technologies used to prevent data leaks are mostly the same as those used to prevent data breaches. Most data loss prevention strategies start with carrying out risk assessments (including third-party risk assessments) and defining policies and procedures based on those assessments. However, in order to carry out a risk assessment, you must first understand what data you have, and where it is located.

1. Data discovery and classification

Use a solution which can automatically discover and classify your sensitive data. Once you have done this, carefully remove any ROT (Redundant, Obsolete and Trivial) data to help streamline your data protection strategy. Classifying your data will make it easier to assign the appropriate controls and keep track of how users interact with your sensitive data.

2. Restrict access rights

As always, it’s a good idea to limit the number of users who have access to sensitive data, as this will reduce the risk of data leakage.

3. Email content filtering

Use a content filtering solution that uses deep content inspection technology to find sensitive data in text, images and attachments in emails. If sensitive data is found, it will send an alert to the administrator, who can verify the legitimacy of the transfer.

4. Controlling print

Sensitive files can be stored on printers that may be accessed by an unauthorised party. Ask users to sign-in to access the printer, limit the functionality of the printer based on their role and ensure that documents containing sensitive data can only be printed once. You will also need to make sure that user’s don’t leave any printed documents containing sensitive data in the printer tray.

5. Encryption

It’s always a good idea to encrypt sensitive both at rest and in transit. This is especially relevant when storing sensitive data in the cloud.

6. Endpoint protection

A Data Loss Prevention (DLP) solution can be used to prevent endpoints (desktops, laptops, mobiles, servers) from leaking sensitive data. Some DLP solutions can automatically block, quarantine or encrypt sensitive data as it leaves an endpoint. A DLP solution can also be used to restrict certain functions, such as copy, print, or the transferring of data to a USB drive or cloud storage platform.

7. Device control

It is common for users to store sensitive documents on their smartphones and tablets. In addition to device management policies, you will need a solution which monitors and controls what devices are being used, and by who. You will also need to use Mobile Device Management (MDM) software, as this will make it easier for security teams to enforce the use of complex passwords, service the device remotely and control which applications can be installed on the device. Most MDM solutions can also track the location of the device and even the wipe the contents of the device if it gets lost or stolen.

8. Cloud storage configuration

Data leaks caused by misconfigured storage repositories are common. For example, many data breaches were reportedly caused by Amazon S3 buckets being exposed to the public by default. Likewise, GitHub repositories and Azure file share have also been known to expose data when they are not configured correctly. As such, it is crucially important to have a formalized process for validating the configuration of any cloud storage repositories you use.

9. Real-time auditing and reporting

Arguably one of the most effective ways to prevent data leakage is to keep track of changes made to your sensitive data. Administrators should have an immutable record of who has access to what data, what actions were performed, and when. The administrators should be informed (in real-time) when sensitive data is accessed, moved, shared, modified or removed in a suspicious manner or by an unauthorized party. This can be especially useful for monitoring access to sensitive data stored in the cloud. If an alert is raised, the administrator can launch an investigation into the issue – perhaps starting off by verifying the permissions of the storage container.

10. Security awareness training

As mentioned previously, data leaks are caused by negligent employees. The reality is, people make mistakes. Such mistakes might include emailing sensitive data to wrong recipient, losing a USB drive, or leaving a printed document containing sensitive data in the printer tray. The most effective way to reduce the number of mistakes that our employees make is to ensure that they are well informed about data security best practices. Having an intuitive classification schema, such as public, internal and restricted, will help employees determine how certain types of data should be handled.

The Tech Platform

bottom of page