What is Big Data?

Ever wondered how much data big tech giants like facebook and twitter generate?

According to a recent report Facebook generates about 4 petabytes(1000000 gigabytes) of data per day — that’s a million gigabytes. The total amount of data adds up to about 120 petabytes in a month and more than an Exabyte(1000 petabytes) in an year.

This huge chunk of data can be collectively referred to as Big Data.

“Big data is a term that describes the large volume of data — both structured and unstructured — that inundates a business on a day-to-day basis.”

Characteristics of Big Data

Some of the characteristics of these data sets include the following:

1. Volume

The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not.

2. Variety

The type and nature of the data. The earlier technologies like RDBMSs were capable to handle structured data efficiently and effectively. However, the change in type and nature from structured to semi-structured or unstructured(variety) challenged the existing tools and technologies. The Big Data technologies evolved with the prime intention to capture, store, and process the semi-structured and unstructured data generated with high speed(velocity), and huge in size (volume).

3. Velocity

The speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Compared to small data, big data are produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing.

4. Veracity

It is the extended definition for big data, which refers to the data quality and the data value. The data quality of captured data can vary greatly, affecting the accurate analysis.

The Problem

Volume/Size Problem

There is a huge explosion in the data available. Look back a few years, and compare it with today, there has been an exponential increase in the data that enterprises can access. They have data for everything, right from what a consumer likes, to how they react, to a particular scent, to the amazing restaurant that opened up in Italy last weekend.

This data exceeds the amount of data that can be stored and computed, as well as retrieved. The challenge is not so much the availability, but the management of this data. With statistics claiming that data would increase 6.6 times the distance between earth and moon by 2020, this is definitely a challenge.

Velocity/Speed Problem

The term “data” is not limited to the “stagnant” data that is available at common disposal. A lot of data keeps updating every second, and organizations need to be aware of that too. For instance, if a retail company wants to analyze customer behavior, real-time data from their current purchases can help.

It is important for businesses to keep themselves updated with this data, along with the “stagnant” and always available data. This will help build better insights and enhance decision-making capabilities.

Variety Problem

In addition to volume and velocity, variety is fast becoming a third big data “V-factor.” The problem is especially prevalent in large enterprises, which have many systems of record and also an abundance of data under management that is structured, semi-structured and unstructured. These enterprises often have multiple purchasing, manufacturing, sales, finance, and other departmental functions in separate subsidiaries and branch facilities, and they end up with “siloed” systems because of the functional duplicity.

Veracity Problem

Data veracity, in general, is how accurate or truthful a data set may be. In the context of big data, however, it takes on a bit more meaning. More specifically, when it comes to the accuracy of big data, it’s not just the quality of the data itself but how trustworthy the data source, type, and processing of it is. Removing things like bias, abnormalities or inconsistencies, duplication, and volatility are just a few aspects that factor into improving the accuracy of big data.

The second side of data veracity entails ensuring the processing method of the actual data makes sense based on business needs and the output is pertinent to objectives. Obviously, this is especially important when incorporating primary market research with big data. Interpreting big data in the right way ensures results are relevant and actionable. Further, access to big data means you could spend months sorting through information without focus and a without a method of identifying what data points are relevant. As a result, data should be analyzed in a timely manner, as is difficult with big data, otherwise the insights would fail to be useful.

Managing Big Data the Hadoop Way

What is HADOOP?

“Hadoop, an open-source software framework, uses HDFS (the Hadoop Distributed File System) and MapReduce to analyze big data on clusters of commodity hardware — that is, in a distributed computing environment.”

The Hadoop Distributed File System (HDFS) was developed to allow companies to more easily manage huge volumes of data in a simple and pragmatic way. Hadoop allows big problems to be decomposed into smaller elements so that analysis can be done quickly and cost effectively. HDFS is a versatile, resilient, clustered approach to managing files in a big data environment.

HDFS is not the final destination for files. Rather it is a data “service” that offers a unique set of capabilities needed when data volumes and velocity are high.

MapReduce is a software framework that enables developers to write programs that can process massive amounts of unstructured data in parallel across a distributed group of processors. MapReduce was designed by Google as a way of efficiently executing a set of functions against a large amount of data in batch mode.

The “map” component distributes the programming problem or tasks across a large number of systems and handles the placement of the tasks in a way that balances the load and manages recovery from failures. After the distributed computation is completed, another function called “reduce” aggregates all the elements back together to provide a result.

Companies that use Hadoop

Various MNCs around the world use Hadoop to manage their Big Data. Let’s look at some of them:


Amazon uses Elastic MapReduce(EMR) and Elastic Cloud Compute(EC2)

Amazon web services, the top E-commerce today, simplifies its big data processing and analytics using Elastic MapReduce web service. EMR provides a managed framework of Hadoop that employs easy, fast and cost-effective mechanism to distribute and compute vast amounts of data across Amazon EC2 instances. The major functions performed by Hadoop in Amazon web services include log analysis, data warehousing, web indexing, financial analysis, machine learning, scientific simulation and bioinformatics.


Facebook uses Hadoop and Hive

According to some experts and company professionals, Hadoop functions on every product of Facebook and in a variety of ways. User actions including ‘like,’ ‘status update’ or ‘add comment’ are stored and saved in an excellent distributed and personalized database, MySQL. Similarly, Facebook messenger application runs on HBase, and all the messages sent and received on Facebook are gathered and hoarded in HBase.

In addition, all of the external advertisers’ and developers’ campaigns and applications running on this social media platform use Hive to generate their success reports. Facebook has built a higher level data warehousing infrastructure using features of Hive that help in querying the database using SQL language-HiveQL.


Adobe uses Apache HBase and Apache Hadoop

Planning a deployment of around 80 nodes cluster, Adobe’s processes currently have 30 nodes running on HDFS, HBase and Hadoop in clusters in the range of 5–14 nodes for its production and development operations. Adobe is a well-known international enterprise whose products and services are used worldwide; one of them is its Digital Marketing business Unit. Hadoop ecosystem has been deployed on Adobe’s VMware vSphere for several Adobe users. This deployment has reduced time to insight data and costs by using existing servers.

Source: Medium