How Big MNC’s like Google, Facebook, Instagram, etc stores, manages and manipulate thousands of Terabytes of data with High Speed and High Efficiency?

Do you know? The New York Stock Exchange generates about one terabyte of new trade data per day. 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. So, are you thinking that how these data are stored and managed by these companies?

So, before finding the solution let us discuss about the problem i.e., Big Data.

What is Big Data?

Types Of Big Data

  1. Structured
  2. Unstructured
  3. Semi-structured

Structured:- Any data that can be stored, accessed and processed in the form of fixed format is termed as a ‘structured’ data. Over the period of time, talent in computer science has achieved greater success in developing techniques for working with such kind of data (where the format is well known in advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.

Data stored in a relational database management system is one example of a ‘structured’ data.

Example:- An ‘Employee’ table in a database is an example of Structured Data.

Unstructured:- Any data with unknown form or the structure is classified as unstructured data. In addition to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. A typical example of unstructured data is a heterogeneous data source containing a combination of simple text files, images, videos etc. Now day organizations have wealth of data available with them but unfortunately, they don’t know how to derive value out of it since this data is in its raw form or unstructured format.

Example:- The output returned by ‘Google Search’.

Semi-structured:- Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semi-structured data is a data represented in an XML file.

Example:- Personal data stored in an XML file.

<rec><name>Ananya Sharma</name><sex>Female</sex><age>18</age></rec>
<rec><name>Anand Sharma</name><sex>Male</sex><age>41</age></rec>
<rec><name>Atul Dixit</name><sex>Male</sex><age>29</age></rec>
<rec><name>Kartik Sharma</name><sex>Male</sex><age>26</age></rec>
<rec><name>Deepshika</name><sex>Female</sex><age>35</age></rec>

Characteristics Of Big Data

(i) Volume:- The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘volume’ is one characteristic which needs to be considered while dealing with Big Data.

(ii) Variety :- The next aspect of Big Data is its variety. Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.

(iii) Velocity:- The term ‘velocity refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data. Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.

(iv) Veracity :- This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.

Now, let us talk about the solution…The solution is Distributed System.

What is Distributed System?

A distributed system is a computer network that consists of independent components communicating through a decentralized protocol. The computers in a distributed system are not necessarily constrained to the same geographic locale, but rather can be located anywhere across the world.

Master-Slave Model

Master/slave is a model of communication for hardware devices where one device has a unidirectional control over one or more devices. This is often used in the electronic hardware space where one device acts as the controller, whereas the other devices are the ones being controlled. In short, one is the master and the others are slaves to be controlled by the master. The most common example of this is the master/slave configuration of IDE disk drives attached on the same cable, where the master is the primary drive and the slave is the secondary drive.

We can use the Hadoop for these kinds of system i.e., Distributed System.

What is Hadoop?

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Hadoop HDFS :- Data is stored in a distributed manner in HDFS. There are two components of HDFS name- node and data node. While there is only one name node, there can be multiple data nodes.

HDFS is specially designed for storing huge datasets in commodity hardware. An enterprise version of a server costs roughly $10,000 per terabyte for the full processor. In case you need to buy 100 of these enterprise version servers, it will go up to a million dollars.

Hadoop enables you to use commodity machines as your data nodes. This way, you don’t have to spend millions of dollars just on your data nodes. However, the name node is always an enterprise server.

Features of HDFS

  • Ability to store and process huge amounts of any kind of data, quickly:- With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that’s a key consideration.
  • Computing power:- Hadoop’s distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.
  • Fault tolerance:- Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
  • Flexibility:- Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
  • Low cost:- The open-source framework is free and uses commodity hardware to store large quantities of data.
  • Scalability:- You can easily grow your system to handle more data simply by adding nodes. Little administration is required.

Thanks for reading :)