- Introduction to Big Data
- What is Big Data
- Types of data
- 6 Vs of big data
- Challenges in Big data
- Big Data Technologies
- Hadoop Introduction
- Distributed Computing
- So why Hadoop?
- Challenges of SuperComputing
- Hadoop History
- Hadoop Framework
Big data – Introduction
Will start with questions like what is big data, why big data, what big data signifies do so that the companies/industries are moving to big data from legacy systems, Is it worth to learn big data technologies and as professional we will get paid high etc etc… Why why why?
What is Big data?
As the name implies, big data is data with huge size. We get a large amount of data in different forms from different sources and in huge volume, velocity, variety and etc which can be derived from human or machine sources.
We are talking about data and let us see what are the types of data to understand the logic behind big data.
Types of data:
Three types of data can be classified as:
Structured data: Data which is represented in a tabular form. The data can be stored, accessed and processed in the form of fixed format. Ex: databases, tables
Semi structured data: Data which does not have a formal data model Ex: XML files
Unstructured data: data which does not have a pre-defined data model Ex: Text files, web logs.
Let us dig on 6 Vs of big data:
Volume: The amount of data from various sources like in TB, PB, ZB etc. It is a rise of bytes we are nowhere in GBs now.
Velocity: High frequency data like in stocks. The speed at which big data is generated.
Veracity: Refers to the biases, noises and abnormality in data.
Variety: Refers to the different forms of data. Data can come in various forms and shapes, like visuals data like pictures, and videos, log data etc. This can be the biggest problem to handle for most businesses.
Variability: to what extent, and how fast, is the structure of your data changing? And how often does the meaning or shape of your data change?
Value: This describes what value you can get from which data, how big data will get better results from stored data.
Challenges in Big data:
Complex: No proper understanding of the underlying data
Storage: How to accommodate large amounts of data in a single physical machine.
Performance: How to process large amounts of data efficiently and effectively so as to increase the performance.
Big Data Technologies:
Big Data is broad and surrounded by many trends and new technology developments, the top emerging technologies given below are helping users cope with and handle Big Data in a cost-effective manner.
1. Apache Hadoop
2. Apache Spark
3. Apache Hive
There are many other technologies. But we will learn about the above 3 technologies In detail.
Hadoop is a distributed parallel processing framework, which facilitates distributed computing
Now to dig more on Hadoop, we need to have understanding on “Distributed Computing”. This will actually give us a root cause of the Hadoop.
In simple English, distributed computing is also called parallel processing. Let’s take an example, let’s say we have a task of painting a room in our house, and we will hire a painter to paint and may approximately take 2 hours to paint one surface. Let’s say we have 4 walls and 1 ceiling to be painted and this may take one day(~10 hours) for one man to finish, if he does this non stop.
The same thing to be done by 4 or 5 more people can take half a day to finish the same task. This is the simple real time problem to understand the logic behind distributed computing
Now let’s take an actual data related problem and analyse the same.
Look at how Predictive Analytics is used in the Travel Industry.
We have an input file of lets say 1 GB and we need to calculate the sum of these numbers together and the operation may take 50secs to produce a sum of numbers
Then let’s take the same example by dividing the dataset into 2 parts and give the input to 2 different machines, then the operation may take 25 secs to produce the same sum results.
This is the fundamental idea of parallel processing.
So why Hadoop?
The idea of parallel processing was not something new!
The idea ws existing since long back in the time of Super computers (back in 1970s)
There we used to have army of network engineers and cables required in manufacturing supercomputers and there are still few research organizations which use these kind of infrastructures which is called as “super Computers”
Lets see what were the challenges of SuperComputing.
• A general purpose operating system like framework for parallel computing needs did not exist
• Companies procuring supercomputers were locked to specific vendors for hardware support
• High initial cost of the hardware.
• Develop custom software for individual use cases
• High cost of software maintenance and upgrades which had to be taken care in house the organizations using a supercomputer.
• Not simple to scale horizontally
There should be a better reason always!
HADOOP comes to rescue
• A general purpose operating system like framework for parallel computing needs
• Its free software (open source) with free upgrades
• Has options for upgrading the software and its free !
• Opens up the power of distributed computing to a wider set of audience.
• Mid sized organizations need not be locked to specific vendors for hardware support – Hadoop works on commodity hardware
• The software challenges of the organization having to write proprietary softwares is no longer the case.
Data is everywhere. People upload videos, take pictures, use several apps on their phones, search the web and more. Machines too, are generating and keeping more and more data. Existing tools are incapable of processing such large data sets. Hadoop and large-scale distributed data processing, in general, is rapidly becoming an important skill set for many programmers. Hadoop is an open-source framework for writing and running distributed applications that process large amounts of data. This course introduces Hadoop in terms of distributed systems as well as data processing systems. With this course, get an overview of the MapReduce programming model using a simple word counting mechanism along with existing tools that highlight the challenges around processing data at a large scale. Dig deeper and implement this example using Hadoop to gain a deeper appreciation of its simplicity.
- The need of the hour was scalable search engine for the growing internet
- Internet Archive search director Doug Cutting and University of Washington graduate student Mike Cafarella set out to build a search engine and the project named NUTCH in the year 2001-2002
- Google’s distributed file system paper came out in 2003 & first file map-reduce paper came out in 2004
- In 2006 Dough Cutting joined YAHOO and created an open source framework called HADOOP (name of his son’s toy elephant) HADOOP traces back its root to NUTCH, Google’s distributed file system and map-reduce processing engine.
- It went to become a full fledged Apache project and a stable version of Hadoop was used in Yahoo in the year 2008
Hadoop Framework: Stepping into Hadoop.
Let us look at some Key terms used while discussing Hadoop.
● Commodity hardware: PCs which can be used to make a cluster
● Cluster/grid: Interconnection of systems in a network
● Node: A single instance of a computer
● Distributed System: A system composed of multiple autonomous computers that communicate through a computer network
● ASF: Apache Software Foundation
● HA: High Availability
● Hot stand-by : Uninterrupted failover whereas cold stand-by will be there will be noticeable delay. If the system goes down, you will have to reboot.
If you are looking to pick up Big Data Analytics skills, you should check out GL Academy’s free online courses. These courses are specially designed for beginners and will help you learn all the concepts.1