Apache Hadoop is an open-source software
framework for distributed storage and distributed
processing of Big Data on clusters of commodity
hardware. Its Hadoop Distributed File System (HDFS)
splits files into large blocks (default 64MB or 128MB)
and distributes the blocks amongst the nodes in the
cluster.
Modules
HDFS
definition
the
filesystem that
Hadoop uses to
store data on
the cluster
nodes
structure
Name node
A cluster can just
have only one
name node
File content is
split into blocks
(128MB) . Each
block is replicated
at multiple
DataNodes(default
is 3)
File and
directories
are store
inside inodes.
Inodes record
attributes like
permissions
modification
access times
namespace
disk space quotas
namenode maintains all of these.
data nodes
each block has 2 files
A file contains
check
sump, and stamp
A file store
the actual
size of the file
The size is the
actual size of
the file
at startup, datanodes
and namenode will
take a handshake to
check namespace Id
namespace ID is
stored persistently in
all datanodes inside
cluster
After handshake, data
nodes will be registered
with name node with a
unique storage ID if it is
first time, it will be never
changed
datanodes manage blocks
through block id and send these
id to name node through block
report. This will be sent
immediately after datanodes
connect to namenode and then
after every hour. Block report
helps name node to locate
where blocks are located in
cluster
every 3 secs,
datanodes send
heartbeat to
namenode. If for 10
min, there isn't any
heartbeat, namenode
will assume that node
is dead and all blocks
are unavailable.
carry information about
total
storage
capapicity
storage in use
number of transactions
namenode also uses
heartbeat to send
instructions to
datanodes
replicate blocks
to others nodes
remove local block replicas
re-register and send an immediate block report
shutdown the node
Map reduce
definition
a framework for processing
large amount of structured or
unstructured data in parallel
across clusters
tasks
Map
list all elements
and breaks them
all into tuples (key/value
pairs).
Reduce
using map as input
and combines data
tuples into a smaller
set of tupples
trackers
Job tracker
schedule jobs'
component tasks on the
slaves, monitoring them
and re-executing the
failed tasks
task trackers
execute tasks.
YARN
also called MRv2
split resource management
and job scheduling/monitoring
into separate daemons
(processes?)
1 global
resource
manager
Application
Master per
application. An
application
could be
A single job
a Directed, acylic
graph (DAG) of
jobs
is a framework
specific library
and is tasked
with negotiating
resources from
resource
manager and
working with
node manager(s)
to execute and
monitor tasks
responsibility negotiating appropriate
resource containers from scheduler,
tracking status and monitoring for
progress
the per-node slave,
nodemanager forms
a data-computation
framework with
resource manager
resource manager is
the last authority to
arbitrates resources
in system
responsible for containers ,
monitoring their resource
usage, report to resource
manager/scheduler
Resource
manager 2 main
components
Scheduler
allocate resources like a scheduler
no monitoring, tracking
guarantee restart failed
tasks (application failed or
hardware failed)
based on abstract notion
of resources container
with incorporates
elements such as
memory, cpu, disk,
network, etc.
ApplicationsManager
accepting job submissions
negotiating first container for
executing the application specific of
application master
provides service for restating the
application master container on
failure
hadoop common packages
provide filesystem and OS level
abstractions, Map Reduce engine
(MR1 and MR2) and HDFS
provide JAR
and scripts needed to
start Hadoop
Supporters
Pig
Pig allows you to write
complex Mapreduce
transformations using a
simple scripting language
pig is a high level
scripting language that
is used with apache
hadoop
The language is called pig latin
which abtracts java mapreduce
into a form similar to SQL
users can extend pig latin by
writing their own functions using
Java, python, Ruby or others
scripting languages
run in 2 modes
local
access to single machine, all files are
installed and run using a localhost and
file system
mapreduce
default, requires
access to a Hadoop
cluster
Hive
access to data on top of
mapreduce using SQL-like
query