Hadoop

Hadoop

definition
1. Apache Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware. Its Hadoop Distributed File System (HDFS) splits files into large blocks (default 64MB or 128MB) and distributes the blocks amongst the nodes in the cluster.
Modules
1. HDFS
  1. definition
    1. the filesystem that Hadoop uses to store data on the cluster nodes
  2. structure
    1. Name node
      1. A cluster can just have only one name node
      2. File content is split into blocks (128MB) . Each block is replicated at multiple DataNodes(default is 3)
      3. File and directories are store inside inodes. Inodes record attributes like
        permissions
        modification
        access times
        namespace
        disk space quotas
      4. namenode maintains all of these.
    2. data nodes
      1. each block has 2 files
        A file contains check sump, and stamp
        A file store the actual size of the file
        The size is the actual size of the file
      2. at startup, datanodes and namenode will take a handshake to check namespace Id
        namespace ID is stored persistently in all datanodes inside cluster
        After handshake, data nodes will be registered with name node with a unique storage ID if it is first time, it will be never changed
      3. datanodes manage blocks through block id and send these id to name node through block report. This will be sent immediately after datanodes connect to namenode and then after every hour. Block report helps name node to locate where blocks are located in cluster
      4. every 3 secs, datanodes send heartbeat to namenode. If for 10 min, there isn't any heartbeat, namenode will assume that node is dead and all blocks are unavailable.
        carry information about
        total storage capapicity
        storage in use
        number of transactions
        namenode also uses heartbeat to send instructions to datanodes
        replicate blocks to others nodes
        remove local block replicas
        re-register and send an immediate block report
        shutdown the node
2. Map reduce
  1. definition
    1. a framework for processing large amount of structured or unstructured data in parallel across clusters
  2. tasks
    1. Map
      1. list all elements and breaks them all into tuples (key/value pairs).
    2. Reduce
      1. using map as input and combines data tuples into a smaller set of tupples
  3. trackers
    1. Job tracker
      1. schedule jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks
    2. task trackers
      1. execute tasks.
3. YARN
  1. also called MRv2
  2. split resource management and job scheduling/monitoring into separate daemons (processes?)
    1. 1 global resource manager
    2. Application Master per application. An application could be
      1. A single job
      2. a Directed, acylic graph (DAG) of jobs
      3. is a framework specific library and is tasked with negotiating resources from resource manager and working with node manager(s) to execute and monitor tasks
      4. responsibility negotiating appropriate resource containers from scheduler, tracking status and monitoring for progress
    3. the per-node slave, nodemanager forms a data-computation framework with resource manager
      1. resource manager is the last authority to arbitrates resources in system
      2. responsible for containers , monitoring their resource usage, report to resource manager/scheduler
    4. Resource manager 2 main components
      1. Scheduler
        allocate resources like a scheduler
        no monitoring, tracking
        guarantee restart failed tasks (application failed or hardware failed)
        based on abstract notion of resources container with incorporates elements such as memory, cpu, disk, network, etc.
      2. ApplicationsManager
        accepting job submissions
        negotiating first container for executing the application specific of application master
        provides service for restating the application master container on failure
4. hadoop common packages
  1. provide filesystem and OS level abstractions, Map Reduce engine (MR1 and MR2) and HDFS
  2. provide JAR and scripts needed to start Hadoop
Supporters
1. Pig
  1. Pig allows you to write complex Mapreduce transformations using a simple scripting language
  2. pig is a high level scripting language that is used with apache hadoop
  3. The language is called pig latin which abtracts java mapreduce into a form similar to SQL
  4. users can extend pig latin by writing their own functions using Java, python, Ruby or others scripting languages
  5. run in 2 modes
    1. local
      1. access to single machine, all files are installed and run using a localhost and file system
    2. mapreduce
      1. default, requires access to a Hadoop cluster
2. Hive
  1. access to data on top of mapreduce using SQL-like query
3. cloudera

Next up

Description

Resource summary

Similar

	Created by Kien Dang Ngoc about 10 years ago