Siddesh BG's Build Release Config mgmt Blog

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Friday, 5 July 2013

What is Apache Hadoop?

Posted on 03:57 by Unknown
Newbies can get a clean and simple introduction to Hadoop from the following Pivotal blog posts

1) Demystifying Apache Hadoop in 5 Pictures
2) Hadoop 101: Programming MapReduce with Native Libraries, Hive, Pig, and Cascading
3) 20+ Examples of Getting Results with Big Data

Highlights

  • Hadoop is developed to assist in Big data Analysis
  • Hadoop implements distributed computing to process large sets of data in a quick time-frame
  • Hadoop divides and distributes work across large number of computers. It spreads data processing logic across 10s, 100s, or even 1000s of commodity servers.

Hadoop's main components

1) Hadoop Distributed File System (HDFS) : to help us split the data, put it on different nodes, replicate it, and manage it.

2) MapReduce: processes the data on each node in parallel and calculates the results of the job
     a) Map: Performs computation on local data set on each nodes and outputs a list of key-value pairs
     b) Reduce: The output from map step is sent to other nodes as input for the reduce step. Before reduce runs, the key-value pairs are typically sorted and shuffled. The reduce phase then sums the lists into single entries

3) Managing the Hadoop jobs: In Hadoop the entire process is called a job.
    a) Job tracker exists to divide the job into tasks and schedules tasks to run on the nodes. The job tracker keeps track of the participating nodes, monitors the processes, orchestrates data flow, and handles failures. 
    b) Task trackers run tasks and report to the job tracker. 
  With this layer of management automation, Hadoop can automatically distribute jobs on a large number of nodes in parallel and scale when more nodes are added .

Hadoop programming

There are 4 coding approaches
1) Native Hadoop library: It helps to achieve the greatest performance and have the most fine-grained control
2) Pig: similar to SQL and it is procedural, not declarative
3) Hive: Started by Facebook. It provides more SQL like interface, considered the slower of the languages to do Hadoop with.
4) Cascading: It is a set of .jars that define data processing APIs, integration APIs, as well as a process planner and scheduler

Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in hadoop | No comments
Newer Post Older Post Home
View mobile version

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Solution to Project Euler Problem 10 - Find the sum of all the primes below two million
    http://projecteuler.net/problem=10 Problem The sum of the primes below 10 is 2 + 3 + 5 + 7 = 17. Find the sum of all the primes below two mi...
  • Fortify scan automation steps for analyzing c/c++ code (Makefiles)
    I wrote in my previous blog about installing and configuring Fortify client. This blog presents standard steps to automate fortify scan for ...
  • Posting a JIRA bug using Perl Mechanize
    Perl provides modules which can be used as command line browser to automate tasks dependent on web pages. Among them LWP and mechanize are i...
  • jenkins error: java.io.IOException: Authentication method password not supported by the server at this stage
    When I tried to add a node to jenkins/hudson using ssh as launch method, the authentication keeps on failing with the below error. [12/15/11...
  • Unable to resolve target system name - a DNS problem
    I was not able to ping to any machines from my Windows 2003 server. I did following steps to debug & resolve the issue, which was relate...
  • Installing and configuring Fortify on Linux and Windows machines
    Installing Fortify on Linux (RHEL 5 32 bit) Download Fortify archive Fortify-360-2.6.5-Analyzers_and_Apps-Linux-x86.tar.gz and extract it to...
  • Perforce - can't edit exclusive file already opened
    In perforce, whenever a binary file like doc, xls or ppt files are checked out, it is opened in exclusive lock mode. So no other person can ...
  • Perforce and cygwin
    Are you a command-line freak ? Do you want your automated shell scripts to run on Windows ? Do you wish to work with Perforce commands on Cy...
  • Using BUILD_LOG_REGEX in jenkins email notification
    Jenkins provide 'Email-ext' plugin, which  allows to configure every aspect of email notifications. One of my requirement is to send...
  • 0509-036 Cannot load program p4 because of the following errors
    Here is the full description of error ............ bash-3.00# p4 info exec(): 0509-036 Cannot load program p4 because of the following error...

Categories

  • AIX
  • AIX ssh
  • ANT
  • apache
  • appliance
  • awk
  • branching
  • build-failures
  • cgi-perl
  • code-signing
  • commands
  • continuous Integration
  • cvs
  • cygwin
  • DNS
  • Drupal
  • EPM
  • euler
  • Fortify
  • hadoop
  • hpux
  • html
  • InstallShield
  • iptables
  • iso
  • jenkins-hudson
  • Jira
  • kiwi
  • linux
  • Makefile
  • maven
  • Miscellaneous
  • mysql
  • nexus
  • NFS
  • package
  • Perforce
  • Perl
  • php
  • rbuilder
  • rpath
  • rpm
  • rsync
  • Solaris
  • ssh
  • SuseStudio
  • tinderbox
  • unix
  • Visual studio 2008
  • vmware
  • war
  • webserver
  • wget
  • windows
  • xterm

Blog Archive

  • ▼  2013 (12)
    • ►  December (1)
    • ▼  July (2)
      • How to increase open files limit in Linux. Fix for...
      • What is Apache Hadoop?
    • ►  April (2)
    • ►  March (2)
    • ►  February (3)
    • ►  January (2)
  • ►  2012 (43)
    • ►  December (2)
    • ►  November (1)
    • ►  October (4)
    • ►  September (7)
    • ►  August (5)
    • ►  July (4)
    • ►  June (2)
    • ►  May (3)
    • ►  April (4)
    • ►  March (3)
    • ►  February (1)
    • ►  January (7)
  • ►  2011 (23)
    • ►  December (4)
    • ►  November (9)
    • ►  October (4)
    • ►  September (1)
    • ►  June (2)
    • ►  May (1)
    • ►  April (1)
    • ►  March (1)
  • ►  2010 (15)
    • ►  December (2)
    • ►  November (1)
    • ►  September (3)
    • ►  April (1)
    • ►  February (6)
    • ►  January (2)
  • ►  2009 (28)
    • ►  November (5)
    • ►  October (3)
    • ►  September (2)
    • ►  August (1)
    • ►  July (1)
    • ►  June (5)
    • ►  May (3)
    • ►  April (1)
    • ►  February (2)
    • ►  January (5)
  • ►  2008 (20)
    • ►  December (6)
    • ►  November (3)
    • ►  October (1)
    • ►  September (1)
    • ►  July (8)
    • ►  June (1)
Powered by Blogger.

About Me

Unknown
View my complete profile