1 . What is big data? and is hadoop is the only toll to handle bigdata problems?

Answer :

As the name suggests big data is all about handling large amounts of data. Big data is used to describe large collections of data it may be structured, unstructured and grow so large and quickly that is difficult to manage with regular database and statistics tools. 
Here is a bit about bytes of data , 

1 bit = binary digit 
8 bits = 1 byte 
1000 bytes = 1 kilo byte 
1000 kilobytes = 1 megabyte 
1000 megabytes = 1 gigabyte 
1000 gigabytes = 1 tera byte 
1000 tera bytes = 1 peta byte 
1000 peta bytes = 1 exa byte 
1000 exa bytes = 1 zetta byte 
1000 zettabytes = 1 yottabyte 
1000 yottabytes = 1 brontobyte 
1000 brontobytes = 1 geopbyte 

Data grows like this day by day. 

for example facebook used 10 terabytes of data each day and twitter used 7 tera bytes of data every day in 2011 interestingly 80% of these data are unstructured. This count increases every year and inorder to store and analyze such large data reliable & efficient hardware and software are needed. 

So if we have to derive details about structured and unstructured data RDBMS cannot hold for such large collection of unstructured data. 

Hadoop is not a replacement for RDBMS 

hadoop is one such tool that offers much reliability of storage throuh Hadoop Distributed file system (HDFS) through several replications and efficient analysis through Map reduce (Programming model). 

Hadoops framework is writtten in java and its originally invented by dough cutting, hadoop is not suitable for online transaction processing and online analytical processing. 

Apart from hadoop google uses mapreduce framework through a different file system named google file system. 

presently hadoop is being promoted by yahoo. 

Several projects that integrate with Hadoop have been released from the Apache incubator and are designed for certain use cases: 
Pig, developed at Yahoo, is a high-level scripting language for working with big data and Hive is a SQL-like query language for big data in a warehouse configuration. 

HBase, developed at Facebook, is a column-oriented database often used as a datastore on which MapReduce jobs can be executed. 

ZooKeeper and Chukwa 

Mahout is a library for scalable machine learning, part of which can use Hadoop. 

Cascading (Chris Wensel), Oozie (Yahoo) and Azkaban (LinkedIn) provide MapReduce job workflows and scheduling. 

Unlike hadoop there are several others such as , 

BashReduce 

Unlike Hadoop, BashReduce is just a script! BashReduce implements MapReduce for standard Unix commands such as sort, awk, grep, join etc. It supports mapping/partitioning, reducing, and merging. 

Disco Project 

Disco was initially developed by Nokia Research and has been around silently for a few years. Developers write MapReduce jobs in simple, beautiful Python. Disco’s backend is written in Erlang, a scalable functional language with built-in support for concurrency, fault tolerance and distribution — perfect for a MapReduce system! Similar to Hadoop, Disco distributes and replicates data, but it does not use its own file system. Disco also has efficient job scheduling features. 

Spark 

Spark is one of the newest players in the MapReduce field. Its purpose is to make data analytics fast to write, and fast to run. Unlike many MapReduce systems, Spark allows in-memory querying of data (even distributed across machines) rather than using disk I/O.Spark is implemented in scala. 

Graph lab, storm and HPCC Systems (from LexisNexis) are the others that handle big data. 

With all these alternatives, why use Hadoop? 

One word: HDFS. For a moment, assume you could bring all of your files and data with you everywhere you go. No matter what system, or type of system, you login to, your data is intact waiting for you. Suppose you find a cool picture on the Internet. You save it directly to your file store and it goes everywhere you go. HDFS gives users the ability to dump very large datasets (usually log files) to this distributed filesystem and easily access it with tools, namely Hadoop. Not only does HDFS store a large amount of data, it is fault tolerant. Losing a disk, or a machine, typically does not spell disaster for your data. HDFS has become a reliable way to store data and share it with other open-source data analysis tools.


Leave a Comment

Name  
  Email   
Message
1 . Problems running hadoop on mac osx with default example?
 

Answer :

Make sure that you turn on Remote Login under System Preferences then File Sharing:

enter image description here

In order to check if it runs try jps command from command line. You should get something like:

43730 TaskTracker
43674 JobTracker
43730 NameNode
43582 DataNode
43611 SecondaryNameNode
43821 Jps

And now try some of the hadoop examples:

$ hadoop jar /usr/local/Cellar/hadoop/2.3.0/libexec/hadoop-examples-*.jar pi 10 100

2 . Pig and Hadoop connection error?
 

Answer :

Found the solution.
The line:

Caused by: java.net.ConnectException: Call From localhost.localdomain/127.0.0.1 to 0.0.0.0:10020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

comes when JobHistoryServer is not up.
So starting the JobHistoryServer would eliminate this issue.
To start it, go to hadoop's sbin dir and then the command:

mr-jobhistory-server.sh start

Do a jps to check if JobHistoryServer is up, and then re-execute your Pig commands.


3 . How Hadoop -getmerge works?
 

Answer :

The getmerge command has been created specifically for merging files from HDFS into a single file on local file system.

This command is very useful to download the output of a MapReduce job, which could have generated multiple part-* files and combine them into a single file locally, which you can use for other operations (for e.g. put it in an Excel sheet for presentation).

Answers to your questions:

  1. If the destination file system does not have enough space, then IOException is thrown. The getmerge internally uses IOUtils.copyBytes() (see IOUtils.copyBytes()) function to copy one file at a time from HDFS to local file. This function throws IOException whenever there is an error in the copy operation.

  2. This command is on similar lines as hdfs fs -get command which gets the file from HDFS to local file system. Only difference is hdfs fs -getmerge merges multiple files from HDFS to local file system.

If you want to merge multiple files in HDFS, you can achieve it using copyMerge() method from FileUtil class (see FileUtil.copyMerge()).

This API copies all files in a directory to a single file (merges all the source files).


4 . Cant find start-all.sh in hadoop installation?
 

Answer :
hduser@ubuntu:~$ /usr/local/hadoop/sbin/start-all.sh

Since start-all.sh and stop-all.sh located in sbin directory while hadoop binary file is located in bindirectory.

Also updated your .bashrc for:

export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

so that you can directly access start-all.sh


5 . Run Hadoop job without using JobConf?
 

Answer :
import org.apache.hadoop.util.ToolRunner;

public class MapReduceExample extends Configured implements Tool {

    static class MyMapper extends Mapper<LongWritable, Text, LongWritable, Text> {
        public MyMapper(){

        }

        protected void map(
                LongWritable key,
                Text value,
                org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, LongWritable, Text>.Context context)
                throws java.io.IOException, InterruptedException {
            context.getCounter("mygroup", "jeff").increment(1);
            context.write(key, value);
        };
    }

    @Override
    public int run(String[] args) throws Exception {
        Job job = new Job();
        job.setMapperClass(MyMapper.class);
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.waitForCompletion(true);
        return 0;
    }

    public static void main(String[] args) throws Exception {
        FileUtils.deleteDirectory(new File("data/output"));
        args = new String[] { "data/input", "data/output" };
        ToolRunner.run(new MapReduceExample(), args);
    }
}


6 . Hadoop - Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext?
 

Answer :

This probably means you compiled your code against an earlier version of Hadoop than you're running on.

This might have something to do with it...

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.2.0</version>
</dependency>

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-core</artifactId>
    <version>1.2.1</version>
</dependency>

You have two different versions of Hadoop here


7 . Apache Hadoop vs Google Bigdata?

Answer :

Simple answer would be.. it depends on what you want to do with your data.

Hadoop is used for massive storage of data and batch processing of that data. It is very mature, popular and you have lot of libraries that support this technology. But if you want to do real time analysis, queries on your data hadoop is not suitable for it.

Google's Big Query was developed specially to solve this issue. You can do real time processingon your data using google's big query.

You can use Big Query in place of Hadoop or you can also use big query with Hadoop to query datasets produced from running MapReduce jobs.

So, it entirely depends on how you want to process your data. If batch processing model is required and sufficient you can use Hadoop and if you want real time processing you have to choose Google's.

Edit: You can also explore other technologies that you can use with Hadoop like Spark, Storm, Hive etc.. (and choose depending on your use case)


8 . Hadoop Text class?
 

Answer :

Let me shed some light. When we talk about distributed systems efficient Serialization/Deserialization plays a vital role. It appears in two quite distinct areas of distributed data processing :

  • IPC
  • Persistent Storage

To be specific to Hadoop, IPC between nodes is implemented using RPCs. The RPC protocol uses serialization to render the message into a binary stream to be sent to the remote node, which then deserializes the binary stream into the original message. So, it is very important to have a solid Serialization/Deserialization framework in order to store and process huge amounts of data efficiently. In general, it is desirable that an RPC serialization format is:

  • Compact
  • Fast
  • Extensible
  • Interoperable

Hadoop uses its own types because developers wanted the storage format to be compact (to make efficient use of storage space), fast (so the overhead in reading or writing terabytes of data is minimal), extensible (so we can transparently read data written in an older format), and interoperable (so we can read or write persistent data using different languages).

Few points to remember before thinking that having dedicated MapReduce types is redundant :

  1. Hadoop’s Writable-based serialization framework provides a more efficient and customized serialization and representation of the data for MapReduce programs than using the general-purpose Java’s native serialization framework.
  2. As opposed to Java’s serialization, Hadoop’s Writable framework does not write the type name with each object expecting all the clients of the serialized data to be aware of the types used in the serialized data. Omitting the type names makes the serialization process faster and results in compact, random accessible serialized data formats that can be easily interpreted by non-Java clients.
  3. Hadoop’s Writable-based serialization also has the ability to reduce the object-creation overhead by reusing the Writable objects, which is not possible with the Java’s native serialization framework.

9 . MultipleOutputFormat in hadoop?
 

Answer :

Each reducer uses an OutputFormat to write records to. So that's why you are getting a set of odd and even files per reducer. This is by design so that each reducer can perform writes in parallel.

If you want just a single odd and single even file, you'll need to set mapred.reduce.tasks to 1. But performance will suffer, because all the mappers will be feeding into a single reducer.

Another option is to change the process the reads these files to accept multiple input files, or write a separate process that merges these files together.


10 . Hadoop Documentation for Eclipse?
 

Answer :

Another easy solution for those who want to stay updated is to edit the Javadocs path like Chris said:

"In Eclipse, right click your Java project and select Build Path -> Configure Build Path. Now click the Libraries tab and locate the entry for hadoop-core-x.x.x.jar. Expand the entry to show options for Source, Javadoc etc locations and click the Javadoc location entry. Now click the Edit button to the right and enter the location as the path"

but instead of linking it directly to the api you have stored on your hard disk.



11 . Hadoop error on Windows : java.lang.UnsatisfiedLinkError?
 

Answer :
  1. You access link https://github.com/steveloughran/winutils
  2. Download file "winutils.exe" and "hadoop.dll" with version which you use
  3. Copy 2 file to HADOOP_HOME\bin.

Note: if 2 file "winutils.exe" and "hadoop.dll" not right with hadoop version which you use, it isn't O


12 . How to find cdh version hadoop?
 

Answer :

n cdh, in the cluster I am using, there is not any cdh_version.properties (or I couldn't find it)

If your cluster uses "Parcels", you could check which version of cdh is used by doing:

/opt/cloudera/parcels

And you could see the version as the name of the folder:

CDH-5.5.1-1.cdh5.5.1.p0.11

13 . Hadoop “Unable to load native-hadoop library for your platform” error on docker-spark?
 

Answer :

Adding the Hadoop library into LD_LIBRARY_PATH fix this problem:

export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:$LD_LIBRARY_PATH

14 . Snapshottable directory hadoop admin?
 

Answer :

HDFS Snapshots are read-only point-in-time copies of the entire HDFS file system or a subtree/portion of it. These snapshots can be used for Data Recovery and Backup

In this example posted, 

drwxr-xr-x   - hadoop supergroup          0 2017-03-09 13:04 /Snap/.snapshot/Sanpnew

/Snap is a snapshottable directory, which means snapshots can be created for this directory. Every snapshottable directory will contain a subdirectory .snapshot to store the snapshots created.

And there are two snapshots(Sanpnew and Sanpnew1) created for the directory /Snap.

These snapshot folders hold the image of the /Snap directory as it was at the time of snapshot creation. These can be used for say, for example if the contents of this directory /Snap requires to be rolled-back (to an earlier point in time), then these snapshots can be made to use.


15 . Hadoop Mapper Context object?
 

Answer :

The run() method will be called using the Java Run Time Polymorphism (i.e method overriding). As you can see the line# 569 on the link below, extended mapper/reducer will get instantiated using the Java Reflection APIs. The MapTask class gets the name of extended mapper/reducer from the Job configuration object which the client program would have been configured extended mapper/reducer class using job.setMapperClass()

mapperContext = contextConstructor.newInstance(mapper, job, getTaskID(),
                                                  input, output, committer,
                                                  reporter, split);

   input.initialize(split, mapperContext);
   mapper.run(mapperContext);
   input.close();` 

The line# 621 is an example of run time polymorphism. On this line, the MapTask calls the run() method of configured mapper with 'Mapper Context' as parameter. If the run() is not extended, it calls the run() method on the org.apache.hadoop.mapreduce.Mapper which again calls the map() method on configured mapper.

On the line# 616 of the above link, MapTask creates the context object with all the details of job configuration, etc. as mentioned by @harpun and then passes onto the run() method on line # 621.

The above explanation holds good for reduce task as well with appropriate ReduceTask class being the main entry class.