1 . Should I learn Hadoop(Big data)?

Answer :

Big Data & Hadoop have a huge career scope ahead, especially. when technologies like SQL are becoming obsolete to handle the huge amounts of data . It is no surprise that your friends are switching to a career in Hadoop. This is an ongoing trend in IT major countries. Organizations like Ebay, Amazon etc. are hiring Hadoop professionals, even freshers at lucrative packages. So if you wish to work in Data analysis field in future, Hadoop is a good way to go. 
If you have a basic knowledge of Java(OOPs technology) you can learn Hadoop. If you take a course to learn the same you can learn get a gist of the technology in less than 4 weeks. Practice will do the rest of the work. You can try online learning platforms like WizIQ.com for a Hadoop course. They have a good range of courses in Developer, Admin & Analyst domain. 
Wish you good luck for your Hadoop career.


Leave a Comment

Name  
  Email   
Message
1 . Problems running hadoop on mac osx with default example?
 

Answer :

Make sure that you turn on Remote Login under System Preferences then File Sharing:

enter image description here

In order to check if it runs try jps command from command line. You should get something like:

43730 TaskTracker
43674 JobTracker
43730 NameNode
43582 DataNode
43611 SecondaryNameNode
43821 Jps

And now try some of the hadoop examples:

$ hadoop jar /usr/local/Cellar/hadoop/2.3.0/libexec/hadoop-examples-*.jar pi 10 100

2 . Pig and Hadoop connection error?
 

Answer :

Found the solution.
The line:

Caused by: java.net.ConnectException: Call From localhost.localdomain/127.0.0.1 to 0.0.0.0:10020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

comes when JobHistoryServer is not up.
So starting the JobHistoryServer would eliminate this issue.
To start it, go to hadoop's sbin dir and then the command:

mr-jobhistory-server.sh start

Do a jps to check if JobHistoryServer is up, and then re-execute your Pig commands.


3 . How Hadoop -getmerge works?
 

Answer :

The getmerge command has been created specifically for merging files from HDFS into a single file on local file system.

This command is very useful to download the output of a MapReduce job, which could have generated multiple part-* files and combine them into a single file locally, which you can use for other operations (for e.g. put it in an Excel sheet for presentation).

Answers to your questions:

  1. If the destination file system does not have enough space, then IOException is thrown. The getmerge internally uses IOUtils.copyBytes() (see IOUtils.copyBytes()) function to copy one file at a time from HDFS to local file. This function throws IOException whenever there is an error in the copy operation.

  2. This command is on similar lines as hdfs fs -get command which gets the file from HDFS to local file system. Only difference is hdfs fs -getmerge merges multiple files from HDFS to local file system.

If you want to merge multiple files in HDFS, you can achieve it using copyMerge() method from FileUtil class (see FileUtil.copyMerge()).

This API copies all files in a directory to a single file (merges all the source files).


4 . Cant find start-all.sh in hadoop installation?
 

Answer :
hduser@ubuntu:~$ /usr/local/hadoop/sbin/start-all.sh

Since start-all.sh and stop-all.sh located in sbin directory while hadoop binary file is located in bindirectory.

Also updated your .bashrc for:

export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

so that you can directly access start-all.sh


5 . Run Hadoop job without using JobConf?
 

Answer :
import org.apache.hadoop.util.ToolRunner;

public class MapReduceExample extends Configured implements Tool {

    static class MyMapper extends Mapper<LongWritable, Text, LongWritable, Text> {
        public MyMapper(){

        }

        protected void map(
                LongWritable key,
                Text value,
                org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, LongWritable, Text>.Context context)
                throws java.io.IOException, InterruptedException {
            context.getCounter("mygroup", "jeff").increment(1);
            context.write(key, value);
        };
    }

    @Override
    public int run(String[] args) throws Exception {
        Job job = new Job();
        job.setMapperClass(MyMapper.class);
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.waitForCompletion(true);
        return 0;
    }

    public static void main(String[] args) throws Exception {
        FileUtils.deleteDirectory(new File("data/output"));
        args = new String[] { "data/input", "data/output" };
        ToolRunner.run(new MapReduceExample(), args);
    }
}


6 . Hadoop - Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext?
 

Answer :

This probably means you compiled your code against an earlier version of Hadoop than you're running on.

This might have something to do with it...

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.2.0</version>
</dependency>

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-core</artifactId>
    <version>1.2.1</version>
</dependency>

You have two different versions of Hadoop here


7 . Apache Hadoop vs Google Bigdata?

Answer :

Simple answer would be.. it depends on what you want to do with your data.

Hadoop is used for massive storage of data and batch processing of that data. It is very mature, popular and you have lot of libraries that support this technology. But if you want to do real time analysis, queries on your data hadoop is not suitable for it.

Google's Big Query was developed specially to solve this issue. You can do real time processingon your data using google's big query.

You can use Big Query in place of Hadoop or you can also use big query with Hadoop to query datasets produced from running MapReduce jobs.

So, it entirely depends on how you want to process your data. If batch processing model is required and sufficient you can use Hadoop and if you want real time processing you have to choose Google's.

Edit: You can also explore other technologies that you can use with Hadoop like Spark, Storm, Hive etc.. (and choose depending on your use case)


8 . Hadoop Text class?
 

Answer :

Let me shed some light. When we talk about distributed systems efficient Serialization/Deserialization plays a vital role. It appears in two quite distinct areas of distributed data processing :

  • IPC
  • Persistent Storage

To be specific to Hadoop, IPC between nodes is implemented using RPCs. The RPC protocol uses serialization to render the message into a binary stream to be sent to the remote node, which then deserializes the binary stream into the original message. So, it is very important to have a solid Serialization/Deserialization framework in order to store and process huge amounts of data efficiently. In general, it is desirable that an RPC serialization format is:

  • Compact
  • Fast
  • Extensible
  • Interoperable

Hadoop uses its own types because developers wanted the storage format to be compact (to make efficient use of storage space), fast (so the overhead in reading or writing terabytes of data is minimal), extensible (so we can transparently read data written in an older format), and interoperable (so we can read or write persistent data using different languages).

Few points to remember before thinking that having dedicated MapReduce types is redundant :

  1. Hadoop’s Writable-based serialization framework provides a more efficient and customized serialization and representation of the data for MapReduce programs than using the general-purpose Java’s native serialization framework.
  2. As opposed to Java’s serialization, Hadoop’s Writable framework does not write the type name with each object expecting all the clients of the serialized data to be aware of the types used in the serialized data. Omitting the type names makes the serialization process faster and results in compact, random accessible serialized data formats that can be easily interpreted by non-Java clients.
  3. Hadoop’s Writable-based serialization also has the ability to reduce the object-creation overhead by reusing the Writable objects, which is not possible with the Java’s native serialization framework.

9 . MultipleOutputFormat in hadoop?
 

Answer :

Each reducer uses an OutputFormat to write records to. So that's why you are getting a set of odd and even files per reducer. This is by design so that each reducer can perform writes in parallel.

If you want just a single odd and single even file, you'll need to set mapred.reduce.tasks to 1. But performance will suffer, because all the mappers will be feeding into a single reducer.

Another option is to change the process the reads these files to accept multiple input files, or write a separate process that merges these files together.


10 . Hadoop Documentation for Eclipse?
 

Answer :

Another easy solution for those who want to stay updated is to edit the Javadocs path like Chris said:

"In Eclipse, right click your Java project and select Build Path -> Configure Build Path. Now click the Libraries tab and locate the entry for hadoop-core-x.x.x.jar. Expand the entry to show options for Source, Javadoc etc locations and click the Javadoc location entry. Now click the Edit button to the right and enter the location as the path"

but instead of linking it directly to the api you have stored on your hard disk.



11 . Hadoop error on Windows : java.lang.UnsatisfiedLinkError?
 

Answer :
  1. You access link https://github.com/steveloughran/winutils
  2. Download file "winutils.exe" and "hadoop.dll" with version which you use
  3. Copy 2 file to HADOOP_HOME\bin.

Note: if 2 file "winutils.exe" and "hadoop.dll" not right with hadoop version which you use, it isn't O


12 . How to find cdh version hadoop?
 

Answer :

n cdh, in the cluster I am using, there is not any cdh_version.properties (or I couldn't find it)

If your cluster uses "Parcels", you could check which version of cdh is used by doing:

/opt/cloudera/parcels

And you could see the version as the name of the folder:

CDH-5.5.1-1.cdh5.5.1.p0.11

13 . Hadoop “Unable to load native-hadoop library for your platform” error on docker-spark?
 

Answer :

Adding the Hadoop library into LD_LIBRARY_PATH fix this problem:

export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:$LD_LIBRARY_PATH

14 . Snapshottable directory hadoop admin?
 

Answer :

HDFS Snapshots are read-only point-in-time copies of the entire HDFS file system or a subtree/portion of it. These snapshots can be used for Data Recovery and Backup

In this example posted, 

drwxr-xr-x   - hadoop supergroup          0 2017-03-09 13:04 /Snap/.snapshot/Sanpnew

/Snap is a snapshottable directory, which means snapshots can be created for this directory. Every snapshottable directory will contain a subdirectory .snapshot to store the snapshots created.

And there are two snapshots(Sanpnew and Sanpnew1) created for the directory /Snap.

These snapshot folders hold the image of the /Snap directory as it was at the time of snapshot creation. These can be used for say, for example if the contents of this directory /Snap requires to be rolled-back (to an earlier point in time), then these snapshots can be made to use.


15 . Hadoop Mapper Context object?
 

Answer :

The run() method will be called using the Java Run Time Polymorphism (i.e method overriding). As you can see the line# 569 on the link below, extended mapper/reducer will get instantiated using the Java Reflection APIs. The MapTask class gets the name of extended mapper/reducer from the Job configuration object which the client program would have been configured extended mapper/reducer class using job.setMapperClass()

mapperContext = contextConstructor.newInstance(mapper, job, getTaskID(),
                                                  input, output, committer,
                                                  reporter, split);

   input.initialize(split, mapperContext);
   mapper.run(mapperContext);
   input.close();` 

The line# 621 is an example of run time polymorphism. On this line, the MapTask calls the run() method of configured mapper with 'Mapper Context' as parameter. If the run() is not extended, it calls the run() method on the org.apache.hadoop.mapreduce.Mapper which again calls the map() method on configured mapper.

On the line# 616 of the above link, MapTask creates the context object with all the details of job configuration, etc. as mentioned by @harpun and then passes onto the run() method on line # 621.

The above explanation holds good for reduce task as well with appropriate ReduceTask class being the main entry class.