2016年4月5日 星期二

第7章. Hadoop MapReduce介紹


7.1 wordCount.java介紹
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
7.2 編輯wordCount.java
Step1 建立wordcount目錄
mkdir -p ~/wordcount/input
cd  ~/wordcount
Step2 編輯WordCount.java
gedit WordCount.java
Step3 Import 相關Lib與WordCount class
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
}
Step4 建立main function
  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
Step5 建立TokenizerMapper類別
 public static class TokenizerMapper
       extends Mapper{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
Step6 建立IntSumReducer類別
 public static class IntSumReducer
       extends Reducer {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

7.3 編譯wordCount.java
Step1 修改編譯所需要的環境變數檔
sudo gedit ~/.bashrc
輸入下列內容
export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
Step2 讓 ~/.bashrc 修改的設定值生效
source ~/.bashrc
Step3 開始編譯
hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
ll
7.4 上傳文字檔至HDFS
hadoop fs -mkdir -p /user/hduser/wordcount/input
cd /usr/local/hadoop
ll LICENSE.txt
hadoop fs -copyFromLocal LICENSE.txt /user/hduser/wordcount/input
hadoop fs -ls /user/hduser/wordcount/input
7.5 執行wordCount.java
cd ~/wordcount
hadoop jar wc.jar WordCount /user/hduser/wordcount/input/LICENSE.txt 
/user/hduser/wordcount/output
7.6 查看執行結果
hadoop fs -ls /user/hduser/wordcount/output
hadoop fs -cat /user/hduser/wordcount/output/part-r-00000



 


以上內容節錄自這本書有詳細介紹:
  Hadoop+Spark大數據巨量分析與機器學習整合開發實戰 http://www.books.com.tw/products/0010695285



10 則留言:

  1. 您好 我照著書上的步驟 在編譯出現錯誤訊息
    WordCount.java:38: error: incompatible types
    for (IntWritable val : values) {
    ^
    required: IntWritable
    found: Object
    Note: WordCount.java uses unchecked or unsafe operations.
    Note: Recompile with -Xlint:unchecked for details.
    1 error

    不確定該如何解決,故來詢問一下,謝謝

    回覆刪除
    回覆
    1. 可以下載範例程式裡面的WordCount.java檔,將內容複製貼上即可

      刪除
  2. 1.原本範例
    public void reduce(Text key, Iterable values,
    2.改為書中的內容
    public void reduce(Text key, Iterable values,
    3.測試OK

    回覆刪除
  3. 不好意思,上一則,貼錯了
    1.原本範例
    public void reduce(Text key, Iterable values,
    2.改為書中的內容
    public void reduce(Text key, Iterable values,
    3.測試OK

    回覆刪除
  4. 不知內容會被蓋掉了
    改為書中的內容
    public void reduce(Text key, Iterable values,

    回覆刪除

  5. 改為書中的內容 : Iterable values,

    回覆刪除
  6. 改為書中的內容 : public void reduce(Text key, Iterable{中括號}IntWritable{中括號}values,

    回覆刪除
  7. 老師,能不能發一下hadoop集群環境的VOF文件,我照著書上執行每一步,但是還是錯誤,每次只有一個節點能存活(live)其他的都不無法顯示

    回覆刪除