We forgot something. Combiner.

The title to this blog post may say it clearly enough, but we did manage to miss something rather important.  This is just based on my understanding, and I haven’t had the time to test the results, so any help in doing so would be greatly appreciated, and so would any help further explaining it or clearing up any of my misconceptions on the class would also be great.  I’ll do another on the Partitioner soon, as well.

I noticed it on Wednesday, and couldn’t help but dig a little deeper.  I’ll start with a few hints so we can see what we missed, but let’s start by looking back to the MapReduce WordCount program from earlier.  The one I’m using here is the latest posted on Blackboard, but critically has  a few lines different from the one posted earlier on the Titanpad instructions, and found on in this tutorial.  The main difference is how the classes are written, and what they do.

This is the latest one on Blackboard:

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

  public static class TokenizerMapper 
       extends Mapper<Object, Text, Text, IntWritable>{
    
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
      
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  
  public static class IntSumReducer 
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, 
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length < 2) {
      System.err.println("Usage: wordcount <in> [<in>...] <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    for (int i = 0; i < otherArgs.length - 1; ++i) {
      FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
    }
    FileOutputFormat.setOutputPath(job,
      new Path(otherArgs[otherArgs.length - 1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

And this is the code from the Hortonworks tutorial:


package org.myorg;

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 public class WordCount {

 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
 private final static IntWritable one = new IntWritable(1);
 private Text word = new Text();

 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
 String line = value.toString();
 StringTokenizer tokenizer = new StringTokenizer(line);
 while (tokenizer.hasMoreTokens()) {
 word.set(tokenizer.nextToken());
 context.write(word, one);
 }
 }
 } 

 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

 public void reduce(Text key, Iterable<IntWritable> values, Context context) 
 throws IOException, InterruptedException {
 int sum = 0;
 for (IntWritable val : values) {
 sum += val.get();
 }
 context.write(key, new IntWritable(sum));
 }
 }

 public static void main(String[] args) throws Exception {
 Configuration conf = new Configuration();

 Job job = new Job(conf, "wordcount");

 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(IntWritable.class);

 job.setMapperClass(Map.class);
 job.setReducerClass(Reduce.class);

 job.setInputFormatClass(TextInputFormat.class);
 job.setOutputFormatClass(TextOutputFormat.class);

 FileInputFormat.addInputPath(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));

 job.waitForCompletion(true);
 }
}

You can see there isn’t much difference if you look at the two side by side, as both the map and reduce methods are nearly the exact same.  The main difference being the name of the Tokenizer, and the use of toString over line.  Now, although they seem very similar there’s a big difference in how the <key,value> pairs are handled.  The main reason is how the jobs are called,

In the first:

Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

And in the second:

Job job = new Job(conf, "wordcount");
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

Hmm, notice that repeated call to the reducer with the IntSumReducer class?  Let’s find out what’s going on there, starting with the documentation for that setCombinerClass.  So what’s happening here and what’s the difference between the two?

The combiner is the local reduction phase, which is to simplify In the one from the Hortonworks tutorial (the second) we’re not combining our <key,value> pairs before sending them to the reducer, and in the Apache one we are!  What the combiner is is a local call of the reducer, so that the <key,value> pairs are combined before being sent off to the reducer.  From a practical standpoint that means what’s being sent to the reducer in the first is <word,N>, and in the second is <word,1> N times.  This clearly results in far data being transferred between nodes, and improved efficiency.

You can learn more about the combiner here.

From the blog ... Something here. » cs-wsu by ilundhild and used with permission of the author. All other rights reserved by the author.

Posted in cs-wsu, CS@Worcester | Comments Off

OpenMRS Meeting

During today’s OpenMRS meeting, the developers discussed things including security issues and bugs as well as many changes they have made in the past month. One of the developers noted that many of the problems OpenMRS has been having is due to many recent commits that were a bit sloppy.

One of the developers got into the registration aspect of OpenMRS and provided a demo. He noted that they made the system keyboard friendly so that the arrows could be used for easy drop down menu navigation. He also noted the added editing in real time to patient medical records, and added address hierarchies for different countries such as Haiti. A few other issues they noted fixing include adding earlier years for older patients because before, OpenMRS wouldn’t allow admins to input birthdays if it would make the patient age over a hundred.

Some issues the developers hope to work on next include adding a format for patient phone numbers such as (xxx) xxx-xxxx rather than just xxxxxxxxxx.

In conclusion, the usage of Java 8 was mentioned. One developer thought it would be better to use with OpenMRS because it has better language features to fit the needs of developers.

From the blog wellhellooosailor » CS@Worcester by epaiz and used with permission of the author. All other rights reserved by the author.

Posted in CS@Worcester | Comments Off

OpenMrs meeting 03/05/2015

The meeting was a discussion of the different changes that were implemented during this past spring. They fixed issues such as a limit to the year a patient was born. The software won’t allow admins to access dates older than 100 years old. They also added a number of functions for admins to be able to identify better a patient. When a patient goes into the hospital unconscious, the hospital staff are not able to identify the patient by asking them their basic information such as their name  furthermore they have created a method which allows them to update the patient information once the individual has awaken. Another feature was a box to add an accurate birth date for the patients since it is not always accurate.

Rodrigo Roldan

From the blog rroldan1 » CS@Worcester by rroldan1 and used with permission of the author. All other rights reserved by the author.

Posted in Assignment, CS401-01, CS@Worcester | Comments Off

Introduction

Hello my names is Patrick Mahoney. I attend Worcester State University and major in Computer Science. I enjoy working in the computer field and hope to learn as much as I can in the times to come.

From the blog pmahones6 » cs-wsu by pmahones's blog and used with permission of the author. All other rights reserved by the author.

Posted in cs-wsu | Comments Off

Hello world!

This is your very first post. Click the Edit link to modify or delete it, or start a new post. If you like, use this post to tell readers why you started this blog and what you plan to do with it.

Happy blogging!

From the blog pmahones6 » cs-wsu by pmahones's blog and used with permission of the author. All other rights reserved by the author.

Posted in cs-wsu | Comments Off

Capital Letters in Pound Signs

Based on the material I have learned today in CS-383, I have created an Excel Spreadsheet containing each of the 26 English letters of the alphabet in an ASCII form of pound symbols and periods. I still do not remember what it’s called, and I have no idea if these are the correct forms of each letter.

Feel free to look at the file and improve if necessary: Letters in Pounds

From the blog jdongamer » cs-wsu by jd22292 and used with permission of the author. All other rights reserved by the author.

Posted in cs-wsu | Comments Off

HDP Sandbox Startup Problems?

When I installed the Hortonworks HDP 2.2 Sandbox in my VMware Player, I had a problem after connecting to the first web page indicated on the startup screen. That web page had bad links.

The quick and dirty solution is to explicitly include the port number 8888 in your URL.

If you want to see more detailed description of the problem (as I saw it) and the workaround (that worked for me), go to this blog entry:

https://kevpfowler.wordpress.com/2015/01/28/bad-links-in-hortonworks-hdp-2-2-sandbox-startup/

From the blog kevpfowler » wsu-cs by kevpfowler and used with permission of the author. All other rights reserved by the author.

Posted in Hadoop, WSU CS | Comments Off