Category Archives: cs-wsu

Worcester State University

I currently am attending Worcester State University and am enrolled in the Software Development Capstone course where we get to take a look at OpenMRS and get a feel for what the real world for development is like. We each get a small group, similar to any work environment, and work together to accomplish tasks that help the OpenMRS project. I am looking forward to the challenges ahead!

From the blog CS@worcester – Greg Tzikas by Greg Tzikas and used with permission of the author. All other rights reserved by the author.

We forgot something. Combiner.

The title to this blog post may say it clearly enough, but we did manage to miss something rather important.  This is just based on my understanding, and I haven’t had the time to test the results, so any help in doing so would be greatly appreciated, and so would any help further explaining it or clearing up any of my misconceptions on the class would also be great.  I’ll do another on the Partitioner soon, as well.

I noticed it on Wednesday, and couldn’t help but dig a little deeper.  I’ll start with a few hints so we can see what we missed, but let’s start by looking back to the MapReduce WordCount program from earlier.  The one I’m using here is the latest posted on Blackboard, but critically has  a few lines different from the one posted earlier on the Titanpad instructions, and found on in this tutorial.  The main difference is how the classes are written, and what they do.

This is the latest one on Blackboard:

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

  public static class TokenizerMapper 
       extends Mapper<Object, Text, Text, IntWritable>{
    
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
      
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  
  public static class IntSumReducer 
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, 
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length < 2) {
      System.err.println("Usage: wordcount <in> [<in>...] <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    for (int i = 0; i < otherArgs.length - 1; ++i) {
      FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
    }
    FileOutputFormat.setOutputPath(job,
      new Path(otherArgs[otherArgs.length - 1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

And this is the code from the Hortonworks tutorial:


package org.myorg;

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 public class WordCount {

 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
 private final static IntWritable one = new IntWritable(1);
 private Text word = new Text();

 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
 String line = value.toString();
 StringTokenizer tokenizer = new StringTokenizer(line);
 while (tokenizer.hasMoreTokens()) {
 word.set(tokenizer.nextToken());
 context.write(word, one);
 }
 }
 } 

 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

 public void reduce(Text key, Iterable<IntWritable> values, Context context) 
 throws IOException, InterruptedException {
 int sum = 0;
 for (IntWritable val : values) {
 sum += val.get();
 }
 context.write(key, new IntWritable(sum));
 }
 }

 public static void main(String[] args) throws Exception {
 Configuration conf = new Configuration();

 Job job = new Job(conf, "wordcount");

 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(IntWritable.class);

 job.setMapperClass(Map.class);
 job.setReducerClass(Reduce.class);

 job.setInputFormatClass(TextInputFormat.class);
 job.setOutputFormatClass(TextOutputFormat.class);

 FileInputFormat.addInputPath(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));

 job.waitForCompletion(true);
 }
}

You can see there isn’t much difference if you look at the two side by side, as both the map and reduce methods are nearly the exact same.  The main difference being the name of the Tokenizer, and the use of toString over line.  Now, although they seem very similar there’s a big difference in how the <key,value> pairs are handled.  The main reason is how the jobs are called,

In the first:

Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

And in the second:

Job job = new Job(conf, "wordcount");
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

Hmm, notice that repeated call to the reducer with the IntSumReducer class?  Let’s find out what’s going on there, starting with the documentation for that setCombinerClass.  So what’s happening here and what’s the difference between the two?

The combiner is the local reduction phase, which is to simplify In the one from the Hortonworks tutorial (the second) we’re not combining our <key,value> pairs before sending them to the reducer, and in the Apache one we are!  What the combiner is is a local call of the reducer, so that the <key,value> pairs are combined before being sent off to the reducer.  From a practical standpoint that means what’s being sent to the reducer in the first is <word,N>, and in the second is <word,1> N times.  This clearly results in far data being transferred between nodes, and improved efficiency.

You can learn more about the combiner here.

From the blog ... Something here. » cs-wsu by ilundhild and used with permission of the author. All other rights reserved by the author.

We forgot something. Combiner.

The title to this blog post may say it clearly enough, but we did manage to miss something rather important.  This is just based on my understanding, and I haven’t had the time to test the results, so any help in doing so would be greatly appreciated, and so would any help further explaining it or clearing up any of my misconceptions on the class would also be great.  I’ll do another on the Partitioner soon, as well.

I noticed it on Wednesday, and couldn’t help but dig a little deeper.  I’ll start with a few hints so we can see what we missed, but let’s start by looking back to the MapReduce WordCount program from earlier.  The one I’m using here is the latest posted on Blackboard, but critically has  a few lines different from the one posted earlier on the Titanpad instructions, and found on in this tutorial.  The main difference is how the classes are written, and what they do.

This is the latest one on Blackboard:

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

  public static class TokenizerMapper 
       extends Mapper<Object, Text, Text, IntWritable>{
    
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
      
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  
  public static class IntSumReducer 
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, 
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length < 2) {
      System.err.println("Usage: wordcount <in> [<in>...] <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    for (int i = 0; i < otherArgs.length - 1; ++i) {
      FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
    }
    FileOutputFormat.setOutputPath(job,
      new Path(otherArgs[otherArgs.length - 1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

And this is the code from the Hortonworks tutorial:


package org.myorg;

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 public class WordCount {

 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
 private final static IntWritable one = new IntWritable(1);
 private Text word = new Text();

 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
 String line = value.toString();
 StringTokenizer tokenizer = new StringTokenizer(line);
 while (tokenizer.hasMoreTokens()) {
 word.set(tokenizer.nextToken());
 context.write(word, one);
 }
 }
 } 

 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

 public void reduce(Text key, Iterable<IntWritable> values, Context context) 
 throws IOException, InterruptedException {
 int sum = 0;
 for (IntWritable val : values) {
 sum += val.get();
 }
 context.write(key, new IntWritable(sum));
 }
 }

 public static void main(String[] args) throws Exception {
 Configuration conf = new Configuration();

 Job job = new Job(conf, "wordcount");

 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(IntWritable.class);

 job.setMapperClass(Map.class);
 job.setReducerClass(Reduce.class);

 job.setInputFormatClass(TextInputFormat.class);
 job.setOutputFormatClass(TextOutputFormat.class);

 FileInputFormat.addInputPath(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));

 job.waitForCompletion(true);
 }
}

You can see there isn’t much difference if you look at the two side by side, as both the map and reduce methods are nearly the exact same.  The main difference being the name of the Tokenizer, and the use of toString over line.  Now, although they seem very similar there’s a big difference in how the <key,value> pairs are handled.  The main reason is how the jobs are called,

In the first:

Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

And in the second:

Job job = new Job(conf, "wordcount");
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

Hmm, notice that repeated call to the reducer with the IntSumReducer class?  Let’s find out what’s going on there, starting with the documentation for that setCombinerClass.  So what’s happening here and what’s the difference between the two?

The combiner is the local reduction phase, which is to simplify In the one from the Hortonworks tutorial (the second) we’re not combining our <key,value> pairs before sending them to the reducer, and in the Apache one we are!  What the combiner is is a local call of the reducer, so that the <key,value> pairs are combined before being sent off to the reducer.  From a practical standpoint that means what’s being sent to the reducer in the first is <word,N>, and in the second is <word,1> N times.  This clearly results in far less data being transferred between nodes, and improved efficiency, so don’t forget to include it.

You can learn more about the combiner here.

From the blog ... Something here. » cs-wsu by ilundhild and used with permission of the author. All other rights reserved by the author.

Introduction

Hello my names is Patrick Mahoney. I attend Worcester State University and major in Computer Science. I enjoy working in the computer field and hope to learn as much as I can in the times to come.

Introduction

Hello my names is Patrick Mahoney. I attend Worcester State University and major in Computer Science. I enjoy working in the computer field and hope to learn as much as I can in the times to come.

Hello world!

This is your very first post. Click the Edit link to modify or delete it, or start a new post. If you like, use this post to tell readers why you started this blog and what you plan to do with it.

Happy blogging!

Capital Letters in Pound Signs

Based on the material I have learned today in CS-383, I have created an Excel Spreadsheet containing each of the 26 English letters of the alphabet in an ASCII form of pound symbols and periods. I still do not remember what it’s called, and I have no idea if these are the correct forms of each letter.

Feel free to look at the file and improve if necessary: Letters in Pounds

From the blog jdongamer » cs-wsu by jd22292 and used with permission of the author. All other rights reserved by the author.

Capital Letters in Pound Signs

Based on the material I have learned today in CS-383, I have created an Excel Spreadsheet containing each of the 26 English letters of the alphabet in an ASCII form of pound symbols and periods. I still do not remember what it’s called, and I have no idea if these are the correct forms of each letter.

Feel free to look at the file and improve if necessary: Letters in Pounds

From the blog jdongamer » cs-wsu by jd22292 and used with permission of the author. All other rights reserved by the author.

Simple MapReduce Example

MapReduce can be very complicated to understand at first.  I found this simple example that takes you through the steps so that you can see how it is reduced.  You can find it at this site:

http://ayende.com/blog/4435/map-reduce-a-visual-explanation

From the blog ddgoddard » cs-wsu by ddgoddard and used with permission of the author. All other rights reserved by the author.

Hadoop MapReduce

The Article on YDN on Hadoop was a great read but I found it very loaded for someone who first heard of Hadoop a couple of weeks ago. I did go hunting for a better, but simpler explanation of what MapReduce is, and how to apply it. I found this article on IBM. It uses a simple data set of cities and their temperatures to explain MapReduce. It also provides an interesting analogy into how the Early Romans may have ‘used’ MapReduce to conduct censuses.

I also found a set of HTML Slides on a Google Research Publication that goes that goes deeper into the inner workings of MapReduce, with an example of how they used it at Google in 2004. It also provides full C code for a word count example in the PDF version of the publication.

From the blog mrnganga » cs-wsu by mrnganga and used with permission of the author. All other rights reserved by the author.