Amazon Web Services (AWS)

From Worcester State University Computer Science Department Wiki
Revision as of 17:02, 28 January 2017 by Karl Wurst (Talk | contribs) (Update Hadoop version the the version currently being used by AWS)

Jump to: navigation, search

AWS Command Line Tools

You may want to install the AWS command line tools, which will allow you to (among other things) download files in bulk from S3.

If you are having problems with getting it to work under Windows, you may want to install it under Cygwin:

Configuring AWS CLI

Once you get it installed, run:

aws configure

You will have to enter your:

AWS Access Key ID:

AWS Secret Access Key:

Default region name: us-east-1

Default output format: (doesn't matter)

Example AWS CLI Command

To upload a full directory of files to an S3 bucket:

aws s3 cp ghcnd_all/ s3://ghcnd-all/ --recursive

Elastic Map Reduce (Hadoop)

To run a MapReduce program on Amazon Web Services Elastic MapReduce

These instructions assume that you are using a particular directory for all your MapReduce projects, e.g. CS-383.

Get the Hadoop jars into a local directory where you can use them in your classpath when compiling

  1. Download the latest version of hadoop: (choose the hadoop-2.7.3.tar.gz)
  2. Unzip and untar (To untar in Windows use: instructions and other methods here:
  3. Move the whole hadoop directory to the top level of your directory

Edit and compile your Java code

  1. Create a directory to hold your code e.g. WordCount
  2. Change to that directory
  3. Write your code
  4. Compile your code with the correct version of Java, and with the Hadoop libraries:
  • Mac/Unix:

javac -target 1.7 -source 1.7 -classpath ../hadoop-2.6.0/share/hadoop/yarn/*:../hadoop-2.6.0/share/hadoop/yarn/lib/*:../hadoop-2.6.0/share/hadoop/mapreduce/*:../hadoop-2.6.0/share/hadoop/mapreduce/lib/*:../hadoop-2.6.0/share/hadoop/common/*

  • Windows

javac -target 1.7 -source 1.7 -classpath ../hadoop-2.6.0/share/hadoop/yarn/*;../hadoop-2.6.0/share/hadoop/yarn/lib/*;../hadoop-2.6.0/share/hadoop/mapreduce/*;../hadoop-2.6.0/share/hadoop/mapreduce/lib/*;../hadoop-2.6.0/share/hadoop/common/*

  1. Make a jar file with all the classes:

jar -cvf WordCount.jar WordCount*.class

Upload your jar and data files

  1. Log in to:
  2. Go to S3
  3. Create a bucket - username-CS-383
  4. Upload WordCount.jar (set to reduced-redundancy storage - cheaper)
  5. Create folder “input"
  6. Upload text file into input folder (set to reduced-redundancy storage - cheaper)

Create a cluster and run your job

  1. Go to EMR
  2. Create Cluster
    1. name: username-cs-383
    2. termination protection: no
    3. logging enabled
    4. log folder: your folder
    5. debugging enabled
    6. software configuration: Hadoop: Amazon
    7. steps: add step: custom jar
    8. configure and add: choose jar from your bucket
    9. arguments: WordCount s3://kwurst-cs-383/input s3://kwurst-cs-383/output
    10. autoterminate: yes

Fixing Connection Timeout Errors When Connecting to EMR Cluster with SSH

  1. Log in to AWS.
  2. Go to EC2.
  3. Choose Security Groups from the lefthand menu.
  4. Select the group with Group Name ElasticMapReduce-master.
  5. In the bottom portion of the screen, select the Inbound tab.
  6. Click Edit.
  7. Scroll down and click Add Rule.
  8. Set Type to SSH
  9. Enter 22 for Port Range.
  10. For Source, add the IP:
  11. SAVE