Amazon Web Services (AWS)
- 1 AWS Command Line Tools
- 2 Elastic Map Reduce (Hadoop)
- 3 Fixing Connection Timeout Errors When Connecting to EMR Cluster with SSH
AWS Command Line Tools
You may want to install the AWS command line tools, which will allow you to (among other things) download files in bulk from S3.
If you are having problems with getting it to work under Windows, you may want to install it under Cygwin: http://www.thecoatlessprofessor.com/programming/installing-amazon-web-services-command-line-interface-aws-cli-for-windows-os-x-and-linux-2
Configuring AWS CLI
Once you get it installed, run:
You will have to enter your:
AWS Access Key ID:
AWS Secret Access Key:
Default region name: us-east-1
Default output format: (doesn't matter)
Example AWS CLI Command
To upload a full directory of files to an S3 bucket:
aws s3 cp ghcnd_all/ s3://ghcnd-all/ --recursive
Elastic Map Reduce (Hadoop)
To run a MapReduce program on Amazon Web Services Elastic MapReduce
These instructions assume that you are using a particular directory for all your MapReduce projects, e.g. CS-383.
Get the Hadoop jars into a local directory where you can use them in your classpath when compiling
- Download the latest version of hadoop: http://www.apache.org/dyn/closer.cgi/hadoop/common/ (choose the hadoop-2.7.3.tar.gz)
- Unzip and untar (To untar in Windows use: http://tartool.codeplex.com/releases/view/31505 instructions and other methods here: https://wiki.haskell.org/How_to_unpack_a_tar_file_in_Windows)
- Move the whole hadoop directory to the top level of your directory
Edit and compile your Java code
- Create a directory to hold your code e.g. WordCount
- Change to that directory
- Write your code
- Compile your code with the correct version of Java, and with the Hadoop libraries:
javac -target 1.7 -source 1.7 -classpath ../hadoop-2.7.3/share/hadoop/yarn/*:../hadoop-2.7.3/share/hadoop/yarn/lib/*:../hadoop-2.7.3/share/hadoop/mapreduce/*:../hadoop-2.7.3/share/hadoop/mapreduce/lib/*:../hadoop-2.7.3/share/hadoop/common/* WordCount.java
javac -target 1.7 -source 1.7 -classpath ../hadoop-2.7.3/share/hadoop/yarn/*;../hadoop-2.7.3/share/hadoop/yarn/lib/*;../hadoop-2.7.3/share/hadoop/mapreduce/*;../hadoop-2.7.3/share/hadoop/mapreduce/lib/*;../hadoop-2.7.3/share/hadoop/common/* WordCount.java
- Make a jar file with all the classes:
jar -cvf WordCount.jar WordCount*.class
Upload your jar and data files
- Log in to: https://cs-worcester.signin.aws.amazon.com/console
- Go to S3
- Create a bucket - username-CS-383
- Upload WordCount.jar (set to reduced-redundancy storage - cheaper)
- Create folder “input"
- Upload text file into input folder (set to reduced-redundancy storage - cheaper)
Create a cluster and run your job
- Go to EMR
- Create Cluster
- name: username-cs-383
- termination protection: no
- logging enabled
- log folder: your folder
- debugging enabled
- software configuration: Hadoop: Amazon
- steps: add step: custom jar
- configure and add: choose jar from your bucket
- arguments: WordCount s3://kwurst-cs-383/input s3://kwurst-cs-383/output
- autoterminate: yes
Fixing Connection Timeout Errors When Connecting to EMR Cluster with SSH
- Log in to AWS.
- Go to EC2.
- Choose Security Groups from the lefthand menu.
- Select the group with Group Name ElasticMapReduce-master.
- In the bottom portion of the screen, select the Inbound tab.
- Click Edit.
- Scroll down and click Add Rule.
- Set Type to SSH
- Enter 22 for Port Range.
- For Source, add the IP: 0.0.0.0/0