Amazon Web Services (AWS)

From Worcester State University Computer Science Department Wiki
Jump to: navigation, search

AWS Command Line Tools

You may want to install the AWS command line tools, which will allow you to (among other things) download files in bulk from S3.

If you are having problems with getting it to work under Windows, you may want to install it under Cygwin: http://www.thecoatlessprofessor.com/programming/installing-amazon-web-services-command-line-interface-aws-cli-for-windows-os-x-and-linux-2

Configuring AWS CLI

Once you get it installed, run:

aws configure

You will have to enter your:

AWS Access Key ID:

AWS Secret Access Key:

Default region name: us-east-1

Default output format: (doesn't matter)

Example AWS CLI Command

To upload a full directory of files to an S3 bucket:

aws s3 cp ghcnd_all/ s3://ghcnd-all/ --recursive

Elastic Map Reduce (Hadoop)

Running FlightsByCarrier on Amazon Web Services

Running a Java MapReduce program on Amazon Web Services Elastic MapReduce

These instructions assume that you are using a particular directory for all your MapReduce projects, e.g. CS-383.

Get the Hadoop jars into a local directory where you can use them in your classpath when compiling

  1. Download the latest version of hadoop:
    1. Go to http://www.apache.org/dyn/closer.cgi/hadoop/common/
    2. Choose hadoop-2.7.3.tar.gz
  2. Unzip and untar the file.
    1. Instructions for Windows here.
  3. Move the whole hadoop directory to the top level of your directory you are using for the course.

Create a directory for your FlightsByCarrier code

  1. Create a directory below your course directory to hold your code e.g. FlightsByCarrier

Get the OpenCSV jar

The FlightsByCarrier program uses OpenCSV to read the flight data CSV files.

  1. Download the latest version of OpenCSV:
    1. Go to https://sourceforge.net/projects/opencsv/?source=typ_redirect
  2. Unzip the opencsv-3.8.jar file to a folder called opencsv-3.8
  3. Move the whole com subdirectory to the top level of your Flights-ByCarrier directory.

Edit and compile your Java code

  1. Change to your FlightsByCarrier directory
  2. Write your code
    1. Changes to the code in the textbook:
      1. Leave out all @@# comments - your code will not compile
      2. FlightsByCarrier driver application:

        Replace Job job = new Job(); with Job job = Job.getInstance();
      3. FlightsByCarrier mapper code:

        Replace import au.com.bytecode.opencsv.CSVparser; with import com.opencsv.CSVParser;
  3. Compile your code with the correct version of Java, and with the Hadoop, and OpenCSV libraries:

    This version is split across multiple lines. This will not work if you copy and paste it.

    javac -target 1.7 -source 1.7 -classpath ../hadoop-2.7.3/share/hadoop/yarn/*:
    ../hadoop-2.7.3/share/hadoop/yarn/lib/*:../hadoop-2.7.3/share/hadoop/mapreduce/*:
    ../hadoop-2.7.3/share/hadoop/mapreduce/lib/*:../hadoop-2.7.3/share/hadoop/common/*:
    . *.java

    Single line version to copy and paste:

    javac -target 1.7 -source 1.7 -classpath ../hadoop-2.7.3/share/hadoop/yarn/*:../hadoop-2.7.3/share/hadoop/yarn/lib/*:../hadoop-2.7.3/share/hadoop/mapreduce/*:../hadoop-2.7.3/share/hadoop/mapreduce/lib/*:../hadoop-2.7.3/share/hadoop/common/*:. *.java
  4. Make a jar file with all the classes:

    jar cfv FlightsByCarrier.jar com FlightsByCarrier*.class

(Optional) Install AWS Command Line Interface

The AWS CLI makes it easier to upload and download multiple files, and can create and run clusters without having to connect to the AWS web interface.

To install it go here

Upload your jar and data files (Web Interface)

  1. Log in AWS
  2. Go to S3
  3. Upload FlightsByCarrier.jar to your bucket
  4. Create a folder flight-data
  5. Upload your CSV file to the flight-data folder

Upload your jar and data files (CLI Interface)

Note: Adjust bucket names to correspond to your own buckets.

  1. Upload FlightsByCarrier.jar to your bucket:

    aws s3 cp FlightsByCarrier.jar s3://kwurst-cs-383/

  2. Create a folder flight-data and upload your CSV file to the flight-data folder:

    aws s3 cp 1987.csv.bz2 s3://kwurst-cs-383/flight-data/

Create a cluster and run your job (Web Interface)

Note: Adjust bucket names to correspond to your own buckets.

  1. Go to EMR
  2. Click Create Cluster
  3. Click on Go to advanced options
  4. Step 1: Software and Steps:
    1. Uncheck all tools except Hadoop 2.7.3
    2. Step type: Custom jar
    3. Click on Configure
      1. Name: FlightsByCarrier
      2. JAR Location: select the jar you uploaded
      3. Arguments: FlightsByCarrier s3://kwurst-cs-383/flight-data s3://kwurst-cs-383/flight-output

        Note: the output bucket name must not exist or your cluster will fail. Either delete any previous bucket with that name, or use a new name.
      4. Action on failure: Terminate Cluster
      5. Click on Add
    4. Click on Next
  5. Step 2:Hardware:
    1. Click on Next
  6. Step 3: General Cluster Settings:
    1. Cluster name: kwurst-FlightsByCarrier
    2. Uncheck Debugging
    3. Uncheck Termination protection
    4. Scale down behavior: Terminate at task completion
    5. Click on Next
  7. Step4: Security:
  8. Click on Create Cluster

Create a cluster and run your job (CLI Interface)

Note: Adjust bucket names to correspond to your own buckets.

Note: the output bucket name must not exist or your cluster will fail. Either delete any previous bucket with that name, or use a new name.

This was generated by clicking on the AWS CLI export button on the cluster page for the cluster created above.

This version is split across multiple lines. This will not work if you copy and paste it.

aws emr create-cluster --applications Name=Hadoop --ec2-attributes
'{"InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-0330be74",
"EmrManagedSlaveSecurityGroup":"sg-2620a442","EmrManagedMasterSecurityGroup":"sg-2520a441"}' 
--release-label emr-5.3.0 --log-uri 's3n://aws-logs-318063560001-us-east-1/elasticmapreduce/' 
--steps '[{"Args":["FlightsByCarrier","s3://kwurst-cs-383/flight-data",
"s3://kwurst-cs-383/flightoutput"],"Type":"CUSTOM_JAR","ActionOnFailure":"TERMINATE_CLUSTER",
"Jar":"s3://kwurst-cs-383/FlightsByCarrier.jar","Properties":"","Name":"FlightsByCarrier"}]' 
--instance-groups '[{"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"m3.xlarge",
"Name":"Core - 2"},{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge",
"Name":"Master - 1"}]' --auto-terminate --auto-scaling-role EMR_AutoScaling_DefaultRole 
--service-role EMR_DefaultRole --name 'kwurst-FlightsByCarrier' 
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-east-1

Single line version to copy and paste:

aws emr create-cluster --applications Name=Hadoop --ec2-attributes '{"InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-0330be74","EmrManagedSlaveSecurityGroup":"sg-2620a442","EmrManagedMasterSecurityGroup":"sg-2520a441"}' --release-label emr-5.3.0 --log-uri 's3n://aws-logs-318063560001-us-east-1/elasticmapreduce/' --steps '[{"Args":["FlightsByCarrier","s3://kwurst-cs-383/flight-data","s3://kwurst-cs-383/flight-output"],"Type":"CUSTOM_JAR","ActionOnFailure":"TERMINATE_CLUSTER","Jar":"s3://kwurst-cs-383/FlightsByCarrier.jar","Properties":"","Name":"FlightsByCarrier"}]' --instance-groups '[{"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"m3.xlarge","Name":"Core - 2"},{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge","Name":"Master - 1"}]' --auto-terminate --auto-scaling-role EMR_AutoScaling_DefaultRole --service-role EMR_DefaultRole --name 'kwurst-FlightsByCarrier' --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-east-1

Download Results (Web Interface)

  1. Log in AWS
  2. Go to S3
  3. Go to the folder flight-data
  4. Download each part-r-##### file individually

Download Results (CLI Interface)

Note: Adjust bucket names to correspond to your own buckets.

aws s3 cp s3://kwurst-cs-383/flight-output flight-output --recursive

Fixing Connection Timeout Errors When Connecting to EMR Cluster with SSH

  1. Log in to AWS.
  2. Go to EC2.
  3. Choose Security Groups from the lefthand menu.
  4. Select the group with Group Name ElasticMapReduce-master.
  5. In the bottom portion of the screen, select the Inbound tab.
  6. Click Edit.
  7. Scroll down and click Add Rule.
  8. Set Type to SSH
  9. Enter 22 for Port Range.
  10. For Source, add the IP: 0.0.0.0/0
  11. SAVE