Difference between revisions of "Amazon Web Services (AWS)"

From Worcester State University Computer Science Department Wiki
Jump to: navigation, search
(Update Hadoop version the the version currently being used by AWS)
(Updated instructions for running FlightsByCarrier on AWS EMR)
Line 28: Line 28:
 
== Elastic Map Reduce (Hadoop) ==
 
== Elastic Map Reduce (Hadoop) ==
  
===To run a MapReduce program on Amazon Web Services Elastic MapReduce===
+
= Running FlightsByCarrier on Amazon Web Services =
These instructions assume that you are using a particular directory for all your MapReduce
+
projects, e.g. CS-383.
+
  
==== Get the Hadoop jars into a local directory where you can use them in your classpath when compiling ====
+
=== Running a Java MapReduce program on Amazon Web Services Elastic MapReduce ===
# Download the latest version of hadoop: http://www.apache.org/dyn/closer.cgi/hadoop/common/ (choose the hadoop-2.7.3.tar.gz)
+
# Unzip and untar (To untar in Windows use: http://tartool.codeplex.com/releases/view/31505 instructions and other methods here: https://wiki.haskell.org/How_to_unpack_a_tar_file_in_Windows)
+
# Move the whole hadoop directory to the top level of your directory
+
  
==== Edit and compile your Java code ====
+
These instructions assume that you are using a particular directory for all your MapReduce projects, e.g. CS-383.
# Create a directory to hold your code e.g. WordCount
+
# Change to that directory
+
# Write your code
+
# Compile your code with the correct version of Java, and with the Hadoop libraries:
+
* Mac/Unix:
+
javac -target 1.7 -source 1.7 -classpath ../hadoop-2.7.3/share/hadoop/yarn/*:../hadoop-2.7.3/share/hadoop/yarn/lib/*:../hadoop-2.7.3/share/hadoop/mapreduce/*:../hadoop-2.7.3/share/hadoop/mapreduce/lib/*:../hadoop-2.7.3/share/hadoop/common/* WordCount.java
+
* Windows
+
javac -target 1.7 -source 1.7 -classpath ../hadoop-2.7.3/share/hadoop/yarn/*;../hadoop-2.7.3/share/hadoop/yarn/lib/*;../hadoop-2.7.3/share/hadoop/mapreduce/*;../hadoop-2.7.3/share/hadoop/mapreduce/lib/*;../hadoop-2.7.3/share/hadoop/common/* WordCount.java
+
# Make a jar file with all the classes:
+
jar -cvf WordCount.jar WordCount*.class
+
  
==== Upload your jar and data files ====
+
== Get the Hadoop jars into a local directory where you can use them in your classpath when compiling ==
# Log in to: https://cs-worcester.signin.aws.amazon.com/console
+
 
 +
<ol style="list-style-type: decimal;">
 +
<li>Download the latest version of hadoop:
 +
<ol start="2" style="list-style-type: decimal;">
 +
<li>Go to http://www.apache.org/dyn/closer.cgi/hadoop/common/</li>
 +
<li>Choose '''hadoop-2.7.3.tar.gz'''</li></ol>
 +
</li>
 +
<li>Unzip and untar the file.
 +
<ol start="3" style="list-style-type: decimal;">
 +
<li>Instructions for Windows [https://wiki.haskell.org/How_to_unpack_a_tar_file_in_Windows here].</li></ol>
 +
</li>
 +
<li>Move the whole hadoop directory to the top level of your directory you are using for the course.</li></ol>
 +
 
 +
== Create a directory for your FlightsByCarrier code ==
 +
 
 +
# Create a directory below your course directory to hold your code e.g. FlightsByCarrier
 +
 
 +
== Get the OpenCSV jar ==
 +
 
 +
The FlightsByCarrier program uses OpenCSV to read the flight data CSV files.
 +
 
 +
<ol style="list-style-type: decimal;">
 +
<li>Download the latest version of OpenCSV:
 +
<ol start="2" style="list-style-type: decimal;">
 +
<li>Go to https://sourceforge.net/projects/opencsv/?source=typ_redirect</li></ol>
 +
</li>
 +
<li>Unzip the opencsv-3.8.jar file to a folder called '''opencsv-3.8'''</li>
 +
<li>Move the whole '''com''' subdirectory to the top level of your Flights-ByCarrier directory.</li></ol>
 +
 
 +
== Edit and compile your Java code ==
 +
 
 +
<ol style="list-style-type: decimal;">
 +
<li>Change to your FlightsByCarrier directory</li>
 +
<li>Write your code
 +
<ol start="2" style="list-style-type: decimal;">
 +
<li>Changes to the code in the textbook:
 +
<ol start="3" style="list-style-type: decimal;">
 +
<li>Leave out all '''@@#''' comments - your code will not compile</li>
 +
<li><p>FlightsByCarrier driver application:</p>
 +
Replace <code>Job job = new Job();</code> with <code>Job job = Job.getInstance();</code></li>
 +
<li><p>FlightsByCarrier mapper code:</p>
 +
Replace <code>import au.com.bytecode.opencsv.CSVparser;</code> with <code>import com.opencsv.CSVParser;</code></li></ol>
 +
</li></ol>
 +
</li>
 +
<li><p>Compile your code with the correct version of Java, and with the Hadoop, and OpenCSV libraries:</p>
 +
<p>This version is split across multiple lines. This will not work if you copy and paste it.</p>
 +
<pre>javac -target 1.7 -source 1.7 -classpath ../hadoop-2.7.3/share/hadoop/yarn/*:
 +
../hadoop-2.7.3/share/hadoop/yarn/lib/*:../hadoop-2.7.3/share/hadoop/mapreduce/*:
 +
../hadoop-2.7.3/share/hadoop/mapreduce/lib/*:../hadoop-2.7.3/share/hadoop/common/*:
 +
. *.java</pre>
 +
<p>Single line version to copy and paste:</p>
 +
<pre>javac -target 1.7 -source 1.7 -classpath ../hadoop-2.7.3/share/hadoop/yarn/*:../hadoop-2.7.3/share/hadoop/yarn/lib/*:../hadoop-2.7.3/share/hadoop/mapreduce/*:../hadoop-2.7.3/share/hadoop/mapreduce/lib/*:../hadoop-2.7.3/share/hadoop/common/*:. *.java</pre></li>
 +
<li><p>Make a jar file with all the classes:</p>
 +
<p><code>jar cfv FlightsByCarrier.jar com FlightsByCarrier*.class</code></p></li></ol>
 +
 
 +
== (Optional) Install AWS Command Line Interface ==
 +
 
 +
The AWS CLI makes it easier to upload and download multiple files, and can create and run clusters without having to connect to the AWS web interface.
 +
 
 +
To install it go [https://aws.amazon.com/cli/ here]
 +
 
 +
== Upload your jar and data files (Web Interface) ==
 +
 
 +
# Log in AWS
 
# Go to S3
 
# Go to S3
# Create a bucket - username-CS-383
+
# Upload FlightsByCarrier.jar to your bucket
# Upload WordCount.jar (set to reduced-redundancy storage - cheaper)
+
# Create a folder '''flight-data'''
# Create folder “input"
+
# Upload your CSV file to the '''flight-data''' folder
# Upload text file into input folder (set to reduced-redundancy storage - cheaper)
+
 
==== Create a cluster and run your job ====
+
== Upload your jar and data files (CLI Interface) ==
# Go to EMR
+
 
# Create Cluster
+
'''Note: Adjust bucket names to correspond to your own buckets.'''
## name: username-cs-383
+
 
## termination protection: no
+
<ol start="2" style="list-style-type: decimal;">
## logging enabled
+
<li><p>Upload FlightsByCarrier.jar to your bucket:</p>
## log folder: your folder
+
<p><code>aws s3 cp FlightsByCarrier.jar s3://kwurst-cs-383/</code></p></li>
## debugging enabled
+
<li><p>Create a folder '''flight-data''' and upload your CSV file to the '''flight-data''' folder:</p>
## software configuration: Hadoop: Amazon
+
<p><code>aws s3 cp 1987.csv.bz2 s3://kwurst-cs-383/flight-data/</code></p></li></ol>
## steps: add step: custom jar
+
 
## configure and add: choose jar from your bucket
+
== Create a cluster and run your job (Web Interface) ==
## arguments: WordCount s3://kwurst-cs-383/input s3://kwurst-cs-383/output
+
 
## autoterminate: yes
+
'''Note: Adjust bucket names to correspond to your own buckets.'''
 +
 
 +
<ol style="list-style-type: decimal;">
 +
<li>Go to EMR</li>
 +
<li>Click '''Create Cluster'''</li>
 +
<li>Click on '''Go to advanced options'''</li>
 +
<li>Step 1: Software and Steps:
 +
<ol style="list-style-type: decimal;">
 +
<li>Uncheck all tools except '''Hadoop 2.7.3'''</li>
 +
<li>Step type: '''Custom jar'''</li>
 +
<li>Click on '''Configure'''
 +
<ol start="5" style="list-style-type: decimal;">
 +
<li>Name: '''FlightsByCarrier'''</li>
 +
<li>JAR Location: select the jar you uploaded</li>
 +
<li><p>Arguments: '''FlightsByCarrier s3://kwurst-cs-383/flight-data s3://kwurst-cs-383/flight-output'''</p>
 +
'''Note: the output bucket name must not exist or your cluster will fail. Either delete any previous bucket with that name, or use a new name.'''</li>
 +
<li>Action on failure: '''Terminate Cluster'''</li>
 +
<li>Click on '''Add'''</li></ol>
 +
</li>
 +
<li>Click on '''Next'''</li></ol>
 +
</li>
 +
<li>Step 2:Hardware:
 +
<ol start="3" style="list-style-type: decimal;">
 +
<li>Click on '''Next'''</li></ol>
 +
</li>
 +
<li>Step 3: General Cluster Settings:
 +
<ol start="4" style="list-style-type: decimal;">
 +
<li>Cluster name: '''kwurst-FlightsByCarrier'''</li>
 +
<li>Uncheck '''Debugging'''</li>
 +
<li>Uncheck '''Termination protection'''</li>
 +
<li>Scale down behavior: '''Terminate at task completion'''</li>
 +
<li>Click on '''Next'''</li></ol>
 +
</li>
 +
<li>Step4: Security:</li>
 +
<li>Click on '''Create Cluster'''</li></ol>
 +
 
 +
== Create a cluster and run your job (CLI Interface) ==
 +
 
 +
'''Note: Adjust bucket names to correspond to your own buckets.'''
 +
 
 +
'''Note: the output bucket name must not exist or your cluster will fail. Either delete any previous bucket with that name, or use a new name.'''
 +
 
 +
This was generated by clicking on the '''AWS CLI export''' button on the cluster page for the cluster created above.
 +
 
 +
This version is split across multiple lines. This will not work if you copy and paste it.
 +
 
 +
<pre>aws emr create-cluster --applications Name=Hadoop --ec2-attributes
 +
'{&quot;InstanceProfile&quot;:&quot;EMR_EC2_DefaultRole&quot;,&quot;SubnetId&quot;:&quot;subnet-0330be74&quot;,
 +
&quot;EmrManagedSlaveSecurityGroup&quot;:&quot;sg-2620a442&quot;,&quot;EmrManagedMasterSecurityGroup&quot;:&quot;sg-2520a441&quot;}'
 +
--release-label emr-5.3.0 --log-uri 's3n://aws-logs-318063560001-us-east-1/elasticmapreduce/'
 +
--steps '[{&quot;Args&quot;:[&quot;FlightsByCarrier&quot;,&quot;s3://kwurst-cs-383/flight-data&quot;,
 +
&quot;s3://kwurst-cs-383/flightoutput&quot;],&quot;Type&quot;:&quot;CUSTOM_JAR&quot;,&quot;ActionOnFailure&quot;:&quot;TERMINATE_CLUSTER&quot;,
 +
&quot;Jar&quot;:&quot;s3://kwurst-cs-383/FlightsByCarrier.jar&quot;,&quot;Properties&quot;:&quot;&quot;,&quot;Name&quot;:&quot;FlightsByCarrier&quot;}]'
 +
--instance-groups '[{&quot;InstanceCount&quot;:2,&quot;InstanceGroupType&quot;:&quot;CORE&quot;,&quot;InstanceType&quot;:&quot;m3.xlarge&quot;,
 +
&quot;Name&quot;:&quot;Core - 2&quot;},{&quot;InstanceCount&quot;:1,&quot;InstanceGroupType&quot;:&quot;MASTER&quot;,&quot;InstanceType&quot;:&quot;m3.xlarge&quot;,
 +
&quot;Name&quot;:&quot;Master - 1&quot;}]' --auto-terminate --auto-scaling-role EMR_AutoScaling_DefaultRole
 +
--service-role EMR_DefaultRole --name 'kwurst-FlightsByCarrier'
 +
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-east-1</pre>
 +
Single line version to copy and paste:
 +
 
 +
<pre>aws emr create-cluster --applications Name=Hadoop --ec2-attributes '{&quot;InstanceProfile&quot;:&quot;EMR_EC2_DefaultRole&quot;,&quot;SubnetId&quot;:&quot;subnet-0330be74&quot;,&quot;EmrManagedSlaveSecurityGroup&quot;:&quot;sg-2620a442&quot;,&quot;EmrManagedMasterSecurityGroup&quot;:&quot;sg-2520a441&quot;}' --release-label emr-5.3.0 --log-uri 's3n://aws-logs-318063560001-us-east-1/elasticmapreduce/' --steps '[{&quot;Args&quot;:[&quot;FlightsByCarrier&quot;,&quot;s3://kwurst-cs-383/flight-data&quot;,&quot;s3://kwurst-cs-383/flight-output&quot;],&quot;Type&quot;:&quot;CUSTOM_JAR&quot;,&quot;ActionOnFailure&quot;:&quot;TERMINATE_CLUSTER&quot;,&quot;Jar&quot;:&quot;s3://kwurst-cs-383/FlightsByCarrier.jar&quot;,&quot;Properties&quot;:&quot;&quot;,&quot;Name&quot;:&quot;FlightsByCarrier&quot;}]' --instance-groups '[{&quot;InstanceCount&quot;:2,&quot;InstanceGroupType&quot;:&quot;CORE&quot;,&quot;InstanceType&quot;:&quot;m3.xlarge&quot;,&quot;Name&quot;:&quot;Core - 2&quot;},{&quot;InstanceCount&quot;:1,&quot;InstanceGroupType&quot;:&quot;MASTER&quot;,&quot;InstanceType&quot;:&quot;m3.xlarge&quot;,&quot;Name&quot;:&quot;Master - 1&quot;}]' --auto-terminate --auto-scaling-role EMR_AutoScaling_DefaultRole --service-role EMR_DefaultRole --name 'kwurst-FlightsByCarrier' --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-east-1</pre>
 +
== Download Results (Web Interface) ==
 +
 
 +
# Log in AWS
 +
# Go to S3
 +
# Go to the folder '''flight-data'''
 +
# Download each '''part-r-#####''' file individually
 +
 
 +
== Download Results (CLI Interface) ==
 +
 
 +
'''Note: Adjust bucket names to correspond to your own buckets.'''
 +
 
 +
<code>aws s3 cp s3://kwurst-cs-383/flight-output flight-output --recursive</code>
  
 
== Fixing Connection Timeout Errors When Connecting to EMR Cluster with SSH ==
 
== Fixing Connection Timeout Errors When Connecting to EMR Cluster with SSH ==

Revision as of 13:46, 29 January 2017

AWS Command Line Tools

You may want to install the AWS command line tools, which will allow you to (among other things) download files in bulk from S3.

If you are having problems with getting it to work under Windows, you may want to install it under Cygwin: http://www.thecoatlessprofessor.com/programming/installing-amazon-web-services-command-line-interface-aws-cli-for-windows-os-x-and-linux-2

Configuring AWS CLI

Once you get it installed, run:

aws configure

You will have to enter your:

AWS Access Key ID:

AWS Secret Access Key:

Default region name: us-east-1

Default output format: (doesn't matter)

Example AWS CLI Command

To upload a full directory of files to an S3 bucket:

aws s3 cp ghcnd_all/ s3://ghcnd-all/ --recursive

Elastic Map Reduce (Hadoop)

Running FlightsByCarrier on Amazon Web Services

Running a Java MapReduce program on Amazon Web Services Elastic MapReduce

These instructions assume that you are using a particular directory for all your MapReduce projects, e.g. CS-383.

Get the Hadoop jars into a local directory where you can use them in your classpath when compiling

  1. Download the latest version of hadoop:
    1. Go to http://www.apache.org/dyn/closer.cgi/hadoop/common/
    2. Choose hadoop-2.7.3.tar.gz
  2. Unzip and untar the file.
    1. Instructions for Windows here.
  3. Move the whole hadoop directory to the top level of your directory you are using for the course.

Create a directory for your FlightsByCarrier code

  1. Create a directory below your course directory to hold your code e.g. FlightsByCarrier

Get the OpenCSV jar

The FlightsByCarrier program uses OpenCSV to read the flight data CSV files.

  1. Download the latest version of OpenCSV:
    1. Go to https://sourceforge.net/projects/opencsv/?source=typ_redirect
  2. Unzip the opencsv-3.8.jar file to a folder called opencsv-3.8
  3. Move the whole com subdirectory to the top level of your Flights-ByCarrier directory.

Edit and compile your Java code

  1. Change to your FlightsByCarrier directory
  2. Write your code
    1. Changes to the code in the textbook:
      1. Leave out all @@# comments - your code will not compile
      2. FlightsByCarrier driver application:

        Replace Job job = new Job(); with Job job = Job.getInstance();
      3. FlightsByCarrier mapper code:

        Replace import au.com.bytecode.opencsv.CSVparser; with import com.opencsv.CSVParser;
  3. Compile your code with the correct version of Java, and with the Hadoop, and OpenCSV libraries:

    This version is split across multiple lines. This will not work if you copy and paste it.

    javac -target 1.7 -source 1.7 -classpath ../hadoop-2.7.3/share/hadoop/yarn/*:
    ../hadoop-2.7.3/share/hadoop/yarn/lib/*:../hadoop-2.7.3/share/hadoop/mapreduce/*:
    ../hadoop-2.7.3/share/hadoop/mapreduce/lib/*:../hadoop-2.7.3/share/hadoop/common/*:
    . *.java

    Single line version to copy and paste:

    javac -target 1.7 -source 1.7 -classpath ../hadoop-2.7.3/share/hadoop/yarn/*:../hadoop-2.7.3/share/hadoop/yarn/lib/*:../hadoop-2.7.3/share/hadoop/mapreduce/*:../hadoop-2.7.3/share/hadoop/mapreduce/lib/*:../hadoop-2.7.3/share/hadoop/common/*:. *.java
  4. Make a jar file with all the classes:

    jar cfv FlightsByCarrier.jar com FlightsByCarrier*.class

(Optional) Install AWS Command Line Interface

The AWS CLI makes it easier to upload and download multiple files, and can create and run clusters without having to connect to the AWS web interface.

To install it go here

Upload your jar and data files (Web Interface)

  1. Log in AWS
  2. Go to S3
  3. Upload FlightsByCarrier.jar to your bucket
  4. Create a folder flight-data
  5. Upload your CSV file to the flight-data folder

Upload your jar and data files (CLI Interface)

Note: Adjust bucket names to correspond to your own buckets.

  1. Upload FlightsByCarrier.jar to your bucket:

    aws s3 cp FlightsByCarrier.jar s3://kwurst-cs-383/

  2. Create a folder flight-data and upload your CSV file to the flight-data folder:

    aws s3 cp 1987.csv.bz2 s3://kwurst-cs-383/flight-data/

Create a cluster and run your job (Web Interface)

Note: Adjust bucket names to correspond to your own buckets.

  1. Go to EMR
  2. Click Create Cluster
  3. Click on Go to advanced options
  4. Step 1: Software and Steps:
    1. Uncheck all tools except Hadoop 2.7.3
    2. Step type: Custom jar
    3. Click on Configure
      1. Name: FlightsByCarrier
      2. JAR Location: select the jar you uploaded
      3. Arguments: FlightsByCarrier s3://kwurst-cs-383/flight-data s3://kwurst-cs-383/flight-output

        Note: the output bucket name must not exist or your cluster will fail. Either delete any previous bucket with that name, or use a new name.
      4. Action on failure: Terminate Cluster
      5. Click on Add
    4. Click on Next
  5. Step 2:Hardware:
    1. Click on Next
  6. Step 3: General Cluster Settings:
    1. Cluster name: kwurst-FlightsByCarrier
    2. Uncheck Debugging
    3. Uncheck Termination protection
    4. Scale down behavior: Terminate at task completion
    5. Click on Next
  7. Step4: Security:
  8. Click on Create Cluster

Create a cluster and run your job (CLI Interface)

Note: Adjust bucket names to correspond to your own buckets.

Note: the output bucket name must not exist or your cluster will fail. Either delete any previous bucket with that name, or use a new name.

This was generated by clicking on the AWS CLI export button on the cluster page for the cluster created above.

This version is split across multiple lines. This will not work if you copy and paste it.

aws emr create-cluster --applications Name=Hadoop --ec2-attributes
'{"InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-0330be74",
"EmrManagedSlaveSecurityGroup":"sg-2620a442","EmrManagedMasterSecurityGroup":"sg-2520a441"}' 
--release-label emr-5.3.0 --log-uri 's3n://aws-logs-318063560001-us-east-1/elasticmapreduce/' 
--steps '[{"Args":["FlightsByCarrier","s3://kwurst-cs-383/flight-data",
"s3://kwurst-cs-383/flightoutput"],"Type":"CUSTOM_JAR","ActionOnFailure":"TERMINATE_CLUSTER",
"Jar":"s3://kwurst-cs-383/FlightsByCarrier.jar","Properties":"","Name":"FlightsByCarrier"}]' 
--instance-groups '[{"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"m3.xlarge",
"Name":"Core - 2"},{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge",
"Name":"Master - 1"}]' --auto-terminate --auto-scaling-role EMR_AutoScaling_DefaultRole 
--service-role EMR_DefaultRole --name 'kwurst-FlightsByCarrier' 
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-east-1

Single line version to copy and paste:

aws emr create-cluster --applications Name=Hadoop --ec2-attributes '{"InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-0330be74","EmrManagedSlaveSecurityGroup":"sg-2620a442","EmrManagedMasterSecurityGroup":"sg-2520a441"}' --release-label emr-5.3.0 --log-uri 's3n://aws-logs-318063560001-us-east-1/elasticmapreduce/' --steps '[{"Args":["FlightsByCarrier","s3://kwurst-cs-383/flight-data","s3://kwurst-cs-383/flight-output"],"Type":"CUSTOM_JAR","ActionOnFailure":"TERMINATE_CLUSTER","Jar":"s3://kwurst-cs-383/FlightsByCarrier.jar","Properties":"","Name":"FlightsByCarrier"}]' --instance-groups '[{"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"m3.xlarge","Name":"Core - 2"},{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge","Name":"Master - 1"}]' --auto-terminate --auto-scaling-role EMR_AutoScaling_DefaultRole --service-role EMR_DefaultRole --name 'kwurst-FlightsByCarrier' --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-east-1

Download Results (Web Interface)

  1. Log in AWS
  2. Go to S3
  3. Go to the folder flight-data
  4. Download each part-r-##### file individually

Download Results (CLI Interface)

Note: Adjust bucket names to correspond to your own buckets.

aws s3 cp s3://kwurst-cs-383/flight-output flight-output --recursive

Fixing Connection Timeout Errors When Connecting to EMR Cluster with SSH

  1. Log in to AWS.
  2. Go to EC2.
  3. Choose Security Groups from the lefthand menu.
  4. Select the group with Group Name ElasticMapReduce-master.
  5. In the bottom portion of the screen, select the Inbound tab.
  6. Click Edit.
  7. Scroll down and click Add Rule.
  8. Set Type to SSH
  9. Enter 22 for Port Range.
  10. For Source, add the IP: 0.0.0.0/0
  11. SAVE