Category Archives: wsucs

Finding Documentation Becomes a Pain

Each time you go about setting up software you may encounter a series of issues or complications that can make the task much more difficult. In order to counteract these issues or complications, one can usually find some useful and comprehensive documentation relating to the problem and solutions on how to fix them. Perhaps the worst time someone can experience when trying to deploy a software is having trouble finding initial documentation in the first place.

Installing (or attempting to install Hadoop) this semester, has allowed me to gain a greater appreciation for documentation and complete step-by-step instructions. Looking through the documentation for Hadoop, I have noticed they have recently updated the software, and describe some of the advantages the newer version has. However, the install instructions they have seem to be for one of the original versions, and searching for other install instructions hasn’t yielded much help either. I’m sure most of you reading this blog can understand the frustration that must be associated with it. Finding the correct way to set up the base system so the different parts of Hadoop works has proven to be anything but a simple task.

At this point the strategy is to continue to look for any documentation that might be useful, use the pieces that I’ve found from different areas and put them together to see if something works, and look at contacting people who have set up the new system before. By following this plan, and devoting the necessary time to the project, should get results before the end of the semester.

At this point, my understanding of Hadoop is slowly growing and as I do certain things, I continue to update my own documentation which hopefully can be used to redeploy the system if necessary in the future.

From the blog WSU Comp. Sci. Servers » wsucs by dillonmurphy10 and used with permission of the author. All other rights reserved by the author.

Semester Update (24 November 2013)

Moving to Thanksgiving week and nearing the end of classes, here is a semester update of where I stand on the current independent study moving and setting up servers, and where I plan to go from here.

I’ll start by covering some of the tasks we have accomplished with the computer science servers.

  1. We have successfully installed and setup a new server for computer science
    1. Running on a newer version of VMWare
    2. Server installed on CentOS 6.4
    3. Used to host computer science wiki and blog
  2. Moved over computer science blog and plugins
  3. Changed IP for machines
  4. Git server setup (with Chad)
  5. Install base systems on computer science cluster

Although this list appears to be in a good place, there is still a good amount of work that needs to be done with the time we have left this semester. Setting up the base system on the machines in the computer science server room was a good start to building a Hadoop cluster, but the real work lies in setting up the cluster itself and learning how it operates. After installing the cluster on the host machines, I plan on looking into how faculty can use the cluster for examples in class as well as some good assignment ideas, but that will be dealt with after the cluster itself is setup. The remaining list and order looks something like this:

  1. Present on current status of Independent Study
  2. Installing Hadoop cluster on machines
    1. Write and update documentation of the install
  3. Testing for any conflicts with Eucalyptus Cluster (with Chad)
  4. Testing and developing practical understanding of Hadoop cluster
  5. Developing examples/assignments professor’s can use with the Hadoop cluster

The first item on the list is to present what I’ve been doing this semester at the presentation computer science is holding to go over independent studies and internships (3 December 2013, ST-107). This is a good time to share with other students and faculty what’s been going on this semester and what plans I have for continuing the work.

It may be a bit difficult to finish the planned material before the end of finals (19 December 2013), but I will be continuing to work until the work is complete. More updates will continue as tasks on the list are completed.

From the blog WSU Comp. Sci. Servers » wsucs by dillonmurphy10 and used with permission of the author. All other rights reserved by the author.

What is Hadoop and Why are we Interested

Recently the computer science department at Worcester State has chosen to add two tracks to the major. One track is software development, which is similar to the current single track, and the other is Big Data Analytics. Because of the new track system, courses offered had to be changed which meant a change in a few course materials. Part of my independent study this semester has been to implement a Hadoop cluster, which is a tool that can be used for data analytics.

Part of big data analytics is dealing with very large sets of data. Many times, especially when we think of companies like Google or Amazon, it becomes evident that a single machine, or a couple of machines won’t get the job done. This is when distributing tasks throughout a series of machines works much more efficiently. One software platform that has been designed for this job is Apache Hadoop, which is installed on a cluster of machines that can handle large sets of data and jobs associated with this data much more efficiently.

From the Hadoop website (hadoop.apache.org):

“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”

In this independent study, I am looking to implement a cluster running Hadoop and testing some applications of this. By the end of the semester, the goal is to have a fully functioning cluster that can be used in the courses that deal with big data analytics, and have a complete set of install instructions that can be used to rebuild the Hadoop cluster at Worcester State in the event something goes wrong (something which is currently tough to obtain).

Keep checking for updates about the install, setup, and use of Hadoop here at Worcester State University.

From the blog WSU Comp. Sci. Servers » wsucs by dillonmurphy10 and used with permission of the author. All other rights reserved by the author.

Learning the Hadoop Install

After discussing with Chad Day the possible solutions for setting up both a Hadoop cluster and an Eucalyptus cluster on the same set of machines, we have concluded we should first attempt to run the services side by side. We are planning on using CentOS 6.4 as the operating system on the machines, and then installing the services we need for the nodes. The nine machines we have now currently have operating systems and packages on them from the 401 class that focused on working with Eucalyptus. Because of this Chad and I will install the latest version, 6.4, of CentOS on the machines and install the necessary packages for the servers.

Before attempting to install the Hadoop software on the CS machines, I am planning to set up a small environment of Hadoop servers on my personal machine using VirtualBox. This gives me a little more flexibility to play with packages and will be easier to access as I am learning the install process and how the head node and data nodes work together. Once I am confident with the install process on the virtual machines also running CentOS 6.4, I will install the packages on the machines that will have both Hadoop and Eucalyptus.

As far as installing the base operating system on the nine machines, Chad and I will install CentOS on the machines this week, hoping to have all the base system on all the machines by the end of this upcoming week. My plan is to have a couple install medias, either several CD’s or USB drives with the CentOS image on it to help the install go a little quicker. After all the machines are set up and I’ve played around in the virtual environment, I will be ready to install the cluster that will be used in the computer science department.

From the blog WSU Comp. Sci. Servers » wsucs by dillonmurphy10 and used with permission of the author. All other rights reserved by the author.

Successful Move of the Blog and Wiki

I am pleased to report a seamless move to the new wiki and blog server for the Worcester State University Computer Science Department! After collaborating with Dr. Karl R. Wurst and Chad Day, we moved the ip of the old wiki and blog server to the new server and assigned another ip to the old one. After running tests on both the wiki and the blog, cs.worcester.edu/wiki and cs.worcester.edu/blog are working as expected.

At this point in the semester, the blog and wiki have been migrated to an updated operating system. However, we are still looking into some possibilities on how to secure the wiki from potential spammers. One such way is to look into the code and limit email registration to the worcester.edu domain, which will be looked into by Chad. The git server is up and running and the CS department is in the process of testing it for usage in classes for homework submission and a good way to teach version control early in the major.

Looking forward, Chad and I will be building two clusters with the Computer Science machines, an Eucalyptus cluster and a Hadoop cluster to be used throughout the semester. Before we install the software on the machines, we must investigate the best solution for sharing hardware. We currently have nine machines at our disposal, and we must partition them for the clusters. The possibilities have investigated so far include:

  1. Splitting the physical allotment in half –  4 machines for each cluster
  2. Installing the correct services along side each other, each cluster share the nine machines and share resources
  3. Installing a virtual host on the machines and running two machines on each host, each cluster would have 9 machines and share resources, but would remain independent of each other.

We should have a strategy and start formatting the host machines by the end of next week, followed by another update.

From the blog WSU Comp. Sci. Servers » wsucs by dillonmurphy10 and used with permission of the author. All other rights reserved by the author.

Migration of cs.worcester.edu/blog

One of the tasks for this fall semester for the server migration/creation study was to move the Worcester State Computer Science blog to a new server alongside a new wiki platform. For both servers, WordPress will be used to manage the blogs, and all posts will be aggregated from other sites. In moving the blog to a new server the following steps were followed:

  1. Investigate the structure of the original server
    1. Web directory stored in a separate lampp folder
    2. MySQL server run from lampp folder
  2. Choose a structure for the new server
    1. Use a standard install of MySQL and Apache
    2. Set up Apache configuration that will work when the ip is changed
    3. Create a MySQL database that matches old server
    4. Create a MySQL user with the same credentials and permissions as the old server
  3. Move the old web directory to the new server
  4. Move the database to the new server
    1. Make a database dump of the WordPress database
    2. Move dump to new server and restore to newly created database
  5. Check that web directory works as expected with migrated database settings
  6. Find admin and home URLs settings in database and update them through the database
    1. This allows you to use the admin page on the new server instead of redirecting to the old admin page
    2. Change any necessary settings on the new page
  7. Run a final test to make sure everything is up to date

Following the steps above, the migration of the server was straightforward and was about ready for an ip change. The only issue with the migration was the updating of the pages (ex. Order of the Rubber Duck) did not carry over. After investigating the database for possible URL conflicts a few settings were changed with URL and as soon as a scheduled check was run by WordPress, the pages were restored. 

A final check was made with the blogs, navigation and aggregation are all set, and a scheduled update of the ip address to match cs.worcester.edu will occur tomorrow, October 25, 2013. After the ip has been migrated a thorough check will be made and if everything is a success, the new server will be used for the blogs.

From the blog WSU Comp. Sci. Servers » wsucs by dillonmurphy10 and used with permission of the author. All other rights reserved by the author.

Blogging for Updates

Attending my final semester at Worcester State University in computer science has given me a great opportunity to be in an independent study where I can work on something that will benefit the program. Throughout the semester I will move over the current CS blog from an outdated operating system to the latest CentOS, help set up a git server using gitlab, and research and implement a hadoop cluster.

The anticipation this semester is that the work being done will be used in the computer science department and will help courses offered in the future. Writing solid documentation throughout the semester will allow maintaining, and possibly duplicating the results as easy as possible. In order to keep the computer science department in the loop, I will be blogging weekly with the current status of the project, problems encountered, cool bits of information I’ve come about, and research finds.

This semester allows me to stay active in the computer science department while finishing up my final liberal arts requirements. It should be both a fun and productive semester, and in the end there should be some solid results.

From the blog WSU Comp. Sci. Servers » wsucs by dillonmurphy10 and used with permission of the author. All other rights reserved by the author.

Week of 30 April 2012

This week the semester officially wound down. We went over what is expected of our final presentations, as well as lat minute work to be done. I am so glad we finally have images running on the cluster. This has been a neat experience; as close to working on a real-life development team as you get in school.

The open source movement is good for innovation and software development. Everyone’s work is transparent, so there is more accountability. They say that open source projects have fewer defects per line of code when than proprietary software. Having more eyes on the code makes for a better product.

That is why Ubuntu has no need for antivirus software. So many people have reviewed the code, that security flaws have largely been weeded out.

There is a serious competitor to Eucalyptus on the horizon called OpenStack. It is completely open source. It is behind Eucalyptus in features and reliability, but it is catching up fast.

One thing I’ve learned in the last semester: No matter what, you are not going to be an expert in every language. What you need to do is find out what the requirements are for a particular job, then become an expert in those areas. You can take classes until the cows come home, but in the end you are responsible for your own learning. The only way to become a good coder is: You guessed it, do a lot of coding!

From the blog danspc.net Blog » wsucs by danspc.net Blog » wsucs and used with permission of the author. All other rights reserved by the author.

Week of 23 April 2012

This has been a week of “finishing up”; putting a cap on the semester.

Brian, Nadia, and I have been working on a Eucalyptus Architecture Overview. I have been working on Walrus, Nadia on Storage Controller and Machine Images, Brian on Cloud Controller. Nadia is pretty much done with with Storage Controller; I am done with Walrus. In addition, my Glossary is complete, at lest for someone with a rudimentary knowledge of networking and IT infrastructure.

We discussed in class with Karl what format the folks at Eucalyptus would like our research delivered in. Apparently they are working on their own wiki thet we will be able to add our content to, once it is up.

This semester has been a real “jumping in the pool” for me In some of my previous classes we have sort of dipped our toes in. This has been a cannonball. We had to learn how to make all facets of a project work, from environment to code to infrastructure.

I particularly liked the exercises in the open source textbook. The exercises on freeciv were the most illuminative. We had to make sure we had all the Unix add-ons, and that the worked correctly, and then do a full build. This showed us some of the real world challenges in compiling and running a large software project.

From the blog danspc.net Blog » wsucs by danspc.net Blog » wsucs and used with permission of the author. All other rights reserved by the author.

Week of 16 April 2012

The Walus overview I am working on is taking shape. At first I had a large voume of unorganized facts about Walrus. I was uncertain as to which manner to present them in. Upon reviewing the data, I managed to group the information I have into some main categories: Walrus’ two manin funcions, Access and security, Buckets and data storage, VM Manipulation, and object version control.

Here is what I have so far:

Walrus Architecture Overview Dan Adams

Walrus is the component of Eucalyptus that allows you to store your data in the cloud. You can store your data as objects or buckets, which are collections of objects. These data stores are secure, and you as the administrator decides who has access to the data, and what privileges they have (read, write, etc.). Walrus serves two purposes:
Walrus provides three types of functionality:
• Users that have access to EUCALYPTUS can use Walrus to stream data into/out of the cloud as well as from instances that they have started on nodes.

• In addition, Walrus acts as a storage service for VM images. Root filesystem as well as kernel and ramdisk images used to instantiate VMs on nodes can be uploaded to Walrus and accessed from nodes. Walrus is a put/get storage service.

• Walrus is also used to store snapshots of VMs for easy restoration and data protection

Accessing the Cloud
Access to Walrus and other components of Eucalyptus is accomplished through a pair of ssh keys. Both are alphanumeric strings. One is a public key, used to validate the software and other settings. Eucalyptus shares user’s credentials with the Cloud Controller’s user database.

The model of Walrus is similar in many ways to the Amazon S3 service. Both store collections of objects in buckets. Both the Walrus component and S3 are accessed using a pair of ssh keys, one private and one public. Amazon, however, requires a 20 and 40 character string, respectively, while the Eucalyptus keys are much longer.
Once the user is authenticated, interaction can take place via a web interface or command line.

Requests to read/write storage can be made via the Amazon S3 Curl Service or through or through the Eucalyptus web interface. Interactions with the service are either authenticated or anonymous. Buckets and objects are owned by individual accounts. To share resources, you must grant permissions as an administrator.

Walrus can use standard web services technologies like Axis and Mule , as well as being interface compatible with S3. It implements the REST or query interface via HTTP, as well as the SOAP interface.

Walrus is accessible to end-users, whether they are running a client outside the cloud or a virtual machine instance inside the cloud.
You can Eucatools commands to put or get data from Walrus, or one of the standard S3 tools. Example of these are S3 Curl , s3cmd
, and s3fs
. s3fs allows users to access S3 buckets as local directories.

Buckets and Data Storage via Walrus
Create a Bucket → Add an object to a bucket → View/move/delete object.

Objects can be grouped into folders. Folders can, in turn, be grouped into other folders. Objects can be public or private, with specific rights given to different users.
Bucket names need to be unique within the individual cloud. A good naming convention is to start with the name of your group or department.
Think of a bucket as analogous to a folder on a Windows system. Bucket storage primarily holds machine images and machine snapshots. The maximum size of a Walrus bucket is 5GB. Walrus is a file level storage system, as compared to
the block level storage system of Storage Controller.

Virtual Machine Image Manipulation via Walrus
Walrus also serves as a VM image storage and management service. Root file system images, kernel images, and ramdisk image can be uploaded to via the Walrus service, and then will be accessible from the different nodes. Images can be compressed, encrypted using user ssh keys, and split into multiple parts. These parts are described in an image description file, sometimes called the manifest.

When a node controller (NC) asks for an image from Walrus before starting it on a node, the node transmits an image download request. The request is authenticated with an internal set of credentials. Walrus then verifies and decrypts images that the users have uploaded, and makes the transfer to the proper directory. Walrus supports parallel and serial transfers of data transfers. Here is a sample write from a Walrus log:
[Fri April 13 04:21:21 2012][001283][EUCADEBUG ] walrus_request(): wrote 5242880000 bytes in 3421570 writes
[ Fri April 13 04:21:21 2012][001283][EUCAINFO ] walrus_request(): saved image in /var/lib/eucalyptus/instances//admin/i-515C08DF/disk

Image instances need to be on the same subnet as Walrus.

Snapshot Storage in Walrus

The volumes that are created with storage controller can be the basis of point-in-time snapshots that are stored on Walrus. The snapshots can, in turn, be used to create volumes if needed.

Snapshots are created using the euca-create-snapshot command:
uecadmin@client1: ̃$ euca−create−snapshot vol−333C04B8

The euca-describe-snapshots lists available snapshots:
uecadmin@client1: ̃$ euca−describe−snapshots
SNAPSHOT snap−32A804A2 vol−333C04B8 completed
2010−04−15T13:48:32.01Z 100%

Volumes from these snapshots can be created by using the snapshot ID:
uecadmin@client1: ̃$ euca−create−volume −s 10 −−snapshot snap−32A804A2
−−zone mycloud
A volume can only be created after the snapshot status
completes. Snapshots from a volume can only be created after
volume status is created, not during the creating stage.

To delete
a snapshot:

uecadmin@client1: ̃$ euca−delete−snapshot snap−32A804A2

Object Version Control in Walrus
Walrus doesn’t provide write-locking for object writes. Users are, however, guaranteed that a consistent copy of the object will be saved even if there are concurrent writes to the same object. If a write to an object is encountered while there is a previous write to the same object in progress, the previous write is invalidated. Walrus responds with the MD5 checksum of the object that was stored. Once a request has been verified, the user has been authenticated as a valid EUCALYPTUS user and checked against access control lists for the object that has been requested, writes and reads are streamed over HTTP.

To increase performance, and due to the fact that VM images can be very big,
Walrus keeps a repository of images that have already been decrypted. Cache invalidation happens when an image manifest is overwritten.

Troubleshooting
Walrus provides some good clues for solving problems in its log files. Two particularly useful logs are walrus-stats, walrus-digest, and registration.log
.

Problems with SSH keypairs commonly cause Walrus errors. Verify your credentials.

From the blog danspc.net Blog » wsucs by danspc.net Blog » wsucs and used with permission of the author. All other rights reserved by the author.