Rendering HTML with SXML and GNU Guile

GNU Guile provides modules for working with XML documents called SXML.
SXML provides an elegant way of writing XML documents as s-expressions
that can be easily manipulated in Scheme. Here’s an example:

(sxml->xml '(foo (bar (@ (attr "something")))))
<foo><bar attr="something" /></foo>

I don’t know about you, but I work with HTML documents much more often
than XML. Since HTML is very similar to XML, we should be able to
represent it with SXML, too!

(sxml->xml '(html
             (head
              (title "Hello, world!")
              (script (@ (src "foo.js"))))
             (body
              (h1 "Hello!"))))
<html>
  <head>
    <title>Hello, world!</title>
    <script src="foo.js" /> <!-- what? -->
  </head>
  <body>
    <h1>Hello!</h1>
  </body>
</html>

That <script> tag doesn’t look right! Script tags don’t close
themselves like that. Well, we could hack around it:

(sxml->xml '(html
             (head
              (title "Hello, world!")
              (script (@ (src "foo.js")) ""))
             (body
              (h1 "Hello!"))))
<html>
  <head>
    <title>Hello, world!</title>
    <script src="foo.js"></script>
  </head>
  <body>
    <h1>Hello!</h1>
  </body>
</html>

Note the use of the empty string in (script (@ (src "foo.js"))
"")
. The output looks correct now, great! But what about the other
void elements? We’ll have to remember to use the empty string hack
each time we use one. That doesn’t sound very elegant.

Furthermore, text isn’t even escaped properly!

(sxml->xml "Copyright © 2015  David Thompson <davet@gnu.org>")
Copyright © 2015  David Thompson &lt;davet@gnu.org&gt;

The < and > braces were escaped, but © should’ve been
rendered as &copy;. Why does this fail, too? Is there a bug in
SXML?

There’s no bug. The improper rendering happens because HTML, while
similar to XML, has a bunch of different syntax rules. Instead of
using sxml->xml, a new procedure that is tailored to the HTML
syntax is needed. Introducing sxml->html:

(define* (sxml->html tree #:optional (port (current-output-port)))
  "Write the serialized HTML form of TREE to PORT."
  (match tree
    (() *unspecified*)
    (('doctype type)
     (doctype->html type port))
    ;; Unescaped, raw HTML output
    (('raw html)
     (display html port))
    (((? symbol? tag) ('@ attrs ...) body ...)
     (element->html tag attrs body port))
    (((? symbol? tag) body ...)
     (element->html tag '() body port))
    ((nodes ...)
     (for-each (cut sxml->html <> port) nodes))
    ((? string? text)
     (string->escaped-html text port))
    ;; Render arbitrary Scheme objects, too.
    (obj (object->escaped-html obj port))))

In addition to being aware of void elements and escape characters, it
can also render '(doctype "html") as <!DOCTYPE html>, or
render an unescaped HTML string using '(raw "frog &amp; toad").

Here’s the full version of my (sxml html) module. It’s quite
brief, if you don’t count the ~250 lines of escape codes! This code
requires Guile 2.0.11 or greater.

Happy hacking!

(define-module (sxml html)
  #:use-module (sxml simple)
  #:use-module (srfi srfi-26)
  #:use-module (ice-9 match)
  #:use-module (ice-9 format)
  #:use-module (ice-9 hash-table)
  #:export (sxml->html))

(define %void-elements
  '(area
    base
    br
    col
    command
    embed
    hr
    img
    input
    keygen
    link
    meta
    param
    source
    track
    wbr))

(define (void-element? tag)
  "Return #t if TAG is a void element."
  (pair? (memq tag %void-elements)))

(define %escape-chars
  (alist->hash-table
   '((#" . "quot")
     (#& . "amp")
     (#' . "apos")
     (#< . "lt")
     (#> . "gt")
     (#¡ . "iexcl")
     (#¢ . "cent")
     (#£ . "pound")
     (#¤ . "curren")
     (#¥ . "yen")
     (#¦ . "brvbar")
     ( . "sect")
     (#¨ . "uml")
     (#© . "copy")
     (#ª . "ordf")
     (#« . "laquo")
     (#¬ . "not")
     (#® . "reg")
     (#¯ . "macr")
     (#° . "deg")
     (#± . "plusmn")
     (#² . "sup2")
     (#³ . "sup3")
     (#´ . "acute")
     (#µ . "micro")
     (# . "para")
     (#· . "middot")
     (#¸ . "cedil")
     (#¹ . "sup1")
     (#º . "ordm")
     (#» . "raquo")
     (#¼ . "frac14")
     (#½ . "frac12")
     (#¾ . "frac34")
     (#¿ . "iquest")
     (#À . "Agrave")
     (#Á . "Aacute")
     ( . "Acirc")
     (#Ã . "Atilde")
     (#Ä . "Auml")
     (#Å . "Aring")
     (#Æ . "AElig")
     (#Ç . "Ccedil")
     (#È . "Egrave")
     (#É . "Eacute")
     (#Ê . "Ecirc")
     (#Ë . "Euml")
     (#Ì . "Igrave")
     (#Í . "Iacute")
     (#Î . "Icirc")
     (#Ï . "Iuml")
     (#Ð . "ETH")
     (#Ñ . "Ntilde")
     (#Ò . "Ograve")
     (#Ó . "Oacute")
     (#Ô . "Ocirc")
     (#Õ . "Otilde")
     (#Ö . "Ouml")
     (#× . "times")
     (#Ø . "Oslash")
     (#Ù . "Ugrave")
     (#Ú . "Uacute")
     (#Û . "Ucirc")
     (#Ü . "Uuml")
     (#Ý . "Yacute")
     (#Þ . "THORN")
     (#ß . "szlig")
     (#à . "agrave")
     (#á . "aacute")
     (#â . "acirc")
     (#ã . "atilde")
     (#ä . "auml")
     (#å . "aring")
     (#æ . "aelig")
     (#ç . "ccedil")
     (#è . "egrave")
     (#é . "eacute")
     (#ê . "ecirc")
     (#ë . "euml")
     (#ì . "igrave")
     (#í . "iacute")
     (#î . "icirc")
     (#ï . "iuml")
     (#ð . "eth")
     (#ñ . "ntilde")
     (#ò . "ograve")
     (#ó . "oacute")
     (#ô . "ocirc")
     (#õ . "otilde")
     (#ö . "ouml")
     (#÷ . "divide")
     (#ø . "oslash")
     (#ù . "ugrave")
     (#ú . "uacute")
     (#û . "ucirc")
     (#ü . "uuml")
     (#ý . "yacute")
     (#þ . "thorn")
     (#ÿ . "yuml")
     (#Π. "OElig")
     (#œ . "oelig")
     (#Š . "Scaron")
     (#š . "scaron")
     (#Ÿ . "Yuml")
     (#ƒ . "fnof")
     (#ˆ . "circ")
     (#˜ . "tilde")
     (#Α . "Alpha")
     (#Β . "Beta")
     (#Γ . "Gamma")
     (#Δ . "Delta")
     (#Ε . "Epsilon")
     (#Ζ . "Zeta")
     (#Η . "Eta")
     (#Θ . "Theta")
     (#Ι . "Iota")
     (#Κ . "Kappa")
     (#Λ . "Lambda")
     (#Μ . "Mu")
     (#Ν . "Nu")
     (#Ξ . "Xi")
     (#Ο . "Omicron")
     (#Π . "Pi")
     (#Ρ . "Rho")
     (#Σ . "Sigma")
     (#Τ . "Tau")
     (#Υ . "Upsilon")
     (#Φ . "Phi")
     (#Χ . "Chi")
     (#Ψ . "Psi")
     (#Ω . "Omega")
     (#α . "alpha")
     (#β . "beta")
     (#γ . "gamma")
     (#δ . "delta")
     (#ε . "epsilon")
     (#ζ . "zeta")
     (#η . "eta")
     (#θ . "theta")
     (#ι . "iota")
     (#κ . "kappa")
     (#λ . "lambda")
     (#μ . "mu")
     (#ν . "nu")
     (#ξ . "xi")
     (#ο . "omicron")
     (#π . "pi")
     (#ρ . "rho")
     (#ς . "sigmaf")
     (#σ . "sigma")
     (#τ . "tau")
     (#υ . "upsilon")
     (#φ . "phi")
     (#χ . "chi")
     (#ψ . "psi")
     (#ω . "omega")
     (#ϑ . "thetasym")
     (#ϒ . "upsih")
     (#ϖ . "piv")
     (# . "ensp")
     (# . "emsp")
     (# . "thinsp")
     (# . "ndash")
     (# . "mdash")
     (# . "lsquo")
     (# . "rsquo")
     (# . "sbquo")
     (# . "ldquo")
     (# . "rdquo")
     (# . "bdquo")
     (# . "dagger")
     (# . "Dagger")
     (# . "bull")
     (# . "hellip")
     (# . "permil")
     (# . "prime")
     (# . "Prime")
     (# . "lsaquo")
     (# . "rsaquo")
     (# . "oline")
     (# . "frasl")
     (# . "euro")
     (# . "image")
     (# . "weierp")
     (# . "real")
     (# . "trade")
     (# . "alefsym")
     (# . "larr")
     (# . "uarr")
     (# . "rarr")
     (# . "darr")
     (# . "harr")
     (# . "crarr")
     (# . "lArr")
     (# . "uArr")
     (# . "rArr")
     (# . "dArr")
     (# . "hArr")
     (# . "forall")
     (# . "part")
     (# . "exist")
     (# . "empty")
     (# . "nabla")
     (# . "isin")
     (# . "notin")
     (# . "ni")
     (# . "prod")
     (# . "sum")
     (# . "minus")
     (# . "lowast")
     (# . "radic")
     (# . "prop")
     (# . "infin")
     (# . "ang")
     (# . "and")
     (# . "or")
     (# . "cap")
     (# . "cup")
     (# . "int")
     (# . "there4")
     (# . "sim")
     (# . "cong")
     (# . "asymp")
     (# . "ne")
     (# . "equiv")
     (# . "le")
     (# . "ge")
     (# . "sub")
     (# . "sup")
     (# . "nsub")
     (# . "sube")
     (# . "supe")
     (# . "oplus")
     (# . "otimes")
     (# . "perp")
     (# . "sdot")
     (# . "vellip")
     (# . "lceil")
     (# . "rceil")
     (# . "lfloor")
     (# . "rfloor")
     (# . "lang")
     (# . "rang")
     (# . "loz")
     (# . "spades")
     (# . "clubs")
     (# . "hearts")
     (# . "diams"))))

(define (string->escaped-html s port)
  "Write the HTML escaped form of S to PORT."
  (define (escape c)
    (let ((escaped (hash-ref %escape-chars c)))
      (if escaped
          (format port "&~a;" escaped)
          (display c port))))
  (string-for-each escape s))

(define (object->escaped-html obj port)
  "Write the HTML escaped form of OBJ to PORT."
  (string->escaped-html
   (call-with-output-string (cut display obj <>))
   port))

(define (attribute-value->html value port)
  "Write the HTML escaped form of VALUE to PORT."
  (if (string? value)
      (string->escaped-html value port)
      (object->escaped-html value port)))

(define (attribute->html attr value port)
  "Write ATTR and VALUE to PORT."
  (format port "~a="" attr)
  (attribute-value->html value port)
  (display #" port))

(define (element->html tag attrs body port)
  "Write the HTML TAG to PORT, where TAG has the attributes in the
list ATTRS and the child nodes in BODY."
  (format port "<~a" tag)
  (for-each (match-lambda
             ((attr value)
              (display #space port)
              (attribute->html attr value port)))
            attrs)
  (if (and (null? body) (void-element? tag))
      (display " />" port)
      (begin
        (display #> port)
        (for-each (cut sxml->html <> port) body)
        (format port "</~a>" tag))))

(define (doctype->html doctype port)
  (format port "<!DOCTYPE ~a>" doctype))

(define* (sxml->html tree #:optional (port (current-output-port)))
  "Write the serialized HTML form of TREE to PORT."
  (match tree
    (() *unspecified*)
    (('doctype type)
     (doctype->html type port))
    ;; Unescaped, raw HTML output
    (('raw html)
     (display html port))
    (((? symbol? tag) ('@ attrs ...) body ...)
     (element->html tag attrs body port))
    (((? symbol? tag) body ...)
     (element->html tag '() body port))
    ((nodes ...)
     (for-each (cut sxml->html <> port) nodes))
    ((? string? text)
     (string->escaped-html text port))
    ;; Render arbitrary Scheme objects, too.
    (obj (object->escaped-html obj port))))

From the blog dthompson by David Thompson and used with permission of the author. All other rights reserved by the author.

Posted in gnu, guile, WSU | Comments Off

We forgot something. Combiner.

The title to this blog post may say it clearly enough, but we did manage to miss something rather important.  This is just based on my understanding, and I haven’t had the time to test the results, so any help in doing so would be greatly appreciated, and so would any help further explaining it or clearing up any of my misconceptions on the class would also be great.  I’ll do another on the Partitioner soon, as well.

I noticed it on Wednesday, and couldn’t help but dig a little deeper.  I’ll start with a few hints so we can see what we missed, but let’s start by looking back to the MapReduce WordCount program from earlier.  The one I’m using here is the latest posted on Blackboard, but critically has  a few lines different from the one posted earlier on the Titanpad instructions, and found on in this tutorial.  The main difference is how the classes are written, and what they do.

This is the latest one on Blackboard:

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

  public static class TokenizerMapper 
       extends Mapper<Object, Text, Text, IntWritable>{
    
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
      
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  
  public static class IntSumReducer 
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, 
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length < 2) {
      System.err.println("Usage: wordcount <in> [<in>...] <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    for (int i = 0; i < otherArgs.length - 1; ++i) {
      FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
    }
    FileOutputFormat.setOutputPath(job,
      new Path(otherArgs[otherArgs.length - 1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

And this is the code from the Hortonworks tutorial:


package org.myorg;

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 public class WordCount {

 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
 private final static IntWritable one = new IntWritable(1);
 private Text word = new Text();

 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
 String line = value.toString();
 StringTokenizer tokenizer = new StringTokenizer(line);
 while (tokenizer.hasMoreTokens()) {
 word.set(tokenizer.nextToken());
 context.write(word, one);
 }
 }
 } 

 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

 public void reduce(Text key, Iterable<IntWritable> values, Context context) 
 throws IOException, InterruptedException {
 int sum = 0;
 for (IntWritable val : values) {
 sum += val.get();
 }
 context.write(key, new IntWritable(sum));
 }
 }

 public static void main(String[] args) throws Exception {
 Configuration conf = new Configuration();

 Job job = new Job(conf, "wordcount");

 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(IntWritable.class);

 job.setMapperClass(Map.class);
 job.setReducerClass(Reduce.class);

 job.setInputFormatClass(TextInputFormat.class);
 job.setOutputFormatClass(TextOutputFormat.class);

 FileInputFormat.addInputPath(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));

 job.waitForCompletion(true);
 }
}

You can see there isn’t much difference if you look at the two side by side, as both the map and reduce methods are nearly the exact same.  The main difference being the name of the Tokenizer, and the use of toString over line.  Now, although they seem very similar there’s a big difference in how the <key,value> pairs are handled.  The main reason is how the jobs are called,

In the first:

Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

And in the second:

Job job = new Job(conf, "wordcount");
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

Hmm, notice that repeated call to the reducer with the IntSumReducer class?  Let’s find out what’s going on there, starting with the documentation for that setCombinerClass.  So what’s happening here and what’s the difference between the two?

The combiner is the local reduction phase, which is to simplify In the one from the Hortonworks tutorial (the second) we’re not combining our <key,value> pairs before sending them to the reducer, and in the Apache one we are!  What the combiner is is a local call of the reducer, so that the <key,value> pairs are combined before being sent off to the reducer.  From a practical standpoint that means what’s being sent to the reducer in the first is <word,N>, and in the second is <word,1> N times.  This clearly results in far data being transferred between nodes, and improved efficiency.

You can learn more about the combiner here.

From the blog ... Something here. » cs-wsu by ilundhild and used with permission of the author. All other rights reserved by the author.

Posted in cs-wsu, CS@Worcester | Comments Off

OpenMRS Meeting

During today’s OpenMRS meeting, the developers discussed things including security issues and bugs as well as many changes they have made in the past month. One of the developers noted that many of the problems OpenMRS has been having is due to many recent commits that were a bit sloppy.

One of the developers got into the registration aspect of OpenMRS and provided a demo. He noted that they made the system keyboard friendly so that the arrows could be used for easy drop down menu navigation. He also noted the added editing in real time to patient medical records, and added address hierarchies for different countries such as Haiti. A few other issues they noted fixing include adding earlier years for older patients because before, OpenMRS wouldn’t allow admins to input birthdays if it would make the patient age over a hundred.

Some issues the developers hope to work on next include adding a format for patient phone numbers such as (xxx) xxx-xxxx rather than just xxxxxxxxxx.

In conclusion, the usage of Java 8 was mentioned. One developer thought it would be better to use with OpenMRS because it has better language features to fit the needs of developers.

From the blog wellhellooosailor » CS@Worcester by epaiz and used with permission of the author. All other rights reserved by the author.

Posted in CS@Worcester | Comments Off

OpenMrs meeting 03/05/2015

The meeting was a discussion of the different changes that were implemented during this past spring. They fixed issues such as a limit to the year a patient was born. The software won’t allow admins to access dates older than 100 years old. They also added a number of functions for admins to be able to identify better a patient. When a patient goes into the hospital unconscious, the hospital staff are not able to identify the patient by asking them their basic information such as their name  furthermore they have created a method which allows them to update the patient information once the individual has awaken. Another feature was a box to add an accurate birth date for the patients since it is not always accurate.

Rodrigo Roldan

From the blog rroldan1 » CS@Worcester by rroldan1 and used with permission of the author. All other rights reserved by the author.

Posted in Assignment, CS401-01, CS@Worcester | Comments Off

Introduction

Hello my names is Patrick Mahoney. I attend Worcester State University and major in Computer Science. I enjoy working in the computer field and hope to learn as much as I can in the times to come.

From the blog pmahones6 » cs-wsu by pmahones's blog and used with permission of the author. All other rights reserved by the author.

Posted in cs-wsu | Comments Off

Hello world!

This is your very first post. Click the Edit link to modify or delete it, or start a new post. If you like, use this post to tell readers why you started this blog and what you plan to do with it.

Happy blogging!

From the blog pmahones6 » cs-wsu by pmahones's blog and used with permission of the author. All other rights reserved by the author.

Posted in cs-wsu | Comments Off

Capital Letters in Pound Signs

Based on the material I have learned today in CS-383, I have created an Excel Spreadsheet containing each of the 26 English letters of the alphabet in an ASCII form of pound symbols and periods. I still do not remember what it’s called, and I have no idea if these are the correct forms of each letter.

Feel free to look at the file and improve if necessary: Letters in Pounds

From the blog jdongamer » cs-wsu by jd22292 and used with permission of the author. All other rights reserved by the author.

Posted in cs-wsu | Comments Off