Java, Strings and Custom Comparators

I recently needed to sort a list of strings according to a custom, non-native order defined at run time. For example given a list of strings:

  1. seven
  2. one
  3. four
  4. three
  5. two
  6. six
  7. 5
  8. N|ne
  9. eight
  10. zten
  11. aleven
  12. btwelve
  13. dthirteen
  14. cfourteen

one might need to sort these dynamically at run time in perhaps the following order, which may or may not change during runtime:

  1. One
  2. Two
  3. three
  4. four
  5. 5
  6. six
  7. seven
  8. eight
  9. n|ne

In addition, the algorithm would need to be able to handle strings that were not defined in the ¬†‘desiredOrder’ list such that unrecognized strings would be placed at the end of the known, ordered list, and then sorted natively from there. An example of how the original list above should look after the sort is:

  1. one
  2. two
  3. three
  4. four
  5. 5
  6. six
  7. seven
  8. eight
  9. N|ne
  10. aleven
  11. btwelve
  12. cfourteen
  13. dthirteen
  14. zten

I decided that creating a new class implementing the Java Comparator¬†interface would be the most efficient way of doing this and was hoping that there would be some immediate cut & paste code gleaned from a simple Google search, but, no dice. Hopefully, at some point in the future, someone else in my position will find this code useful. Here’s what I came up with (comments, suggestions, criticism is welcome):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
public class ADynamicStringComparator implements Comparator
{
    private ArrayList order = new ArrayList();
 
    /**
     * String Comparator class that takes an explicit desired order. 
     * Case insensitive. If an encountered String is not contained within the
     * provided desiredOrder, then it will be placed at the end of the known
     * list within desiredOrder and then sorted natively.  
     * @param desiredOrder 
     */
    ADynamicStringComparator(List desiredOrder)
    {
        if(desiredOrder == null){
            throw new IllegalArgumentException("desiredOrder cannot be null.");
        }
        //set everything to lowercase
        for (String string : desiredOrder) {
            this.order.add(string.toLowerCase());
        }
    }
 
    @Override
    public int compare(String s1, String s2) 
    {
        s1 = s1.toLowerCase();
        s2 = s2.toLowerCase();
 
        //get index for 1
        int indexOfS1 = order.indexOf(s1);
        if(indexOfS1 == -1){
            indexOfS1 = order.size();
        }
 
        //get index for 2
        int indexOfS2 = order.indexOf(s2);
        if(indexOfS2 == -1){
            indexOfS2 = order.size();
        }
 
        if(indexOfS1 == order.size() && indexOfS2 == order.size())
        {
            return s1.compareTo(s2);
        }else{
            return indexOfS1 - indexOfS2;
        }
    }
}

Because the Java List interface preserves order, we can use a list as a parameter to the comparator’s constructor to define our order at runtime. Below is some simple test code as an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
public void testCompare() 
{
    ArrayList list = new ArrayList();
    list.add("seven");
    list.add("one");
    list.add("four");
    list.add("three");
    list.add("two");
    list.add("six");
    list.add("5");
    list.add("N|ne");
    list.add("eight");
    list.add("zten");
    list.add("aleven");
    list.add("btwelve");
    list.add("dthirteen");
    list.add("cfourteen");
 
    /*The desired sort order. */
    ArrayList desiredOrder = new ArrayList();
    desiredOrder.add("One");
    desiredOrder.add("Two");
    desiredOrder.add("three");
    desiredOrder.add("four");
    desiredOrder.add("5");
    desiredOrder.add("six");
    desiredOrder.add("seven");
    desiredOrder.add("eight");
    desiredOrder.add("n|ne");
 
    /*Another desired sort order. */
    ArrayList desiredOrder2 = new ArrayList();
    desiredOrder.add("Two");
    desiredOrder.add("three");
    desiredOrder.add("four");
    desiredOrder.add("six");
    desiredOrder.add("One");
    list.add("dthirteen");        
    desiredOrder.add("seven");
    desiredOrder.add("eight");
    desiredOrder.add("5");
    desiredOrder.add("n|ne");        
 
    System.out.println("----------------");
    System.out.println("before sort 1");
    System.out.println("----------------");
    for (String string : list) {
        System.out.println(string);
    }        
 
    System.out.println("----------------");
    System.out.println("desiredOrder 1");
    System.out.println("----------------");
    for (String string : desiredOrder) {
        System.out.println(string);
    }
 
    Collections.sort(list, new ADynamicStringComparator(desiredOrder));
    System.out.println("----------------");
    System.out.println("after sort 1");
    System.out.println("----------------");
    for (String string : list) {
        System.out.println(string);
    }
 
    assertEquals(list.get(0), "one");
    assertEquals(list.get(1), "two");
    assertEquals(list.get(2), "three");
    assertEquals(list.get(3), "four");
    assertEquals(list.get(4), "5");
    assertEquals(list.get(5), "six");
    assertEquals(list.get(6), "seven");
    assertEquals(list.get(7), "eight");
    assertEquals(list.get(8), "N|ne");
    assertEquals(list.get(9), "aleven");
    assertEquals(list.get(10), "btwelve");
    assertEquals(list.get(11), "cfourteen");
 
    Collections.sort(list, new ADynamicStringComparator(desiredOrder2));
    System.out.println("----------------");
    System.out.println("after sort 2");
    System.out.println("----------------");
    for (String string : list) {
        System.out.println(string);
    }        
 
    /*Sorry, got lazy writing asserts. Feel free to add your own ;)*/
}

For easier to read code or contributions, check out the google code project here.

 

Scraping HTML using Java Servlets and TagSoup

I’ve been working on a simple java project to scrape some data from vendor websites in order to compare prices. I found a neat little library called “TagSoup” to help parse through the HTML tags returned from my URL connection. I ran into a few hiccups along the way which I figured might be worth documenting not only for my own sanity but hopefully for other code monkeys searching the net for solutions to these problems.

First of all, good luck finding any clearly written usage guides for the TagSoup library.  I was able to find a nice writeup over at HackDiary written in 2003 that gave me a nice starting point. The code provided there looks like this:

1
2
3
4
5
6
7
8
URL url = new URL("http://example.com");
// build a JDOM tree from a SAX stream provided by tagsoup
SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser");
Document doc = builder.build(url);
JDOMXPath titlePath = new JDOMXPath("/h:html/h:head/h:title");
titlePath.addNamespace("h","http://www.w3.org/1999/xhtml");
String title = ((Element)titlePath.selectSingleNode(doc)).getText();
System.out.println("Title is "+title);

This Implementation worked just fine for me, though I did run into a couple of issues.

  1. I am building my projects using Apache-maven so I needed to add a dependencey for a newer version of Saxon than what is shipped with the JDK6 library.
    1. 1
      2
      3
      4
      5
      
      <dependency>
          <groupId>net.sf.saxon</groupId>
          <artifactId>saxon</artifactId>
          <version>8.7</version>
      </dependency>
  2. The default User-Agent that was being added to my GET headers was ‘Java/1.6.0_13′. This was causing some of the pages I needed to scrape to return back an 403 Forbidden error.Changing that was easy enough if I was not running this code from within a servlet.
    1. 1
      2
      3
      
      System.setProperty("http.agent", "Mozilla/5.0 "
              + "(Windows NT 6.1; U; ru; rv:5.0.1.6) "
              + "Gecko/20110501 Firefox/5.0.1 Firefox/5.0.1");

    However, When running this code from within a servlet using GlassFish3, I got stuck with the same 403 forbidden errors. I was able to resolve that with the following code changes:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    
    URL url1 = new URL("http://example.php");
    HttpURLConnection conn = (HttpURLConnection) url1.openConnection();
    conn.addRequestProperty("User-Agent", "Mozilla/5.0 "
            + "(Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) "
            + "Gecko/20061204 Firefox/2.0.0.1");
    // build a JDOM tree from a SAX stream provided by tagsoup
    SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser"); 
    BufferedReader reader = new BufferedReader(
            new InputStreamReader(conn.getInputStream()));
    Document doc;
    String price = null;
    doc = builder.build(reader);

    Notice that I’m now passing the build() method a BufferedReader object instead of the Document object.

    For whatever reason, setting the system property from the servlet was not getting the job done.

  3. TagSoup seems to expect an ‘h:’ name space in front of all of the XPath elements. I was using the Firebug plugin for Firefox. The String returned from the ‘Copy XPath’ feature needed to be modified to include ‘h:’ after every node. Additionally, the ‘Tidying up’ that TagSoup performs seemed to have removed any ‘/tbody’ nodes. Those needed to be removed from my XPath string. here is an example of the XPath string I needed to use in order to grab the Silver Bid/Ask price from ‘http://bullion.nwtmint.com/silver_panam.php':
    1
    2
    3
    
    JDOMXPath titlePath = new JDOMXPath("/h:html/h:body/h:table/h:tr[3]"
            + "/h:td/h:table/h:tr/h:td[3]/h:table/h:tr[3]/h:td/h:table"
            + "/h:tr[2]/h:td[2]");

I hope this information is able to help someone else out there.