Scraping HTML using Java Servlets and TagSoup

I’ve been working on a simple java project to scrape some data from vendor websites in order to compare prices. I found a neat little library called “TagSoup” to help parse through the HTML tags returned from my URL connection. I ran into a few hiccups along the way which I figured might be worth documenting not only for my own sanity but hopefully for other code monkeys searching the net for solutions to these problems.

First of all, good luck finding any clearly written usage guides for the TagSoup library.  I was able to find a nice writeup over at HackDiary written in 2003 that gave me a nice starting point. The code provided there looks like this:

1
2
3
4
5
6
7
8
URL url = new URL("http://example.com");
// build a JDOM tree from a SAX stream provided by tagsoup
SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser");
Document doc = builder.build(url);
JDOMXPath titlePath = new JDOMXPath("/h:html/h:head/h:title");
titlePath.addNamespace("h","http://www.w3.org/1999/xhtml");
String title = ((Element)titlePath.selectSingleNode(doc)).getText();
System.out.println("Title is "+title);

This Implementation worked just fine for me, though I did run into a couple of issues.

  1. I am building my projects using Apache-maven so I needed to add a dependencey for a newer version of Saxon than what is shipped with the JDK6 library.
    1. 1
      2
      3
      4
      5
      
      <dependency>
          <groupId>net.sf.saxon</groupId>
          <artifactId>saxon</artifactId>
          <version>8.7</version>
      </dependency>
  2. The default User-Agent that was being added to my GET headers was ‘Java/1.6.0_13′. This was causing some of the pages I needed to scrape to return back an 403 Forbidden error.Changing that was easy enough if I was not running this code from within a servlet.
    1. 1
      2
      3
      
      System.setProperty("http.agent", "Mozilla/5.0 "
              + "(Windows NT 6.1; U; ru; rv:5.0.1.6) "
              + "Gecko/20110501 Firefox/5.0.1 Firefox/5.0.1");

    However, When running this code from within a servlet using GlassFish3, I got stuck with the same 403 forbidden errors. I was able to resolve that with the following code changes:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    
    URL url1 = new URL("http://example.php");
    HttpURLConnection conn = (HttpURLConnection) url1.openConnection();
    conn.addRequestProperty("User-Agent", "Mozilla/5.0 "
            + "(Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) "
            + "Gecko/20061204 Firefox/2.0.0.1");
    // build a JDOM tree from a SAX stream provided by tagsoup
    SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser"); 
    BufferedReader reader = new BufferedReader(
            new InputStreamReader(conn.getInputStream()));
    Document doc;
    String price = null;
    doc = builder.build(reader);

    Notice that I’m now passing the build() method a BufferedReader object instead of the Document object.

    For whatever reason, setting the system property from the servlet was not getting the job done.

  3. TagSoup seems to expect an ‘h:’ name space in front of all of the XPath elements. I was using the Firebug plugin for Firefox. The String returned from the ‘Copy XPath’ feature needed to be modified to include ‘h:’ after every node. Additionally, the ‘Tidying up’ that TagSoup performs seemed to have removed any ‘/tbody’ nodes. Those needed to be removed from my XPath string. here is an example of the XPath string I needed to use in order to grab the Silver Bid/Ask price from ‘http://bullion.nwtmint.com/silver_panam.php’:
    1
    2
    3
    
    JDOMXPath titlePath = new JDOMXPath("/h:html/h:body/h:table/h:tr[3]"
            + "/h:td/h:table/h:tr/h:td[3]/h:table/h:tr[3]/h:td/h:table"
            + "/h:tr[2]/h:td[2]");

I hope this information is able to help someone else out there.
 

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>