Skip to content

More web scraping and fun with Java, Scala, XML libraries and XPath

April 17, 2013

I’ve posted some more code to Github for Project Miracle and now have at least one alternative to Python for web scraping. I decided to use Scala for my second attempt, partly because I want to get better at writing Scala code, and partly because I thought that support for using XML/XHTML would be superior on the JVM. This was true in a sense, but as with the Python code, things never seemed as straightforward as they should have been. I also got some more practice using SBT, which worked pretty well and was very easy to start using for a simple project.

To start with downloading the HTML, it’s as simple in Scala as one would like it to be:

val sb = new StringBuilder
for (line <- Source.fromURL(queenpediaSongList).getLines())
   sb.append(line).append(Properties.lineSeparator)
val songListHTML = sb.toString()

I download the HTML first so that I can perform any necessary pre-processing before the parsing steps. As it turns out the Queenpedia Song List that I’ve been using as my test case has some malformed HTML, so I needed to clean that up before using an XML-based (or even strict HTML-based) parsing approach. I decided to try the JTidy project which is a port of HTML Tidy, which I had used in the Python code via PyTidyLib. The key word here, however, is port, which means that the behavior of the Java and the C libraries are not the same. It turns out that JTidy complains about a missing table summary or something like that, a feature mostly used for accessibility. The error message kindly suggests that I update the HTML document before running it through JTidy again. I thought that’s what you were supposed to be doing, JTidy!

A simple setting update to “force output” gets us past this step, but somewhere along the line, I was getting no results from the function call to clean up the original HTML. I experimented with using Tagsoup, an alternative library for cleaning up HTML, which seems to have a decent pedigree as it is used by Apache Tika. I wasn’t seeming to have any luck with that, so I switched back to JTidy and created a new Java project in IntelliJ IDEA to test things out, just in case I was running into some idiosyncrasies with the Scala environment. With this project, I eventually got the results I wanted from JTidy and could get back to parsing the HTML/XHTML content.

At the same time I was trying to figure out what was going wrong with cleaning up the HTML, I was trying to decide which XML approach/library to use in parsing out the song list. As I mentioned in a previous post, I had already figured out a single-line XPath statement that would extract the relevant links from the main song list page.  It looked like this:

/html/body//div[@id="bodyContent"]/(table/tr/td/ul/li/a | ul/li/a)

Simple enough it seems. All I should need to do is instantiate some kind of XPathEvaluator object and evaluate this XPath statement against the cleaned up HTML string, et voilà. Right?

First, a note on the use of XML in Scala. Scala has built-in support for XML, but that package is relatively notorious for being outdated and “beyond fixing”. The project Anti-XML was started by Daniel Spiewak, who now works at Precog, as a clean-room replacement for Scala’s built-in XML support. The project looks promising but it looks like it hasn’t had any activity in over a year. The Scala Wiki contains information on other alternative XML libraries, the most promising of which seems to be Scales Xml. However, since I’m not familiar with any of these libraries, I decide to stay conservative and simply use the built-in Java XML libraries (i.e. JAXP). That seems like a low-risk approach.

Low-risk and low-functioning approach as it turns out. To test out the XML/XHTML interactively, I’ve been using the excellent <oXygen/> XML Editor, which has first-class support for XPath and XQuery, including XPath 2.0. Turns out I’ve been building XPath 2.0 expressions the whole time and Java’s built-in JAXP implementation does not support XPath 2.0. So I can either port my XPath statement back to be XPath 1.0 compliant or find a compatible XPath 2.0 library. After investigating XOM briefly (and wondering why I’d never really run across it in all my years of working with XML), I decide to go with Michael Kay’s wonderful Saxon library, which, for some unknown reason, appears to be one of the very few libraries actually implementing support for XPath 2.0 (and even XPath 3.0 in the commercial versions).

No, not that Saxon.

No, not that Saxon.

I briefly attempt to use Saxon via the normal JAXP XPath APIs. After getting no love from this approach after tinkering for a while, I decide to use Saxon’s native API called s9api. Using s9api, I initialize all of the required objects in 11 steps instead of 1, but I finally was able to extract the information I wanted from the HTML document.

Another interesting thing I learned while using the Saxon API was of a new feature in Scala 2.10 when converting between Java and Scala collections.  The result of XPathSelector.evaluate() in s9api is an Iterable<XdmItem>. In order to use the standard Scala for/foreach loop, I needed to convert to a Scala Iterable. Normally, this can be done by importing scala.collection.JavaConversions._

In Scala 2.10.1 (maybe before, I’m not sure), you can be more explicit about the implicit conversions you want to enable, so I used the following import statements instead:

import scala.language.implicitConversions
import scala.collection.convert.WrapAsScala.iterableAsScalaIterable

The final result, though not complete (it doesn’t extract the lyrics from the individual song pages yet), is viewable in the current Github repository. Am I satisfied with the results? For now, using Scala was unnecessary, but also didn’t cause any extra problems. I think it’s safe to say I was using Scala as simply a “better Java” with some easier syntax and relaxed rules on when static types need to be declared. But truthfully, I wrote most of the code that ended up working in Java using IntelliJ IDEA first. IntelliJ automates so much for you, you never really have to worry about what types to declare.

Next step may be to try using Node.js, then I will proceed to try to do something interesting with the data I’ve acquired.

From → Project Miracle

One Comment
  1. ern2150 permalink

    Can’t wait til you find some kind of UTF WTF where someone’s used a slanted quote mark or ndash. Not that I run into that problem every single iteration of my app at work or anything, especially not with i18n or whatever they’re calling it these days.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: