Skip to content

Web scraping Queenpedia with Python

My first working code for Project Miracle is now posted to Github. Rejoice and be thankful!

The code is definitely a work in progress, but I wanted to share some of my initial thoughts on the process so far. I started with the approach that I had described in my earlier post by writing a Python script to do the initial scraping. This helps me to learn Python better and have a baseline web scraping approach against which to compare other potential or implemented solutions.

I won’t bother to annotate all of the code at this point because it’s not terribly interesting.  Just a few notes:

  • BeautifulSoup worked decently. However, being the kind of developer who is very comfortable with XML, I found myself testing out XPath statements against the downloaded song index from Queenpedia and then trying to figure out how to convert that to BeautifulSoup/Python code.  This resulted in changing a one-line XPath statement into 30-40 lines of Python code. This doesn’t seem that efficient, but I am willing to allow for the fact that I’m still trying to learn Python and didn’t have any familiarity with the BeautifulSoup API before now. Perhaps there are more efficient ways to do what I’ve done so far. As a side note, the most popular answer in the Stack Overflow topic on web scraping with Python leaves a lot to be desired. For one the read() function called on urllib2.urlopen() might not return all of the contents of the URL every time. I suppose one has to accept the fact that Stack Overflow users aren’t going to write all of your code for you.
  • I made use of the lxml Python XML library via BeautifulSoup.  According to the Beautiful Soup documentation: “If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.” I was running Python 2.7.2, and ran into weird issues parsing the HTML using BeautifulSoup API until I installed the lxml parser. My next step might be to try to use lxml directly and bypass BeautifulSoup.
  • If I were to use more of an XML-centric API (lxml in Python or another XML library in some other language), the one drawback with regard to web scraping is that many, if not most, web pages are not strictly XML-compliant.  I was able to easily work around this when I started to scrape the actual song/lyrics pages by using PyTidyLib, a Python wrapper around the time-tested HTML Tidy program.
  • I wasn’t sure how to properly handle URLs at first without something like the java.net.URL class, but I was able to quickly find urlparse.urljoin in Python, which did the needed trick of constructing a URL from a base URL and the relative link contained in the HTML docs.

I think that is about it for now. I’ll finish up this script by having it download all of the songs and may then try some alternative approaches for the web scraping to compare/contrast. Then I need to extract the lyrics and other relevant info from the song/lyric pages themselves. Stay tuned for more info on that.

Installing MusicBrainz: was it all worth it?

In my quest to acquire the definitive database of Queen songs for Project Miracle, I first thought of turning to MusicBrainz. MusicBrainz is, according to their website, “an open music encyclopedia that collects music metadata and makes it available to the public.” The important aspect of MusicBrainz is that it is completely open, so one can download their database for free and easily query all of the music metadata contained therein.

That was my initial thought. I still haven’t quite made use of the MusicBrainz data yet, but I do have a working copy of the database on my machine. The process to get there was, at times, exasperating, frustrating, and most importantly, ridden with copious amounts of swearing. In other words, I was back in my element.

I kept a log of my progress in building the MusicBrainz server software, and I did get it all eventually built correctly so that I could import the database (which was all I wanted to do in the first place). I thought about just copying my raw notes here, but I will instead try to summarize in a more coherent form. Just to give a taste of the notes in raw form however, here are a few snippets:

Couldn’t connect to postgres
I skipped the authentication steps in the install notes
That doesn’t matter right

The data directory was created with PG 9.1 but that’s not compatible with PG 9.2
You’ve got to be shitting me

Turns out it looks like i didn’t install icu-dev (icu4c via homebrew)
Yes I skipped that step earlier
When will I ever learn

So, back to the main story of installing MusicBrainz. Here are the steps I went through:

First, I downloaded the MusicBrainz database.  At first I thought that perhaps I could simply import the database into PostgreSQL using some simple commands like pg_restore, but the README file contained in the archive with the database files said “you need a compatible version of the MusicBrainz server software” in order to import the database. So I proceeded to the MusicBrainz Server Setup page and instead of downloading the easy-to-use virtual machine, I decided to grab the code from Git and build it all on my machine. What could go wrong?

I grabbed the code from the MusicBrainz git repository and proceeded to read through the INSTALL file to get things working correctly (again, in theory). The requirements called for Unix (Ubuntu was recommended, but I’m running on a Mac so I’m good–I thought), Perl, Postgres, memcached, and a version of GCC and Make. On Perl, I had a sufficiently recent version (5.12.4), and I was fine on GCC and Make. On PostgreSQL, it had been a few months since I had run it, so I decided to do an update using Homebrew (package manager for Mac OS X). I upgraded from 9.1.4 to 9.2.3. Smart idea, right? Newer is always better? (Sorry for the blatant foreshadowing). Finally, I installed memcached even though I thought I just needed the database and I wouldn’t need the web server software running. Because I’m a completist, you see?

OK, now I’ve got all of the prerequisites loaded correctly. Now it’s time to edit the config files in MusicBrainz per the instructions. I edit lib/DBDefs.pm and update the variables MB_SERVER_ROOT, WEB_SERVER, and REPLICATION_TYPE to MB_STANDALONE. (As a side note, I take interest in noting that apparently people are still coding in Perl.)

Now the instructions say to install Carton, which is apparently some sort of Perl package manager. I enter sudo cpan Carton, which takes a while because I haven’t used Perl in ages. Perhaps I’ve never used it on this particular laptop. After installing Carton, I proceed with the instructions in the README to type carton install –deployment

I immediately receive an error ‘Can’t locate cpanfile’. Turns out the latest code from Git doesn’t have a cpanfile and I find from this page that I have to type cat Makefile.PL | grep ^requires > cpanfile

Thank goodness for searchable IRC chat logs. Why doesn’t someone update the Git repository so people don’t get this error? It’s open source, that’s why.

Anyway, now I am back to proceeding with the install. The Carton install runs mostly okay, but I do get some error about “Installing modules failed”. I never know whether to trust those types of error messages, so I proceed. I mean what could go wrong?

Turns out this database requires some specific extensions to PostgreSQL. So I guess I was right to install the software rather than trying to do a shortcut at importing the database without it. The first extension (unaccent) installs correctly, but the second extension (collate) fails because it can’t find <unicode/utypes.h>. I decide that the second extension can’t be that important so I move on. Remember what I said about being a completist?

The next several steps involved trying to get my PostgreSQL database running correctly. If you read the excerpts from the raw notes above, you got a glimpse of what I went through. In short: don’t upgrade your PostgreSQL database unless you really have to. Versions such as 9.1 and 9.2 are not directly compatible. I attempted to run pg_upgrade but ended up giving up and reinitializing the database. The second lesson was that PostgreSQL access permissions are always a pain in the ass. I eventually had to grant the superuser privilege to the ‘postgres’ role. I think I did it with the ‘ALTER ROLE’ command. I can’t remember, but it seems like you have to do something different every time with PostgreSQL roles, privileges, access permissions, or pg_hba.conf.

Finally, the MusicBrainz Perl scripts were able to connect to my local PostgreSQL instance successfully and it turns out they really do need the collate extension. So I proceed to figure out why the collate extension won’t build. Turns out I skipped over the step where it said to install libicu-dev in the install instructions. That’s a Debian/Ubuntu specific name for the ICU package which is a well-known library for handling Unicode. I search Homebrew and find icu4c and think everything is hunky dory. Re-run the MusicBrainz Perl scripts and, bam, still no worky. I eventually find out that icu4c is a ‘keg-only’ module which means the commands and the libraries aren’t in the default path. I spent some time trying to figure out how to alter the build for the PostgreSQL collate extension by adding the ICU include libraries to the list of include directories and the ICU binaries to the path. Turns out you can easily augment the list of variables used by make like so: make CFLAGS+=”-I /usr/local/Cellar/icu4c/50.1/include”

Et voilà! After this step, the database install is able to proceed and I now have a fully populated MusicBrainz local database. So the question remains: was it all worth it? For now, I’m not doing anything with the database, but I am working on the lyrics scraping scripts, which I’ll describe in the next post.

Learning statistics

Alex Popescu has an interesting post entitled “Programmers Need to Learn Statistics“.  He mentions hating having to learn about probability and statistics in college.  I never took a statistics course in college, but I did take Probability (Math 361 at UIUC) and let’s just say I did very poorly.

So it’s a little disconcerting when one realizes that “machine learning” and “data science” are actually both just fancy terms for “Applied Statistics”.  So, in any case, I’ve been trying to get smarter at learning statistics as I go along, with the help of some of the resources below:

  • OpenIntro Statistics – A free textbook on introductory statistics (you can also order the paperback at Amazon for less than $10).  The OpenIntro project aims to produce free textbooks in many different subject areas.  On a related note: Boundless is a new startup devoted to producing free textbook content, but they don’t have a Statistics textbook yet.
  • Elements of Statistical Learning – Another textbook that you can download for free.  Covers more advanced topics in statistical/machine learning.
  • Think Stats – Another book free for download, this one focusing on using Python for statistical computing.  Also published by O’Reilly.
  • Think Bayes – Another free textbook recommended by a colleague.  You probably should know some of the more basic statistics material before diving too deeply into Bayesian statistics.
  • I’ve mentioned it before, but Jeff Leek’s Coursera course on Data Analysis (video and slides) gives a good background on some statistical concepts like distributions, regression, etc. while using the R programming language to perform data analysis and some basic machine learning techniques.

There are many other resources available at extra cost of course, but the above should get you started learning enough statistics to be dangerous and you won’t have to spend money on a textbook, an activity which virtually no one outside of a college course would ever do willingly.

Image

Recently posted on Twitter

Recently posted on Twitter

Fun in Space and its inspiration. Very meta.

Acquiring data: the not-so-fun part of data analysis

OK, so it’s been a few weeks since my last post.  In that time, I’ve been doing some more research on getting Project Miracle started.  At the risk of stating the obvious, the tricky part of doing data analysis, or any data-intensive project for that matter, is actually acquiring the relevant data to analyze.

I need to start with building a Queen song database. My first thought was to utilize the MusicBrainz open source database, at least as a starting point.  I have downloaded their database dump and have begun the process of building the MusicBrainz server software and setting up the database (more to come on that later).  But the MusicBrainz database doesn’t contain any lyrics, so I’ll need to acquire those elsewhere.

Some possible candidates:

  • Andy’s Queen Page – My favorite Queen web site from the 1990s was run by Andy Young.  Even though it hasn’t been actively updated since 2000, it’s still one of my favorite resources.  This page contains all of the Queen and solo songs up to 2000.  The solo material will require further discussion, but I think we will want it included.  The inclusion of solo material and what constitutes our ‘master’ database will probably require a separate blog post.
  • Queenpedia Song List – This looks like the most comprehensive of the modern Queen discography web sites.
  • Queen Online Official Discography – What the hell; might as well see what we can get from the “official” discography, though I think we will be hard pressed to get as much info as Queenpedia.

There are plenty of non-Queen lyric sites available, but unless the three sites above fall short, I don’t think I’ll need to look into those yet.  As a side note, I just received The Complete Illustrated Lyrics from Amazon, so, for all of the songs from the main albums I should have a master reference to use to verify the quality of the downloaded lyrics.  (I should also be able to verify if the official Queen web site contains the same lyrical content as the book.)

I’ll need to scrape the lyrics data from the aforementioned web sites using some kind of web scraping approach.  Some possible solutions for this:

  • Python – This Stack Overflow question contains some good answers on basic web scraping with Python.  Scrapy is also a possible approach using Python, but it might be overkill for the relatively simple scraping I’ll need to do.
  • A friend of mine has recommended Node.js as the best solution for web scraping.  He mentioned that you can use a full JavaScript and JQuery approach just as if you were running in a browser.  I usually manage to keep JavaScript at arms length, but I probably should get better at it, so may have to try this out.

I also recently reviewed the videos for Getting Data (Part 1 and Part 2) that were presented by Jeff Leek as part of the Coursera online course Data Analysis.  (For some reason, unlike other Coursera courses, the content for the course is no longer available on the Coursera site, but you can review all of the videos on YouTube here).  “Getting Data (Part 2)” actually covers scraping web data using R and the R XML package.  I think that could be an interesting approach to try out, but since part of my goal is to be able to use environments other than R, I think I’ll try one of the other approaches first.

Machine learning libraries to use for Project Miracle

For the machine learning portion of Project Miracle, I may end up trying several different languages and libraries to see which works best.  This will also give me a chance to learn more about the various algorithms and approaches.  Here are a few I’m considering:

  • R and related libraries: I’ve learned about R through Coursera’s courses Computing for Data Analysis and Data Analysis; I think I have one or more books on R lying around somewhere in either paper or ebook form, but I can’t seem to find them at the moment.
  • Mahout: Familiar with this one because of its usage within the Java and Hadoop communities.
  • Octave/Matlab: Familiar with this language/environment through Andrew Ng’s machine learning course from both Coursera and iTunes U.
  • Breeze from scalanlp.org: My friend Devon is using this library for his Master’s thesis. I’m interested in using Scala, so may have to learn more about this library.
  • Weka: Another general purpose machine learning library written in Java. Also appears to be used by the folks at Pentaho.
  • MLC++: A machine learning library written in C++ at Stanford; now distributed by SGI.
  • Python and related machine learning libraries. Need to research the best options available in this environment.

Any other suggestions?

Project Miracle

OK, I’ve started this blog in part to motivate me to put together a project that will allow me to tie together several different pieces of technology and subject matter that are of interest to me at the moment.

I’d like to build an application that brings together some text processing (including basic text analysis/indexing as well as natural language processing), data analysis (including experimenting with alternate types of data models), machine learning, and maybe some data visualization to make it all look pretty.

The subject matter I’m interested in is analyzing music, specifically songs from the rock group Queen.  I’d like to assemble a database of Queen songs, focusing initially on the ‘metadata’ for each song (track title, length, composer, release date, etc.) in addition to lyrics.  Later, I may add actual audio processing to the application, but for now, we’ll deal with the mostly structured data that consists of song metadata and of course the textual content of the lyrics.

For what purposes would this application be constructed?  In part, it’s to give me a reason to experiment with different technologies, but I do envision some potentially practical usages (‘practical’ may be a stretch if you’re not a devoted Queen fan, however).

First, a little bit of background.  In 1988, after a roughly two-year hiatus, Queen released the album ‘The Miracle‘.  The song credits on that album were all attributed to Queen (that is Freddie Mercury, Brian May, Roger Taylor and John Deacon) as opposed to being credited to the member who initially had the idea for the song or wrote most of the lyrics as was previously the case.  This was in recognition of the fact that most of Queen’s songs had always been created collaboratively, i.e. all four members contributed something to the finished product (thus the inspiration for the somewhat creepy album cover shown below).  However, most songs still had a genesis with one of the band members and the Queen fan community has a fairly good idea from analysis and interviews conducted over the years to indicate which song was originally conceived by which band member (some are also assumed to be co-authored by two band members).  My question is: given a database of songs from other Queen albums, can we use so-called data science (for example: machine learning techniques), to predict who wrote a particular song?  Hence the inspiration for the project’s name: Project Miracle.  What information gives us the best hint as to who wrote a song?  Is it the lyrics (e.g. lyrical complexity) or musical key or time-signature or something else?

More to come shortly.

Welcome to Fun In Space

Welcome to my new technology-related blog called Fun In Space. The blog with this name started out on Blogger many years ago while I tried to figure out what to do with it. I’ve since started Strange Frontier, which has mostly focused on topics related to politics, economics and some philosophy. Here I will write about technology-focused topics that interest me such as machine learning, information retrieval, data analysis and data science. Stay tuned for more updates.