Acquiring data: the not-so-fun part of data analysis
OK, so it’s been a few weeks since my last post. In that time, I’ve been doing some more research on getting Project Miracle started. At the risk of stating the obvious, the tricky part of doing data analysis, or any data-intensive project for that matter, is actually acquiring the relevant data to analyze.
I need to start with building a Queen song database. My first thought was to utilize the MusicBrainz open source database, at least as a starting point. I have downloaded their database dump and have begun the process of building the MusicBrainz server software and setting up the database (more to come on that later). But the MusicBrainz database doesn’t contain any lyrics, so I’ll need to acquire those elsewhere.
Some possible candidates:
- Andy’s Queen Page – My favorite Queen web site from the 1990s was run by Andy Young. Even though it hasn’t been actively updated since 2000, it’s still one of my favorite resources. This page contains all of the Queen and solo songs up to 2000. The solo material will require further discussion, but I think we will want it included. The inclusion of solo material and what constitutes our ‘master’ database will probably require a separate blog post.
- Queenpedia Song List – This looks like the most comprehensive of the modern Queen discography web sites.
- Queen Online Official Discography – What the hell; might as well see what we can get from the “official” discography, though I think we will be hard pressed to get as much info as Queenpedia.
There are plenty of non-Queen lyric sites available, but unless the three sites above fall short, I don’t think I’ll need to look into those yet. As a side note, I just received The Complete Illustrated Lyrics from Amazon, so, for all of the songs from the main albums I should have a master reference to use to verify the quality of the downloaded lyrics. (I should also be able to verify if the official Queen web site contains the same lyrical content as the book.)
I’ll need to scrape the lyrics data from the aforementioned web sites using some kind of web scraping approach. Some possible solutions for this:
- Python – This Stack Overflow question contains some good answers on basic web scraping with Python. Scrapy is also a possible approach using Python, but it might be overkill for the relatively simple scraping I’ll need to do.
- A friend of mine has recommended Node.js as the best solution for web scraping. He mentioned that you can use a full JavaScript and JQuery approach just as if you were running in a browser. I usually manage to keep JavaScript at arms length, but I probably should get better at it, so may have to try this out.
I also recently reviewed the videos for Getting Data (Part 1 and Part 2) that were presented by Jeff Leek as part of the Coursera online course Data Analysis. (For some reason, unlike other Coursera courses, the content for the course is no longer available on the Coursera site, but you can review all of the videos on YouTube here). “Getting Data (Part 2)” actually covers scraping web data using R and the R XML package. I think that could be an interesting approach to try out, but since part of my goal is to be able to use environments other than R, I think I’ll try one of the other approaches first.
Trackbacks & Pingbacks