How long would it take you to copy and paste 230 things from one text file to another? If your answer is more than 10 mins, you could prob­a­bly use this short script I wrote tonight.

I needed to expand a list of URLs I already had that are serv­ing as “seeds” for my search engine project. I knew I had a nice col­lec­tion of design blogs in my Google Reader, so I thought I could export those as an OPML file and extract the URLs. The OPML file is just XML that looks like this:


<outline title="Design Feeds" text="Design Feeds">

<outline text="10 Steps" title="10 Steps" type="rss"

xmlUrl="http://feeds.feedburner.com/10Steps" htmlUrl="http://10steps.sg"/>

<outline text="1st Web Designer" title="1st Web Designer"

type="rss"

xmlUrl="http://feeds.feedburner.com/1stwebdesigner" htmlUrl="http://www.1stwebdesigner.com"/>

I didn’t want their RSS feed URL, just the URL to their home­page for my web crawler. I could have sat around copy­ing and past­ing for hours, but I’m lazy and have other things to do so I wrote a quick Perl script to do that for me.

#!/usr/bin/perl -w
use strict;

die "Need OPML file to extract URLs from i.e. google-reader-subscribtions.xml" unless (@ARGV == 1);

open (FH, $ARGV[0]) or die "Cannot open $ARGV[0]: $!n";

my %urls;

while(<FH>) {
$_ =~ /htmlUrl="(.*?)"/;
$urls{$1} = 1;
}

print join("n",keys(%urls));

close FH;

The code just says to open a file, look for where you see htmlUrl=“http://domain.com”, and print all of the unique URLs. There might be a sim­pler or shorter way to write this in Perl, but I thought this was straight for­ward enough. Leave a com­ment with your sim­pli­fied or shorter method :)

I can then run it in Ter­mi­nal ./extractURLfromOPML google-reader-subscriptions.xml > seeds.txt and I’ll have a nice text file with a URL on each line. I copied and pasted all of those into my orig­i­nal seeds list and now I have around 250 URLs to crawl within a cou­ple of min­utes. This is just another rea­son why Perl (and pro­gram­ming in gen­eral) is awesome.

That code solves a spe­cific prob­lem, but it could eas­ily be mod­i­fied or expanded to solve other prob­lems. For exam­ple, I want to know how many dif­fer­ent artists I have in my iTunes library (I know iTunes will tell you but go with it). You could mod­ify the reg­u­lar expres­sion line in the above code to look for $_ =~ m|<key>Artist</key><string>(.*?)</string>|; and it’ll match and print all of the artists in your iTunes Library XML file.

If you wanted to expand this mod­i­fi­ca­tion, you could take the Perl mod­ules for Face­book, MySpace, and Twit­ter to auto­mat­i­cally send a friend request to all of the artists in your iTunes library on all of your social net­works. It might not be a per­fect solu­tion, but it would save you some time and effort for sure.

It looks like I’ll have a few months this sum­mer to free­lance so I’ll try to post some more sim­ple code tuto­ri­als and exam­ples. Enjoy!