Quick Perl Script to Extract Text from a File
How long would it take you to copy and paste 230 things from one text file to another? If your answer is more than 10 mins, you could probably use this short script I wrote tonight.
I needed to expand a list of URLs I already had that are serving as “seeds” for my search engine project. I knew I had a nice collection of design blogs in my Google Reader, so I thought I could export those as an OPML file and extract the URLs. The OPML file is just XML that looks like this:
<outline title="Design Feeds" text="Design Feeds"> <outline text="10 Steps" title="10 Steps" type="rss" xmlUrl="http://feeds.feedburner.com/10Steps" htmlUrl="http://10steps.sg"/> <outline text="1st Web Designer" title="1st Web Designer" type="rss" xmlUrl="http://feeds.feedburner.com/1stwebdesigner" htmlUrl="http://www.1stwebdesigner.com"/>
I didn’t want their RSS feed URL, just the URL to their homepage for my web crawler. I could have sat around copying and pasting for hours, but I’m lazy and have other things to do so I wrote a quick Perl script to do that for me.
#!/usr/bin/perl -w
use strict;
die "Need OPML file to extract URLs from i.e. google-reader-subscribtions.xml" unless (@ARGV == 1);
open (FH, $ARGV[0]) or die "Cannot open $ARGV[0]: $!\n";
my %urls;
while(<FH>) {
$_ =~ /htmlUrl="(.*?)"/;
$urls{$1} = 1;
}
print join("\n",keys(%urls));
close FH;
The code just says to open a file, look for where you see htmlUrl=”http://domain.com”, and print all of the unique URLs. There might be a simpler or shorter way to write this in Perl, but I thought this was straight forward enough. Leave a comment with your simplified or shorter method
I can then run it in Terminal ./extractURLfromOPML google-reader-subscriptions.xml > seeds.txt and I’ll have a nice text file with a URL on each line. I copied and pasted all of those into my original seeds list and now I have around 250 URLs to crawl within a couple of minutes. This is just another reason why Perl (and programming in general) is awesome.
That code solves a specific problem, but it could easily be modified or expanded to solve other problems. For example, I want to know how many different artists I have in my iTunes library (I know iTunes will tell you but go with it). You could modify the regular expression line in the above code to look for $_ =~ m|<key>Artist</key><string>(.*?)</string>|; and it’ll match and print all of the artists in your iTunes Library XML file.
If you wanted to expand this modification, you could take the Perl modules for Facebook, MySpace, and Twitter to automatically send a friend request to all of the artists in your iTunes library on all of your social networks. It might not be a perfect solution, but it would save you some time and effort for sure.
It looks like I’ll have a few months this summer to freelance so I’ll try to post some more simple code tutorials and examples. Enjoy!

May 6th, 2009 at 12:16 pm
Hi, nice post. I have been wondering about this issue,so thanks for posting. I’ll certainly be coming back to your site. Keep up the good posts
May 18th, 2009 at 8:54 am
very GOOD!!! =)
June 10th, 2009 at 8:52 pm
I disagree. ticketslayer@gmail.com
June 12th, 2009 at 9:21 pm
What do you disagree with?