December 20, 2003

perl module awesomeness

There's no such thing as a complete ISBN database. Some places, like the Library of Congress have really big databases, but not complete. I learned a lot about ISBN numbers tonight (trying to find some sort of database I could use); I ended up writing a perl script to search the LOC database; and I wrote a sweetly simple screen-scraping script.

The LOC database isn't international, and it's not complete either (some lack ISBNs for some reason). Somehow places like amazon.com compile nearly-complete ISBN databases though. You can't really access amazon's database however.

I wanted to programmatically access book information by ISBN. I thought I'd have to write a screen-scraper to simulate a search and parse out the information from the results page. But LOC supports a z39.50 search interface. And there's a perl module to use that. The search results are returned in MARC format. And there's a perl module to use that. It took a while to get it right, but I now have a very small perl script that takes an ISBN and prints the title, author, edition, and date! Awesome!

#!/usr/bin/perl -w use strict; use Net::Z3950; use MARC::File::USMARC; print "syntax: search.pl ISBN\n" and exit if !exists $ARGV[0]; my $isbn = $ARGV[0]; my $host = 'z3950.loc.gov'; my $port = 7090; my $db = 'Voyager'; my $conn = new Net::Z3950::Connection($host, $port, databaseName => $db) or die $!; my $rs = $conn->search("\@attr 1=7 $isbn") or die $conn->errmsg(); my $n = $rs->size(); $rs->option(elementSetName => "f"); $rs->option(preferredRecordSyntax => "USMARC"); foreach my $i (1..$n) { my $rec = $rs->record($i) or die $rs->errmsg(); my $m = MARC::Record->new_from_usmarc($rec->rawdata()); print $m->title(), "\n"; print $m->title_proper(), "\n"; print $m->author(), "\n"; print $m->edition(), "\n"; print $m->publication_date(), "\n"; print "\n"; } $conn->close();

I did write a screen-scaper tonight, too. But for a different reason. Using Template::Extract, I wrote some terribly simple templates to extract links and data from webpages. Looping through the returned data has a funky syntax, but I managed (with some help from Data::Dump):

foreach my $course (@{$data->{'courses'}}) { print $course->{'dept'}, "\t", $course->{'number'}, "\t",course->{'section'}, "\n"; }

Posted by Dave at December 20, 2003 03:02 AM | TrackBack
Comments