I've had this tiny project rattling around in my head for a while where I wanted to parse my viewing history from Hoopla. Unfortunately, they only show 24 shows per page and my requests to have all of my shows listed at once have gone on deaf ears. So, I decided to see how hard it would be to scrape or parse the data from them.
The pages are all generated javascript, and I'm not a web programmer, so I went the old fashioned manual way. I went to each page and saved it as text, then I used perl to parse the text. Fortunately, the data saved this way was pretty structured, so I wrote some awful perl code to parse each page:
#!/usr/bin/perl my $line; my $incard = 11; while ($line = <>) { chomp($line); # print "parsing: $line\n"; if ($line =~ /^card$/) { $incard=0; # print "New Card - In Card: $incard\n"; } if ( $incard < 10 ) { # Process date if ( $line =~ /Returned/ ) { my @group = split /\ /, $line; # print "Date: @group[2] - Incard line: $incard\n"; $date = @group[2]; $incard ++; } # Process episode title if ($incard == 2) { # print "Episode Title: $line - Incard line: $incard\n"; $episode=$line; $incard ++; } # Process series title if ($line =~ /^Hide/) { $line =~ m/Hide.([A-Z].*).from.my.borrow.*$/; $series = $1; # print "Series: $1 - Incard line: $incard\n"; $incard ++; } # Process actor if ($incard == 4) { # print "Actor: $line - Incard line: $incard\n"; $actor=$line; $incard ++; } # Process link if ($line =~ /hoopladigital/) { # print "Link: $line - Incard line: $incard\n"; $incard ++; } # Process blank line if ($line =~ /^$/) { # print "Blank line: $incard \n"; $incard ++; } # print "parsing: $line\n"; # should be the end of the record. Print it. if ($incard == 9) { print "$series\t$episode\t$date\t$actor\n"; } } }
That way I can cat
all the saved pages through this script and make a tab separated table. I picked tabs since I assume there may be commas in either the series name or the episode titles from time to time. I don't expect tabs.