Saturday, January 21, 2017

Post-midnight wget vs xkcd

I re-watched The Social Network yesterday. Dunno why, really.
Anyways, I noticed something about a wget meme at the start of the film, where he downloads some pics to have on his site. I figured I'd give it a go. Turns out it's a pretty good meme.

After reading some docs and fooling around for a while, I was looking for an actual thing to do with wget. And that's when it hit me: what if you could download up to the Nth xkcd comic?
And the game was on.

First, I was gonna do this in Python. It's the only hacky/easy language I'm fluent in and there's no way I would mess around in Java or, God forbid, C. So Python it is then.
First things first, how about just getting the goddamn memes with wget? Not so fast, buddy. Tried that. Got back a robots.txt file. Ughhh.
So for those of you who don't know what that means, basically there's often a system in place to, I guess, "block access" to certain "robots" - bots, programs, that kind of stuff. Apparently, wget falls under that category, and apparently, it also complies with the robots.txt rules, so you can't get past that. But actually, you can. You have the option to ignore them. So I just add "-e robots==off" and "--wait 1" to wget's arguments and we're set!
Ughhh. Now it doesn't download anything. What the hell?
Apparently, xkcd has set some pretty good memes, 'cause you can't just wget the whole thing. For reference, I tried just getting all the images (and overriding robots.txt of course) on another site and I did manage to get all the images (or most of them anyway).
BUT, there's a solution. It's really hacky and ugly, and I'm sure there's a better way, but here's how I did it:
I noticed that, while you didn't have access to the image from the normal xkcd page, the image url is always in the html file.
So what I did is I fetched the index.html, then ran through it to find the image url, then wget'd the image from that url. Pretty simple, right?
So this all ties together like this: you input a certain number of comics, the program runs through pages xkcd.com/1/ through xkcd.com/N/, downloads and parses the .htmls, adds the url to a list of urls, then once it's done it downloads all the images based on the urls and saves them in a folder.
The only problem I've had with the whole procedure is that some .htmls would download in code form (binary? hex? probably hex) and thus I couldn't read the image url off of them. Page 3 was giving me a lot of shit - I actually tried wget -v and it actually worked on 3 but didn't work on 2. By testing, I saw that it sometimes worked and sometimes didn't work.
So, finally, firstly because I'm just bored and secondly because it's almost 3 am and I'm really tired, I just worked around the problem by having the user input a maximum number of tries, so that wget will try to get the right page N times and if it doesn't succeed it'll just move on. I tried having it try forever, but some pages like #16 I think just won't download properly. Anyway, with something like 100 tries I found out you lose like 2-3 comics out of 30, so it's not that bad - AND they are enumerated so in the end you know which ones you lost.
Anyways, I'm off to bed now. But do try this. It's a fun exercise.

P.S.: You can download my script here (and you need to have wget and Python 2 installed, obviously).