« ptth (Reverse HTTP) implementation in a browser using Long Poll COMET | Main | Evolution of Codependency in Antagonistic Relationships »

December 19, 2008

lxml + eventlet mashup

Since Ian was kind enough to give me instructions that gave me a working lxml (I had never been able to compile it before), I thought I'd write a quick scraper by mashing lxml together with eventlet.

The result is a thing of beauty:

from os import path
import sys

from eventlet import coros
from eventlet import httpc
from eventlet import util

from lxml import html

## Make httpc work -- I'll make it work without this soon
util.wrap_socket_with_coroutine_socket()

def get(linknum, url):
    print "[%s] downloading %s" % (linknum, url)
    file(path.basename(url), 'wb').write(httpc.get(url))

def scrape(url):
    root = html.parse(url).getroot()
    pool = coros.CoroutinePool(max_size=8)
    linknum = 0
    for link in root.cssselect('a'):
        url = link.get('href', '')
        if url.endswith('.mp3'):
            linknum += 1
            pool.execute(get, linknum, url)
    pool.wait_all()

if __name__ == '__main__':
    if len(sys.argv) == 2:
        scrape(sys.argv[1])
    else:
        print "usage: %s url" % (sys.argv[0], )

This script manages to max out my bandwidth -- 800KB/sec at home and 2.5MB/sec at work -- without breaking a sweat. It oscillates between about 10% and 20% CPU on my MacBook Pro. Nice!

Posted by Donovan at 21:14:31


Comments

Hey Donovan,

If you're doing what I think you're doing, also have a look at barbipes:

http://code.google.com/p/barbipes/

An mp3 spider that has some smarts to prevent downloading the same songs over and over, when you delete them, and a few other things. It is one of my main sources of new music these days.

It doesn't use lxml yet, (Just throwing out the admittedly ugly regexes didn't merit a new dependency, in my case.) and it's not exactly elegant, but it maxes out *my* glass fiber connection, and it does that, without hitting any individual site too hard.

Posted by: eric casteleijn on December 24, 2008 9:49 AM


Leave a Comment