Multi-threaded map() for Python

2007-09-05 at 04:39 | In devel, lang:en, talk | 8 Comments
Tags: , ,

The idea of multi-processing map() for Python is quite nice. And what about multi-threaded one? Threads usually cause less overhead than processes. If a mapping function is quite side-effect free (even if it does some HTTP GETs — they are idempotent), you don’t rely on a parallel execution model you’ve selected. And when it isn’t, then such an approach is error-prone. I’ve implemented a very simple threaded exception-aware map() using one thread per call. This is the basic usage scenario:

@measured
def single_threaded():
  return [urlopen(url) for x in range(count)]

@measured
def multi_threaded():
  return map(lambda x: urlopen(url), range(count))

ps_s = single_threaded()
ps_m = multi_threaded()

The results for url = "http://ya.ru/" and count = 1000:

single_threaded() is finished in 121.333 s
multi_threaded() is finished in 29.692 s

A multi-threaded map() is rather useful, isn’t it?

P. S. The first exception in a map() thread will be re-raised (with its traceback) in the main thread while others will be suppressed.

8 Comments »

RSS feed for comments on this post. TrackBack URI

  1. The reason I wrote that using processes and not threads is that Python uses a global lock around object access, so the current implementation might be a bit lacking in performance.

    We’ve been using NetWorkSpaces a lot lately. It requires a lot more advance planning, but the end results have been wonderful so far. You basically built a cluster, push requests to the network, and collect the results afterward. It also comes with a “sleigh” that can use SSH (or other methods) to automatically connect to remote servers and spawn the child processes to handle the workload.

    Of course, network overhead (especially involving SSHing to remote hosts and launching programs) will be significant depending on what it is you’re doing, but we’ve been pretty happy so far.

  2. @kirk: Yep, GIL is the issue. I’m currently browsing docs on various threading implementations in different languages/environments. Many of them in interpreted languages are much more better than in Python. AFAIK, GwR won’t remove the GIL in Python 3.0, will he?

    Thanks for the link to NetWorkSpaces. I’ll take a look at it.

  3. I haven’t really followed Python 3 very closely, so I’m not sure. But I think they pretty much have to do something different if they don’t want to get left behind, since multi-core machines and heavy multithreading are becoming more common.

    I’d much rather use multithreading than multiprocessing here because you can get rid of serialization and all the limitations that brings (such as not being able to return file handles or sockets, not to mention the overhead). At this moment, though, the fork() version scales more linearly than the threaded version.

    But both are better than not having them at all. :-)

  4. у меня необходимость многопоточного скачивания в итоге привела к созданию threadpool декоратора ;)

  5. @kirk: You might be interested in learning more about GIL, especially in connection with the recent Guido’s post. Here is one of the latest resources on this subject.

  6. Is this code still available? I can’t seem to find it.

  7. [...] Multi-threaded map() for Python [...]

  8. Монтаж, демонтаж, транспортировка и перемещение тяжеловесных и негабаритных грузов.

    Качественно, профессионально, недорого.
    8 903 522-37-15, (495) 223-70-79


Leave a comment

XHTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Blog at WordPress.com. | Theme: Pool by Borja Fernandez.
Entries and comments feeds.