Related Work

Related Work#

Writing the “related work” for a project called “distributed”, is a Sisyphean task. We’ll list a few notable projects that you’ve probably already heard of down below.

You may also find the dask comparison with spark of interest.

Big Data World#

The venerable Hadoop provides batch processing with the MapReduce programming paradigm. Python users typically use Hadoop Streaming or MRJob.
Spark builds on top of HDFS systems with a nicer API and in-memory processing. Python users typically use PySpark.
Storm provides streaming computation. Python users typically use streamparse.

This is a woefully inadequate representation of the excellent work blossoming in this space. A variety of projects have come into this space and rival or complement the projects above. Still, most “Big Data” processing hype probably centers around the three projects above, or their derivatives.

Python Projects#

There are dozens of Python projects for distributed computing. Here we list a few of the more prominent projects that we see in active use today.

Task scheduling#

Celery: An asynchronous task scheduler, focusing on real-time processing.
Luigi: A bulk big-data/batch task scheduler, with hooks to a variety of interesting data sources.

Ad hoc computation#

IPython Parallel: Allows for stateful remote control of several running ipython sessions.
Scoop: Implements the concurrent.futures API on distributed workers. Notably allows tasks to spawn more tasks.

Direct Communication#

MPI4Py: Wraps the Message Passing Interface popular in high performance computing.
PyZMQ: Wraps ZeroMQ, the high-performance asynchronous messaging library.

Venerable#

There are a couple of older projects that often get mentioned

Dispy: Embarrassingly parallel function evaluation
Pyro: Remote objects / RPC

Relationship#

In relation to these projects distributed…

Supports data-local computation like Hadoop and Spark
Uses a task graph with data dependencies abstraction like Luigi
In support of ad-hoc applications, like IPython Parallel and Scoop

In depth comparison to particular projects#

IPython Parallel#

Short Description

IPython Parallel is a distributed computing framework from the IPython project. It uses a centralized hub to farm out jobs to several ipengine processes running on remote workers. It communicates over ZeroMQ sockets and centralizes communication through the central hub.

IPython Parallel has been around for a while and, while not particularly fancy, is quite stable and robust.

IPython Parallel offers parallel map and remote apply functions that route computations to remote workers

>>> view = Client(...)[:]
>>> results = view.map(func, sequence)
>>> result = view.apply(func, *args, **kwargs)
>>> future = view.apply_async(func, *args, **kwargs)

It also provides direct execution of code in the remote process and collection of data from the remote namespace.

>>> view.execute('x = 1 + 2')
>>> view['x']
[3, 3, 3, 3, 3, 3]

Brief Comparison

Distributed and IPython Parallel are similar in that they provide map and apply/submit abstractions over distributed worker processes running Python. Both manage the remote namespaces of those worker processes.

They are dissimilar in terms of their maturity, how worker nodes communicate to each other, and in the complexity of algorithms that they enable.

Distributed Advantages

The primary advantages of distributed over IPython Parallel include

Peer-to-peer communication between workers
Dynamic task scheduling

Distributed workers share data in a peer-to-peer fashion, without having to send intermediate results through a central bottleneck. This allows distributed to be more effective for more complex algorithms and to manage larger datasets in a more natural manner. IPython Parallel does not provide a mechanism for workers to communicate with each other, except by using the central node as an intermediary for data transfer or by relying on some other medium, like a shared file system. Data transfer through the central node can easily become a bottleneck and so IPython Parallel has been mostly helpful in embarrassingly parallel work (the bulk of applications) but has not been used extensively for more sophisticated algorithms that require non-trivial communication patterns.

The distributed client includes a dynamic task scheduler capable of managing deep data dependencies between tasks. The IPython Parallel docs include a recipe for executing task graphs with data dependencies. This same idea is core to all of distributed, which uses a dynamic task scheduler for all operations. Notably, distributed.Future objects can be used within submit/map/get calls before they have completed.

>>> x = client.submit(f, 1)  # returns a future
>>> y = client.submit(f, 2)  # returns a future
>>> z = client.submit(add, x, y)  # consumes futures

The ability to use futures cheaply within submit and map methods enables the construction of very sophisticated data pipelines with simple code. Additionally, distributed can serve as a full dask task scheduler, enabling support for distributed arrays, dataframes, machine learning pipelines, and any other application built on dask graphs. The dynamic task schedulers within distributed are adapted from the dask task schedulers and so are fairly sophisticated/efficient.

IPython Parallel Advantages

IPython Parallel has the following advantages over distributed

Maturity: IPython Parallel has been around for a while.
Explicit control over the worker processes: IPython Parallel allows you to execute arbitrary statements on the workers, allowing it to serve in system administration tasks.
Deployment help: IPython Parallel has mechanisms built-in to aid deployment on SGE, MPI, etc.. Distributed does not have any such sugar, though is fairly simple to set up by hand.
Various other advantages: Over the years IPython Parallel has accrued a variety of helpful features like IPython interaction magics, @parallel decorators, etc..

concurrent.futures#

The distributed.Client API is modeled after concurrent.futures and PEP 3148. It has a few notable differences:

distributed accepts Future objects within calls to submit/map. When chaining computations, it is preferable to submit Future objects directly rather than wait on them before submission.
The map() method returns Future objects, not concrete results. The map() method returns immediately.
Despite sharing a similar API, distributed Future objects cannot always be substituted for concurrent.futures.Future objects, especially when using wait() or as_completed().
Distributed generally does not support callbacks.

If you need full compatibility with the concurrent.futures.Executor API, use the object returned by the get_executor() method.