Concurrent Requests with Python3

Intro

Pulling data from websites is often the first step of a data-analytic process.

The number of data resources required for an analysis influences the time this process take. Few resources, of course, require little time to gather. But gathering data from 1000 resources (i.e. making 1000 API calls) could take a substantial amount of time. If the resources must gathered on a repeating basis, the problem is compounded.

People new to python might be uncertain as to how to make this process faster; here’s a demonstration and comparison of some approaches!

We start with a list of resources:

subs = [
    'politics', 'canada', 'funny', 'news', 'gifs', 'python',
    'worldnews', 'aww', 'movies', 'books', 'space', 'creepy',
]
endpoints = ['https://reddit.com/r/%s/top.json?t=day&limit=10' % s for s in subs]

Blocking

With the requests library, we can put the data from each resource into a list shown below.

Note, the resources are downloaded sequentially. The total time is approximately:

time_per_resource * number_of_resources.

import requests
%%timeit
done_blocking = [requests.get(u) for u in endpoints]
1 loop, best of 3: 8.46 s per loop

Parallel

Parallel methods split the acquisition of resources across workers. Workers can be threads or processes and are accessed through the Executor class from the concurrent.futures module. Users can overlook some of these details. requests_futures provides a API the same as the requests with a parallel underlying implementation.

Each worker handles tasks sequentially. *If the number of worker (threads or processes) is close to the number of tasks, the process requires a fixed time for any number of tasks; specifically it requires approximately the time for the longest task*:

from requests_futures.sessions import FuturesSession
from concurrent.futures import wait

session = FuturesSession(max_workers=len(endpoints))
%%timeit
futures = [
    session.get(u) for u in endpoints
]
done, incomplete = wait(futures)
1 loop, best of 3: 189 ms per loop

*More generally, the process requires the time to process the number of tasks / number of workers in sequence *:

session = FuturesSession(max_workers=2)
%%timeit
futures = [
    session.get(u) for u in endpoints
]
done, incomplete = wait(futures)
1 loop, best of 3: 1.1 s per loop

Asyncio

A third method is asynchronous. In this case, nothing is guaranteed to happen in sequence. Tasks must have entry/exit points where the worker (i.e. the main thread) can leave them and work on something else. In this case, the web request constitutes that entry point; so, for example, once the first web request is started, the main thread works on something else, i.e. starting the next web request.

I have a hard time coming up with an expression for the duration of the asynchronous case. I suppose its something like:

time_not_waiting + max(time_for_task_i - time_task_i_started)

import asyncio
import aiohttp
import json

loop = asyncio.get_event_loop()
client = aiohttp.ClientSession(loop=loop)

async def get_json(client, url):
    async with client.get(url) as response:
        return await response.read()
%%timeit
result = loop.run_until_complete(
    asyncio.gather(
        *[get_json(client, e) for e in endpoints]
    )
)
1 loop, best of 3: 741 ms per loop

When to use which?

There’s a few ways to look at this. The key for me is that, in terms of simplicity, sequential > parallel > asynchronous. That’s my apriori preference.

  • For a few tasks, use sequential.

  • With a large number of tasks that are not meaningful entered/exited (i.e. they are not waiting for input/output), parallel. A good example here is running an operation on rows of a data set which is already in memory.

  • With a large number of tasks which are usually waiting for intput/output, use asynchronous. For web requests, asynchronous fits the bill for the large number of tasks.

Large depends on how long a task tasks and your time sensitivity.

Bonus

Asynchronous parallel would be fascinating and useful for a very large number of i/o heavy tasks; if you have any idea how to achieve this do share!

Go Top
comments powered by Disqus