top of page
Search

Downloading images concurrently (Threads)

  • Writer: Andy Brave
    Andy Brave
  • Apr 23, 2022
  • 5 min read


In the previous section, we explored how we can check if a resource is an image, with the downside of the speed of execution. Imagine that you have a list of resources and you need to check if all of them are images. You can do it one at a time, like in a queue.

Imagine you are at the ATM and before you are 9 persons. If there is only one ATM you need to wait for the nine persons after you finish their task to withdraw your cash.

But what if instead of only one ATM would be 2 or 3 of them? You still must wait until every single person before you finish their withdrawals, but the spending time waiting for your turn will be reduced.

The same goes for our program, what if instead of only one single image downloader we create 2 or 3 or n instances of them?

Threads to the rescue!




Introduction to Threads

A thread is the smallest sequence of instructions that can be managed, that is scheduled and executed by the operating system. A program can be composed of a single thread of execution or multiple threads of execution. When multiple CPU cores are available, each thread’s instructions can be executed at the same time in parallel to multiple cores. If only a single core is available, the threads share time on that core.

In either scenario, the result is that the use of multiple threads allows a process to perform multiple tasks at once. For example in a media player, one thread can be playing the current song, while another is figuring out the next song to play and downloading it, while again another thread is responding to user clicks and navigation.

Another example is a web or application server that uses a pool of threads to respond to multiple requests simultaneously. Each request is handled by a thread from the pool. The thread executes whatever task is assigned, and then when it is completed, it returns to the pool to wait for the next request. Multithreading is supported by virtually all operating systems and almost all programming languages, and Python is no exception.

Creating Threads in Python

Python has had support for threading since version 1. 5. 2 via the threading package in the standard library. This package allows you to create thread objects that map directly to operating system threads.

The simplest way to create a thread is to instantiate an object of the Thread class, passing in the thread function, as well as any function arguments, and then calling the start method on the thread object you just created. In this example, we start by importing the threading package, and then we define the function that we want our thread to run. There are plenty of blog entries about threads and their implications, I recommend you this: this and this.

Let’s explore the basics of Thread creation. (These are the features we will use).

import threading
def my_task(name):
   print(f"Hello from a thread, {name}")
   return
name = 'Andy'
t = threading.Thread(target=my_task, args=(name,)) #Dont' forget the comma in the end
t.start()

The target is the function to be invoked. It can specify a name for a thread using the name parameter. If you choose not to, a default name will be used, which will be the word Thread and a counter appended to the thread.

The args parameter is the argument tuple for the target function invocation, while the kwargs parameter is a dictionary of keyword arguments for the function if you choose to use that instead. This way of executing tasks with threads is the most common usage pattern for threading.

How Threads Work

When the program starts, there’s only one thread in existence, the main thread. The main thread executes the instructions for importing the threading library, defining the my_task method, and creating the name variable.


The main thread then creates the new thread. At this point, the new thread is in a new state. When the main thread calls start on the new thread object, it goes into the ready state, which means that the thread is now available for the OS to schedule to run on a CPU. After this, the main thread calls join. Before this, the main thread had been in the running state, but now it goes into the blocked state, which means that it is suspended and can’t execute until something happens.


The whole thread cycle is shown here:



Enough of theory, let’s go to the code!



import pytest
import time
from image_service import ImageDownloader
img_urls = \
       ["https://upload.wikimedia.org/wikipedia/commons/thumb/e/e3/Coat_types_3.jpg/500px-Coat_types_3.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/3/38/Anatomy_dog.png",
        "https://upload.wikimedia.org/wikipedia/commons/8/8c/Poligraf_Poligrafovich.JPG",
        "https://upload.wikimedia.org/wikipedia/commons/thumb/5/5c/Great_Dane_and_Chihuahua_Skeletons.jpg/1280px-Great_Dane_and_Chihuahua_Skeletons.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/4/42/Eye_of_a_dog.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/2/2b/Dog_nose2.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/2/23/Dog_retrieving_stick.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/thumb/4/44/Dog_puppy.jpg/1024px-Dog_puppy.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c0/Figueras_%28RPS_24-07-2020%29_sujeci%C3%B3n_para_perros.png/2560px-Figueras_%28RPS_24-07-2020%29_sujeci%C3%B3n_para_perros.png",
        "https://upload.wikimedia.org/wikipedia/commons/thumb/5/51/USMC-06639.jpg/1920px-USMC-06639.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/a/aa/AHey_Fatty.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/thumb/9/99/Cancer_beagle.jpg/1920px-Cancer_beagle.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/b/b6/Ejemplares_h%C3%ADbridos_de_la_raza_pekines_%28pequines%29.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/a/ae/Wilde_huendin_am_stillen.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Puppies_Fighting.jpg/1920px-Puppies_Fighting.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/thumb/7/72/Cave_canem.JPG/1280px-Cave_canem.JPG",
        "https://upload.wikimedia.org/wikipedia/commons/c/cf/Big_and_small_dog.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Lucky_en_Panzerwiese%2C_M%C3%BAnich%2C_Alemania%2C_2014-12-24%2C_DD_02.JPG/1920px-Lucky_en_Panzerwiese%2C_M%C3%BAnich%2C_Alemania%2C_2014-12-24%2C_DD_02.JPG",
        "https://upload.wikimedia.org/wikipedia/commons/c/c0/Perrovaca_UNMSM.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Hanging_18.jpg/1920px-Hanging_18.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/2/27/Truemmer_18.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c7/U.S._Air_Force_military_working_dog_Jackson_sits_on_a_U.S._Army_M2A3_Bradley_Fighting_Vehicle_before_heading_out_on_a_mission_in_Kahn_Bani_Sahd%2C_Iraq%2C_Feb._13%2C_2007.jpg/1920px-thumbnail.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/thumb/1/10/Tursiops_truncatus_01.jpg/1920px-Tursiops_truncatus_01.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/b/b0/Dolphins_gesture_language.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/3/38/Orca_porpoising.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/5/5a/Baby_wolphin_by_pinhole.jpeg",
        "https://upload.wikimedia.org/wikipedia/commons/e/e1/Commdolph01.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/b/b1/DELFIN_DEL_ORINOCO2.JPG",
        "https://upload.wikimedia.org/wikipedia/commons/2/2d/Dolphin-Musandam_2.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/thumb/d/d8/Delphinus_capensis.JPG/1920px-Delphinus_capensis.JPG",
        "https://upload.wikimedia.org/wikipedia/commons/2/2b/Frazer%C2%B4s_dolphin_group.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/2/24/Northern_right_whale_dolphin.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Anim1018_-_Flickr_-_NOAA_Photo_Library.jpg/1920px-Anim1018_-_Flickr_-_NOAA_Photo_Library.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/7/7e/Steno_bredanensis_2.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/thumb/1/10/Tursiops_truncatus_01.jpg/1920px-Tursiops_truncatus_01.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/3/38/LF_Pilot_Whale_Goban_Spur.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4e/Peponocephala_electra_Mayotte.jpg/1920px-Peponocephala_electra_Mayotte.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Killerwhales_jumping.jpg/1920px-Killerwhales_jumping.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a2/Pseudorca_crassidens_at_Okichan_Theater_20070201.jpg/1920px-Pseudorca_crassidens_at_Okichan_Theater_20070201.jpg",
        "https://upload.wikimedia.org/wikipedia/commons/b/b8/Kentriodon_BW.jpg"
        ]
@pytest.fixture
def checker():
   return ImageDownloader()
def test_is_image_not_concurrent(checker):
   imgs_bytes = []
   print("beginning image downloads")
   start = time.perf_counter()
   for url in img_urls:
       imgs_bytes.append(checker.get_bytes(url))
   end = time.perf_counter()
   print("downloaded {} images in {} seconds".format(len(img_urls),
         end - start))
def test_is_image_concurrent(checker):
   imgs_bytes = []
   print("beginning image downloads")
   start = time.perf_counter()
   for url in img_urls:
       imgs_bytes.append(checker.get_bytes_concurrent(url))
   assert img_urls.__len__() == 40
   end = time.perf_counter()
   print("downloaded {} images in {} seconds".format(len(img_urls),
         end - start))

The first part is the test. I’m going to download 40 images from Wikipedia, store them in a list of bytes and measure the time elapsed. There are two methods:


test_is_image_not_concurrent and test_is_image_concurrent.

"""
:param url: URL for the resource
:return: bytes array for the image
"""
def get_bytes_concurrent(self, url: str) -> bytes:
   try:
       num_dl_threads = 10
       for _ in range(num_dl_threads):
           t = Thread(target=self.get_bytes, args=(url,))
           t.start()
   except RequestException as e:
       raise RequestException(f'Ooops! an error occurred while processing the image {url}')

Now the performance

test_is_image_concurrent.py::test_is_image_not_concurrent PASSED [100%]
beginning image downloads
downloaded 40 images in 9.386144178 seconds

test_is_image_concurrent.py::test_is_image_concurrent PASSED     [100%]
beginning image downloads
downloaded 40 images in 0.831834199 seconds

We pass from 10 seconds to less than 1!! With 10 threads. But take it carefully. Adding more threads not always will improve the performance of the app. Because switching between threads also costs you time and CPU. For your app, you will need to figure out the ideal balance.


Unfortunately not always ends happily

The usage of thread opens the door to another type of problem: Thread interference.

Thread interference is also commonly referred to as a race condition, as both threads are in a race to read or update the same variable, and you may get different results depending on what threads run the code at what time. And since this is out of programmer control, sometimes your program results in a correct outcome, and other times it doesn’t. This can be very hard to debug, as the data corruption may occur only intermittently in your production environment under heavy load, but the code runs fine in your local development environment. To solve this problem, we use a technique called thread synchronization. There are plenty of blogs about this topic.

Conclusion: When to use threads?

Not all the problems must be solved with this technique, because not all problems could be parallelized. Unfortunately, there are no hard and fast rules to using Threads. If you have too many threads the processor will spend all its time generating and switching between them. On the other hand, using too few threads you will not get the throughput you want in your application.

The times I find most often that I need to consider creating threads explicitly are:

· Asynchronous operations

· Operations that can be parallelized

· Continual running background operations

That’s enough folks! Hope this helps you! See you soon!



Code



 
 
 

Comments


Post: Blog2_Post

©2022 by andybravo.

bottom of page