Faster Webscrapping

Webscrapping is the process of extracting data from web pages. It can be quite slow, especially when scrapping thousands of pages. In this article, we'll try to divide the time it takes as much as we can.

serpent-python

In order to measure every webscrapping process, we need a test. This test will be scrapping the list of stocks from the Nasdaq website. While this may not seem necessary because Nasdaq offers to download a CSV with the list of companies quotes on the stock exchange, this CSV is not well-formed, i.e. commas , are used to separate columns of data but companies with a comma in their stock name which creates problems. And since working with the stock market is a popular webscrapping goal, I thought it was a good idea to work with.

Avoid Selenium if you can

Selenium is library that provides functions to control your browser. While this mimics a regular user almost perfectly, it is extremely slow. It takes 1058.95 seconds = 17 min 39 sec to acquire the list of companies from the 333 pages of the Nasdaq website. And it spawned 6 processes to do the job.

While this is extremely slow compared to the other solutions presented in this blog post, this is also the only way to execute the JavaScript contained in those pages. Perhaps you're in this situation and cannot avoid using Selenium.

Install selenium:

  1. Install the python library : pip install selenium
  2. Install the WebDriver: here
  3. Don't forget to put the WebDriver in the PATH, explanation here

Here's the code I used:

import time
import selenium

from selenium import webdriver
from selenium.webdriver.common.by import By


URL = 'https://www.nasdaq.com/market-activity/stocks/screener'
CLOSE_BTN_XPATH = '//*[@id="onetrust-accept-btn-handler"]'
TABLE_XPATH = '/html/body/div[2]/div/main/div[2]/article/div[3]/div[1]/div/div/div[3]/div[3]/table/tbody'
NEXT_BTN_XPATH = '/html/body/div[2]/div/main/div[2]/article/div[3]/div[1]/div/div/div[3]/div[5]/button[2]'


def close_cookie_banner(driver, close_btn_xpath):
    """
    returns True if it closed the cookie banner
    """
    try:
        close_btn = driver.find_element(By.XPATH, close_btn_xpath)
        close_btn.click()
        return True
    except:
        return False

def get_companies(driver, table_xpath):
    """
    returns the list of symbols, companies and price [(symbol, company, price), ...]
    if it couldn't get the list, it returns an empty list 
    """
    try:
        table = driver.find_element(By.XPATH, table_xpath)
        rows = table.find_elements(By.XPATH, ".//*")
        symbols_companies_prices = [
            cell.text
            for cell in rows
        ]
        return symbols_companies_prices
    except:
        return []

def click_next_page(driver, next_btn_xpath):
        """
        returns True if it could click on the 'next' button
        """
        try:
            next_page_btn = driver.find_element(By.XPATH, next_btn_xpath)
            next_page_btn.click()
            return True
        except:
            return False


driver = webdriver.Firefox()
print("1. scraper instantiated")

driver.get(URL)
print("2. page loaded")

while not close_cookie_banner(driver, CLOSE_BTN_XPATH):
    time.sleep(1.00)
print("3. cookie banner closed")


nb_pages = 333
all_companies = []
for page in range(nb_pages):
    print(f"page {page} on {nb_pages}")

    # get the data
    companies = get_companies(driver, TABLE_XPATH)
    while not companies:
        time.sleep(1.00)
        companies = get_companies(driver, TABLE_XPATH)
    all_companies.extend(companies)

    # get to the next page
    while not click_next_page(driver, NEXT_BTN_XPATH):
        time.sleep(1.00)
print("4. data acquired")

Requests

Requests is a python library used to make HTTP requests to websites. It's often used by webscrappers because it's much faster than Selenium. However, it's working synchronously, which means it only sends 1 request at a time, and it does not use parallelization. It ran for 106.07 seconds = 1 min 46 sec.

pip install requests
import json
import requests


def get_partial_stock_list(page_num: int) -> list:
    headers = {
        'authority': 'api.nasdaq.com',
        'accept': 'application/json, text/plain, */*',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75',
        'origin': 'https://www.nasdaq.com',
        'sec-fetch-site': 'same-site',
        'sec-fetch-mode': 'cors',
        'sec-fetch-dest': 'empty',
        'referer': 'https://www.nasdaq.com/',
        'accept-language': 'fr,fr-FR;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    }
    params = (
        ('tableonly', 'true'),
        ('limit', '25'),
        ('offset', str(page_num * 25)),
    )
    url = "https://api.nasdaq.com/api/screener/stocks"
    response = requests.get(url, headers=headers, params=params)

    new_data_table: list = json.loads(response.text)['data']['table']['rows']
    return new_data_table


nb_pages: int = 333
stock_list: list = []

for page in range(0, nb_pages):
    print(f"page {page} on {nb_pages}")
    stock_list.extend(get_partial_stock_list(page))

Asynchronous Requests

This time, we're not going to ditch requests to move to a faster library like we ditched selenium to move to requests. Instead, we'll speed up those requests by doing them asynchronously. The normal (synchronous) behavior of requests is to send an HTTP request to the WEB server then wait for the response, process it when it's received and finally send another request, etc. Waiting for a response is very inefficient, it may sometimes take seconds for a response to arrive while the computer is just waiting instead of doing something useful. To cut the time the program takes to run, we can send new requests asynchronously to the web server. It means that we'll send new requests to the WEB server while waiting for a response instead of staying idle.

It ran for 94.93 = 1 min 35 sec.

pip install asyncio requests
import json
import asyncio
import requests


async def get_partial_stock_list(page_num: int, nb_pages: int) -> list:
    print(f"page {page_num} on {nb_pages}")
    headers = { ... }
    params = (
        ('tableonly', 'true'),
        ('limit', '25'),
        ('offset', str(page_num * 25)),
    )
    url = "https://api.nasdaq.com/api/screener/stocks"
    response = requests.get(url, headers=headers, params=params)

    new_data_table: list = json.loads(response.text)['data']['table']['rows']
    return new_data_table

async def get_stock_list(nb_pages) -> list:
    responses: list = []

    for page in range(nb_pages):
        responses.append(asyncio.create_task(get_partial_stock_list(page, nb_pages)))
    
    return await asyncio.gather(*responses)


nb_pages: int = 333
stock_list: list = []

loop = asyncio.get_event_loop()
stocks = loop.run_until_complete(get_stock_list(nb_pages))

Parallelization with Joblib

This time, we'll not make asynchronous requests but use the multiple cores our computer possesses to parallelize sending, receiving our HTTP requests and processing the responses. We'll use the library Joblib which is not a standard python library.

It ran for 61.18 seconds = 1 min 1 sec.

pip install joblib requests
import json
import requests

from joblib import Parallel, delayed

def get_partial_stock_list(page_num: int, nb_pages: int) -> list:
    print(f"page {page_num} on {nb_pages}")
    headers = { ... }
    params = (
        ('tableonly', 'true'),
        ('limit', '25'),
        ('offset', str(page_num * 25)),
    )
    url = "https://api.nasdaq.com/api/screener/stocks"
    response = requests.get(url, headers=headers, params=params)

    new_data_table: list = json.loads(response.text)['data']['table']['rows']
    return new_data_table

nb_pages: int = 333
ParallelTask = Parallel(n_jobs=-1) # -1 => all cores
ParallelFunc = (delayed(get_partial_stock_list)(page, nb_pages) for page in range(nb_pages))
stock_list: list = ParallelTask(ParallelFunc)

Parallelization with Multiprocessing

Again, we'll not make asynchronous requests but use the multiple cores our computer possesses to parallelize sending, receiving our HTTP requests and processing the responses. However, we'll use the multiprocessing library, a standard python library.

It ran for 49.78 seconds, less than a minute.

import json
import requests

from multiprocessing import Pool, cpu_count

def get_partial_stock_list(page_num: int) -> list:
    print(f"page {page_num} on {nb_pages}")
    headers = { ... }
    params = (
        ('tableonly', 'true'),
        ('limit', '25'),
        ('offset', str(page_num * 25)),
    )
    url = "https://api.nasdaq.com/api/screener/stocks"
    response = requests.get(url, headers=headers, params=params)

    new_data_table: list = json.loads(response.text)['data']['table']['rows']
    return new_data_table

nb_pages: int = 333
nb_cpu: int = cpu_count()
with Pool(nb_cpu) as pool:
    responses: list = list(pool.map(get_partial_stock_list, range(nb_pages)))

Parallelization with Multithreading

And finally, we'll use the concurrent standard python library which is both multithreaded (which means it uses parallelization) and asynchronous. This speeds our program up a notch!

It ran for 16.63 seconds, less than 20 seconds!

import json
import requests

from concurrent.futures import ThreadPoolExecutor

def get_partial_stock_list(page_num: int) -> list:
    print(f"page {page_num} on {nb_pages}")
    headers = { ... }
    params = (
        ('tableonly', 'true'),
        ('limit', '25'),
        ('offset', str(page_num * 25)),
    )
    url = "https://api.nasdaq.com/api/screener/stocks"
    response = requests.get(url, headers=headers, params=params)

    new_data_table: list = json.loads(response.text)['data']['table']['rows']
    return new_data_table

nb_pages: int = 333
with ThreadPoolExecutor() as executor:
    responses: list = list(executor.map(get_partial_stock_list, range(nb_pages)))

Conclusion

Getting information from the WEB has become a challenge because of the sheer amount of data available. To acquire them faster, we have to use libraries that are optimized for this job and make our programs asynchronous and use multiple cores in parallel.