Webscrapping is the process of extracting data from web pages. It can be quite slow, especially when scrapping thousands of pages. In this article, we'll try to divide the time it takes as much as we can.
In order to measure every webscrapping process, we need a test. This test will be scrapping the list of stocks from the Nasdaq website. While this may not seem necessary because Nasdaq offers to download a CSV with the list of companies quotes on the stock exchange, this CSV is not well-formed, i.e. commas ,
are used to separate columns of data but companies with a comma in their stock name which creates problems. And since working with the stock market is a popular webscrapping goal, I thought it was a good idea to work with.
Selenium is library that provides functions to control your browser. While this mimics a regular user almost perfectly, it is extremely slow. It takes 1058.95 seconds = 17 min 39 sec to acquire the list of companies from the 333 pages of the Nasdaq website. And it spawned 6 processes to do the job.
While this is extremely slow compared to the other solutions presented in this blog post, this is also the only way to execute the JavaScript contained in those pages. Perhaps you're in this situation and cannot avoid using Selenium.
Install selenium:
pip install selenium
Here's the code I used:
import time
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
URL = 'https://www.nasdaq.com/market-activity/stocks/screener'
CLOSE_BTN_XPATH = '//*[@id="onetrust-accept-btn-handler"]'
TABLE_XPATH = '/html/body/div[2]/div/main/div[2]/article/div[3]/div[1]/div/div/div[3]/div[3]/table/tbody'
NEXT_BTN_XPATH = '/html/body/div[2]/div/main/div[2]/article/div[3]/div[1]/div/div/div[3]/div[5]/button[2]'
def close_cookie_banner(driver, close_btn_xpath):
"""
returns True if it closed the cookie banner
"""
try:
close_btn = driver.find_element(By.XPATH, close_btn_xpath)
close_btn.click()
return True
except:
return False
def get_companies(driver, table_xpath):
"""
returns the list of symbols, companies and price [(symbol, company, price), ...]
if it couldn't get the list, it returns an empty list
"""
try:
table = driver.find_element(By.XPATH, table_xpath)
rows = table.find_elements(By.XPATH, ".//*")
symbols_companies_prices = [
cell.text
for cell in rows
]
return symbols_companies_prices
except:
return []
def click_next_page(driver, next_btn_xpath):
"""
returns True if it could click on the 'next' button
"""
try:
next_page_btn = driver.find_element(By.XPATH, next_btn_xpath)
next_page_btn.click()
return True
except:
return False
driver = webdriver.Firefox()
print("1. scraper instantiated")
driver.get(URL)
print("2. page loaded")
while not close_cookie_banner(driver, CLOSE_BTN_XPATH):
time.sleep(1.00)
print("3. cookie banner closed")
nb_pages = 333
all_companies = []
for page in range(nb_pages):
print(f"page {page} on {nb_pages}")
# get the data
companies = get_companies(driver, TABLE_XPATH)
while not companies:
time.sleep(1.00)
companies = get_companies(driver, TABLE_XPATH)
all_companies.extend(companies)
# get to the next page
while not click_next_page(driver, NEXT_BTN_XPATH):
time.sleep(1.00)
print("4. data acquired")
Requests is a python library used to make HTTP requests to websites. It's often used by webscrappers because it's much faster than Selenium. However, it's working synchronously, which means it only sends 1 request at a time, and it does not use parallelization. It ran for 106.07 seconds = 1 min 46 sec.
pip install requests
import json
import requests
def get_partial_stock_list(page_num: int) -> list:
headers = {
'authority': 'api.nasdaq.com',
'accept': 'application/json, text/plain, */*',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75',
'origin': 'https://www.nasdaq.com',
'sec-fetch-site': 'same-site',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.nasdaq.com/',
'accept-language': 'fr,fr-FR;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
}
params = (
('tableonly', 'true'),
('limit', '25'),
('offset', str(page_num * 25)),
)
url = "https://api.nasdaq.com/api/screener/stocks"
response = requests.get(url, headers=headers, params=params)
new_data_table: list = json.loads(response.text)['data']['table']['rows']
return new_data_table
nb_pages: int = 333
stock_list: list = []
for page in range(0, nb_pages):
print(f"page {page} on {nb_pages}")
stock_list.extend(get_partial_stock_list(page))
This time, we're not going to ditch requests to move to a faster library like we ditched selenium to move to requests. Instead, we'll speed up those requests by doing them asynchronously. The normal (synchronous) behavior of requests is to send an HTTP request to the WEB server then wait for the response, process it when it's received and finally send another request, etc. Waiting for a response is very inefficient, it may sometimes take seconds for a response to arrive while the computer is just waiting instead of doing something useful. To cut the time the program takes to run, we can send new requests asynchronously to the web server. It means that we'll send new requests to the WEB server while waiting for a response instead of staying idle.
It ran for 94.93 = 1 min 35 sec.
pip install asyncio requests
import json
import asyncio
import requests
async def get_partial_stock_list(page_num: int, nb_pages: int) -> list:
print(f"page {page_num} on {nb_pages}")
headers = { ... }
params = (
('tableonly', 'true'),
('limit', '25'),
('offset', str(page_num * 25)),
)
url = "https://api.nasdaq.com/api/screener/stocks"
response = requests.get(url, headers=headers, params=params)
new_data_table: list = json.loads(response.text)['data']['table']['rows']
return new_data_table
async def get_stock_list(nb_pages) -> list:
responses: list = []
for page in range(nb_pages):
responses.append(asyncio.create_task(get_partial_stock_list(page, nb_pages)))
return await asyncio.gather(*responses)
nb_pages: int = 333
stock_list: list = []
loop = asyncio.get_event_loop()
stocks = loop.run_until_complete(get_stock_list(nb_pages))
This time, we'll not make asynchronous requests but use the multiple cores our computer possesses to parallelize sending, receiving our HTTP requests and processing the responses. We'll use the library Joblib which is not a standard python library.
It ran for 61.18 seconds = 1 min 1 sec.
pip install joblib requests
import json
import requests
from joblib import Parallel, delayed
def get_partial_stock_list(page_num: int, nb_pages: int) -> list:
print(f"page {page_num} on {nb_pages}")
headers = { ... }
params = (
('tableonly', 'true'),
('limit', '25'),
('offset', str(page_num * 25)),
)
url = "https://api.nasdaq.com/api/screener/stocks"
response = requests.get(url, headers=headers, params=params)
new_data_table: list = json.loads(response.text)['data']['table']['rows']
return new_data_table
nb_pages: int = 333
ParallelTask = Parallel(n_jobs=-1) # -1 => all cores
ParallelFunc = (delayed(get_partial_stock_list)(page, nb_pages) for page in range(nb_pages))
stock_list: list = ParallelTask(ParallelFunc)
Again, we'll not make asynchronous requests but use the multiple cores our computer possesses to parallelize sending, receiving our HTTP requests and processing the responses. However, we'll use the multiprocessing library, a standard python library.
It ran for 49.78 seconds, less than a minute.
import json
import requests
from multiprocessing import Pool, cpu_count
def get_partial_stock_list(page_num: int) -> list:
print(f"page {page_num} on {nb_pages}")
headers = { ... }
params = (
('tableonly', 'true'),
('limit', '25'),
('offset', str(page_num * 25)),
)
url = "https://api.nasdaq.com/api/screener/stocks"
response = requests.get(url, headers=headers, params=params)
new_data_table: list = json.loads(response.text)['data']['table']['rows']
return new_data_table
nb_pages: int = 333
nb_cpu: int = cpu_count()
with Pool(nb_cpu) as pool:
responses: list = list(pool.map(get_partial_stock_list, range(nb_pages)))
And finally, we'll use the concurrent standard python library which is both multithreaded (which means it uses parallelization) and asynchronous. This speeds our program up a notch!
It ran for 16.63 seconds, less than 20 seconds!
import json
import requests
from concurrent.futures import ThreadPoolExecutor
def get_partial_stock_list(page_num: int) -> list:
print(f"page {page_num} on {nb_pages}")
headers = { ... }
params = (
('tableonly', 'true'),
('limit', '25'),
('offset', str(page_num * 25)),
)
url = "https://api.nasdaq.com/api/screener/stocks"
response = requests.get(url, headers=headers, params=params)
new_data_table: list = json.loads(response.text)['data']['table']['rows']
return new_data_table
nb_pages: int = 333
with ThreadPoolExecutor() as executor:
responses: list = list(executor.map(get_partial_stock_list, range(nb_pages)))
Getting information from the WEB has become a challenge because of the sheer amount of data available. To acquire them faster, we have to use libraries that are optimized for this job and make our programs asynchronous and use multiple cores in parallel.