Downloading data¶
Overview¶
Here is the big picture of the different components that allow for data download:
The following sections will introduce each component of this architecture diagram.
Download script¶
A fetcher must define a download.py script:
#!/usr/bin/env python3
from dbnomics_toolbox.fetcher_utils.cli_utils.download_cli import DownloadCLI
from dbnomics_toolbox.fetcher_utils.cli_utils.download_cli_args import DownloadCLIArgs
import abc_fetcher
from abc_fetcher.downloader import Downloader
def main() -> None:
cli = DownloadCLI(package_name=abc_fetcher.__name__)
cli.start()
downloader = Downloader(**cli.args.as_downloader_kwargs)
downloader.start()
cli.finalize(downloader)
if __name__ == "__main__":
main()
This script can be called manually from the command line, and will be called by the DBnomics infrastructure in production.
Note the separation of concerns:
the
DownloadCLIclass handles the CLI arguments and options, and saves an output state file (in thefinalizemethod),the
Downloaderclass takes care of the download logic.
Downloader¶
The Downloader class, inherited from BaseDownloader, is responsible for downloading data from the provider infrastructure.
To achieve its goal, it creates file resources (cf next section) and downloads them.
Example of Downloader:
from collections.abc import Iterator
from typing import Unpack, override
from dbnomics_toolbox.fetcher_utils import BaseDownloader, BaseResource
from dbnomics_toolbox.fetcher_utils.resources import ResourceId
from yarl import URL
from dummy_fetcher.constants import API_BASE_URL
from dummy_fetcher.resources import DummyApiResource
from dummy_fetcher.source_data_repo import SourceDataRepo
__all__ = ["Downloader"]
class Downloader(BaseDownloader):
"""Download files from the provider infrastructure."""
def __init__(
self,
**kwargs: Unpack[BaseDownloader.InitKwargs],
) -> None:
super().__init__(**kwargs)
self._source_data_repo = SourceDataRepo(source_data_dir=kwargs["source_data_dir"])
@override
def _iter_resources(self) -> Iterator[BaseResource]:
# Defined later
...
The __init__ method instanciates the SourceDataRepo class, which is a fundamental pattern detailed in the next section.
Source data repository¶
We don’t want the Downloader to know too much about source data, especially file paths inside the source-data directory, and how to read data from files.
Remember: the Converter will also have to read data back from this directory, and we don’t want to duplicate this logic.
So we can define a SourceDataRepo class and use it from both the Downloader and the Converter.
For example, let’s say that the ABC provider exposes its datasets in a catalog.json file such as:
[
{"dataset_id": "BOP", "dataset_name": "Balance of payments"},
{"dataset_id": "GDP", "dataset_name": "Gross domestic product"}
]
Those catalog items can be modeled by a CatalogItem dataclass:
from dataclasses import dataclass
__all__ = ["CatalogItem"]
@dataclass(frozen=True, kw_only=True)
class CatalogItem:
dataset_id: str
dataset_name: str
The SourceDataRepo class can define methods that parse source-data files and expose their content as model objects from the provider domain.
To validate Python dicts and load them to dataclass instances, it is advised to use a data loading library such as typedload or pydantic.
The dbnomics_toolbox.json_utils.load_json_file function uses typedload under the hood.
Exemple of SourceDataRepo:
from collections.abc import Iterator
from pathlib import Path
import daiquiri
from dbnomics_toolbox.json_utils import load_json_file
from abc_fetcher.source_data_model import CatalogItem
__all__ = ["SourceDataRepo"]
logger = daiquiri.getLogger(__name__)
class SourceDataRepo:
"""Load data from the `source-data` directory.
Useful both to the Downloader (e.g. to know where to write the downloaded files)
and to the Converter (e.g. to load those files).
"""
def __init__(self, *, source_data_dir: Path) -> None:
self._source_data_dir = source_data_dir
self.catalog_file = source_data_dir / "catalog.json"
def iter_catalog_items(self) -> Iterator[CatalogItem]:
catalog_items = load_json_file(self.catalog_file, type_=list[CatalogItem])
yield from catalog_items
The iter_catalog_items method will be called by the Downloader in order to download all the datasets, and by the Converter to convert them all.
File resources¶
A resource represents some data distributed by the provider that will be stored as a single file.
The abstract method FileResource._download is responsible for writing data to the target file.
Then it validates the MIME type of the target file, and reformats it according to its format (e.g. XML, JSON, etc.).
Those features are enabled by default with auto-detection, but can be disabled by passing arguments to __init__.
Most of the time you will use the HttpResource, a child class of FileResource which uses the requests library to download files.
For example we can define the following resource to download a JSON file from https://abc-provider.com/data/catalog.json:
from dbnomics_toolbox.fetcher_utils import HttpResource
class Downloader(BaseDownloader):
# [...]
@override
def _iter_resources(self) -> Iterator[BaseResource]:
yield HttpResource(
id="catalog",
request="https://abc-provider.com/data/catalog.json",
target_file=self._source_data_repo.catalog_file,
)
The target_file argument references the catalog_file attribute from the SourceDataRepo.
The request argument can be a URL or a requests.Request object.
See also the file resources page.
Simulate the provider web API¶
Obviously the URLs of the example, starting with https://abc-provider.com/, do not actually exist.
To make the downloader work with them, we’re going to intercept those requests and respond fake data by using the responses package.
Install the package:
uv add responses
Override the BaseDownloader.start method:
import responses
class Downloader(BaseDownloader):
# [...]
@override
def start(self) -> None:
with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps:
rsps.add(
responses.GET,
"https://abc-provider.com/data/catalog.json",
json=[
{"dataset_id": "BOP", "dataset_name": "Balance of payments"},
{"dataset_id": "GDP", "dataset_name": "Gross domestic product"},
],
status=200,
)
super().start()
Run the download script¶
$ python download.py source-data
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created target directory: 'source-data'
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created cache directory: 'source-data/.cache'
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created debug directory: 'source-data/.debug'
DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['catalog']
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='catalog') (1/1)
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Received HTTP response: <Response [200]>
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Started writing to temporary file 'source-data/catalog.part' (0 bytes)...
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Wrote file 'source-data/catalog.json' (127 bytes) -- duration: 0 seconds
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.json_utils: Start reformatting JSON file 'source-data/catalog.json' (127 bytes) with command '/usr/bin/jq --indent 2 < source-data/catalog.json > source-data/catalog.tmp'
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Reformatted JSON file 'source-data/catalog.json' (127 bytes) -- initial_file_size: 127 bytes duration: 0.01 seconds
INFO dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'catalog' successfully -- duration: 0.06 seconds
DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 0.06 seconds
INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes)
INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=0, skip_count=0, success_count=1)
DEBUG dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}}
The log of the script mainly shows that the resource having the ID catalog was downloaded successfully.
The other things shown by the log will be detailed in the following sections.
We can see that the catalog.json file was downloaded:
$ tree source-data
source-data
└── catalog.json
1 directory, 1 file
$ cat source-data/catalog.json
[
{
"dataset_id": "BOP",
"dataset_name": "Balance of payments"
},
{
"dataset_id": "GDP",
"dataset_name": "Gross domestic product"
}
]
Note: as shown in the logs, the JSON file was reformatted.
Resume mode¶
When the file of a resource already exists, that resource is skipped. This behavior is called the resume mode.
For example, if we run the download script again, we can read in the logs that the resource is skipped:
$ python download.py source-data
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data' as the target directory
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.cache' as the cache directory
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.debug' as the debug directory
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Skipped resource HttpResource(id='catalog'): [Resume mode] Skipping resource 'catalog' because its file already exists: 'source-data/catalog.json' (158 bytes)
DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 0 resources... (1 found, 1 skipped) -- ids: []
DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 0 resources -- duration: 0 seconds
INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes)
INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=0, skip_count=1, success_count=0)
DEBUG dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}}
To force downloading the resource again, the resume mode can be disabled by passing the --no-resume option:
$ python download.py --no-resume source-data
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data' as the target directory
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.cache' as the cache directory
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.debug' as the debug directory
DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['catalog']
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='catalog') (1/1)
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Received HTTP response: <Response [200]>
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Started writing to temporary file 'source-data/catalog.part' (0 bytes)...
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Wrote file 'source-data/catalog.json' (127 bytes) -- duration: 0 seconds
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.json_utils: Start reformatting JSON file 'source-data/catalog.json' (127 bytes) with command '/usr/bin/jq --indent 2 < source-data/catalog.json > source-data/catalog.tmp'
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Reformatted JSON file 'source-data/catalog.json' (127 bytes) -- duration: 0.01 seconds initial_file_size: 127 bytes
INFO dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'catalog' successfully -- duration: 0.02 seconds
DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 0.02 seconds
INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes)
INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=0, skip_count=0, success_count=1)
DEBUG dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}}
When disabling the resume mode, all the resources will be downloaded, whether their file already exists or not.
To re-download a particular resource only while keeping the resume mode enabled, just delete its file and re-execute the script.
Error handling¶
If any exception occurs while downloading a resource, the resource will be skipped and the script continue without crashing, and the error will be logged.
If the target file of the resource was written, even partially, the BaseDownloader will move the file to the debug directory for further inspection.
By default, the debug directory is a sub-directory of the source data directory named .debug.
Its path can be customized by passing the --debug-dir option.
Example: simulate a 404 page not found¶
To simulate an error with the resource catalog, modify the Downloader.start method that we created earlier to return a 404 HTML page for the URL of the catalog JSON file:
import responses
class Downloader(BaseDownloader):
# [...]
@override
def start(self) -> None:
with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps:
# rsps.add(
# responses.GET,
# "https://abc-provider.com/data/catalog.json",
# json=[
# {"dataset_id": "BOP", "dataset_name": "Balance of payments"},
# {"dataset_id": "GDP", "dataset_name": "Gross domestic product"},
# ],
# status=200,
# )
rsps.add(
responses.GET,
"https://abc-provider.com/data/catalog.json",
body="Page not found!",
content_type="text/html",
status=404,
)
super().start()
Run the script:
$ python download.py source-data
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data' as the target directory
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.cache' as the cache directory
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.debug' as the debug directory
DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['catalog']
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='catalog') (1/1)
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Invalid HTTP response: <Response [404]>
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Dumped HTTP request and response to 'source-data/.debug/catalog.json.http_dump.attempt_1.txt' (298 bytes)
ERROR dbnomics_toolbox.fetcher_utils.processors.base_downloader: Error downloading resource 'catalog' -- duration: 0.01 seconds
Traceback (most recent call last):
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/processors/base_downloader.py", line 136, in _download_resource
resource._start() # type: ignore[reportPrivateUsage] # noqa: SLF001
~~~~~~~~~~~~~~~^^
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 225, in _start
run_retrying_attempts(
~~~~~~~~~~~~~~~~~~~~~^
retrying=self._retrying,
^^^^^^^^^^^^^^^^^^^^^^^^
run_attempt=run_attempt,
^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 26, in run_retrying_attempts
for attempt in retrying:
^^^^^^^^
File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 445, in __iter__
do = self.iter(retry_state=retry_state)
File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 378, in iter
result = action(retry_state)
File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 400, in <lambda>
self._add_action_func(lambda rs: rs.outcome.result())
~~~~~~~~~~~~~~~~~^^
File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
~~~~~~~~~~~~~~~~~^^
File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 30, in run_retrying_attempts
result = run_attempt(retry_state=retry_state)
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 222, in run_attempt
self._download(retry_state=retry_state)
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 108, in _download
with self._fetch_response(retry_state=retry_state) as response:
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 141, in __enter__
return next(self.gen)
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 145, in _fetch_response
self._validate_response(response)
~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 208, in _validate_response
response.raise_for_status()
~~~~~~~~~~~~~~~~~~~~~~~~~^^
File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/requests/models.py", line 1026, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://abc-provider.com/data/catalog.json
DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 0.03 seconds
INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes)
INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=1, skip_count=0, success_count=0)
DEBUG dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}}
We see in the logs the HTTPError exception that was raised because of the 404 response status code.
In this case the error was raised before starting to write the target file catalog.json, so we won’t find it in source-data.
This is important because we don’t want invalid files to be written to the source-data directory.
However, as shown in the logs, since the HTTP request failed, a textual dump of the request and the response was saved to the debug directory:
$ tree -a source-data
source-data
├── .cache
│ └── .gitignore
└── .debug
├── catalog.json.http_dump.attempt_1.txt
└── .gitignore
3 directories, 3 files
$ cat source-data/.debug/catalog.json.http_dump.attempt_1.txt
< GET /data/catalog.json HTTP/1.1
< Host: abc-provider.com
< User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0
< Accept-Encoding: gzip, deflate
< Accept: */*
< Connection: keep-alive
<
> HTTP/? 404 Not Found
> Content-Type: text/html
>
Page not found!%
This allows us to quickly spot the problem in context, without having to reproduce that request with curl or in the browser.
Retrying¶
It’s not unusual for servers to be too busy to respond something useful to the client. In this case we may receive a response that tells us to retry after a delay.
This retrying logic is implemented by the HttpResource.
See also: retrying downloads.
Example: simulate a busy server¶
Let’s simulate this time a server that responds something like “Server busy, retry later” the first time the URL is called, then responds the JSON catalog as expected the second time.
We’ll use the HTTP response code 429 Too many requests.
Modify again the Downloader.start method:
import responses
class Downloader(BaseDownloader):
# [...]
@override
def start(self) -> None:
with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps:
# When calling `rsps.add` many times for the same URL, the `RequestsMock` will consume each one in order for each incoming request from the client.
rsps.add(
responses.GET,
"https://abc-provider.com/data/catalog.json",
body="<p>Server busy, retry later<p>",
content_type="text/html",
status=429,
)
rsps.add(
responses.GET,
"https://abc-provider.com/data/catalog.json",
json=[
{"dataset_id": "BOP", "dataset_name": "Balance of payments"},
{"dataset_id": "GDP", "dataset_name": "Gross domestic product"},
],
status=200,
)
super().start()
Run the script:
$ python download.py source-data
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created target directory: 'source-data'
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created cache directory: 'source-data/.cache'
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created debug directory: 'source-data/.debug'
DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['catalog']
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='catalog') (1/1)
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Invalid HTTP response: <Response [429]>
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Dumped HTTP request and response to 'source-data/.debug/catalog.json.http_dump.attempt_1.txt' (321 bytes)
ERROR dbnomics_toolbox.retry_utils.loggers: Error during attempt 1 after 0.01 seconds
Traceback (most recent call last):
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/loggers.py", line 41, in log_failed_attempt
outcome.result()
~~~~~~~~~~~~~~^^
File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
~~~~~~~~~~~~~~~~~^^
File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 30, in run_retrying_attempts
result = run_attempt(retry_state=retry_state)
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 222, in run_attempt
self._download(retry_state=retry_state)
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 108, in _download
with self._fetch_response(retry_state=retry_state) as response:
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 141, in __enter__
return next(self.gen)
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 145, in _fetch_response
self._validate_response(response)
~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 208, in _validate_response
response.raise_for_status()
~~~~~~~~~~~~~~~~~~~~~~~~~^^
File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/requests/models.py", line 1026, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://abc-provider.com/data/catalog.json
DEBUG dbnomics_toolbox.retry_utils.loggers: Sleeping 1.5 seconds
DEBUG dbnomics_toolbox.retry_utils.loggers: Starting attempt 2
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Received HTTP response: <Response [200]>
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Started writing to temporary file 'source-data/catalog.part' (0 bytes)...
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Wrote file 'source-data/catalog.json' (127 bytes) -- duration: 0 seconds
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.json_utils: Start reformatting JSON file 'source-data/catalog.json' (127 bytes) with command '/usr/bin/jq --indent 2 < source-data/catalog.json > source-data/catalog.tmp'
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Reformatted JSON file 'source-data/catalog.json' (127 bytes) -- duration: 0.01 seconds initial_file_size: 127 bytes
INFO dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'catalog' successfully -- duration: 1.56 seconds
DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 1.56 seconds
INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes)
INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=0, skip_count=0, success_count=1)
DEBUG dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}}
After the first failed attempt, we see that the download script slept for 1.5 seconds. This is because the default wait strategy grows exponentially, starting with a low value.
We can now inspect the files corresponding to the first failed attempt, then the successful second one:
$ tree -a source-data
source-data
├── .cache
│ └── .gitignore
├── catalog.json
└── .debug
├── catalog.json.http_dump.attempt_1.txt
└── .gitignore
3 directories, 4 files
$ cat source-data/.debug/catalog.json.http_dump.attempt_1.txt
< GET /data/catalog.json HTTP/1.1
< Host: abc-provider.com
< User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0
< Accept-Encoding: gzip, deflate
< Accept: */*
< Connection: keep-alive
<
> HTTP/? 429 Too Many Requests
> Content-Type: text/html
>
<p>Server busy, retry later<p>%
$ cat source-data/catalog.json
[
{
"dataset_id": "BOP",
"dataset_name": "Balance of payments"
},
{
"dataset_id": "GDP",
"dataset_name": "Gross domestic product"
}
]
Example: simulate Retry-After response header¶
If the server did respond with a Retry-After HTTP response header, then that value will be used.
Let’s simulate this:
import responses
class Downloader(BaseDownloader):
# [...]
@override
def start(self) -> None:
with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps:
# When calling `rsps.add` many times for the same URL, the `RequestsMock` will consume each one in order for each incoming request from the client.
rsps.add(
responses.GET,
"https://abc-provider.com/data/catalog.json",
body="<p>Server busy, retry later<p>",
content_type="text/html",
headers={"Retry-After": "4"},
status=429,
)
rsps.add(
responses.GET,
"https://abc-provider.com/data/catalog.json",
json=[
{"dataset_id": "BOP", "dataset_name": "Balance of payments"},
{"dataset_id": "GDP", "dataset_name": "Gross domestic product"},
],
status=200,
)
super().start()
This time the scripts waits for 4 seconds:
$ python download.py source-data
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data' as the target directory
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.cache' as the cache directory
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.debug' as the debug directory
DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['catalog']
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='catalog') (1/1)
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Invalid HTTP response: <Response [429]>
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Dumped HTTP request and response to 'source-data/.debug/catalog.json.http_dump.attempt_1.txt' (339 bytes)
ERROR dbnomics_toolbox.retry_utils.loggers: Error during attempt 1 after 0.01 seconds
Traceback (most recent call last):
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/loggers.py", line 41, in log_failed_attempt
outcome.result()
~~~~~~~~~~~~~~^^
File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
~~~~~~~~~~~~~~~~~^^
File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 30, in run_retrying_attempts
result = run_attempt(retry_state=retry_state)
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 222, in run_attempt
self._download(retry_state=retry_state)
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 108, in _download
with self._fetch_response(retry_state=retry_state) as response:
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 141, in __enter__
return next(self.gen)
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 145, in _fetch_response
self._validate_response(response)
~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 208, in _validate_response
response.raise_for_status()
~~~~~~~~~~~~~~~~~~~~~~~~~^^
File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/requests/models.py", line 1026, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://abc-provider.com/data/catalog.json
DEBUG dbnomics_toolbox.fetcher_utils.http_utils.requests_utils.waiters: The HTTP response has a HTTP header Retry-After: 4
DEBUG dbnomics_toolbox.retry_utils.loggers: Sleeping 4 seconds
DEBUG dbnomics_toolbox.retry_utils.loggers: Starting attempt 2
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Received HTTP response: <Response [200]>
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Started writing to temporary file 'source-data/catalog.part' (0 bytes)...
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Wrote file 'source-data/catalog.json' (127 bytes) -- duration: 0 seconds
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.json_utils: Start reformatting JSON file 'source-data/catalog.json' (127 bytes) with command '/usr/bin/jq --indent 2 < source-data/catalog.json > source-data/catalog.tmp'
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Reformatted JSON file 'source-data/catalog.json' (127 bytes) -- duration: 0.01 seconds initial_file_size: 127 bytes
INFO dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'catalog' successfully -- duration: 4.05 seconds
DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 4.05 seconds
INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes)
INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=0, skip_count=0, success_count=1)
DEBUG dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}}
Example: simulate a busy server with 200 status code¶
Sometimes the server responds something like “Server busy, retry later” but yet responds a status code of 200 OK.
In this case, the default retry strategy of the HttpResource thinks that the response is successful and does not dig into its contents to determine whether or not to retry downloading the resource.
Let’s first simulate the server responses:
import responses
class Downloader(BaseDownloader):
# [...]
@override
def start(self) -> None:
with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps:
# When calling `rsps.add` many times for the same URL, the `RequestsMock` will consume each one in order for each incoming request from the client.
rsps.add(
responses.GET,
"https://abc-provider.com/data/catalog.json",
body="<p>Server busy, retry later<p>",
content_type="text/html",
status=200,
)
rsps.add(
responses.GET,
"https://abc-provider.com/data/catalog.json",
json=[
{"dataset_id": "BOP", "dataset_name": "Balance of payments"},
{"dataset_id": "GDP", "dataset_name": "Gross domestic product"},
],
status=200,
)
super().start()
If we try to run the script at this point, we’ll have an InvalidMimeType exception because the HTML response does not match the extension of the catalog.json file name (cf MIME type validation), but no retry will be done.
Although we could customize the retry strategy, it’s better to customize the HTTP response validation directly to make it fail, by passing the validate_response kwarg to the constructor of HttpResource:
import responses
from requests import Response
class Downloader(BaseDownloader):
# [...]
@override
def _iter_resources(self) -> Iterator[BaseResource]:
def validate_response(response: Response) -> None:
response.raise_for_status()
if "Server busy" in response.text:
msg = "Server is busy"
raise RetryHttpRequest(msg, response=response)
yield HttpResource(
id="catalog",
request="https://abc-provider.com/data/catalog.json",
target_file=self._source_data_repo.catalog_file,
validate_response=validate_response,
)
In validate_response, raise_for_status does not raise an exception as the response code is 200.
Raising RetryHttpRequest (inherited from requests.HTTPError) makes the resource download fail and let the request to be retried.
In contrast, just raising a requests.HTTPError would make the resource download fail, but would not let the request to be retried, as the response code is 200.
Let’s execute the script:
$ python download.py source-data
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created target directory: 'source-data'
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created cache directory: 'source-data/.cache'
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created debug directory: 'source-data/.debug'
DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['catalog']
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='catalog') (1/1)
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Invalid HTTP response: <Response [200]>
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Dumped HTTP request and response to 'source-data/.debug/catalog.json.http_dump.attempt_1.txt' (306 bytes)
ERROR dbnomics_toolbox.retry_utils.loggers: Error during attempt 1 after 0 seconds
Traceback (most recent call last):
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/loggers.py", line 41, in log_failed_attempt
outcome.result()
~~~~~~~~~~~~~~^^
File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
~~~~~~~~~~~~~~~~~^^
File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 30, in run_retrying_attempts
result = run_attempt(retry_state=retry_state)
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 220, in run_attempt
self._download(retry_state=retry_state)
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 106, in _download
with self._fetch_response(retry_state=retry_state) as response:
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 141, in __enter__
return next(self.gen)
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 143, in _fetch_response
self._validate_response(response)
~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 209, in _validate_response
validate_response_callback(response)
~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/src/abc_fetcher/downloader.py", line 53, in validate_response
raise RetryHttpRequest(msg, response=response)
dbnomics_toolbox.retry_utils.requests.errors.RetryHttpRequest: Server is busy
DEBUG dbnomics_toolbox.retry_utils.loggers: Sleeping 1.5 seconds
DEBUG dbnomics_toolbox.retry_utils.loggers: Starting attempt 2
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Received HTTP response: <Response [200]>
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Started writing to temporary file 'source-data/catalog.part' (0 bytes)...
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Wrote file 'source-data/catalog.json' (127 bytes) -- duration: 0 seconds
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.json_utils: Start reformatting JSON file 'source-data/catalog.json' (127 bytes) with command '/usr/bin/jq --indent 2 < source-data/catalog.json > source-data/catalog.tmp'
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Reformatted JSON file 'source-data/catalog.json' (127 bytes) -- initial_file_size: 127 bytes duration: 0.01 seconds
INFO dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'catalog' successfully -- duration: 1.54 seconds
DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 1.54 seconds
INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes)
INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=0, skip_count=0, success_count=1)
DEBUG dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}}
Now the custom retry strategy applies and 2 download attempts are made as expected: the first one fails because the server is busy, and the second one succeeds.
Resource groups¶
Sometimes we want to download many files, but if any of them fails, we want none of them. In other words we want to keep the files if and only if they are all successful.
The ResourceGroup provides such a mechanism.
Resource groups by default store the files of the child resources in the source-data directory, but if the target_dir kwarg is passed to its constructor, they will be stored under that base directory.
Let’s demonstrate that by simulating downloading a dataset composed of 2 files: data.xml and structure.xml.
import responses
class Downloader(BaseDownloader):
# [...]
@override
def start(self) -> None:
with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps:
rsps.add(
responses.GET,
"https://abc-provider.com/dataset1/data.xml",
body="<p>Page Not Found<p>",
content_type="text/html",
status=404,
)
rsps.add(
responses.GET,
"https://abc-provider.com/dataset1/structure.xml",
body='<?xml version="1.0" encoding="UTF-8"?><structure />',
content_type="application/xml",
)
super().start()
@override
def _iter_resources(self) -> Iterator[BaseResource]:
yield ResourceGroup(
id="dataset1",
resources=[
HttpResource(
id="data",
request="https://abc-provider.com/dataset1/data.xml",
target_file="data.xml",
),
HttpResource(
id="structure",
request="https://abc-provider.com/dataset1/structure.xml",
target_file="structure.xml",
),
],
target_dir="dataset1",
)
Let’s run the script:
$ python download.py source-data
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created target directory: 'source-data'
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created cache directory: 'source-data/.cache'
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created debug directory: 'source-data/.debug'
DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['dataset1']
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource ResourceGroup(id='dataset1') (1/1)
DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 2 resources... -- ids: ['data', 'structure']
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='data') (1/2)
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/dataset1/data.xml' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Invalid HTTP response: <Response [404]>
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Dumped HTTP request and response to 'source-data/.debug/dataset1/data.xml.http_dump.attempt_1.txt' (303 bytes)
ERROR dbnomics_toolbox.fetcher_utils.resources.resource_group: Error downloading resource 'data' of group 'dataset1'
Traceback (most recent call last):
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/resource_group.py", line 65, in _download_resource
self._downloader._download_resource( # noqa: SLF001 # type: ignore[reportPrivateUsage]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
resource,
^^^^^^^^^
progression=progression,
^^^^^^^^^^^^^^^^^^^^^^^^
reraise=True,
^^^^^^^^^^^^^
)
^
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/processors/base_downloader.py", line 136, in _download_resource
resource._start() # type: ignore[reportPrivateUsage] # noqa: SLF001
~~~~~~~~~~~~~~~^^
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 223, in _start
run_retrying_attempts(
~~~~~~~~~~~~~~~~~~~~~^
retrying=self._retrying,
^^^^^^^^^^^^^^^^^^^^^^^^
run_attempt=run_attempt,
^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 26, in run_retrying_attempts
for attempt in retrying:
^^^^^^^^
File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 445, in __iter__
do = self.iter(retry_state=retry_state)
File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 378, in iter
result = action(retry_state)
File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 400, in <lambda>
self._add_action_func(lambda rs: rs.outcome.result())
~~~~~~~~~~~~~~~~~^^
File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
~~~~~~~~~~~~~~~~~^^
File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 30, in run_retrying_attempts
result = run_attempt(retry_state=retry_state)
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 220, in run_attempt
self._download(retry_state=retry_state)
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 108, in _download
with self._fetch_response(retry_state=retry_state) as response:
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 141, in __enter__
return next(self.gen)
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 145, in _fetch_response
self._validate_response(response)
~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 206, in _validate_response
response.raise_for_status()
~~~~~~~~~~~~~~~~~~~~~~~~~^^
File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/requests/models.py", line 1026, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://abc-provider.com/dataset1/data.xml
DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='structure') (2/2)
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/dataset1/structure.xml' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Received HTTP response: <Response [200]>
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Started writing to temporary file 'source-data/.cache/dataset1/structure.part' (0 bytes)...
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Wrote file 'source-data/.cache/dataset1/structure.xml' (51 bytes) -- duration: 0 seconds
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.xml_utils.reformatters: Start reformatting XML file 'source-data/.cache/dataset1/structure.xml' (51 bytes) with command '/usr/bin/xmlindent -i 2 source-data/.cache/dataset1/structure.xml -o source-data/.cache/dataset1/structure.tmp'
DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Reformatted XML file 'source-data/.cache/dataset1/structure.xml' (51 bytes) -- initial_file_size: 51 bytes duration: 0 seconds
INFO dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'structure' successfully -- duration: 0.02 seconds
DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 2 resources -- duration: 0.03 seconds
INFO dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'dataset1' successfully -- duration: 0.03 seconds
DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 0.03 seconds
INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes)
INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=1, skip_count=0, success_count=1)
DEBUG dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}}
Let’s look at the source-data directory:
$ tree -a source-data
source-data
├── .cache
│ ├── dataset1
│ │ └── structure.xml
│ └── .gitignore
└── .debug
├── dataset1
│ └── data.xml.http_dump.attempt_1.txt
└── .gitignore
5 directories, 4 files
The source-data/dataset1 directory does not exist, which is what we want: one of the resources of the group failed, so we want none of them.
The structure.xml file is stored in the cache directory so that a subsequent download will take advantage of the resume mode to skip downloading the file again.
The response dump is stored in the debug directory as data.xml.http_dump.attempt_1.txt and allows us to inspect what’s going on.
Let’s fix the simulated response to make the data resource succeed:
```{code-block} python
:caption: downloader.py
import responses
class Downloader(BaseDownloader):
# [...]
@override
def start(self) -> None:
with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps:
rsps.add(
responses.GET,
"https://abc-provider.com/dataset1/data.xml",
body='<?xml version="1.0" encoding="UTF-8"?><data />',
content_type="application/xml",
)
rsps.add(
responses.GET,
"https://abc-provider.com/dataset1/structure.xml",
body='<?xml version="1.0" encoding="UTF-8"?><structure />',
content_type="application/xml",
)
super().start()
Let’s run the script again:
python download.py source-data
Let’s look at the source-data directory:
$ tree -a source-data
source-data
├── .cache
│ └── .gitignore
├── dataset1
│ ├── data.xml
│ └── structure.xml
└── .debug
└── .gitignore
Now the source-data/dataset1 directory exists, and contains all the files of the resource group.
Scraping the provider website¶
When providers do not distribute machine-parseable data, we can scrap its website to extract missing data.
For example, the list of datasets can be only available as a list of links into an HTML page.
When using web scraping, fetchers should define a website.py module that exposes a Website class which encapsulates the details and knowledge about the website (URLs, data iterators, etc.) and makes high-level data available through methods:
from typing import Final
from yarl import URL
WEBSITE_BASE_URL: Final = URL("https://abc-provider.com/data")
from yarl import URL
from abc_fetcher.constants import WEBSITE_BASE_URL
class Website:
def __init__(self, *, base_url: URL | str | None = None) -> None:
if base_url is None:
base_url = WEBSITE_BASE_URL
if isinstance(base_url, str):
base_url = URL(base_url)
self._base_url = base_url
def build_series_url(self, series_id: str) -> URL:
return self.base_url / f"series/{series_id}"
Real-world examples:
ons-fetcher relies on web scraping to extract the category tree of datasets.
SDMX¶
Providers that distribute SDMX data can be handled by sub-classing BaseSdmxDownloader.
This base class handles many things related to SDMX data:
downloading global SDMX resources (e.g. dataflow, categorisation, categoryscheme)
downloading datasets by iterating the dataflow
extracting the last update date to avoid downloading non-updated datasets again and again
This base class also defines abstract methods that must be implemented in the fetcher.
TODO SdmxApi
Real-world examples: