# Downloading data ## Overview Here is the big picture of the different components that allow for data download: ![Download architecture](../_static/download-architecture.drawio.svg) The following sections will introduce each component of this architecture diagram. ## Download script A fetcher must define a `download.py` script: ```{code-block} python :caption: download.py #!/usr/bin/env python3 from dbnomics_toolbox.fetcher_utils.cli_utils.download_cli import DownloadCLI from dbnomics_toolbox.fetcher_utils.cli_utils.download_cli_args import DownloadCLIArgs import abc_fetcher from abc_fetcher.downloader import Downloader def main() -> None: cli = DownloadCLI(package_name=abc_fetcher.__name__) cli.start() downloader = Downloader(**cli.args.as_downloader_kwargs) downloader.start() cli.finalize(downloader) if __name__ == "__main__": main() ``` This script can be called manually from the command line, and will be called by the DBnomics infrastructure in production. Note the separation of concerns: - the [`DownloadCLI`](DownloadCLI) class handles the CLI arguments and options, and saves an output state file (in the [`finalize`](DownloadCLI.finalize) method), - the `Downloader` class takes care of the download logic. ## Downloader The `Downloader` class, inherited from [`BaseDownloader`](BaseDownloader), is responsible for downloading data from the provider infrastructure. To achieve its goal, it creates file resources (cf next section) and downloads them. Example of `Downloader`: ```{code-block} python :caption: downloader.py from collections.abc import Iterator from typing import Unpack, override from dbnomics_toolbox.fetcher_utils import BaseDownloader, BaseResource from dbnomics_toolbox.fetcher_utils.resources import ResourceId from yarl import URL from dummy_fetcher.constants import API_BASE_URL from dummy_fetcher.resources import DummyApiResource from dummy_fetcher.source_data_repo import SourceDataRepo __all__ = ["Downloader"] class Downloader(BaseDownloader): """Download files from the provider infrastructure.""" def __init__( self, **kwargs: Unpack[BaseDownloader.InitKwargs], ) -> None: super().__init__(**kwargs) self._source_data_repo = SourceDataRepo(source_data_dir=kwargs["source_data_dir"]) @override def _iter_resources(self) -> Iterator[BaseResource]: # Defined later ... ``` The `__init__` method instanciates the `SourceDataRepo` class, which is a fundamental pattern detailed in the next section. ## Source data repository We don't want the `Downloader` to know too much about source data, especially file paths inside the `source-data` directory, and how to read data from files. Remember: the `Converter` will also have to read data back from this directory, and we don't want to duplicate this logic. So we can define a `SourceDataRepo` class and use it from both the `Downloader` and the `Converter`. For example, let's say that the ABC provider exposes its datasets in a `catalog.json` file such as: ```json [ {"dataset_id": "BOP", "dataset_name": "Balance of payments"}, {"dataset_id": "GDP", "dataset_name": "Gross domestic product"} ] ``` Those catalog items can be modeled by a `CatalogItem` dataclass: ```{code-block} python :caption: source_data_model.py from dataclasses import dataclass __all__ = ["CatalogItem"] @dataclass(frozen=True, kw_only=True) class CatalogItem: dataset_id: str dataset_name: str ``` The `SourceDataRepo` class can define methods that parse source-data files and expose their content as model objects from the provider domain. To validate Python dicts and load them to dataclass instances, it is advised to use a data loading library such as [typedload](https://ltworf.codeberg.page/typedload/) or [`pydantic`](https://docs.pydantic.dev/). The [`dbnomics_toolbox.json_utils.load_json_file`](dbnomics_toolbox.json_utils.load_json_file) function uses `typedload` under the hood. Exemple of `SourceDataRepo`: ```{code-block} python :caption: source_data_repo.py from collections.abc import Iterator from pathlib import Path import daiquiri from dbnomics_toolbox.json_utils import load_json_file from abc_fetcher.source_data_model import CatalogItem __all__ = ["SourceDataRepo"] logger = daiquiri.getLogger(__name__) class SourceDataRepo: """Load data from the `source-data` directory. Useful both to the Downloader (e.g. to know where to write the downloaded files) and to the Converter (e.g. to load those files). """ def __init__(self, *, source_data_dir: Path) -> None: self._source_data_dir = source_data_dir self.catalog_file = source_data_dir / "catalog.json" def iter_catalog_items(self) -> Iterator[CatalogItem]: catalog_items = load_json_file(self.catalog_file, type_=list[CatalogItem]) yield from catalog_items ``` The `iter_catalog_items` method will be called by the `Downloader` in order to download all the datasets, and by the `Converter` to convert them all. ## File resources A resource represents some data distributed by the provider that will be stored as a single file. The abstract method [`FileResource._download`](FileResource._download) is responsible for writing data to the target file. Then it validates the MIME type of the target file, and reformats it according to its format (e.g. XML, JSON, etc.). Those features are enabled by default with auto-detection, but can be disabled by passing arguments to `__init__`. Most of the time you will use the [`HttpResource`](HttpResource), a child class of [`FileResource`](FileResource) which uses the [`requests`](https://requests.readthedocs.io/en/latest/) library to download files. For example we can define the following resource to download a JSON file from `https://abc-provider.com/data/catalog.json`: ```{code-block} python :caption: downloader.py from dbnomics_toolbox.fetcher_utils import HttpResource class Downloader(BaseDownloader): # [...] @override def _iter_resources(self) -> Iterator[BaseResource]: yield HttpResource( id="catalog", request="https://abc-provider.com/data/catalog.json", target_file=self._source_data_repo.catalog_file, ) ``` The `target_file` argument references the `catalog_file` attribute from the `SourceDataRepo`. The `request` argument can be a URL or a `requests.Request` object. See also the [file resources](resources/file-resources.md) page. ## Simulate the provider web API Obviously the URLs of the example, starting with `https://abc-provider.com/`, do not actually exist. To make the downloader work with them, we're going to intercept those requests and respond fake data by using the [`responses`](https://github.com/getsentry/responses) package. Install the package: ```bash uv add responses ``` Override the [`BaseDownloader.start`](BaseDownloader.start) method: ```{code-block} python :caption: downloader.py import responses class Downloader(BaseDownloader): # [...] @override def start(self) -> None: with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps: rsps.add( responses.GET, "https://abc-provider.com/data/catalog.json", json=[ {"dataset_id": "BOP", "dataset_name": "Balance of payments"}, {"dataset_id": "GDP", "dataset_name": "Gross domestic product"}, ], status=200, ) super().start() ``` ## Run the download script ```bash $ python download.py source-data DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created target directory: 'source-data' DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created cache directory: 'source-data/.cache' DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created debug directory: 'source-data/.debug' DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['catalog'] DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='catalog') (1/1) DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)... DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Received HTTP response: DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Started writing to temporary file 'source-data/catalog.part' (0 bytes)... DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Wrote file 'source-data/catalog.json' (127 bytes) -- duration: 0 seconds DEBUG dbnomics_toolbox.fetcher_utils.file_utils.json_utils: Start reformatting JSON file 'source-data/catalog.json' (127 bytes) with command '/usr/bin/jq --indent 2 < source-data/catalog.json > source-data/catalog.tmp' DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Reformatted JSON file 'source-data/catalog.json' (127 bytes) -- initial_file_size: 127 bytes duration: 0.01 seconds INFO dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'catalog' successfully -- duration: 0.06 seconds DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 0.06 seconds INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes) INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=0, skip_count=0, success_count=1) DEBUG dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}} ``` The log of the script mainly shows that the resource having the ID `catalog` was downloaded successfully. The other things shown by the log will be detailed in the following sections. We can see that the `catalog.json` file was downloaded: ```bash $ tree source-data source-data └── catalog.json 1 directory, 1 file $ cat source-data/catalog.json [ { "dataset_id": "BOP", "dataset_name": "Balance of payments" }, { "dataset_id": "GDP", "dataset_name": "Gross domestic product" } ] ``` Note: as shown in the logs, the JSON file was reformatted. ## Resume mode When the file of a resource already exists, that resource is skipped. This behavior is called the *resume mode*. For example, if we run the download script again, we can read in the logs that the resource is skipped: ```{code-block} bash :emphasize-lines: 5 $ python download.py source-data DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data' as the target directory DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.cache' as the cache directory DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.debug' as the debug directory DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Skipped resource HttpResource(id='catalog'): [Resume mode] Skipping resource 'catalog' because its file already exists: 'source-data/catalog.json' (158 bytes) DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 0 resources... (1 found, 1 skipped) -- ids: [] DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 0 resources -- duration: 0 seconds INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes) INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=0, skip_count=1, success_count=0) DEBUG dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}} ``` To force downloading the resource again, the resume mode can be disabled by passing the `--no-resume` option: ```{code-block} bash :emphasize-lines: 6-13 $ python download.py --no-resume source-data DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data' as the target directory DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.cache' as the cache directory DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.debug' as the debug directory DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['catalog'] DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='catalog') (1/1) DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)... DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Received HTTP response: DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Started writing to temporary file 'source-data/catalog.part' (0 bytes)... DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Wrote file 'source-data/catalog.json' (127 bytes) -- duration: 0 seconds DEBUG dbnomics_toolbox.fetcher_utils.file_utils.json_utils: Start reformatting JSON file 'source-data/catalog.json' (127 bytes) with command '/usr/bin/jq --indent 2 < source-data/catalog.json > source-data/catalog.tmp' DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Reformatted JSON file 'source-data/catalog.json' (127 bytes) -- duration: 0.01 seconds initial_file_size: 127 bytes INFO dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'catalog' successfully -- duration: 0.02 seconds DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 0.02 seconds INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes) INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=0, skip_count=0, success_count=1) DEBUG dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}} ``` When disabling the resume mode, all the resources will be downloaded, whether their file already exists or not. To re-download a particular resource only while keeping the resume mode enabled, just delete its file and re-execute the script. ## Error handling If any exception occurs while downloading a resource, the resource will be skipped and the script continue without crashing, and the error will be logged. If the target file of the resource was written, even partially, the [`BaseDownloader`](BaseDownloader) will move the file to the debug directory for further inspection. By default, the debug directory is a sub-directory of the source data directory named `.debug`. Its path can be customized by passing the `--debug-dir` option. ### Example: simulate a 404 page not found To simulate an error with the resource `catalog`, modify the `Downloader.start` method that we created earlier to return a 404 HTML page for the URL of the catalog JSON file: ```{code-block} python :caption: downloader.py import responses class Downloader(BaseDownloader): # [...] @override def start(self) -> None: with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps: # rsps.add( # responses.GET, # "https://abc-provider.com/data/catalog.json", # json=[ # {"dataset_id": "BOP", "dataset_name": "Balance of payments"}, # {"dataset_id": "GDP", "dataset_name": "Gross domestic product"}, # ], # status=200, # ) rsps.add( responses.GET, "https://abc-provider.com/data/catalog.json", body="Page not found!", content_type="text/html", status=404, ) super().start() ``` Run the script: ```bash $ python download.py source-data DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data' as the target directory DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.cache' as the cache directory DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.debug' as the debug directory DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['catalog'] DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='catalog') (1/1) DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)... DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Invalid HTTP response: DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Dumped HTTP request and response to 'source-data/.debug/catalog.json.http_dump.attempt_1.txt' (298 bytes) ERROR dbnomics_toolbox.fetcher_utils.processors.base_downloader: Error downloading resource 'catalog' -- duration: 0.01 seconds Traceback (most recent call last): File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/processors/base_downloader.py", line 136, in _download_resource resource._start() # type: ignore[reportPrivateUsage] # noqa: SLF001 ~~~~~~~~~~~~~~~^^ File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 225, in _start run_retrying_attempts( ~~~~~~~~~~~~~~~~~~~~~^ retrying=self._retrying, ^^^^^^^^^^^^^^^^^^^^^^^^ run_attempt=run_attempt, ^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 26, in run_retrying_attempts for attempt in retrying: ^^^^^^^^ File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 445, in __iter__ do = self.iter(retry_state=retry_state) File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 378, in iter result = action(retry_state) File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 400, in self._add_action_func(lambda rs: rs.outcome.result()) ~~~~~~~~~~~~~~~~~^^ File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 449, in result return self.__get_result() ~~~~~~~~~~~~~~~~~^^ File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result raise self._exception File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 30, in run_retrying_attempts result = run_attempt(retry_state=retry_state) File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 222, in run_attempt self._download(retry_state=retry_state) ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 108, in _download with self._fetch_response(retry_state=retry_state) as response: ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 141, in __enter__ return next(self.gen) File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 145, in _fetch_response self._validate_response(response) ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^ File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 208, in _validate_response response.raise_for_status() ~~~~~~~~~~~~~~~~~~~~~~~~~^^ File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/requests/models.py", line 1026, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://abc-provider.com/data/catalog.json DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 0.03 seconds INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes) INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=1, skip_count=0, success_count=0) DEBUG dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}} ``` We see in the logs the HTTPError exception that was raised because of the 404 response status code. In this case the error was raised before starting to write the target file `catalog.json`, so we won't find it in `source-data`. This is important because we don't want invalid files to be written to the `source-data` directory. However, as shown in the logs, since the HTTP request failed, a textual dump of the request and the response was saved to the debug directory: ```bash $ tree -a source-data source-data ├── .cache │   └── .gitignore └── .debug ├── catalog.json.http_dump.attempt_1.txt └── .gitignore 3 directories, 3 files $ cat source-data/.debug/catalog.json.http_dump.attempt_1.txt < GET /data/catalog.json HTTP/1.1 < Host: abc-provider.com < User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0 < Accept-Encoding: gzip, deflate < Accept: */* < Connection: keep-alive < > HTTP/? 404 Not Found > Content-Type: text/html > Page not found!% ``` This allows us to quickly spot the problem in context, without having to reproduce that request with curl or in the browser. ## Retrying It's not unusual for servers to be too busy to respond something useful to the client. In this case we may receive a response that tells us to retry after a delay. This retrying logic is implemented by the [`HttpResource`](HttpResource). See also: [retrying downloads](resources/http-resources.md#retrying-downloads). ### Example: simulate a busy server Let's simulate this time a server that responds something like "Server busy, retry later" the first time the URL is called, then responds the JSON catalog as expected the second time. We'll use the HTTP response code [429 Too many requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/429). Modify again the `Downloader.start` method: ```{code-block} python :caption: downloader.py import responses class Downloader(BaseDownloader): # [...] @override def start(self) -> None: with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps: # When calling `rsps.add` many times for the same URL, the `RequestsMock` will consume each one in order for each incoming request from the client. rsps.add( responses.GET, "https://abc-provider.com/data/catalog.json", body="

Server busy, retry later

", content_type="text/html", status=429, ) rsps.add( responses.GET, "https://abc-provider.com/data/catalog.json", json=[ {"dataset_id": "BOP", "dataset_name": "Balance of payments"}, {"dataset_id": "GDP", "dataset_name": "Gross domestic product"}, ], status=200, ) super().start() ``` Run the script: ```bash $ python download.py source-data DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created target directory: 'source-data' DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created cache directory: 'source-data/.cache' DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created debug directory: 'source-data/.debug' DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['catalog'] DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='catalog') (1/1) DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)... DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Invalid HTTP response: DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Dumped HTTP request and response to 'source-data/.debug/catalog.json.http_dump.attempt_1.txt' (321 bytes) ERROR dbnomics_toolbox.retry_utils.loggers: Error during attempt 1 after 0.01 seconds Traceback (most recent call last): File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/loggers.py", line 41, in log_failed_attempt outcome.result() ~~~~~~~~~~~~~~^^ File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 449, in result return self.__get_result() ~~~~~~~~~~~~~~~~~^^ File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result raise self._exception File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 30, in run_retrying_attempts result = run_attempt(retry_state=retry_state) File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 222, in run_attempt self._download(retry_state=retry_state) ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 108, in _download with self._fetch_response(retry_state=retry_state) as response: ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 141, in __enter__ return next(self.gen) File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 145, in _fetch_response self._validate_response(response) ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^ File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 208, in _validate_response response.raise_for_status() ~~~~~~~~~~~~~~~~~~~~~~~~~^^ File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/requests/models.py", line 1026, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://abc-provider.com/data/catalog.json DEBUG dbnomics_toolbox.retry_utils.loggers: Sleeping 1.5 seconds DEBUG dbnomics_toolbox.retry_utils.loggers: Starting attempt 2 DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)... DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Received HTTP response: DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Started writing to temporary file 'source-data/catalog.part' (0 bytes)... DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Wrote file 'source-data/catalog.json' (127 bytes) -- duration: 0 seconds DEBUG dbnomics_toolbox.fetcher_utils.file_utils.json_utils: Start reformatting JSON file 'source-data/catalog.json' (127 bytes) with command '/usr/bin/jq --indent 2 < source-data/catalog.json > source-data/catalog.tmp' DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Reformatted JSON file 'source-data/catalog.json' (127 bytes) -- duration: 0.01 seconds initial_file_size: 127 bytes INFO dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'catalog' successfully -- duration: 1.56 seconds DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 1.56 seconds INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes) INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=0, skip_count=0, success_count=1) DEBUG dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}} ``` After the first failed attempt, we see that the download script slept for 1.5 seconds. This is because the default wait strategy grows exponentially, starting with a low value. We can now inspect the files corresponding to the first failed attempt, then the successful second one: ```bash $ tree -a source-data source-data ├── .cache │   └── .gitignore ├── catalog.json └── .debug ├── catalog.json.http_dump.attempt_1.txt └── .gitignore 3 directories, 4 files $ cat source-data/.debug/catalog.json.http_dump.attempt_1.txt < GET /data/catalog.json HTTP/1.1 < Host: abc-provider.com < User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0 < Accept-Encoding: gzip, deflate < Accept: */* < Connection: keep-alive < > HTTP/? 429 Too Many Requests > Content-Type: text/html >

Server busy, retry later

% $ cat source-data/catalog.json [ { "dataset_id": "BOP", "dataset_name": "Balance of payments" }, { "dataset_id": "GDP", "dataset_name": "Gross domestic product" } ] ``` ### Example: simulate `Retry-After` response header If the server did respond with a `Retry-After` HTTP response header, then that value will be used. Let's simulate this: ```{code-block} python :caption: downloader.py import responses class Downloader(BaseDownloader): # [...] @override def start(self) -> None: with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps: # When calling `rsps.add` many times for the same URL, the `RequestsMock` will consume each one in order for each incoming request from the client. rsps.add( responses.GET, "https://abc-provider.com/data/catalog.json", body="

Server busy, retry later

", content_type="text/html", headers={"Retry-After": "4"}, status=429, ) rsps.add( responses.GET, "https://abc-provider.com/data/catalog.json", json=[ {"dataset_id": "BOP", "dataset_name": "Balance of payments"}, {"dataset_id": "GDP", "dataset_name": "Gross domestic product"}, ], status=200, ) super().start() ``` This time the scripts waits for 4 seconds: ```bash $ python download.py source-data DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data' as the target directory DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.cache' as the cache directory DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.debug' as the debug directory DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['catalog'] DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='catalog') (1/1) DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)... DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Invalid HTTP response: DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Dumped HTTP request and response to 'source-data/.debug/catalog.json.http_dump.attempt_1.txt' (339 bytes) ERROR dbnomics_toolbox.retry_utils.loggers: Error during attempt 1 after 0.01 seconds Traceback (most recent call last): File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/loggers.py", line 41, in log_failed_attempt outcome.result() ~~~~~~~~~~~~~~^^ File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 449, in result return self.__get_result() ~~~~~~~~~~~~~~~~~^^ File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result raise self._exception File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 30, in run_retrying_attempts result = run_attempt(retry_state=retry_state) File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 222, in run_attempt self._download(retry_state=retry_state) ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 108, in _download with self._fetch_response(retry_state=retry_state) as response: ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 141, in __enter__ return next(self.gen) File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 145, in _fetch_response self._validate_response(response) ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^ File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 208, in _validate_response response.raise_for_status() ~~~~~~~~~~~~~~~~~~~~~~~~~^^ File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/requests/models.py", line 1026, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://abc-provider.com/data/catalog.json DEBUG dbnomics_toolbox.fetcher_utils.http_utils.requests_utils.waiters: The HTTP response has a HTTP header Retry-After: 4 DEBUG dbnomics_toolbox.retry_utils.loggers: Sleeping 4 seconds DEBUG dbnomics_toolbox.retry_utils.loggers: Starting attempt 2 DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)... DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Received HTTP response: DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Started writing to temporary file 'source-data/catalog.part' (0 bytes)... DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Wrote file 'source-data/catalog.json' (127 bytes) -- duration: 0 seconds DEBUG dbnomics_toolbox.fetcher_utils.file_utils.json_utils: Start reformatting JSON file 'source-data/catalog.json' (127 bytes) with command '/usr/bin/jq --indent 2 < source-data/catalog.json > source-data/catalog.tmp' DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Reformatted JSON file 'source-data/catalog.json' (127 bytes) -- duration: 0.01 seconds initial_file_size: 127 bytes INFO dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'catalog' successfully -- duration: 4.05 seconds DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 4.05 seconds INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes) INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=0, skip_count=0, success_count=1) DEBUG dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}} ``` ### Example: simulate a busy server with 200 status code Sometimes the server responds something like "Server busy, retry later" but yet responds a status code of [200 OK](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/200). In this case, the default retry strategy of the [`HttpResource`](HttpResource) thinks that the response is successful and does not dig into its contents to determine whether or not to retry downloading the resource. Let's first simulate the server responses: ```{code-block} python :caption: downloader.py import responses class Downloader(BaseDownloader): # [...] @override def start(self) -> None: with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps: # When calling `rsps.add` many times for the same URL, the `RequestsMock` will consume each one in order for each incoming request from the client. rsps.add( responses.GET, "https://abc-provider.com/data/catalog.json", body="

Server busy, retry later

", content_type="text/html", status=200, ) rsps.add( responses.GET, "https://abc-provider.com/data/catalog.json", json=[ {"dataset_id": "BOP", "dataset_name": "Balance of payments"}, {"dataset_id": "GDP", "dataset_name": "Gross domestic product"}, ], status=200, ) super().start() ``` If we try to run the script at this point, we'll have an [`InvalidMimeType`](InvalidMimeType) exception because the HTML response does not match the extension of the `catalog.json` file name (cf [MIME type validation](resources/file-resources.md#mime-type-validation)), but no retry will be done. Although we could customize the retry strategy, it's better to customize the HTTP response validation directly to make it fail, by passing the `validate_response` kwarg to the constructor of [`HttpResource`](HttpResource): ```{code-block} python :caption: downloader.py import responses from requests import Response class Downloader(BaseDownloader): # [...] @override def _iter_resources(self) -> Iterator[BaseResource]: def validate_response(response: Response) -> None: response.raise_for_status() if "Server busy" in response.text: msg = "Server is busy" raise RetryHttpRequest(msg, response=response) yield HttpResource( id="catalog", request="https://abc-provider.com/data/catalog.json", target_file=self._source_data_repo.catalog_file, validate_response=validate_response, ) ``` In `validate_response`, `raise_for_status` does not raise an exception as the response code is 200. Raising [`RetryHttpRequest`](dbnomics_toolbox.retry_utils.requests.errors.RetryHttpRequest) (inherited from [`requests.HTTPError`](https://requests.readthedocs.io/en/latest/api/#requests.HTTPError)) makes the resource download fail and let the request to be retried. In contrast, just raising a [`requests.HTTPError`](https://requests.readthedocs.io/en/latest/api/#requests.HTTPError) would make the resource download fail, but would not let the request to be retried, as the response code is 200. Let's execute the script: ```bash $ python download.py source-data DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created target directory: 'source-data' DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created cache directory: 'source-data/.cache' DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created debug directory: 'source-data/.debug' DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['catalog'] DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='catalog') (1/1) DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)... DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Invalid HTTP response: DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Dumped HTTP request and response to 'source-data/.debug/catalog.json.http_dump.attempt_1.txt' (306 bytes) ERROR dbnomics_toolbox.retry_utils.loggers: Error during attempt 1 after 0 seconds Traceback (most recent call last): File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/loggers.py", line 41, in log_failed_attempt outcome.result() ~~~~~~~~~~~~~~^^ File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 449, in result return self.__get_result() ~~~~~~~~~~~~~~~~~^^ File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result raise self._exception File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 30, in run_retrying_attempts result = run_attempt(retry_state=retry_state) File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 220, in run_attempt self._download(retry_state=retry_state) ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 106, in _download with self._fetch_response(retry_state=retry_state) as response: ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 141, in __enter__ return next(self.gen) File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 143, in _fetch_response self._validate_response(response) ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^ File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 209, in _validate_response validate_response_callback(response) ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^ File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/src/abc_fetcher/downloader.py", line 53, in validate_response raise RetryHttpRequest(msg, response=response) dbnomics_toolbox.retry_utils.requests.errors.RetryHttpRequest: Server is busy DEBUG dbnomics_toolbox.retry_utils.loggers: Sleeping 1.5 seconds DEBUG dbnomics_toolbox.retry_utils.loggers: Starting attempt 2 DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)... DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Received HTTP response: DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Started writing to temporary file 'source-data/catalog.part' (0 bytes)... DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Wrote file 'source-data/catalog.json' (127 bytes) -- duration: 0 seconds DEBUG dbnomics_toolbox.fetcher_utils.file_utils.json_utils: Start reformatting JSON file 'source-data/catalog.json' (127 bytes) with command '/usr/bin/jq --indent 2 < source-data/catalog.json > source-data/catalog.tmp' DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Reformatted JSON file 'source-data/catalog.json' (127 bytes) -- initial_file_size: 127 bytes duration: 0.01 seconds INFO dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'catalog' successfully -- duration: 1.54 seconds DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 1.54 seconds INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes) INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=0, skip_count=0, success_count=1) DEBUG dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}} ``` Now the custom retry strategy applies and 2 download attempts are made as expected: the first one fails because the server is busy, and the second one succeeds. ## Resource groups Sometimes we want to download many files, but if any of them fails, we want none of them. In other words we want to keep the files if and only if they are all successful. The [`ResourceGroup`](ResourceGroup) provides such a mechanism. Resource groups by default store the files of the child resources in the `source-data` directory, but if the `target_dir` kwarg is passed to its constructor, they will be stored under that base directory. Let's demonstrate that by simulating downloading a dataset composed of 2 files: `data.xml` and `structure.xml`. ```{code-block} python :caption: downloader.py import responses class Downloader(BaseDownloader): # [...] @override def start(self) -> None: with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps: rsps.add( responses.GET, "https://abc-provider.com/dataset1/data.xml", body="

Page Not Found

", content_type="text/html", status=404, ) rsps.add( responses.GET, "https://abc-provider.com/dataset1/structure.xml", body='', content_type="application/xml", ) super().start() @override def _iter_resources(self) -> Iterator[BaseResource]: yield ResourceGroup( id="dataset1", resources=[ HttpResource( id="data", request="https://abc-provider.com/dataset1/data.xml", target_file="data.xml", ), HttpResource( id="structure", request="https://abc-provider.com/dataset1/structure.xml", target_file="structure.xml", ), ], target_dir="dataset1", ) ``` Let's run the script: ```bash $ python download.py source-data DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created target directory: 'source-data' DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created cache directory: 'source-data/.cache' DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created debug directory: 'source-data/.debug' DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['dataset1'] DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource ResourceGroup(id='dataset1') (1/1) DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 2 resources... -- ids: ['data', 'structure'] DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='data') (1/2) DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/dataset1/data.xml' (connect timeout: 1 minute, read timeout: 1 minute)... DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Invalid HTTP response: DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Dumped HTTP request and response to 'source-data/.debug/dataset1/data.xml.http_dump.attempt_1.txt' (303 bytes) ERROR dbnomics_toolbox.fetcher_utils.resources.resource_group: Error downloading resource 'data' of group 'dataset1' Traceback (most recent call last): File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/resource_group.py", line 65, in _download_resource self._downloader._download_resource( # noqa: SLF001 # type: ignore[reportPrivateUsage] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ resource, ^^^^^^^^^ progression=progression, ^^^^^^^^^^^^^^^^^^^^^^^^ reraise=True, ^^^^^^^^^^^^^ ) ^ File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/processors/base_downloader.py", line 136, in _download_resource resource._start() # type: ignore[reportPrivateUsage] # noqa: SLF001 ~~~~~~~~~~~~~~~^^ File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 223, in _start run_retrying_attempts( ~~~~~~~~~~~~~~~~~~~~~^ retrying=self._retrying, ^^^^^^^^^^^^^^^^^^^^^^^^ run_attempt=run_attempt, ^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 26, in run_retrying_attempts for attempt in retrying: ^^^^^^^^ File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 445, in __iter__ do = self.iter(retry_state=retry_state) File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 378, in iter result = action(retry_state) File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 400, in self._add_action_func(lambda rs: rs.outcome.result()) ~~~~~~~~~~~~~~~~~^^ File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 449, in result return self.__get_result() ~~~~~~~~~~~~~~~~~^^ File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result raise self._exception File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 30, in run_retrying_attempts result = run_attempt(retry_state=retry_state) File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 220, in run_attempt self._download(retry_state=retry_state) ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 108, in _download with self._fetch_response(retry_state=retry_state) as response: ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 141, in __enter__ return next(self.gen) File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 145, in _fetch_response self._validate_response(response) ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^ File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 206, in _validate_response response.raise_for_status() ~~~~~~~~~~~~~~~~~~~~~~~~~^^ File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/requests/models.py", line 1026, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://abc-provider.com/dataset1/data.xml DEBUG dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='structure') (2/2) DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/dataset1/structure.xml' (connect timeout: 1 minute, read timeout: 1 minute)... DEBUG dbnomics_toolbox.fetcher_utils.resources.http_resource: Received HTTP response: DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Started writing to temporary file 'source-data/.cache/dataset1/structure.part' (0 bytes)... DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Wrote file 'source-data/.cache/dataset1/structure.xml' (51 bytes) -- duration: 0 seconds DEBUG dbnomics_toolbox.fetcher_utils.file_utils.xml_utils.reformatters: Start reformatting XML file 'source-data/.cache/dataset1/structure.xml' (51 bytes) with command '/usr/bin/xmlindent -i 2 source-data/.cache/dataset1/structure.xml -o source-data/.cache/dataset1/structure.tmp' DEBUG dbnomics_toolbox.fetcher_utils.file_utils.common: Reformatted XML file 'source-data/.cache/dataset1/structure.xml' (51 bytes) -- initial_file_size: 51 bytes duration: 0 seconds INFO dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'structure' successfully -- duration: 0.02 seconds DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 2 resources -- duration: 0.03 seconds INFO dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'dataset1' successfully -- duration: 0.03 seconds DEBUG dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 0.03 seconds INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes) INFO dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=1, skip_count=0, success_count=1) DEBUG dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}} ``` Let's look at the `source-data` directory: ```bash $ tree -a source-data source-data ├── .cache │   ├── dataset1 │   │   └── structure.xml │   └── .gitignore └── .debug ├── dataset1 │   └── data.xml.http_dump.attempt_1.txt └── .gitignore 5 directories, 4 files ``` The `source-data/dataset1` directory does not exist, which is what we want: one of the resources of the group failed, so we want none of them. The `structure.xml` file is stored in the cache directory so that a subsequent download will take advantage of the resume mode to skip downloading the file again. The response dump is stored in the debug directory as `data.xml.http_dump.attempt_1.txt` and allows us to inspect what's going on. Let's fix the simulated response to make the `data` resource succeed: ```python ```{code-block} python :caption: downloader.py import responses class Downloader(BaseDownloader): # [...] @override def start(self) -> None: with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps: rsps.add( responses.GET, "https://abc-provider.com/dataset1/data.xml", body='', content_type="application/xml", ) rsps.add( responses.GET, "https://abc-provider.com/dataset1/structure.xml", body='', content_type="application/xml", ) super().start() ``` Let's run the script again: ```bash python download.py source-data ``` Let's look at the `source-data` directory: ```bash $ tree -a source-data source-data ├── .cache │   └── .gitignore ├── dataset1 │   ├── data.xml │   └── structure.xml └── .debug └── .gitignore ``` Now the `source-data/dataset1` directory exists, and contains all the files of the resource group. ## Scraping the provider website When providers do not distribute machine-parseable data, we can scrap its website to extract missing data. For example, the list of datasets can be only available as a list of links into an HTML page. When using web scraping, fetchers should define a `website.py` module that exposes a `Website` class which encapsulates the details and knowledge about the website (URLs, data iterators, etc.) and makes high-level data available through methods: ```{code-block} python :caption: constants.py from typing import Final from yarl import URL WEBSITE_BASE_URL: Final = URL("https://abc-provider.com/data") ``` ```{code-block} python :caption: website.py from yarl import URL from abc_fetcher.constants import WEBSITE_BASE_URL class Website: def __init__(self, *, base_url: URL | str | None = None) -> None: if base_url is None: base_url = WEBSITE_BASE_URL if isinstance(base_url, str): base_url = URL(base_url) self._base_url = base_url def build_series_url(self, series_id: str) -> URL: return self.base_url / f"series/{series_id}" ``` Real-world examples: - [ons-fetcher](https://git.nomics.world/dbnomics-fetchers/ons-fetcher/-/blob/master/src/ons_fetcher/website.py) relies on web scraping to extract the category tree of datasets. ## SDMX Providers that distribute SDMX data can be handled by sub-classing `BaseSdmxDownloader`. This base class handles many things related to SDMX data: - downloading global SDMX resources (e.g. dataflow, categorisation, categoryscheme) - downloading datasets by iterating the dataflow - extracting the last update date to avoid downloading non-updated datasets again and again This base class also defines abstract methods that must be implemented in the fetcher. TODO `SdmxApi` Real-world examples: - [oecd-fetcher](https://git.nomics.world/dbnomics-fetchers/oecd-fetcher/-/blob/main/src/oecd_fetcher/downloader.py)