# Downloading data

## Overview

Here is the big picture of the different components that allow for data download:

![Download architecture](../_static/download-architecture.drawio.svg)

The following sections will introduce each component of this architecture diagram.

## Download script

A fetcher must define a `download.py` script:

```{code-block} python
:caption: download.py
#!/usr/bin/env python3

from dbnomics_toolbox.fetcher_utils.cli_utils.download_cli import DownloadCLI
from dbnomics_toolbox.fetcher_utils.cli_utils.download_cli_args import DownloadCLIArgs

import abc_fetcher
from abc_fetcher.downloader import Downloader


def main() -> None:
    cli = DownloadCLI(package_name=abc_fetcher.__name__)
    cli.start()
    downloader = Downloader(**cli.args.as_downloader_kwargs)
    downloader.start()
    cli.finalize(downloader)


if __name__ == "__main__":
    main()
```

This script can be called manually from the command line, and will be called by the DBnomics infrastructure in production.

Note the separation of concerns:

- the [`DownloadCLI`](DownloadCLI) class handles the CLI arguments and options, and saves an output state file (in the [`finalize`](DownloadCLI.finalize) method),
- the `Downloader` class takes care of the download logic.

## Downloader

The `Downloader` class, inherited from [`BaseDownloader`](BaseDownloader), is responsible for downloading data from the provider infrastructure.

To achieve its goal, it creates file resources (cf next section) and downloads them.

Example of `Downloader`:

```{code-block} python
:caption: downloader.py
from collections.abc import Iterator
from typing import Unpack, override

from dbnomics_toolbox.fetcher_utils import BaseDownloader, BaseResource
from dbnomics_toolbox.fetcher_utils.resources import ResourceId
from yarl import URL

from dummy_fetcher.constants import API_BASE_URL
from dummy_fetcher.resources import DummyApiResource
from dummy_fetcher.source_data_repo import SourceDataRepo

__all__ = ["Downloader"]


class Downloader(BaseDownloader):
    """Download files from the provider infrastructure."""

    def __init__(
        self,
        **kwargs: Unpack[BaseDownloader.InitKwargs],
    ) -> None:
        super().__init__(**kwargs)
        self._source_data_repo = SourceDataRepo(source_data_dir=kwargs["source_data_dir"])

    @override
    def _iter_resources(self) -> Iterator[BaseResource]:
      # Defined later
      ...
```

The `__init__` method instanciates the `SourceDataRepo` class, which is a fundamental pattern detailed in the next section.

## Source data repository

We don't want the `Downloader` to know too much about source data, especially file paths inside the `source-data` directory, and how to read data from files.

Remember: the `Converter` will also have to read data back from this directory, and we don't want to duplicate this logic.

So we can define a `SourceDataRepo` class and use it from both the `Downloader` and the `Converter`.

For example, let's say that the ABC provider exposes its datasets in a `catalog.json` file such as:

```json
[
  {"dataset_id": "BOP", "dataset_name": "Balance of payments"},
  {"dataset_id": "GDP", "dataset_name": "Gross domestic product"}
]
```

Those catalog items can be modeled by a `CatalogItem` dataclass:

```{code-block} python
:caption: source_data_model.py
from dataclasses import dataclass

__all__ = ["CatalogItem"]


@dataclass(frozen=True, kw_only=True)
class CatalogItem:
    dataset_id: str
    dataset_name: str
```

The `SourceDataRepo` class can define methods that parse source-data files and expose their content as model objects from the provider domain.

To validate Python dicts and load them to dataclass instances, it is advised to use a data loading library such as [typedload](https://ltworf.codeberg.page/typedload/) or [`pydantic`](https://docs.pydantic.dev/).
The [`dbnomics_toolbox.json_utils.load_json_file`](dbnomics_toolbox.json_utils.load_json_file) function uses `typedload` under the hood.

Exemple of `SourceDataRepo`:

```{code-block} python
:caption: source_data_repo.py
from collections.abc import Iterator
from pathlib import Path

import daiquiri
from dbnomics_toolbox.json_utils import load_json_file

from abc_fetcher.source_data_model import CatalogItem

__all__ = ["SourceDataRepo"]


logger = daiquiri.getLogger(__name__)


class SourceDataRepo:
    """Load data from the `source-data` directory.

    Useful both to the Downloader (e.g. to know where to write the downloaded files)
    and to the Converter (e.g. to load those files).
    """

    def __init__(self, *, source_data_dir: Path) -> None:
        self._source_data_dir = source_data_dir

        self.catalog_file = source_data_dir / "catalog.json"

    def iter_catalog_items(self) -> Iterator[CatalogItem]:
        catalog_items = load_json_file(self.catalog_file, type_=list[CatalogItem])
        yield from catalog_items
```

The `iter_catalog_items` method will be called by the `Downloader` in order to download all the datasets, and by the `Converter` to convert them all.

## File resources

A resource represents some data distributed by the provider that will be stored as a single file.

The abstract method [`FileResource._download`](FileResource._download) is responsible for writing data to the target file.
Then it validates the MIME type of the target file, and reformats it according to its format (e.g. XML, JSON, etc.).
Those features are enabled by default with auto-detection, but can be disabled by passing arguments to `__init__`.

Most of the time you will use the [`HttpResource`](HttpResource), a child class of [`FileResource`](FileResource) which uses the [`requests`](https://requests.readthedocs.io/en/latest/) library to download files.

For example we can define the following resource to download a JSON file from `https://abc-provider.com/data/catalog.json`:

```{code-block} python
:caption: downloader.py
from dbnomics_toolbox.fetcher_utils import HttpResource

class Downloader(BaseDownloader):
    # [...]

    @override
    def _iter_resources(self) -> Iterator[BaseResource]:
        yield HttpResource(
            id="catalog",
            request="https://abc-provider.com/data/catalog.json",
            target_file=self._source_data_repo.catalog_file,
        )
```

The `target_file` argument references the `catalog_file` attribute from the `SourceDataRepo`.

The `request` argument can be a URL or a `requests.Request` object.

See also the [file resources](resources/file-resources.md) page.

## Simulate the provider web API

Obviously the URLs of the example, starting with `https://abc-provider.com/`, do not actually exist.

To make the downloader work with them, we're going to intercept those requests and respond fake data by using the [`responses`](https://github.com/getsentry/responses) package.

Install the package:

```bash
uv add responses
```

Override the [`BaseDownloader.start`](BaseDownloader.start) method:

```{code-block} python
:caption: downloader.py
import responses

class Downloader(BaseDownloader):
    # [...]

    @override
    def start(self) -> None:
        with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps:
            rsps.add(
                responses.GET,
                "https://abc-provider.com/data/catalog.json",
                json=[
                    {"dataset_id": "BOP", "dataset_name": "Balance of payments"},
                    {"dataset_id": "GDP", "dataset_name": "Gross domestic product"},
                ],
                status=200,
            )
            super().start()
```

## Run the download script

```bash
$ python download.py source-data
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created target directory: 'source-data'
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created cache directory: 'source-data/.cache'
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created debug directory: 'source-data/.debug'
DEBUG    dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['catalog']
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='catalog') (1/1)
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Received HTTP response: <Response [200]>
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.common: Started writing to temporary file 'source-data/catalog.part' (0 bytes)...
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.common: Wrote file 'source-data/catalog.json' (127 bytes) -- duration: 0 seconds
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.json_utils: Start reformatting JSON file 'source-data/catalog.json' (127 bytes) with command '/usr/bin/jq --indent 2 < source-data/catalog.json > source-data/catalog.tmp'
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.common: Reformatted JSON file 'source-data/catalog.json' (127 bytes) -- initial_file_size: 127 bytes duration: 0.01 seconds
INFO     dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'catalog' successfully -- duration: 0.06 seconds
DEBUG    dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 0.06 seconds
INFO     dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes)
INFO     dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=0, skip_count=0, success_count=1)
DEBUG    dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}}
```

The log of the script mainly shows that the resource having the ID `catalog` was downloaded successfully.
The other things shown by the log will be detailed in the following sections.

We can see that the `catalog.json` file was downloaded:

```bash
$ tree source-data
source-data
└── catalog.json

1 directory, 1 file

$ cat source-data/catalog.json
[
  {
    "dataset_id": "BOP",
    "dataset_name": "Balance of payments"
  },
  {
    "dataset_id": "GDP",
    "dataset_name": "Gross domestic product"
  }
]
```

Note: as shown in the logs, the JSON file was reformatted.

## Resume mode

When the file of a resource already exists, that resource is skipped.
This behavior is called the *resume mode*.

For example, if we run the download script again, we can read in the logs that the resource is skipped:

```{code-block} bash
:emphasize-lines: 5
$ python download.py source-data
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data' as the target directory
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.cache' as the cache directory
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.debug' as the debug directory
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Skipped resource HttpResource(id='catalog'): [Resume mode] Skipping resource 'catalog' because its file already exists: 'source-data/catalog.json' (158 bytes)
DEBUG    dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 0 resources... (1 found, 1 skipped) -- ids: []
DEBUG    dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 0 resources -- duration: 0 seconds
INFO     dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes)
INFO     dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=0, skip_count=1, success_count=0)
DEBUG    dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}}
```

To force downloading the resource again, the resume mode can be disabled by passing the `--no-resume` option:

```{code-block} bash
:emphasize-lines: 6-13
$ python download.py --no-resume source-data
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data' as the target directory
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.cache' as the cache directory
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.debug' as the debug directory
DEBUG    dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['catalog']
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='catalog') (1/1)
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Received HTTP response: <Response [200]>
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.common: Started writing to temporary file 'source-data/catalog.part' (0 bytes)...
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.common: Wrote file 'source-data/catalog.json' (127 bytes) -- duration: 0 seconds
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.json_utils: Start reformatting JSON file 'source-data/catalog.json' (127 bytes) with command '/usr/bin/jq --indent 2 < source-data/catalog.json > source-data/catalog.tmp'
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.common: Reformatted JSON file 'source-data/catalog.json' (127 bytes) -- duration: 0.01 seconds initial_file_size: 127 bytes
INFO     dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'catalog' successfully -- duration: 0.02 seconds
DEBUG    dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 0.02 seconds
INFO     dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes)
INFO     dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=0, skip_count=0, success_count=1)
DEBUG    dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}}
```

When disabling the resume mode, all the resources will be downloaded, whether their file already exists or not.

To re-download a particular resource only while keeping the resume mode enabled, just delete its file and re-execute the script.

## Error handling

If any exception occurs while downloading a resource, the resource will be skipped and the script continue without crashing, and the error will be logged.

If the target file of the resource was written, even partially, the [`BaseDownloader`](BaseDownloader) will move the file to the debug directory for further inspection.

By default, the debug directory is a sub-directory of the source data directory named `.debug`.
Its path can be customized by passing the `--debug-dir` option.

### Example: simulate a 404 page not found

To simulate an error with the resource `catalog`, modify the `Downloader.start` method that we created earlier to return a 404 HTML page for the URL of the catalog JSON file:

```{code-block} python
:caption: downloader.py
import responses

class Downloader(BaseDownloader):
    # [...]

    @override
    def start(self) -> None:
        with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps:
            # rsps.add(
            #     responses.GET,
            #     "https://abc-provider.com/data/catalog.json",
            #     json=[
            #         {"dataset_id": "BOP", "dataset_name": "Balance of payments"},
            #         {"dataset_id": "GDP", "dataset_name": "Gross domestic product"},
            #     ],
            #     status=200,
            # )
            rsps.add(
                responses.GET,
                "https://abc-provider.com/data/catalog.json",
                body="Page not found!",
                content_type="text/html",
                status=404,
            )
            super().start()
```

Run the script:

```bash
$ python download.py source-data
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data' as the target directory
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.cache' as the cache directory
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.debug' as the debug directory
DEBUG    dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['catalog']
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='catalog') (1/1)
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Invalid HTTP response: <Response [404]>
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Dumped HTTP request and response to 'source-data/.debug/catalog.json.http_dump.attempt_1.txt' (298 bytes)
ERROR    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Error downloading resource 'catalog' -- duration: 0.01 seconds
Traceback (most recent call last):
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/processors/base_downloader.py", line 136, in _download_resource
    resource._start()  # type: ignore[reportPrivateUsage] # noqa: SLF001
    ~~~~~~~~~~~~~~~^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 225, in _start
    run_retrying_attempts(
    ~~~~~~~~~~~~~~~~~~~~~^
        retrying=self._retrying,
        ^^^^^^^^^^^^^^^^^^^^^^^^
        run_attempt=run_attempt,
        ^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 26, in run_retrying_attempts
    for attempt in retrying:
                   ^^^^^^^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 445, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 378, in iter
    result = action(retry_state)
  File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 400, in <lambda>
    self._add_action_func(lambda rs: rs.outcome.result())
                                     ~~~~~~~~~~~~~~~~~^^
  File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 30, in run_retrying_attempts
    result = run_attempt(retry_state=retry_state)
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 222, in run_attempt
    self._download(retry_state=retry_state)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 108, in _download
    with self._fetch_response(retry_state=retry_state) as response:
         ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 141, in __enter__
    return next(self.gen)
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 145, in _fetch_response
    self._validate_response(response)
    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 208, in _validate_response
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/requests/models.py", line 1026, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://abc-provider.com/data/catalog.json
DEBUG    dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 0.03 seconds
INFO     dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes)
INFO     dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=1, skip_count=0, success_count=0)
DEBUG    dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}}
```

We see in the logs the HTTPError exception that was raised because of the 404 response status code.

In this case the error was raised before starting to write the target file `catalog.json`, so we won't find it in `source-data`.
This is important because we don't want invalid files to be written to the `source-data` directory.

However, as shown in the logs, since the HTTP request failed, a textual dump of the request and the response was saved to the debug directory:

```bash
$ tree -a source-data
source-data
├── .cache
│   └── .gitignore
└── .debug
    ├── catalog.json.http_dump.attempt_1.txt
    └── .gitignore

3 directories, 3 files

$ cat source-data/.debug/catalog.json.http_dump.attempt_1.txt
< GET /data/catalog.json HTTP/1.1
< Host: abc-provider.com
< User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0
< Accept-Encoding: gzip, deflate
< Accept: */*
< Connection: keep-alive
<

> HTTP/? 404 Not Found
> Content-Type: text/html
>
Page not found!%
```

This allows us to quickly spot the problem in context, without having to reproduce that request with curl or in the browser.

## Retrying

It's not unusual for servers to be too busy to respond something useful to the client.
In this case we may receive a response that tells us to retry after a delay.

This retrying logic is implemented by the [`HttpResource`](HttpResource).

See also: [retrying downloads](resources/http-resources.md#retrying-downloads).

### Example: simulate a busy server

Let's simulate this time a server that responds something like "Server busy, retry later" the first time the URL is called, then responds the JSON catalog as expected the second time.

We'll use the HTTP response code [429 Too many requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/429).

Modify again the `Downloader.start` method:

```{code-block} python
:caption: downloader.py
import responses

class Downloader(BaseDownloader):
    # [...]

    @override
    def start(self) -> None:
        with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps:
            # When calling `rsps.add` many times for the same URL, the `RequestsMock` will consume each one in order for each incoming request from the client.
            rsps.add(
                responses.GET,
                "https://abc-provider.com/data/catalog.json",
                body="<p>Server busy, retry later<p>",
                content_type="text/html",
                status=429,
            )
            rsps.add(
                responses.GET,
                "https://abc-provider.com/data/catalog.json",
                json=[
                    {"dataset_id": "BOP", "dataset_name": "Balance of payments"},
                    {"dataset_id": "GDP", "dataset_name": "Gross domestic product"},
                ],
                status=200,
            )
            super().start()
```

Run the script:

```bash
$ python download.py source-data
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created target directory: 'source-data'
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created cache directory: 'source-data/.cache'
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created debug directory: 'source-data/.debug'
DEBUG    dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['catalog']
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='catalog') (1/1)
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Invalid HTTP response: <Response [429]>
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Dumped HTTP request and response to 'source-data/.debug/catalog.json.http_dump.attempt_1.txt' (321 bytes)
ERROR    dbnomics_toolbox.retry_utils.loggers: Error during attempt 1 after 0.01 seconds
Traceback (most recent call last):
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/loggers.py", line 41, in log_failed_attempt
    outcome.result()
    ~~~~~~~~~~~~~~^^
  File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 30, in run_retrying_attempts
    result = run_attempt(retry_state=retry_state)
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 222, in run_attempt
    self._download(retry_state=retry_state)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 108, in _download
    with self._fetch_response(retry_state=retry_state) as response:
         ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 141, in __enter__
    return next(self.gen)
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 145, in _fetch_response
    self._validate_response(response)
    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 208, in _validate_response
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/requests/models.py", line 1026, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://abc-provider.com/data/catalog.json
DEBUG    dbnomics_toolbox.retry_utils.loggers: Sleeping 1.5 seconds
DEBUG    dbnomics_toolbox.retry_utils.loggers: Starting attempt 2
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Received HTTP response: <Response [200]>
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.common: Started writing to temporary file 'source-data/catalog.part' (0 bytes)...
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.common: Wrote file 'source-data/catalog.json' (127 bytes) -- duration: 0 seconds
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.json_utils: Start reformatting JSON file 'source-data/catalog.json' (127 bytes) with command '/usr/bin/jq --indent 2 < source-data/catalog.json > source-data/catalog.tmp'
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.common: Reformatted JSON file 'source-data/catalog.json' (127 bytes) -- duration: 0.01 seconds initial_file_size: 127 bytes
INFO     dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'catalog' successfully -- duration: 1.56 seconds
DEBUG    dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 1.56 seconds
INFO     dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes)
INFO     dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=0, skip_count=0, success_count=1)
DEBUG    dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}}
```

After the first failed attempt, we see that the download script slept for 1.5 seconds.
This is because the default wait strategy grows exponentially, starting with a low value.

We can now inspect the files corresponding to the first failed attempt, then the successful second one:

```bash
$ tree -a source-data
source-data
├── .cache
│   └── .gitignore
├── catalog.json
└── .debug
    ├── catalog.json.http_dump.attempt_1.txt
    └── .gitignore

3 directories, 4 files

$ cat source-data/.debug/catalog.json.http_dump.attempt_1.txt
< GET /data/catalog.json HTTP/1.1
< Host: abc-provider.com
< User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0
< Accept-Encoding: gzip, deflate
< Accept: */*
< Connection: keep-alive
<

> HTTP/? 429 Too Many Requests
> Content-Type: text/html
>
<p>Server busy, retry later<p>%

$ cat source-data/catalog.json
[
  {
    "dataset_id": "BOP",
    "dataset_name": "Balance of payments"
  },
  {
    "dataset_id": "GDP",
    "dataset_name": "Gross domestic product"
  }
]
```

### Example: simulate `Retry-After` response header

If the server did respond with a `Retry-After` HTTP response header, then that value will be used.

Let's simulate this:

```{code-block} python
:caption: downloader.py
import responses

class Downloader(BaseDownloader):
    # [...]

    @override
    def start(self) -> None:
        with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps:
            # When calling `rsps.add` many times for the same URL, the `RequestsMock` will consume each one in order for each incoming request from the client.
            rsps.add(
                responses.GET,
                "https://abc-provider.com/data/catalog.json",
                body="<p>Server busy, retry later<p>",
                content_type="text/html",
                headers={"Retry-After": "4"},
                status=429,
            )
            rsps.add(
                responses.GET,
                "https://abc-provider.com/data/catalog.json",
                json=[
                    {"dataset_id": "BOP", "dataset_name": "Balance of payments"},
                    {"dataset_id": "GDP", "dataset_name": "Gross domestic product"},
                ],
                status=200,
            )
            super().start()
```

This time the scripts waits for 4 seconds:

```bash
$ python download.py source-data
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data' as the target directory
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.cache' as the cache directory
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Using the existing directory 'source-data/.debug' as the debug directory
DEBUG    dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['catalog']
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='catalog') (1/1)
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Invalid HTTP response: <Response [429]>
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Dumped HTTP request and response to 'source-data/.debug/catalog.json.http_dump.attempt_1.txt' (339 bytes)
ERROR    dbnomics_toolbox.retry_utils.loggers: Error during attempt 1 after 0.01 seconds
Traceback (most recent call last):
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/loggers.py", line 41, in log_failed_attempt
    outcome.result()
    ~~~~~~~~~~~~~~^^
  File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 30, in run_retrying_attempts
    result = run_attempt(retry_state=retry_state)
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 222, in run_attempt
    self._download(retry_state=retry_state)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 108, in _download
    with self._fetch_response(retry_state=retry_state) as response:
         ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 141, in __enter__
    return next(self.gen)
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 145, in _fetch_response
    self._validate_response(response)
    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 208, in _validate_response
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/requests/models.py", line 1026, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://abc-provider.com/data/catalog.json
DEBUG    dbnomics_toolbox.fetcher_utils.http_utils.requests_utils.waiters: The HTTP response has a HTTP header Retry-After: 4
DEBUG    dbnomics_toolbox.retry_utils.loggers: Sleeping 4 seconds
DEBUG    dbnomics_toolbox.retry_utils.loggers: Starting attempt 2
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Received HTTP response: <Response [200]>
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.common: Started writing to temporary file 'source-data/catalog.part' (0 bytes)...
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.common: Wrote file 'source-data/catalog.json' (127 bytes) -- duration: 0 seconds
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.json_utils: Start reformatting JSON file 'source-data/catalog.json' (127 bytes) with command '/usr/bin/jq --indent 2 < source-data/catalog.json > source-data/catalog.tmp'
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.common: Reformatted JSON file 'source-data/catalog.json' (127 bytes) -- duration: 0.01 seconds initial_file_size: 127 bytes
INFO     dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'catalog' successfully -- duration: 4.05 seconds
DEBUG    dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 4.05 seconds
INFO     dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes)
INFO     dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=0, skip_count=0, success_count=1)
DEBUG    dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}}
```

### Example: simulate a busy server with 200 status code

Sometimes the server responds something like "Server busy, retry later" but yet responds a status code of [200 OK](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/200).

In this case, the default retry strategy of the [`HttpResource`](HttpResource) thinks that the response is successful and does not dig into its contents to determine whether or not to retry downloading the resource.

Let's first simulate the server responses:

```{code-block} python
:caption: downloader.py
import responses

class Downloader(BaseDownloader):
    # [...]

    @override
    def start(self) -> None:
        with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps:
            # When calling `rsps.add` many times for the same URL, the `RequestsMock` will consume each one in order for each incoming request from the client.
            rsps.add(
                responses.GET,
                "https://abc-provider.com/data/catalog.json",
                body="<p>Server busy, retry later<p>",
                content_type="text/html",
                status=200,
            )
            rsps.add(
                responses.GET,
                "https://abc-provider.com/data/catalog.json",
                json=[
                    {"dataset_id": "BOP", "dataset_name": "Balance of payments"},
                    {"dataset_id": "GDP", "dataset_name": "Gross domestic product"},
                ],
                status=200,
            )
            super().start()
```

If we try to run the script at this point, we'll have an [`InvalidMimeType`](InvalidMimeType) exception because the HTML response does not match the extension of the `catalog.json` file name (cf [MIME type validation](resources/file-resources.md#mime-type-validation)), but no retry will be done.

Although we could customize the retry strategy, it's better to customize the HTTP response validation directly to make it fail, by passing the `validate_response` kwarg to the constructor of [`HttpResource`](HttpResource):

```{code-block} python
:caption: downloader.py
import responses
from requests import Response

class Downloader(BaseDownloader):
    # [...]

    @override
    def _iter_resources(self) -> Iterator[BaseResource]:
        def validate_response(response: Response) -> None:
            response.raise_for_status()
            if "Server busy" in response.text:
                msg = "Server is busy"
                raise RetryHttpRequest(msg, response=response)

        yield HttpResource(
            id="catalog",
            request="https://abc-provider.com/data/catalog.json",
            target_file=self._source_data_repo.catalog_file,
            validate_response=validate_response,
        )
```

In `validate_response`, `raise_for_status` does not raise an exception as the response code is 200.
Raising [`RetryHttpRequest`](dbnomics_toolbox.retry_utils.requests.errors.RetryHttpRequest) (inherited from [`requests.HTTPError`](https://requests.readthedocs.io/en/latest/api/#requests.HTTPError)) makes the resource download fail and let the request to be retried.

In contrast, just raising a [`requests.HTTPError`](https://requests.readthedocs.io/en/latest/api/#requests.HTTPError) would make the resource download fail, but would not let the request to be retried, as the response code is 200.

Let's execute the script:

```bash
$ python download.py source-data
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created target directory: 'source-data'
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created cache directory: 'source-data/.cache'
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created debug directory: 'source-data/.debug'
DEBUG    dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['catalog']
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='catalog') (1/1)
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Invalid HTTP response: <Response [200]>
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Dumped HTTP request and response to 'source-data/.debug/catalog.json.http_dump.attempt_1.txt' (306 bytes)
ERROR    dbnomics_toolbox.retry_utils.loggers: Error during attempt 1 after 0 seconds
Traceback (most recent call last):
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/loggers.py", line 41, in log_failed_attempt
    outcome.result()
    ~~~~~~~~~~~~~~^^
  File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 30, in run_retrying_attempts
    result = run_attempt(retry_state=retry_state)
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 220, in run_attempt
    self._download(retry_state=retry_state)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 106, in _download
    with self._fetch_response(retry_state=retry_state) as response:
         ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 141, in __enter__
    return next(self.gen)
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 143, in _fetch_response
    self._validate_response(response)
    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 209, in _validate_response
    validate_response_callback(response)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/src/abc_fetcher/downloader.py", line 53, in validate_response
    raise RetryHttpRequest(msg, response=response)
dbnomics_toolbox.retry_utils.requests.errors.RetryHttpRequest: Server is busy
DEBUG    dbnomics_toolbox.retry_utils.loggers: Sleeping 1.5 seconds
DEBUG    dbnomics_toolbox.retry_utils.loggers: Starting attempt 2
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/data/catalog.json' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Received HTTP response: <Response [200]>
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.common: Started writing to temporary file 'source-data/catalog.part' (0 bytes)...
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.common: Wrote file 'source-data/catalog.json' (127 bytes) -- duration: 0 seconds
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.json_utils: Start reformatting JSON file 'source-data/catalog.json' (127 bytes) with command '/usr/bin/jq --indent 2 < source-data/catalog.json > source-data/catalog.tmp'
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.common: Reformatted JSON file 'source-data/catalog.json' (127 bytes) -- initial_file_size: 127 bytes duration: 0.01 seconds
INFO     dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'catalog' successfully -- duration: 1.54 seconds
DEBUG    dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 1.54 seconds
INFO     dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes)
INFO     dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=0, skip_count=0, success_count=1)
DEBUG    dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}}
```

Now the custom retry strategy applies and 2 download attempts are made as expected: the first one fails because the server is busy, and the second one succeeds.

## Resource groups

Sometimes we want to download many files, but if any of them fails, we want none of them.
In other words we want to keep the files if and only if they are all successful.

The [`ResourceGroup`](ResourceGroup) provides such a mechanism.

Resource groups by default store the files of the child resources in the `source-data` directory, but if the `target_dir` kwarg is passed to its constructor, they will be stored under that base directory.

Let's demonstrate that by simulating downloading a dataset composed of 2 files: `data.xml` and `structure.xml`.

```{code-block} python
:caption: downloader.py
import responses

class Downloader(BaseDownloader):
    # [...]

    @override
    def start(self) -> None:
        with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps:
            rsps.add(
                responses.GET,
                "https://abc-provider.com/dataset1/data.xml",
                body="<p>Page Not Found<p>",
                content_type="text/html",
                status=404,
            )
            rsps.add(
                responses.GET,
                "https://abc-provider.com/dataset1/structure.xml",
                body='<?xml version="1.0" encoding="UTF-8"?><structure />',
                content_type="application/xml",
            )
            super().start()

    @override
    def _iter_resources(self) -> Iterator[BaseResource]:
        yield ResourceGroup(
            id="dataset1",
            resources=[
                HttpResource(
                    id="data",
                    request="https://abc-provider.com/dataset1/data.xml",
                    target_file="data.xml",
                ),
                HttpResource(
                    id="structure",
                    request="https://abc-provider.com/dataset1/structure.xml",
                    target_file="structure.xml",
                ),
            ],
            target_dir="dataset1",
        )
```

Let's run the script:

```bash
$ python download.py source-data
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created target directory: 'source-data'
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created cache directory: 'source-data/.cache'
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Created debug directory: 'source-data/.debug'
DEBUG    dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 1 resource... -- ids: ['dataset1']
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource ResourceGroup(id='dataset1') (1/1)
DEBUG    dbnomics_toolbox.fetcher_utils.processors.logging_utils: About to download 2 resources... -- ids: ['data', 'structure']
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='data') (1/2)
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/dataset1/data.xml' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Invalid HTTP response: <Response [404]>
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Dumped HTTP request and response to 'source-data/.debug/dataset1/data.xml.http_dump.attempt_1.txt' (303 bytes)
ERROR    dbnomics_toolbox.fetcher_utils.resources.resource_group: Error downloading resource 'data' of group 'dataset1'
Traceback (most recent call last):
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/resource_group.py", line 65, in _download_resource
    self._downloader._download_resource(  # noqa: SLF001 # type: ignore[reportPrivateUsage]
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        resource,
        ^^^^^^^^^
        progression=progression,
        ^^^^^^^^^^^^^^^^^^^^^^^^
        reraise=True,
        ^^^^^^^^^^^^^
    )
    ^
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/processors/base_downloader.py", line 136, in _download_resource
    resource._start()  # type: ignore[reportPrivateUsage] # noqa: SLF001
    ~~~~~~~~~~~~~~~^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 223, in _start
    run_retrying_attempts(
    ~~~~~~~~~~~~~~~~~~~~~^
        retrying=self._retrying,
        ^^^^^^^^^^^^^^^^^^^^^^^^
        run_attempt=run_attempt,
        ^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 26, in run_retrying_attempts
    for attempt in retrying:
                   ^^^^^^^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 445, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 378, in iter
    result = action(retry_state)
  File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 400, in <lambda>
    self._add_action_func(lambda rs: rs.outcome.result())
                                     ~~~~~~~~~~~~~~~~~^^
  File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/retry_utils/run.py", line 30, in run_retrying_attempts
    result = run_attempt(retry_state=retry_state)
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/file_resource.py", line 220, in run_attempt
    self._download(retry_state=retry_state)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 108, in _download
    with self._fetch_response(retry_state=retry_state) as response:
         ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cbenz/.local/share/uv/python/cpython-3.13.1-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 141, in __enter__
    return next(self.gen)
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 145, in _fetch_response
    self._validate_response(response)
    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-toolbox/src/dbnomics_toolbox/fetcher_utils/resources/http_resource.py", line 206, in _validate_response
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/home/cbenz/Dev/dbnomics/dbnomics-fetchers/abc-fetcher/.venv/lib/python3.13/site-packages/requests/models.py", line 1026, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://abc-provider.com/dataset1/data.xml
DEBUG    dbnomics_toolbox.fetcher_utils.processors.base_downloader: Start downloading resource HttpResource(id='structure') (2/2)
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Fetching URL 'https://abc-provider.com/dataset1/structure.xml' (connect timeout: 1 minute, read timeout: 1 minute)...
DEBUG    dbnomics_toolbox.fetcher_utils.resources.http_resource: Received HTTP response: <Response [200]>
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.common: Started writing to temporary file 'source-data/.cache/dataset1/structure.part' (0 bytes)...
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.common: Wrote file 'source-data/.cache/dataset1/structure.xml' (51 bytes) -- duration: 0 seconds
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.xml_utils.reformatters: Start reformatting XML file 'source-data/.cache/dataset1/structure.xml' (51 bytes) with command '/usr/bin/xmlindent -i 2 source-data/.cache/dataset1/structure.xml -o source-data/.cache/dataset1/structure.tmp'
DEBUG    dbnomics_toolbox.fetcher_utils.file_utils.common: Reformatted XML file 'source-data/.cache/dataset1/structure.xml' (51 bytes) -- initial_file_size: 51 bytes duration: 0 seconds
INFO     dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'structure' successfully -- duration: 0.02 seconds
DEBUG    dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 2 resources -- duration: 0.03 seconds
INFO     dbnomics_toolbox.fetcher_utils.processors.base_downloader: Finished downloading resource 'dataset1' successfully -- duration: 0.03 seconds
DEBUG    dbnomics_toolbox.fetcher_utils.processors.logging_utils: Finished attempting to download 1 resource -- duration: 0.03 seconds
INFO     dbnomics_toolbox.fetcher_utils.processors.base_processor: The report has been saved to 'download_report.json' (59 bytes)
INFO     dbnomics_toolbox.fetcher_utils.processors.base_processor: ProcessorStats(fail_count=1, skip_count=0, success_count=1)
DEBUG    dbnomics_toolbox.fetcher_utils.cli_utils.download_cli: Downloader state not saved to a file, but logged: {'dataset_updates': {}}
```

Let's look at the `source-data` directory:

```bash
$ tree -a source-data
source-data
├── .cache
│   ├── dataset1
│   │   └── structure.xml
│   └── .gitignore
└── .debug
    ├── dataset1
    │   └── data.xml.http_dump.attempt_1.txt
    └── .gitignore

5 directories, 4 files
```

The `source-data/dataset1` directory does not exist, which is what we want: one of the resources of the group failed, so we want none of them.

The `structure.xml` file is stored in the cache directory so that a subsequent download will take advantage of the resume mode to skip downloading the file again.

The response dump is stored in the debug directory as `data.xml.http_dump.attempt_1.txt` and allows us to inspect what's going on.

Let's fix the simulated response to make the `data` resource succeed:

```python
```{code-block} python
:caption: downloader.py
import responses

class Downloader(BaseDownloader):
    # [...]

    @override
    def start(self) -> None:
        with responses.RequestsMock(assert_all_requests_are_fired=False) as rsps:
            rsps.add(
                responses.GET,
                "https://abc-provider.com/dataset1/data.xml",
                body='<?xml version="1.0" encoding="UTF-8"?><data />',
                content_type="application/xml",
            )
            rsps.add(
                responses.GET,
                "https://abc-provider.com/dataset1/structure.xml",
                body='<?xml version="1.0" encoding="UTF-8"?><structure />',
                content_type="application/xml",
            )
            super().start()
```

Let's run the script again:

```bash
python download.py source-data
```

Let's look at the `source-data` directory:

```bash
$ tree -a source-data
source-data
├── .cache
│   └── .gitignore
├── dataset1
│   ├── data.xml
│   └── structure.xml
└── .debug
    └── .gitignore
```

Now the `source-data/dataset1` directory exists, and contains all the files of the resource group.

## Scraping the provider website

When providers do not distribute machine-parseable data, we can scrap its website to extract missing data.

For example, the list of datasets can be only available as a list of links into an HTML page.

When using web scraping, fetchers should define a `website.py` module that exposes a `Website` class which encapsulates the details and knowledge about the website (URLs, data iterators, etc.) and makes high-level data available through methods:

```{code-block} python
:caption: constants.py
from typing import Final

from yarl import URL

WEBSITE_BASE_URL: Final = URL("https://abc-provider.com/data")
```

```{code-block} python
:caption: website.py
from yarl import URL

from abc_fetcher.constants import WEBSITE_BASE_URL

class Website:
    def __init__(self, *, base_url: URL | str | None = None) -> None:
        if base_url is None:
            base_url = WEBSITE_BASE_URL
        if isinstance(base_url, str):
            base_url = URL(base_url)

        self._base_url = base_url

    def build_series_url(self, series_id: str) -> URL:
        return self.base_url / f"series/{series_id}"
```

Real-world examples:

- [ons-fetcher](https://git.nomics.world/dbnomics-fetchers/ons-fetcher/-/blob/master/src/ons_fetcher/website.py) relies on web scraping to extract the category tree of datasets.

## SDMX

Providers that distribute SDMX data can be handled by sub-classing `BaseSdmxDownloader`.

This base class handles many things related to SDMX data:

- downloading global SDMX resources (e.g. dataflow, categorisation, categoryscheme)
- downloading datasets by iterating the dataflow
- extracting the last update date to avoid downloading non-updated datasets again and again

This base class also defines abstract methods that must be implemented in the fetcher.

TODO `SdmxApi`

Real-world examples:

- [oecd-fetcher](https://git.nomics.world/dbnomics-fetchers/oecd-fetcher/-/blob/main/src/oecd_fetcher/downloader.py)