# Fetcher design principles ## One fetcher per provider Each fetcher should handle a specific provider, not more. For example, there is one fetcher for [Eurostat](https://git.nomics.world/dbnomics-fetchers/eurostat-fetcher), another one for [IMF](https://git.nomics.world/dbnomics-fetchers/imf-fetcher), etc. ## Store provider data as-is Fetchers download data from the provider infrastructure and write it to the file-system as-is. Providers usually distribute data as: * static files (sometimes called bulk download): XML, JSON, CSV, XLSX, sometime archived in ZIP files * web API, with responses being XML, JSON, etc. File formats can be: * machine-readable: XML, JSON, CSV * human-readable: XLSX files using formatting, colors, etc. ## 2 stages: download and convert A fetcher is a 2 stage process: download and convert. First it downloads data from the provider and store it in its original format (called *source data*). Then it converts source data into DBnomics data model and store it as JSON, JSON-Lines or TSV files. The goal of a fetcher is to write data in both formats. To achieve this, the fetcher must bring 2 scripts: * `download.py`: downloads all the datasets or a subset from a provider infrastructure * `convert.py`: converts the downloaded datasets to DBnomics data model ## Aim for maximum data coverage Fetchers should handle a maximum of data from their corresponding provider. For example, if a provider ships source data through its web API in SDMX format, the fetcher should cover all the datasets by iterating the list of datasets dynamically. As data is structured, processing one dataset costs the same than processing 10000 datasets. That's not always easy: there are sometimes particular cases to handle, and sometimes providers ship manually-formatted Excel files that require handling them separately. In those cases, the fetcher can cover only a subset of the available datasets. ## Run frequently Fetchers are designed to be run every day (or more). As a consequence, all the data should not be downloaded everytime the fetcher runs, especially if data is huge. To help targetting only the datasets that changed since the last fetcher execution, we can rely on release dates when providers make them available. ## Be standalone programs Fetchers can run independently from any infrastructure: they just write data to the file-system. This allows anyone to run them without having to run the complete DBnomics infrastructure. Fetchers are run by DBnomics infrastructure, which take fetcher output and make it available on DBnomics website and web API. ## Don't write converted data directly The final outcome of a fetcher is a bunch of datasets written to a directory. However we don't want the fetcher to write converted data to the file-system directly, for several reasons: * this would be error-prone: letting fetcher write data files directly would make them responsible for producing valid data, * this would be repetitive: the same data serialization logic would be repeated over and over, for each fetcher, * this would be highly coupled: fetchers would only be able to write to the file-system, nothing else. Instead we want to give fetchers access to model classes that validate their inputs, and to a storage abstraction that writes those model classes to the file-system (or other targets). The `dbnomics-toolbox` library provides those: see the [data model](data-model.md) and [data storage](data-storage.md) pages. ## Keep past revisions Most of the time, providers do not give access to the past revisions of data. However it is often important to access them for [reproducibility](https://en.wikipedia.org/wiki/reproducibility), for example to run computations that were written in the past, with the data that was available at that time. Fetchers rely on [Git](https://git-scm.com/) to handle revisions. ## Avoid false revisions Downloaded data sometimes differs sightly from one download to another, even if both downloads correspond to the same revision. For example, there can be a `prepared_at` date in an XML file, or a random URL to a CSS stylesheet in an HTML file used to bypass the browser cache. Keeping them would create false revisions, so fetchers should remove those specificities in downloaded data in order to avoid them. In the same spirit, source data should be reformatted in a standard way. ## Process resources Providers distribute data in various ways. For example, here are many possible cases: * a CSV file defining a whole dataset (1 to 1 relationship) * an XLSX file defining many datasets (1 to many relationship) * many XML files defining a dataset (many to 1 relationship) * many files defining many datasets (many to many relationship) In order to reason more easily about those different data granularities, the fetcher toolbox introduces the notion of *resource*. In the previous example, the resource can be a file, or a group of files. Fetcher authors can choose the scope of resources, based on their understanding of provider data. ## Error handling Errors may occur during the processing of resources. In such case this error should not break the entire script execution. Th error should be logged and the next resource should start being processed. The script should not fail immediately by raising an exception. Data generated by a script is written to the target directory. In case of error, data is kept but could be corrupt or incomplete. In development, this allows the fetcher author to inspect the situation. In production, that corrupt data should be removed. For example, a download script may fail downloading a resource because the server is down or slow, or a convert script may fail converting a resource because data is different than expected for that resource. The fetcher toolbox takes care of handling the error (logging it) and keeps on processing the next resource. This default behavior can be modified by using script options like `--fail-fast`, which makes the script fail by raising an exception.