File-system storage adapter

The file-system adapter is a concrete implementation of the BaseStorage abstract class described in the data storage page.

The FileSystemStorage class defines concrete methods corresponding to the abstract methods of BaseStorage to load and save model classes to the file-system.

This adapter has options that are detailed below.

File-system storage session

TODO document transactions

The physical data model

Althrough using human-readable file formats like JSON, JSON Lines and TSV, it can be considered as a black box, meaning that reading and writing data should be done through the storage adapter.

The data format of the files is designed to optimize reads and writes and can differ from the data model classes.

To represent the content of those files in-memory, another set of model classes is defined in dbnomics_toolbox.storage.adapters.filesystem.model. We can talk about those classes as the physical model, whereas the classes of dbnomics_toolbox.model can be called the domain model.

When calling its load and save methods, the FileSystemStorage class is responsible for both:

  • transforming objects of the domain model from and to objects of the physical model,

  • knowing the path of the files that are read and written.

Provider metadata

The file-system adapter stores provider metadata in a file named provider.json directly at the root of the provider directory.

The ProviderMetadata domain model class is mapped to the ProviderJson physical model class.

See also: BaseStorage.load_provider_metadata and BaseStorage.save_provider_metadata.

Category tree

The file-system adapter stores the category tree in a file named category_tree.json directly at the root of the provider directory.

The CategoryTree domain model class is mapped to the CategoryTreeJson physical model class.

See also: BaseStorage.load_category_tree and BaseStorage.save_category_tree.

Datasets and series

The file-system adapter stores each dataset in a dedicated directory named after the dataset code. For example, the dataset INSEE/IPC-2015 is stored in the directory IPC-2015 at the root level of the provider directory.

The way datasets and series are stored depends on the chosen storage variant, the different variants being described in the next section.

Storage variants

The storage URI of the file-system adapter accepts a variant parameter that takes one of those values: jsonl or tsv (e.g. filesystem:insee-converted-data?variant=jsonl).

When instanciating a FileSystemStorage, the variant will be detected by looking at the already written files, if any. If detection could not be done, for example because no file has been written yet, then the jsonl variant will be used by default for future writes. This allows clients to open any directory with the file-system adapter without knowing by advance which variant is used.

TSV variant

The TSV variant mainly uses tab-separated values files to store series.

When using the TSV variant, the following files are created for each dataset, in a sub-directory named after the dataset code.

The file {dataset_code}/dataset.json contains dataset metadata coming from the DatasetMetadata model class, and series metadata coming from the Series model class under the series property. The TsvDatasetJson physical model class represents the contents of this file.

For each series, observations and their attributes are stored in a TSV file named after the series code. For example, the series INSEE/IPC-2015/A.IPC.SO.00.00.INDICE.ENSEMBLE.FE.SO.BRUT.2015.FALSE is stored in the file IPC-2015/A.IPC.SO.00.00.INDICE.ENSEMBLE.FE.SO.BRUT.2015.FALSE.tsv.

A simple TSV file looks like this:

PERIOD	VALUE
2000	19
2001	NA
2002	22

Observation attributes can be stored in additional columns:

PERIOD	VALUE	OBS_STATUS
2000	19
2001	NA
2002	22	E

Historically the TSV variant was the only existing variant, but since it uses one TSV file per series, this could lead to a huge number of files for some datasets. Given that in file-systems, any file takes a minimum of one block (e.g. 4kb), this is not optimal for datasets having a huge number of small series. Quickly the disk was full due to those millions of small files.

Also the fact that the file is named after the series code can lead to file names that are too long for the runtime environment (i.e. file-system, operating system).

JSON Lines variant

The JSON Lines variant mainly uses JSON Lines files to store series.

This variant was introduced to circumvent the issues and limitations of the TSV variant.

The file {dataset_code}/dataset.json contains dataset metadata coming from the DatasetMetadata model class. The JsonLinesDatasetJson physical model class represents the contents of this file.

The file {dataset_code}/series.jsonl contains all the series of the dataset, including metadata and observations coming from the Series model class. The JsonLinesSeriesItem physical model class represents each line of this file.

Example of series.jsonl (only a minimal sample is shown here):

{"code":"M.LB.B.TTP.SA","dimensions":{"frequency":"M","seasonally_adjusted":"SA","sex":"B","subject":"LB","unit":"TTP"},"observations":[["PERIOD","VALUE"],["1953-01",4122],["1953-02",4001],["1953-03",4008]]}
{"code":"M.NILF.F.TTP.NSA","dimensions":{"frequency":"M","seasonally_adjusted":"NSA","sex":"F","subject":"NILF","unit":"TTP"},"observations":[["PERIOD","VALUE],["1972-01","NA"],["1972-02","NA"],["1972-03","NA"],["1972-04","NA"]]}

Single provider or multiple providers

By design, the BaseStorage interface is able to load and save data belonging to several providers:

series1 = storage.load_series("INSEE/IPC-2015/A.IPC.SO.00.00.INDICE.ENSEMBLE.FE.SO.BRUT.2015.FALSE")
series2 = storage.load_series("IMF/CPI/A.AF.PCPIHO_IX")

However, by design also, fetchers are dedicated to a single provider.

The file-system adapter reads and writes data from/to a directory, which can be used in 2 different modes, according to the single_provider boolean parameter of the storage URI.

With single_provider=false (the default), the adapter works in multiple provider mode. The path of the storage URI will be used as a base directory containing one top-level directory per provider.

With single_provider=true, the adapter works in single provider mode. The path of the storage URI will be used as a directory containing data of a single provider only (the same as the top-level directories of the multiple provider mode).

Even when being used in the single provider mode, the provider codes must be given to the methods of BaseStorage in order to respect the common interface.

Example for multiple provider mode:

from dbnomics_toolbox.model.provider_metadata import ProviderMetadata
from dbnomics_toolbox.storage.storage import BaseStorage
from dbnomics_toolbox.storage.storage_uri import StorageUri

multiple_provider_storage = BaseStorage.from_uri(StorageUri.parse("filesystem:multiple_provider_data"))
multiple_provider_storage.save_provider_metadata(
    ProviderMetadata.create(
        code="Eurostat",
        name="Eurostat",
        # Cf https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
        region="EU",
        terms_of_use="https://ec.europa.eu/eurostat/web/main/help/copyright-notice",
        website="https://ec.europa.eu/eurostat",
    )
)
multiple_provider_storage.save_provider_metadata(
    ProviderMetadata.create(
        code="INSEE",
        name="Institut national de la statistique et des études économiques",
        # Cf https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
        region="FR",
        terms_of_use="https://www.insee.fr/fr/information/2381863",
        website="https://www.insee.fr/",
    )
)
$ tree multiple_provider_data
multiple_provider_data
├── eurostat-json-data
│   └── provider.json
└── insee-json-data
    └── provider.json

Example for single provider mode:

from dbnomics_toolbox.model.provider_metadata import ProviderMetadata
from dbnomics_toolbox.storage.storage import BaseStorage
from dbnomics_toolbox.storage.storage_uri import StorageUri

single_provider_storage = BaseStorage.from_uri(StorageUri.parse("filesystem:single_provider_data?single_provider=true"))
single_provider_storage.save_provider_metadata(
    ProviderMetadata.create(
        code="INSEE",
        name="Institut national de la statistique et des études économiques",
        # Cf https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
        region="FR",
        terms_of_use="https://www.insee.fr/fr/information/2381863",
        website="https://www.insee.fr/",
    )
)
$ tree single_provider_data
single_provider_data
└── provider.json

In a fetcher, the convert CLI expects a directory or a storage URI as its second command-line argument. If it’s a directory, it is turned into a storage URI like filesystem:{directory}?single_provider=true.