# File-system storage adapter The file-system adapter is a concrete implementation of the [`BaseStorage`](BaseStorage) abstract class described in the [data storage](data-storage.md#base-storage-interface) page. The [`FileSystemStorage`](FileSystemStorage) class defines concrete methods corresponding to the abstract methods of [`BaseStorage`](BaseStorage) to load and save model classes to the file-system. This adapter has options that are detailed below. ## File-system storage session TODO document transactions ## The physical data model Althrough using human-readable file formats like JSON, JSON Lines and TSV, it can be considered as a black box, meaning that reading and writing data should be done through the storage adapter. The data format of the files is designed to optimize reads and writes and can differ from the [data model classes](data-model.md). To represent the content of those files in-memory, another set of model classes is defined in [`dbnomics_toolbox.storage.adapters.filesystem.model`](dbnomics_toolbox.storage.adapters.filesystem.model). We can talk about those classes as the *physical model*, whereas the classes of [`dbnomics_toolbox.model`](dbnomics_toolbox.model) can be called the *domain model*. When calling its load and save methods, the [`FileSystemStorage`](FileSystemStorage) class is responsible for both: - transforming objects of the domain model from and to objects of the physical model, - knowing the path of the files that are read and written. ### Provider metadata The file-system adapter stores provider metadata in a file named `provider.json` directly at the root of the provider directory. The [`ProviderMetadata`](ProviderMetadata) domain model class is mapped to the [`ProviderJson`](ProviderJson) physical model class. See also: [`BaseStorage.load_provider_metadata`](BaseStorage.load_provider_metadata) and [`BaseStorage.save_provider_metadata`](BaseStorage.save_provider_metadata). ### Category tree The file-system adapter stores the category tree in a file named `category_tree.json` directly at the root of the provider directory. The [`CategoryTree`](CategoryTree) domain model class is mapped to the [`CategoryTreeJson`](CategoryTreeJson) physical model class. See also: [`BaseStorage.load_category_tree`](BaseStorage.load_category_tree) and [`BaseStorage.save_category_tree`](BaseStorage.save_category_tree). ### Datasets and series The file-system adapter stores each dataset in a dedicated directory named after the dataset code. For example, the dataset `INSEE/IPC-2015` is stored in the directory `IPC-2015` at the root level of the provider directory. The way datasets and series are stored depends on the chosen storage variant, the different variants being described in the next section. #### Storage variants The storage URI of the file-system adapter accepts a `variant` parameter that takes one of those values: `jsonl` or `tsv` (e.g. `filesystem:insee-converted-data?variant=jsonl`). When instanciating a [`FileSystemStorage`](FileSystemStorage), the variant will be detected by looking at the already written files, if any. If detection could not be done, for example because no file has been written yet, then the `jsonl` variant will be used by default for future writes. This allows clients to open any directory with the file-system adapter without knowing by advance which variant is used. #### TSV variant The TSV variant mainly uses [tab-separated values files](https://en.wikipedia.org/wiki/Tab-separated_values) to store series. When using the TSV variant, the following files are created for each dataset, in a sub-directory named after the dataset code. The file `{dataset_code}/dataset.json` contains dataset metadata coming from the [`DatasetMetadata`](DatasetMetadata) model class, and series metadata coming from the [`Series`](Series) model class under the `series` property. The [`TsvDatasetJson`](TsvDatasetJson) physical model class represents the contents of this file. For each series, observations and their attributes are stored in a TSV file named after the series code. For example, the series `INSEE/IPC-2015/A.IPC.SO.00.00.INDICE.ENSEMBLE.FE.SO.BRUT.2015.FALSE` is stored in the file `IPC-2015/A.IPC.SO.00.00.INDICE.ENSEMBLE.FE.SO.BRUT.2015.FALSE.tsv`. A simple TSV file looks like this: ```tsv PERIOD VALUE 2000 19 2001 NA 2002 22 ``` Observation attributes can be stored in additional columns: ```tsv PERIOD VALUE OBS_STATUS 2000 19 2001 NA 2002 22 E ``` Historically the TSV variant was the only existing variant, but since it uses one TSV file per series, this could lead to a huge number of files for some datasets. Given that in file-systems, any file takes a minimum of one block (e.g. 4kb), this is not optimal for datasets having a huge number of small series. Quickly the disk was full due to those millions of small files. Also the fact that the file is named after the series code can lead to file names that are too long for the runtime environment (i.e. file-system, operating system). #### JSON Lines variant The JSON Lines variant mainly uses [JSON Lines files](https://jsonlines.org/) to store series. This variant was introduced to circumvent the issues and limitations of the TSV variant. The file `{dataset_code}/dataset.json` contains dataset metadata coming from the [`DatasetMetadata`](DatasetMetadata) model class. The [`JsonLinesDatasetJson`](JsonLinesDatasetJson) physical model class represents the contents of this file. The file `{dataset_code}/series.jsonl` contains all the series of the dataset, including metadata and observations coming from the [`Series`](Series) model class. The [`JsonLinesSeriesItem`](JsonLinesSeriesItem) physical model class represents each line of this file. Example of `series.jsonl` (only a minimal sample is shown here): ```jsonl {"code":"M.LB.B.TTP.SA","dimensions":{"frequency":"M","seasonally_adjusted":"SA","sex":"B","subject":"LB","unit":"TTP"},"observations":[["PERIOD","VALUE"],["1953-01",4122],["1953-02",4001],["1953-03",4008]]} {"code":"M.NILF.F.TTP.NSA","dimensions":{"frequency":"M","seasonally_adjusted":"NSA","sex":"F","subject":"NILF","unit":"TTP"},"observations":[["PERIOD","VALUE],["1972-01","NA"],["1972-02","NA"],["1972-03","NA"],["1972-04","NA"]]} ``` ## Single provider or multiple providers By design, the [`BaseStorage`](BaseStorage) interface is able to load and save data belonging to several providers: ```python series1 = storage.load_series("INSEE/IPC-2015/A.IPC.SO.00.00.INDICE.ENSEMBLE.FE.SO.BRUT.2015.FALSE") series2 = storage.load_series("IMF/CPI/A.AF.PCPIHO_IX") ``` However, by design also, fetchers are dedicated to a single provider. The file-system adapter reads and writes data from/to a directory, which can be used in 2 different modes, according to the `single_provider` boolean parameter of the storage URI. With `single_provider=false` (the default), the adapter works in multiple provider mode. The path of the storage URI will be used as a base directory containing one top-level directory per provider. With `single_provider=true`, the adapter works in single provider mode. The path of the storage URI will be used as a directory containing data of a single provider only (the same as the top-level directories of the multiple provider mode). Even when being used in the single provider mode, the provider codes must be given to the methods of [`BaseStorage`](BaseStorage) in order to respect the common interface. Example for multiple provider mode: ```python from dbnomics_toolbox.model.provider_metadata import ProviderMetadata from dbnomics_toolbox.storage.storage import BaseStorage from dbnomics_toolbox.storage.storage_uri import StorageUri multiple_provider_storage = BaseStorage.from_uri(StorageUri.parse("filesystem:multiple_provider_data")) multiple_provider_storage.save_provider_metadata( ProviderMetadata.create( code="Eurostat", name="Eurostat", # Cf https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 region="EU", terms_of_use="https://ec.europa.eu/eurostat/web/main/help/copyright-notice", website="https://ec.europa.eu/eurostat", ) ) multiple_provider_storage.save_provider_metadata( ProviderMetadata.create( code="INSEE", name="Institut national de la statistique et des études économiques", # Cf https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 region="FR", terms_of_use="https://www.insee.fr/fr/information/2381863", website="https://www.insee.fr/", ) ) ``` ```bash $ tree multiple_provider_data multiple_provider_data ├── eurostat-json-data │   └── provider.json └── insee-json-data └── provider.json ``` Example for single provider mode: ```python from dbnomics_toolbox.model.provider_metadata import ProviderMetadata from dbnomics_toolbox.storage.storage import BaseStorage from dbnomics_toolbox.storage.storage_uri import StorageUri single_provider_storage = BaseStorage.from_uri(StorageUri.parse("filesystem:single_provider_data?single_provider=true")) single_provider_storage.save_provider_metadata( ProviderMetadata.create( code="INSEE", name="Institut national de la statistique et des études économiques", # Cf https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 region="FR", terms_of_use="https://www.insee.fr/fr/information/2381863", website="https://www.insee.fr/", ) ) ``` ```bash $ tree single_provider_data single_provider_data └── provider.json ``` In a fetcher, the convert CLI expects a directory or a storage URI as its second command-line argument. If it's a directory, it is turned into a storage URI like `filesystem:{directory}?single_provider=true`.