# File-system storage adapter

The file-system adapter is a concrete implementation of the [`BaseStorage`](BaseStorage) abstract class described in the [data storage](data-storage.md#base-storage-interface) page.

The [`FileSystemStorage`](FileSystemStorage) class defines concrete methods corresponding to the abstract methods of [`BaseStorage`](BaseStorage) to load and save model classes to the file-system.

This adapter has options that are detailed below.

## File-system storage session

TODO document transactions

## The physical data model

Althrough using human-readable file formats like JSON, JSON Lines and TSV, it can be considered as a black box, meaning that reading and writing data should be done through the storage adapter.

The data format of the files is designed to optimize reads and writes and can differ from the [data model classes](data-model.md).

To represent the content of those files in-memory, another set of model classes is defined in [`dbnomics_toolbox.storage.adapters.filesystem.model`](dbnomics_toolbox.storage.adapters.filesystem.model).
We can talk about those classes as the *physical model*, whereas the classes of [`dbnomics_toolbox.model`](dbnomics_toolbox.model) can be called the *domain model*.

When calling its load and save methods, the [`FileSystemStorage`](FileSystemStorage) class is responsible for both:

- transforming objects of the domain model from and to objects of the physical model,
- knowing the path of the files that are read and written.

### Provider metadata

The file-system adapter stores provider metadata in a file named `provider.json` directly at the root of the provider directory.

The [`ProviderMetadata`](ProviderMetadata) domain model class is mapped to the [`ProviderJson`](ProviderJson) physical model class.

See also: [`BaseStorage.load_provider_metadata`](BaseStorage.load_provider_metadata) and [`BaseStorage.save_provider_metadata`](BaseStorage.save_provider_metadata).

### Category tree

The file-system adapter stores the category tree in a file named `category_tree.json` directly at the root of the provider directory.

The [`CategoryTree`](CategoryTree) domain model class is mapped to the [`CategoryTreeJson`](CategoryTreeJson) physical model class.

See also: [`BaseStorage.load_category_tree`](BaseStorage.load_category_tree) and [`BaseStorage.save_category_tree`](BaseStorage.save_category_tree).

### Datasets and series

The file-system adapter stores each dataset in a dedicated directory named after the dataset code.
For example, the dataset `INSEE/IPC-2015` is stored in the directory `IPC-2015` at the root level of the provider directory.

The way datasets and series are stored depends on the chosen storage variant, the different variants being described in the next section.

#### Storage variants

The storage URI of the file-system adapter accepts a `variant` parameter that takes one of those values: `jsonl` or `tsv` (e.g. `filesystem:insee-converted-data?variant=jsonl`).

When instanciating a [`FileSystemStorage`](FileSystemStorage), the variant will be detected by looking at the already written files, if any.
If detection could not be done, for example because no file has been written yet, then the `jsonl` variant will be used by default for future writes.
This allows clients to open any directory with the file-system adapter without knowing by advance which variant is used.

#### TSV variant

The TSV variant mainly uses [tab-separated values files](https://en.wikipedia.org/wiki/Tab-separated_values) to store series.

When using the TSV variant, the following files are created for each dataset, in a sub-directory named after the dataset code.

The file `{dataset_code}/dataset.json` contains dataset metadata coming from the [`DatasetMetadata`](DatasetMetadata) model class, and series metadata coming from the [`Series`](Series) model class under the `series` property.
The [`TsvDatasetJson`](TsvDatasetJson) physical model class represents the contents of this file.

For each series, observations and their attributes are stored in a TSV file named after the series code.
For example, the series `INSEE/IPC-2015/A.IPC.SO.00.00.INDICE.ENSEMBLE.FE.SO.BRUT.2015.FALSE` is stored in the file `IPC-2015/A.IPC.SO.00.00.INDICE.ENSEMBLE.FE.SO.BRUT.2015.FALSE.tsv`.

A simple TSV file looks like this:

```tsv
PERIOD	VALUE
2000	19
2001	NA
2002	22
```

Observation attributes can be stored in additional columns:

```tsv
PERIOD	VALUE	OBS_STATUS
2000	19
2001	NA
2002	22	E
```

Historically the TSV variant was the only existing variant, but since it uses one TSV file per series, this could lead to a huge number of files for some datasets.
Given that in file-systems, any file takes a minimum of one block (e.g. 4kb), this is not optimal for datasets having a huge number of small series.
Quickly the disk was full due to those millions of small files.

Also the fact that the file is named after the series code can lead to file names that are too long for the runtime environment (i.e. file-system, operating system).

#### JSON Lines variant

The JSON Lines variant mainly uses [JSON Lines files](https://jsonlines.org/) to store series.

This variant was introduced to circumvent the issues and limitations of the TSV variant.

The file `{dataset_code}/dataset.json` contains dataset metadata coming from the [`DatasetMetadata`](DatasetMetadata) model class.
The [`JsonLinesDatasetJson`](JsonLinesDatasetJson) physical model class represents the contents of this file.

The file `{dataset_code}/series.jsonl` contains all the series of the dataset, including metadata and observations coming from the [`Series`](Series) model class.
The [`JsonLinesSeriesItem`](JsonLinesSeriesItem) physical model class represents each line of this file.

Example of `series.jsonl` (only a minimal sample is shown here):

```jsonl
{"code":"M.LB.B.TTP.SA","dimensions":{"frequency":"M","seasonally_adjusted":"SA","sex":"B","subject":"LB","unit":"TTP"},"observations":[["PERIOD","VALUE"],["1953-01",4122],["1953-02",4001],["1953-03",4008]]}
{"code":"M.NILF.F.TTP.NSA","dimensions":{"frequency":"M","seasonally_adjusted":"NSA","sex":"F","subject":"NILF","unit":"TTP"},"observations":[["PERIOD","VALUE],["1972-01","NA"],["1972-02","NA"],["1972-03","NA"],["1972-04","NA"]]}
```

## Single provider or multiple providers

By design, the [`BaseStorage`](BaseStorage) interface is able to load and save data belonging to several providers:

```python
series1 = storage.load_series("INSEE/IPC-2015/A.IPC.SO.00.00.INDICE.ENSEMBLE.FE.SO.BRUT.2015.FALSE")
series2 = storage.load_series("IMF/CPI/A.AF.PCPIHO_IX")
```

However, by design also, fetchers are dedicated to a single provider.

The file-system adapter reads and writes data from/to a directory, which can be used in 2 different modes, according to the `single_provider` boolean parameter of the storage URI.

With `single_provider=false` (the default), the adapter works in multiple provider mode.
The path of the storage URI will be used as a base directory containing one top-level directory per provider.

With `single_provider=true`, the adapter works in single provider mode.
The path of the storage URI will be used as a directory containing data of a single provider only (the same as the top-level directories of the multiple provider mode).

Even when being used in the single provider mode, the provider codes must be given to the methods of [`BaseStorage`](BaseStorage) in order to respect the common interface.

Example for multiple provider mode:

```python
from dbnomics_toolbox.model.provider_metadata import ProviderMetadata
from dbnomics_toolbox.storage.storage import BaseStorage
from dbnomics_toolbox.storage.storage_uri import StorageUri

multiple_provider_storage = BaseStorage.from_uri(StorageUri.parse("filesystem:multiple_provider_data"))
multiple_provider_storage.save_provider_metadata(
    ProviderMetadata.create(
        code="Eurostat",
        name="Eurostat",
        # Cf https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
        region="EU",
        terms_of_use="https://ec.europa.eu/eurostat/web/main/help/copyright-notice",
        website="https://ec.europa.eu/eurostat",
    )
)
multiple_provider_storage.save_provider_metadata(
    ProviderMetadata.create(
        code="INSEE",
        name="Institut national de la statistique et des études économiques",
        # Cf https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
        region="FR",
        terms_of_use="https://www.insee.fr/fr/information/2381863",
        website="https://www.insee.fr/",
    )
)
```

```bash
$ tree multiple_provider_data
multiple_provider_data
├── eurostat-json-data
│   └── provider.json
└── insee-json-data
    └── provider.json
```

Example for single provider mode:

```python
from dbnomics_toolbox.model.provider_metadata import ProviderMetadata
from dbnomics_toolbox.storage.storage import BaseStorage
from dbnomics_toolbox.storage.storage_uri import StorageUri

single_provider_storage = BaseStorage.from_uri(StorageUri.parse("filesystem:single_provider_data?single_provider=true"))
single_provider_storage.save_provider_metadata(
    ProviderMetadata.create(
        code="INSEE",
        name="Institut national de la statistique et des études économiques",
        # Cf https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
        region="FR",
        terms_of_use="https://www.insee.fr/fr/information/2381863",
        website="https://www.insee.fr/",
    )
)
```

```bash
$ tree single_provider_data
single_provider_data
└── provider.json
```

In a fetcher, the convert CLI expects a directory or a storage URI as its second command-line argument.
If it's a directory, it is turned into a storage URI like `filesystem:{directory}?single_provider=true`.