File resources

FileResource is an abstract class that provides base features for downloading files, but does not define _download method itself.

The HttpResource class, described in the HTTP resources section of the documentation, is a concrete implementation of the FileResource class.

Resource flow

The FileResource._start method first calls the FileResource._download method, doing multiple attempts depending on the retry strategy.

The FileResource._download method is responsible for writing the target file.

After the file is downloaded, the FileResource._start method calls the FileResource._post_process_target_file which in turn calls the FileResource._validate_mimetype and the FileResource._reformat_file methods. See the following sections for more information about them.

MIME type validation

A common pitfall when downloading files is that the server responds something else than the expected response, the most well-known example being the “404 not found” web page.

By default, the FileResource class validates that the actual MIME type of the downloaded file matches the expected one, based on the file name, after it has been downloaded.

For example, a file nameed catalog.json will be expected to have a MIME type of application/json, and a file named data.csv will be expected to have a MIME type of text/csv.

The FileResource._validate_mimetype method calls the validate_mimetype, which makes use of the mimetypes.guess_type function of the Python standard library.

If the MIME type could not be guessed based on the file name, the MimeTypeNotGuessed exception is raised. In that case it is still possible to pass the accept_mimetype kwarg to the constructor of FileResource, which skips guessing the MIME type from the file name.

The actual MIME type of the file is then detected from the file contents by using the python-magic package. If the detected MIME type does not match the expected one, the InvalidMimeType exception is raised. The BaseDownloader considers the resource as failed and logs the error.

MIME type validation can be disabled by passing the validate_mimetype=False kwarg to the constructor of FileResource.

Reformat files

When downloading a text-based file like JSON or XML, the server can send its contents formatted in different ways. For exemple, a JSON file can be responded completely unindented (as a single line), indented with 2 or 4 spaces, etc. The same goes for XML files.

To minimize variations between different versions of the same file, especially when using a version control system like Git, it is advised to reformat the file using settings that don’t vary in time.

By default, the FileResource class reformats the file after it has been downloaded, using a different method based on the file extension. As of now, the JSON (<file>.json) and XML (<file>.xml) formats are supported.

The FileResource._reformat_file method calls the reformat_file function which can rely on external tools to actually reformat the files. If a tool is missing, or the reformatting fails, an exception is raised and the resource is considered as failed by the BaseDownloader

Reformatting can be disabled by passing the reformat_file=False kwarg to the constructor of FileResource.