matchzoo.utils.get_file

Download file.

Module Contents

class matchzoo.utils.get_file.Progbar(target, width=30, verbose=1, interval=0.05)

Bases: object

Displays a progress bar.

Parameters:
  • target – Total number of steps expected, None if unknown.
  • width – Progress bar width on screen.
  • verbose – Verbosity mode, 0 (silent), 1 (verbose), 2 (semi-verbose)
  • stateful_metrics – Iterable of string names of metrics that should not be averaged over time. Metrics in this list will be displayed as-is. All others will be averaged by the progbar before display.
  • interval – Minimum visual progress update interval (in seconds).
update(self, current)

Updates the progress bar.

matchzoo.utils.get_file._extract_archive(file_path, path='.', archive_format='auto')

Extracts an archive if it matches tar, tar.gz, tar.bz, or zip formats.

Parameters:
  • file_path – path to the archive file
  • path – path to extract the archive file
  • archive_format – Archive format to try for extracting the file. Options are ‘auto’, ‘tar’, ‘zip’, and None. ‘tar’ includes tar, tar.gz, and tar.bz files. The default ‘auto’ is [‘tar’, ‘zip’]. None or an empty list will return no matches found.
Returns:

True if a match was found and an archive extraction was completed, False otherwise.

matchzoo.utils.get_file.get_file(fname:str=None, origin:str=None, untar:bool=False, extract:bool=False, md5_hash:typing.Any=None, file_hash:typing.Any=None, hash_algorithm:str='auto', archive_format:str='auto', cache_subdir:typing.Union[Path, str]='data', cache_dir:typing.Union[Path, str]=matchzoo.USER_DATA_DIR, verbose:int=1) → str

Downloads a file from a URL if it not already in the cache.

By default the file at the url origin is downloaded to the cache_dir ~/.matchzoo/datasets, placed in the cache_subdir data, and given the filename fname. The final location of a file example.txt would therefore be ~/.matchzoo/datasets/data/example.txt.

Files in tar, tar.gz, tar.bz, and zip formats can also be extracted. Passing a hash will verify the file after download. The command line programs shasum and sha256sum can compute the hash.

Parameters:
  • fname – Name of the file. If an absolute path /path/to/file.txt is specified the file will be saved at that location.
  • origin – Original URL of the file.
  • untar – Deprecated in favor of ‘extract’. Boolean, whether the file should be decompressed.
  • md5_hash – Deprecated in favor of ‘file_hash’. md5 hash of the file for verification.
  • file_hash – The expected hash string of the file after download. The sha256 and md5 hash algorithms are both supported.
  • cache_subdir – Subdirectory under the cache dir where the file is saved. If an absolute path /path/to/folder is specified the file will be saved at that location.
  • hash_algorithm – Select the hash algorithm to verify the file. options are ‘md5’, ‘sha256’, and ‘auto’. The default ‘auto’ detects the hash algorithm in use.
  • archive_format – Archive format to try for extracting the file. Options are ‘auto’, ‘tar’, ‘zip’, and None. ‘tar’ includes tar, tar.gz, and tar.bz files. The default ‘auto’ is [‘tar’, ‘zip’]. None or an empty list will return no matches found.
  • cache_dir – Location to store cached files, when None it defaults to the [matchzoo.USER_DATA_DIR](~/.matchzoo/datasets).
  • verbose – Verbosity mode, 0 (silent), 1 (verbose), 2 (semi-verbose)
Papram extract:

True tries extracting the file as an Archive, like tar or zip.

Returns:

Path to the downloaded file.

matchzoo.utils.get_file.validate_file(fpath, file_hash, algorithm='auto', chunk_size=65535)

Validates a file against a sha256 or md5 hash.

Parameters:
  • fpath – path to the file being validated
  • file_hash – The expected hash string of the file. The sha256 and md5 hash algorithms are both supported.
  • algorithm – Hash algorithm, one of ‘auto’, ‘sha256’, or ‘md5’. The default ‘auto’ detects the hash algorithm in use.
  • chunk_size – Bytes to read at a time, important for large files.
Returns:

Whether the file is valid.

matchzoo.utils.get_file._hash_file(fpath, algorithm='sha256', chunk_size=65535)

Calculates a file sha256 or md5 hash.

Parameters:
  • fpath – path to the file being validated
  • algorithm – hash algorithm, one of ‘auto’, ‘sha256’, or ‘md5’. The default ‘auto’ detects the hash algorithm in use.
  • chunk_size – Bytes to read at a time, important for large files.
Returns:

The file hash.