Cache data on the underlying machine
Your Meadowrun code will usually run in a container, which means that it will not be
able to write files to the underlying file system. Meadowrun provides
meadowrun.MACHINE_CACHE_FOLDER
which points to /var/meadowrun/machine_cache
when
running on EC2 or Azure VMs. You can write data to this folder and it will be visible to
any jobs that happen to run on the same node.
The general pattern for using this folder should be something like:
import os.path
import filelock
import meadowrun
def a_slow_computation():
return "some sample data"
def get_cached_data():
cached_data_filename = os.path.join(meadowrun.MACHINE_CACHE_FOLDER, "myfile")
with filelock.FileLock(f"{cached_data_filename}.lock"):
if not os.path.exists(cached_data_filename):
data = a_slow_computation()
with open(cached_data_filename, "w") as cached_data_file:
cached_data_file.write(data)
return data
else:
with open(cached_data_filename, "r") as cached_data_file:
return cached_data_file.read()
filelock
is a library which makes
sure only one process at a time is writing to the specified file. You're welcome to use
whatever locking mechanism you like, but you should never assume that your job is the
only process running on the machine.
You should also never assume that any data you wrote will be available for a subsequent job. Meadowrun does not provide a way to guarantee that two jobs will run on the same machine.