Remote manager tutorial
This material is adapted from work by Luigi Genovese, Laura Ratcliff and others
We will see here the engine which is employed by BigDFT to run calculations on a remote supercomputer. After this tutorial you will be able to trigger the usage of the nodes from the local workstation.
The remotemanager
package will have to be installed on your machine (or in the colab session)
!pip install -U remotemanager
Collecting remotemanager
Using cached remotemanager-0.10.10-py3-none-any.whl.metadata (1.9 kB)
Requirement already satisfied: pyyaml in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from remotemanager) (6.0.1)
Requirement already satisfied: requests in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from remotemanager) (2.31.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from requests->remotemanager) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from requests->remotemanager) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from requests->remotemanager) (2.1.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from requests->remotemanager) (2023.11.17)
Using cached remotemanager-0.10.10-py3-none-any.whl (85 kB)
Installing collected packages: remotemanager
Successfully installed remotemanager-0.10.10
Begin with the common imports, Dataset
and URL
. As these two are always used, they are available from remotemanager
root.
from remotemanager import Dataset, URL
Define a simple function to run:
def basic_function(inp):
import time
time.sleep(1)
return inp*inp
For the moment, we’re just running on our local machines
url = URL()
#url = URL(host='vega') #switch on this after the first walkthrough of this notebook
The basis concept: Dataset
We here see the main concept of the remotemanager spirit. A function can ran multiple times on a remote machine with different values of its arguments.
This will be useful to control the running of different calculations of more complex data submissions.
As basic setup is done, lets create the Dataset
ds = Dataset(function=basic_function,
url=url,
# script='module load Python', # we need python to execute a python function, such script loads it
)
The Dataset stores the function, the Runners store the arguments
Right now all we have is a function, need to create the args:
values = [1, 3, 7, 50]
for val in values:
ds.append_run(args={'inp': val})
appended run runner-0
appended run runner-1
appended run runner-2
appended run runner-3
Now we have all the material required:
Function
Connection
Arguments
Time to run:
Here we run the dataset.
WARNING: If you have timeout problems in this part, try to increase the timeout
value in the URL class attributes.
#ds.url.timeout = 120
#ds.url.max_timeouts = 4
ds.run(force=True)
assessing run for runner dataset-0324eba7-runner-0... force running
assessing run for runner dataset-0324eba7-runner-1... force running
assessing run for runner dataset-0324eba7-runner-2... force running
assessing run for runner dataset-0324eba7-runner-3... force running
The below cell is useful to wait on a run function, there are two sections to it:
print(ds.run_cmds)
This checks the commands that were used to launch the command were okay. If there was any errors, you’ll see them here.
while not ds.all_finished: ...
This block waits for the dataset to be completed. Dataset.all_finished
only returns True
when all the runners are completed.
Note: You can also use Dataset.is_finished
to see the state on a per-runner basis.
WARNING (for the exercise): If you have an error like ‘python not found’ here, define a script header in the dataset.
print(ds.run_cmds)
import time
while not ds.all_finished:
print('dataset not finished yet, sleeping for 1s')
time.sleep(1)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[8], line 1
----> 1 print(ds.run_cmds)
3 import time
4 while not ds.all_finished:
AttributeError: 'Dataset' object has no attribute 'run_cmds'
If we’ve made it through the wait block then we must have results, let’s fetch them:
ds.fetch_results()
Now they’re fetched, we can access them via the results
property without having to talk to the remote again:
ds.results
[1, 9, 49, 2500]
ds.runners[0].history
{'2022-11-16 19:22:00/0': 'created',
'2022-11-16 19:22:02/0': 'submitted',
'2022-11-16 19:22:03/0': 'resultfile created remotely',
'2022-11-16 19:22:12/0': 'completed'}
Exercise
Run this simple function on vega.
Change the URL to vega
(it should be enough to put in the URL command the same host you put for your ssh
connection - e.g. ssh vega
).
Pay attention to the various steps of the procedure.
Hint: Is python loaded by default when you ssh into vega? Perhaps module spider python
can help with this.
Important: contact us in case you are using a setup that would not work for the excercise. You can run the following tutorials also on google colab in case there are difficulties in installing a jupyter notebook environment on your workstation.