General
This notebook is used to scrape the data used for this tutorial from the HPC wiki.
Imports
from pathlib import Path
import re
import requests
from datetime import datetime
import os
Env config
t0 = datetime.now()
print(f"Last execution {t0}")
# Endpoints
BASE_URL = "https://docs.hpc.cineca.it/_sources"
# Paths
DUMP_POSITION = "../data/input"
Path(DUMP_POSITION).mkdir(exist_ok=True, parents=True)
Last execution 2025-07-22 11:19:25.577657
Scraping
# Get the content of the index of the wiki
res = requests.get(BASE_URL + "/index.rst.txt")
def extract_urls_from_page(text:str):
# Index items are preceeded by 3 spaces, then we search for letters + / + other chars + /n
urls = re.findall(r"(?:\.\.\stoctree::\n.*?)((\s{3}[\w\/\-]+\n)+)", text, flags=re.S)
cleaned_pages = []
for pages in urls:
for page in pages:
for item in page.split("\n"):
if len(item):
cleaned_pages.append(item.strip())
return list(set(cleaned_pages))
index_pages = extract_urls_from_page(res.text)
index_pages
['hpc/hpc_data_storage',
'cloud/systems/index_system_specifics',
'general/users_account',
'specific_users/specific_users',
'general/access',
'cloud/os_overview/index_openstack_overview',
'hpc/hpc_clusters',
'hpc/hpc_scheduler',
'cloud/tutorials/index_tutorials_and_repos',
'cloud/tenant_adm/index_tenants_administration',
'cloud/operative/index_operative_manual',
'faq',
'hpc/hpc_intro',
'general/getting_started',
'hpc/hpc_software',
'services/services_and_tools',
'hpc/hpc_enviroment',
'cloud/general/general_info',
'general/general_info']
Download data
Download data from each page. Check on each page if there are other links which were not in index, and add them to the list of pages we must scrape.
pages_to_be_visited = index_pages.copy()
while pages_to_be_visited:
page_url:str = pages_to_be_visited.pop()
page_url = BASE_URL + "/" + page_url + ".rst.txt"
print(f"Visiting {page_url}")
page = requests.get(page_url)
# Search for additional urls in the page
additional_pages = extract_urls_from_page(page.text)
# ADD THE HEAD OF THE URL, THE URLS ARE "RELATIVE" (? CHECK)
for item in additional_pages:
item = "/".join(page.url.split("/")[4:-1]) + "/" + item
if item not in index_pages:
# Do not include a page if this is already in the index. This is probably a loop.
index_pages.append(item)
pages_to_be_visited.append(item)
print(f"Page {item} was not in index. Adding it to the list of pages to be dumped.")
with open(os.path.join(DUMP_POSITION, page_url.split("/")[-1]), mode="w") as f:
f.write(page.text)
Visiting https://docs.hpc.cineca.it/_sources/general/general_info.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/general/general_info.rst.txt
Page cloud/general/budget_accounting was not in index. Adding it to the list of pages to be dumped.
Page cloud/general/cineca_cloud_model was not in index. Adding it to the list of pages to be dumped.
Page cloud/general/what_is_cloud was not in index. Adding it to the list of pages to be dumped.
Visiting https://docs.hpc.cineca.it/_sources/cloud/general/what_is_cloud.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/general/cineca_cloud_model.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/general/budget_accounting.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/hpc/hpc_enviroment.rst.txt
Page hpc/hpc_cineca-ai-hpyc was not in index. Adding it to the list of pages to be dumped.
Visiting https://docs.hpc.cineca.it/_sources/hpc/hpc_cineca-ai-hpyc.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/services/services_and_tools.rst.txt
Page services/interactive_computing was not in index. Adding it to the list of pages to be dumped.
Page services/singularity was not in index. Adding it to the list of pages to be dumped.
Visiting https://docs.hpc.cineca.it/_sources/services/singularity.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/services/interactive_computing.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/hpc/hpc_software.rst.txt
Page hpc/software/qe was not in index. Adding it to the list of pages to be dumped.
Page hpc/software/matlab was not in index. Adding it to the list of pages to be dumped.
Visiting https://docs.hpc.cineca.it/_sources/hpc/software/matlab.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/hpc/software/qe.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/general/getting_started.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/hpc/hpc_intro.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/faq.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/index_operative_manual.rst.txt
Page cloud/operative/db_ops/index_db_ops was not in index. Adding it to the list of pages to be dumped.
Page cloud/operative/network_ops/index_network_ops was not in index. Adding it to the list of pages to be dumped.
Page cloud/operative/shares_ops/index_shares_ops was not in index. Adding it to the list of pages to be dumped.
Page cloud/operative/compute_ops/index_compute_ops was not in index. Adding it to the list of pages to be dumped.
Page cloud/operative/storage_ops/index_storage_ops was not in index. Adding it to the list of pages to be dumped.
Page cloud/operative/lb_ops/index_lb_ops was not in index. Adding it to the list of pages to be dumped.
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/lb_ops/index_lb_ops.rst.txt
Page cloud/operative/lb_ops/lb_create was not in index. Adding it to the list of pages to be dumped.
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/lb_ops/lb_create.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/storage_ops/index_storage_ops.rst.txt
Page cloud/operative/storage_ops/volume_create was not in index. Adding it to the list of pages to be dumped.
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/storage_ops/volume_create.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/compute_ops/index_compute_ops.rst.txt
Page cloud/operative/compute_ops/instance_create was not in index. Adding it to the list of pages to be dumped.
Page cloud/operative/compute_ops/instance_snap_create was not in index. Adding it to the list of pages to be dumped.
Page cloud/operative/compute_ops/instance_deletion was not in index. Adding it to the list of pages to be dumped.
Page cloud/operative/compute_ops/instance_rescue was not in index. Adding it to the list of pages to be dumped.
Page cloud/operative/compute_ops/instance_resize was not in index. Adding it to the list of pages to be dumped.
Page cloud/operative/compute_ops/image_upload was not in index. Adding it to the list of pages to be dumped.
Page cloud/operative/compute_ops/instance_manage was not in index. Adding it to the list of pages to be dumped.
Page cloud/operative/compute_ops/instance_download was not in index. Adding it to the list of pages to be dumped.
Page cloud/operative/compute_ops/keypair_create was not in index. Adding it to the list of pages to be dumped.
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/compute_ops/keypair_create.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/compute_ops/instance_download.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/compute_ops/instance_manage.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/compute_ops/image_upload.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/compute_ops/instance_resize.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/compute_ops/instance_rescue.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/compute_ops/instance_deletion.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/compute_ops/instance_snap_create.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/compute_ops/instance_create.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/shares_ops/index_shares_ops.rst.txt
Page cloud/operative/shares_ops/cephfs_share_create was not in index. Adding it to the list of pages to be dumped.
Page cloud/operative/shares_ops/generic_share_create was not in index. Adding it to the list of pages to be dumped.
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/shares_ops/generic_share_create.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/shares_ops/cephfs_share_create.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/network_ops/index_network_ops.rst.txt
Page cloud/operative/network_ops/secgroups_create was not in index. Adding it to the list of pages to be dumped.
Page cloud/operative/network_ops/network_create was not in index. Adding it to the list of pages to be dumped.
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/network_ops/network_create.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/network_ops/secgroups_create.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/db_ops/index_db_ops.rst.txt
Page cloud/operative/db_ops/db_create was not in index. Adding it to the list of pages to be dumped.
Page cloud/operative/db_ops/db_access was not in index. Adding it to the list of pages to be dumped.
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/db_ops/db_access.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/operative/db_ops/db_create.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/tenant_adm/index_tenants_administration.rst.txt
Page cloud/tenant_adm/store_sens_data was not in index. Adding it to the list of pages to be dumped.
Page cloud/tenant_adm/security_guidelines was not in index. Adding it to the list of pages to be dumped.
Page cloud/tenant_adm/dns_guidelines was not in index. Adding it to the list of pages to be dumped.
Visiting https://docs.hpc.cineca.it/_sources/cloud/tenant_adm/dns_guidelines.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/tenant_adm/security_guidelines.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/tenant_adm/store_sens_data.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/tutorials/index_tutorials_and_repos.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/hpc/hpc_scheduler.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/hpc/hpc_clusters.rst.txt
Page hpc/leonardo was not in index. Adding it to the list of pages to be dumped.
Page hpc/galileo was not in index. Adding it to the list of pages to be dumped.
Page hpc/pitagora was not in index. Adding it to the list of pages to be dumped.
Visiting https://docs.hpc.cineca.it/_sources/hpc/pitagora.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/hpc/galileo.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/hpc/leonardo.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/os_overview/index_openstack_overview.rst.txt
Page cloud/os_overview/os_components/load_balancers was not in index. Adding it to the list of pages to be dumped.
Page cloud/os_overview/os_components/compute was not in index. Adding it to the list of pages to be dumped.
Page cloud/os_overview/os_components/shares was not in index. Adding it to the list of pages to be dumped.
Page cloud/os_overview/os_components/network was not in index. Adding it to the list of pages to be dumped.
Page cloud/os_overview/os_components/database was not in index. Adding it to the list of pages to be dumped.
Page cloud/os_overview/os_components/storage was not in index. Adding it to the list of pages to be dumped.
Visiting https://docs.hpc.cineca.it/_sources/cloud/os_overview/os_components/storage.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/os_overview/os_components/database.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/os_overview/os_components/network.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/os_overview/os_components/shares.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/os_overview/os_components/compute.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/os_overview/os_components/load_balancers.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/general/access.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/specific_users/specific_users.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/general/users_account.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/systems/index_system_specifics.rst.txt
Page cloud/systems/gaia was not in index. Adding it to the list of pages to be dumped.
Page cloud/systems/ada was not in index. Adding it to the list of pages to be dumped.
Visiting https://docs.hpc.cineca.it/_sources/cloud/systems/ada.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/cloud/systems/gaia.rst.txt
Visiting https://docs.hpc.cineca.it/_sources/hpc/hpc_data_storage.rst.txt