This page is a detailed explanation on the method for calculating resource usage costs on the HURCS cluster named Moriah.
Chargeable compute resources are:
Storage
CPU cores
Memory
GPU
Charging is done in the beginning of each month, for the cluster usage on the previous month, by deducting the costs from the credits each PI has in the HURCS account.
Cluster usage costs are calculated based on storage and compute resource type. Please read the following descriptions carefully and make sure you understand your cost structure.
The cost for storage usage is for disk quota, NOT the actual usage of storage. Disk quotas can only be in increments of 1 TB (that is, quota cannot be for example 3.5TB).
We provide two kinds of disk storage: Fast and Archive. Fast storage should be used for files that are currently in use. Archive storage is cheaper and may be used to store data that is not accessed often, such as old research results, etc. On top of the storage service itself, we provide two kinds of backup services: snapshots and full backup. The different storage and backup options are implemented by dividing the storage into several volumes, and accordingly, charging is done for the quota allocated in each volume, as described in the following table:
Volume | Details | Path | Total cost (for 1TB/1Month) |
Fast, no snapshots | Fast storage for temporary files (without any backup options) | /sci/nosnap/<PI login> | $4 for 1TB |
Fast, with snapshots | Main working storage for computation work | /sci/labs/<PI login> | $5 for 1TB |
Fast with full backup | Fast storage for a full backup service | /sci/backup/<PI login> | $5 for 1TB + $2.5 per TB for backup service (total, $7.5) |
Archive | Long term storage. Cannot be used for computation | /sci/archive/<PI login> | $2 for 1TB |
Archive with backup | Archive storage + full backup | /sci/backup/Archive/<PI login> | $2 for 1TB + $2.5 per TB for backup service (total, $4.5) |
Each lab is assigned 1TB of fast (with snapshots) storage by default. PI’s who don’t need any lab storage can ask for this storage option to be revoked, and will not be charged for it.
Requests for changes in quota allocations can be done by filling this form.
Fast (with snapshots) = 10TB
Archive = 30TB
Backup = 5TB
Monthly cost = 10*5 + 30*2 + 5*7.5 = $147.5
See here for more details about the storage folders structure in HURCS.
The Moriah cluster is made of 42 compute nodes. 38 CPU-only nodes, called glacier-01 – glacier-38, 2 NVIDIA DGX machines, called dogfish-01 and dogfish-02 and 2 NVIDIA A-30 machines, called puffin-01 and puffin-02. Each compute node provides a certain amount of compute resources: CPU, Memory and (in case of the GPU nodes), GPU. Jobs submitted to the cluster are allocated a certain amount of resources, per the user’s request. Jobs will always get at least the amount of resources requested, and sometimes a little more. Jobs are charged for the amount of resources allocated to them. The cluster automatically allocates the requested resources over the required number of nodes. For example, the glacier nodes have 128 CPU’s each, so if a user asks for 200 CPUs, they will be allocated over 2 nodes.
NVIDIA A100 - Two nodes named dogfish-01 and dogfish-02
On dogfish-01 the cards are partitioned into 7 units, each with 10GB of memory - a total of 56 units.
On dogfish-02 the cards are partitioned into 2 units, each with 40GB of memory - a total of 16 units.
NVIDIA A30 - Two nodes name puffin-01 and puffin-02
Each contain 8 NVIDIA A30 cards with 24GB memory. The cards are not partitioned into smaller units.
In order to simplify the detailing of your compute resource usage costs, we defined a unified accounting unit, called CRU (Compute Resource Unit). CRU is defined as follows:
1 CRU = Using an entire node for 1 hour
The cost per 1 CRU differs between the CPU only (glacier) nodes and the GPU (dogfish) nodes:
The amount of CRU that each job was using is calculated as follows:
Example #1:
A job used 1 CPU on a glacier (CPU only node) and 6GB of memory
The glacier nodes have 128 CPUs per node and 1TB of memory
In total, this job used 1/128 (~ 0.8%) of the node CPU’s and 6/1024(~ 0.6%) of the Memory.
Thus, the CRU amount is equal to the portion of the CPU used (~ 0.008 CRU)
The job ran for 2 hours. Its total cost is therefore: 1/128 CRU * 2 Hours * $1.024 = $0.016
Example #2:
A job used 10GB (of GPU mem) GPU, 1 CPU and 32GB of memory on dogfish-01
The dogfish-01 node has 56 10GB GPU’s, 256 CPU’s and 2TB of memory
In total, this job used 1/256 of the node CPU’s, 1/56 of the GPU’s and 1/64 of the memory
Thus, CRU amount is equal to the portion of the GPU used (~ 0.018 CRU)
The job ran for 10 hours. Its total cost is therefore: 1/56 CRU *10 Hours * $11.2 = $2
Example #3:
A job used 2 CPUs and 20GB of memory on a glacier node
This job used 2/128 (~ 1.6% ) of the CPUs and 20/1024 (~ 2%) of memory
Thus, CRU amount is equal to the portion of memory used (~ 0.2 CRU)
The job ran for 6 hours. Its total cost is therefore: 2/102 CRU * 6 Hours * 1.024 = $0.12
On the Glacier, CPU only hosts, usage costs increases in a linear manner when adding CPUs or Memory (additional $0.008 for each additional CPU or 8GB of memory). So, when running your jobs, consider how much you will benefit from adding more CPUs or more memory. Requesting more resources than required will needlessly increase the cost of the jobs.
Because of the relatively high CRU cost and the nature of GPU-based algorithms, jobs that use the DGX machines can become very costly. Users are therefore encouraged to carefully plan their jobs resource requests before execution.
We provide a certain amount of resources on each of the Dogfish hosts for a fixed price as follows:
Above that, the cost increases with every additional CPU or additional Memory ($0.04375 for each additional CPU or additional 8GB of memory).
Based on this cost behavior, we recommend you to start with a small data set and run a scaled down model on Dogfish-01, with a single CPU and minimal amount of memory, and then once again with 4 CPU’s and more memory (up to 36GB). Based on the results of these scaled-down runs, you may understand how much GPU memory your models actually need, and make an informed decision on the usage cost-performance.