BioHPC – Storage Cheat Sheet


Filesystems Overview

/home2 - Home directories only. Configuration, code etc. Not for active data analysis. Mirror backup twice-weekly on Mon and Wed.

/project - Large space, high-performance for large files. Not for working with large numbers of small files.  Archive large collections of small files (<1MB files) and avoid working on very small files (<100KB). No backup by defaullt. Incremental backup is available -  PIs should email if you would like any content on /project backed up to /project1,

/work -  This is also a high-performance filesystem for users to have LIVE HOT data since our recent upgrade. When using /work, you do not need to stripe your large single files for performance as in /project.  Each user has 5 TB of space. /work is mirror backup'ed once per week (Friday/Saturday, no old versions). 

/archive - This is a place for users to store COLD data. Each lab has 5TB of space by default. Quota can be increased upon approval. Accounting usage will be 2/3 of actual usage. /archive file system has similar directory tree setup as /project. 

Overall, Single thread writing to /work or /archive can be up to 2.3 GB/s, slightly faster than /project, metadata query is slightly slower. For most applications, you will not feel the performance difference among /work, /archive and /project. 

** For applications which need to read large files from multiple threads concurrently (eg. sequencing applications reading large reference database), /work or /archive are optimal choices than /project since the IO throughput are passed from more arrays of disks.



/home2 - 50GB per user
/project - 5TB+ for the lab as agreed with PI/department chair (default 5TB for each new lab)
/work - 5TB per user (but not for long term storage)

/archive - 5TB+ for the lab as agreed with PI/department chair (default 5TB for each new lab)

Quota stats show soft and hard limits. You need to keep within the soft limit. The hard limit only exists to give a margin of safety so that jobs generating more data than expected do not fail.


Checking Quotas / Usage

The biohpc_quota command shows your quota status on each filesystem:

$ biohpc_quota

Current BioHPC Storage Quotas for test (group: Test_lab):

  FILE   |         SPACE USAGE        |         NUMBER OF FILES        
 SYSTEM  |     USED     SOFT     HARD |       USED       SOFT       HARD
home2    |     334M   51200M   71680M |      12393          0          0
project  |   35.65T      80T      90T |   28305097          0          0
work     |    6.24T    5.00T    7.00T |   84370898  unlimited  unlimited


To see individual usage for a user on project use the lfs quota command:

lfs quota –u <username> -h /project

$ lfs quota -u test -h /project
Disk quotas for user test (uid 123456):
     Filesystem    used   quota   limit   grace   files   quota   limit   grace
       /project    9.3T      0k      0k       - 37321317       0       0       -


Faster Performance for Very Large Files on /project

The /project filesystem is a parallel filesystem consisting of 40 storage targets. By default each file is stored on a single target. Each target can provide read speeds of up to 1GB/s depending on use.

Faster speeds can be achieved for very large files by striping the file across multiple targets. Most software can’t read files fast enough to benefit from striping – but some can.  If you have many processes all reading from a single file then striping can also help improve the aggregate speed.

Some important rules:

            NEVER use a stripe count of more than 8 – usually no benefit, and it slows things down for others.
            ONLY stripe large files. Striping files <1GB will increase the load on the system with no real benefit.
            ONLY use the -c stripe count option for setstripe. Never change stripe index or stripe size!
            Try to set striping on directories – and keep large and small files separate so you can do this.

When you set striping on a directory it only applies to new files in that directory. To apply striping to old files you must copy (not move) them inside the directory that has striping set.

To set striping for a directory:

# -c option specified number of stripes, 4 in this case
lfs setstripe -c 4 /project/department/myuser/bigfiles


To see striping settings for a directory or file:

lfs getstripe /project/department/myuser/bigfiles


To apply striping to an existing file in a directory:

# Set striping on the directory
lfs setstripe -c 4 /project/department/myuser/bigfiles
cd /project/department/myuser/bigfiles
# Copy the existing file to create a striped version
cp myfile myfile.striped
# Replace the original file with its striped copy
mv myfile.striped myfile

BE CAREFUL – make sure you are certain you don’t overwrite the wrong thing. It can be safer to create a new directory and copy files into it.


How many Stripes?

The following general rules are appropriate for our storage system:


Default – Any file that doesn’t fit the criteria below.


Moderate size files 2-10GB that are read by 1-2 concurrent processes


Moderate size files 2-10GB that are read by 3+ concurrent processes regularly

Large files 10GB+ that are read by 1-2 concurrent processes


Large files 10GB+ that are read by 3+ concurrent processes regularly

Any very large files 200GB+ (to balance storage target usage)


Remember – performance is very good even without striping. You only have to worry about striping at all if you have a real need to increase performance, or are storing files that are 100s of GBs in size.