Known MATLAB issues
Matlab 2014b and above can experience the following issues when run on BioHPC systems. BioHPC has reported these issues to the MATLAB developer, but no fix is currently available.
* Automatically handled by Matlab, use of vector operations in place of for-loops.
- Parallel for Loop (parfor)
- Spmd (Single Program Multiple Data)
- Pmode (Interactive environment for spmd development)
- Parallel Computing with GPU
* Directly submit matlab job to BioHPC cluster
- Matlab job scheduler integrated with SLURM
• The only difference in parfor loop is the keyword parfor instead of for. When the loop begins, it opens a parallel pool of MATLAB sessions called workers for executing the iterations in parallel.
Example 1: Estimating an Integration
quad_fun.m (serial version)
function q = quad_fun (n, a, b) q = 0.0; w=(b-a)/n; for i = 1 : n x = (n-i) * a + (i-1) *b) /(n-1); fx = 0.2*sin(2*x); q = q + w*fx; end return endrun from command line:
>> tic q=quad_fun(120000000, 0.13, 1.53); t1=toc t1 = 5.7881
quad_fun_parfor.m (parfor version)
function q = quad_fun_parfor (n, a, b) q = 0.0; w=(b-a)/n; parfor i = 1 : n x = (n-i) * a + (i-1) *b) /(n-1); fx = 0.2*sin(2*x); q = q + w*fx; end return endrun from command line :
>> parpool('local'); Starting parallel pool(parpool) using the 'local' profile ... connected to 12 workers ... >> tic q=quad_fun_parfor(120000000, 0.13, 1.53); t2=toc t2 = 0.9429
12 workers, Speedup = t1/t2 = 6.14x
Limitations of parfor
spmd labdata = load([‘datafile_’ num2str(labindex) ‘.ascii’]) % multiple data result = MyFunction(labdata); % single program end
Example 1: Estimating an Integration
quad_fun_spmd.m (spmd version)
a = 0.13; b = 1.53; n = 120000000; spmd aa = (labindex - 1) / numlabs * (b-a) + a; bb = labindex / numlabs * (b-a) + a; nn = round(n / numlabs) q_part = quad_fun(nn, aa, bb); end q = sum([q_part{:}]);run from command line :
8.26x >> tic quad_fun_spmd t3=toc t3 = 0.7006
12 workers, Speedup = t1 / t3 = 8.26x
Q: why better performance compares to parfor version ?
A: 12 communications between client and workers to update q, whereas in parfor, it needs 120,000,000 communications between client and workers.
Distributed Array
Matrix Multiply (A*B) by Distributed Array
parpool(‘local’); A = rand(3000); B = rand(3000); a = distributed(A); b = distributed(B); c = a*b; % run on workers automatically as long as c is of type distributed delete(gcp);
Matrix Multiply (A*B) by Distributed Array
parpool(‘local’); A = rand(3000); B = rand(3000); spmd u = codistributed(A, codistributor1d(1)); % by row v = codistributed(B, codistributor1d(2)); % by column w = u * v; end delete(gcp);
Example 2 : linear Solver (Ax = b)
linearSolver.m (serial version)
n = 10000; M = rand(n); X = ones(n, 1); A = M + M’; b = A * X; u = A \ b;run from command line :
>>tic;linearSolver;t1=toc t1 = 8.4054
linearSolverSpmd.m (spmd version)
n = 10000; M = rand(n); X = ones(n, 1); spmd m = codistributed(M, codistributor(‘1d’, 2)); % distribute one portion of m to each MATLAB worker x = codistributed(X, codistributor(‘1d’, 1)); % distribute one portion of x to each MATLAB worker A = m +m’; b = A * x; utmp = A \ b; end u1 = gather(utmp); % gathering u1 from all MATLAB workers to MATLAB clientrun from command line :
>>parpool('local'); Starting parallel pool(parpool) using the 'local' profile ... connected to 12 workers >>tic;linearSolverSpmd;t2=toc t2 = 16.5482
Q: why spmd version needs more time for job completion ?
A: It needs extra time for distribute/consolidate large data and tranfer them to/from different workers
Factors reduce the speedup of parfor and spmd
Example 3 : Contrast Enhancement (by John Burkardt @ FSU)
contract_enhance.m
function y = contrast_enhance (x) x = double (x); n = size (x, 1); x_average = sum ( sum (x(:,:))) /n /n ; s = 3.0; % the contrast s should be greater than 1 y = (1.0 – s ) * x_average + s * x((n+1)/2, (n+1)/2); return endcontrat_serial.m (serial version)
x = imread(‘surfsup.tif’); yl = nlfilter (x, [3,3], @contrast_enhance); y = uint8(y);
contrast_spmd.m (spmd version)
x = imread(‘surfsup.tif’); xd = distributed(x); spmd xl = getlocalPart(xd); xl = nlfilter(xl, [3,3], @contrast_enhance); end y = [xl{:}];
12 workers, running time is 2.01 s, Speedup = 7.19x
Problem : When the image is divided by columns among the workers, arti?cial internal boundaries are created !
Reason : Zero padding at edges when applying sliding-neighborhood operation.
Solution: Build up communication between workers.
contrast_MPI.m
x = imread('surfsup.tif'); xd = distributed(x); spmd xl = getLocalPart ( xd ); % find out previous & next by labindex if ( labindex ~= 1 ) previous = labindex - 1; else previous = numlabs; end if ( labindex ~= numlabs ) next = labindex + 1; else next = 1; end % attach ghost boundaries column = labSendReceive ( previous, next, xl(:,1) ); if ( labindex < numlabs ) xl = [ xl, column ]; end column = labSendReceive ( next, previous, xl(:,end) ); if ( 1 < labindex ) xl = [ column, xl ]; end xl = nlfilter ( xl, [3,3], @contrast_enhance ); % remove ghost boundaries if ( 1 < labindex ) xl = xl(:,2:end); end if ( labindex < numlabs ) xl = xl(:,1:end-1); end xl = uint8 ( xl ); end y = [ xl{:} ];
12 workers, running time is 2.01 s, Speedup = 7.23x
* pmode allows the interactive parallel execution of MATLAB® commands. pmode achieves this by defining and submitting a communicating job, and opening a Parallel Command Window connected to the workers running the job. The workers then receive commands entered in the Parallel Command Window, process them, and send the command output back to the Parallel Command Window. Variables can be transferred between the MATLAB client and the workers.
Q: How many workers can I use for MATLAB parallel pool ?
Enable GPU computing of Matlab on BioHPC cluster.
Step 1 : reserve a GPU node by remoteGPU or webGPU
Step 2 : inside terminal, type in export CUDA_VISIBLE_DEVICES=“0”
to enable the hardware rendering.
You can lauch Matlab directly if you have GPU card on your workstation.
Example 2 : linear Solver (Ax = b)
linearSolverGPU.m (GPU version)
n = 10000; M = rand(n); X = ones(n, 1); % Copy data from RAM to GPU ( ~ 0.3 s) Mgpu = gpuArray(M); Xgpu = gpuArray(X); Agpu = Mgpu + Mgpu’; Bgpu = Agpu * Xgpu; ugpu = Agpu \ bgpu % Copy data from GPU to RAM ( ~0.0002 s) u = gather(ugpu);
run from command line :
>>tic;linearSolverGPU;t3=toc t3 = 3.3101
Speedup = t1 / t3 = 2.54x
Setup MATLAB environment
Home -> ENVIRONMENT -> Parallel -> Manage Cluster Profiles
Add -> Import Cluster Profiles from file -> select profile file from (based on your matlab version):
/project/apps/MATLAB/profile/nucleus_r2016a.settings
/project/apps/MATLAB/profile/nucleus_r2015b.settings
/project/apps/MATLAB/profile/nucleus_r2015a.settings
/project/apps/MATLAB/profile/nucleus_r2014b.settings
/project/apps/MATLAB/profile/nucleus_r2014a.settings
/project/apps/MATLAB/profile/nucleus_r2013b.settings
/project/apps/MATLAB/profile/nucleus_r2013a.settings
Test MATLAB environment (optional)
After initial setup, you should have two Cluster Profiles ready for Matlab parallel computing:
“local"
– for running job on your workstation, thin client or on any single compute node of BioHPC cluster (use Parallel Computing toolbox)
“nucleus_r<version No.>”
– for running job on BioHPC cluster with multiple nodes (use MDCS)
Setup Slurm Environment from Matlab Command line
ClusterInfo.setQueueName(‘128GB’); % use 128 GB partition ClusterInfo.setNNode(2); % request 2 nodes ClusterInfo.setWallTime(‘1:00:00’); % setup time limit to 1 hour ClusterInfo.setEmailAddress(‘yi.du@utsouthwestern.edu’); % email notification
Check Slurm Environment from Matlab Command line
ClusterInfo.getQueueName(); ClusterInfo.getNNode(); ClusterInfo.getWallTime(); ClusterInfo.getEmailAddress();
wait(job, ‘finished’); % Block session until job has finished get(job, ‘State’); % Occasionally checking on state of job
results = fetchOutputs(job); % load job results results = load(job); % load job results, only valid when submit job as a script delete(job); % delete job related data
You can also open job monitor from Home->Parallel->Monitor Jobs
Example 1: Estimating an Integration
quad_submission.m
% setup Slurm environment ClusterInfo.setQueueName(‘super’); ClusterInfo.setNNode(2); ClusterInfo.setWallTime(‘00:30:00’); % submit job to BioHPC cluster job = batch(@quad_fun, 1, {120000000, 0.13, 1.53}, ‘Pool’, 63, ‘profile’, ‘nucleus_r2015a’); % wait wait(job, ‘finished’); % load results after job completion Results = fetchOutputs(job);
check Slurm settings from Matlab command window
check Job status from terminal ( squeue -u <username>
)
load results
16 workers cross 2 nodes, running time is 24s
Recall that the serial job only needs 5.7881 seconds, MDCS needs more time for example 1 as it is not designed for small jobs.
Example 4 : Vessel Extraction
parfor i = 1 : length(numSubImages) I = im2double(Bwall(:,:i)); % load binary image allS{i} = skeleton(I); % find vessel by using fast matching method end
performance evaluation ( running time v.s. number of workers/nodes)
Job running on 4 nodes (96 workers) requires minimum computational time, Speedup = 28.3 x @ 4 nodes.
time needed for each single image various
Why linear speedup is impossible to obtain ?