\documentclass[twoside]{article} \pagestyle{myheadings} \markboth{\hfil Implementing parallel elliptic solver \hfil}% {\hfil Marcin Paprzycki, Svetozara Petrova, \& Julian Sanchez \hfil} \begin{document} \setcounter{page}{75} \title{\vspace{-1in}\parbox{\linewidth}{\footnotesize\noindent {\sc 15th Annual Conference of Applied Mathematics, Univ. of Central Oklahoma}, \newline Electronic Journal of Differential Equations, Conference~02, 1999, pp. 75--85. \newline ISSN: 1072-6691. URL: http://ejde.math.swt.edu or http://ejde.math.unt.edu \newline ftp ejde.math.swt.edu (login: ftp)} \vspace{\bigskipamount} \\ Implementing parallel elliptic solver \\ on a Beowulf cluster \thanks{ {\em 1991 Mathematics Subject Classifications:} 65F10, 65N20, 65N30. \hfil\break\indent {\em Key words and phrases:} elliptic solvers, separation of variables, parallel computing, \hfil\break\indent problems with sparse right-hand side. \hfil\break\indent \copyright 1999 Southwest Texas State University and University of North Texas. \hfil\break\indent Published November 29, 1999. \hfil\break\indent The second author was partially supported by the Alexander von Humboldt Foundation and \break\indent the Bulgarian Ministry for Education, Science and Technology under Grant MM--98\#801.} } \date{} \author{Marcin Paprzycki, Svetozara Petrova, \& Julian Sanchez} \maketitle \begin{abstract} In a recent paper \cite{zara} a parallel direct solver for the linear systems arising from elliptic partial differential equations has been proposed. The aim of this note is to present the initial evaluation of the performance characteristics of this algorithm on Beowulf-type cluster. In this context the performance of PVM and MPI based implementations is compared. \end{abstract} \newcommand{\x}{{\underline {x}}} \newcommand{\y}{{\underline {y}}} \newcommand{\f}{{\underline {f}}} \newcommand{\q}{{\underline {q}}} \newcommand{\et}{{\underline {\eta }}} \newcommand{\bet}{{\underline {\beta }}} \newcommand{\0}{{\underline {0}}} \renewcommand{\theequation}{\thesection.\arabic{equation}} \section{Introduction}\label {sec1} Recently a new parallel algorithm for the solution of separable second order elliptic PDE's on rectangular domains has been presented by Petrova \cite{zara}. The method is based on the sequential algorithm proposed by Vassilevski \cite{V1,V2}. The algorithm consists of odd-even block elimination combined with discrete separation of variables. It was established that the proposed solver has good numerical properties for both 2D and 3D problems \cite{zara,V1,V2,VP}. The parallel algorithm by Petrova was implemented using PVM to facilitate parallelism and the initial performance evaluation has been reported in \cite{HMPzara} (for a brief descriptions of PVM and MPI programming environments, see below). The performance study has run into technical problems. For obvious reasons, one should use parallel computers to solve large problems. On the computers we had access to (Silicon Graphics machines at NCSA in Urbana), this meant that using the batch-queues was required (there is an imposed limit on the size and time allowed for interactive jobs). We have found that for some reason the PVM environment, when invoked from the NQS (NCSA batch job submission system), was relatively unstable (approximately two of every three jobs hang up when the PVM daemons died). In the mean time, while running MPI-based jobs (in the same environment) we have not encountered such problems (see also \cite{LiMaPa}). We have therefore re-implemented the algorithm using the MPI environment to facilitate parallelism and experimented with it on a Beowulf cluster. Due to the fact that we had two versions of the code we were able to run both of them for problems of the same size and compare their performance. The aim of this note is to report on the results of our experiments. Parallel Virtual Machine (PVM) is a programming environment (developed at Oak Ridge National Laboratory) which allows a heterogeneous collection of workstations and/or supercomputers to function as a single high-performance parallel machine. PVM was designed to link computing resources and provide users with a parallel platform for running their computing applications, irrespective of the number of different computers they use and where the computers are located. It was created primarily as an interactive-type tool, where a console is started on one of the machines and the interactions with other computers are handled from this console. Over the last two years its role has been slowly descreasing. In the meantime, the popularity of the Message Passing Interface (MPI) is increasing. MPI is a message passing library facilitating parallel programming (primarily for codes written in C and Fortran). It was designed in Argonne National Laboratory and released in 1994. In contrast to PVM, MPI is just a specification of a library without an attempt to build an interactive parallel environment. Both PVM and MPI became standards and are supported by most vendors of parallel computers. We proceed as follows. In Section~2 the method of discrete separation of variables is presented. Section~3 contains a brief summary of the Fast Algorithm for Separation of Variables (FASV) (for more details of both phases of the algorithm, see \cite {zara,V1}). In Section~4 we discuss a number of issues related to the parallel implementation and execution. Finally, Section~5 presents the results of experiments performed on a Beowulf cluster. We conclude with the description of future research. \section{Discrete separtion of variables} \label {sec2} \setcounter{equation}{0} Consider a separable second order elliptic equation of the form $$ -\sum _{s=1}^d \frac {\partial} {\partial x_s} a_s(x_s) \frac {\partial u} {\partial x_s} = f(x), \ x \in \Omega ,~~ d = 2~ \mbox { or} ~3 $$ $$ u_{| \partial \Omega} = 0. $$ For the purpose of this paper we assume that $\Omega $ is a rectangle $(d=2)$, but the method described below can be generalized to 3D problems. Using, for example, a finite difference method to discretize the equation, we obtain the linear system of equations \begin{equation} \label {11} A \x = \f, \end{equation} where $$A = I_m \otimes T + B \otimes I_n.$$ \noindent Here $I_n$ is the identity $n \times n$ matrix and $\otimes$ is the {\it Kronecker product}. Matrices $T=(t_{ij})_{i,j=1}^n $ and $B=(b_{kl})_{k,l=1}^m $ are tridiagonal s.p.d.\ arising from the finite difference approximation of the one-dimensional operators $-\frac {\partial} {\partial x_s} a_s(x_s) \frac {\partial } {\partial x_s}(.),$ for $s=1,2,$ respectively. Vector $\x$ and the right-hand side $\f$ of (\ref {11}) are composed using the lexicographic ordering on horizontal lines. To describe the method of separation of variables we will use vectors $\x'$ and $\f'$ reordered using vertical lexicographic ordering. Consider the following eigenvalue problem \begin{equation} \label {12} B \q_k = \lambda _k \q_k, \end{equation} where $\left \{\lambda _k,\,\q_k \right \}_{k=1}^m$ are the eigenpairs of $B_ {m \times m}$. The matrix $B$ is assumed s.p.d.\ and hence the eigenvalues $\lambda _k>0$ and the eigenvectors satisfy $\q_k^T\,\q_r = \delta _{kr}$ ($\delta _{kr}$ is the Kronecker symbol) for $k,r=1, 2, \ldots ,m$. Using the basis $\{\q_k \}_{k=1}^m$ the vectors $ \x_i'$ and $ \f_i'$ can be expanded as follows: \begin{equation} \label {13} \x_i' = \sum _{k=1}^m \eta _{ki}\q_k, \qquad \f_i' = \sum _{k=1}^m \beta _{ki}\q_k, ~~~ i=1,2,\ldots ,n, \end{equation} where the Fourier coefficients of $\f_i'$ are computed by \begin{equation} \label{15} \beta_{ki} = \q_k^T \f_i'. \end{equation} Consider the column vectors $\et_k = \left (\eta_{ki} \right )_{i=1}^n$ and $\bet_k = \left (\beta_{ki}\right )_{i=1}^n, \ k=1,2, \ldots, m.$ Substituting expressions (\ref {13}) in (\ref {11}) results in the following system of equations for the discrete Fourier coefficients $\et_k$ of $\x_i', \ i=1,2, \ldots ,n$ \begin{equation} \label{16} (\lambda_kI+T) \et_k = \bet_k, ~~ k=1,2, \ldots ,m. \end{equation} Equations (\ref {16}) represent $m$ systems of linear equations with $n \times n$ matrices. They can be solved independently of each other. The algorithm for the separation of variables (SV) has thus the following form: \bigskip \noindent \underline {\large {Algorithm SV}} \medskip \noindent (1) {\bf determine} the eigenpairs $\left \{ \lambda_k, \,\q_k \right \}_{k=1}^m$ of the tridiagonal matrix $B$ from (\ref {12}); \noindent (2) {\bf compute} the Fourier coefficients $\beta_{ki}$ of $\f_i'$ using (\ref {15}); \noindent (3) {\bf solve} $m$ $n \times n$ tridiagonal systems of equations of the form (\ref {16}) to determine $\{\eta_{ki}\}$ -- the Fourier coefficients of $\x_i', \ i=1,2, \ldots, n$; \noindent (4) {\bf recover} the solution of our original system (\ref {11}) on the basis of the Fourier coefficients $\{\eta_{ki}\}$ of $\x_i'$. There are two possibilities: \begin{eqnarray} \label {17} {\mbox {vertical recovering: }} ~~&\x_i' = &\sum _{k=1}^m \eta_{ki}\q_k, ~~ i=1,2\ldots ,n \nonumber \\ \vspace{6pt} {\mbox {horizontal recovering: }}~~&\x_j = &\sum _{k=1}^m q_{jk}\et_k, ~~ j=1,2\ldots ,m. \end{eqnarray} \medskip Let us now assume that the system of the form (\ref {11}) has a sparse right-hand side (SHRS) (for more details of origins of such problems see for instance Banegas \cite{B}, Proskurowski \cite{P} and Kuznetsov \cite{K}). More precisely, assume that the right-hand side $\f$ has only $d \ll m$ nonzero block components $$ \f^T = \left [ \0, \ldots, \f_{j_1}, \0, \ldots, \f_{j_d}, \0, \ldots, \0\right]^T, ~~ {\mbox {where }} \f_{j_s} \in {\cal R} ^n, \ s=1,2 \ldots , d.$$ Then, for the reordered right-hand side, each vector $\f_i' \ (i=1,2,\ldots ,n)$ has only $d$ nonzero scalar components $f_{ij_s}, \ s = 1,2 \ldots ,d$, i.e. $$ \f_i'^T = \left [ 0, \ldots, f_{ij_1}, 0, \ldots, f_{ij_d}, 0, \ldots, 0 \right]^T, ~~ i=1,2,\ldots ,n.$$ Assume that only $r \ll m$ block components of the solution are needed. Denote by $\x_{j_1'},\x_{j_2'}, \ldots ,\x_{j_r'}$ the sought vectors. To find the solution of such a problem with a sparse right-hand side we apply the Algorithm~SV as follows: %\newpage \bigskip \noindent \underline {\large {Algorithm SRHS}} \medskip \noindent (1) {\bf compute} the Fourier coefficients $\beta_{ki}$ of $\f_i'$ from (\ref {15}), $$\beta_{ki}= \q_k^T \f_i' = \sum _{s=1}^d q_{kj_s}f_{ij_s} , ~~ k=1,2, \ldots,m, ~~ i=1,2,\ldots,n;$$ \noindent (2) {\bf solve} systems of the form (\ref{16}); \noindent (3) {\bf recover} the solution per lines using (\ref {17}). We need only $ \x_j = \sum _{k=1}^m q_{jk} \et_k$, for $j=j_1', j_2', \ldots, j_r'$. \section {Fast algorithm for separation of variables} \label {sec3} \setcounter{equation}{0} Consider now the algorithm FASV proposed by Vassilevski \cite {V1,V2} for solving problems of type (\ref {11}). For simplicity we assume that $m=2^l-1$. The algorithm consists of two steps -- forward and backward recurrence. For the forward step we need matrix $A$ in the following block form: \begin{equation} \label{32} A = \left [ \begin {array} {ccccc} A^{(k,1)} & A_{12} & & & 0 \\ A_{21} & T+b_{2^k2^k}I_n & A_{23} & & \\ & A_{32} & A^{(k,2)} & A_{34} & \\ & & \ddots & \ddots & \ddots \\ 0 & & & A_{2^{l-k+1}-1,2^{l-k+1}-2} & A^{(k,2^{l-k})} \end{array} \right ] , \end{equation} where $$A^{(k,s)} = I_{2^k-1} \otimes T+B_k^{(s)} \otimes I_n, ~~s=1,2,\ldots ,2^{l-k},$$ i.e. each matrix $ A^{(k,s)},\ 1\le k\le l,\ 1\le s \le 2^{l-k}$ has $2^k-1$ blocks of order $n.$ Above $B_k^{(s)}$ of order $2^k-1$ is the principal submatrix of $B$ and hence, it is also a tridiagonal s.p.d.\ matrix of the form $B_k^{(s)}=(b_{ij}),\ i,j=1,\ldots s_k+2^k-1,$ where $s_k=(s-1)2^k.$ \bigskip \noindent (1) {\bf Forward step} of FASV \medskip For $k=1,2, \ldots , l-1$ solve the problem: \begin{equation} \label{34} A^{(k,s)} \x^{(k,s)} = \f^{(k,s)} , ~~ s=1,2, \ldots , 2^{l-k}, \end{equation} where $$\x^{(k,s)} = \left [ \begin {array} {c} \x_{s_k+1}^{(k)} \\ \vspace {4pt} \x_{s_k+2}^{(k)} \\ \vdots \\ \x_{s_k+2^k-1}^{(k)} \end {array} \right ] $$ and $\f^{(k,s)}$ will be defined below. After solving these problems and setting $\x_{s2^k}^{(k)}=0$ for $s=1,2, \ldots , 2^{l-k}-1$ we denote by $\x^{(k)}$ and $\f^{(k)}$ the following vectors \begin{equation} \label{*} \x^{(k)} = \left [ \begin {array} {c} \x^{(k,1)} \\ \0 \\ \vspace {7pt} \x^{(k,2)} \\ \0 \\ \vdots \\ \x^{(k,2^{l-k})} \end {array} \right ] \begin {array} {l} \} 2^k-1 {\mbox { blocks }} \\ \vspace {4pt} \} 1 {\mbox { block }} \\ \vspace {4pt} \} 2^k-1 {\mbox { blocks }} \\ \vspace {4pt} \} 1 {\mbox { block }} \\ \vspace {4pt} \\ \} 2^k-1 {\mbox { blocks }} \end {array} ~~~{\mbox { and }} \f^{(k)} = \left [ \begin {array} {c} \f^{(k,1)} \\ \vspace {5pt} \f_{2^k} \\ \vspace {5pt} \f^{(k,2)} \\ \vspace {5pt} \vdots \\ \f^{(k,2^{l-k})} \end {array} \right ] . \end{equation} Let the residual vector be the right-hand side for the next $k+1$st step \begin{equation} \label {35} \f^{(k+1)} = \f^{(k)}-A\x^{(k)}= \left [ \begin {array} {c} \f^{(k+1,1)} \\ \f_{2^{k+1}}\\ \vspace {4pt} \f^{(k+1,2)} \\ \vdots \\ \f^{(k+1,2^{l-k-1})} \end {array} \right ] , \end{equation} where $$\f^{(k+1,s')} = \left [ \begin {array} {c} \0 \vspace {3pt} \\ \vspace {6pt} \f_{s'.2^{k+1}}^{(k+1)}\\ \vspace {6pt} \0 \end {array} \right ] \begin {array} {l} \} 2^k-1 {\mbox { blocks }} \\ \vspace {4pt} \} 1 {\mbox { block }} \\ \vspace {4pt} \} 2^k-1 {\mbox { blocks }} \vspace {4pt} \end {array}, ~~ s'=1,2, \ldots , 2^{l-k-1}.$$ From (\ref {35}) and $s=(2s'-1)$ we have \begin{eqnarray} \label {36} \f_{s'.2^{k+1}}^{(k+1)} & =& \f_{s.2^k}^{(k)}-A_{2s,2s-1}\,\x^{(k,s)}-A_{2s,2s+1}\,\x^{(k,s+1)} \nonumber \\ & =&\f_{s.2^k}^{(k)}-b_{s.2^k,s.2^k-1}\,\x_{2^k-1}^{(k,s)}- b_{s.2^k,s.2^k+1}\,\x_1^{(k,s+1)}. \end{eqnarray} The new right-hand side $\f^{(k+1,s')}$ has only one nonzero block component and hence, by induction, the right-hand sides $\f^{(k,s)}$ have the following sparsity pattern $$\f^{(k,s)} = \left [ \begin {array} {c} \0 \vspace {3pt}\\ \vspace {3pt} \ast \vspace {3pt}\\ \vspace {3pt} \0 \end {array} \right ] \begin {array} {l} \} 2^{k-1}-1 {\mbox { blocks }} \vspace {3pt} \\ \vspace {3pt} \} 1 {\mbox { block }} \vspace {3pt} \\ \vspace {3pt} \} 2^{k-1}-1 {\mbox { blocks }} \vspace {3pt} \end {array}, ~~ s=1,2, \ldots , 2^{l-k} .$$ Therefore, the problems (\ref {34}) have a sparse right-hand side. The matrices $A^{(k,s)}$ allow a separation of variables as submatrices of $A$ and hence, Algorithm SV from Section \ref {sec2} can be applied with $d=1$ (the number of nonzero block components of the right-hand side), and $r=3$ (the number of the sought block components of the solution: $\x_1^{(k,s)}, \ \x_{2^{k-1}}^{(k,s)}$ and $\x_{2^k-1}^{(k,s)}).$ \bigskip \noindent (2) {\bf {Backward step}} of FASV \medskip For $k=l,l-1,\ldots ,1$, our purpose is to determine $\x_{(2s-1)2^{k-1}}$ for $s=1,2, \ldots ,2^{l-k}$. First, when $k=l,$ we solve \begin{equation} \label {38} A\x^{(l)} = \f^{(l)}, \end{equation} where $A^{(l,s)}\equiv A, \ \x^{(l)} = \x^{(l,1)} $ and $\f^{(l)} = \f^{(l,1)}$. The right-hand side $\f^{(l)}$ is found at the last step of the forward recurrence and from (\ref {36}) it has a sparse right-hand side. The problem (\ref {38}) is solved incompletely finding only one block component $ \x_{2^{l-1}} ^{(l)}.$ We have also $\x_{2^{l-1}}=\x_{2^{l-1}} ^{(l)}$ which corresponds to the midblock component of the solution $\x$ of (\ref {11}). The remaining block components of $\x$ are recovered by induction as follows: Starting with $k=l$ we have $\x_{2^{l-1}}=\x_{2^{l-1}} ^{(l)}$. Assume that for some given $k,\ 1 \le k \le l-1,$ we have found the vectors $\x_{s.2^k}$ for $s=1,2, \ldots , 2^{l-k}-1$ in the previous steps $k+1, \ldots , l.$ Then at the $k$th step we can find $\x_{s.2^{k-1}}$ for $s=1,2, \ldots , 2^{l-k+1}-1$. More precisely, by construction we have $\f^{(k'+1)} = \f^{(k')} - A\x^{(k')}$ and after summation of both sides of these equalities for $k'=k, k+1, \ldots, l,$ we get \begin{equation} \label {39} \f^{(k)} = A \left ( \sum _{k'=k}^l \x^{(k')} \right ). \end{equation} Let \begin{equation} \label {310} \y = \sum _{k'=k}^l \x^{(k')} \end{equation} and using $A$ from (\ref {32}) we have the following equivalent form for (\ref {39}) $$ \displaylines{ \left [ \begin {array} {ccccc} A^{(k,1)} & A_{12} & & & 0 \\ A_{21} & T+b_{2^k2^k}I_n & A_{23} & & \\ & A_{32} & A^{(k,2)} & A_{34} & \\ & & \ddots & \ddots & \ddots \\ 0 & & & A_{2^{l-k+1}-1,2^{l-k+1}-2} & A^{(k,2^{l-k})} \end{array} \right ] \left [ \begin {array} {c} \y^{(k,1)}\\ \y_{2^k} \\ \vspace {3pt} \y^{(k,2)}\\ \vdots \\ \vspace {3pt} \y^{(k,2^{l-k})} \end{array} \right ] \cr =\left [ \begin{array} {c} \f^{(k,1)}\\ \f_{2^k} \\ \vspace {3pt} \f^{(k,2)}\\ \vdots \\ \vspace {3pt} \f^{(k,2^{l-k})} \end{array} \right ] }$$ where each block $\y^{(k,s)}, \ s=1,2, \ldots , 2^{l-k}$ has $2^k-1$ blocks of order $n$. From this system we obtain the following equation for $\y^{(k,s)}:$ \begin{equation} \label {311} A_{2s-1,2s-2} \y_{(s-1).2^k} + A^{(k,s)} \y^{(k,s)} + A_{2s-1,2s} \y_{s.2^k} = \f^{(k,s)}. \end{equation} Hence, when $s_k=(s-1)2^k$: \begin{equation} \label {312} A^{(k,s)} \y^{(k,s)} = \f^{(k,s)}-A_{2s-1,2s-2} \y_{s_k} - A_{2s-1,2s} \y_{s.2^k}. \end{equation} Using (\ref {310}) we have $\y_{s_k} = \x_{s_k}$ and $\y_{s.2^k} = \x_{s.2^k}.$ Recall that from the induction described above, the vectors $\x_{s_k} = \x_{(s-1).2^k}$ and $\x_{s.2^k}$ are already found. Thus, from (\ref {312}) we get the following system \begin{equation} \label {314} A^{(k,s)} \y^{(k,s)} = \left [ \begin{array} {c} -b_{s_k+1,s_k} \x_{s_k} \\ \0 \\ \f_{2^{k-1}}^{(k,s)} \\ \0 \\ -b_{s.2^k-1,s.2^k} \x_{s.2^k} \end {array} \right ] \begin{array} {l} \} 1 {\mbox { st block}} \\ \\ \} 2^{k-1} {\mbox { th block} }\\ \\ \} 2^{k}-1 {\mbox { st block}} \end {array} . \end{equation} The nonzero block components of the right-hand side of (\ref {314}) can be found using the solution computed at the $k+1$st step. It is a problem with a sparse right-hand side and only one $(r=1)$ block component of the solution, namely $\y_{2^{k-1}}^{(k,s)}$ is needed. By (\ref {310}) one finds $$\y_{2^{k-1}}^{(k,s)} = \y_{(2s-1).2^{k-1}} = \sum _{k'=k}^l \x_{(2s-1).2^{k-1}}^{(k')} = \x_{(2s-1).2^{k-1}}.$$ Therefore, at the $k$th step from the backward recurrence $(k=l,\, l-1, \ldots, 1)$ we can determine $\x_{(2s-1)2^{k-1}}, \ s=1,2, \ldots, 2^{l-k}.$ \section{Parallel implementation} When a 2D problem is solved the rectangular domain is decomposed into horizontal strips. We then assign a processor to each strip. In each forward step of FASV a number of solves with matrices $A^{(k,s)}$ are involved. These are problems with sparse right hand sides and to find the components of the solution algorithm SRHS is applied. In general, the forward sweep of FASV requires $O(log (m))$ steps. The backward sweep of FASV is a reverse of the forward step and results in the solution of the original system (2.1). In \cite{zara} it was shown that the total arithmetical complexity of the algorithm is $(28n^2 - 9n^2/(l-1))(l-3-log P +2P)/P$ and the speed-up $S= (l-1)P/(l-3-logP +2P)$. Therefore, in the two limiting cases, for large $l$ the optimal speed-up $P$ is obtained, while for a fixed $l$ and a large $P$ the speed-up is limited by $l/2$. As mentioned above, we have used two versions of the code. The only difference between them was that in one the PVM environment was used to facilitate interprocessor communication, while the other was based on the MPI library. While re-implementing the code we have fixed the way that the time was measured. In the original code an average of processor times was reported. We have decided that this may be slightly misleading as it hides possible workload imbalances. In the new version of the code we measured the time spent by each individual processor. Since the workload differs from processor to processor, in each run we recorded the longest time (all remaining processors have to wait for the slowest one to complete its job before the problem is solved). After performing several runs, we kept and reported the shortest time (of the longest times). We used the function $mclock()$ to measure time in the PVM version while in the MPI version we used the $MPI\_Wtime()$ function. Our experiments have been performed on a Beowulf cluster. The Beowulf cluster architecture was designed in 1994 by Thomas Sterling and Donald Beck at the center of Excellence in Space Data and Information Sciences (CESDIS). The purpose of the design was to combine together COTS (commodity off the shelf) base systems in a network and emulate a costly high performance supercomputer. The Beowulf cluster at the University of Southern Mississippi consists of 16 PC's connected with a Fast Ethernet switch. Each PC is a 233 MHz Pentium II with 256 Mbytes of RAM. The proposed algorithm has been implemented in Fortran (77). We used the Portland Group compiler and invoked maximum compiler optimization ($-O2$ $-tp$ $p6$ $-Mvect$ $-Munroll=n:16$). Most runs have been done on an empty machine. \section{Experimental results} \label {sec4} To study the performance of the codes we have used a standard model problem. We have solved the Poisson equation on the unit square with Dirichlet boundary conditions. The mesh size was fixed at $h = 1/(m+1)$, where $m = 2^l$. In Table $1$ we present the times obtained for both PVM and MPI codes for $l = 9, 10, 11$ and for $P = 1, 2, \dots, 11$ processors (unfortunately, due to the technical difficulties 5 nodes were down for hardware maintenance). Problem of size $l = 11$ was the largest that we were able to run on a single processor system (it required more than 90Mbytes of memory). \begin{table} \begin{center} \begin {tabular}{| c || c | c || c | c || c | c || c | c ||} \hline P & \multicolumn{2}{c||}{$L=9$} & \multicolumn{2}{c||}{$L=10$} & \multicolumn{2}{c||}{$L=11$} \\ \hline & MPI & PVM & MPI & PVM & MPI & PVM \\ \hline 1 & 1.16 & 1.30 & 6.32 & 6.71 &32.19 &34.42 \\ 2 & 0.69 & 0.75 & 3.50 & 3.91 &17.50 &19.20 \\ 3 & 0.52 & 0.58 & 2.54 & 2.78 &13.39 &13.50 \\ 4 & 0.42 & 0.46 & 2.01 & 2.30 & 9.68 &10.92 \\ 5 & 0.40 & 0.40 & 1.81 & 1.95 & 8.63 & 9.23 \\ 6 & 0.38 & 0.41 & 1.68 & 1.91 & 7.77 & 8.84 \\ 7 & 0.36 & 0.33 & 1.53 & 1.72 & 6.94 & 7.88 \\ 8 & 0.32 & 0.32 & 1.36 & 1.54 & 6.15 & 7.00 \\ 9 & 0.34 & 0.30 & 1.35 & 1.43 & 6.02 & 6.05 \\ 10& 0.35 & 0.32 & 1.32 & 1.48 & 5.78 & 5.53 \\ 11& 0.34 & 0.30 & 1.28 & 1.38 & 5.48 & 5.77 \\ \hline \end {tabular} \end{center} \caption{Times comparison} \end{table} It can be observed that for the larger problems the MPI based code slightly outperformed the PVM based implementation. This result is consistent with our expectations. One needs to remember that MPI is a newer development than PVM and performance was more of a goal in its creation (in case of PVM the main goal was rather ability of interactively creating a heterogeneous parallel computer system). Additionally, the Beowulf implementation of MPI, which uses the LAM communication fabrics, supports sending and receiving messages directly from one process to another, overriding the communication daemons. LAM is a parallel processing environment and development system for a network of independent computers created and maintained by the Ohio Supercomputer Center. It features the MPI programming standard, supported by extensive monitoring and debugging tools. In Table $2$ we compare the speed-up values of both implementations for the largest problem size. In addition we present the theoretical speed-up value calculated using the speed-up formula above. \begin{table} \begin{center} \begin {tabular}{| c || c | c | c ||} \hline P & \multicolumn{3}{c||}{$L=11$} \\ \hline & MPI & PVM & THEORY \\ \hline 1 & 1.00 & 1.00 & 1.00 \\ 2 & 1.84 & 1.79 & 1.67 \\ 3 & 2.40 & 2.55 & 2.14 \\ 4 & 3.33 & 3.15 & 2.50 \\ 5 & 3.73 & 3.73 & 2.78 \\ 6 & 4.14 & 3.89 & 3.00 \\ 7 & 4.64 & 4.37 & 3.18 \\ 8 & 5.23 & 4.92 & 3.33 \\ 9 & 5.35 & 5.69 & 3.46 \\ 10& 5.57 & 6.22 & 3.57 \\ 11& 5.87 & 5.96 & 3.67 \\ \hline \end {tabular} \end {center} \caption{Speedup comparison} \end{table} Interestingly, both MPI and PVM speedup values are far superior to the theoretical ones. This shows a slight weakness of the theoretical arithmetical complexity analysis which typically takes into consideration only the size of the problem (sometimes supplemented by a rudimentary data communication analysis). In the experimental environment, one deals with machines with hierarchical memory structure and its management as well as networks and network protocols which influence the performance of the program. For instance, as we increase the number of processors for a given fixed problem size the amount of memory used per-processor decreases (since the problem is divided into smaller pieces) and it is much easier to manage the data movement in the hierarchical environment making the multiprocessor runs faster than the predicted ones. This can be viewed also from the reverse side. For very large problems the memory management becomes rather complicated. It is well known that most of the existing cache management algorithms are far from optimal. This increases the single processor execution time and thus increases the obtainable speed-up. In this context we can also observe that in many cases the PVM based code shows slightly faster speed-up values than MPI. The reason for that behavior is that speed-up formula involves the time spent by a single processor run. Inspecting the values in Table $1$ we can see that the single processor PVM version is slower than the MPI version. Thus any speedup value will have a greater numerator and therefore a greater value. \section{Concluding remarks}\label {sec5} In this note we have presented the experimental results illustrating the parallel performance characteristics of the fast direct elliptic solver on a Beowulf cluster. The algorithm and both its implementations turned out to be relatively efficient. We have also found that, for large problems, memory management capacity is an important factor influencing performance. In the near future we plan to, first, investigate the code to see if there is a possibility of any additional optimization of the way that the code is currently implemented. Second, we plan to complete the performance study across a number of modern parallel programming architectures. We will also look into extending the proposed approach to parallel three dimensional solvers. \begin{thebibliography}{10} \bibitem {B} A. Banegas, Fast Poisson solvers for problems with sparsity, {\em Math.Comp.}, {\bf 32}(1978), 441-446. \bibitem{HMPzara} H. Hope, M. Paprzycki and S. Petrova, Parallel Performance of a Direct Elliptic Solver, in: M. Griebel et. al. (eds.), {\em Large Scale Scientific Computations of Engineering and Environmental Problems}, Vieweg, Wisbaden, (1998), 310-318 \bibitem {K} Y. Kuznetsov, Block relaxation methods in subspaces, their optimization and application, {\em Soviet.J.Numer.Anal. and Math. Modelling}, {\bf 4}(1989), 433-452. \bibitem{LiMaPa} I. Lirkov, S. Margenov and M. Paprzycki, Benchmarking performance of parallel computers using a 2D elliptic solver, {\em Proceedings of the $4^{th}$ International Conference on Numerical Methods and Applications}, Sofia, Bulgaria, August, 1998, to appear. \bibitem {zara} S. Petrova, Parallel implementation of fast elliptic solver, {\em Parallel Computing}, {\bf23}(1997), 1113-1128. \bibitem {P} W. Proskurowski, Numerical solution of Helmholtz equation by implicit capacitance matrix method, {\em ACM Trans. Math. Software}, {\bf 5}(1979), 36-49. \bibitem {V1} P. Vassilevski, Fast algorithm for solving a linear algebraic problem with separation of variables, {\em Comptes Rendus de l'Academie Bulgare des Sciences}, {\bf 37}(1984), No.3, 305-308. \bibitem {V2} P. Vassilevski, Fast algorithm for solving discrete Poisson equation in a rectangle, {\em Comptes Rendus de l'Academie Bulgare des Sciences}, {\bf 38}(1985), No.10, 1311-1314. \bibitem {VP} P. Vassilevski and S. Petrova, A note on construction of preconditioners in solving 3D elliptic problems by substructuring, {\em Comptes Rendus de l'Academie Bulgare des Sciences}, {\bf 41}(1988), No.7, 33-36. \end{thebibliography} \medskip \noindent{\sc Marcin Paprzycki} (e-mail: marcin.paprzycki@usm.edu)\\ {\sc Julian Sanchez} (e-mail: julian.sanchez@ieee.org) \\ Department of Computer Science and Statistics \\ University of Southern Mississippi \\ Hattiesburg, MS 39406, USA \medskip \noindent{\sc Svetozara Petrova} \\ Central Laboratory of Parallel Processing \\ Bulgarian Academy of Sciences, Acad. G.Bonchev Str.\\ Block 25A, 1113 Sofia, Bulgaria \\ e-mail: zara@cantor.bas.bg \end{document}