Providing adequate computational resources to handle the ever increasing size of aerospace structural and fluid dynamic computational simulation models is rapidly becoming a pressing need. One very attractive way of satisfying this need is that of developing an innovative real time parallelization "Dynamic Load Balancing" technology that will allow the use of a heterogeneous network of computers operating as an array of CPUs in a parallel processing mode. In order to apply this technology as a design tool and enable fast, inexpensive, and accurate computational simulations, optimized techniques are required for real time dynamic monitoring, controlling, and redistributing the computer resources available for the computations. Alpha STAR, under NASA funding, has devised such techniques for structural computational simulations employing modest parallel processing arrays of up to 128 CPUs. These techniques are amenable to scale-up for use with arrays of over 100,000 CPUs
The advent of parallel computing technologies offers the very attractive possibility of drastically reducing the time required to carry out large complex structural mechanics and fluid dynamics computational simulations. This has been convincingly demonstrated by using message passing paradigm such as Parallel Virtual Machine (PVM), and Message Passing Interface (MPI), to facilitate solving large problems on a variety of modern computer architectures including a cluster of networked computers. The major goal of next generation parallel computing is achieving a balanced synchronization of the analysis demands relative to computer resources recognizing that a multidisciplinary environment requires a synchronized balance of contributing analyses (CA) such as Computational Fluid Dynamics CFD, Computational Structural Dynamics CSD and adequate computing resources.
Alpha STAR Corporation’s (ASC’s) experience with the large and complex CSD applications, has led to development techniques under NASA contracts (NASA/LeRC SBIR and NASA/LARC-CSB) for structural mechanics that decrease the simulation processing time by developing a dynamic load balancing algorithm for large scale Finite Element Model (FEM) simulations. In contrast to static load balancing, dynamic load balancing for large scale computative simulations is formulated as an optimization system in which minimization of CPU time is to be achieved subject to parallelism constraints. ASC’s newly developed parallel processing technologies have the potential to greatly reduce design cycle time to meet the current stringent need to reduce elapsed times to solve CFD analyses (e.g., of a Reynolds- averaged Navier-Stokes level problem) for aircraft configurations with multi-million grid point FEM meshes. The current elapsed times (measured in hours) are excessive for these types of analyses to support designers engaged in multidisciplinary analyses and optimization of areoelastic performance.
A key feature of ASC’s parallelization technique is a dynamic load balancing system with a real time parallelization algorithm, GENOA-MAESTRO, that controls the parallel processing. GENOA-MAESTRO minimizes CPU and wall clock time based on the optimization of computer resources (CPUs, memory, disk space, communication rate, and floating point operations (real, logical, and integer). ASC is confident that MAESTRO’s current capability of controlling 128 CPUs in parallel mode can be readily extended to the ability to control over 100,000 CPUs in a parallel mode.
The massively parallel processing has been implemented for cross engineering disciplines such as CSD, CFD and Computational Electromagnetics (CEM) simulations for both structured grid and unstructured grid arrangements. The structured grid and the unstructured grid-based finite-element/volume codes were originally developed and optimized for vector computer architectures. The implementation of these codes in a distributed memory parallel architecture was accomplished by re-using much of the original vector code. Additional coding was added for handling inter-processor communication and other functions unique to the parallel implementation using PVM/MPI libraries which allows one to execute the code on a number of platforms such as the Intel Paragon, IBM SP2, Cray T3D and T3E, and the nCUBE, as well as on workstation clusters.
Figure 1 shows schematic of the GENOA-MAESTRO system for minimization of CPU and wall clock time. Optimization is achieved by employing 1) a computer interrogation facility usage (Testing facility usage: TFU), 2) a multifactor optimization scheme, and 3) domain decomposition ( partitioning) technologies, and 4) linkage of operations with the external model solvers (e.g., CFD, CSD, singly or combined).
design optimization (MDO) technique. The discrete integer optimization techniques employed by GENOA-MAESTRO minimize CPU time. The parameters involved are the: 1) number of processors, 2) amount of memory, 3) interprocessor communication time , 4) number of domains, 5) task allocation table ( task distribution of processors), and 6) processor sleeping time. GENOA-MAESTRO seeks to minimize the maximum execution time found in the set of available processors by redistributing the work loads. Specific activities of GENOA-MAESTRO are as follows:
Domain decomposition operations are controlled and managed to provide the best combination of available resources (such as machine type, network level, number of processors and available memory) to maximize the support of the operations of the decomposition algorithm.
CPU Testing. This segment measures
the CPU time and elapsed time usage of a selected CPUs for a certain amount
of arithmetic operation. The PVM libraries are used to spawn the process
on the selected processor. The availability of a processor is expressed
by the ratio of wall clock to CPU time. This function can assist the GENOA-MAESTRO
dynamic load balancing system to redistribute the tasks among the processors.
Domain Decomposer. The domain decomposer is responsible for partitioning a finite element model, or CFD model into domains. Presently the domain decomposition methods favored for computational efficiency are: 1) Recursive Inertia Partitioning (RIP); 2) Recursive Spectral Bisection (RSB); 3) Recursive Coordinate Bisection; 4) Recursive Graph Bisection (RGB). GENOA-MAESTRO receives these domains and dynamically distributes them as tasks to assigned processors. In a multidisciplinary application a dynamic load balancer must be able to utilize one or two of these methods e.g., RSB for CFD, and RIP for CSD.
Validation Results of Parallel Analysis. An HSCT model was used to demonstrate large scale computing simulation. The results with and without GENOA-MAESTRO optimization (Figures 2 and 3) show that (especially for simulations with large numbers of domains) optimization with GENOA-MAESTRO reduces CPU time.
The results of simulations using up to 64 domains and 64 CPUs are shown in Figures 4 and 5. For a fixed number of CPUs, an increase in the number of domains leads to smaller super elements and an increase in the speed of parallel computations.
Figure 6 shows that the CPU times in the HSCT parallel processing simulation are better balanced and that run times are reduced with optimization. Run time reductions with GENOA-MAESTRO optimization progressively increase as the number of domains and CPUs are increased.

Parallelization Strategy. For the structured
formulation of the finite-volume code, the computational domain surrounding
the target geometry is composed of 3-dimensional 6-sided volumes of grid
points called zones or
blocks.
Each side or face of a zone either connects to another zone or has a boundary
condition defined on that face (perfect conducting surface, outer boundary,
etc.). The parallel algorithm takes advantage of this multi-zonal gridding
capability in order to divide work among processors. The various zones
are grouped onto processors with each processor providing a solution for
the cells within its own local set of zones. An additional technique has
been gaining acceptance in FEM is available as a multifrontal algorithm
for superelement creation.
Communication Requirements. The solution
procedure does not allow for processors to proceed completely asynchronously.
Solving for cells on zone faces that are connected to other zones requires
information from within the adjacent zone. This information may be available
locally if the adjacent zone resides on the same processor, or message
passing may be required if the adjacent zone resides on another processor.
A feature of GENOA-MAESTRO minimizes the amount of communication (message
passing) among the processors by moving the interconnected zones into selected
processors to the extent possible.

Load Balancing. Static load balancing
is achieved by mapping zones onto processors. Perfect load
balancing requires that each processor have the same number of zones, each
containing the same number of gird points and equal numbers and types of
boundary condition cells. Simple geometries may usually be zoned in such
a manner as to obtain perfect load balancing. For complex geometries perfect
load balancing is much more difficult, but adequate load balancing may
usually be obtained by mapping a close to equal number of grid points onto
each processor. GENOA-MAESTRO receives information from a static load balancing
domain decomposer and re-configures assignment of domains and tasks to
parallel processors to achieve a balanced load with respect to minimizing
processing time.
Data Warehousing. A data warehouse
is a system for assembling and managing data that results from various
processes and sources to support analysis and decision making. It is an
important item for consideration in developing parallel processing systems
and is comprised of several components, namely 1) tools to extract relevant
data from operational data stores, 2) means to clean and transform data
into a usable form for an application, 3) means to load data into the data
warehouse, 4) a system to manage the stored data, 5) a catalog of information
pertinent to the query environment, and 6) a means of creating and distributing
reports involving the warehoused data It is tailored for use as an instrument
for decision making, probing new ways to look at data and as a tool to
achieve collaborative work.
Scalability. For computer software and architec-tures to be called "scaleable", they must efficiently make use of parallelization in two respects. In one respect, called scale-up (or scaled speed-up), a constant amount of computational work is given to each processor while the number of processors being utilized is increased. The other aspect is called speed-up where in the total size of the simulation is kept constant while the work load is spread over increasing numbers of processors.
Scalability Results. Validation and timing studies have been performed on a 512-node nCUBE and a 208-node Intel Paragon. Currently the code shows good scalability on evenly balanced test cases. These cases typically had simple gridding requirements and a straight forward domain decomposition. The results show that inter-processor communication due to flux transfer never becomes a dominant time factor even on problems with large numbers of grid points run on many processors. The sphere test case illustrated in Figure 7 shows how problem size and number of processors can be increased while solution time remains level. Perfectly conducting sphere grids were run on 6, 24, and 96 processors of the Intel Paragon. The number of gird points per processor remained constant at approximately 60,000 resulting in total grid sizes of approximately 0.35, 1.4, and 5.7 million grid points for the three cases. Since increasing the number of processors results in an increased number of zonal interfaces, flux message passing requirements increase throughout the system. Despite this increase in required message passing, communication times did not change appreciably
Complex problems such as full-scale fighter geometries also show encouraging results. Figure 8 shows timing results and zoning for the VFY218 fighter gridded for a frequency of 500 MHz with a 10-point per wavelength resolution. A total of 58 zones and 2.2 million grid points
were required. The grid was run on an Intel Paragon using 28, 61, and 128 processors. Preliminary timing data reveals that communication overhead remains at between 1 and 2.5 percent of the total solve time and that solution speedup occurs as the problem is distributed over more processors. Speedup may be improved by addressing load balancing issues arising from complex zoning arrangements.
The domain decomposition is performed by the RIP algorithm which exploits the discretization of the FEM and performs balanced partitioning and natural subdivision of the grid topology to minimize the computational bottleneck, as shown in Figure 9.
Figure 9 illustrates the partitioning of a 2D finite element model. At the first level, RIP partitions the model into two separate structures by cutting it at modes 37 through 45. At the second level, RIP partitions each of the previously partitioned structure into a new pair


of substructures by cutting the model further at nodes (5, 14, 23, 32) and (50, 59, 68, 77). In Figure 9 the partitioning is continued to four levels and a total of 16 substructures (super elements) are created. At each level the list of cut nodes (separator nodes) are listed in a separator tree. The interior nodes of final super elements are listed in a new level added to the last level. The separator tree contains all of the information on the list and order of nodes to be eliminated from the stiffness equations.
This technique is designed to reduce the communication overhead in sparse matrix
decomposition and thus permit efficient runs on message-passing multiprocessors. The number of potential messages is minimized by assigning pivots to processors on the basis of spatial locality in the dissected problems. This is accomplished using a nested dissection of the underlying problem grid. At each stage of the dissection process, a separator is found that divides the adjacent graph into two disjointed blocks. This procedure is recursively applied to each block until a block is isolated for every processor. The resulting blocks of equations are linearly independent of one another. Nested dissection allows exploitation of large grain parallelism in the sparse matrix decomposition by allocating branches of the elimination tree to multiple problems.
The RIP algorithm first identifies the smallest moment of inertia in the long direction, minimizes the boundary region by rotation, and finally balances processor computational loading. The RIP algorithm follows these general steps to partition a finite element model into substructures, Si. Each Si is a subset of nodes and elements from the original model. The partitioning algorithm further cuts each substructure assigned to a single processor to determine the order of eliminating unknowns within the substructure. The AMF will reduce the

global memory and the global finite element equations to equations with unknowns only on the boundaries between substructures. Once AMF solves for the unknowns on these interfaces, each processor can then work independently to find the solution within its substructure. The RIP algorithm may be implemented repeatedly until further partitioning of the substructures is not possible. This limit occurs when RIP cannot find moment of inertia axes. Each cut will produce two new substructures and a list of interface nodes, Omk, where m is the level of the cut and k indexes the separator group which contains that set of interface nodes. The interface nodes are the nodes common between the two substructures. As an example, the mesh of Figure 10 is partitioned at three levels creating eight super elements. Figure 11 shows the corresponding binary tree with the list of separator nodes. Each separator group Omk has at most two children and at most one parent in the binary tree. After RIP dismantles the original model, substructures are grouped together to make new aggregates. The set of Omk nodes within each aggregate specify an order for eliminating the degrees of freedom within the aggregate. The set of Omk nodes on the boundary of the aggregates specify the shared unknowns between substructures in neighboring processors.

A distributed multifrontal (DMF) algorithm for sparse Gaussian elimination on parallel computer was presented by Nour-omid, Hadian. This method uses the nested dissection reordering heuristic to extract separator from the graph of the matrix, thereby partitioning the matrix into disjoint blocks that can be allocated to the processors. Symbolic decomposition of the result is shown to be completely independent. The number of messages exchanged during the sparse matrix factorization is limited to a function of the length of the separators. The DMF sparse solver achieves parallel efficiencies of over 70 percent. AMF was developed to remove the limitations of DMF. AMF can operate on all types of elements, mixture of elements, and even unsymmetric binary trees. Both DMF and AMF are based on multifrontal algorithm.
The multifrontal algorithm is very powerful. It can be easily described by considering the partitioned structure (see Figure 10). At the last level each super element is assembled to create eight systems of equations. The assembly can be performed simultaneously by different processors. There is no need for inter processor communications at this stage. This would speed up the parallel assembly tasks, especially for large finite element models. Each of the assembled super elements is then condensed by removing the fully summed nodes (interior nodes). In the process of condensing a super element, the coefficient matrix for the super element is partitioned so that the interior and exterior degrees of freedom are separated. Then, the interior coefficient matrix is decomposed into triangular matrices and processed further. This decomposition step is computationally the most expensive task performed on a super element. This is mathematically similar to calculating inverse of the interior matrix. The interior portion of coefficient matrix of all super elements are decomposed and stored in the climbing operation of AMF. According to the binary tree, the condensed systems of equations for every two neighboring super elements are combined (assembled) to form the super elements of the next level. The levels in binary tree are climbed up by repeating the assembly and condensation process for each pair of super element. At each level a new parent super element is created by assembling each pair of child super elements. At the top of the binary tree (Level 1) only one super element is created. All the nodes at the top are interior and fully summed. Therefore, all the unknowns at top level can be calculated at this point. Once the top of binary tree is reached, all the unknowns are calculated consecutively group by group by back substitution.
This tool-kit decouples the equations governing the perturbed substructures from other equations, cuts down the problem size in each step of the iterative design procedure, distribute the tasks among several processors, and further speeds up the solutions.
One obvious advantage of multifrontal algorithm is that all operations can be fully parallelized. Another advantage of the method is that when parameters in a finite element subdomain are perturbed, only the condensed coefficient matrix for that super element and any other higher level super elements in that branch of binary tree are changed. Therefore, in a new solution the condensed matrices of untouched super elements of the binary tree are the same as in the first unperturbed solution. With careful domain partitioning, one can manage to collect all the perturbed elements in one super element so that only one branch of the binary tree be reevaluated for solution to the perturbed problem. As an example, the panel in Figure 12 is partitioned into substructures A and B. Each of these substructures are force partitioned further, and total of four new substructures, are created, as shown in Figure 13. The corresponding binary tree has three levels of interior nodes as shown in Figure 14 where Si’s are the list of interior nodes of the separators. S1, S2, and S3 are the set of nodes shared by substructures A and B, I and III, and II and IV, respectively. S4, S5, S6, and S7 are the interior nodes of the four substructures (Level 3).


According to AMF program, first the elements
of each substructure are assembled and condensed at the third level of
binary tree, followed by assembly and condensation procedure at higher
levels. Now suppose that some geometric and/or material properties of substructure
I is perturbed and another run is needed. In this case, the element stiffness
(and the assembled matrix) in only substructure I is changed. Therefore,
the assembled stiffness matrices, and more important of all, the condensed
matrices of the remaining three substructures remain unchanged. However,
the changes in substructure I leads to changes in super element 2 Level
2 which was created by combining exterior nodes of substructures I and
III. Consequently, the condensation procedure for super element 2 must
be repeated. Fortunately, at this level, the stiffness system to be condensed
contains only the degrees of freedom associated with the nodes on the intersection
of substructures I and III which is relatively small compared to the whole
substructure A . Similarly, the super element 1 at Level 1 is changed and
only the degrees of freedom shared by substructures A and B are recondensed.
Obviously, tremendous amount of computation time is saved in the second
run. Here, only three sub systems of equations are condensed. It is possible
to reduce recomputations by using a combination of unsymmetric binary tree
and force partitioning which actually shifts the separator group S4
to a higher level in the binary tree. An example of the binary tree for
this case is shown in Figure 15 where Si’s are the same as the
previous case The computation saving with the algorithm becomes
even more apparent in large scale models. The larger the model is, the
greater saving in computational time is achieved. For example, a perturbation
in a panel of an aircraft requires solving only a system of equation as
small as the degrees of freedom on the boundaries of that panel. That is
reducing the size of the system to be solved from millions to hundreds
of replacements. Figure 16 illustrates forced natural, and mathematical
partitioning. As shown, a trade study can be performed using super elements
(mathematical representation of a component) and material data in parallel
processing environment. Changes in one substructure are automatically propagated
into other substructures and the global panel by the management of forced
partitioning. Therefore, a tremendous amount of computation time is saved
when sensitivity analysis is performed.

Further speed up in design procedure is achieved by performing sensitivity analysis. The sensitivity analysis aids the designer to find a search direction in trail and error procedure. Once AMF is implemented to solve a system, it can be used again to calculate the sensitivity derivative of the response with respect to any geometric or material parameter. The equations governing the sensitivity derivative of the system response with respect to an input parameter are similar to unperturbed original equations. Their coefficient matrices are the same. Since the major computing efforts is in operations performed on these coefficient matrices, one should save the results of these operations when solving unperturbed problem, and then use them in solution to equations governing the sensitivity of the system. Therefore, all of the saved condensed stiffness matrices in system response calculations can be utilized to calculate sensitivity of the solutions. This is another major computation time saving feature of AMF in trade studies.
