Technical Briefs

Parallelized Direct Search of a Binary Objective PUBLIC ACCESS

[+] Author and Article Information
P. Grant

Department of Electrical Engineering, Vanderbilt University, Nashville, TN 37235

D. G. Walker

Thermal Engineering Laboratory, Department of Mechanical Engineering, Vanderbilt University, Nashville, TN 37235greg.walker@vanderbilt.edu

J. Comput. Inf. Sci. Eng 10(4), 044502 (Nov 23, 2010) (5 pages) doi:10.1115/1.3510589 History: Received June 09, 2009; Revised May 07, 2010; Published November 23, 2010; Online November 23, 2010

Many scientific and engineering problems are solved by utilizing simulations of computationally intensive mathematical models within massive design spaces. As a result, parametric studies of these models are cost prohibitive in terms of computational time. The focus of this work is a particular kind of parameter sweep problem where the simulation of the model, given real parameters, results in a binary value. The goal of this work is to design, develop, and implement a parallel algorithm for bounding a binary objective when the simulation is a computationally intensive mathematical model. A fully functioning implementation is provided for a two-dimensional example using client-server architecture. Results show that a straight bisection search is approximately 50% faster than a full parametric sweep for most continuous functions. With parallelization and load balancing, the simulation is remarkably faster, exhibiting near-linear speedup up to 16 processors for most functions.


Engineering analysis is often concerned with identifying conditions responsible for catastrophic failure. For example, consider the load required to buckle a column. Using simple material relations and static analysis, an estimate can be obtained for a range of column size and load. However, for complex systems, the prediction of failure can be difficult due to the interaction of many forces and the confounding interaction of many parameters. In these cases, detailed models that govern the fundamental behavior of systems are designed to predict the performance of systems immediately prior to failure. Depending on the application, models can be computationally expensive, and exploration of large design spaces can be cost prohibitive, yet engineers often want to know where in the design space a system will fail.

The calculation of microelectronic device failure due to a single energetic particle strike (1-2) is an example of an application that requires significant computational time for a single design point. Failure of a device is determined by the particle energy, biasing conditions, and strike location and direction, among other parameters. Simulations for a single design point require hours of computer time. As a result, to locate all combinations of bias, particle energy and direction that result in failure would require years of computer time.

Another example that does not deal directly with failure but is a similar class of engineering problem is ignition prediction. In diesel engines, ignition is dependent on a variety of parameters such as temperature, pressure, crank angle, cylinder charge, and so forth. To predict whether ignition will occur, a detailed chemistry and thermodynamic simulation must be performed, allowing the states to evolve in time as the gas is compressed by the cylinder (3). Failure in this context would actually be ignition, but the results of the simulation indicate whether the fuel mixture burns or not. These models require expensive computations.

These examples fall into a class of problems that are characterized by a binary objective: The diesel fuel ignited or it did not; the ion destroyed the device or it did not; the column buckled or it did not. Furthermore, simulation results on the “unfailed” side are not terribly useful for determining how close to failure we might be; we simply know the system did not fail. Results obtained from design points of the failed side are often unavailable because the simulation entered an error condition or the quantities are nonphysical (due to failure). Therefore, we cannot depend on the simulation results to provide any more information other than failure or not-failure. In essence, we are attempting to find the location of the discontinuity (xd) in a Heaviside step functionDisplay Formula


Contrast the foregoing problem to the discrete optimization problems where the design variable is discrete. In general, discrete optimization problems do not demand a discrete objective. One notable example is the knapsack problem (4)—a simple example of the discrete optimization problem. In this case, the objective is the number of items in the knapsack, which is discrete. Nevertheless, discrete optimization methods, in general, focus on the fact that the design variables are discrete and are less concerned about the continuity of the objective. In our case, the design variables are continuous and the objective is discrete.

Not only is our objective discrete, but it is also binary. However, our problem is not a pseudo-Boolean optimization (also related to binary optimization or zero-one integer programming) either (5). These techniques are special cases of discrete optimization when the design variable is binary. Again, in general, the objective remains continuous in these classes of problems.

Due to the lack of information from a function evaluation, binary objective problems are amenable only to direct search methods much like those problems without numerical objectives (6). Problems without numerical objectives still usually have a “better” or “worse” designation between any two points even though the objective is not quantified. This qualitative evaluation still gives a search direction. Our binary objective is similar except that we do not have enough information to determine a direction to proceed. We assume that gradients are nonexistent or meaningless except across the interface where the objective changes by a unit step. Nevertheless, direct search methods can be applied to these problems.

The binary objective problem somewhat resembles that of an edge detection problem in image processing (7). However, traditional approaches still rely on higher order derivatives to smooth the image and locate where in an image the intensity changes dramatically. The Canny approach (8) assumes that noise is inherent to the data and that all function evaluations have already been acquired (pixel color). These two assumptions are contrary to our problem and make the standard edge detection algorithms impractical for our approach. Moreover, the type of edge being detected is usually quite different from our design space surface. Nevertheless, the interpolation line segment approach used by some edge detection routines (9) is partially responsible for the interpolation scheme used in our approach to speed up serial analysis. Finally, edge detection problems do not scale to multiple dimensions, making these approaches only peripherally related and of limited importance to the present work.

In general, optimization problems are amenable to parallelization. Byrd et al. (10) suggested that either the function evaluations can be performed concurrently or that a sequential optimization can be used with a parallel forward model. Eldred et al. (11) further developed a taxonomy for levels of parallelization that, when used properly, provides maximum use of available resources. In fact, the DAKOTA project at Sandia National Laboratories was developed to exploit the multilevel parallelism of most optimization problems (12). In the present context, we assume that the parallelization of the forward model is limited and that more resources exist than can be used by the forward model. Therefore, we are interested solely in the parallelization of the optimization algorithm.

The vast majority of the research on parallelization of optimization algorithms is in the area of smooth nonlinear functions (13). In these cases, gradients are usually used to provide a new search direction. If analytic gradients are not available, they can be approximated with finite differences. In terms of parallelization, each gradient (or Hessian) requires additional function evaluations that can be performed concurrently. Therefore, the optimal speedup is equivalent to the number of function evaluations required for each iteration (14). This rule of thumb can be improved slightly with more modern quasi-Newton methods (15-16) and using additional information from otherwise unutilized processors to augment the optimization (14). Nevertheless, the number of processors that can be used is limited by the dimensionality of the design space. Consequently, for the most practical optimization problems, the amount of parallelization for practical gradient-based optimization algorithms is not massive.

Nongradient-based optimizations, however, stand to see greater gains than their gradient-based counterparts. Chazan and Miranker (17) showed that an order advantage exists between sequential search algorithms and parallelized search algorithms for quadratic functions. Sutti (18) later extended the methods by performing asynchronous evaluations, which simply means that the order of the linearly independent search direction s is immaterial. Nevertheless, these methods depend on a smooth convex function, which is not representative of our binary function.

The binary objective problems described here can be treated as trivially parallel in that the results of each design point are independent of the results from other design points. Therefore, high-performance solutions can greatly reduce the total amount of time an engineer might have to wait for the results. The simplest approach would be to discretize the design space and evaluate each point on a different processor. However, this approach does not limit the search in any way and can also become computationally prohibitive for multidimensional searches or when a limited number of processors is available. In general, the brute force approach scales such as Np where N is the number of parameters values to be tested for a single parameter and p is the number of parameters to be tested.

The present work focuses on a parallelized and load balanced bisection of the design space to locate a sudden transition at the discontinuity in the objective. We define the mathematical formulation of the problem and solution approach, explore how the solution can be improved serially, parallelize a basic bisection algorithm, and add load balancing. Performance of the serial solution is measured by the number of function evaluations required to obtain the threshold in the objective. Performance of the parallel algorithms is indicated by the speedup.

The mathematical model of the problem is described in terms of adjustable parameters that govern the physics given by real variables represented by zkk[1,2,,n]. The parameter space is then the set of all possible design points P where f(z1,z2,,zn)[0,1] is a mapping from P to a binary variable. The problem can be defined as the following: Find sets V0Pf(z1,z2,,zn)0 and V1Pf(z1,z2,,zn)1.

To make the problem more tractable, it is assumed that the boundary between the sets can be represented as a continuous (n1) dimensional surface S in P. Because of this simplifying assumption, the solution may be represented by this surface and a binary value that indicates whether the value 1 lies above or below the surface.

In terms of the computational complexity theory (19), the problem is stated as finding the region of the design space that results in failure (V0) or success (V1) to within a particular tolerance such that the surface S is nowhere further from the closest points in either set V0 or V1 than a prescribed tolerance. If the continuous design space is discretized using the tolerance, then a naive algorithm would be to evaluate the function f for each discretized design point in the domain P. This algorithm is O(nd), where n is the number of design points to test in each direction d (assuming the tolerance in each direction is specified to yield the same number of discretized points in each direction). Although we expect to be able to do better, we are not trying to establish a lower bound to the complexity of the problem. Instead, we simply wish to improve on the O(nd) scaling.

To illustrate the solution process and provide metrics for evaluating the efficacy of the algorithm, a two-dimensional design space (d=2) search is considered. Assume that the boundary S between success and failure can be represented by a continuous curve in the xy domain and that x[x0,x1] and y[y0,y1] define the parameter space P. The goal is to find the boundary S within xtol along the x-direction and ytol along the y-direction.

The naive solution (full parameter sweep) proceeds by discretizing the design space using the tolerance for each direction. Each design point can be evaluated to determine whether that point results in failure or success. The boundary S then lies between neighboring design points of opposite results and is never farther from the actual location than the specified tolerance. This algorithm is O(nxny), where nx and ny are the numbers of discretized points in each direction that can be computed asDisplay Formula

Assuming the tolerances in each direction are commensurate, the algorithm can be said to be O(nd), where d is the dimensionality or number of design points. An illustration of this strategy is shown in Fig. 1 where the distance between points isDisplay Formula

A better solution proceeds by allowing xi=x0+ixtolixi<x1. For each xi, perform a bisection along the y-direction to bracket the boundary. The number of function evaluations can be estimated based on the tolerance required for each direction without having any knowledge of the shape of the surface S. The bisection algorithm givesDisplay Formula

number of function evaluations. This is a significant improvement over a parametric sweep, which requiresDisplay Formula
function evaluations. Assuming the boundaries and tolerances are the same in each direction, the total number of function evaluations required for finding the surface S is O(nlogn) for the bisection instead of O(n2) for the full parametric sweep. As the number of dimensions increases, the advantage is not quite as dramatic because the complexity increases as O(nd1logn). Nevertheless, a bisection should improve the performance, and alternative algorithms could be used such as a Brent algorithm (20).

In some instances, the bisection could fail to provide an advantage. For example, consider the right most portion of the example provided in Fig. 1. For the two right most columns, the bisection would not bracket the solution because the solution is not there to bracket. Only after the requisite number of evaluations (O[logN]) would we be able to conclude that the solution S does not extend into that region. If we could leverage existing knowledge about the locus of the solution, perhaps we could reduce or eliminate evaluating these points. This is the focus of the approximation scheme described below.

To reduce the number of function evaluations even further, an approximation scheme can be used, which will leverage information already gathered about the location of the surface S. These approximations serve as a guide for the remaining bisections. Types of approximation used include all of the interpolation types supported by the GNU Scientific Library (21): linear, polynomial, cubic spline, periodic cubic spline, akima spline, and periodic akima spline interpolations. Each type of interpolation was tested on a boundary represented by a line, sine, parabola, and polynomial. The results in terms of the number of function evaluations are presented in Table 1. The results suggest that simple interpolation (linear) is adequate for the vast majority of function types.

These functions examined are all smooth, and so the results are quite impressive. Cubic splines perform very well, but performance is worse than the regular bisection method in the case of the ninth degree polynomial, which wildly oscillates outside of the parameter space.

This problem can be parallelized easily because each function evaluation is assumed to be independent of all other design points. Therefore, little extra work is required to parallelize the algorithm: One must only assign a single x-value to each node and allow the bisection for that x-value to take place on that node. The number of function evaluations does not change when the algorithm is parallelized in this way.

A parallel version of the search algorithm was developed using MPI (message passing interface) with ethernet network connectivity. No attempt to balance the load was made in this initial trial. Because the number of design points in the x-direction was fixed (based on the tolerance), each processor was assigned a fixed number of evaluations in that direction. For each x design point, bisection was performed in the y-direction until the required tolerance was reached. The test problem was to find the threshold given by y=0.75x+0.13 on the domains x0=0, x1=1, y0=0, and y1=1. The model returned a 1 if the design point was above the line and 0 if the design point was below the line. The required tolerances for each direction were xtol=102 and ytol=103. No interpolation was performed to reduce the number of iterations.

There are three possible sources of task distribution that could prohibit the parallelization from achieving a linear speedup.
  • The number of tasks does not divide easily among the available processors. For example, four function evaluations do not divide equally across three processors. While one processor finishes the final function evaluation, the other three processors sit idle.
  • Each processor may be assigned a different number of function evaluations due to predictive algorithms. For example, if one can provide a good guess where the interface is based on previous results, fewer iterations may be required to achieve the specified tolerance. Therefore, the processors where a predictive algorithm is effective may finish their tasks earlier than other processors.
  • Each function evaluation may require a different amount of time. For example, in the case of the ion-strike simulations described previously, the average time to determine whether a device failed given a single design point is 4 h, yet some failure conditions can be estimated early in the simulation (of the order of 20 min of simulation time), and some simulations that come close to failure but ultimately survive can require 10–20 h of CPU time. Therefore, depending on the number of processors, it is conceivable that several processors may sit idle while the last few simulations finish the calculations.

In the foregoing tests, the number of function evaluations (1200) is much greater than the number of processors (up to 17); therefore, the load based on the number of function evaluations (point 1) should be similar. Concerning point 2, we are not using any predictive algorithm, so the number of function evaluations will not change based on previous results. Finally, if each function evaluation requires the same amount of time, then we expect linear speedup, as shown in Fig. 2, for the constant function evaluation time case. These results were obtained on a homogeneous cluster.

In real engineering systems, however, we cannot depend on each function evaluation to be equivalent in terms of computational requirements (point 3). This wait time at the end of the simulation will reduce the speedup of the overall analysis, as shown in Fig. 2. In the random function evaluation time case, a random amount of sleep time was executed before the function returned with a 1 or a 0. This was implemented to simulate varying function evaluation times. This lack of linear speedup can be addressed through a load balancing scheme.

The strategy used to accomplish load balancing is that of a simple client-server architecture. The server maintains a queue of tasks to be completed. Initially, the queue contains, in order, the low and high y-values for each x-value. The server doles out tasks from the front of the queue. When a task completes, a new task is generated and is placed at the end of the queue. A new task can be assigned in constant time. This form of load balancing addresses all three sources of degradation of linear speedup. However, it requires a dedicated server to dole out design points and requires a bit more communication overhead. Nevertheless, the results are dramatically improved. Figure 3 shows the maximum wait time of any processor in the group and the average wait time for all the processors in the group as a function of the number of processors in the group. The amount of wait time increases with the number of processors when no load balancing is performed. With load balancing, however, the wait time is decreased significantly. Therefore, the efficiency of this strategy is nearly perfect, leading to a scalable speedup up to 17 processors.

The performance of the parallel algorithm can be improved by including the approximation scheme used in the serial algorithm. However, this effort introduces additional complexity. As processors are added to the simulation, the accuracy of the approximation decreases, as illustrated in the following example. If all execution occurs on a single processor, then when the evaluation k is assigned (k1), other results may be used in the approximation. On the other hand, if ten clients are used, then when the evaluation k is assigned at most (k10), other results may be used. Furthermore, the approximation in this case is an interpolator, meaning that the approximation must pass through each point. For these reasons, it is extremely difficult to place bounds on the number of evaluations required for a particular number of processors and type of approximation. As illustrated in the case of the ninth degree polynomial, the number of function evaluations does not always decrease when using approximation.

Although the example was cast in a two-dimensional space, the concepts here can be extended to multiple dimensions with a multidimensional bisection (22) algorithm. In fact, the two-dimensional problem may also be improved with a multidimensional bisection approach. However, the scalability is no longer a simple function of the problems size, and the parallelization is no longer trivial.

Improvements can be made to the load balancing strategy by introducing a list ofDisplay Formula

queues qi such that qi contains only tasks corresponding to an x-value that has i tasks executed already. This strategy is commonly used in CPU schedulers and runs in constant time.

Other improvements could include mixing an approximation strategy with a load balancing strategy. This is difficult because the approximation and thus the scalability worsen as the number of processors increases. Also, the interpolators used should be replaced with better approximations since the partial results are inexact at best.

Binary objectives occur naturally in many areas of science and engineering. Therefore, understanding these types of problems is important to the efficient analysis and designs of engineering systems. Examining the problems by forming a mesh of the parameter space seems to be the accepted solution, which is certainly easy to extend to n dimensions. However, complicated physical models often require vast computing resources, and an algorithm that scales better and reduces the computational cost is essential. We have shown that a simple bisection algorithm and simple parallelization with load balancing can reduce the computational cost by an order of magnitude. These results are limited to a boundary that is continuous in the parameter space. Moreover, we did not test the scheme for higher dimensions. Nevertheless, these results lay the framework for a continued study in optimization problems with binary objectives.

Copyright © 2010 by American Society of Mechanical Engineers
View article in PDF format.



Grahic Jump Location
Figure 1

Illustration of the solution S on the domain P and the value of the function evaluation f for discretized points whose distances (δx and δy) are determined by a prescribed tolerance

Grahic Jump Location
Figure 2

Speedup measured for a constant function evaluation time and a random function evaluation time (from 0 s to 5 s of sleep)

Grahic Jump Location
Figure 3

The maximum and average wait times for a nonbalanced case and when load balancing is employed


Table Grahic Jump Location
Table 1
Number of function evaluations required to locate the surface S identified above to a tolerance of 104. The row labels identify the type of interpolating scheme and the column labels identify the type of interface function.


Some tools below are only available to our subscribers or users with an online account.

Related Content

Customize your page view by dragging and repositioning the boxes below.

Related Journal Articles
Related eBook Content
Topic Collections

Sorry! You do not have access to this content. For assistance or to subscribe, please contact us:

  • TELEPHONE: 1-800-843-2763 (Toll-free in the USA)
  • EMAIL: asmedigitalcollection@asme.org
Sign In