The role of a profiling tool is to associate computational bottlenecks that arise during program execution with easily identifiable segments of the underlying source code. The usefulness of a profiling tool depends upon the ease in which users can employ this information to alleviate identified bottlenecks within their programs.
The success of profiling tools in sequential languages has been predominantly based on the employment of three criteria as the platform on which profiling tools are built. The first of these criteria is `what' is measured; typically this might be the percentage of execution time spent in each part of the program. The second criteria is `where' in the code these costs should be attributed; costs may be associated with functions or libraries for example. The third criteria is `how-to-use' the profiling information to optimise programs in a quantifiable, portable and universal way; for example, problematic portions of code may be rewritten using an algorithm with improved asymptotic complexity.
The difference between profiling parallel programs as opposed to sequential programs is that parallel programs are executed on a number of processors. Consequently, each part of the code may be associated with up-to p costs, where p is the number of processors. The major challenge for the developers of profiling tools for parallel languages is to identify and expose the relationship (imbalance) of computational costs amongst processors, and subsequently express this relationship in terms of the three criteria outlined above. Unfortunately, within a parallel framework, there is a multiplicity of interacting issues that make these criteria significantly more obscure and complex:
In this paper it is demonstrated that parallel programs written using the disciplined approach of the BSP model are amenable to the three profiling criteria stated above. The development of a BSP profiling tool is documented. The work motivates the notion of computation and communication balance as the metric by which programs are optimised. It is shown that by minimising imbalance, significant improvements in the algorithmic complexity of parallel algorithms usually follows. This approach provides the foundation upon which portable and architecture independent optimisation can be achieved.