AcuFieldView Parallel Performance Results

The degree to which certain operations in AcuFieldView Parallel will scale depends upon several factors pertaining to the dataset, and may also depend upon hardware issues as well.

As such, it is difficult to make generally applicable statements concerning the performance of AcuFieldView Parallel. Some testing results are included here. AcuFieldView Parallel testing was carried out on a set of test cases run using AcuSolve 1.7e. Details for two benchmark cases, all containing 38 grids, and made up entirely of tetrahedral cells, are summarized in the table below:

Table 1.
	Tet Benchmark (small)	Tet Benchmark (large)
File size (bytes)	420,111,352	811,480,068
Nodes	2,833,166	5,526,084
Elements	15,540,893	30,374,480

Timing tests were performed using an HP xw8600 system with four dual core CPU (Intel X5460, 3.166GHz). The system was configured with 32GB of random access memory. Elapsed timings were obtained running Parallel Shared Memory (shmem). Averaged results on scale-up factors recorded for each of the benchmarks are presented below.

Benchmark timings were obtained for the tasks of reading data, creating coordinate and iso-surfaces and sweeping these same surfaces. Overall time savings are reported as a percentage relative to the time required to complete these tasks using AcuFieldView in Client-Server mode. Using AcuFieldView Parallel with three server processes (for np values to the left of the dotted line) you should expect to see scaling improvement of 2x or better.

Benchmark timings were also carried out on a partitioned FV Unstructured dataset composed of 53 grid and results files with a total of 153 million nodes. The test system was an HP 740C with eight blades or compute nodes. Each compute node was configured with an Intel XEON 4 core (Intel X5260, 3.33GHz, dual core, dual processor), a local file system and 16GB random access memory. Local file systems were connected with a GigE network switch.

Timing tests were obtained as follows. For the case of one compute node, all 53 partitions were read from the local file system on that node. For two compute nodes, the 53 partitions were spread across the file systems for two nodes, and the dataset was read again. By utilizing all eight compute nodes, a 7X reduction in the time required to read data was achieved.