Did nVidia sabotage its physX library for the CPU in order to boost the GPU?

By Tim Quax on 09 july 2010

David Kanter from Realworldtech investigated the issue of nVidia possibly crippling their cross platform physics acceleration library, called PhysX, to make their GPU look better than those inferior CPU's.

PhysX is a cross platform physics acceleration library. It is used by developers to add high quality physics simulation to their games. If there's no nVidia GPU present, PhysX will default to the CPU, rumours are it doesn't perform very well there.

CPU and physics emulation

One might say the CPU's performance deficiency is because of the simple fact that GPUs are superior at physics emulation by definition. The CPU's poor performance on PhysX would then be completely logical and would serve as more evidence that the GPU is indeed best equipped for the job.

Early investigations however showed that PhysX uses only a single thread when it runs on a CPU. Remarkable, since the workload is designed to be highly parallelizable and its this design in combination with the GPU which serves as its main performance boost. PhysX pushes the GPU into running on hundreds of threads, and the CPU remains on one mere thread.

While this could point to severe neglect on nVidia's part, Kanter's research is a bit more shocking; PhysX on a CPU appears to exclusively use x87 floating-point instructions, instead of the newer SSE instructions. A note for the compiling-uninitiated; changing from x87 to SSE is as simple as changing a compiler flag.

x87 and it's successor

The x87 floating-point instructions are terrible and have been declared legacy for about 5 years now, and for good reason. The x87 stack based math extensions simply need more instructions and memory access to accomplish the same task on other hardware. Besides that the x87 instructions are hard to optimize, so a fix is released by Intel by introducing a set of SSE scalar floating-point instructions that could completely replace x87, giving developers access to more registers and then there are the floating-point vector formats.

Conclusion

But oh well, nVidia says it "has been good enough" and "the bottlenecks are elsewhere". You make a very good point indeed.