JIT inlining seems to be disabled, so results are misleading
AdamMil
Posts: 22
When profiling the included Mandelbrot applications running in "Slow" (complex number) mode, the "slowest functions" are always Complex.get_X() and Complex.get_Y(), which have the code "return m_X;" and "return m_Y;", respectively.
I tried building the Mandelbrot application myself, in a release build with no debug information, and profiling that.
The slowest functions were still listed as Complex.get_X() and Complex.get_Y().
It seems that JIT inlining is being disabled.
In an actual build, those function would likely be inlined and take almost no time at all.
But whatever the actual cause, the fact that those functions are the slowest is almost certainly skewing more results than just those two lines, because the times of the functions that call them would also be artificially inflated, leading to me optimizing the wrong code. The only way to "optimize" Complex.get_X() and Complex.get_Y() as far as the profiler would be concerned is to manually inline them, which would require breaking encapsulation, etc. And all that work and ugliness wouldn't really speed up the application because in a real build they would have been inlined by the JIT compiler anyway.
I could be wrong, but this seems like a big problem to me...
I tried building the Mandelbrot application myself, in a release build with no debug information, and profiling that.
The slowest functions were still listed as Complex.get_X() and Complex.get_Y().
It seems that JIT inlining is being disabled.
In an actual build, those function would likely be inlined and take almost no time at all.
But whatever the actual cause, the fact that those functions are the slowest is almost certainly skewing more results than just those two lines, because the times of the functions that call them would also be artificially inflated, leading to me optimizing the wrong code. The only way to "optimize" Complex.get_X() and Complex.get_Y() as far as the profiler would be concerned is to manually inline them, which would require breaking encapsulation, etc. And all that work and ugliness wouldn't really speed up the application because in a real build they would have been inlined by the JIT compiler anyway.
I could be wrong, but this seems like a big problem to me...
This discussion has been closed.
Comments
Thanks for the very thorough research and comments! Yes, profiling applications must disable inlining. This directive is forced on us, as far as I know, by Microsoft. Once profiling is started, the runtime's Just-In-Time compiler disables inlining.
you are correct that the profiler disables inlining. This is done so that all the methods do appear in the results. You are also correct to point out that the time taken by these inlined methods would normally be negligible. The real reason that the complex example is slower is because we instaniate a whole load of object number objects.
Thanks,
Tom Harris
Red Gate Software Ltd.
From what I know of the profiling API, the JIT will call the JITInlining method when it's about to inline a function, giving the function IDs of the two functions involved, and allowing you to instruct the JIT to not inline them. (Inlining can also be disabled with a global flag.)
It would seem that inlining can be controlled. However, I'm sure you instrument the IL or something, and instrumenting around inlined functions to properly track when they were called (in order to get correct hit counting) would be difficult or impossible.
But that difficult task doesn't seem necessary. It would still be a useful feature of the profiler if it allowed the JIT to inline functions without instrumentation. The function hit counts wouldn't be what people might expect, but the profiling results would do a better job of leading people to things that need to be optimized.
With JIT inlining disabled, people can be led to optimize the wrong things, which would show an improvement in the profiler, but no actual improvement in the release build of their application.
It's a tradeoff between correctness in timing and correctness in hit counting. The current implementation gives correct hit counting but incorrect timing. I personally would prefer to sacrifice hit counting for better timing, because timing's what's important for profiling in most applications.
This could perhaps be an option inside the profiler. I think it's better to have the option than to be stuck with one way of doing things... or if I had to be stuck, I'd prefer to be stuck with more correct timings and less correct hit counts.
Just my opinion as an evaluator of the profiler...
Actually I'm pretty amazed that nobody has noticed this, as inlining can have a fairly significant effect on performance. It's definitely worth a discussion!
After doing a bit of research, I've found this article from MSDN:
http://msdn.microsoft.com/msdnmag/issues/01/12/hood/
It does mention briefly about profiling applications and the effect that inlining would have:
I take this to mean that the profiling wouldn't work at all? I could be wrong on that point.
But I suspect you guys don't need/use the callbacks so much anyway -- instead, you probably use a method called instrumentation where extra bits of code are inserted into the program before it's compiled by the JIT compiler.
These bits of code are what enable you to get profiling down to a per-line level. Instrumentation can be difficult to do properly, but once you've got the system in place, you don't need the function-call callbacks anymore -- you can do everything through instrumentation.
I don't know how your profiler works internally, but if I had to guess, I would say that in performance profiling mode, it doesn't need the callbacks and instead uses instrumentation for everything. And if it does use the callbacks, not having them wouldn't have a large negative effect on the profiler.
Again, I don't know how your profiler works internally, but I'm assuming that it adds to each function through instrumentation a prologue that increments the function hit counter and initializes a timing variable, an epilogue that calculates the elapsed time, and a bit of code before each line that increments that line's counter and calculates the elapsed time since the previous line.
Extrapolating from the numbers given in your advertisements, the instrumentation overhead for a simple one-line function is 6 times as much as the instrumentation overhead for one line of code. In addition, the function call itself takes time, and raises callbacks in the profiling API, which takes even more time.
Inlining that small function could eliminate all that extra time. The negative effect on the profiler really depends on how your profiler is implemented. I assume that it would get the wrong hit count for the function. As I said in a previous message, it's a tradeoff between correct timing and correct hit counting.
There is one possible complication. The JIT compiler only inlines very short and simple functions, and if you emit a prologue and epilogue for hit counting, and additional code for each line of a function, the instrumentation itself could cause the JIT compiler to not inline a function that it otherwise would have, because now it's too long or too complicated.
One way around that would be to 1) not use a prologue/epilogue for function timing and hit counting, and instead use the function-call callbacks (maybe you do this already), 2) not add the per-line instrumentation for one-line functions (one-line functions could be treated specially, with all the timing/hit counting done with the callbacks), and 3) enable JIT inlining. (I'll refer to this as idea #1.)
The effect would be that simple one-line functions would not be instrumented at all, and the JIT would be sure to inline them, and other short functions might also be inlined because you don't have the complicated prologue/epilogue. Hit counting would be affected by enabling JIT inlining because, as you mentioned, the callbacks wouldn't be called, but the timings would be closer to correct.
There are other things you could do. You could have an option to not add any instrumentation at all, using the callbacks only, with JIT inlining enabled. That would give a very accurate picture of the slowest functions, but wouldn't provide per-line profiling. You could then instruct the profiler to run again, and only instrument functions that are either large and clearly not inlineable (Microsoft has published some criteria that the JIT uses to determine whether something should be inlined), or have timings that were not close to zero on the last run. That would allow the JIT compiler to inline exactly (or almost exactly, depending on the internal workings of the JIT compiler) the same set of functions that it normally would, giving nearly-perfect timings and not misleading anyone as to what they need to optimize. (I'll refer to this as idea #2.)
These are all advanced options, of course, and might be too confusing for the average user. (Perhaps you have a good tech writer who could make it accessible.) For the average user, you might have two modes -- the standard mode which you have now, and something like idea #1. The mode you have now gives perfect hit counting, but inaccurate (and in some cases very inaccurate) timing. Idea #1 gives inaccurate hit counting and somewhat better timing (in some cases much better timing). Idea #2 gives inaccurate hit counting and is a 2-step process, but the timings could be nearly perfect.
And in my mind, as I said, timing is the most important thing. Hit counting is second.
Anyway, this is turning into a long ramble. But I think this is something you guys should think about and work on for the next release. Let me know if you get it right. I still need to find a good profiler.
And that's good. Using the callbacks for function timing rather than an epilogue/prologue enables that 2-step scenario I described where the timings could nearly perfect.
I'm not saying that's the best way... I'm not a profiler writer... just a regular programmer, but it might inspire some ideas.
I wish you guys would respond to my other thread, though:
http://www.red-gate.com/MessageBoard/viewtopic.php?t=1826
It's not resolved yet.
Thanks,
-- Adam