gh-146073: Add fitness/exit quality mechanism for JIT trace frontend#147966
gh-146073: Add fitness/exit quality mechanism for JIT trace frontend#147966cocolato wants to merge 3 commits intopython:mainfrom
Conversation
| /* Default fitness configuration values for trace quality control. | ||
| * These can be overridden via PYTHON_JIT_FITNESS_* environment variables. */ | ||
| #define FITNESS_INITIAL 1000 | ||
| #define FITNESS_INITIAL_SIDE 800 | ||
| #define FITNESS_PER_INSTRUCTION 2 | ||
| #define FITNESS_BRANCH_BIASED 5 | ||
| #define FITNESS_BRANCH_UNBIASED 25 | ||
| #define FITNESS_BACKWARD_EDGE 80 | ||
| #define FITNESS_FRAME_ENTRY 10 | ||
|
|
||
| /* Default exit quality constants for fitness-based trace termination. | ||
| * Higher values mean better places to stop the trace. | ||
| * These can be overridden via PYTHON_JIT_EXIT_QUALITY_* environment variables. */ | ||
| #define EXIT_QUALITY_ENTER_EXECUTOR 500 | ||
| #define EXIT_QUALITY_DEFAULT 200 | ||
| #define EXIT_QUALITY_SPECIALIZABLE 50 |
There was a problem hiding this comment.
I ran some benchmarks with the current configuration:
Performance version: 1.14.0
Python version: 3.15.0a7+ (64-bit) revision 2f9438a25f
Report on Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.41
Number of logical CPUs: 12
Start date: 2026-04-01 21:04:33.548904
End date: 2026-04-01 21:09:52.782976
+----------------------+---------------+--------------+--------------+------------------------+
| Benchmark | baseline.json | fitness.json | Change | Significance |
+======================+===============+==============+==============+========================+
| chaos | 44.2 ms | 45.1 ms | 1.02x slower | Not significant |
+----------------------+---------------+--------------+--------------+------------------------+
| deltablue | 2.14 ms | 2.23 ms | 1.04x slower | Significant (t=-9.76) |
+----------------------+---------------+--------------+--------------+------------------------+
| fannkuch | 246 ms | 256 ms | 1.04x slower | Significant (t=-7.84) |
+----------------------+---------------+--------------+--------------+------------------------+
| float | 46.1 ms | 49.6 ms | 1.07x slower | Significant (t=-6.55) |
+----------------------+---------------+--------------+--------------+------------------------+
| go | 82.8 ms | 91.2 ms | 1.10x slower | Significant (t=-11.45) |
+----------------------+---------------+--------------+--------------+------------------------+
| json_dumps | 7.59 ms | 8.32 ms | 1.10x slower | Significant (t=-19.49) |
+----------------------+---------------+--------------+--------------+------------------------+
| json_loads | 18.4 us | 18.9 us | 1.03x slower | Significant (t=-6.08) |
+----------------------+---------------+--------------+--------------+------------------------+
| nbody | 50.2 ms | 50.7 ms | 1.01x slower | Not significant |
+----------------------+---------------+--------------+--------------+------------------------+
| pickle_pure_python | 279 us | 262 us | 1.07x faster | Significant (t=13.59) |
+----------------------+---------------+--------------+--------------+------------------------+
| pidigits | 137 ms | 139 ms | 1.02x slower | Not significant |
+----------------------+---------------+--------------+--------------+------------------------+
| pyflate | 264 ms | 274 ms | 1.04x slower | Significant (t=-8.33) |
+----------------------+---------------+--------------+--------------+------------------------+
| raytrace | 252 ms | 211 ms | 1.19x faster | Significant (t=40.18) |
+----------------------+---------------+--------------+--------------+------------------------+
| regex_compile | 102 ms | 103 ms | 1.01x slower | Not significant |
+----------------------+---------------+--------------+--------------+------------------------+
| regex_effbot | 2.16 ms | 2.17 ms | 1.01x slower | Not significant |
+----------------------+---------------+--------------+--------------+------------------------+
| richards | 16.0 ms | 16.0 ms | 1.00x faster | Not significant |
+----------------------+---------------+--------------+--------------+------------------------+
| spectral_norm | 60.8 ms | 57.3 ms | 1.06x faster | Significant (t=2.43) |
+----------------------+---------------+--------------+--------------+------------------------+
| telco | 6.02 ms | 5.80 ms | 1.04x faster | Significant (t=5.54) |
+----------------------+---------------+--------------+--------------+------------------------+
| unpickle_pure_python | 177 us | 170 us | 1.04x faster | Significant (t=5.28) |
+----------------------+---------------+--------------+--------------+------------------------+
| xml_etree_generate | 78.5 ms | 74.4 ms | 1.06x faster | Significant (t=5.83) |
+----------------------+---------------+--------------+--------------+------------------------+
| xml_etree_iterparse | 75.6 ms | 71.5 ms | 1.06x faster | Significant (t=4.14) |
+----------------------+---------------+--------------+--------------+------------------------+
| xml_etree_parse | 134 ms | 120 ms | 1.12x faster | Significant (t=6.42) |
+----------------------+---------------+--------------+--------------+------------------------+
| xml_etree_process | 48.7 ms | 49.4 ms | 1.02x slower | Not significant |
+----------------------+---------------+--------------+--------------+------------------------+
Fidget-Spinner
left a comment
There was a problem hiding this comment.
Great work. Thanks for doing this!
|
I also forgot to mention: but we should adjust for the following:
|
markshannon
left a comment
There was a problem hiding this comment.
This looks good, and thanks for doing this.
Overall, the approach looks good.
I originally had in a mind that the fitness would reduce geometrically, not arithmetically, but your arithmetic approach looks easier to reason about and at least as good in terms of trace quality.
We need to reduce the number of configurable parameters to just one or two.
Then we can make sure that fitness and penalties are set such that we are guaranteed not to overflow the trace or optimizer buffers.
I'm not sure that rewinding is worth it. As long as "good" exits have a much higher score than "bad" exits, then we should (almost) always end up at a good exit.
| uint16_t side_exit_initial_value; | ||
| uint16_t side_exit_initial_backoff; | ||
|
|
||
| // Trace fitness thresholds |
There was a problem hiding this comment.
I don't think we want this many tunable parameters.
It will impossible to test them all, and most aren't useful to tune.
I'd suggest having just the initial, and initial_side values tunable.
Some of these values must be determined from other values, so cannot be tuned at all.
For example, to avoid overflowing the optimizer frame stack, the frame entry penalty must exceed the initial value / 5.
| /* Default fitness configuration values for trace quality control. | ||
| * These can be overridden via PYTHON_JIT_FITNESS_* environment variables. */ | ||
| #define FITNESS_PER_INSTRUCTION 2 | ||
| #define FITNESS_INITIAL (UOP_MAX_TRACE_LENGTH * FITNESS_PER_INSTRUCTION) |
There was a problem hiding this comment.
I'd stick with 2000. We really don't need traces of over 1000 uops; it is far more important to find a good stopping point than to have very long traces
|
|
||
| /* Compute branch bias from the 16-bit branch history register. | ||
| * Returns 0 (completely unpredictable, 50/50) to 8 (fully biased). */ | ||
| static inline int |
There was a problem hiding this comment.
It isn't the bias of the branch alone that matters, but how likely it is to stay on trace.
A branch that has gone left the previous 15 times, but right during tracing should reduce fitness a lot.
| /* Compute exit quality for the current trace position. | ||
| * Higher values mean it's a better place to stop the trace. */ | ||
| static inline int32_t | ||
| compute_exit_quality(_Py_CODEUNIT *target_instr, int opcode, |
There was a problem hiding this comment.
The most important thing to look for is loops.
If target_instr == tracer->start_instr then the exit quality should be very high as we can then close the loop.
| * Returns 1 if successful, 0 if no valid best exit exists. | ||
| * Enforces progress constraints: the fallback position must satisfy | ||
| * the minimum trace length requirements. */ | ||
| static inline int |
There was a problem hiding this comment.
Is this necessary?
Given that "good" exits have much higher scores than "bad" ones, we should end up on a good exit almost always.
Rewinding is more complex and bug prone, so unless it makes a real difference, I'd drop this.
There was a problem hiding this comment.
I’ll run more benchmarks to verify this.
If there’s no noticeable improvement, I’ll remove it.
| // Check if fitness is depleted — should we stop the trace? | ||
| if (ts->fitness < eq && | ||
| !(progress_needed && uop_buffer_length(trace) < CODE_SIZE_NO_PROGRESS)) { | ||
| // Prefer stopping at the best recorded exit point |
There was a problem hiding this comment.
Just stop here. Backing up could give us a better exit, but it might give us a worse trace. And it is more complex to implement.
| else { | ||
| // No valid best exit — stop at current position | ||
| ADD_TO_TRACE(_EXIT_TRACE, 0, 0, target); | ||
| uop_buffer_last(trace)->operand1 = true; // is_control_flow |
There was a problem hiding this comment.
This doesn't count as control flow. It is a terminator, not a branch.
| trace->end -= needs_guard_ip; | ||
|
|
||
| int space_needed = expansion->nuops + needs_guard_ip + 2 + (!OPCODE_HAS_NO_SAVE_IP(opcode)); | ||
| if (uop_buffer_remaining_space(trace) < space_needed) { |
There was a problem hiding this comment.
| assert (uop_buffer_remaining_space(trace) > space_needed); |
If we choose the fitness and exit values correctly, we can't run out of space.
| _PyJitTracerTranslatorState *ts_depth = &tracer->translator_state; | ||
| if (ts_depth->frame_depth <= 0) { | ||
| // Underflow | ||
| ts_depth->fitness -= (int32_t)tstate->interp->opt_config.fitness_frame_entry * 2; |
There was a problem hiding this comment.
I think this is fundamentally different to making a call, so should have its own distinct (and probably larger) penalty.
| // Underflow | ||
| ts_depth->fitness -= (int32_t)tstate->interp->opt_config.fitness_frame_entry * 2; | ||
| } | ||
| ts_depth->frame_depth = ts_depth->frame_depth <= 0 ? 0 : ts_depth->frame_depth - 1; |
There was a problem hiding this comment.
| ts_depth->frame_depth = ts_depth->frame_depth <= 0 ? 0 : ts_depth->frame_depth - 1; | |
| ts_depth->frame_depth = ts_depth->frame_depth <= 0 ? 0 : ts_depth->frame_depth - 1; | |
| ts_depth->fitness -= penalty; |
We should increase the fitness on returning, we want to inline calls to small functions, so we shouldn't penalize it.
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
Background
See: #146073
Introduced a preliminary fitness/exit quality mechanism for JIT trace frontend has, enabling the tracer to:
ENTER_EXECUTOR) to avoid stopping on instructions that can be optimized