Skip to content

gh-146073: Add fitness/exit quality mechanism for JIT trace frontend#147966

Open
cocolato wants to merge 3 commits intopython:mainfrom
cocolato:jit-tracer-fitness
Open

gh-146073: Add fitness/exit quality mechanism for JIT trace frontend#147966
cocolato wants to merge 3 commits intopython:mainfrom
cocolato:jit-tracer-fitness

Conversation

@cocolato
Copy link
Copy Markdown
Member

@cocolato cocolato commented Apr 1, 2026

Background

See: #146073

Introduced a preliminary fitness/exit quality mechanism for JIT trace frontend has, enabling the tracer to:

  • Stop proactively: Stop at the appropriate time without waiting for the buffer to fill up or hitting a dead end
  • Stop at good locations: Prioritize stopping at the entry points of existing executors (ENTER_EXECUTOR) to avoid stopping on instructions that can be optimized
  • Control frame depth: Apply a penalty that increases with depth for function call inlining

Comment on lines +18 to +33
/* Default fitness configuration values for trace quality control.
* These can be overridden via PYTHON_JIT_FITNESS_* environment variables. */
#define FITNESS_INITIAL 1000
#define FITNESS_INITIAL_SIDE 800
#define FITNESS_PER_INSTRUCTION 2
#define FITNESS_BRANCH_BIASED 5
#define FITNESS_BRANCH_UNBIASED 25
#define FITNESS_BACKWARD_EDGE 80
#define FITNESS_FRAME_ENTRY 10

/* Default exit quality constants for fitness-based trace termination.
* Higher values mean better places to stop the trace.
* These can be overridden via PYTHON_JIT_EXIT_QUALITY_* environment variables. */
#define EXIT_QUALITY_ENTER_EXECUTOR 500
#define EXIT_QUALITY_DEFAULT 200
#define EXIT_QUALITY_SPECIALIZABLE 50
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran some benchmarks with the current configuration:

Performance version: 1.14.0
Python version: 3.15.0a7+ (64-bit) revision 2f9438a25f
Report on Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.41
Number of logical CPUs: 12
Start date: 2026-04-01 21:04:33.548904
End date: 2026-04-01 21:09:52.782976

+----------------------+---------------+--------------+--------------+------------------------+
| Benchmark            | baseline.json | fitness.json | Change       | Significance           |
+======================+===============+==============+==============+========================+
| chaos                | 44.2 ms       | 45.1 ms      | 1.02x slower | Not significant        |
+----------------------+---------------+--------------+--------------+------------------------+
| deltablue            | 2.14 ms       | 2.23 ms      | 1.04x slower | Significant (t=-9.76)  |
+----------------------+---------------+--------------+--------------+------------------------+
| fannkuch             | 246 ms        | 256 ms       | 1.04x slower | Significant (t=-7.84)  |
+----------------------+---------------+--------------+--------------+------------------------+
| float                | 46.1 ms       | 49.6 ms      | 1.07x slower | Significant (t=-6.55)  |
+----------------------+---------------+--------------+--------------+------------------------+
| go                   | 82.8 ms       | 91.2 ms      | 1.10x slower | Significant (t=-11.45) |
+----------------------+---------------+--------------+--------------+------------------------+
| json_dumps           | 7.59 ms       | 8.32 ms      | 1.10x slower | Significant (t=-19.49) |
+----------------------+---------------+--------------+--------------+------------------------+
| json_loads           | 18.4 us       | 18.9 us      | 1.03x slower | Significant (t=-6.08)  |
+----------------------+---------------+--------------+--------------+------------------------+
| nbody                | 50.2 ms       | 50.7 ms      | 1.01x slower | Not significant        |
+----------------------+---------------+--------------+--------------+------------------------+
| pickle_pure_python   | 279 us        | 262 us       | 1.07x faster | Significant (t=13.59)  |
+----------------------+---------------+--------------+--------------+------------------------+
| pidigits             | 137 ms        | 139 ms       | 1.02x slower | Not significant        |
+----------------------+---------------+--------------+--------------+------------------------+
| pyflate              | 264 ms        | 274 ms       | 1.04x slower | Significant (t=-8.33)  |
+----------------------+---------------+--------------+--------------+------------------------+
| raytrace             | 252 ms        | 211 ms       | 1.19x faster | Significant (t=40.18)  |
+----------------------+---------------+--------------+--------------+------------------------+
| regex_compile        | 102 ms        | 103 ms       | 1.01x slower | Not significant        |
+----------------------+---------------+--------------+--------------+------------------------+
| regex_effbot         | 2.16 ms       | 2.17 ms      | 1.01x slower | Not significant        |
+----------------------+---------------+--------------+--------------+------------------------+
| richards             | 16.0 ms       | 16.0 ms      | 1.00x faster | Not significant        |
+----------------------+---------------+--------------+--------------+------------------------+
| spectral_norm        | 60.8 ms       | 57.3 ms      | 1.06x faster | Significant (t=2.43)   |
+----------------------+---------------+--------------+--------------+------------------------+
| telco                | 6.02 ms       | 5.80 ms      | 1.04x faster | Significant (t=5.54)   |
+----------------------+---------------+--------------+--------------+------------------------+
| unpickle_pure_python | 177 us        | 170 us       | 1.04x faster | Significant (t=5.28)   |
+----------------------+---------------+--------------+--------------+------------------------+
| xml_etree_generate   | 78.5 ms       | 74.4 ms      | 1.06x faster | Significant (t=5.83)   |
+----------------------+---------------+--------------+--------------+------------------------+
| xml_etree_iterparse  | 75.6 ms       | 71.5 ms      | 1.06x faster | Significant (t=4.14)   |
+----------------------+---------------+--------------+--------------+------------------------+
| xml_etree_parse      | 134 ms        | 120 ms       | 1.12x faster | Significant (t=6.42)   |
+----------------------+---------------+--------------+--------------+------------------------+
| xml_etree_process    | 48.7 ms       | 49.4 ms      | 1.02x slower | Not significant        |
+----------------------+---------------+--------------+--------------+------------------------+

Copy link
Copy Markdown
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Thanks for doing this!

@Fidget-Spinner
Copy link
Copy Markdown
Member

I also forgot to mention: but we should adjust for the following:

  1. Penalize frame depth underflow more so than normal frame push/pop.
  2. Stop tracing when we hit MAX_ABSTRACT_FRAME_DEPTH, as we can't optimize it anyways.

Copy link
Copy Markdown
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, and thanks for doing this.

Overall, the approach looks good.

I originally had in a mind that the fitness would reduce geometrically, not arithmetically, but your arithmetic approach looks easier to reason about and at least as good in terms of trace quality.

We need to reduce the number of configurable parameters to just one or two.
Then we can make sure that fitness and penalties are set such that we are guaranteed not to overflow the trace or optimizer buffers.

I'm not sure that rewinding is worth it. As long as "good" exits have a much higher score than "bad" exits, then we should (almost) always end up at a good exit.

uint16_t side_exit_initial_value;
uint16_t side_exit_initial_backoff;

// Trace fitness thresholds
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want this many tunable parameters.
It will impossible to test them all, and most aren't useful to tune.
I'd suggest having just the initial, and initial_side values tunable.

Some of these values must be determined from other values, so cannot be tuned at all.
For example, to avoid overflowing the optimizer frame stack, the frame entry penalty must exceed the initial value / 5.

/* Default fitness configuration values for trace quality control.
* These can be overridden via PYTHON_JIT_FITNESS_* environment variables. */
#define FITNESS_PER_INSTRUCTION 2
#define FITNESS_INITIAL (UOP_MAX_TRACE_LENGTH * FITNESS_PER_INSTRUCTION)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd stick with 2000. We really don't need traces of over 1000 uops; it is far more important to find a good stopping point than to have very long traces


/* Compute branch bias from the 16-bit branch history register.
* Returns 0 (completely unpredictable, 50/50) to 8 (fully biased). */
static inline int
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't the bias of the branch alone that matters, but how likely it is to stay on trace.
A branch that has gone left the previous 15 times, but right during tracing should reduce fitness a lot.

/* Compute exit quality for the current trace position.
* Higher values mean it's a better place to stop the trace. */
static inline int32_t
compute_exit_quality(_Py_CODEUNIT *target_instr, int opcode,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most important thing to look for is loops.
If target_instr == tracer->start_instr then the exit quality should be very high as we can then close the loop.

* Returns 1 if successful, 0 if no valid best exit exists.
* Enforces progress constraints: the fallback position must satisfy
* the minimum trace length requirements. */
static inline int
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?
Given that "good" exits have much higher scores than "bad" ones, we should end up on a good exit almost always.
Rewinding is more complex and bug prone, so unless it makes a real difference, I'd drop this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll run more benchmarks to verify this.
If there’s no noticeable improvement, I’ll remove it.

// Check if fitness is depleted — should we stop the trace?
if (ts->fitness < eq &&
!(progress_needed && uop_buffer_length(trace) < CODE_SIZE_NO_PROGRESS)) {
// Prefer stopping at the best recorded exit point
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just stop here. Backing up could give us a better exit, but it might give us a worse trace. And it is more complex to implement.

else {
// No valid best exit — stop at current position
ADD_TO_TRACE(_EXIT_TRACE, 0, 0, target);
uop_buffer_last(trace)->operand1 = true; // is_control_flow
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't count as control flow. It is a terminator, not a branch.

trace->end -= needs_guard_ip;

int space_needed = expansion->nuops + needs_guard_ip + 2 + (!OPCODE_HAS_NO_SAVE_IP(opcode));
if (uop_buffer_remaining_space(trace) < space_needed) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert (uop_buffer_remaining_space(trace) > space_needed);

If we choose the fitness and exit values correctly, we can't run out of space.

_PyJitTracerTranslatorState *ts_depth = &tracer->translator_state;
if (ts_depth->frame_depth <= 0) {
// Underflow
ts_depth->fitness -= (int32_t)tstate->interp->opt_config.fitness_frame_entry * 2;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fundamentally different to making a call, so should have its own distinct (and probably larger) penalty.

// Underflow
ts_depth->fitness -= (int32_t)tstate->interp->opt_config.fitness_frame_entry * 2;
}
ts_depth->frame_depth = ts_depth->frame_depth <= 0 ? 0 : ts_depth->frame_depth - 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ts_depth->frame_depth = ts_depth->frame_depth <= 0 ? 0 : ts_depth->frame_depth - 1;
ts_depth->frame_depth = ts_depth->frame_depth <= 0 ? 0 : ts_depth->frame_depth - 1;
ts_depth->fitness -= penalty;

We should increase the fitness on returning, we want to inline calls to small functions, so we shouldn't penalize it.

@bedevere-app
Copy link
Copy Markdown

bedevere-app bot commented Apr 1, 2026

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants