Add OTEP #4947 OTEL-compatible context storage#347
Conversation
ivoanjo
left a comment
There was a problem hiding this comment.
Gave it a pass! There were a few of "uuuuh what? 👀 moments". I guess AI?
ddprof-lib/src/main/java/com/datadoghq/profiler/OTelContext.java
Outdated
Show resolved
Hide resolved
ddprof-test/src/test/java/com/datadoghq/profiler/context/OtelContextStorageModeTest.java
Outdated
Show resolved
Hide resolved
ddprof-test/src/test/java/com/datadoghq/profiler/context/ProcessContextTest.java
Outdated
Show resolved
Hide resolved
Yes. Ah, you shouldn't have spent a lot of time on this ... this is just a dirty AI generated scaffolding, hence still draft .. |
…of reference library **What does this PR do?** This PR updates the vendored version of the OTel process context support with the latest code from open-telemetry/sig-profiling#79 (which will soon be merged upstream). **Motivation:** This latest version of the library fixes a number of small issues, but more importantly includes support for the thread context configuration keys which we need for #347 (so we will be able to shed a bunch of code once we rebase on top of this one). **Additional Notes:** From the Java API consumer side, not a lot changes with this version. **How to test the change?** Validate tests still pass!
CI Test ResultsRun: #24146663568 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Summary: Total: 32 | Passed: 32 | Failed: 0 Updated: 2026-04-08 16:56:50 UTC |
8849603 to
9bbf480
Compare
ddprof-lib/src/main/cpp/arguments.h
Outdated
| * PROFILER: Use existing TLS-based storage (default, proven async-signal safe) | ||
| * OTEL: Use OTEP #4947 TLS pointer storage (discoverable by external profilers) |
There was a problem hiding this comment.
Ack! As I said above, I think having a "press button in case of emergency" makes sense, but overall I really think it's important to get this to a situation where OTel is the default in java-profiling and carries no downsides (even if that means changing the spec, as many times as needed).
88a03d5 to
73f90c9
Compare
ivoanjo
left a comment
There was a problem hiding this comment.
Here's my first pass of comments before lunch, be back soon!
ivoanjo
left a comment
There was a problem hiding this comment.
Another round of comments, haven't gone through everything 😅 C++ is cool but hard :)
| #ifndef OTEL_PROCESS_CTX_NO_READ | ||
| // Re-publish the process context with thread_ctx_config. | ||
| // Guard: otel_process_ctx_read() is only available when the read API is compiled in. | ||
| // We need to read the current context and re-publish with the config | ||
| otel_process_ctx_read_result read_result = otel_process_ctx_read(); | ||
| if (read_result.success) { | ||
| otel_process_ctx_data data = { | ||
| .deployment_environment_name = read_result.data.deployment_environment_name, | ||
| .service_instance_id = read_result.data.service_instance_id, | ||
| .service_name = read_result.data.service_name, | ||
| .service_version = read_result.data.service_version, | ||
| .telemetry_sdk_language = read_result.data.telemetry_sdk_language, | ||
| .telemetry_sdk_version = read_result.data.telemetry_sdk_version, | ||
| .telemetry_sdk_name = read_result.data.telemetry_sdk_name, | ||
| .resource_attributes = read_result.data.resource_attributes, | ||
| .extra_attributes = read_result.data.extra_attributes, | ||
| .thread_ctx_config = &config, | ||
| }; | ||
|
|
||
| otel_process_ctx_publish(&data); | ||
| otel_process_ctx_read_drop(&read_result); | ||
| } | ||
| #endif |
There was a problem hiding this comment.
Oh interesting, I'll admit I hadn't considered that it would be nice for the process context impl to have a "append thread context config to an existing process context".
So I guess this is the homegrown version of that -- read existing, append, write again.
Would it be useful for me to add a otel_process_ctx_update_thread_ctx to do this? In particular can be done without requiring the cost of the reader (e.g. we could compile with the reader disabled).
ivoanjo
left a comment
There was a problem hiding this comment.
Gave it a full pass! Really excited to see how this is evolving! Overall I think there's a bit of code that we can rip out/simplify/improve :D
ddprof-test/src/test/java/com/datadoghq/profiler/context/OtelContextStorageModeTest.java
Outdated
Show resolved
Hide resolved
ddprof-test/src/test/java/com/datadoghq/profiler/context/OtelContextStorageModeTest.java
Outdated
Show resolved
Hide resolved
ddprof-test/src/test/java/com/datadoghq/profiler/ContextSanityTest.java
Outdated
Show resolved
Hide resolved
ddprof-lib/src/main/java/com/datadoghq/profiler/BufferWriter.java
Outdated
Show resolved
Hide resolved
ddprof-lib/src/main/java/com/datadoghq/profiler/ThreadContext.java
Outdated
Show resolved
Hide resolved
ddprof-test/src/test/java/com/datadoghq/profiler/context/OtelContextStorageModeTest.java
Show resolved
Hide resolved
|
Reply to #347 (comment): Done, added |
- Remove OtelContexts::u64ToBytes (never called) - Remove ProfiledThread::setOtelTagEncoding, setOtelLocalRootSpanId (Java writes sidecar directly via ByteBuffer) - Eliminate ContextApi::_attribute_keys member array, freeAttributeKeys, and shutdown — otel_process_ctx_publish copies strings, no keepalive needed - registerAttributeKeys0: always publish even with count==0 so datadog.local_root_span_id (index 0) is always in the process context - Remove stale tlsPtrBuffer comment (PR-evolution artifact) - Add ThreadContext.readTraceId() for test-side trace ID verification Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- setContextDirect non-clear path now zeros sidecar tag encodings and trims attrs_data to LRS_ENTRY_SIZE so span A's attributes don't leak into span B when setContext is called without an intervening clear - Add testSpanTransitionClearsAttributes to cover this exact path - Use readTraceId() in testOtelStorageModeContext for trace ID round-trip - Fix TLSContext.md write order (sidecar write was shown before trace ID writes; code does it after; also add the new attr-reset step) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Remove stale lrsCache reference (LRS uses inline hex encoding, no cache) - Fix LRS encoding description: hex (16-char zero-padded), not decimal; clarify its fixed position at attrs_data[0..17] - Expand step 3 description to mention the fixed LRS entry is retained when resetting attrs_data_size to LRS_ENTRY_SIZE Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…st comment Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ebug Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
HEX_DIGITS is byte[]; sb.append(byte) widens to append(int) and appends the decimal ASCII code (e.g. '5'=53 -> "53") instead of the character. The JIT happens to mask this in release builds but the interpreter exposes it in debug builds. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- clearContextAttribute: plain putInt inside detach/attach window - setContextAttributeDirect: same; dict-full path reuses clearContextAttribute - Remove redundant testRepeatedContextWrites and testNestedContextUpdates - TLSContext.md: fix CPU store buffer claim (compiler reordering only for same-thread); update code snippet and test list Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- thread.h/wallClock.cpp: guard _otel_local_root_span_id update behind context_valid; ContextApi::get returns valid=0 between detach/attach causing the sidecar to be clobbered with 0 before attach completes - JavaProfiler.java/javaApi.cpp: remove dead registerConstant() and its JNI function (no callers since ContextSetter was removed) - otel_process_ctx.cpp: remove duplicate #include <sys/syscall.h> - ThreadContext.java: fix detach() javadoc and static cache comment - TLSContext.md: clarify ~35 ns = spanLifecycle + clearContext Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
valid stays 0 after clear (no attach()), so no reader can observe attrs_data. The next setContext call resets attrsDataSize before attach() anyway. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Static cache arrays allowed a thread to read a stale encoding from a slot evicted by another thread. Per-thread (instance) arrays mean only the owning thread ever accesses them — no races, no barriers needed. Drop fullFence from the hit path and storeFence from the write path as a result. Each thread pays one JNI registerConstant0() call per unique value on first encounter; encodings are stable for the JVM lifetime. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
"FB" and "Ea" share hashCode 2236 (slot 188). Test verifies that two threads writing those values simultaneously each read back their own correct value — impossible with the old static cache. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Rename initializeOtelTls0/isOtelContextInitialized/markOtelContextInitialized to drop Otel prefix (only one impl exists now) - Factor clearContextDirect() out of setContextDirect() - Use Integer.BYTES instead of magic 4 for sidecar byte offsets - Drop dead writeAndReleaseLong/writeAndReleaseInt from BufferWriter - Add LOG_WARN when registerAttributeKeys keys exceed capacity - Clarify comments: musl TLS deadlock, contextValid vs zero spans, recycling vs valid-flag, plain put safety in attach() - Fix javadocs: clearContext, setContext, readContextAttribute, readTraceId Rationale: PR review by @jbachorik
…lFence - Rename custom_labels_current_set_v2 -> otel_thread_ctx_v1 per OTEP #4947 final naming - Remove fullFence() (storeFence is sufficient; no load ordering needed) - Make bytesToU64 file-local static; drop BytesToU64 from header - Use plain putLong for LRS sidecar (no cross-thread ordering needed) - Consolidate testMaxValueContext and testOtelModeClearContext into testSequentialContextUpdates
- Inline getSpanId into ContextApi::get(); remove OtelContexts class - Remove bytesToU64 and cassert (now unused in otel_context.cpp) - Fix TLSContext.md: scheduler rq_lock barrier makes stores visible, not "thread paused"; DMB ISHST is required on ARM, not a side effect - Fix otel_context.h: pointer is nulled on thread exit, not "never nulled" - Fix TLSContext.md: pointer is nulled on thread exit to prevent use-after-recycle; external profilers must null-check before deref
- Replace custom_labels_current_set_v2 with otel_thread_ctx_v1 in diagram - Remove deleted OtelContexts/getSpanId/bytesToU64 references - Update writeOrderedLong -> putLong in diagrams and protocol timeline - Fix process-wide -> per-thread cache label - Remove phantom storeFence from attr-cache pseudo-code (cache is per-thread) - Remove phantom putLongVolatile/setVolatile from BufferWriter diagram - Update ContextApi::get() snippet to match actual inlined implementation - Add missing numAttrs declaration in writeCurrentContext snippet
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…view Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add @param context_valid to lookupWallclockCallTraceId doxygen - Add static_assert linking OtelThreadContextRecord size to OTEL_MAX_RECORD_SIZE - Deprecate 2-arg JavaProfiler.setContext; clarify rootSpanId→localRootSpanId mapping - Explain return 0 in deprecated put(long,long) is for ABI compatibility - Fix "per-process" → "per-thread" encoding cache in TLSContext.md - Clarify x86 storeFence applies to Java 9+ VarHandle path in TLSContext.md - Suppress deprecation warnings at 3 legacy setContext call sites Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- wallClock.h: replace assert with null guard for ProfiledThread race - thread.cpp: copy tag encodings only when ContextApi::get() returns true - ThreadContext: read spanId/rootSpanId directly from buffer, drop getContext0() JNI - jfrMetadata.cpp: document signal-safety precondition on reset() - ThreadContext: clarify 2-arg put() param mapping, note otepKeyIndex != 0 invariant - OtelContextStorageModeTest: add 1000-iteration stress test
- ThreadContextBenchmark: replace removed getContext() with getSpanId() - thread.cpp: use __atomic_store_n(RELEASE) for _otel_ctx_initialized - thread.h: document naturally-aligned u64 atomicity assumption - ThreadContext: document clearContextDirect valid=0 OTEP reader trade-off - context_api.h: move LOCAL_ROOT_SPAN_ATTR_INDEX to context_api.cpp (file-local) - perf report: remove stale getContext results, note replacement by direct buffer reads
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@ivoanjo Thanks for the thorough review! |
What does this PR do?:
Exposes trace context via an OTEP #4947-compliant TLS pointer (
otel_thread_ctx_v1),discoverable by external profilers through the ELF dynsym table. OTEL context storage
is the sole context storage mode; there is no legacy profiler-only mode.
Each thread gets a 640-byte
OtelThreadContextRecordcontaining:attrs_data[612]A
ContextApiclass provides signal-safe read and snapshot operations on the OTEP record.Context data flow
flowchart LR T["Java tracer"] subgraph thread ["Per-thread (ProfiledThread)"] REC["OtelThreadContextRecord\nvalid · trace_id · span_id · attrs_data"] SC["JFR sidecar\nu32 encodings · LRS"] end T -- "put(): valid=0 → write → valid=1" --> REC T -- "put(): LRS\nsetContextAttribute(): u32 encoding" --> SC T -- "setContextAttribute(): UTF-8 value" --> REC REC -- "otel_thread_ctx_v1 (dynsym)" --> EP["External profiler\n(OTEP #4947)"] SC -- "O(1), signal-safe" --> SH["Signal handler\n→ JFR writer"] REC -- "valid + span_id" --> SHPublication protocol (detach/attach)
sequenceDiagram participant W as Java tracer (writer) participant rec as OtelThreadContextRecord participant R as Reader (signal handler / external profiler) Note over W,R: Thread init: otel_thread_ctx_v1 permanently points to record W->>rec: valid = 0 W->>rec: storeFence() Note over R: load valid → 0, skip record W->>rec: write trace_id, span_id, attrs_data, LRS W->>rec: storeFence() W->>rec: valid = 1 Note over R: load valid → 1, read record safely R->>rec: read trace_id, span_id, attrs_dataContext setting API:
ThreadContext.put(localRootSpanId, spanId, traceIdHigh, traceIdLow)— writestrace_id + span_id to the OTEP record, stores localRootSpanId in
ProfiledThreadforO(1) signal-safe reads (used for endpoint correlation in JFR).
ThreadContext.put(spanId, rootSpanId)— delegates to the4-arg overload with
traceIdHigh=0andtraceIdLow=spanId. Marked@Deprecated; callersneeding correct OTEP interop must switch to the 4-arg API.
ThreadContext.setContextAttribute(keyIndex, value)— writes custom attributes.Encodes the string value into both the sidecar (u32 Dictionary encoding for JFR) and
attrs_data(UTF-8 for external profilers) via ByteBuffer. JNI is called only on cachemiss to register a new constant.
JFR tag encoding sidecar:
ProfiledThreadcaches pre-computed Dictionary encodings (_otel_tag_encodings[]) and_otel_local_root_span_idalongside the OTEP record. The ByteBuffer write path populatesboth the OTEP
attrs_data(raw strings for external profilers) and the sidecar (u32encodings for our JFR writer). Signal handlers read only the sidecar — O(1) array index,
zero string lookups or hash-table probes per sample.
Motivation:
Enable external profilers (e.g. OTel eBPF profiler) to discover and read the Java
profiler's trace context using the standardized OTEP #4947 protocol, while preserving
localRootSpanId for Datadog endpoint correlation.
Additional Notes:
JNI thread; the signal handler reads only the sidecar (no Dictionary access at all)
trace/span/attrs data, storeFence + set valid=1 — ensuring signal handlers never see
partially-written records. The TLS pointer
otel_thread_ctx_v1is set permanently onthread init and never modified during context writes.
OtelThreadContextRecordis zeroed and the TLS pointer nulled inreleaseFromBuffer()to prevent external profilers from dereferencing a recycled recordreleaseFromBuffer()before the OTEL record is cleareddoc/architecture/TLSContext.mdHow to test the change?:
Automated tests (7 tests):
./gradlew :ddprof-test:testRelease --tests "com.datadoghq.profiler.context.OtelContextStorageModeTest"Covers: OTEL context round-trip with 128-bit trace ID + localRootSpanId, custom
attributes, attribute overflow, sequential context updates, thread isolation,
span transition clearing attributes, attribute cache isolation.
Symbol verification:
nm -D libjavaProfiler.so | grep otel_thread_ctx_v1For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance.