Skip to content

feat: add optional US retail sentiment dataset support#2179

Open
alexander-schneider wants to merge 2 commits intomicrosoft:mainfrom
alexander-schneider:codex/adanos-retail-sentiment-qlib
Open

feat: add optional US retail sentiment dataset support#2179
alexander-schneider wants to merge 2 commits intomicrosoft:mainfrom
alexander-schneider:codex/adanos-retail-sentiment-qlib

Conversation

@alexander-schneider
Copy link
Copy Markdown

Summary

This PR adds an optional alternative-data path for US equities based on daily retail sentiment factors.

It includes:

  • an adanos data collector for daily US retail sentiment snapshots
  • a merge step for combining sentiment CSVs with existing normalized US price CSVs
  • an Alpha158AdanosUS handler that augments Alpha158 with lagged sentiment factors
  • a LightGBM benchmark config for US daily data
  • targeted tests for payload mapping, merge behavior, lookback capping, and feature exposure

Why

Qlib already supports custom collectors and custom datasets well. This patch keeps the integration optional and focuses on a reproducible daily factor workflow rather than a sentiment-only demo.

The goal is to make retail sentiment usable as structured alternative data for factor research on top of existing US OHLCV datasets.

What was added

Collector

New files under scripts/data_collector/adanos/:

  • collector.py
  • README.md
  • requirements.txt

The collector pulls daily rows from the Adanos stock detail endpoints for:

  • Reddit
  • X
  • News
  • Polymarket

It builds per-symbol daily CSVs with source-specific columns such as:

  • reddit_buzz, reddit_sentiment, reddit_mentions
  • x_buzz, x_sentiment, x_mentions, x_avg_rank
  • news_buzz, news_sentiment, news_mentions
  • polymarket_buzz, polymarket_sentiment, polymarket_trade_count

And aggregate daily fields:

  • retail_buzz_avg
  • retail_sentiment_avg
  • retail_coverage
  • retail_alignment_score

A merge_with_price_data command is included so users can append sentiment fields to existing normalized US daily price CSVs before running dump_bin.py.

Dataset handler

New files under qlib/contrib/data/:

  • adanos_features.py
  • handler_adanos.py

Alpha158AdanosUS extends Alpha158 with lagged sentiment features such as:

  • lagged retail buzz / sentiment / coverage / alignment
  • source-level lagged buzz and activity fields
  • short-horizon disagreement and ratio features

All sentiment features are lagged to avoid same-day leakage.

Benchmark example

New example config:

  • examples/benchmarks/LightGBM/workflow_config_lightgbm_Alpha158Adanos_US.yaml

This shows how to run a US daily LightGBM workflow using merged price + sentiment qlib data.

Notes

  • The integration is fully optional and does not affect existing collectors or datasets.
  • It is designed for users who already maintain US daily price data and want to add retail sentiment as alternative data.
  • The Adanos stock detail endpoints expose bounded historical lookback, so the collector is best used for recent backfill plus ongoing daily accumulation.

Validation

Targeted checks run locally:

  • python3 -m pytest tests/test_adanos_collector.py tests/data_mid_layer_tests/test_adanos_handler.py -q
  • python3 -m compileall qlib/contrib/data scripts/data_collector/adanos tests/test_adanos_collector.py tests/data_mid_layer_tests/test_adanos_handler.py
  • python3 -m pytest tests/ -q (fails in this checkout because of existing environment/setup issues unrelated to this patch, including missing compiled qlib extensions and optional test dependencies such as mlflow, gym, tianshou, dill, and fire)
  • git diff --check

@alexander-schneider
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant