ML for Trading Insights

The Coding-Agent Toolkit: A Workflow Loop You Can Install

Stefan Jansen — Tue, 07 Jul 2026 17:57:24 GMT

The previous issue was a reading list on coding agents. It separated three loops that run around one:

the agent loop that writes or edits code,
the harness loop that decides whether the work is acceptable, and
the workflow loop that decides what work should exist, who reviews it, and what state survives into the next run.

The argument was that the workflow loop is where the engineer stays. The issue closed on a design rule — decide what the agent may change, define how the harness checks it, preserve the evidence needed for review, and keep the research claim outside the automation boundary — and then stopped, because a reading list describes a loop without handing you one.

The three loops from the last issue. The agent loop writes code; the harness loop checks whether the work is acceptable; the workflow loop decides whether the work should exist, who reviews it, and how state persists. This release is one implementation of that outer loop.

coding-agent-toolkit is now public: seven host-neutral steps that take a piece of work from a rough idea to a merged pull request, on both Claude Code and OpenAI Codex. It is the specific workflow loop I run daily, released now because the shape has stopped moving. The agent does the work; the toolkit keeps the state.

A concrete case for what “keeps the state” buys you. An agent starts a data-loader fix, patches the imports, and gets three tests green before it runs out of context. The next morning, a fresh session — maybe on the other host — picks up the branch. With nothing durable on disk, it re-reads the repo, guesses at what was in flight, and quietly widens a loader fix into a refactor of the ingestion layer. With a spec, a plan, an open issue, and a handoff snapshot to read first, it resumes the same scoped task: finish the loader fix, keep the diff reviewable, and stop. The workflow loop is what makes the second outcome the default.

What The Seven Steps Do

Five steps carry a piece of work forward — align, plan, plan-issues, next-issue, ship — and two, handoff and continue, carry state across the gaps between sessions. Each writes one file the next step reads, so state lives on disk rather than in a conversation, and each exists to head off a specific way delegated work goes wrong.

The five delivery steps, the file or GitHub object each writes, and the two transition steps that carry state across boundaries. The file on disk is the contract that the next step reads.

align → spec.md. Interrogates a rough request into a verifiable end-state, one question at a time, before any code is written — so the agent is not optimizing toward an implicit, private definition of done.
plan → plan.md. Breaks the spec into issue-sized chunks with dependencies, so one request does not arrive as a single sprawling change that no one can review.
plan-issues → a GitHub milestone and one issue per chunk. Dry-run by default; --apply writes. The work becomes reviewable units under version control, not lines in a chat transcript.
next-issue → a branch, an implementation, tests, and a PR. Takes the lowest-numbered open issue in the active milestone, so the agent works the next scoped unit instead of whatever is convenient.
ship → a squash-merge and a closed milestone. Runs only after verifying that every issue has a commit with a closing footer, so nothing merges while leaving the work graph half-open.
handoff / continue → a durable session transition. handoff writes a snapshot; continue replays it and flags drift — so the next session does not trust stale prose, and drift does not stay invisible until it corrupts the next task.

Six of the seven ship as a Claude skill and a mirrored Codex prompt. plan is the exception: it delegates to each host’s native plan mode and a capture hook, because re-implementing plan mode as a shell command would discard the structured output the host already produces.

Closes #N in a PR body is the signal that bubbles state up: a merged PR closes its issue, and ship closes the milestone once every issue is closed by a commit footer. GitHub is the projection; the file each step writes to disk is the contract the next step — or the next agent — reads, rather than re-deriving where things stand.

The same steps apply to non-code work when the deliverable consists of reviewable files. Closes #N closes an issue about a chapter draft or a research report as cleanly as one about a bug fix.

How It Answers The Design Rule

The last issue’s rule had four clauses. The toolkit maps onto three directly; the harness clause it leaves to your existing tests and review.

Decide what the agent may change. align produces a spec with acceptance criteria before the agent touches a file. The scope is written down and reviewable in advance, so “the agent wrote something” and “the agent did the agreed work” become separate, checkable claims.
Preserve the evidence needed for review. The chain produces a milestone, issues, branches, and a merged PR — a trail a human reads, rather than an agent transcript to reconstruct afterward. The diff arrives scoped to one issue, tied to a stated spec, with a closing footer.
Keep the research claim outside the automation boundary. The last issue’s sharpest example was a data-loader test made to pass by dropping the tickers that had stopped returning history: the suite goes green, and the backtest quietly runs on survivors only. The code is correct; the research claim has changed. A host-neutral workflow layer does not prevent that edit by itself — domain checks do that: a frozen universe, a survivorship guard, an invariant test.

What the workflow changes is visibility. The edit arrives as a scoped PR against a spec that stated what the loader was for, so a reviewer sees a change to the ticker universe instead of a green checkmark, and the domain checks have somewhere to run. The same reading list noted that many agent failures are workflow failures — abandoned reviews, duplicate work, task mismatches — rather than syntax errors, and that this is the layer this works on.

The toolkit does not make the model write better code, nor does it replace tests, domain invariants, data checks, cost models, or review. It makes the agent work legible enough that those checks — and a human — can still own the result.

Cross-Host State

The same workflow runs on Claude Code and OpenAI Codex because there is no orchestrator between them. The cross-host primitive is a shared durable state under .workspace/, read natively by both.

Swapping hosts mid-feature works because the next session — whichever host it runs on — picks up where the last one left off by reading what the last one wrote. There is deliberately no “execute as the other host” command; the file is the swap.

Session continuity has one refinement worth calling out. A prose handoff goes stale silently: the next session cannot tell whether what it describes is still true. handoff pairs the summary with a fenced block of read-only commands, each carrying an inline # expect: . continue runs each command, compares the output to the expected value, and flags every divergence. Drift is information: a repo one commit ahead is fine; a milestone that has since closed may matter. continue surfaces suggested next steps and does not act on them.

This release supersedes claude-code-toolkit, the broader, Claude-only predecessor — a wide collection of patterns and plugins. This one is narrower: seven steps, two hosts, one job. It is a sibling, not a competitor, to code review: if you run roborev, open reviews on the current branch surface when a session starts, and stay silent when it is not installed. One reviews code, one drives work, and they compose.

Installing It

Claude Code. Copy or symlink skills/ into your project’s .claude/skills/, or install the workflow plugin from the coding-agent-plugins marketplace, which mirrors the same steps. The marketplace is Claude-side; the toolkit itself is cross-host.

OpenAI Codex. Point your prompts directory at codex/prompts/, or copy the files in.

Then invoke the steps in order on a new piece of work, and handoff / continue at the boundaries of anything long-running. The SKILL.md in each step directory is that step’s authoritative documentation; there is no separate runtime to install.

The toolkit is pre-1.0 and MIT-licensed. The step contracts are stable enough for daily use, but their shape may still shift in response to friction encountered in real work. Issues and PRs are welcome. Repository: github.com/stefan-jansen/coding-agent-toolkit.

Where This Sits In The ML4T Work

The near-term value of coding agents for ML4T readers is the bounded, checkable work: environment repair, notebook maintenance, data-loader fixes, test scaffolding, CI triage, reproducible experiment plumbing. The workflow loop is what keeps that work reviewable as you delegate more of it, and what draws the line the last issue insisted on — between the changes you hand to the loop and the ones that decide whether a result is real.

Today I am teaching a free Maven Lightning Lesson, Getting stuff done with coding agents (4:00 PM UTC), which walks through this chain end-to-end: turning an ambiguous goal into a scoped work unit, keeping state across sessions, and using review gates so a human still owns the result.

For the hands-on version, I am running a workshop, Loop Engineering: Reliable Work From Coding Agents, on July 18 (10 am–2 pm EDT). It builds what this issue describes, on your own keyboard: turn a vague goal into a verifiable spec, project the plan onto GitHub issues, and write a handoff that survives the session boundary. It also puts you in front of the green-but-wrong failure — the suite passes while the result is corrupted — and has you write a verification skill the agent cannot satisfy by quietly dropping the tickers that stopped reporting.

The same division of labor — what you hand to the loop, what stays with you because it decides whether the result is real — runs through the Machine Learning for Trading: From Research to Production cohort across nine case studies; the current summer cohort is full, with a waitlist for the next.

A Coding-Agent Reading List: Behind the Loops

Stefan Jansen — Thu, 02 Jul 2026 13:45:26 GMT

This is a sourced reading list on coding agents, split into six kinds of source — practitioner essays, control-theory foundations, LLM-agent primitives, systems and harness papers, field studies, and benchmarks — so you can match a question to the evidence that can settle it. The through-line: “loop engineering” is the visible surface; the problems underneath — bounded search, partial observability, delegation, verification — are older and better studied.

“Loop engineering” names a real shift in practice: developers spend less time on single prompts and more time designing the system around the agent — context, tools, state, branches, tests, permissions, review, and stop rules. That is worth naming.

But a serious reading list cannot stop with the current discourse. The current labels give new names to problems that control theory, planning, and human-automation research have studied for decades: bounded search, partial observability, shared state, reusable actions, delegation, verification, and supervisory control. The newer work matters because LLMs made these problems operational in software repositories, where the agent can change files and the environment can push back through tests, logs, compilers, benchmarks, diffs, and code review.

This list separates six kinds of sources:

practitioner pieces that explain why “loops” became the current label;
older control and agent foundations that give the label depth;
LLM-agent primitives that explain the mechanism;
coding-agent systems and harness papers;
empirical studies of adoption, review, refactoring, context, and test quality;
benchmarks and safety work that measure task completion or risk.

That separation matters. A benchmark is not the same as a field study. A product essay is not the same as evidence. A context-file study is not the same as a repository-level benchmark. Mixing them into one bucket makes the field look thinner than it is.

Match your question to the bucket, and note what that kind of source can and cannot settle:

Prompt quality still matters, but the engineering surface expands as agents gain context, tools, state, permissions, verification, and persistence.

From Prompting To Loop Engineering: The Current Discourse

Read these first to understand the vocabulary. Then move on.

Addy Osmani, “Loop Engineering”. The term’s current anchor. Osmani frames loop engineering as designing the system that prompts, monitors, and resumes the agent. The useful primitives are automations, worktrees, skills, plugins or connectors, subagents, and external memory. Read it for language, not for empirical proof.
Andrew Ng on three product-building loops (The Batch #359, June 2026). The highest-profile statement of the frame, published as the naming wave crested. Ng separates three loops by timescale and who acts: an agentic coding loop (minutes), a developer-feedback loop (tens of minutes to hours), and an external-feedback loop (days to weeks). That is a different cut than Figure 2 below, which separates loops by what each one validates — evidence, not contradiction, that “loops” already names several things. Ng also traces the phrase’s spread to Boris Cherny (Claude Code) and Peter Steinberger (OpenClaw).
Armin Ronacher, “The Coming Loop”. The skeptical counterpart. Ronacher separates the model’s tool-result cycle from the external harness that restarts, redirects, or escalates work. Read it for comprehension debt: the risk that code becomes easier for machines to change than for humans to understand.
Simon Willison, “Designing agentic loops”. A cleaner earlier statement of the loop idea: agents run tools in a loop to achieve a goal. The value is that it predates the June 2026 naming wave.
Birgitta Böckeler, “Harness engineering for coding agent users”. Published on martinfowler.com. Note the pieces around the model: guides, sensors, tests, feedback, and the boundaries that make delegated work reviewable.
Geoffrey Huntley, “everything is a ralph loop”. A reminder that the external while-loop around a coding agent existed before the current name. The practice predates the label.

The agent loop writes code; the harness loop checks whether the work is acceptable; the workflow loop decides whether the work should exist, who reviews it, and how state persists.

Older Agent And Control Foundations

This is the layer the previous draft underdeveloped. These sources are not ornaments. They are the reason the current vocabulary makes sense.

Newell and Simon, Human Problem Solving. The starting point for problem solving as search through a structured problem space under bounded computation. Coding agents explore repo states, command outputs, and patch candidates under similar limits.
Hart, Nilsson, and Raphael, “A Formal Basis for the Heuristic Determination of Minimum Cost Paths”. The A* paper. It belongs here because loop design is partly about spending search budget intelligently: which branch, which tool call, which retry, which stop condition.
Fikes and Nilsson, “STRIPS”. Preconditions, operators, and effects. Modern tool calls look different, but safe agent design still asks what must be true before an action runs and what the action changes.
Ghallab, Nau, and Traverso, Automated Planning: Theory and Practice. A broad synthesis of planning, decomposition, actions, goals, temporal structure, and uncertainty. It gives a broader frame than “prompt a model to make a plan.”
Kaelbling, Littman, and Cassandra, “Planning and Acting in Partially Observable Stochastic Domains”. A coding agent never sees the full system. Files, logs, tests, prior runs, and issue text are observations. Context engineering is better belief-state maintenance, not larger prompts.
Erman, Hayes-Roth, Lesser, and Reddy, “The Hearsay-II Speech-Understanding System”. The canonical blackboard architecture. Multiple specialized knowledge sources coordinate through shared state. Modern versions include issue trackers, state files, memory logs, CI summaries, and review artifacts.
Rao and Georgeff, “BDI Agents: From Theory to Practice”. The belief-desire-intention model is still useful for separating what an agent knows, what it is trying to achieve, and what course of action it has committed to.
Sutton, Precup, and Singh, “Between MDPs and Semi-MDPs”. The options framework gives the clean lineage for skills: reusable, temporally extended actions that can be selected like primitives.
Horvitz, “Principles of Mixed-Initiative User Interfaces”. Good agent systems need rules for initiative: when the system proceeds, asks, pauses, escalates, or hands control back.
Parasuraman and Riley, “Humans and Automation: Use, Misuse, Disuse, Abuse”. The automation-trust paper to read before giving agents more authority. Coding-agent failures often come from overtrust, under-specification, or weak supervision rather than local syntax errors.
Sheridan, Telerobotics, Automation, and Supervisory Control. A durable treatment of human supervision over semi-autonomous systems. It is directly relevant to agent loops that operate while a human reviews only the artifacts.

LLM-Agent Primitives

This layer is closer to the earlier agent canon, but here the emphasis is on what each source adds to coding-agent loops.

WebGPT. Browse, gather evidence, cite. It is an early template for language models acting through tools rather than answering from memory.
MRKL. Modular routing across tools and symbolic systems. It foreshadows the connector/plugin layer.
ReAct. The atom of the modern agent loop: reason, act, observe.
Toolformer. Tool use as a learnable behavior, not only an inference-time wrapper.
Reflexion. Actor, evaluator, self-reflection, and memory. Coding agents need the same separation between making a change and judging it.
Tree of Thoughts. Branch, score, and backtrack. Maps onto parallel worktrees, alternate patches, and explicit search over candidate plans.
Voyager. Executable skills that accumulate over time. It is the clearest ancestor for practical skill libraries.
CodeAct. Executable code as the action language. This is central to why coding agents feel different from generic chat agents.
SWE-agent. The agent-computer interface becomes a first-class variable. The environment and harness shape the observed capability.
AI Agents That Matter. Cost, holdout validity, benchmark shortcuts, and complexity discipline. Read this before believing any agent leaderboard.

Coding-Agent Systems, Harnesses, And Configuration

This group is about the engineering layer between the model and the repo.

SWE-bench. The most widely cited benchmark for repository-level issue resolution. It belongs in this section as the benchmark that shaped coding-agent systems, not as a field study about adoption.
SWE-agent. Also belongs here because it showed that the agent-computer interface changes performance. The same model can behave differently under a different scaffold.
CodeAgent. Early repo-level, tool-integrated code generation. Useful for seeing how tool use entered repository work before the current loop vocabulary.
SICA: A Self-Improving Coding Agent. A coding agent modifies its own codebase and improves from 17 to 53 percent on a random SWE-bench Verified subset. Read it as a harness/self-improvement result, not as proof of open-ended autonomy.
Code as Agent Harness. A synthesis of the idea that code is not only the agent output; it is also part of the harness that structures reasoning, action, and verification.
Configuring Agentic AI Coding Tools. A practical taxonomy of configuration mechanisms across coding tools.
A Dataset of Agentic AI Coding Tool Configurations. Shows what configuration patterns appear in real repositories, rather than only in vendor docs.
Agent READMEs. Descriptive evidence on context files in the wild: what teams put in them, how they vary, and why they behave more like configuration than prose.
Evaluating AGENTS.md. The corrective to context-file enthusiasm. Repository instructions can reduce success and increase cost if they add irrelevant or distracting guidance.

Empirical Studies Of Use, Review, And Maintenance

This section is distinct from benchmarks. These studies look at adoption, pull requests, review, refactoring, context use, and test quality in practice.

Agentic Much?. A GitHub-scale adoption study. The current arXiv version reports 22.20 to 28.66 percent adoption across 128,018 GitHub projects. It measures traces of use, not net value.
AIDev. A dataset of agent-authored pull requests — 932,791 PRs across 116,211 repositories, produced by five agents. Keep it separate from companion studies that analyze acceptance or failure.
The Rise of AI Teammates in SE 3.0. Uses an earlier AIDev slice to study agentic PR behavior. Read for the acceptance gap and the difference between speed and accepted contribution.
Where Do AI Coding Agents Fail?. A failure study of agent-authored PRs. It is valuable because many failures are workflow failures: abandoned review, duplicate work, CI/test failures, or task mismatch.
Agentic Refactoring. Refactoring is not one thing. Agents can make local structural changes while leaving deeper design issues intact.
Are Coding Agents Generating Over-Mocked Tests?. Test quality matters because shallow verification is easy to satisfy. Agent commits change tests more often than non-agent commits (23 versus 13 percent) and add mocks more often (36 versus 26 percent).
ContextBench. Process-oriented evidence that finding relevant code and using it correctly are different skills.
SWE-ContextBench. Prior experience can help, but only when retrieved and summarized correctly. This is the right paper to read before building memory-heavy loops.
Coding Agents Are Effective Long-Context Processors. Supports the idea that agents can externalize long-context processing through filesystems and tools. Do not generalize it to all long-horizon maintenance.

Benchmarks, Stress Tests, And Safety

Benchmarks are not field studies. They measure task performance under a protocol. They are useful when the protocol is visible.

Read benchmark scores as protocol results. A benchmark can say what an agent did under a task set, scaffold, tool policy, and evaluation rule. It does not by itself establish production value, review cost, or research validity.

SWE-bench. Real GitHub issues as repository-level tasks. Essential, but easy to overread once leaderboards saturate or scaffolds diverge.
SWE-PolyBench. Multi-language repository-level evaluation. It helps counter Python-only benchmark bias.
RExBench. Research-extension tasks in real repositories. A useful ceiling check for hard, novel work.
SaaSBench. Long-horizon enterprise SaaS engineering. Its main lesson is integration: failures often happen before agents reach deep business logic.
RedCode. Risky code execution and generation. This belongs in the core path because coding agents operate tools, shells, files, and sometimes networks.
OWASP Top 10 for LLM Applications. The security frame for prompt injection, supply chain risk, sensitive data, and excessive agency.

When reading a benchmark score, ask: which task set, which date, which model, which scaffold, which tools, which permissions, and which evaluation protocol? A model score without the harness is not a product capability claim.

What This Means For ML4T Work

For ML4T readers, the near-term value of coding agents is practical: environment repair, notebook maintenance, data-loader fixes, test scaffolding, documentation updates, CI triage, and reproducible experiment plumbing. These tasks are useful because they are bounded and checkable. To make the work more reliable, we published 60+ ML4T agent skills.

The risky step is letting an agent mutate the research claim. In trading and ML, the dangerous errors are often leakage, wrong timestamps, stale data assumptions, broken evaluation windows, accidental survivorship, weak baselines, or transaction-cost assumptions that no longer match the text.

The failure is rarely a syntax error. A more typical case: an agent makes a failing data-loader test pass by dropping the tickers that stopped returning history. The suite goes green, and the backtest quietly runs on survivors only. The code is correct; the research claim has changed.

For an ML4T notebook, loader, or backtest change, an agent-generated patch should record:

data source and snapshot date;
timestamp convention and decision-time availability;
label definition;
train/test split or walk-forward window;
baseline comparison;
transaction-cost assumption where relevant;
exact command to reproduce the result;
files changed and why;
tests added or modified;
reviewer note on whether the research claim changed.

Do not delegate changes that redefine the target variable, change the backtest window, select the best run after repeated iteration, add data that would not have been known at decision time, or rewrite the published claim to fit the new output. Those choices belong to the research process, not to the coding loop.

On Tuesday, July 7, I am teaching a free Maven Lightning Lesson, Getting stuff done with coding agents. The lesson is the practical companion to this reading list: how to turn an ambiguous goal into a scoped work unit, keep state across sessions, use review gates, and make agent work legible enough that a human can still own the result.

The same boundary sits at the center of Machine Learning for Trading: From Research to Production, the cohort course I am running this summer: it works nine case studies through that exact division of labor — what you can hand to the loop, and what stays with you because it decides whether the result is real. Watch the video overview.

Compressed Reading Path

If the full list is too much, start here:

Osmani, Ronacher, Willison, Böckeler, and Huntley for the current loop and harness debate.
Newell/Simon, POMDPs, Hearsay-II, options, mixed initiative, automation misuse, and supervisory control for the older design problems.
ReAct, Reflexion, Voyager, CodeAct, SWE-agent, and AI Agents That Matter for the LLM-agent mechanism.
Evaluating AGENTS.md, Agentic Much?, AIDev, Where Do AI Coding Agents Fail?, Agentic Refactoring, Over-mocked Tests, and ContextBench for field evidence.
SWE-bench, RExBench, SaaSBench, SWE-PolyBench, and RedCode for benchmark and safety constraints.

The practical result is a design rule: decide what the agent may change, define how the harness checks it, preserve the evidence needed for review, and keep the research claim outside the automation boundary.

From Research to Production with Nine Case Studies

Stefan Jansen — Wed, 01 Jul 2026 12:42:07 GMT

This summer we’re opening the first cohort of Machine Learning for Trading: From Research to Production — a hands-on course built around nine case studies, where you take one trading strategy from a first feasibility check all the way through deployment. See course page and overview video for more details.

Here is the kind of problem that workflow has to handle.

One of those nine case studies runs the ML4T workflow on thirty CME futures, daily, from 2011 through 2025. The model is a gradient-boosted ranker on a five-day forward return, traded as a long-short book — long the top-ranked contracts, short the bottom, inverse-volatility weighted across both sides. The label that feeds it is built on a ratio-adjusted continuous price series: back-adjusting the price level at each contract roll keeps the roll gap from registering as a price move, so it never lands in the forward returns the model trains on. Leave the series unadjusted and every roll embeds that artifact. Both decisions matter, and so does the validation around them: the result is measured on a holdout never touched during model selection, and against a long-only equal-weight basket of the same contracts.

Put together, they hold up. The validation Sharpe — the inverse-vol book under a 5% trailing-stop overlay — is 1.36, and on the 2024–2025 holdout it comes in at 1.11 — close enough that the profile carries, though the window is short enough that the gap between the two is not statistically resolved. The robust part is the signal itself: a daily information coefficient near 0.03 that clears the 5% significance threshold on both windows (on autocorrelation-robust standard errors, since the five-day labels overlap), and a result that survives the cost model — a Sharpe near 1.0 net of a tick-implied trading cost of about 10 basis points of notional per leg (modeled, not realized fills), against a breakeven near 30 basis points per leg, a roughly threefold cushion. It sits above the long-only basket too (about 0.6 in validation, 0.8 on the holdout), though a long-short book against a long-only directional basket is a floor rather than a risk-matched comparison. In this market the data construction shapes the result just as much as the model family does: get the roll handling wrong and even the cleanest model trains on a price path you could never have held. The case works because the data, the model, the cost model, and the holdout each carry their weight, and the workflow is built to show which stage carries how much.

Validation (left) and a holdout the model never saw (right), each against a long-only equal-weight basket of the same thirty contracts. Validation Sharpe 1.36; holdout 1.11 — the profile carries across the split.

None of those decisions is glamorous on its own: most of a strategy’s fate is settled in the handoffs between stages, not in the choice of model. We ran this loop nine times, and the new cohort is built so you run it too: you walk away with the workflow and one case you took all the way through yourself.

That workflow is the same on every case: a feasibility check, labels and features, a ladder of models, backtesting, portfolio construction, cost and risk analysis, an out-of-sample holdout, the deployment path, and the monitoring after. Running it nine times is what makes the handoffs visible: the decision that carries each case moves from one market to the next, so you learn the pattern instead of one example. Every run is logged, and you keep the notebooks, the run registry, and the deployment scaffold.

The comparative lab

The pipeline is fixed; what it has to handle changes with the market. ETFs anchor the set as a shared reference — daily frequency, an inspectable universe, a cost model you can reason about, an allocator you can deploy — and the other eight show how the same workflow adapts when the market does.

The two S&P rows describe different problems on the same underlying: one uses options data as side information to forecast equities; the other trades the options directly — the same instrument posed as two different trading problems. Each setting shows where an edge comes from, and how much of it survives once costs and capacity are real.

What the workflow forces you to confront

A working pipeline, stage by stage, in the same numbered sequence on every case — one notebook per stage, mapped to the chapter that teaches it:

Feasibility — confirm the universe and its cost structure can support a strategy before you model anything; the straddle case is a feasibility problem before it is a modeling one.
Labels and features — forward returns at the horizon you actually trade, walk-forward splits, and features from momentum and carry to model-based ones: walk-forward GARCH, HMM regimes, particle-filtered stochastic volatility.
A model ladder — linear baselines, Optuna-tuned gradient boosting, neural tabular and sequence models, latent-factor methods, and double machine learning for the causal question.
Strategy construction — backtest, covariance cleaning with random-matrix methods (following Paleologo’s treatment), an instrument-appropriate cost model, and risk overlays.
Validation — an out-of-sample holdout scored with deflated performance measures and multiple-testing corrections.
Production — a deployment loop, then drift monitoring once it is running.

The work that decides the outcome lives in the handoffs between these stages, and one principle keeps reappearing: use only data you would have known at decision time. In the CME opener it was building the label on a roll-adjusted series, so the model never learns from a continuation no one could have traded. In NASDAQ microstructure it is bar-close timestamps — whether an order-book feature was observable when the 15-minute bar closed. In the firm-characteristics panel it is reporting lags and restatements, aligning each fundamental field to the date it was actually known. Different markets, same decision, and it separates a tradable result from a notebook artifact — which you only learn by working the case.

The record, and the right to deploy

Model work is where people expect the course to get technical. It does, and the discipline that makes it pay off is being able to prove, six weeks later, exactly what you ran.

Every training run, prediction set, causal estimate, and backtest is written to a run log: a SQLite registry where each entry is content-addressed by a hash of its full configuration, with its artifacts and results recorded against that hash. A hyperparameter sweep becomes an auditable archive you can query rather than a folder of guesses. That is what makes a model comparison honest: the synthesis step reads the registry to recover how many configurations you effectively tried, then scores each candidate against that count — deflated and probabilistic Sharpe ratios, multiple-testing corrections, the probability of backtest overfitting, and a block-bootstrap comparison against an equal-weight benchmark. (The deflated-Sharpe and overfitting measures are López de Prado’s.) Your best result is judged against the number of attempts it took to find it, and then it meets an out-of-sample holdout it has never touched.

When validation and the holdout agree, you have evidence to carry the strategy forward: paper trading, limited deployment, or further monitoring. The firm-characteristics case is the cleanest version of that agreement: a monthly cross-sectional ranker over roughly 2,500 US stocks whose validation Sharpe of 2.75 carries to 1.77 on a holdout the model never saw, with its drawdown profile holding intact alongside the Sharpe — the cleanest such transfer of the nine. When validation and the holdout disagree instead, the course teaches you to read why — whether the prediction degraded, the portfolio translation lost it, or the regime changed. Where a causal question is well defined, a double-ML treatment-effect estimate with a refutation test asks whether the relationship survives a different specification — a separate question from whether the strategy is tradable after costs. That diagnostic ability is what lets you tell the strategies worth more of your time from the ones that are not.

Through deployment, and after

A research result is not the finish line. The deployment chapters run the strategy as a live loop: a unified framework that shares feature and signal code between research and live paths so they cannot silently diverge, a verification step that checks that parity, an order state machine, broker integrations for paper and live trading, and runtime safety controls. In the cohort, deployment means standing up that loop and its controls; committing capital stays your decision. The monitoring chapters cover what happens once it is running: drift detection on features and predictions, circuit breakers, safe model rollout, and a feature store so the live system reads the same features it was trained on. This is running code — the ETF, crypto-funding, and FX cases each have a working deployment loop.

In the cohort

You take one case study — or your own data, if it fits the scope — and run the arc end to end: a clean baseline, a hypothesis about its binding constraint, two or three targeted cycles, each change tested against an out-of-sample holdout under the same protocol. Some changes hold up; some do not, and the cohort is built around learning to tell them apart. What you produce is a research package and the verdict it earns — advance, revise, or retire — scored against a rubric you see on day one.

Coding agents work inside that same protocol. The third edition ships an ml4t-skills companion — 61 skills mapping specific failure modes (leakage, lookahead, purging, cost models, capacity, deployment) to checks an agent runs in your editor, browsable now with a free account at https://www.ml4trading.io/skills/. Agents speed up loaders, tests, and ablations; the skills keep that speed inside the research protocol so the code stays correct as it gets faster.

Logistics

The first cohort starts Monday, July 6, 2026, and the first live session is Thursday, July 9, at 4 PM UTC. The format is eight weekly 90-minute cohort sessions on Thursdays — twelve hours of live class — plus four biweekly 60-minute office-hours sessions, and a final 30-minute one-on-one discussion with me about your project. Every session is recorded and posted within a day, so nothing is lost if you miss a live hour. Between sessions, support runs through a dedicated Discord server, so questions on your case do not have to wait for the next session.

The book publishes in the coming weeks: chapters one through ten are already public, with the case-study code rolling out after, and cohort members get everything they need from day one. Several cases run on free, open data; the intraday microstructure work uses institutional AlgoSeek data, provided to every cohort member along with the rest of the course material.

The best fit is a reader who can already build a complete strategy and wants to get better at the research itself. By the end you have a complete case-study research package — the workflow, the diagnosis and experiments behind your result, your out-of-sample evidence, and the means to repeat all of it on your own data.

See course page and overview video for more details.

New Release: From Data to Model-Ready Evidence

Stefan Jansen — Mon, 29 Jun 2026 12:18:05 GMT

This ML4T code release covers Chapters 6-10 of the third edition. These chapters sit between the data layer and the model chapters.

The first release covered the data side: market data, microstructure, alternative data, synthetic data, and the data/ package. This release adds the next layer: the trading setup, labels, features, diagnostics, validation checks, and point-in-time inputs.

Here, model-ready evidence means something practical: the target, feature timing, validation split, and search history are explicit enough to inspect before model selection starts. A lot can go wrong before an estimator enters the workflow: the label can use the wrong horizon, the feature can belong to the wrong decision time, the cross-validation split can mismatch the trading cadence, or the result can come from too many uncounted trials.

Across the five chapters, the release adds 43 notebooks:

Chapter 6: Strategy Research Framework — 2 notebooks
Chapter 7: Defining the Learning Task — 9 notebooks
Chapter 8: Financial Feature Engineering — 8 notebooks
Chapter 9: Model-Based Feature Extraction — 15 notebooks
Chapter 10: Text Feature Engineering — 9 notebooks

The release also updates the case-study directory with shared utilities and each case study’s config/setup.yaml: universe, costs, walk-forward CV, and labels.

We will work through the same research-to-production workflow as in the live Machine Learning for Trading: From Research to Production course, which starts Monday, July 6.

Part II connects the data chapters to the model chapters: research setup, labels, features, validation, and model-ready inputs.

Where to start

If you want the shortest path through the release:

Start with 06_strategy_definition/01_cv_foundations for walk-forward validation, purging, embargo, and nested CV.
Start with 07_defining_the_learning_task/03_label_methods if the prediction target is still unclear.
Start with 08_financial_features/05_feature_selection if the feature set is too broad.
Start with 09_model_based_features/11_hmm_regimes if the regime state is part of the signal story.
Start with 10_text_feature_engineering/09_filing_text_signals if you need point-in-time NLP alignment.

Chapter 6: Strategy Research Framework

Chapter 6 defines a trading strategy as an executable decision process, not just a signal or model. It specifies the market, universe, decision schedule, tradability rules, score-to-position mapping, costs, constraints, evaluation metrics, and run logging.

Chapter 6 separates the live trading loop from the research loop: one runs the strategy on a fixed cadence, the other improves the pipeline under a fixed setup and evaluation protocol.

The chapter also introduces the nine case studies used throughout the book, enabling strategy research to be compared across asset classes, trading cadences, and market structures.

Notebook highlights:

01_cv_foundations builds walk-forward cross-validation from first principles, including decision-time admissibility, purging, embargo, calendar-aware splits, nested walk-forward validation, and combinatorial purged CV.
02_case_study_overview summarizes the nine case studies, including datasets, asset classes, constraints, evaluation protocols, and prediction coverage across time.

Chapter 7: Defining the Learning Task

Chapter 7 turns raw but validated data into a learning problem. It builds split-aware preprocessing, label engineering, feature-label evaluation, multiple-testing checks, and lightweight causal falsification.

Chapter 7 uses the triple-barrier method to label an event by whichever barrier is hit first: the profit target, the stop-loss, or the maximum holding period.

The useful move is that labels are treated as trading decisions, not just target columns. Horizon, overlap, resolution time, and implied trading intensity all contribute to the target definition.

Notebook highlights:

01_data_quality_diagnostics checks the ML4T datasets for index integrity, missing values, duplicates, outliers, and temporal gaps.
02_preprocessing_pipeline implements split-aware preprocessing, including a SplitAwarePreprocessor method that fits scalers and encoders on training data only.
03_label_methods implements fixed-horizon, percentile, triple-barrier, ATR-based, trend-scanning, meta-label, and sequential-bootstrap labels on ETF data.
05_signal_evaluation and 06_ic_inference use Information Coefficients, quantile returns, signal decay, HAC standard errors, and block bootstrap to evaluate single-factor signals.
07_multiple_testing covers factor-zoo selection bias, Benjamini-Hochberg FDR control, complexity-aware corrections, the Deflated Sharpe Ratio, and the Probability of Backtest Overfitting.
08_causal_sanity_checks uses timing placebos, shared-driver checks, and VIX regime heterogeneity tests to ask whether a signal story survives basic falsification.
10_ml4t_library_ecosystem shows the ml4t-data, ml4t-engineer, and ml4t-diagnostic libraries used across Chapters 7-12.

Chapter 8: Financial Feature Engineering

Chapter 8 moves from a trading idea to a documented feature specification: what the feature is supposed to capture, which horizon it belongs to, which reference frame it uses, and how it should be lagged.

Chapter 8 organizes feature families by data requirement and role: direct signals, state variables, feasibility inputs, and context.

The chapter builds direct financial features from prices, volume, liquidity, microstructure, term structure, cross-asset relationships, options markets, fundamentals, macro data, calendars, and events. It also includes feature selection and sensitivity checks, so it is not just a catalog of indicators.

Notebook highlights:

01_price_volume_features builds multi-horizon returns, trend features, volatility estimators, volume-liquidity features, ranks, and z-scores from ETF data.
02_microstructure_features computes Kyle lambda, Amihud illiquidity, Roll spread, and order-flow imbalance from NASDAQ ITCH data.
03_structural_cross_instrument_features builds carry and roll-yield features from CME futures, rolling beta and lead-lag features from ETFs, and options-implied features from S&P 500 options.
04_fundamentals_macro_calendar builds value, quality, and growth features from point-in-time financial statements, plus yield-curve, credit-spread, VIX, and calendar features.
05_feature_selection reduces a large ETF feature set using IC ranking, correlation deduplication, hierarchical clustering, BH-FDR filtering, bootstrap stability, and LightGBM importance.
06_robustness_sensitivity sweeps momentum parameters, checks IC by VIX regime, and builds signal-times-state interaction features.
07_event_studies implements event studies around ETF momentum breakouts, including normal-return estimation and CAAR confidence bands.
case_study_feature_summary compares feature results across case studies, including IC distributions, FDR survival rates, and feature-family effectiveness by asset class.

Chapter 9: Model-Based Feature Extraction

Chapter 9 turns fitted models into features. These are not the final supervised models being compared later. They are inputs: filtered states, residuals, volatility estimates, uncertainty summaries, regime probabilities, and panel-level relationships.

Chapter 9 turns fitted models into features, including filtered regime probabilities from Hidden Markov Models without using future observations.

A fitted model can compress structure that raw rolling features miss: latent state, changing volatility, structural breaks, uncertainty, cycles, and pairwise relationships. The point-in-time distinction matters here: filtered regime probabilities use only information available up to the decision time, whereas smoothed probabilities incorporate future observations and would leak.

Notebook highlights:

01_visual_diagnostics, 02_structural_breaks, and 03_fractional_differencing cover stationarity tests, ACF/PACF analysis, rolling diagnostics, break detection, and fixed-width fractional differencing.
04_kalman_filter extracts level, slope, innovation, and uncertainty features, and estimates time-varying hedge ratios for pairs trading.
05_spectral_features uses wavelets, rolling FFT features, Welch power spectral density, and spectral heatmaps.
07_arima_features, 08_garch_volatility, and 09_har_rough_volatility turn time-series models into residual, forecast, uncertainty, conditional volatility, persistence, leverage, and rough volatility features.
10_uncertainty_features uses PyMC stochastic-volatility models and ARIMA forecast uncertainty to build uncertainty inputs.
11_hmm_regimes, 12_wasserstein_regimes, and 13_regime_as_feature compare parametric and distribution-based regime features, including the difference between filtered and smoothed probabilities.
14_panel_features builds cointegration, Kalman hedge ratio, Ornstein-Uhlenbeck half-life, ranking, sector-relative, and universe aggregation features.
case_study_temporal_summary compares model-based feature inventories and incremental IC contributions across case studies.

Chapter 10: Text Feature Engineering

Chapter 10 treats financial NLP as feature engineering. It starts with lexicons, bag-of-words, TF-IDF, and static embeddings, then moves through sequential models, Transformers, domain adaptation, fine-tuning, and text-derived trading signals.

Chapter 10 moves from dictionaries and bag-of-words to contextual Transformer representations such as BERT and FinBERT.

The practical focus is timestamp-safe text features: when the text was available, which entity or asset it maps to, how the signal is aggregated, and how it is evaluated against future returns.

Notebook highlights:

01_word2vec_training trains Word2Vec on Financial PhraseBank text and evaluates embeddings with similarity, analogies, and t-SNE.
02_asset_embeddings applies Word2Vec to SEC 13F holdings, treating portfolios as sentences and stocks as words.
03_sentiment_evolution compares TF-IDF, GloVe, and FinBERT on Financial PhraseBank and shows why distribution shift matters.
04_bert_finetuning fine-tunes FinBERT, DeBERTa-v3, and ModernBERT for financial sentiment classification.
05_financial_ner_finetuning fine-tunes a FinBERT-based named-entity recognizer for companies, amounts, and dates.
07_news_return_signals builds news surprise and sentiment factors from FNSPID news and evaluates them against S&P 500 forward returns.
08_text_feature_evaluation evaluates text-derived alpha signals with IC, ICIR, t-statistics, and quintile spreads across 1-day, 5-day, and 20-day horizons.
09_filing_text_signals builds 10-Q MD&A sentiment and narrative-change signals and joins them to S&P 500 returns with point-in-time alignment.

These notebooks prepare research inputs and diagnostics. They do not claim that a feature survives full strategy simulation, transaction costs, portfolio construction, risk controls, or live deployment; those checks come in later releases.

Learn the workflow live

The full live-course and workshop schedule is on the ML4T courses page.

The Machine Learning for Trading: From Research to Production course starts Monday, July 6. The course uses the same workflow in live work: start with a strategy idea, build a baseline, diagnose what failed, and decide what to try next without quietly reusing the holdout.

The repository gives you the code. The course adds review and pace.

You can watch or star the ML4T repository to follow the staged release. The next releases move into model comparison, simulation, portfolio construction, transaction costs, risk, and live deployment.

How to build a Multi-Agent Forecasting System

Stefan Jansen — Thu, 25 Jun 2026 16:09:43 GMT

Most forecasting work in ML4T starts with structured data.

A panel of returns. A table of fundamentals. A calendar of releases. A cross-section of firm characteristics. A feature matrix that tries to preserve what the model could have known at the decision time.

That remains the center of gravity. The third edition still treats data design, labels, baselines, backtests, costs, and monitoring as the hard parts of the workflow. Forecasting agents add an evidence-processing layer to that workflow.

The layer is specific: it can process current unstructured evidence on the fly and turn the result into a structured forecast artifact.

For finance, that evidence might be a central bank statement, company guidance, a filing, a research note, macro commentary, or market chatter that has not yet become a stable feature history. For sports, it might be squad news, injuries, lineup changes, travel, form, or match reports. The domains differ. The pattern is the same: some relevant information arrives as text, search results, or messy context before it arrives as a clean panel.

This Saturday, June 27, we are building this pattern in a live Maven workshop: Building Multi-Agent Forecasting Systems. The workshop uses an AIA-style multi-agent forecaster inspired by Bridgewater AIA Labs: research agents gather evidence; an aggregation layer combines their probabilities; a supervisor checks the result; calibration adjusts the final number; and every run is stored so it can be scored later.

The system should read evidence that a conventional model may miss, produce a probability, and then have that probability compete with a baseline.

The missing input is often not another price series

In a clean modeling setup, the training panel defines the world. If the feature is not in the panel, the model cannot use it. That discipline is important because it prevents hindsight from leaking into the experiment.

But real decisions often happen before the relevant information has been converted into a feature. A company may update guidance, a regulator may change tone, a central bank may revise its language, or a prediction market may move before the event has resolved. Some of that evidence would be hard to reconstruct across twenty years of history.

The same issue appears outside finance. In the World Cup forecaster, a statistical model can estimate the probabilities of a win, draw, or loss from historical match data. Match-specific information is a different kind of input: squad updates, recent form narratives, injury reports, expected lineups, conditions, and tournament context.

An agent can put that evidence into the workflow without pretending it has already become a clean historical feature.

The agent can search, read, summarize, compare sources, reason over contradictions, and emit a probability with a trace of what it used. That output is not automatically correct. It is a candidate input.

The test is whether the candidate input improves the forecast record, the diagnosis, or the decision process when compared with a baseline.

Web search is the easy version

The simplest forecasting agent uses a web search.

Give the agent a binary question, a cutoff date, a search tool, and a prompt that asks for evidence, uncertainty, and a probability. Let several agents search independently. Aggregate their probabilities. Ask a supervisor to check the disagreement. Store the trace. Score the result when the event resolves.

That already teaches most of the system design: resolvable questions, cutoff-aware retrieval, source metadata, schema-bound outputs, aggregation policy, calibration, and baseline scoring.

The current search is convenient but not replayable. Search rankings change. Pages move. Articles update. Snippets may reflect information unavailable as of the forecast date. Date filters help, but they do not create a clean historical archive. If you want to tune the system rather than merely demo it, a point-in-time evidence archive becomes much more attractive.

For finance, that might mean timestamped news, filings, transcripts, macro releases, prediction-market snapshots, and internal research notes. For sports, it might mean dated match previews, squad announcements, injury updates, odds, and official lineups. The important feature is not that all of this is text. It is that the system can know what was available at the time.

Once the evidence is archived point-in-time, the agent becomes easier to improve:

You can backtest retrieval policies without letting the agent read the future;
You can compare prompts, tools, and agent counts on the same historical evidence;
You can run prompt or program optimization against resolved outcomes;
You can measure whether a supervisor helps or merely adds cost;
You can decide whether debate improves calibration or only produces longer traces.

A live demo can run on the current search. A system that improves over time needs evidence that it can replay.

Two practical ways to use the forecast

Two deployment patterns cover most practical use cases.

The first is to build the agent in isolation as an additional forecast. It reads evidence, produces a probability, and gets compared with a market price, a statistical baseline, or a human forecast. If it provides incremental information, the output can be ensembled with the baseline or used as a feature in a downstream model.

That is the restrained interpretation of the AIA Forecaster result. On ForecastBench, the report finds that performance is statistically indistinguishable from that of human superforecasters. On MarketLiquid, the paper evaluates 322 liquid prediction-market questions at five forecast dates each, producing 1,610 forecast instances from April 2 to May 23, 2025. The agent alone trails the market consensus there. The practical result is that combining the agent forecast with market consensus beats either source alone on that benchmark.

That is a realistic role: independent, inspectable evidence processing that can add information when combined with a strong baseline.

The second pattern is to enrich the agent with structured inputs from the start. It receives a model prior, market odds, recent structured data, and the current evidence it should inspect.

That is what we are doing in the World Cup agent: model the prior first, current evidence second, and a separate forecast surface throughout.

The World Cup app illustrates this second pattern. A Poisson-style team-strength model provides the prior; the agent reads the current football context and publishes its own win/draw/loss forecast alongside the model. In one pre-match validation run, changing the prompt increased the agent’s mean maximum swing from the model by 3.5 percentage points across eight group-opener fixtures. The goal was to keep the model prior visible while allowing the agent to move when supplied evidence justified it.

If the agent simply rubber-stamps the model, it adds little. If it makes large moves without evidence, it adds noise. Evidence-driven deviations are the cases to track and evaluate.

The AIA-style architecture

Chapter 24 of Machine Learning for Trading uses forecasting because the output can be scored. A forecast has a timestamp, a probability, a resolution rule, and an eventual outcome.

For prediction-market questions, the target is usually a binary event. For match forecasts, the outcomes are mutually exclusive. The architecture is similar, but the scoring layer changes.

The architecture we use in the workshop follows the same practical sequence:

An AIA-style forecasting system turns a question into research-agent traces, an aggregate forecast, supervisor review, calibration, and a persisted probability that can be scored later.

Define the forecast target and cutoff policy.
Run several research agents that search and reason independently under the same cutoff policy.
Aggregate their probabilities under an explicit rule.
Optionally run a debate or specialist review stage when the question set warrants the cost.
Let a supervisor inspect disagreements and perform clarifying searches when needed.
Calibrate the final probability.
Persist the run: question, inputs, traces, configuration, token usage, probabilities, and later scores.

This is a read-only forecasting system. It can pull questions, prices, search results, and evidence. It does not place trades or bets.

That boundary makes the system easier to evaluate. A read-only agent can be wrong without also creating an execution problem. The first product of the system is a probability and a record. Action comes later, if the probability survives comparison with the rest of the workflow.

What has to be measured

Forecasting agents invite a familiar mistake: treating a good explanation as evidence of a good forecast.

The explanation may be useful. It may help diagnose what the system saw. It may reveal a stale source, a missing counterargument, or a brittle prompt. But the forecast has to be judged by scoring rules and baselines.

Calibration remaps probabilities based on past forecast errors. It cannot add evidence the agent missed, and it should be evaluated on a disjoint set of resolved forecasts.

At a minimum, the evaluation should ask:

Does the agent improve Brier score, log score, calibration, or sharpness relative to the baseline?
Does the improvement survive time-disjoint evaluation?
Does the agent add information when combined with market consensus or a statistical model?
Which component actually helps: search, more agents, debate, supervisor, calibration, or a better evidence archive?
Does the benefit justify the cost and latency?

A negative result still tells you which component not to keep. An agent may fail to beat a baseline on its own and still improve an ensemble or help diagnose model blind spots. If it only produces plausible prose, it should not survive the ablation.

This is also why point-in-time evidence matters. Without it, a backtest can become a story about leakage. With it, the same architecture can become a repeatable experiment: the same question, the same cutoff, the same archive, different prompt or tool policy, scored after resolution.

What we will build on Saturday

The workshop is built around a working AIA-style forecaster.

Students run a CLI and dashboard, inspect individual agent traces, change profiles, compare single-agent and multi-agent runs, look at aggregation and calibration, and see how forecast runs are stored for later scoring. The deterministic replay path works without API keys. Live model and search providers are optional upgrades.

The day is deliberately practical:

start from a resolvable question;
run one research agent and inspect its evidence trail;
move from one agent to several;
aggregate and calibrate the probabilities;
store the run so it can be evaluated later;
discuss how the same pattern generalizes from prediction markets to finance, sports, and other decision workflows.

The ML4T link is the evaluation habit: preserve the timestamp, compare with a baseline, and remove components that do not improve the record. Forecasting agents make a new class of inputs available to that workflow.

The live workshop is this Saturday: Building Multi-Agent Forecasting Systems.

The free Lightning Lesson page is here: Build Multi-Agent Systems You Can Audit.

In the workshop, a component earns its place only if the stored runs show that it improves the forecast record, the diagnosis, or the evaluation against the baseline.

New Release: The ML4T Data Layer Is Now Public

Stefan Jansen — Mon, 22 Jun 2026 12:30:49 GMT

Today, we are releasing the code for the first five chapters of the third edition of Machine Learning for Trading.

Chapter 1, The Process Is Your Edge, sets up the workflow: define the research problem, keep exploration separate from confirmation, and treat live degradation as something the process has to handle. Chapters 2-5 then make the data foundation inspectable.

Data infrastructure supports strategy research and live deployment. This release provides access to the book's coverage of the infrastructure layer: the data and chapter notebooks that later case studies build on.

This release also arrives as we start teaching the ML4T workflow live. The free From trading idea to validated strategy Lightning Lesson runs Wednesday, June 24, at 16:00 UTC / noon ET; the full Machine Learning for Trading: From Research to Production course starts July 6. The course is organized around this loop: start with a research question, build the data contract, cross the evidence boundary only when the experiment deserves it, and carefully track the live feedback loop.

In trading research, few things are more important than the information contained in data, and how to handle it properly: clocks, sessions, adjustments, revisions, identifiers, contract rolls, venue rules, licensing limits, and the question of what a strategy could have known at the time of the decision. A later model result only means something once those choices are visible.

A point-in-time accounting example from Chapter 4. The final database view shows the same $100M revenue value across the year; the lower track shows what was actually available after the original filing and later restatements.

This release adds the first five chapter directories and the central data/ package that later case-study work builds on:

Chapter 1 adds the workflow frame. The four data chapters contain 62 notebooks. The data/ directory adds the catalog, download scripts, loaders, configs, and access notes that turn the chapter examples into something a reader can inspect and run. The implementation detail worth checking first: loaders return Polars DataFrames through a consistent API, and missing data raises setup instructions with explicit download guidance.

What the data layer contains

The top-level data catalog currently lists 31 dataset entries. Financial data does not fit cleanly into one giant downloadable bundle. Some sources can be fetched without an API key. Some need a free key. Some are paid. Some require a manual provider download. Some reduced licensed packages are still being prepared for hosting.

The catalog groups entries around the constraints a research workflow inherits: daily market data, microstructure feeds, options data, fundamentals, positioning, macro series, prediction markets, on-chain metrics, news, and text. That structure makes setup expectations and redistribution limits visible before a reader spends time trying to reproduce a notebook.

The synthetic-data chapter takes a different angle. It asks how to create alternative histories for robustness analysis when one realized market path is not enough. The notebooks start with classical simulation and move through TimeGAN, Tail-GAN, Sig-CWGAN, GT-GAN, Diffusion-TS, LLM-based tabular generation, and differentially private GAN training. The diagnostics focus on stylized facts, dependence structure, downstream task utility, and privacy constraints.

Chapters 2-5 in one pass

Chapter 2 gives the map. It covers the financial data universe by asset class and source type, then moves into due diligence and storage. The practical thread is that provider data already contains research choices: corporate actions, survivorship, contract construction, point-in-time availability, storage format, and update policy.

Chapter 3 gets closer to the tape. Market data is the output of market design: sessions, order types, venue rules, visible liquidity, and timestamp conventions. The notebooks parse NASDAQ TotalView-ITCH, reconstruct limit order books, validate trade-direction classification, compare bar-sampling rules, and show why the sampling clock changes the return distribution a model sees.

A Chapter 3 notebook turns the same AAPL trading day into one-minute bars, 10,000-share volume bars, and tick imbalance bars. The choice of sampling clock changes the observations a model receives before feature engineering begins.

Chapter 4 turns point-in-time correctness into implementation work. It covers SEC EDGAR and XBRL data, Form 4 insider transactions, 13F holdings, entity resolution, macro release alignment, CFTC futures positioning, on-chain fundamentals, prediction markets, and the extraction of filing text for later NLP work. Financial information must be aligned with the time it became available, as well as the fiscal period or event it describes.

Chapter 5 handles synthetic data with a practical standard: fidelity, utility, and privacy. Generated histories are useful when they stress robustness claims and privacy constraints; extra sample count alone cannot make a backtest more convincing.

From dataset to case study

The third-edition case studies begin with a market, a universe, a decision schedule, and a data definition. The case-study table in the main README is already public; the detailed case-study notebooks will be included in later release batches.

The futures path shows the point. A CME futures example starts with contracts, sessions, UTC timestamps, roll rules, continuous-series construction, and positioning data that arrives on its own schedule. A term-structure signal built on that data depends on each of those choices before any model sees a feature matrix.

The firm-characteristics case has a different failure mode. A monthly cross-sectional result only means something if the accounting data is lagged and aligned to a point-in-time before labels and features are built. The README lists the remaining nine case studies, including the distinct S&P 500 equity-plus-options and options-only tracks.

That is why the data release comes before the model chapters. Later chapters can ask whether a model improves a score. This release sets up the prior check: whether the score was computed on a dataset whose timing, universe, costs, and availability rules would have held up against the actual research problem.

How to inspect the release

A narrow first pass works better than browsing every notebook.

Start with data/README.md. It explains the dataset catalog, access categories, loader names, storage tiers, and the quick-start commands. The repo expects the data path in a root .env file:

# In the repository root .env file
ML4T_DATA_PATH=/path/to/your/data

# Then download free datasets
uv run python data/download_all.py --free-only

Then pick one path.

If you want the lightest local setup, start with ETFs, factors, and crypto. The minimum tier is about 70 MB. Adding macro and FX data still leaves the standard tier around 75 MB, assuming the free API keys are configured. Adding the broad US-equities dataset takes the footprint to roughly 740 MB. Adding CME futures brings it to roughly 825 MB. The full setup, including the larger microstructure and reduced licensed packages, is closer to 7 GB.

That separation matters because a reproducible notebook is only reproducible within the data rights and access assumptions it declares.

A practical first pass is:

open the data/ README;
run one free dataset download;
open the matching chapter notebook;
inspect the loader output and canonical timestamp/entity columns;
trace that dataset into the case-study table in the main README;
then move to the Chapter 6 feasibility notebook when that batch lands.

The next ML4T Insights issues will follow the same sequence: data first, then strategy definition, labels, features, model comparison, backtesting, costs, risk, and deployment.

For the walkthrough

The June 24 Lightning Lesson starts from this data contract: choose a market, define the decision problem, build a first baseline, diagnose what failed, and make the next iteration measurable without quietly reusing the holdout.

The July 6 cohort uses the same case-study structure as the book and repo, with sequencing, feedback, and live review.

The repo gives readers the artifacts. The course adds the working cadence: turn the artifacts into a baseline, read what the baseline says, and decide what deserves the next experiment.

You can watch or star the repo to follow the staged release. The lesson is a short live walkthrough of why the release starts with the data contract before moving to models.

The ML4T Third-Edition Code Rollout Starts Today

Stefan Jansen — Fri, 19 Jun 2026 15:12:24 GMT

The Machine Learning for Trading repository has carried much of the practical load for ML4T. Some readers used it to run notebooks alongside the chapters. Others found the book through the code first. With more than 19,000 stars and 5,300 forks, it has become the public memory of the first two editions. It also showed up again recently on GitHub Trending for Jupyter Notebook repositories, which is a useful reminder that the old repo is still being discovered while the new edition is being prepared.

Star history for the public ML4T repository, showing steady growth from 2019 through 2026. The third-edition rebuild starts from an existing reader base that has accumulated over several years.

The third-edition README is now live on the main branch. The older material is preserved on the first-edition and second-edition branches. The new code will be rolled out in stages over the next few weeks leading up to the July launch.

That staged release has a practical reason. A large trading-ML repo can easily turn into a larger notebook archive: more models, more datasets, more examples, more ways to get lost. The third-edition rebuild is organized around a different question:

When a forecast looks promising, what has to be checked before it becomes a trading decision?

The third-edition workflow separates strategy research from evaluation with an evidence boundary, then closes the loop through deployment, monitoring, and the decision to retrain, pause, or retire a strategy.

That is also the question behind next Wednesday’s free Lightning Lesson “From trading idea to validated strategy”. The repo is the public resource; the lesson is the short guided walkthrough.

The work after the forecast

A familiar research sequence goes like this. A model produces a plausible return forecast. The validation IC improves. A backtest looks better than the baseline. Then the harder checks begin.

Was the asset universe fixed before the test? Did the label line up with what was knowable at the time of the decision? Were overlapping labels handled correctly? How many variants were tried before this one survived? Did the result clear the cost model? Did turnover make the signal unusable? Does the portfolio rule amplify a weak forecast or diversify it? What happens after position limits, drawdown rules, and execution assumptions are applied?

The book calls one part of this the evidence boundary: the line between exploration and confirmation. On one side, you search, tune, compare labels, test feature families, and improve the baseline. On the other side, you spend a holdout on a result that was already specified well enough to deserve the test.

That boundary changes the decision a result can support. A strategy can meet a naive Sharpe threshold and still fail once the number of variants tested is taken into account. A feature can look predictive until its timestamp is checked. A model can improve rank correlation and still lose once costs and turnover are factored in.

At that point, the right response is to return to exploration, narrow the claim, or stop. The holdout has already answered the question it was allowed to answer.

That is the practical center of the third-edition workflow.

Three ways into the repo

The new README is the route map for the staged code release. It shows three useful ways to enter the material as the directories fill in.

First, enter by workflow stage. If the problem is data reliability, start with the financial data layer. If the problem is label design, feature triage, or leakage, start with research design and feature engineering. If the problem is model comparison, start with the model chapters. If the backtest looks too clean to believe, start with portfolio construction, costs, risk, and strategy synthesis. If you are working with RAG, knowledge graphs, or agents, start with the advanced-AI chapters. If the issue is live operation, start with deployment, monitoring, and the retrain-pause-retire loop.

Second, enter by case study. The case-study page lists nine studies across seven asset classes, 168 notebooks, and six pipeline stages. Each one puts pressure on a different part of the process:

ETFs are a cleaner daily cross-asset setting for following the full loop.
Crypto perpetuals introduce funding rates, shorter decision intervals, and a market structure that differs sharply from equities.
NASDAQ-100 microstructure makes bar construction, order flow, and intraday costs part of the modeling problem.
S&P 500 equity plus options uses implied-volatility information to improve equity selection.
US firm characteristics revisit the canonical monthly cross-sectional factor problem, where point-in-time data discipline does much of the work.
FX pairs expose the limits of a small cross-section when shared macro and dollar factors are present.
CME futures force rolls, term structure, and sector structure into the foreground.
S&P 500 options test whether an options-only strategy can survive labels, hedging, and instrument-specific costs.
The broad US equities panel asks whether weak individual signals become useful when scaled across thousands of names.

Third, enter by library layer. The current library page lists six workflow-aligned packages: ml4t-data, ml4t-engineer, ml4t-models, ml4t-diagnostic, ml4t-backtest, and ml4t-live. A reader can use a single layer without adopting the entire stack. A data team may only need acquisition and storage. A researcher may only need labels, features, and diagnostics. A course participant may start with a baseline and then decide whether the next change belongs in data, labels, validation, costs, allocation, or execution.

What to inspect first

The first release wave is the data and research foundation. That is less fashionable than agents or deep learning, but it is where many strategy problems start.

The data chapter spans market, fundamental, and alternative data; corporate actions; futures and continuous contracts; options; crypto; FX; point-in-time validation; survivorship; provider comparison; incremental updates; and storage benchmarks.

The microstructure chapter moves from raw exchange messages to feature-ready bars: NASDAQ ITCH parsing, limit-order-book reconstruction, order lifecycle analysis, TAQ-style examples, Lee-Ready validation, alternative bar sampling, and intraday jump detection.

The synthetic-data chapter is another natural deep dive. Since the second edition, the topic has moved well past “can a GAN produce realistic-looking prices?” The current problem is more practical: whether a generator preserves the structure a downstream research task needs, without leaking the original data or creating a world that is easier than the real one. The chapter covers classical simulation, TimeGAN, Tail-GAN, Sig-CWGAN, diffusion-based generators, LLM-based tabular generation, and privacy-aware variants.

If you open the repo as the release unfolds, a reasonable first inspection path is:

pick one case study instead of browsing every notebook;
read the data construction before the model notebook;
locate the label definition, horizon, and decision timestamp;
check the validation split, purge, embargo, and holdout rule;
inspect cost assumptions before performance statistics;
treat the first backtest as a diagnostic, not evidence;
look for the handoff from forecast output to portfolio rule, risk limit, and deployment decision.

That is also the editorial path for the next set of ML4T Insights issues. The useful issues will turn the release into resources readers can keep: a financial-data failure-mode map, a guide to when tick data changes the research problem, an update on synthetic financial data after GANs and diffusion, a case-study access map, and a checklist for when a backtest should count as evidence.

For the first time: live course and workshop

The repo is the public resource. The course and workshop are the guided version of the same workflow. They aim to help you improve your ML4T research and trading practice, not just repeat what’s in the book.

On Wednesday, June 24, there are two free Maven Lightning Lessons:

From trading idea to validated strategy at 16:00 UTC
Build Multi-Agent Systems You Can Audit at 15:00 UTC

The first lesson is the closest companion to this release. It walks through the same transition from idea to baseline, from baseline to diagnosis, and from diagnosis to a cleaner second version without quietly reusing the holdout.

The second lesson connects to the agent material in the later chapters and to the forecasting-agent workshop. It uses the same discipline in a narrower setting: define the task, preserve the trace, compare against a baseline, and score the output.

The full Machine Learning for Trading: From Research to Production course runs July 6 to August 29. It is the live cohort version of the workflow: choose a case-study path, build a baseline, diagnose failure modes, iterate honestly, and defend the result against a published rubric.

The Building Multi-Agent Forecasting Systems workshop runs June 27. It is the hands-on route through the agent side: multi-agent forecasting, trace capture, ablations, and scoring.

Neither course is required to use the public repo. The reason to take one is sequencing, feedback, and time spent applying the workflow with other people.

Start with the map

The practical next step is to watch or star the repo if you want to follow the code release as it unfolds.

The useful test for this release is simple: can a reader open the repo, pick a market, and understand what has to be checked before a model result becomes strategy evidence?

Machine Learning for Trading, 3rd Edition on GitHub

Watching the World Cup With a Forecasting Agent

Stefan Jansen — Tue, 16 Jun 2026 14:36:30 GMT

The World Cup is one of the few forecasting problems people argue about for pleasure, and it rarely lets a confident prior off the hook.

Morocco reached the semifinal in 2022. Costa Rica topped a 2014 group that included Uruguay, Italy, and England. Germany won the tournament in 2014, then went out in the group stage in 2018 and again in 2022. The tournament has a long record of embarrassing anyone too sure about it.

That is exactly what makes it a fun public forecasting lab. The questions are the ones everyone already has an opinion on, the answers come back within hours, and once the match is over, a forecast has very little room to talk its way out of a bad call.

By the morning of June 16, the World Cup Forecast Agent had already built a record worth picking apart. Spain was a 75.6% favorite in the model and an 80.0% favorite in the agent’s call; Cape Verde held Spain to 0-0. Saudi Arabia-Uruguay and Iran-New Zealand ran the same way: the agent nudged the favorite higher, and both matches finished level.

That is the reason for publishing the lab while the tournament is still underway. Every pre-match number is locked, the trace behind the agent’s adjustment is stored, and the call is scored once the result is in.

Before kickoff, the traces can read sensibly. A few hours later, the scoreboard may have a different opinion.

A June 16 fixture strip from the live app. Each row shows the model call and the agent call before kickoff, so readers can see the adjustment before the result arrives.

See below for a free lightning lesson about building multi-agent forecasting systems.

The model gives the prior

The baseline starts with a familiar statistical idea: estimate each team’s attacking and defensive strength from international results, weight those results by competition and recency, then simulate the tournament bracket.

As of the current public run, the model’s leading champion probabilities are:

Spain: 14.2%
Brazil: 12.9%
Argentina: 11.8%
England: 9.3%
France: 6.5%
Portugal: 6.5%
Germany: 5.0%
Colombia: 4.6%

Treat those as a dated snapshot. They are where the forecast starts, not where it ends.

The live forecast table keeps model champion probabilities alongside Polymarket and Kalshi prices, rather than blending them into a single consensus number.

The baseline is useful because it is systematic. It is fit on more than 47,000 international results, checked in walk-forward tests, and straightforward to score after the fact. The forecast table sets the model’s champion odds next to Polymarket and Kalshi prices without averaging the three, which keeps the disagreement on the page rather than dissolving it into a single tidy consensus figure.

The baseline is also restricted to variables we can reconstruct over time. Injuries, lineup news, weather, travel, squad value, current form: a model could use all of it if each input were a clean, point-in-time history. Most of it does not. Some arrives as prose; some is patchy; and for many variables, there is no decades-long archive of comparable snapshots from the past. Pour today’s headline into a model that was never trained on yesterday’s headlines, and you have introduced a new bias, not improved the prior.

That gap is the agent’s job. It can work with current, partly structured information without demanding a 50-year feature history. In production, the next step would be to rebuild richer point-in-time datasets and tune prompts, tools, and aggregation rules against historical cases. This app runs the lighter version: hold the quantitative prior steady, let the agent make a separate context adjustment, and score that adjustment on its own.

The agent calls the matchup

Before each match, the agent receives the model’s win/draw/loss probabilities, expected goals, recent form, head-to-head history, availability signals, key players, match conditions, a relative squad-value signal, and, when available, recent search results.

The part closest to Bridgewater AIA Labs’ AIA Forecaster is the research step. The agent first decides what is actually worth checking for a fixture: a late fitness scare, a suspension, a venue condition, a question about one specific player. It then runs a brief Tavily search, retains the sources relevant to the question, and files the query trace alongside the forecast.

The squad-value signal is kept relative by design: rank within the 48-team field, percentile, value ratio, and marquee names rather than raw euro totals. That gives the agent a current-talent check against a prior built from results, without pretending a transfer-market price tag is a starting eleven. It is a serviceable proxy that still carries the usual biases around age, league, club brand, and the transfer market itself.

The trace is what turns “the agent looked into the match” into something a reader can examine. For the June 16 France-Senegal call, the agent ran three searches, kept nine sources, and moved France from 50.3% to 61.7%. The source list included a piece flagging William Saliba as doubtful and a later one that walked some of that worry back. The agent is reading a noisy evidence stream, not the truth; the trace lets a reader see what it leaned on and judge whether the move was earned.

The forecast itself comes from a small ensemble of language model calls. Each one returns win/draw/loss probabilities, a scoreline distribution, and a factor-by-factor account of its reasoning. The system aggregates those passes into a single forecast and publishes it as the agent’s call, set beside the statistical model rather than folded into it.

A single agent call card from the live app. The card exposes the model-to-agent shift, likely score, short rationale, source count, and grading status.

What the live record adds

A tournament forecast table is easy to publish and easy to forget by the weekend. The World Cup Forecast Agent pulls apart the pieces such tables usually blend:

the model prior
the market’s outside view
the match’s current context
the agent’s reading of it
the eventual score

A market disagreement is not an injury note. A model probability, a research summary, and a language model’s explanation are three different objects. Holding them apart makes the forecast inspectable and scoreable.

That turns the World Cup into a gentle version of a harder ML4T problem. Many forecasting workflows begin with a quantitative prior and then have to wrap it in messy, recent, and largely unstructured information. The question that matters is whether the agent contributes a distinct, reviewable adjustment around the model or just restates it in livelier language.

Football makes that concrete. The match is played. The call can be checked. The scoreboard does the rest.

The live app is already showing its shape. As of the June 16 morning snapshot, 16 calls had been graded, with agent and model tied at 37.5% top-pick accuracy. The model’s live Brier score was 0.7271 against 0.704 for the naive-prior floor. Lower is better, so the model had not yet pulled clear of the floor, which itself was hitting 43.8% on the same 16 matches. Sixteen games is far too few to render a verdict, but the machinery is already doing its work: publish the call, score the result, let the record stack up.

The scorecard uses multiclass Brier for match outcomes, with signed per-call deltas showing whether each agent adjustment helped or hurt the model’s score. By the end of the tournament, the bar should be higher than “the agent sounded informed.” The adjustment has to improve the scoring record, sharpen the diagnosis, or both.

Why this belongs with the forecasting-agent workshop

The World Cup build is intentionally lightweight. It is the playful cousin of Bridgewater AIA Labs’ AIA Forecaster: start from a prior, pull current evidence, run independent forecast passes, aggregate them, store the trace, and score the call once it resolves.

The differences are worth being explicit about. The AIA technical report includes supervisor reconciliation and statistical calibration that this app does not claim to reproduce. AIA’s result is also more interesting than the usual “agent beats the market” story: on a liquid-markets benchmark, the agent underperformed market consensus on its own, while an ensemble of AIA forecasts and that same consensus beat the consensus alone. That is why this app keeps model, market, and agent in view at the same time.

Inside ML4T, this is the second live forecasting-agent surface. The ML4T Agent Lab tracks AIA-style questions across Polymarket and Kalshi; the World Cup app brings the same discipline, logged priors, sourced adjustments, and scored outcomes, to a tournament anyone can follow.

The workshop pushes these choices in a less-forgiving setting: independent research, reconciliation, calibration, ablations, and replayable traces. The World Cup app is the easy way in. You can pull up a match, disagree with the number, and look at exactly what the agent checked before kickoff.

How to read the live app

There are a few good places to start. The model view carries champion odds, match probabilities, team ratings, and market comparisons. The agent view holds the upcoming and resolved calls: the model-to-agent shift, the factor trace, the searches and sources, the date each was generated. The scorecard is where the misses surface. Once a result lands, every pre-match call is graded against both the outcome and the model.

It will be wrong, and often. Football is low-scoring, noisy, and decided by small events that can swamp reasonable priors. That is most of the reason to publish the record at all.

Watch the matches, argue with the numbers, and check back as the scoreboard fills in.

World Cup Forecast Agent live app

Learn how to build multi-agent forecasting systems

If you want to see how the architecture travels beyond football, next week’s free Maven Lightning Lesson walks through the design of auditable multi-agent systems: Build Multi-Agent Systems You Can Audit.

The full hands-on workshop is here: Building Multi-Agent Forecasting Systems.

When multi-agent systems are worth the overhead

Stefan Jansen — Wed, 10 Jun 2026 16:19:25 GMT

Chapter 24 of Machine Learning for Trading treats agents as engineering systems rather than chat interfaces. One of its capstones is a multi-agent forecasting system: several research agents gather evidence, an aggregation rule combines their probabilities, a supervisor reconciles disagreement, and the final forecast can later be scored.

The accompanying notebooks test where this architecture helps, where it adds cost, and where a simpler baseline would have been enough.

One example is the adversarial-debate notebook. A bull-bear debate ran for three rounds on a Fed-rate question. The bull ended at 0.78, the bear at 0.35, and the gap stayed fixed at 0.43 across all three rounds. The debate added more than 6,000 tokens and moved the blended forecast by one percentage point.

The result provides a practical lesson for multi-agent systems: the extra-agent stage has to justify its cost.

Multi-agent systems are easy to make look sophisticated. Add several model calls, assign roles to them, ask them to debate, and place a supervisor at the end. The harder question is whether the added structure produces independent evidence, actionable disagreement, better reconciliation, or a better-scored output than a simpler baseline.

Forecasting is a good lab for that question because the output is hard to hide. A probability can be timestamped, compared with a market price or base rate, calibrated, ablated, and scored after the event resolves.

Chapter 24’s full multi-agent forecasting pipeline: specialist research agents feed aggregation, optional debate, supervisor reconciliation, a forecast artifact, and audit artifacts. The question for this issue is which stages earn their cost.

The realistic standard

Bridgewater AIA Labs’ AIA Forecaster technical report is one of the clearest public blueprints for this pattern.

The report describes a multi-agent forecasting system with agentic search, supervisor reconciliation, and statistical calibration. It reports performance that is statistically indistinguishable from human superforecasters on ForecastBench, but it also reports a more sobering result on liquid prediction-market questions: AIA Forecaster underperforms the market consensus on its own. The interesting result is the ensemble. On that harder benchmark, combining market prices with AIA forecasts performs better than either source alone.

The standard for agent systems in this setting is narrower than market replacement. The agent has to add information that survives comparison with a strong baseline.

The Chapter 24 implementation starts with a single ReAct research loop, then adds tool contracts, explicit state, multi-agent forecasting, evaluation, governance, and a separate ML4T research operator. This issue focuses on the forecasting system because it most clearly exposes the multi-agent design.

The pipeline

The forecasting agent example starts with a question that can be resolved - on a prediction market, at a later stage.

“Will inflation be high?” is not enough. A resolved example from the Lightning Lesson is precise enough to score: “Will US core PCE for April 2026 exceed 0.3% month over month in the BEA Personal Income and Outlays release scheduled for May 28, 2026?” The BEA release reported core PCE at 0.2% month over month, so the event resolved “no.”

After the question is fixed, several research agents work independently. Each agent searches for evidence, records what it finds, reasons about the result, and returns a probability with supporting evidence. In the AIA design, these agents are parallel evidence-gathering runs rather than role-playing characters.

The system then combines the probabilities. The simple mean is the reference point. The median and trimmed mean reduce the influence of outliers. Extremization pushes an aggregate away from 0.5 when forecasters are treated as sufficiently independent.

One diversity adjustment used in the Chapter 24 notebooks is:

Here, D quantifies how far the aggregate can deviate from 0.5 after accounting for dependence among forecasters. With three independent forecasters, D = 1.73. At pairwise correlation 0.7, it falls to 1.12. Three correlated agents are only marginally more informative than a single agent.

A supervisor can inspect disagreement after aggregation. The supervisor policy is narrow: identify the source of disagreement, run clarifying searches when evidence is missing, and override the aggregate only when the new evidence clears a stated confidence threshold. Otherwise the aggregate remains the forecast.

Calibration comes last. It can correct systematic probability bias only when estimated on resolved forecasts outside the run being scored. It cannot repair a contaminated question, a weak evidence process, or three agents that all missed the same fact.

The stored forecast record should contain the question, resolution source, cutoff, evidence references, agent forecasts, aggregate, supervisor action, calibration setting, model configuration, market or base-rate baseline, final probability, outcome, and score.

What the notebooks show

The Chapter 24 notebooks include weak results rather than only clean demonstrations to show the reality of agent system development (notebooks will be released closer to publication, late June/early July).

In the notebook research_agent, two identical research agents working on the same Fed-rate question returned 0.75 and 0.72. In the notebook multi_agent_research, three agents on a pinned Fed-hike question all returned 0.25, even though their search paths differed slightly. Temperature alone did not create an informative panel.

In the notebook aggregation_math, the diversity factor makes the cost of correlation explicit. Adding agents helps only if they bring different information. If they share the same model, tools, retrieval sources, and priors, the panel can appear broad while behaving like a single repeated forecast.

In adversarial_debate, debate lengthened the trace without narrowing the gap. The notebook shows why debate needs an ablation rather than just a transcript. If the debate does not move beliefs, reveal missing evidence, or improve resolved-outcome scores, it is theater with a token bill.

In evaluation_and_governance, the evaluation panel compares the pipeline with simpler variants using Brier score, log score, expected calibration error, sharpness, ablations, and a market baseline. The panel is small and carries contamination caveats, so it illustrates the scoring workflow rather than a performance claim.

The design rule is simple: keep a component only when it improves the forecast record, the diagnosis, or the evaluation.

What adjacent work adds

The relevant papers add design lessons rather than slogans.

Bridgewater’s AIA Forecaster supplies the forecasting blueprint: agentic search, multiple independent forecasts, supervisor reconciliation, calibration, and comparison with market consensus.

AlphaAgents matters here less for its reported portfolio results than for its role-structured design: fundamental, sentiment, and valuation agents with debate as the reconciliation mechanism.

Recent work on scaling agent systems makes the general point explicit: more agents can help on decomposable tasks and hurt on sequential ones. Coordination overhead, redundancy, and error amplification are design variables.

Where the pattern fits

The workshop uses forecasting because the work is measurable. That does not make forecasting a universal test case for agents.

Sequential execution tasks usually need a workflow, not a panel of agents. A deterministic data-cleaning step, for example, should not become a debate among models. Evidence-gathering tasks may require one strong research agent at first, but several later. Multi-agent design becomes more plausible when the task has competing interpretations, missing evidence, and a decision artifact that can be compared with a baseline.

Forecasting gives that problem a disciplined form. The artifact is a probability. The baseline may be a market price, base rate, consensus forecast, or simpler model. The outcome eventually resolves. The system can be wrong in a measurable way.

That is why this topic belongs in ML4T. It is close enough to trading and research to matter, but constrained enough to evaluate.

What we are building

The companion forecasting-agent codebase turns the AIA pattern into a runnable system, including a CLI, configuration profiles, read-only prediction-market connectors, SQLite persistence, a Streamlit dashboard, deterministic replay mode, local and hosted model profiles, and ablation settings.

The free Lightning Lesson on Thursday, June 18, 2026, walks through this architecture: independent research agents, aggregation, supervisor reconciliation, calibration, and scoring. The practical goal is to inspect, test, ablate, and score the system rather than trust the final paragraph.

The full hands-on workshop, Building Multi-Agent Forecasting Systems, runs Saturday, June 27, 2026, 10am-5pm ET. Participants work with config profiles, replay traces, aggregation settings, supervisor policy, calibration diagnostics, and resolved-outcome scores. The guaranteed path uses deterministic replay, so the exercises do not depend on API keys or live network access. The connectors are read-only.

Use five questions to judge the result:

Did independent agents produce independent evidence?
Did aggregation improve on the simple mean?
Did the supervisor find missing evidence or merely override?
Did calibration improve reliability without hiding a weak evidence process?
Did the system improve on a market price, base rate, or simpler model?

Tomorrow’s issue uses the World Cup Forecast Lab as a public forecasting surface. Later issues go deeper into AIA Forecaster, multi-agent diversity, and the ML4T research operator.

Deep Learning for End-to-End Portfolio Construction

Stefan Jansen — Tue, 26 May 2026 15:42:27 GMT

Most machine learning workflows for trading stop before the trade.

A model estimates returns, ranks assets, or emits a signal. A risk model estimates covariance or volatility. An allocator turns those objects into weights. A backtest then decides whether the chain produced a portfolio worth trading.

That separation is valuable because it makes errors easier to diagnose. If the signal has no information coefficient, the allocator is not the first suspect. If the signal has a positive IC but the strategy loses money, the next checks are sizing, turnover, costs, concentration, exposure, and timing.

End-to-end portfolio learning changes the object being learned. The network maps market features directly to positions and trains on a portfolio-level objective, usually a differentiable Sharpe-style loss computed after volatility scaling and turnover costs. The appeal is alignment: the gradient flows through a portfolio return stream closer to the object used in evaluation, rather than stopping at one-step forecast error. The risk is that forecasting, sizing, turnover, and exposure control become entangled in one loss surface.

The practical rule is narrow: use end-to-end allocators when the objective cannot be cleanly decomposed into forecasts plus an optimizer, and evaluate them as full trading systems rather than allocator modules. Objective alignment is not evidence. It is a reason to run stricter evidence checks.

Chapter 17 of Machine Learning for Trading uses this tension to place learned allocators inside a broader portfolio-construction workflow. The chapter does not argue that neural allocators replace classical allocation. It asks a narrower question: when is it useful to train the allocation decision itself?

Three recent lines of work make the progression visible. The first shows that portfolio weights can be trained directly through a Sharpe-style objective. The second asks which sequence architectures survive under a common volatility-targeted portfolio loss. The third adds structure: cost-aware training, cross-market filtration discipline, graph-constrained attention, and a robust regime objective.

Chapter 17’s ETF examples use those papers to ask a practical question: what has to be checked when the model owns the path from features to weights?

Figure 1. The end-to-end portfolio-learning pipeline in Chapter 17. Per-asset features pass through a shared sequence encoder, a bounded signal head, a volatility-targeted position layer, and a portfolio-return aggregation step. The loss is a risk-adjusted statistic of realized portfolio returns, so gradients flow through the allocation decision rather than stopping at a forecast.

What changes when the model learns weights

Chapter 17 starts with a portfolio-construction term sheet: objective, inputs, constraints, rebalancing protocol, cost treatment, and evaluation plan. That framing matters because allocation is otherwise easy to turn into an unlogged search layer once model selection is complete.

The early sections cover the standard allocator workflow. Expected returns may come from a model. Covariance may come from a shrinkage estimator, a factor model, or a realized window. Constraints and turnover penalties shape the final weight vector. Evaluation then asks whether the allocation improved the portfolio relative to a benchmark allocator, using the same signal and backtesting protocol.

The learned allocators in Section 17.8 are different. They do not consume the same forecast stream as the allocator comparisons earlier in the chapter. They learn from raw or engineered price features and output positions directly. A head-to-head table is still useful, but it compares systems rather than allocators fed identical predictions.

That makes learned allocators a separate evidence track. They are not allocator modules fed identical forecasts; they are full trading systems. They must still pass simple heuristics, but the comparison must include leakage checks, costs, turnover, drawdown, regime slices, seed variation, and component ablations. A single test Sharpe is not enough evidence when the model owns the entire path from features to weights.

The tables below should be read as experiment-specific evidence, not as a single consolidated leaderboard across different data masks, model protocols, and portfolio-return calculations.

The common computation graph

The three implementations share a recognizable core graph, although their constraints, cost treatment, and details of the robust objective differ.

For each asset and decision time, the model receives a fixed-length lookback window. A sequence encoder processes the window and produces a hidden state. A small head projects the hidden state into a signal. A position layer converts the signal into a tradeable weight, usually after volatility scaling. The portfolio-return layer combines those positions with next-period realized returns and subtracts turnover costs. The training loss is computed on the resulting portfolio-return stream.

The base loss is usually a negative annualized Sharpe-style objective computed over the portfolio return stream. It rewards average portfolio return relative to portfolio volatility rather than one-step forecast accuracy. Later versions add cost terms and robust subperiod penalties. The notation is schematic: the papers and examples differ in output constraints, volatility scaling, transaction-cost treatment, and robust-window construction. Figure 2 puts the objective where it belongs in the workflow: weights, returns, masks, costs, and previous positions first form a net return stream; the loss then rewards pooled Sharpe while adding pressure on weak subperiods.

The pooled Sharpe term asks whether net portfolio returns compensate for realized volatility after sizing and costs. The SoftMin term is a smoothed worst-window Sharpe ratio; maximizing it rewards policies whose weak windows improve rather than those that rely on a few favorable windows.

Figure 2. The portfolio objective is built from current weights, next-period returns, availability masks, previous positions, and cost inputs. The formula is shown as a schematic rather than an implementation identity; the symbol key defines the notation used in the image.

Two details are easy to understate.

First, the pooled Sharpe is not a separable loss. Its gradient depends on the mean and variance of the return stream. If a training loop computes Sharpe in small mini-batches and averages the resulting gradients, it optimizes the average mini-batch Sharpe, not the pooled Sharpe across the full panel. Those objectives can prefer different policies because the denominator is computed on different return distributions.

DeePM treats this as an implementation issue, not a footnote. The paper introduces an exact two-pass microbatching procedure for large effective batches:

First, accumulate sufficient statistics for the full logical batch, then replay the forward pass with the corrected normalization so the gradient matches the pooled objective.
Second, cost-aware training differs from post hoc cost reporting. If turnover costs are charged only after training, the model can learn a policy that works on gross returns and fails when traded. When costs are within the return stream being optimized, the model encounters implementation friction during training.

Direct Sharpe training is only the start

Zhang, Zohren, and Roberts (2020) provide the clean starting point. The paper bypasses the expected-return forecast and trains a neural network to output long-only portfolio weights directly. A softmax layer keeps weights positive and ensures they sum to 1. The objective is portfolio Sharpe computed from realized portfolio returns, and gradient ascent updates the model parameters.

The paper deliberately keeps the architecture simple: a single-layer LSTM with 64 units, a 50-day lookback, close prices and daily returns as inputs, Adam optimization, and a validation split for hyperparameter control. The empirical setup uses four ETFs or index proxies: VTI, AGG, DBC, and a VIX-tracking proxy. Reported test results include volatility scaling and transaction costs. In that four-asset setting, the deep learning strategy performs well relative to the paper’s baselines and moves substantially toward bonds during the COVID-19 crash, which falls inside the paper’s test period.

The novelty lies in the objective, not in architectural complexity. The model demonstrates that portfolio weights can be learned via a differentiable, risk-adjusted objective without first estimating expected returns.

The Chapter 17 ETF example is useful precisely because it does not flatter the method. It puts the same idea into a broader 29-ETF setting, using daily prices from 2006-01-03 to 2025-12-31, a chronological 60/20/20 train/validation/test split, 63-day sequences, and a long-only softmax LSTM trained against a differentiable Sharpe objective.

The result is a useful negative control:

The learned LSTM reduces volatility and drawdown, but the return side does not compensate. Equal weight and inverse volatility are not straw-man baselines here; they are low-turnover controls that expose whether the learned policy earns its extra complexity. The best validation Sharpe is 1.609; the test Sharpe is 0.48. Directly optimizing a portfolio loss does not remove overfitting. It moves the overfitting target from prediction error to the portfolio object itself.

This is why the baseline belongs in the issue. It prevents the argument from becoming “train the Sharpe and win.” The paper establishes the training principle. The Chapter 17 replication shows why the principle needs stronger architecture, costs, and validation discipline before it becomes competitive.

Better sequence models still have to pay costs

Saly-Kaufmann et al. (2026) ask the next question: if every model is evaluated under the same portfolio objective, which temporal architecture earns its complexity?

Their benchmark uses roughly 15 years of futures and currency data across bonds, commodities, energy, foreign exchange, and equity indices. Each model maps a lookback window to a bounded signal in [-1, 1]. A volatility-targeted position layer converts the signal to risk-scaled exposure. The paper uses pooled Sharpe as the optimization objective and reports a broad set of evaluation metrics: annualized return, Sharpe ratio, HAC statistics, hit rate, turnover, passive-relative information ratio, downside risk, seed robustness, and breakeven transaction costs.

The paper evidence is best read as an architecture ranking under a shared protocol, not as a claim that those absolute Sharpe ratios transfer to the Chapter 17 ETF examples. The benchmark uses a different universe, futures-style instruments, a 10% volatility target, seed averaging, and a different implementation stack.

Within that benchmark, the main lesson is not that one architecture universally wins, but that inductive bias matters more than raw capacity. VLSTM, a variable-selection network in front of an LSTM encoder, reports the strongest aggregate Sharpe in the main table: 2.39, with a 23.9% annualized return. The hybrid LPatchTST and TFT are close behind, at 2.32 and 2.20, respectively. xLSTM has a lower average Sharpe ratio of 1.80 but a more favorable turnover profile than the classical LSTM, which matters for implementation. iTransformer has very low turnover but weak economic performance, with a Sharpe of 0.35 in the reported benchmark.

The discussion is more important than the ranking. Recurrent or recurrent-hybrid models do well because the architecture builds in a temporal axis rather than forcing the model to infer it from noisier token structure. Variable selection helps because most financial features are weak, unstable, or regime-dependent. But the “best” architecture depends on the metric. VLSTM leads on average Sharpe; LPatchTST and VxLSTM, the variable-selection plus xLSTM hybrid, look attractive on some downside and tail-risk measures; xLSTM has a stronger cost buffer in the paper’s breakeven analysis.

The paper also checks seed sensitivity. Under a smaller experimental budget, VLSTM still reports a Sharpe near the full-budget estimate: 2.40 in the reduced-seed table versus 2.39 in the main table. That does not make the result universal, but it reduces the risk that the ordering is only a lucky initialization artifact within this benchmark.

The Chapter 17 VLSTM example applies that idea to the same ETF universe. It keeps the same 29 ETFs, the 2006-2025 price panel, the 63-day sequence length, and the chronological split as in the first ETF example. The model changes the allocator in two ways:

It adds a TFT-style gated residual network and variable-selection network before the LSTM;
It replaces the softmax long-only output with a volatility-targeted long-short position layer trained with a 5 bps cost-aware pooled-Sharpe loss.

The cost table below is an out-of-sample stress test on one trained model. The model is trained with a 5 bps one-way cost inside the loss; the table then revalues the same held-out weights at 0, 5, 10, 20, and 50 bps one-way cost per dollar of turnover. It is not retrained at each cost level.

At zero cost, VLSTM is effectively tied with equal weight. The cost profile is the result that matters:

The VLSTM variant recovers most of the gap between the softmax LSTM and the heuristic allocators at zero cost. It does not survive the 5 bps cost assumption used during training. That is a substantive result, not a caveat. Architecture can look competitive and gross, yet still fail the deployability test.

Robust portfolio learning needs structure

Wood, Roberts, and Zohren (2026) advance the same line of work. DeePM is built for systematic macro portfolios, where the model must learn from noisy non-stationary data, trade across asynchronous global markets, and survive transaction costs.

The paper identifies three design problems.

The first is the ragged filtration problem. Global markets do not close at the same time. A naive cross-sectional attention layer can inadvertently allow an earlier-closing market to see information from a later-closing one. In the paper’s implementation, Directed Delay lags cross-sectional conditioning so that cross-market representations are measured with respect to a common information set, preferring filtration discipline over maximum same-day freshness.
The second is low signal-to-noise cross-asset learning. Free attention can form economically implausible links and overfit unstable correlations. DeePM uses a macro graph prior: an ex ante economic topology that constrains or biases cross-asset attention toward admissible relationships. The paper gives examples such as intra-group cliques, risk-on links across equities and cyclical assets, and inflation-sensitive links among energy, rates, and precious metals. The graph is a structural regularizer, not a claim that the specified edges are ground-truth causality.
The third is regime fragility. A pooled Sharpe objective can be lifted by favorable windows while hiding weak periods. DeePM augments pooled Sharpe with a SoftMin penalty over subperiod Sharpe ratios. As the temperature approaches zero, the penalty approaches the worst window. At intermediate temperature, it emphasizes weak windows without collapsing onto a single episode. The paper connects this to a KL-penalized distributionally robust objective and to Entropic Value-at-Risk.

The architecture maps those problems into explicit modeling choices:

a vectorized variable-selection network with FiLM-style static conditioning, where static asset context modulates features through feature-wise affine transforms;
an LSTM temporal backbone plus temporal attention;
lagged cross-sectional attention for filtration discipline;
macro-graph attention for economic structure;
a cost-aware net-return objective;
a SoftMin-augmented robust Sharpe loss.

In the DeePM paper’s 2010-2025 macro futures test, with out-of-sample returns rescaled to a 10% annualized volatility, the full model reports a gross Sharpe ratio of 1.29 and a net Sharpe ratio of 0.93 after transaction costs. Passive equal risk reports a 0.50 net Sharpe ratio, TSMOM 0.45, and the Momentum Transformer baseline, trained with the same transaction-cost regularization, reports 0.66. The paper-level ablations matter:

Those are paper results on a 50-contract futures and FX universe, not results from the Chapter 17 ETF examples. The distinction matters.

The local ETF example then asks a narrower practical question: what happens when a DeePM-style structure is adapted to the same setting? The example uses 29 ETFs, five asset-class groups, daily prices from 2006 to 2025, an 84-day sequence length, and a chronological 60/20/20 split. It includes FiLM conditioning, variable selection, an LSTM backbone, cross-sectional attention, a small ETF macro graph prior, transaction costs, and the SoftMin objective. The key contrast is full DeePM versus a no-SoftMin ablation.

One protocol detail prevents a bad cross-table comparison. The heuristic baselines are local controls within each example, not a single shared benchmark series. The LSTM and VLSTM examples evaluate final-step returns from sliding windows; the DeePM-style example evaluates the full post-validation test-date mask and normalizes model risk weights before computing returns. The equal-weight rule is not changing; the sampling and return-alignment protocol is. Read each table within its local comparison, not as a claim that the same equal-weight series has two different Sharpe ratios.

On the test window:

The regime slice explains the source of the improvement. The split is mechanical: test days above the median 21-day annualized realized volatility of SPY are labeled “crisis,” and the remaining days are labeled “calm.” In this run, the threshold is set to 14.0%, resulting in 504 calm days and 503 crisis days.

SoftMin does not help by making calm periods better. Calm Sharpe falls slightly from 1.26 to 1.21. The improvement comes from the crisis side, where Sharpe rises from 0.57 to 1.00. In this ETF window, the robust objective improves the shape of losses and the average statistic simultaneously.

Figure 3. DeePM drawdown comparison from Chapter 17. The full SoftMin version keeps drawdowns shallower during high-volatility periods than the equal-weight and no-SoftMin ablations. The point of the figure is the path of losses, not only the full-window Sharpe ratio.

The ETF example does not isolate every DeePM component. It does not separately quantify FiLM, V-VSN, Directed Delay, macro graph, and temporal attention as the paper’s broader ablation table does. Its measured local contrast is full DeePM versus no SoftMin, with the architecture held otherwise fixed. The example also does not run a DeePM cost-stress grid, unlike the VLSTM table, so the result should not be taken as proof that the ETF DeePM policy survives arbitrary cost assumptions. The ETF Sharpe ratio should also not be read as evidence that the reduced example system is better than the paper system; the asset universe, volatility target, training protocol, ensemble design, and cost model differ.

Figure 4. Papers, Chapter 17 examples, and implementation patterns answer different questions. Paper benchmarks support research claims; local ETF examples illustrate evaluation logic; implementation patterns preserve the contract, not evidence of alpha.

The implementation contract

The evidence boundary is now the main point: papers, Chapter 17 examples, and implementation patterns answer different questions.

The reusable lesson is not a list of class names. It is the contract an end-to-end allocator has to preserve: what the model needs to see, what it produces, and where downstream constraints belong.

For a learned allocator, the input contract must carry more than just features. It needs feature windows, forward returns, availability masks, volatility-scaling terms, cost inputs, previous weights, and, when used, economic graph structure. The output is not a forecast to be handed to a separate optimizer; it is a target weight vector that must still pass through exposure, leverage, turnover, and backtest checks.

Each part of that contract corresponds to a failure mode from the papers and examples:

This is the implementation boundary that matters. A DeePM-style allocator is not just a generic sequence model with a Sharpe loss. It needs a net-return layer, cost-aware turnover, pooled-objective semantics, optional graph constraints, and hooks for portfolio constraints after the model emits weights.

At the same time, packaging those mechanisms in software is not evidence of alpha. The evidence still comes from matched protocols, out-of-sample tests, cost stress, regime slices, seed checks, and ablations.

Alignment raises the burden of proof

The evidence across the Chapter 17 examples is mixed.

The softmax LSTM shows that direct Sharpe training can underperform simple heuristics. VLSTM shows that a better architecture can improve gross performance and still lose after costs. The reduced DeePM-style ETF example shows that, in this test window, a structured allocator with the SoftMin objective improves Sharpe, drawdown, and crisis-window performance relative to its local controls.

End-to-end training is not a shortcut around portfolio-construction discipline. It is a way to move more of the trading problem into the learning objective. Once that happens, the validation burden grows. The 0.98 versus 0.69 DeePM comparison is evidence from one ETF test window; it still needs seed, cost, period, and ablation checks before it can support a deployment claim.

A learned allocator should be judged by a protocol tied to the failure modes above:

Start with heuristic baselines. The softmax LSTM does not assign equal weight or inverse volatility to the ETF window.
Report gross and net performance. VLSTM’s zero-cost tie disappears at 5 bps one-way cost.
Show turnover and cost-stress curves. A cost-aware loss can still learn weights that are too expensive out-of-sample.
Slice regimes with a rule defined before looking at results. The DeePM example uses median SPY realized volatility.
Report seed sensitivity where the model class makes it material. Saly-Kaufmann’s reduced-seed check is part of the evidence, not a footnote.
Use ablations when the claim is architectural. DeePM’s paper-level result is more credible because the no-SoftMin, no-graph, and graph-only variants are visible.

This is where Chapter 17’s workflow matters. Classical allocators remain the right default when signals, risk estimates, and constraints can be diagnosed cleanly. End-to-end allocators become more compelling when the objective is hard to factor into a forecast plus an optimizer: cost-aware sizing, regime robustness, or structured cross-asset interaction.

The practical takeaway from this line of work is not that deep learning beats portfolio heuristics. It is that portfolio-objective training can align the model with the economic target, but only if the architecture, cost model, and evaluation protocol are strong enough to carry that alignment out-of-sample.

The Chapter 17 examples show the experimental progression, not trading instructions. The implementation lesson is the same throughout: sequence information in, target weights out, with costs, volatility scaling, graph structure, and robust Sharpe treated as first-class parts of the model rather than afterthoughts. Direct portfolio learning is not a replacement for portfolio construction. It is portfolio construction moved into the model, which makes the modeling problem more aligned and harder to audit.

AI Agents in Finance: A Reading List

Stefan Jansen — Fri, 22 May 2026 13:11:05 GMT

The well-known Sutskever/Carmack reading list worked because it did not try to be an encyclopedia. It gave readers a path: learn these ideas, and a large part of modern deep learning becomes easier to understand.

This issue uses the same format for a narrower question: what should you read before building AI agents for financial research and forecasting?

The answer is not a list of agent frameworks or orchestration libraries. It starts earlier. Core agent design still involves search under limited computation, action under partial observation, tool validity, state, delegation, evaluation, and human supervision. Language models changed the substrate, but not the underlying control problem.

Chapter 24 of Machine Learning for Trading implements that view. It builds read-only research and forecasting agents that gather evidence, call tools, maintain state, produce probabilities, and leave artifacts that can be replayed, scored, and audited. More specifically, the chapter shows how to build the Bridgewater AIA Forecasting Agent through to live deployment.

You can also learn how to build this in our new workshop on Building Multi-Agent Forecasting Systems, which teaches how to engineer this particular agent harness and loop in one day.

In the chapter, an agent does not place trades. It calls market data APIs, searches for filings and news, retrieves documents, runs calculations, writes structured forecasts, and records what happened. The question is not whether a chatbot can sound like an analyst. The question is whether a workflow can produce useful decision-support artifacts that can be replayed, scored, and governed.

Figure 1. Chapter 24 treats financial agents as read-only workflows over evidence, tools, memory, and audit artifacts. Order generation and execution require a separate layer of permissions, risk, and controls.

What finance changes

Finance turns a generic agent problem into a time-sensitive evidence problem. The system has to know what was knowable when, where a number came from, which tool produced it, and whether the evidence was available before the forecast or backtest decision.

Five constraints shape the reading path:

Time. Filings, prices, macro releases, transcripts, and news have timestamps. An agent must preserve cutoffs rather than mix past and future evidence.
Provenance. Financial evidence is not interchangeable text. A model summary, an SEC filing, an exchange quote, and a scraped article carry different reliability and permission properties.
Leakage. Evaluation can be contaminated by training data, revised data, benchmark overfitting, or hidden access to answers. Finance makes this more dangerous because small information leaks can appear to be forecasting skill.
Calibration. Many outputs are probabilities, not prose. The evaluation question is not only whether the explanation sounds plausible, but whether the forecast is calibrated after resolution.
Capital-at-risk boundaries. Research support, portfolio recommendation, order generation, and execution are different system classes. Chapter 24 stays on the research-support side of that boundary.

A core path for the long weekend

Start with these core entries:

Kaelbling, Littman, and Cassandra on POMDPs: financial agents act from belief states, not complete market state.
Rao and Georgeff on BDI agents: beliefs, goals, intentions, and resource-bounded deliberation predate LLM wrappers.
Horvitz on mixed-initiative interfaces: agents need rules for proceeding, asking, abstaining, and escalating.
Lewis et al. on retrieval-augmented generation: finance needs external, updateable, provenance-bearing knowledge.
WebGPT: an early template for search, evidence collection, citation, and answer generation in one loop.
MRKL Systems: language models can route to tools, calculators, retrieval systems, and symbolic modules instead of internalizing every operation.
ReAct: reason-act-observe is a canonical starting pattern for evidence-grounded agents.
CodeAct: executable actions fit technical research workflows better than unconstrained prose.
AI Agents That Matter: agent evaluation has to include cost, reproducibility, holdouts, and benchmark overfitting.
AgentDojo and the OWASP Top 10 for LLM Applications 2025: retrieved content and tool access create security problems that ordinary model benchmarks miss.
ForecastBench and Halawi et al. on language-model forecasting: financial agents need leakage-aware forecasting evaluation, not just impressive rationales.
AIA Forecaster, Finance Agent Benchmark, and FinToolBench: together they show where agentic financial research works now and where the evidence remains thin.

The rest of the issue gives the broader route. It is organized by design problem rather than by publication date.

The list deliberately excludes most framework documentation, product announcements, and “autonomous trading bot” papers. Frameworks matter in implementation, but they age quickly. The reading path below focuses on more durable design problems: state, tools, retrieval, partial observation, supervision, evaluation, security, forecasting, and governance. Execution agents and order-routing systems are also out of scope, as Chapter 24 remains on the research-support side of the capital-at-risk boundary.

The old problems are still the hard problems

Newell and Simon, Human Problem Solving. Newell and Simon frame intelligence as search through a structured problem space under bounded computation. That framing keeps the central object in view: not a fluent answer, but a process that moves through possible states, operators, and goals.

Hart, Nilsson, and Raphael, “A Formal Basis for the Heuristic Determination of Minimum Cost Paths”. The A-star paper makes a point that still holds: search quality depends on how the system allocates its limited computational budget. LLM agents do not escape that constraint. They move it into prompt length, tool calls, branching, reranking, and supervisor passes.

Rao and Georgeff, “BDI Agents: From Theory to Practice”. BDI gives the literal pre-LLM agent vocabulary: beliefs, desires, and intentions. The vocabulary is older; the design problem remains current. A financial research agent needs a representation of what it believes, what it is trying to answer, which plan it is executing, when to reconsider, and what state must survive between steps.

Kaelbling, Littman, and Cassandra, “Planning and Acting in Partially Observable Stochastic Domains”. Partial observability is central in finance. The agent never sees the full state of the market, the company, or the policy process. It sees filings, quotes, transcripts, news, and partial indicators. The paper makes explicit that agents act from belief states, not from truth.

Sutton, Precup, and Singh, “Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning”. The options framework gives a theory of temporally extended actions. Modern systems call them tools, skills, routines, subagents, or workflows. The abstraction problem is the same: when should a multi-step behavior be treated as a single action, and which state must be preserved before and after it runs?

Erman, Hayes-Roth, Lesser, and Reddy, “The Hearsay-II Speech-Understanding System”. Hearsay-II is the classic blackboard architecture: specialized components coordinate through a shared workspace. That pattern keeps returning in planner-executor-reviewer loops, multi-agent debate, research-agent ensembles, and supervisor reconciliation. The same architecture helps explain Chapter 24’s forecasting pipeline.

Sheridan, Telerobotics, Automation, and Supervisory Control. Sheridan treats autonomy as a control relationship rather than a marketing label. The practical questions are who monitors execution, when control is handed back, and what the human is expected to approve. Those questions apply directly when an agent’s output can influence capital allocation, research priorities, or a published forecast.

Horvitz, “Principles of Mixed-Initiative User Interfaces”. Mixed initiative gives a concrete frame for human-agent work. A financial agent needs rules for when to proceed, when to ask for clarification, when to abstain, and when to escalate. This is not only a user-interface problem. It is part of the risk-control surface.

The LLM-era primitives

Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”. Chain-of-thought showed that eliciting intermediate reasoning can improve multi-step performance. Chapter 24 treats this as a control surface rather than an audit record. In financial agents, the auditable objects are tool calls, observations, state transitions, evidence records, prompts, model versions, policies, and scored outputs. Free-form reasoning text is not enough.

Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”. RAG is not an agent paper, but finance agents are retrieval-bound. The distinction between parametric memory and external, updateable, provenance-bearing knowledge matters for filings, transcripts, news, research notes, macro releases, and market data.

Nakano et al., “WebGPT: Browser-Assisted Question-Answering with Human Feedback”. WebGPT is an early modern template for a language model that searches, collects evidence, cites sources, and answers with the browser in the loop. For finance, this is the move from static model output to evidence acquisition. The model is no longer only producing text. It is choosing what evidence to retrieve before it answers.

Karpas et al., “MRKL Systems”. MRKL made modularity explicit. The language model routes among tools, symbolic modules, knowledge sources, and external calculators. Chapter 24 uses the same principle in a finance setting: deterministic calculations should be tools, retrieval should carry provenance, and the LLM should not pretend to internalize every operation.

Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models”. ReAct is a canonical starting pattern for evidence-grounded agents: reason, act, observe, repeat. In Chapter 24, the first notebook builds this loop with structured JSON decisions and trace capture. The trace ties stated reasoning to tool calls and observations. The hard audit evidence still consists of the tool invocation, observation, state transition, and stored source.

Yao et al., “Tree of Thoughts: Deliberate Problem Solving with Large Language Models” and Shinn et al., “Reflexion: Language Agents with Verbal Reinforcement Learning”. Tree of Thoughts adds branching and scoring at decision points where premature commitment is costly. Reflexion records post-run lessons that can persist without updating model weights. In finance, both mechanisms need controls. A branch can help compare market hypotheses, and a lesson can improve future behavior. But both need validity horizons, provenance, and pruning rules. Otherwise, a temporary market condition becomes a persistent bias.

Wang et al., “Voyager: An Open-Ended Embodied Agent with Large Language Models”. Voyager is not a finance paper, but its skill-library idea maps well to research agents. A financial operator should not have to rediscover the same data-loading, feature-inspection, or backtest-diagnostic procedures every time. It needs a bounded skill corpus whose behavior can be inspected.

Wang et al., “Executable Code Actions Elicit Better LLM Agents”. CodeAct reframes action as executable code rather than fixed text or JSON. Technical research workflows involve many computational actions: querying a registry, reading a parquet file, computing an IC, running a backtest, or inspecting a result table. Chapter 24’s research operator follows this direction by providing the model with general tools and a skill corpus.

Yang et al., “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering”. SWE-agent made the agent-computer interface a first-class variable. That lesson generalizes beyond software engineering. If the environment is hard to inspect, the state is hidden, the tools are poorly named, or errors are hard to recover from, the agent will fail due to interface issues even when the base model is strong.

Evaluation and security are part of the system

Kapoor et al., “AI Agents That Matter”. This paper anchors the evaluation section. It argues that accuracy alone is the wrong target because cost, reproducibility, holdout design, benchmark overfitting, and the needs of downstream developers decide whether an agent works in practice. Finance needs that discipline.

AgentBench, WebArena, OSWorld, and SWE-bench. These benchmarks shifted evaluation from “does the model produce the right text?” to “can the system change an environment into the target state?” That shift fits agent evaluation, but it also creates new validity problems. An agent can satisfy a checker without doing the intended work, reading the hidden state, or exploiting the evaluation harness itself.

AgentDojo. AgentDojo turns indirect prompt injection into an environment problem. The agent must complete assigned work while treating retrieved content as untrusted. That model fits finance, where retrieved documents can contain adversarial instructions, speculative narratives, stale facts, or conflicting claims.

OWASP Top 10 for LLM Applications 2025. OWASP is not an agent paper, but tool-connected LLM systems create security failures that ordinary model evaluation misses. Chapter 24 turns this into engineering controls: least privilege, source allowlists, prompt-injection filters, policy proxies, and logged allow/deny decisions.

Two papers from May 2026 are recent stress tests, not settled references. Evaluating Deep Research Agents on Expert Consulting Work assesses deep-research agents on structured analytical deliverables, using verifiers, rubrics, and cognitive traps. Reported acceptance rates are low across frontier systems. SaaSBench tests long-horizon work in multi-component enterprise software and finds that many failures occur during setup, configuration, and integration before deep business logic is reached. The finance lesson is direct: agent failures are often system failures, not only reasoning failures.

The finance branch

The finance literature below falls into four groups: broad LLM-in-finance maps, financial-agent benchmarks, forecasting-agent systems, and portfolio or trading-agent architectures. Those are not the same system class. A research benchmark, a forecasting assistant, a portfolio-construction committee, and an execution agent require different evidence and controls.

Kong et al., “Large Language Models for Financial and Investment Management”. Kong et al. provide a broad investment-management map: retrieval, domain-specific data, task decomposition, evaluation, and deployment constraints. The paper does not reduce finance to sentiment analysis or trading signals. It treats LLMs as part of a workflow that has to respect evidence, timing, and institutional constraints.

Finance Agent Benchmark. This benchmark adds a concrete constraint. It uses expert-authored financial research tasks that require recent SEC filings and an agentic harness with search and EDGAR access. The dataset has 537 questions across nine task categories. The best-reported model achieved 46.8 percent accuracy at an average cost of $3.79 per query. The evidence is concrete, costly, tool-based, and still limited.

Lu et al., “FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use”. Finance Agent Benchmark tests financial research tasks over filings and search. FinToolBench tests tool-using financial agents under runnable execution conditions. It pairs 760 executable financial tools with 295 tool-required queries and evaluates not only success, but also timeliness, intent restraint, and regulatory-domain alignment. That maps directly to Chapter 24’s view of agents as auditable workflows over tools, traces, and policy constraints.

Xie et al., “FinBen: A Holistic Financial Benchmark for Large Language Models”. FinBen is a checkpoint before discussing agents because it separates financial language tasks from numerical reasoning, forecasting, risk, and decision-making. Static benchmarks do not evaluate full agent behavior, but they reveal where base-model capabilities are thin before an agent loop adds tools, retrieval, and state.

Schoenegger et al., “AI-Augmented Predictions”. This paper serves as a bridge between general agents and forecasting. LLM assistants can improve human forecasting accuracy, but the improvement comes through a decision-support relationship, not full replacement. That is close to Chapter 24’s stance: agents gather and organize evidence, but the output still needs scoring, calibration, and supervision.

Karger et al., “ForecastBench”. ForecastBench evaluates future events whose answers are not known at submission time. That design directly targets leakage. It also keeps the evaluation unit clear: a probability on a resolvable question, not a compelling narrative about what may happen.

Halawi et al., “Approaching Human-Level Forecasting with Language Models”. Halawi et al. provide the methodological bridge to AIA Forecaster. The system searches for relevant information, generates forecasts, and aggregates predictions against human forecaster baselines. It treats forecasting agents as workflows for retrieval, aggregation, and evaluation.

Alur et al., “AIA Forecaster: Technical Report”. AIA Forecaster is the chapter’s central reference for the forecasting-agent implementation. It combines agentic search, independent forecasts, supervisor reconciliation, and statistical calibration. Its reported results cut both ways: expert-level performance on ForecastBench, weaker performance than market consensus on a harder prediction-market benchmark, and better results when combined with market consensus. That supports decision assistance, not a claim that an LLM forecasts on its own.

Figure 2. Chapter 24 implements forecasting as a supervised evidence workflow: specialists produce independent views, aggregation combines probabilities, debate surfaces contradictions, and the final artifact preserves probability, confidence, caveats, and audit evidence.

Yu et al., “FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design” and Yu et al., “FinCon: A Synthesized LLM Multi-Agent System with Conceptual Verbal Reinforcement”. These papers belong on the list as design references for memory and multi-agent financial decision systems. Treat them as architecture references, not as proof of deployable trading edge. They expose a design problem: when an agent stores lessons from prior decisions, which lessons are valid enough to persist, and which should be pruned before they become bias?

Zhao et al., “AlphaAgents: Large Language Model based Multi-Agents for Equity Portfolio Constructions” and Ang, Azimbayev, and Kim, “The Self Driving Portfolio”. These papers move from research assistance toward portfolio construction. Chapter 24 reads them cautiously. Role-based analysts, peer critique, investment policy constraints, and supervisor combinations are useful architectural patterns. They do not remove the need for statistical evaluation, transaction-cost modeling, permissions, and operational controls.

Xia et al., “Agentic Trading: When LLM Agents Meet Financial Markets”. This May 2026 survey is best read as a methodological audit, not as a settled taxonomy. It maps 77 LLM-based trading-agent studies and finds that comparable evaluation remains weak: time-consistent splits, transaction-cost assumptions, universe construction, execution semantics, and reproducible artifacts are often missing. That supports Chapter 24’s conservative boundary: before financial agents influence capital, their evidence, timing, costs, and execution assumptions must be inspectable.

Fabozzi and Lopez de Prado, “Implementing AI Foundation Models in Asset Management”. This paper anchors the governance thread. Prompts, retrieval corpora, model versions, and outputs become controlled artifacts once they affect asset management decisions. That is why Chapter 24 treats traces and replay as model-risk infrastructure, not just an engineering convenience.

Lopez-Lira, Tang, and Zhu, “The Memorization Problem”. Economic forecasting with LLMs has a contamination problem: a model may appear to forecast the past because it has absorbed realized outcomes during training. That makes pre-cutoff evaluation hard to interpret. Chapter 24’s answer is not to trust narrative claims of forecasting skill. It uses cutoff dates, time-shift tests, event windows, baselines, and post-resolution scoring.

Lee et al., “Your AI, Not Your View”. Lee et al. show that LLMs can carry systematic investment preferences and confirmation bias. Retrieval and tool use do not automatically remove latent model preferences. A financial agent needs stress tests that present the same evidence under different framings and check whether the conclusion changes for the wrong reason.

How ML4T uses case studies to test strategies across markets

Stefan Jansen — Tue, 19 May 2026 13:18:16 GMT

The third edition of Machine Learning for Trading carries nine case studies: ETFs, crypto perpetuals, NASDAQ-100 microstructure, S&P 500 equity and option analytics, US firm characteristics, FX pairs, CME futures, direct S&P 500 options, and a broad US equities panel.

That list matters because the studies are not decorative applications at the end of the book. They show how the same research process performs across different datasets and in very different markets and trading environments, and each includes around 20 notebooks from data sourcing to detailed performance analysis.

They cover different asset classes, frequencies, breadths, cost regimes, and execution problems. Some are monthly. Some are daily. One is intraday. Some are long-only ranking problems. Some are long-short cross-sectional problems. One is a delta-hedged options strategy.

Model choice rarely decides whether a strategy survives. Label design, cost regime, data construction, breadth, position sizing, and search discipline usually decide it.

That is what the case studies are built to show.

That comparison space is intentionally wide. A monthly ETF rotation process, an 8-hour crypto funding trade, a 15-minute NASDAQ-100 signal, a weekly futures ranking problem, and a delta-hedged options strategy do not put pressure on the same parts of the workflow. That is why the set is useful: each market makes a different failure mode visible.

Figure 1. The point of the set is not just breadth. It is that the same research process is exposed to very different markets, cadences, and execution constraints.

What becomes comparable across nine very different markets

Because the case studies are built on the same discipline, they make more than headline Sharpe ratios comparable. A 21-day ETF label, an 8-hour crypto label, a 5-day futures label, a 15-minute intraday equity label, and a return-to-expiry options label are not interchangeable prediction problems. They encode different holding periods, execution assumptions, and cost burdens.

The same goes for features and models. Some studies rely on traditional financial features such as momentum, carry, volatility, and the term structure. Others add model-based features such as HMM regimes, GARCH volatility, or forecast model outputs. The model set is also deliberately broad: linear baselines, gradient boosting, tabular deep learning, sequence models, latent-factor models, and, where the setup supports it, causal estimators.

Just as important, the book forces the post-prediction steps into view. Signals are converted into positions, run through explicit backtests, stress-tested under costs, modified by allocators and risk overlays, and then evaluated on a holdout set. Feature ICs are measured with heteroscedasticity- and autocorrelation-corrected standard errors. Screening uses false-discovery control. Backtests are read with probabilistic and deflated Sharpe analysis, bootstrap intervals, and search-accounting adjustments. The point is not just to report a number, but to say what that number does and does not justify.

That stack is why the case studies read as research rather than as examples.

What changes once the workflow meets real markets

In the firm-characteristics study, label treatment materially changes the result. The raw 1-month return label leaves the linear baseline at an IC of about -0.005, with a HAC interval that straddles zero. On the winsorized label, the same linear family moves to about +0.023 and clears zero on a thin margin. GBM is strong in both cases, around +0.080, which is the point: the large change came from label treatment, not from swapping ridge for a more expressive architecture.

In crypto perpetuals, the problem is different. On the primary 8-hr forward return regression label, only one family leader is clearly credible: NLinear at +0.0293 daily IC with a HAC 95% interval of [+0.0168, +0.0419]. GBM, linear, and TabM all straddle zero on that same primary label. But directional reframings recover the signal for other families. That is not a “deep learning wins” story. It is a label-and-market-structure story in a small sample with only 19 instruments and two folds.

In ETFs, the comparison is broad enough to separate prediction quality from strategy quality. The study compares all major model families across a large, liquid cross-asset panel. The editorial point here is not to crown a universal winner. It is to show that the family leader on validation IC need not be the leader once the signal is turned into a strategy. The prediction problem and the portfolio problem are related but not the same.

In the NASDAQ-100 microstructure, the signal can be statistically real yet economically fragile. The current rank-1 prediction has a daily IC of around +0.0054, with a HAC interval of [+0.0022, +0.0086], indicating positive directional alignment. But the holdout strategy Sharpe still flips negative, around -1.69, and the strategy trails the equal-weight holdout benchmark. That is a clean example of why “detected signal” and “deployable strategy” are not synonyms at intraday cadence.

S&P 500 equity-plus-options analytics make a different point: options can be a useful source of information for stock prediction, but credible validation results still do not settle the execution and holdout questions.

And in direct S&P 500 options, the cost problem becomes the case study. The workflow uses a dedicated hold-to-maturity cost cascade because standard basis point grids are the wrong abstraction for the instrument. Even there, the strategy analysis says no statistically resolved edge has been earned yet. That is a useful result. It shows what it looks like when the instrument pushes back hard enough that careful modeling is still not enough.

Figure 2. The same research stack repeats across all nine studies. That is what makes the differences interpretable rather than anecdotal.

Why the case studies matter

This is the real reason to read that part of the book closely.

The case studies do not just show that ML can be applied to many markets. They show how differently the same research stack behaves when the market changes.

Sometimes label engineering matters more than architecture. Sometimes costs decide the result. Sometimes breadth rescues a weak signal. Sometimes the prediction is credible, but the strategy is not. Sometimes the right conclusion is not to deploy, but to narrow the claim, change the horizon, or stop.

Because the trade definition, label design, model comparison, backtest, cost accounting, and holdout discipline are kept explicit, you can ask better questions when a strategy fails. Was the label wrong? Was the cross-section too narrow? Was the cost regime too severe? Did the signal survive validation but die in holdout? Did the apparent edge disappear once uncertainty and search adjustments were counted?

And those questions are not asked loosely. The workflow forces them through HAC IC, false-discovery control, probabilistic and deflated Sharpe, bootstrap uncertainty, and search-accounting discipline before a result is allowed to sound stronger than it is.

That is serious empirical work. And it is much closer to how real quant research feels than a neat parade of winning backtests.

Read that way, the nine case studies are not a tour of examples. They are part of the book where the whole argument is exposed to the market and forced to earn its claims.

The ML4T library ecosystem, built to support the workflow

The software stack behind that process is documented on our website and is live on PyPI:

ML4T Data handles multi-provider acquisition and point-in-time storage.
ML4T Engineer builds features, labels, and alternative bars;
ML4T Models adds finance-native latent-factor, SDF, direct-prediction, and portfolio-learning models;
ML4T Diagnostic covers IC analysis, false-discovery control, Deflated Sharpe, Rademacher, PBO, CPCV, and tearsheets; and
ML4T Backtest turns signals into event-driven strategy results with explicit execution, risk, and account rules.

Together, they make the case studies reproducible rather than descriptive.

Six libraries, one workflow

Stefan Jansen — Tue, 12 May 2026 16:03:59 GMT

A notebook can demonstrate a workflow. It rarely makes the workflow reusable. The harder problem is preserving the assumptions that make a result interpretable: data provenance, label construction, validation design, execution semantics, and deployment controls.

That is the change this spring. The ML4T workflow now has a public software layer that readers can inspect and pressure-test, rather than reconstructing everything from scattered notebooks and chapter code.

Six libraries now carry the main parts of that loop:

ml4t-data
ml4t-engineer
ml4t-models
ml4t-diagnostic
ml4t-backtest
ml4t-live

The stack is not equally mature; most are in public beta, ml4t-live is still alpha, and ml4t-models is the most recent addition. Even so, the reusable layer is now concrete enough for readers to run, inspect, and break.

The ML4T library ecosystem is a six-step research-to-production workflow: ml4t-data to ml4t-engineer to ml4t-models on the build path, then ml4t-diagnostic to ml4t-backtest to ml4t-live on the prove-and-deploy path, with an iterate-and-redeploy loop back to data.

From teaching material to professional workflow

The six libraries line up with the actual research and deployment sequence:

ml4t-data acquires, stores, and refreshes data
ml4t-engineer builds features, labels, and leakage-safe training inputs
ml4t-models packages finance-native model families and hands predictions downstream
ml4t-diagnostic asks whether the signal survives statistical scrutiny
ml4t-backtest simulates execution under explicit behavioral assumptions
ml4t-live carries the same strategy surface into shadow, paper, and live operation

That means a reader can now do something much more concrete than “learn the workflow.” They can pull a futures or equities panel with ml4t-data, construct features and targets with ml4t-engineer, train or score them with ml4t-models, test IC stability and multiple-testing risk with ml4t-diagnostic, simulate next-bar or quote-aware execution in ml4t-backtest, and then carry the same strategy surface into ml4t-live shadow mode. That is a more concrete workflow than another abstract essay about process.

Build

`ml4t-data`

Every quant workflow begins with data, and data engineering failures often stay invisible until they become expensive. ml4t-data is the acquisition, storage, and refresh layer for the rest of the workflow.

Its core abstraction is a DataManager that provides a single interface for fetching, storing, updating, and loading data across providers. Breadth matters: about 20 provider adapters spanning equities, crypto, futures, FX, macro series, prediction markets, and factor data.

The more important part is that the package treats data as an ongoing research asset rather than a one-off notebook download, with local Parquet storage, metadata-backed refresh workflows, gap detection, backfills, and validation in the same layer. That is why the futures and the commitment-of-traders (COT) modules matter. They solve recurring workflow problems that simple wrappers usually ignore: bulk futures ingestion, continuous contract construction, and a point-in-time combination of weekly positioning data with market series.

`ml4t-engineer`

ml4t-engineer is where raw market data starts becoming something a model can learn from.

It includes 120 features across 11 categories, as well as triple-barrier labeling, alternative bars, feature discovery, fractional differencing, preprocessing, and leakage-safe dataset-building utilities. The important design choice is that feature construction, label construction, and ML-ready dataset preparation are all in one package rather than scattered across custom scripts.

These steps are not independent. Triple-barrier labels, ATR-scaled barriers, volume and dollar bars, tick-imbalance bars, fractional differencing, registry-driven discovery, and train-only preprocessing all change the shape of the learning problem. Treating them as one layer is a workflow choice, not just an API choice.

The validation posture is also concrete. The library shows explicit validation against TA-Lib-compatible features and AFML-style labeling methods, which is the right kind of proof for software that sits directly between market data and model training.

`ml4t-models`

ml4t-models is the newest and narrowest of the six, but it has a clear modeling point of view.

It starts from finance-native contracts: persistent panels, ragged cross-sections, portfolio sequences, structural factor extraction, stochastic discount factor learning, direct asset prediction, and end-to-end portfolio allocation.

The public surface is correspondingly specific. The library includes latent factor estimators such as PCA, Risk-Premium PCA (RPPCA), Instrumented PCA (IPCA), and Conditional Autoencoder (CAE) variants; a stochastic discount factor model; a supervised autoencoder for direct asset prediction; and portfolio-learning models for linear, LSTM, and deeper allocation settings. It also includes helpers that pass predictions and weight frames to the backtest and diagnostic layers, rather than treating modeling as an isolated exercise.

Prove and deploy

`ml4t-diagnostic`

ml4t-diagnostic is the part of the ML4T stack that asks the hardest question, last-mile research too often postpones: is there a real signal here, or just activity that looked convincing in the sample?

Its public surface leans into HAC-adjusted information coefficients, purged and combinatorial cross-validation, deflated Sharpe, false-discovery control, PBO, feature selection, structured backtest reporting, and template-based tearsheets.

Signal validation, statistical corrections, feature diagnostics, and backtest reporting live in one place. That makes it easier to separate prediction-quality problems from portfolio-translation problems and to ask whether an apparent result remains credible once multiple testing, autocorrelation, and leakage risks are accounted for honestly.

`ml4t-backtest`

Backtesting is crowded, which is one reason ml4t-backtest needs a clearer claim than “another framework.”

It is an event-driven simulator with explicit execution semantics and parity profiles that make comparisons meaningful rather than vague. The package emphasizes same-bar and next-bar execution modes, quote-aware fills, position-level and portfolio-level risk rules, and profiles spanning common frameworks, plus a conservative, realistic mode.

It also preserves inspectable artifacts after the run: fills, trades, portfolio state, predictions, and resolved config snapshots. That is what makes the bridge into ml4t-diagnostic reliable.

`ml4t-live`

Last but not least, ml4t-live is one of the more recent additions to the group.

It extends the workflow into a staged operation. The same Strategy interface is used in ml4t-backtest carries into live or shadow trading with broker adapters, feed adapters, safety controls, reconciliation, preflight checks, and execution journaling.

ml4t-live is built around staged deployment: shadow mode first, then paper trading, then live operation with explicit controls around stale data, position limits, order limits, drawdown limits, and kill-switch persistence. That is a much more honest view of production than pretending a strategy is “deployed” once an API key works.

Where to start

The best starting point depends on the problem you already have.

If you need repeatable acquisition and updates:

ml4t-data quickstart
ml4t-data providers
ml4t-data incremental updates

If you already have data and need ML-ready features and labels:

ml4t-engineer quickstart
ml4t-engineer labeling guide
ml4t-engineer dataset builder

If you have signals and want to test credibility:

ml4t-diagnostic quickstart
ml4t-diagnostic workflows
ml4t-diagnostic statistical tests
ml4t-diagnostic backtest tearsheets

If you want to compare execution assumptions:

ml4t-backtest quickstart
ml4t-backtest profiles
ml4t-backtest execution semantics

If you want the safest path from backtest to production:

ml4t-live quickstart
ml4t-live risk guide
ml4t-live examples guide
ml4t-live operator guide

If you want to inspect the finance-native model layer:

ml4t-models docs
ml4t-models quickstart
ml4t/models repo

I wanted to start this newsletter run with the libraries because they are the parts that readers can use immediately.

People in this field already know they need cleaner data pipelines, leak-aware feature work, honest validation, realistic execution, and safer production handoff. The question is whether those principles have been turned into reusable software with sufficient structure to improve how people actually work.

This issue maps the public software layer. Each library is large enough to deserve its own treatment later. For now, the job is simply to make that layer visible.

Useful feedback starts where these abstractions break against real workflows: provider gaps, labeling edge cases, diagnostics that need different assumptions, execution profiles that do not match a venue, or model contracts that fail on ragged panels.

Nine case studies, one end-to-end workflow

Stefan Jansen — Tue, 05 May 2026 16:59:44 GMT

The early chapters of Machine Learning for Trading, 3rd edition, introduce a research workflow for systematic strategy development. Chapters 6 through 20 then apply it to nine case studies that span seven asset classes, five forecasting horizons, and frequencies from 8-hourly to monthly. The case studies are concrete, worked examples a reader can pick from — the one closest to your data, cadence, or asset class. This issue walks through what the workflow does at each stage, with pointers to where the libraries that support it are located.

Each case has its own page at ml4trading.io/case-studies with pipeline details, related chapters, and links to the GitHub code.

The ML for Trading workflow that organizes the book — research loop above the evidence boundary, deployment below, and feedback closing the cycle.

What the workflow does at each stage

Setup (Chapter 6). Each case begins with an explicit specification: the asset universe, the rebalance cadence, the train/validation/holdout split, the baseline checkpoints the case will measure itself against, and a search-accounting log that records every model trained — feature inputs, hyperparameters, fold-level metrics, and runtime artifacts. Chapter 6 argues for explicit search accounting as a guard against backtest overfitting: reproducibility is hard to recover once prior runs have disappeared from memory.

Labels (Chapter 7). The label is the quantity the model is trained to predict; defining it is a modeling decision, not a downstream encoding of a separate target. Chapter 7 organizes labels into fixed-horizon and variable-horizon families. Fixed-horizon labels are evaluated at a predetermined offset: continuous forward returns over the trading horizon for regression, or discrete state codes — the sign of the return, a quantile bucket, or exceedance of a volatility-scaled threshold — for classification. Variable-horizon labels let the realized horizon depend on the path: trend-scanning labels expand the look-forward window until a trend test rejects, and triple-barrier labels resolve when one of a profit target, stop loss, or time limit binds first. Horizon and the instrument’s cost regime at that horizon constrain the choice — a 21-day forward return on monthly ETFs is a different estimation problem than an 8-hour funding-period return on crypto perpetuals. Label construction primitives, alternative bar samplers, and the overlap-aware sample-weighting they require are provided by ml4t-engineer.

Features (Chapter 8). Engineered features come from ml4t-engineer: roughly 120 technical indicators across 11 categories — momentum, volatility, trend, volume, microstructure, and others — Polars-native and JIT-compiled, with around 60 cross-validated against TA-Lib. Where the asset structure supports them, alternative bar samplers (volume bars, dollar bars, tick-imbalance bars) replace fixed-time bars; microstructure features appear when intrabar data are available.

Model-based features (Chapter 9). Features extracted from auxiliary statistical models fit per series, used to encode dynamics that engineered indicators capture only loosely:

Kalman-filtered states and innovations,
spectral and path-signature coefficients,
ARIMA residuals,
GARCH and HAR/rough-volatility estimates,
HMM and Wasserstein regime posteriors,
fractional-differencing transforms for stationarity, and
uncertainty-aware variants of each.

The distinction is mechanical rather than thematic — the feature is an estimated quantity from a fitted model, so its training-time and inference-time computation has to respect the same purged-walk-forward discipline as the prediction model that consumes it.

Feature evaluation (Chapter 7, second pass). Before any model is trained, every feature is screened individually — daily cross-sectional information coefficient, ICIR, HAC-robust standard errors, walk-forward folds. Features that fail the triage screen do not silently carry forward into model training. The diagnostic machinery here lives in ml4t-diagnostic, which also provides the deflated Sharpe ratio, combinatorial purged cross-validation, and the multiple-testing corrections used downstream.

Model families (Chapters 11–15). Each case runs the families that fit its data:

regularized linear models (Chapter 11),
gradient-boosted trees and tabular deep learning (Chapter 12),
sequence deep learning, including LSTMs, TCNs, and transformers ( Chapter 13),
latent-factor models, including IPCA and a stochastic-discount-factor specification (Chapter 14), and
double machine learning for the cases where confounding is the open question (Chapter 15).

The latent-factor and SDF estimators, together with a conditional-autoencoder model and several end-to-end portfolio-learning architectures, are packaged in ml4t-models — the most recent of the six libraries. Hyperparameter search and fold-level evaluation are uniform across families; comparability across cases comes from running the same protocol everywhere, not from picking a per-case favorite.

Signal-stage backtest (Chapter 16). Predictions become positions through the backtester provided by ml4t-backtest — event-driven with point-in-time correctness, exit-first order processing matching real broker behavior, configurable same-bar or next-bar fills, and quote-aware execution that distinguishes bid, ask, and midpoint sources. The same code path runs the validation backtest and the frozen-holdout backtest, with no leakage between them.

Portfolio construction (Chapter 17). Allocator choice is part of the experiment. Equal-weight long-short top-N is the baseline; risk parity, mean-variance with shrinkage, and robust-optimization variants run alongside where the universe supports them. End-to-end portfolio-learning models — where the allocator is itself learned rather than rule-based — are part of ml4t-models. The point is to isolate how much of any net result comes from the signal and how much comes from the allocator.

Transaction costs (Chapter 18). Costs are modeled instrument-by-instrument and calibrated to the level a participant trading the case-study universe would incur, rather than a flat basis-point placeholder applied uniformly. Equity bid-ask half-spreads are derived from quote data with a bottom-quintile discipline; futures roll costs and continuous-contract artifacts are handled at the bar-construction level; FX uses interbank-spread approximations; option strategies use premium-scaled bid-asks sized to the round-trip; per-share commissions enter where the cadence makes them binding. Each case reports a sensitivity analysis at multiple cost levels — a single static cost assumption hides where the strategy actually breaks down. The cost machinery sits on the same execution layer as the backtester.

Risk overlays (Chapter 19). Daily-loss caps, drawdown-triggered position cuts, position-size limits, and regime-aware sizing where the data supports an explicit regime layer. The overlays are applied as a separate pass over the cost-aware backtest output rather than fused into the signal stage. Keeping them separate preserves a useful diagnostic distinction: when a result disappoints, you can ask whether the signal had alpha that the cost regime erased, whether the risk overlay was the binding constraint, or whether the overlay never engaged at all.

Cross-case analysis (Chapter 20). The synthesis appears in Chapter 20, which treats the nine cases as a single experiment rather than nine independent reports. It examines how well upstream feature-triage diagnostics predict downstream strategy survival, identifies, case by case, where prediction quality, portfolio translation, or execution friction is the binding constraint, and points to the lever each case suggests for the next research iteration. Detailed per-case numbers are the subject of upcoming issues.

What’s coming

Coming issues will move case-by-case — each case’s binding constraint, the iteration step the evidence suggests, what changed between research cycles, and the open questions left on the table. The cross-case synthesis from Chapter 20, individual stages worth a deep dive (feature triage, instrument-specific cost modeling, allocator comparison), and the methods themselves all have their own future issues queued up.

The research loop from Chapter 6, alongside the live-trading loop. The new case-study iteration agent runs inside the right-hand loop.

A new agent has also joined the research loop. It runs its own iterations on each case study — re-running setup decisions, refining the feature panel, adjusting cost assumptions and risk parameters, and proposing the next experiment to try. Whatever it surfaces worth reporting will land here.

Per-case detail lives at ml4trading.io/case-studies. GitHub repo going live close to launch.

Inside the Agent Lab

Stefan Jansen — Fri, 01 May 2026 15:40:31 GMT

The Agent Lab currently assigns an 82% probability to the upper bound of the federal funds rate being above 3.00% after the April 2027 FOMC meeting. Kalshi trades the same question at 47%. On a different question — whether month-over-month core PCE inflation will print above 0.3% in April 2026 — the Lab is at 30%, and the market is at 51%, the disagreement running the other way.

The Agent Lab is our implementation of the AIA Forecaster (Alur et al. 2025) from Bridgewater AIA Labs — the multi-agent research pipeline, Chapter 24 of the third edition teaches end-to-end. It runs on live prediction-market questions, publishes a probability against each one, and persists every search result, agent trace, and intermediate aggregate to a database so that any run can be replayed against its original evidence.

This issue walks through what the Lab actually does. We are running it as a research experiment alongside the book, not as an institutional product, and the simplifications matter; we will name them as they come up.

What you see on a question page

The landing page lists featured questions across the US macro calendar — federal funds rate, core PCE, payrolls, and GDP. Each card shows the market price, the Lab’s forecast, a one-line agent-derived rationale, and a link to the full dossier.

The dossier is the part worth seeing. For a given question, it shows:

The distribution of individual agent probabilities for that run.
The probability trajectory as new daily forecasts accumulate.
The pipeline arithmetic — agents in, mean probability out.
The supervisor’s synthesis of the evidence with citations to the sources that the agents actually retrieved.
Run ID, generation timestamp, the model used for that run, and the number of search calls.

Every number is timestamped and attributable. This is the chapter’s position made operational: agents as engineering systems for evidence-rich decision support, with replayable traces rather than chat interfaces.

From Chapter 24 to running code

Chapter 24 walks the design space through ten notebooks: a ReAct loop, tool contracts and explicit state, a research agent, aggregation arithmetic, multi-agent research, adversarial debate, the full forecasting pipeline, and an evaluation/governance pass. The notebooks run deterministically in mock mode for teaching and CI, and switch to live providers and live search when the reader is ready.

The aia-forecaster repository (available at publication in June) takes the same architecture and runs it against live Kalshi and Polymarket questions. What the repo adds beyond the notebooks is the operational infrastructure: a SQLite run log with token-cost telemetry, market connectors with retry and filtering, an evaluation harness against historical resolutions, configuration profiles, and a scheduler for daily pull-and-forecast jobs.

How a forecast is built

The AIA Forecaster pipeline — from market question to published probability.

For each market on each daily run, the Lab does the following:

Reword the question. Kalshi market titles are written for traders. The pipeline rewrites each one into a single explicit yes/no question that names the date, threshold, units, and resolution source. “Will the rate of core PCE inflation be above 0.3% in April 2026?” becomes the longer “Will the month-over-month percent change in core PCE be above 0.3% in April 2026, according to the Bureau of Economic Analysis?” The paper publishes this prompt verbatim in Appendix F. It does most of the work of preventing the units-and-time-frame mistakes a fast reader would make.
Run the research agents. Three agents run a ReAct loop in parallel over a search tool. They are identical: same prompt, same temperature, no role specialization. Diversity comes from stochastic sampling, not from prescribed roles. Each agent returns a probability and the evidence chain that produced it. (The paper’s production configuration uses ten agents on a frontier model; we run three on an open-source model. More on that below.)
Aggregate. The mean of the three agent probabilities is the ensemble forecast.
Supervisor pass. A separate agent reviews the ensemble against the market price and the agents’ rationales. It can override the ensemble — but only when its confidence in the override is explicitly high. Most of the time, it confirms.
Persist. The rewritten question, the search results, every agent trace, the aggregate, and the supervisor’s synthesis all land in SQLite, keyed by a run ID. Any run is replayable with its evidence held constant.

The point of step 5 is that nothing the Lab publishes is a black box. If a forecast looks wrong, the question is “which step produced it” — and the answer is in the database.

A note on two stages that look ordinary on the diagram but carry most of the system.

Search matters. On a batch of 64 live markets, the paper evaluates the same pipeline: Brier 0.1002 with search and 0.3609 without, worse than always predicting 50%, which mechanically scores 0.25. Each agent’s iterated search-and-reason loop is what produces the headline result.
The supervisor is not a judge. The paper tested a simpler “best of M” supervisor that reads the agents’ forecasts and picks the one it considers best, and it lost to the simple mean — selecting the worst of the M forecasts 7.2% of the time. The agentic supervisor’s gain comes specifically from running new searches to resolve disagreements, not from re-judging the existing answers. Naive verification underperforms averaging; only verification plus additional evidence beats it.

What the system is and is not

A few honest caveats about the version visitors see today.

Open-source model by default; three agents, not ten. The paper’s headline configuration runs ten agents per question on a frontier model. We run three agents on Qwen 3 (32 billion Parameters) — open-source, locally hosted, free at the margin — because a daily sweep on a frontier model with the full number of research agent iterations would cost on the order of $5/market/day. The price is paid in quality: an open-source model produces forecasts that illustrate how the pipeline works, not the numbers the paper produced - both web search and model quality matter materially. The same pipeline can run on Anthropic’s Sonnet — and we do occasionally use it for comparison sweeps — but the daily schedule uses Qwen.

It is not built to beat the market. The paper’s headline result is that AIA Forecaster matches a human superforecaster panel on ForecastBench — Brier score 0.075, statistically indistinguishable. On liquid prediction markets, the system underperforms the market consensus on its own. Instead, it produces independent information that improves on the consensus when combined with it. The paper formalizes this with a regression of resolution outcome on (market price, AIA forecast): even on the harder benchmark where AIA loses to consensus on its own, the optimal ensemble assigns roughly a third of its weight to the AIA forecasts, and the combined estimator beats either input alone. The Lab’s position is the same: a calibrated probability against each question, alongside the market price, not in place of it.

Calibration is currently off. Language models trained with RLHF tend to hedge toward the middle of the probability scale: even when the evidence supports an extreme forecast, the raw probability tends to be timid. The paper’s correction is Platt scaling — a logistic transform applied to each forecast as it is produced. The transform has one coefficient, which the paper sets a priori to √3, a value drawn from the calibration literature (Neyman and Roughgarden, 2022) rather than fitting it against the authors’ own benchmarks — a choice they explicitly make to avoid overfitting. With only three agents and open-weight models, the risk for us cuts the other way: extremization can amplify a wrong-side-of-50% forecast into a confidently-wrong forecast. The recent prompt rebuild already produces appropriately confident outputs, so we run with calibration disabled for now.

The forecasts move in both directions. On the Fed-rate questions in the opener, the Lab is well above the market. On core PCE, it is well below. The system is not built to take a contrarian view; it is built to produce an evidence-backed probability and let disagreement, in either direction, stand or fall in the face of resolution.

Read a dossier

Open the Agent Lab, pick a question where the forecast and the market visibly disagree, and read the dossier end-to-end. The agent distribution, the trajectory, the supervisor’s synthesis, and the search citations are all there. The next Insights issue lands Tuesday.

More than 27 chapters

Stefan Jansen — Tue, 28 Apr 2026 14:06:11 GMT

The third edition of Machine Learning for Trading ships with 27 chapters (and 400+ notebooks). It also ships with five open-source Python libraries, 112 primer articles, and 56 agent skills across nine categories. Issue 1 promised to unpack what the companion material does that the chapters alone cannot. This issue is part of that unpacking — two of the libraries in detail, plus a tour of the primer set. Skills and the agent layer that consumes them get their own treatment in forthcoming issues.

Five libraries, one workflow

The five libraries trace the research-to-production path the book teaches: data acquisition, feature engineering, signal validation, strategy simulation and evaluation, and live deployment. Each library is finance-native — APIs, semantics, and data contracts tailored to the domain rather than borrowed from a generic ML stack. The chapters establish the methods. The library, its tests, and its validation harness ensure the methods are implemented correctly at scale.

This issue goes deeper into two areas: data management and the backtest-to-live pair.

`ml4t-data`: twenty providers, one interface

Most ML-for-trading projects start by writing the same data layer. Ten different vendor SDKs with ten different schemas, ad-hoc CSVs that drift out of date, and a notebook full of one-off fetches that may or may not reproduce. ml4t-data provides the data layer that the project should not have to write.

A single DataManager unifies 20+ provider adapters behind one interface — the same fetch, load, and update calls regardless of source. Coverage spans 850,000 FRED economic series, 70+ global exchanges via Finnhub, 10,000+ cryptocurrencies via CoinGecko, prediction-market history from Kalshi and Polymarket, academic factor data (Fama-French, AQR), and Databento-backed futures for CME and ICE. Data is stored locally in Hive-partitioned Parquet with metadata tracking and is queryable directly with DuckDB or Polars. CLI commands handle incremental updates, gap detection, and validation against OHLC invariants and anomaly detectors.

The two modules where lookahead bias is most likely to enter quietly get dedicated treatment. The futures module builds continuous contracts with configurable roll logic for CME and ICE products. The COT (Commitment of Traders) module joins weekly CFTC positioning data to OHLCV under explicit point-in-time semantics — the join key is the release date, not the date the positions describe — so the model never sees a Tuesday’s positions before the Friday they were published.

`ml4t-backtest` → `ml4t-live`: the same strategy, twice

The hardest move in the workflow is the move from a backtested strategy to a live one. The two environments share almost nothing by default — different data feeds, different fill semantics, different failure modes — and the bugs that survive the transition are the ones that erase paper-PnL the fastest. The backtest and live libraries are designed as a single system that cleanly crosses that boundary.

ml4t-backtest is the simulation engine, and its headline claim is cross-framework parity. Behavioral profiles for Zipline, Backtrader, VectorBT, and LEAN set dozens of knobs to match each target framework exactly so that you can either reproduce the backtester you know, or the behavior of the broker you use. Validated on 250 assets over 20 years, the Zipline profile reproduces 226,723 trades with zero gap and a $10.30 value discrepancy on the final portfolio (0.0014%), running 8× faster than the reference. The point of the parity work is to establish that the engine does what the standard implementations do, so the strategy author can stop worrying about the engine.

ml4t-live takes the same SignalStrategy class, unmodified, into production. SafeBroker enforces 16 risk parameters — position limits, order limits, daily-loss caps, fat-finger rejection at ±5% from market, asset whitelisting. Shadow mode runs the strategy logic on live data without placing real orders, closing the gap between a passing backtest and a safe live deployment. The kill switch persists atomic JSON state, so a mid-run crash does not leave orphaned positions. The design constraint is zero-rewrite migration: the code that passed the backtest is the code that runs the broker.

112 primers: a menu of choice

The second edition carried its own prerequisites. Chapters had to stop and introduce hypothesis testing, linear regression, and basic time-series mechanics. That crowded out pages for the trading applications the book was actually about. The third edition moves that material into primers and gives the chapters back to trading.

The result is a menu for you to pick and choose from along two axes:

Level of preparation: 20 foundational primers, 66 intermediate, 26 advanced.
Topic: 24 primers are cross-chapter, the other 88 provide background on or further expand on individual chapters: Ch 9 on model-based features has 11 primers, Ch 17 on portfolio construction has 8, Ch 14 on latent factors has 7.

264 unique research citations back the set. A few primers to show the range:

“Volatility: Realized, Implied, and Why It Clusters” (Cross-chapter, foundational). Three distinct volatility objects — realized, implied, and conditional — and why they are not interchangeable. Square-root annualization explained. Volatility clustering as a potential regime marker, not an artifact of noisy estimation.
SPY daily log returns and 20-day rolling annualized realized volatility, 2018–2024. Long calm stretches near 10–15% punctuated by short clusters — late 2018, the COVID spike to ~94%, the 2022 sell-off, and the 2023 banking episodes. Roughly 5% of days exceed 30% annualised volatility. From the primer “Volatility: Realized, Implied, and Why It Clusters”.
“Random Matrix Theory for PCA in Finance” (Ch 14, advanced). The Marchenko–Pastur law provides a null benchmark for the eigenvalue distribution of a sample covariance or correlation matrix in the absence of structure. In finance, eigenvalues that exceed the upper edge of the Marchenko–Pastur bulk (the range of eigenvalues that finite-sample estimation noise alone can generate) are often interpreted as candidate signal components, while eigenvalues inside the bulk are treated as noise-dominated. This gives a practical, though assumption-dependent, way to decide how many principal components to retain and motivates covariance cleaning and eigenvalue shrinkage methods.
“Temporal-Difference Learning and Bellman Equations” (Ch 21, intermediate). The foundation for value-based RL methods used in applications such as execution and hedging includes the Bellman equations, value iteration, TD(0), the bias–variance trade-off relative to Monte Carlo estimation, and the transition from tabular control methods such as Q-learning and SARSA to neural variants such as DQN and Double DQN.

What’s next

Friday’s issue opens the Agent Lab — our implementation of the multi-agent research pipeline based on Bridgewater’s AIA Forecaster that Chapter 24 teaches — running daily on live questions from Kalshi and Polymarket. The 56 agent skills and the agent layer that orchestrates the full research-to-production workflow will be included in a later issue.

What changed in six years, and what didn't

Stefan Jansen — Fri, 24 Apr 2026 12:05:49 GMT

The second edition of Machine Learning for Trading shipped on 31 July 2020. Two months earlier, OpenAI posted the GPT-3 paper to arXiv. The third edition ships in June 2026. The second edition added a few early deep-learning applications to the first (December 2018). The second-to-third gap is much larger because it covers some of the most consequential six years AI and ML have ever seen.

The field moved. Two questions the book tries to answer did not:

How to develop a trading strategy end-to-end; the book now includes nine case studies from equities and futures to ETFs, FX, and crypto, with holding periods from minutes to months.
How to evaluate a strategy without fooling yourself with a plausible-looking backtest; the new ml4t-diagnostic library ships state-of-the-art overfitting guards — from the Deflated Sharpe ratio to the Rademacher Anti-Serum — and the chapters cover process discipline and the relevant tests in detail.

How the landscape changed

Generative AI and autonomous agents are rapidly becoming part of the research workflow. Three new chapters respond directly:

Retrieval-augmented generation for financial research (Ch 22),
Knowledge graphs (Ch 23), and
Autonomous agents (Ch 24).

Alongside these three, Chapter 10 compresses the second edition’s three NLP chapters — sentiment, topic modeling, word embeddings — into a single chapter organized around transformer-based embeddings as a pipeline stage. Topic modeling and word2vec mostly drop out.

Deep learning diversified, then dispersed. The second edition had a dedicated six-chapter deep-learning part; the third has none. The deeper shift is that finance has begun to develop its own domain-specific architectures, from latent-factor models to end-to-end portfolio learning, rather than importing deep learning from other domains unchanged. The material now travels with the application it serves.

GANs and diffusion models are used for synthetic data (Ch 5).
Transformers support the text feature pipeline (Ch 10).
Tabular DL sits alongside gradient boosting (Ch 12).
Sequence models land in Chapter 13. Gu, Kelly, and Xiu’s 2019 conditional autoencoder and Chen, Pelger, and Zhu’s 2021 deep-learning stochastic discount factor anchor the latent-factor chapter (Ch 14).
End-to-end portfolio learning sits in Chapter 17.
Deep reinforcement learning, with three concrete applications — optimal execution, market making, and deep hedging — stays in Chapter 21.

Chapter 13 takes a deliberately skeptical view of deep learning for time series. Foundation models are harder to extract value from off the shelf on financial data than in other domains. The chapter’s cross-dataset rollup asks where deep learning actually lands on the curve when LSTMs, TCNs, attention variants, and a foundation model are run across the case-study datasets. Deep learning is a tool with specific strengths, not a blanket replacement.

Three additions at the chapter and section levels. Causal analysis and conformal predictions have continued to gain importance:

Causal machine learning (Ch 15) is a new chapter: Pearl-style identification, double ML for isolating factor effects, Bayesian structural time series, time-series causal discovery.
Conformal prediction is now a standard pipeline stage in Chapter 11, not an advanced topic.

Both matter more now than at any earlier point because LLMs and agents have collapsed the cost of generating plausible-looking hypotheses, and the counterweight is formal robustness.

Chapter 9 adds a new perspective: ARIMA, GARCH, spectral, regime-switching, and Bayesian time-series models are treated as feature extractors for a downstream predictor rather than as standalone forecasters.

Operational reality moved from the edges of the book to the center. From strategy implementation to deployment, details matter in practice:

Chapter 18 is a dedicated chapter on transaction costs — taxonomy, microstructure-regime link, Almgren–Chriss as the unifying framework, and the guardrails for when costs kill a strategy.
Chapter 19 is dedicated to risk management — VaR and CVaR, path risk, stress testing, adaptive controls without leakage, and kill switches.
Chapter 25 covers live trading through Interactive Brokers, Alpaca, and QuantConnect.
Chapter 26 covers MLOps and governance.

None of the four had a counterpart in the second edition. The backtrader and zipline backtesters have been replaced by ml4t-backtest, and we also demonstrate vectorized alternatives like vectorBT.

Two foundation-level additions.

Market microstructure gets its own chapter (Ch 3): tick, volume, and dollar bars as information-driven sampling, limit-order-book reconstruction, continuous-futures construction.
Synthetic financial data moved from an advanced topic in 2E Chapter 21 to a foundation chapter (Ch 5), and broadened well beyond GANs to include Monte Carlo baselines, diffusion models, LLM-based structured-data synthesis, and an explicit fidelity–utility–privacy evaluation framework.

Data and infrastructure caught up. Polars replaces pandas across notebooks where the migration is worthwhile. Commercial data sources sit alongside free ones, because free data has become increasingly rare over the past six years, and has serious limitations. Crypto is more central, and platforms like Alpaca make it materially easier to move from research prototype to paper trading and then to small-scale live execution than it was in 2020. Prediction markets — Kalshi, Polymarket — appear to be a new research frontier.

What didn’t change

The constant is process discipline. If anything, the third edition gives it more weight than the second.

Backtesting is one stage in a research pipeline, not the finish line. The book breaks the research-to-deployment arc into dedicated chapters rather than a single chapter on simulation. Chapter 16 is the simulation stage: the ml4t-backtest library, event-driven and vectorized modes, walk-forward with purging and embargo. Chapter 17 is portfolio construction: equal-weight and risk parity as hard benchmarks, the Markowitz curse, hierarchical risk parity, regime-adaptive allocation without discrete switching, and end-to-end portfolio learning. Chapter 18 handles costs, Chapter 19 handles risk, and Chapter 20 synthesizes across the nine case studies — reporting what generalized, what didn’t, and what was deliberately left on the table.

Statistical discipline is threaded through the chapters. The anchor papers are organizing content in Chapters 7, 11, 16, and 20:

Deflated Sharpe ratio — Bailey and López de Prado (2014)
Rademacher Anti-Serum — Paleologo (2025), Elements of Quantitative Investing, §8.3
Purged, embargoed, and combinatorial cross-validation — López de Prado (2018), Advances in Financial Machine Learning
Probability of Backtest Overfitting — Bailey, Borwein, López de Prado, and Zhu (2015)
Multiple-testing corrections in factor research — Harvey, Liu, and Zhu (2016)
Conformal prediction — the Vovk, Gammerman, and Shafer lineage

Hands-on implementation remains front and center, growing substantially in scope and scale. The third edition is built around nine case studies across asset classes and frequencies:

ETFs
Broad US equities
US firm characteristics
NASDAQ-100 microstructure on minute-bar TAQ data
S&P 500 equities joined with options analytics
S&P 500 options as a volatility strategy
CME futures
FX majors
Crypto perpetuals, with funding as a structural signal

Roughly 170 case-study notebooks carry each case through the same pipeline stages — setup, labels, features, model-based features, evaluation, linear, GBM, tabular DL, sequence DL, latent factors, causal, backtest, portfolio construction, costs, risk, synthesis. Cross-case rollups appear at the ends of the model chapters and in a dedicated synthesis chapter.

The second edition taught, model by model, on different datasets. The third edition teaches one pipeline across nine datasets, with explicit protocols for reporting across cases. The cross-case grid is the clearest pedagogical difference between the editions.

More than ‘just a book’: 450+ notebooks, 100+ primers, 56 agent skills, and five libraries

The third edition ships with roughly 450 notebooks, over one hundred primer articles, 56 agent skills across nine categories (concepts, data, features, validation, backtest, portfolio, production, advanced AI, workflows), and five open-source Python libraries:

ml4t-data — sourcing, validation, and point-in-time data pipelines.
ml4t-engineer — feature and label engineering with 120+ financial indicators.
ml4t-diagnostic — model evaluation, overfitting guards, and uncertainty quantification.
ml4t-backtest — event-driven and vectorized strategy simulation with walk-forward controls.
ml4t-live — broker adapters for live execution (Interactive Brokers, Alpaca, QuantConnect).

The agent skills exist because coding agents increasingly participate in implementation. A skill encodes the canonical approach to a specific task — a walk-forward split with purging and embargo, a deflated Sharpe check on a set of backtests, a cost-sensitivity sweep — in a form a reader’s agent can consume without reinventing it. The book carries the argument; the skill shortens the distance between the argument and a correct implementation when an agent does the typing.

The Agent Lab on ml4trading.io is our implementation of Bridgewater’s AIA Forecaster — the multi-agent research pipeline that Chapter 24 shows how to build. It publishes Platt-calibrated probabilities on live Kalshi and Polymarket questions.

About Insights

This is Issue 1 of Insights, a twice-weekly letter running through the June launch and after. Each issue takes one claim from the book — a library, a primer article, an agent skill, a case study, or a new paper — and goes deeper than the book alone has room for. The next issue covers what the five libraries, the primer set, and the 56 skills do that 27 chapters alone cannot. The one after opens the Agent Lab.

If you subscribed to the second-edition list: welcome back.