<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[ML for Trading Insights]]></title><description><![CDATA[Notes on ML for trading — and applied AI more broadly — from the author of the third edition.]]></description><link>https://insights.ml4trading.io</link><image><url>https://substackcdn.com/image/fetch/$s_!5KbB!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7c2e3b8-b91b-41ba-a42d-a371f6359359_800x800.png</url><title>ML for Trading Insights</title><link>https://insights.ml4trading.io</link></image><generator>Substack</generator><lastBuildDate>Fri, 05 Jun 2026 11:08:26 GMT</lastBuildDate><atom:link href="https://insights.ml4trading.io/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Stefan Jansen]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[ml4t@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[ml4t@substack.com]]></itunes:email><itunes:name><![CDATA[Stefan Jansen]]></itunes:name></itunes:owner><itunes:author><![CDATA[Stefan Jansen]]></itunes:author><googleplay:owner><![CDATA[ml4t@substack.com]]></googleplay:owner><googleplay:email><![CDATA[ml4t@substack.com]]></googleplay:email><googleplay:author><![CDATA[Stefan Jansen]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Deep Learning for End-to-End Portfolio Construction]]></title><description><![CDATA[Neural allocators can train directly on portfolio objectives such as Sharpe, costs, and drawdowns. That makes the model more aligned with trading performance, and much harder to audit.]]></description><link>https://insights.ml4trading.io/p/deep-learning-for-end-to-end-portfolio</link><guid isPermaLink="false">https://insights.ml4trading.io/p/deep-learning-for-end-to-end-portfolio</guid><dc:creator><![CDATA[Stefan Jansen]]></dc:creator><pubDate>Tue, 26 May 2026 15:42:27 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!MGze!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaba2fbc-1cb9-4256-a52c-29d47b0fa37d_2752x1351.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Most machine learning workflows for trading stop before the trade.</p><p>A model estimates returns, ranks assets, or emits a signal. A risk model estimates covariance or volatility. An allocator turns those objects into weights. A backtest then decides whether the chain produced a portfolio worth trading.</p><p>That separation is valuable because it makes errors easier to diagnose. If the signal has no information coefficient, the allocator is not the first suspect. If the signal has a positive IC but the strategy loses money, the next checks are sizing, turnover, costs, concentration, exposure, and timing.</p><p>End-to-end portfolio learning changes the object being learned. The network maps market features directly to positions and trains on a portfolio-level objective, usually a differentiable Sharpe-style loss computed after volatility scaling and turnover costs. The appeal is alignment: the gradient flows through a portfolio return stream closer to the object used in evaluation, rather than stopping at one-step forecast error. The risk is that forecasting, sizing, turnover, and exposure control become entangled in one loss surface.</p><p>The practical rule is narrow: use end-to-end allocators when the objective cannot be cleanly decomposed into forecasts plus an optimizer, and evaluate them as full trading systems rather than allocator modules. Objective alignment is not evidence. It is a reason to run stricter evidence checks.</p><p><a href="https://ml4trading.io/third-edition/chapters/17_portfolio_construction/">Chapter 17 of </a><em><a href="https://ml4trading.io/third-edition/chapters/17_portfolio_construction/">Machine Learning for Trading</a></em> uses this tension to place learned allocators inside a broader portfolio-construction workflow. The chapter does not argue that neural allocators replace classical allocation. It asks a narrower question: when is it useful to train the allocation decision itself?</p><p>Three recent lines of work make the progression visible. The first shows that portfolio weights can be trained directly through a Sharpe-style objective. The second asks which sequence architectures survive under a common volatility-targeted portfolio loss. The third adds structure: cost-aware training, cross-market filtration discipline, graph-constrained attention, and a robust regime objective.</p><p>Chapter 17&#8217;s ETF examples use those papers to ask a practical question: what has to be checked when the model owns the path from features to weights?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MGze!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaba2fbc-1cb9-4256-a52c-29d47b0fa37d_2752x1351.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MGze!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaba2fbc-1cb9-4256-a52c-29d47b0fa37d_2752x1351.jpeg 424w, https://substackcdn.com/image/fetch/$s_!MGze!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaba2fbc-1cb9-4256-a52c-29d47b0fa37d_2752x1351.jpeg 848w, https://substackcdn.com/image/fetch/$s_!MGze!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaba2fbc-1cb9-4256-a52c-29d47b0fa37d_2752x1351.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!MGze!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaba2fbc-1cb9-4256-a52c-29d47b0fa37d_2752x1351.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MGze!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaba2fbc-1cb9-4256-a52c-29d47b0fa37d_2752x1351.jpeg" width="1456" height="715" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/faba2fbc-1cb9-4256-a52c-29d47b0fa37d_2752x1351.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:715,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:441771,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://insights.ml4trading.io/i/199282494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaba2fbc-1cb9-4256-a52c-29d47b0fa37d_2752x1351.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MGze!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaba2fbc-1cb9-4256-a52c-29d47b0fa37d_2752x1351.jpeg 424w, https://substackcdn.com/image/fetch/$s_!MGze!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaba2fbc-1cb9-4256-a52c-29d47b0fa37d_2752x1351.jpeg 848w, https://substackcdn.com/image/fetch/$s_!MGze!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaba2fbc-1cb9-4256-a52c-29d47b0fa37d_2752x1351.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!MGze!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaba2fbc-1cb9-4256-a52c-29d47b0fa37d_2752x1351.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 1. The end-to-end portfolio-learning pipeline in Chapter 17. Per-asset features pass through a shared sequence encoder, a bounded signal head, a volatility-targeted position layer, and a portfolio-return aggregation step. The loss is a risk-adjusted statistic of realized portfolio returns, so gradients flow through the allocation decision rather than stopping at a forecast.</em></p><h2>What changes when the model learns weights</h2><p>Chapter 17 starts with a portfolio-construction term sheet: objective, inputs, constraints, rebalancing protocol, cost treatment, and evaluation plan. That framing matters because allocation is otherwise easy to turn into an unlogged search layer once model selection is complete.</p><p>The early sections cover the standard allocator workflow. Expected returns may come from a model. Covariance may come from a shrinkage estimator, a factor model, or a realized window. Constraints and turnover penalties shape the final weight vector. Evaluation then asks whether the allocation improved the portfolio relative to a benchmark allocator, using the same signal and backtesting protocol.</p><p>The learned allocators in Section 17.8 are different. They do not consume the same forecast stream as the allocator comparisons earlier in the chapter. They learn from raw or engineered price features and output positions directly. A head-to-head table is still useful, but it compares systems rather than allocators fed identical predictions.</p><p>That makes learned allocators a separate evidence track. They are not allocator modules fed identical forecasts; they are full trading systems. They must still pass simple heuristics, but the comparison must include leakage checks, costs, turnover, drawdown, regime slices, seed variation, and component ablations. A single test Sharpe is not enough evidence when the model owns the entire path from features to weights.</p><p>The tables below should be read as experiment-specific evidence, not as a single consolidated leaderboard across different data masks, model protocols, and portfolio-return calculations.</p><h2>The common computation graph</h2><p>The three implementations share a recognizable core graph, although their constraints, cost treatment, and details of the robust objective differ.</p><p>For each asset and decision time, the model receives a fixed-length lookback window. A sequence encoder processes the window and produces a hidden state. A small head projects the hidden state into a signal. A position layer converts the signal into a tradeable weight, usually after volatility scaling. The portfolio-return layer combines those positions with next-period realized returns and subtracts turnover costs. The training loss is computed on the resulting portfolio-return stream.</p><p>The base loss is usually a negative annualized Sharpe-style objective computed over the portfolio return stream. It rewards average portfolio return relative to portfolio volatility rather than one-step forecast accuracy. Later versions add cost terms and robust subperiod penalties. The notation is schematic: the papers and examples differ in output constraints, volatility scaling, transaction-cost treatment, and robust-window construction. Figure 2 puts the objective where it belongs in the workflow: weights, returns, masks, costs, and previous positions first form a net return stream; the loss then rewards pooled Sharpe while adding pressure on weak subperiods.</p><p>The pooled Sharpe term asks whether net portfolio returns compensate for realized volatility after sizing and costs. The SoftMin term is a smoothed worst-window Sharpe ratio; maximizing it rewards policies whose weak windows improve rather than those that rely on a few favorable windows.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!w0iz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5d2c13-25d1-4e75-a24f-73f117288269_2499x1402.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!w0iz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5d2c13-25d1-4e75-a24f-73f117288269_2499x1402.png 424w, https://substackcdn.com/image/fetch/$s_!w0iz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5d2c13-25d1-4e75-a24f-73f117288269_2499x1402.png 848w, https://substackcdn.com/image/fetch/$s_!w0iz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5d2c13-25d1-4e75-a24f-73f117288269_2499x1402.png 1272w, https://substackcdn.com/image/fetch/$s_!w0iz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5d2c13-25d1-4e75-a24f-73f117288269_2499x1402.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!w0iz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5d2c13-25d1-4e75-a24f-73f117288269_2499x1402.png" width="1456" height="817" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f5d2c13-25d1-4e75-a24f-73f117288269_2499x1402.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:817,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:227546,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://insights.ml4trading.io/i/199282494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5d2c13-25d1-4e75-a24f-73f117288269_2499x1402.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!w0iz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5d2c13-25d1-4e75-a24f-73f117288269_2499x1402.png 424w, https://substackcdn.com/image/fetch/$s_!w0iz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5d2c13-25d1-4e75-a24f-73f117288269_2499x1402.png 848w, https://substackcdn.com/image/fetch/$s_!w0iz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5d2c13-25d1-4e75-a24f-73f117288269_2499x1402.png 1272w, https://substackcdn.com/image/fetch/$s_!w0iz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5d2c13-25d1-4e75-a24f-73f117288269_2499x1402.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 2. The portfolio objective is built from current weights, next-period returns, availability masks, previous positions, and cost inputs. The formula is shown as a schematic rather than an implementation identity; the symbol key defines the notation used in the image.</em></p><p>Two details are easy to understate.</p><p>First, the pooled Sharpe is not a separable loss. Its gradient depends on the mean and variance of the return stream. If a training loop computes Sharpe in small mini-batches and averages the resulting gradients, it optimizes the average mini-batch Sharpe, not the pooled Sharpe across the full panel. Those objectives can prefer different policies because the denominator is computed on different return distributions.</p><p>DeePM treats this as an implementation issue, not a footnote. The paper introduces an exact two-pass microbatching procedure for large effective batches: </p><ul><li><p>First, accumulate sufficient statistics for the full logical batch, then replay the forward pass with the corrected normalization so the gradient matches the pooled objective.</p></li><li><p>Second, cost-aware training differs from post hoc cost reporting. If turnover costs are charged only after training, the model can learn a policy that works on gross returns and fails when traded. When costs are within the return stream being optimized, the model encounters implementation friction during training.</p></li></ul><h2>Direct Sharpe training is only the start</h2><p><a href="https://arxiv.org/abs/2005.13665">Zhang, Zohren, and Roberts (2020)</a> provide the clean starting point. The paper bypasses the expected-return forecast and trains a neural network to output long-only portfolio weights directly. A softmax layer keeps weights positive and ensures they sum to 1. The objective is portfolio Sharpe computed from realized portfolio returns, and gradient ascent updates the model parameters.</p><p>The paper deliberately keeps the architecture simple: a single-layer LSTM with 64 units, a 50-day lookback, close prices and daily returns as inputs, Adam optimization, and a validation split for hyperparameter control. The empirical setup uses four ETFs or index proxies: VTI, AGG, DBC, and a VIX-tracking proxy. Reported test results include volatility scaling and transaction costs. In that four-asset setting, the deep learning strategy performs well relative to the paper&#8217;s baselines and moves substantially toward bonds during the COVID-19 crash, which falls inside the paper&#8217;s test period.</p><p>The novelty lies in the objective, not in architectural complexity. The model demonstrates that portfolio weights can be learned via a differentiable, risk-adjusted objective without first estimating expected returns.</p><p>The Chapter 17 ETF example is useful precisely because it does not flatter the method. It puts the same idea into a broader 29-ETF setting, using daily prices from 2006-01-03 to 2025-12-31, a chronological 60/20/20 train/validation/test split, 63-day sequences, and a long-only softmax LSTM trained against a differentiable Sharpe objective.</p><p>The result is a useful negative control:</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/rsTRB/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dcbe20d4-943a-42bc-903d-540fa8f43683_1220x680.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6329048a-b528-4592-92b7-950bd8df505b_1220x838.png&quot;,&quot;height&quot;:400,&quot;title&quot;:&quot;Implementation contract&quot;,&quot;description&quot;:&quot;Design choices an end-to-end allocator has to preserve so portfolio-learning evidence remains auditable.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/rsTRB/1/" width="730" height="400" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>The learned LSTM reduces volatility and drawdown, but the return side does not compensate. Equal weight and inverse volatility are not straw-man baselines here; they are low-turnover controls that expose whether the learned policy earns its extra complexity. The best validation Sharpe is 1.609; the test Sharpe is 0.48. Directly optimizing a portfolio loss does not remove overfitting. It moves the overfitting target from prediction error to the portfolio object itself.</p><p>This is why the baseline belongs in the issue. It prevents the argument from becoming &#8220;train the Sharpe and win.&#8221; The paper establishes the training principle. The Chapter 17 replication shows why the principle needs stronger architecture, costs, and validation discipline before it becomes competitive.</p><h2>Better sequence models still have to pay costs</h2><p><a href="https://arxiv.org/abs/2603.01820">Saly-Kaufmann et al. (2026)</a> ask the next question: if every model is evaluated under the same portfolio objective, which temporal architecture earns its complexity?</p><p>Their benchmark uses roughly 15 years of futures and currency data across bonds, commodities, energy, foreign exchange, and equity indices. Each model maps a lookback window to a bounded signal in <code>[-1, 1]</code>. A volatility-targeted position layer converts the signal to risk-scaled exposure. The paper uses pooled Sharpe as the optimization objective and reports a broad set of evaluation metrics: annualized return, Sharpe ratio, HAC statistics, hit rate, turnover, passive-relative information ratio, downside risk, seed robustness, and breakeven transaction costs.</p><p>The paper evidence is best read as an architecture ranking under a shared protocol, not as a claim that those absolute Sharpe ratios transfer to the Chapter 17 ETF examples. The benchmark uses a different universe, futures-style instruments, a 10% volatility target, seed averaging, and a different implementation stack.</p><p>Within that benchmark, the main lesson is not that one architecture universally wins, but that inductive bias matters more than raw capacity. VLSTM, a variable-selection network in front of an LSTM encoder, reports the strongest aggregate Sharpe in the main table: 2.39, with a 23.9% annualized return. The hybrid LPatchTST and TFT are close behind, at 2.32 and 2.20, respectively. xLSTM has a lower average Sharpe ratio of 1.80 but a more favorable turnover profile than the classical LSTM, which matters for implementation. iTransformer has very low turnover but weak economic performance, with a Sharpe of 0.35 in the reported benchmark.</p><p>The discussion is more important than the ranking. Recurrent or recurrent-hybrid models do well because the architecture builds in a temporal axis rather than forcing the model to infer it from noisier token structure. Variable selection helps because most financial features are weak, unstable, or regime-dependent. But the &#8220;best&#8221; architecture depends on the metric. VLSTM leads on average Sharpe; LPatchTST and VxLSTM, the variable-selection plus xLSTM hybrid, look attractive on some downside and tail-risk measures; xLSTM has a stronger cost buffer in the paper&#8217;s breakeven analysis.</p><p>The paper also checks seed sensitivity. Under a smaller experimental budget, VLSTM still reports a Sharpe near the full-budget estimate: 2.40 in the reduced-seed table versus 2.39 in the main table. That does not make the result universal, but it reduces the risk that the ordering is only a lucky initialization artifact within this benchmark.</p><p>The Chapter 17 VLSTM example applies that idea to the same ETF universe. It keeps the same 29 ETFs, the 2006-2025 price panel, the 63-day sequence length, and the chronological split as in the first ETF example. The model changes the allocator in two ways:</p><ul><li><p>It adds a TFT-style gated residual network and variable-selection network before the LSTM;</p></li><li><p>It replaces the softmax long-only output with a volatility-targeted long-short position layer trained with a 5 bps cost-aware pooled-Sharpe loss.</p></li></ul><p>The cost table below is an out-of-sample stress test on one trained model. The model is trained with a 5 bps one-way cost inside the loss; the table then revalues the same held-out weights at 0, 5, 10, 20, and 50 bps one-way cost per dollar of turnover. It is not retrained at each cost level.</p><p>At zero cost, VLSTM is effectively tied with equal weight. The cost profile is the result that matters:</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/j9Dkl/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/000b8696-25f3-4b5a-9278-c04081f3c1a0_1220x542.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6e06ffcb-567e-4f40-87cb-d61f8c18b5f9_1220x700.png&quot;,&quot;height&quot;:400,&quot;title&quot;:&quot;VLSTM cost stress&quot;,&quot;description&quot;:&quot;Held-out Sharpe for one trained Chapter 17 ETF VLSTM model, revalued at different one-way turnover costs.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/j9Dkl/1/" width="730" height="400" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>The VLSTM variant recovers most of the gap between the softmax LSTM and the heuristic allocators at zero cost. It does not survive the 5 bps cost assumption used during training. That is a substantive result, not a caveat. Architecture can look competitive and gross, yet still fail the deployability test.</p><h2>Robust portfolio learning needs structure</h2><p><a href="https://arxiv.org/abs/2601.05975">Wood, Roberts, and Zohren (2026)</a> advance the same line of work. DeePM is built for systematic macro portfolios, where the model must learn from noisy non-stationary data, trade across asynchronous global markets, and survive transaction costs.</p><p>The paper identifies three design problems.</p><ul><li><p>The first is the ragged filtration problem. Global markets do not close at the same time. A naive cross-sectional attention layer can inadvertently allow an earlier-closing market to see information from a later-closing one. In the paper&#8217;s implementation, Directed Delay lags cross-sectional conditioning so that cross-market representations are measured with respect to a common information set, preferring filtration discipline over maximum same-day freshness.</p></li><li><p>The second is low signal-to-noise cross-asset learning. Free attention can form economically implausible links and overfit unstable correlations. DeePM uses a macro graph prior: an ex ante economic topology that constrains or biases cross-asset attention toward admissible relationships. The paper gives examples such as intra-group cliques, risk-on links across equities and cyclical assets, and inflation-sensitive links among energy, rates, and precious metals. The graph is a structural regularizer, not a claim that the specified edges are ground-truth causality.</p></li><li><p>The third is regime fragility. A pooled Sharpe objective can be lifted by favorable windows while hiding weak periods. DeePM augments pooled Sharpe with a SoftMin penalty over subperiod Sharpe ratios. As the temperature approaches zero, the penalty approaches the worst window. At intermediate temperature, it emphasizes weak windows without collapsing onto a single episode. The paper connects this to a KL-penalized distributionally robust objective and to Entropic Value-at-Risk.</p></li></ul><p>The architecture maps those problems into explicit modeling choices:</p><ul><li><p>a vectorized variable-selection network with FiLM-style static conditioning, where static asset context modulates features through feature-wise affine transforms;</p></li><li><p>an LSTM temporal backbone plus temporal attention;</p></li><li><p>lagged cross-sectional attention for filtration discipline;</p></li><li><p>macro-graph attention for economic structure;</p></li><li><p>a cost-aware net-return objective;</p></li><li><p>a SoftMin-augmented robust Sharpe loss.</p></li></ul><p>In the DeePM paper&#8217;s 2010-2025 macro futures test, with out-of-sample returns rescaled to a 10% annualized volatility, the full model reports a gross Sharpe ratio of 1.29 and a net Sharpe ratio of 0.93 after transaction costs. Passive equal risk reports a 0.50 net Sharpe ratio, TSMOM 0.45, and the Momentum Transformer baseline, trained with the same transaction-cost regularization, reports 0.66. The paper-level ablations matter:</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/4vKe8/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dcaecb66-e65b-4d21-a0fc-efae40e97549_1220x616.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92f9950d-2c45-41d0-91b6-98b653236056_1220x774.png&quot;,&quot;height&quot;:400,&quot;title&quot;:&quot;DeePM paper ablations&quot;,&quot;description&quot;:&quot;Net Sharpe results reported in Wood, Roberts, and Zohren (2026) for the macro futures and FX universe.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/4vKe8/1/" width="730" height="400" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>Those are paper results on a 50-contract futures and FX universe, not results from the Chapter 17 ETF examples. The distinction matters.</p><p>The local ETF example then asks a narrower practical question: what happens when a DeePM-style structure is adapted to the same setting? The example uses 29 ETFs, five asset-class groups, daily prices from 2006 to 2025, an 84-day sequence length, and a chronological 60/20/20 split. It includes FiLM conditioning, variable selection, an LSTM backbone, cross-sectional attention, a small ETF macro graph prior, transaction costs, and the SoftMin objective. The key contrast is full DeePM versus a no-SoftMin ablation.</p><p>One protocol detail prevents a bad cross-table comparison. The heuristic baselines are local controls within each example, not a single shared benchmark series. The LSTM and VLSTM examples evaluate final-step returns from sliding windows; the DeePM-style example evaluates the full post-validation test-date mask and normalizes model risk weights before computing returns. The equal-weight rule is not changing; the sampling and return-alignment protocol is. Read each table within its local comparison, not as a claim that the same equal-weight series has two different Sharpe ratios.</p><p>On the test window:</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/gKFau/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/38393a1d-fa53-4f57-be11-436d3054a2c5_1220x532.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/04e61ffe-1e24-4be5-9af5-a860c476f9ab_1220x690.png&quot;,&quot;height&quot;:400,&quot;title&quot;:&quot;DeePM-style ETF test window&quot;,&quot;description&quot;:&quot;Chapter 17 ETF example comparing the full DeePM-style model with local controls on the test window.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/gKFau/1/" width="730" height="400" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>The regime slice explains the source of the improvement. The split is mechanical: test days above the median 21-day annualized realized volatility of SPY are labeled &#8220;crisis,&#8221; and the remaining days are labeled &#8220;calm.&#8221; In this run, the threshold is set to 14.0%, resulting in 504 calm days and 503 crisis days.</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/YnbHz/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d7b10498-f594-4946-92c3-5859aa9be84a_1220x468.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/df8c7ab5-5f40-47e9-9ccc-a7666686c8f2_1220x592.png&quot;,&quot;height&quot;:400,&quot;title&quot;:&quot;DeePM ETF regime slice&quot;,&quot;description&quot;:&quot;Sharpe by median SPY 21-day realized-volatility split: 504 calm days and 503 crisis days.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/YnbHz/1/" width="730" height="400" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>SoftMin does not help by making calm periods better. Calm Sharpe falls slightly from 1.26 to 1.21. The improvement comes from the crisis side, where Sharpe rises from 0.57 to 1.00. In this ETF window, the robust objective improves the shape of losses and the average statistic simultaneously.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tYsa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce16b24-9105-49a5-b690-c694ea5101d8_1754x1504.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tYsa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce16b24-9105-49a5-b690-c694ea5101d8_1754x1504.png 424w, https://substackcdn.com/image/fetch/$s_!tYsa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce16b24-9105-49a5-b690-c694ea5101d8_1754x1504.png 848w, https://substackcdn.com/image/fetch/$s_!tYsa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce16b24-9105-49a5-b690-c694ea5101d8_1754x1504.png 1272w, https://substackcdn.com/image/fetch/$s_!tYsa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce16b24-9105-49a5-b690-c694ea5101d8_1754x1504.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tYsa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce16b24-9105-49a5-b690-c694ea5101d8_1754x1504.png" width="1456" height="1248" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ce16b24-9105-49a5-b690-c694ea5101d8_1754x1504.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1248,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:319903,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://insights.ml4trading.io/i/199282494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce16b24-9105-49a5-b690-c694ea5101d8_1754x1504.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tYsa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce16b24-9105-49a5-b690-c694ea5101d8_1754x1504.png 424w, https://substackcdn.com/image/fetch/$s_!tYsa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce16b24-9105-49a5-b690-c694ea5101d8_1754x1504.png 848w, https://substackcdn.com/image/fetch/$s_!tYsa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce16b24-9105-49a5-b690-c694ea5101d8_1754x1504.png 1272w, https://substackcdn.com/image/fetch/$s_!tYsa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce16b24-9105-49a5-b690-c694ea5101d8_1754x1504.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 3. DeePM drawdown comparison from Chapter 17. The full SoftMin version keeps drawdowns shallower during high-volatility periods than the equal-weight and no-SoftMin ablations. The point of the figure is the path of losses, not only the full-window Sharpe ratio.</em></p><p>The ETF example does not isolate every DeePM component. It does not separately quantify FiLM, V-VSN, Directed Delay, macro graph, and temporal attention as the paper&#8217;s broader ablation table does. Its measured local contrast is full DeePM versus no SoftMin, with the architecture held otherwise fixed. The example also does not run a DeePM cost-stress grid, unlike the VLSTM table, so the result should not be taken as proof that the ETF DeePM policy survives arbitrary cost assumptions. The ETF Sharpe ratio should also not be read as evidence that the reduced example system is better than the paper system; the asset universe, volatility target, training protocol, ensemble design, and cost model differ.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FTlc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1d3397-f17f-4dee-ae43-fe37263096f6_2203x1253.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FTlc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1d3397-f17f-4dee-ae43-fe37263096f6_2203x1253.png 424w, https://substackcdn.com/image/fetch/$s_!FTlc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1d3397-f17f-4dee-ae43-fe37263096f6_2203x1253.png 848w, https://substackcdn.com/image/fetch/$s_!FTlc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1d3397-f17f-4dee-ae43-fe37263096f6_2203x1253.png 1272w, https://substackcdn.com/image/fetch/$s_!FTlc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1d3397-f17f-4dee-ae43-fe37263096f6_2203x1253.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FTlc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1d3397-f17f-4dee-ae43-fe37263096f6_2203x1253.png" width="1456" height="828" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1d1d3397-f17f-4dee-ae43-fe37263096f6_2203x1253.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:828,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:184491,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://insights.ml4trading.io/i/199282494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1d3397-f17f-4dee-ae43-fe37263096f6_2203x1253.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FTlc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1d3397-f17f-4dee-ae43-fe37263096f6_2203x1253.png 424w, https://substackcdn.com/image/fetch/$s_!FTlc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1d3397-f17f-4dee-ae43-fe37263096f6_2203x1253.png 848w, https://substackcdn.com/image/fetch/$s_!FTlc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1d3397-f17f-4dee-ae43-fe37263096f6_2203x1253.png 1272w, https://substackcdn.com/image/fetch/$s_!FTlc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d1d3397-f17f-4dee-ae43-fe37263096f6_2203x1253.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 4. Papers, Chapter 17 examples, and implementation patterns answer different questions. Paper benchmarks support research claims; local ETF examples illustrate evaluation logic; implementation patterns preserve the contract, not evidence of alpha.</em></p><h2>The implementation contract</h2><p>The evidence boundary is now the main point: papers, Chapter 17 examples, and implementation patterns answer different questions.</p><p>The reusable lesson is not a list of class names. It is the contract an end-to-end allocator has to preserve: what the model needs to see, what it produces, and where downstream constraints belong.</p><p>For a learned allocator, the input contract must carry more than just features. It needs feature windows, forward returns, availability masks, volatility-scaling terms, cost inputs, previous weights, and, when used, economic graph structure. The output is not a forecast to be handed to a separate optimizer; it is a target weight vector that must still pass through exposure, leverage, turnover, and backtest checks.</p><p>Each part of that contract corresponds to a failure mode from the papers and examples:</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/rsTRB/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/622b3453-b137-4ec6-a2d2-0ca196997f39_1220x680.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ab90035e-ae60-49b2-b544-bd785ce890b9_1220x838.png&quot;,&quot;height&quot;:400,&quot;title&quot;:&quot;Implementation contract&quot;,&quot;description&quot;:&quot;Design choices an end-to-end allocator has to preserve so portfolio-learning evidence remains auditable.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/rsTRB/1/" width="730" height="400" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>This is the implementation boundary that matters. A DeePM-style allocator is not just a generic sequence model with a Sharpe loss. It needs a net-return layer, cost-aware turnover, pooled-objective semantics, optional graph constraints, and hooks for portfolio constraints after the model emits weights.</p><p>At the same time, packaging those mechanisms in software is not evidence of alpha. The evidence still comes from matched protocols, out-of-sample tests, cost stress, regime slices, seed checks, and ablations.</p><h2>Alignment raises the burden of proof</h2><p>The evidence across the Chapter 17 examples is mixed.</p><p>The softmax LSTM shows that direct Sharpe training can underperform simple heuristics. VLSTM shows that a better architecture can improve gross performance and still lose after costs. The reduced DeePM-style ETF example shows that, in this test window, a structured allocator with the SoftMin objective improves Sharpe, drawdown, and crisis-window performance relative to its local controls.</p><p>End-to-end training is not a shortcut around portfolio-construction discipline. It is a way to move more of the trading problem into the learning objective. Once that happens, the validation burden grows. The 0.98 versus 0.69 DeePM comparison is evidence from one ETF test window; it still needs seed, cost, period, and ablation checks before it can support a deployment claim.</p><p>A learned allocator should be judged by a protocol tied to the failure modes above:</p><ul><li><p>Start with heuristic baselines. The softmax LSTM does not assign equal weight or inverse volatility to the ETF window.</p></li><li><p>Report gross and net performance. VLSTM&#8217;s zero-cost tie disappears at 5 bps one-way cost.</p></li><li><p>Show turnover and cost-stress curves. A cost-aware loss can still learn weights that are too expensive out-of-sample.</p></li><li><p>Slice regimes with a rule defined before looking at results. The DeePM example uses median SPY realized volatility.</p></li><li><p>Report seed sensitivity where the model class makes it material. Saly-Kaufmann&#8217;s reduced-seed check is part of the evidence, not a footnote.</p></li><li><p>Use ablations when the claim is architectural. DeePM&#8217;s paper-level result is more credible because the no-SoftMin, no-graph, and graph-only variants are visible.</p></li></ul><p>This is where Chapter 17&#8217;s workflow matters. Classical allocators remain the right default when signals, risk estimates, and constraints can be diagnosed cleanly. End-to-end allocators become more compelling when the objective is hard to factor into a forecast plus an optimizer: cost-aware sizing, regime robustness, or structured cross-asset interaction.</p><p>The practical takeaway from this line of work is not that deep learning beats portfolio heuristics. It is that portfolio-objective training can align the model with the economic target, but only if the architecture, cost model, and evaluation protocol are strong enough to carry that alignment out-of-sample.</p><p>The Chapter 17 examples show the experimental progression, not trading instructions. The implementation lesson is the same throughout: sequence information in, target weights out, with costs, volatility scaling, graph structure, and robust Sharpe treated as first-class parts of the model rather than afterthoughts. Direct portfolio learning is not a replacement for portfolio construction. It is portfolio construction moved into the model, which makes the modeling problem more aligned and harder to audit.</p>]]></content:encoded></item><item><title><![CDATA[AI Agents in Finance: A Reading List]]></title><description><![CDATA[What to read before building read-only financial agents that retrieve evidence, use tools, make forecasts, and leave audit trails.]]></description><link>https://insights.ml4trading.io/p/ai-agents-in-finance-a-reading-list</link><guid isPermaLink="false">https://insights.ml4trading.io/p/ai-agents-in-finance-a-reading-list</guid><dc:creator><![CDATA[Stefan Jansen]]></dc:creator><pubDate>Fri, 22 May 2026 13:11:05 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/18d3d935-4c80-43e3-a7d6-95b249af5011_1200x630.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The well-known <a href="https://github.com/dzyim/ilya-sutskever-recommended-reading">Sutskever/Carmack reading list</a> worked because it did not try to be an encyclopedia. It gave readers a path: learn these ideas, and a large part of modern deep learning becomes easier to understand.</p><p>This issue uses the same format for a narrower question: what should you read before building AI agents for financial research and forecasting?</p><p>The answer is not a list of agent frameworks or orchestration libraries. It starts earlier. Core agent design still involves search under limited computation, action under partial observation, tool validity, state, delegation, evaluation, and human supervision. Language models changed the substrate, but not the underlying control problem.</p><p><a href="https://ml4trading.io/third-edition/chapters/24_autonomous_agents/">Chapter 24 of </a><em><a href="https://ml4trading.io/third-edition/chapters/24_autonomous_agents/">Machine Learning for Trading</a></em> implements that view. It builds read-only research and forecasting agents that gather evidence, call tools, maintain state, produce probabilities, and leave artifacts that can be replayed, scored, and audited. More specifically, the chapter shows how to build the <a href="https://arxiv.org/abs/2511.07678">Bridgewater AIA Forecasting Agent</a> all the way to <a href="https://ml4trading.io/agent-lab/">live deployment</a>.</p><p>In the chapter, an agent does not place trades. It calls market data APIs, searches for filings and news, retrieves documents, runs calculations, writes structured forecasts, and records what happened. The question is not whether a chatbot can sound like an analyst. The question is whether a workflow can produce useful decision-support artifacts that can be replayed, scored, and governed.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!czP7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8b0faa0-6536-408d-814a-a4a78a325541_2752x1536.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!czP7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8b0faa0-6536-408d-814a-a4a78a325541_2752x1536.jpeg 424w, https://substackcdn.com/image/fetch/$s_!czP7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8b0faa0-6536-408d-814a-a4a78a325541_2752x1536.jpeg 848w, https://substackcdn.com/image/fetch/$s_!czP7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8b0faa0-6536-408d-814a-a4a78a325541_2752x1536.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!czP7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8b0faa0-6536-408d-814a-a4a78a325541_2752x1536.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!czP7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8b0faa0-6536-408d-814a-a4a78a325541_2752x1536.jpeg" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a8b0faa0-6536-408d-814a-a4a78a325541_2752x1536.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:740968,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://insights.ml4trading.io/i/198839659?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8b0faa0-6536-408d-814a-a4a78a325541_2752x1536.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!czP7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8b0faa0-6536-408d-814a-a4a78a325541_2752x1536.jpeg 424w, https://substackcdn.com/image/fetch/$s_!czP7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8b0faa0-6536-408d-814a-a4a78a325541_2752x1536.jpeg 848w, https://substackcdn.com/image/fetch/$s_!czP7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8b0faa0-6536-408d-814a-a4a78a325541_2752x1536.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!czP7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8b0faa0-6536-408d-814a-a4a78a325541_2752x1536.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 1. Chapter 24 treats financial agents as read-only workflows over evidence, tools, memory, and audit artifacts. Order generation and execution require a separate layer of permissions, risk, and controls.</em></p><h2>What finance changes</h2><p>Finance turns a generic agent problem into a time-sensitive evidence problem. The system has to know what was knowable when, where a number came from, which tool produced it, and whether the evidence was available before the forecast or backtest decision.</p><p>Five constraints shape the reading path:</p><ul><li><p><strong>Time</strong>. Filings, prices, macro releases, transcripts, and news have timestamps. An agent must preserve cutoffs rather than mix past and future evidence.</p></li><li><p><strong>Provenance</strong>. Financial evidence is not interchangeable text. A model summary, an SEC filing, an exchange quote, and a scraped article carry different reliability and permission properties.</p></li><li><p><strong>Leakage</strong>. Evaluation can be contaminated by training data, revised data, benchmark overfitting, or hidden access to answers. Finance makes this more dangerous because small information leaks can appear to be forecasting skill.</p></li><li><p><strong>Calibration</strong>. Many outputs are probabilities, not prose. The evaluation question is not only whether the explanation sounds plausible, but whether the forecast is calibrated after resolution.</p></li><li><p><strong>Capital-at-risk boundaries</strong>. Research support, portfolio recommendation, order generation, and execution are different system classes. Chapter 24 stays on the research-support side of that boundary.</p></li></ul><h2>A core path for the long weekend</h2><p>Start with these core entries:</p><ol><li><p><a href="https://people.csail.mit.edu/lpk/papers/aij98-pomdp.pdf">Kaelbling, Littman, and Cassandra on POMDPs</a>: financial agents act from belief states, not complete market state.</p></li><li><p><a href="https://cdn.aaai.org/ICMAS/1995/ICMAS95-042.pdf">Rao and Georgeff on BDI agents</a>: beliefs, goals, intentions, and resource-bounded deliberation predate LLM wrappers.</p></li><li><p><a href="https://erichorvitz.com/chi99horvitz.pdf">Horvitz on mixed-initiative interfaces</a>: agents need rules for proceeding, asking, abstaining, and escalating.</p></li><li><p><a href="https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html">Lewis et al. on retrieval-augmented generation</a>: finance needs external, updateable, provenance-bearing knowledge.</p></li><li><p><a href="https://arxiv.org/abs/2112.09332">WebGPT</a>: an early template for search, evidence collection, citation, and answer generation in one loop.</p></li><li><p><a href="https://arxiv.org/abs/2205.00445">MRKL Systems</a>: language models can route to tools, calculators, retrieval systems, and symbolic modules instead of internalizing every operation.</p></li><li><p><a href="https://arxiv.org/abs/2210.03629">ReAct</a>: reason-act-observe is a canonical starting pattern for evidence-grounded agents.</p></li><li><p><a href="https://arxiv.org/abs/2402.01030">CodeAct</a>: executable actions fit technical research workflows better than unconstrained prose.</p></li><li><p><a href="https://arxiv.org/abs/2407.01502">AI Agents That Matter</a>: agent evaluation has to include cost, reproducibility, holdouts, and benchmark overfitting.</p></li><li><p><a href="https://arxiv.org/abs/2406.13352">AgentDojo</a> and the <a href="https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/">OWASP Top 10 for LLM Applications 2025</a>: retrieved content and tool access create security problems that ordinary model benchmarks miss.</p></li><li><p><a href="https://arxiv.org/abs/2409.19839">ForecastBench</a> and <a href="https://arxiv.org/abs/2402.18563">Halawi et al. on language-model forecasting</a>: financial agents need leakage-aware forecasting evaluation, not just impressive rationales.</p></li><li><p><a href="https://arxiv.org/abs/2511.07678">AIA Forecaster</a>, <a href="https://arxiv.org/abs/2508.00828">Finance Agent Benchmark</a>, and <a href="https://arxiv.org/abs/2603.08262">FinToolBench</a>: together they show where agentic financial research works now and where the evidence remains thin.</p></li></ol><p>The rest of the issue gives the broader route. It is organized by design problem rather than by publication date.</p><p>The list deliberately excludes most framework documentation, product announcements, and &#8220;autonomous trading bot&#8221; papers. Frameworks matter in implementation, but they age quickly. The reading path below focuses on more durable design problems: state, tools, retrieval, partial observation, supervision, evaluation, security, forecasting, and governance. Execution agents and order-routing systems are also out of scope, as Chapter 24 remains on the research-support side of the capital-at-risk boundary.</p><h2>The old problems are still the hard problems</h2><p><strong><a href="https://archive.org/details/humanproblemsolv0000newe">Newell and Simon, </a></strong><em><strong><a href="https://archive.org/details/humanproblemsolv0000newe">Human Problem Solving</a></strong></em>. Newell and Simon frame intelligence as search through a structured problem space under bounded computation. That framing keeps the central object in view: not a fluent answer, but a process that moves through possible states, operators, and goals.</p><p><strong><a href="https://doi.org/10.1109/TSSC.1968.300136">Hart, Nilsson, and Raphael, &#8220;A Formal Basis for the Heuristic Determination of Minimum Cost Paths&#8221;</a></strong>. The A-star paper makes a point that still holds: search quality depends on how the system allocates its limited computational budget. LLM agents do not escape that constraint. They move it into prompt length, tool calls, branching, reranking, and supervisor passes.</p><p><strong><a href="https://cdn.aaai.org/ICMAS/1995/ICMAS95-042.pdf">Rao and Georgeff, &#8220;BDI Agents: From Theory to Practice&#8221;</a></strong>. BDI gives the literal pre-LLM agent vocabulary: beliefs, desires, and intentions. The vocabulary is older; the design problem remains current. A financial research agent needs a representation of what it believes, what it is trying to answer, which plan it is executing, when to reconsider, and what state must survive between steps.</p><p><strong><a href="https://people.csail.mit.edu/lpk/papers/aij98-pomdp.pdf">Kaelbling, Littman, and Cassandra, &#8220;Planning and Acting in Partially Observable Stochastic Domains&#8221;</a></strong>. Partial observability is central in finance. The agent never sees the full state of the market, the company, or the policy process. It sees filings, quotes, transcripts, news, and partial indicators. The paper makes explicit that agents act from belief states, not from truth.</p><p><strong><a href="https://www-anw.cs.umass.edu/~barto/courses/cs687/Sutton-Precup-Singh-AIJ99.pdf">Sutton, Precup, and Singh, &#8220;Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning&#8221;</a></strong>. The options framework gives a theory of temporally extended actions. Modern systems call them tools, skills, routines, subagents, or workflows. The abstraction problem is the same: when should a multi-step behavior be treated as a single action, and which state must be preserved before and after it runs?</p><p><strong><a href="https://websites.nku.edu/~foxr/CSC425/hearsay2.pdf">Erman, Hayes-Roth, Lesser, and Reddy, &#8220;The Hearsay-II Speech-Understanding System&#8221;</a></strong>. Hearsay-II is the classic blackboard architecture: specialized components coordinate through a shared workspace. That pattern keeps returning in planner-executor-reviewer loops, multi-agent debate, research-agent ensembles, and supervisor reconciliation. The same architecture helps explain Chapter 24&#8217;s forecasting pipeline.</p><p><strong><a href="https://mitpress.mit.edu/9780262193160/telerobotics-automation-and-supervisory-control/">Sheridan, </a></strong><em><strong><a href="https://mitpress.mit.edu/9780262193160/telerobotics-automation-and-supervisory-control/">Telerobotics, Automation, and Supervisory Control</a></strong></em>. Sheridan treats autonomy as a control relationship rather than a marketing label. The practical questions are who monitors execution, when control is handed back, and what the human is expected to approve. Those questions apply directly when an agent&#8217;s output can influence capital allocation, research priorities, or a published forecast.</p><p><strong><a href="https://erichorvitz.com/chi99horvitz.pdf">Horvitz, &#8220;Principles of Mixed-Initiative User Interfaces&#8221;</a></strong>. Mixed initiative gives a concrete frame for human-agent work. A financial agent needs rules for when to proceed, when to ask for clarification, when to abstain, and when to escalate. This is not only a user-interface problem. It is part of the risk-control surface.</p><h2>The LLM-era primitives</h2><p><strong><a href="https://arxiv.org/abs/2201.11903">Wei et al., &#8220;Chain-of-Thought Prompting Elicits Reasoning in Large Language Models&#8221;</a></strong>. Chain-of-thought showed that eliciting intermediate reasoning can improve multi-step performance. Chapter 24 treats this as a control surface rather than an audit record. In financial agents, the auditable objects are tool calls, observations, state transitions, evidence records, prompts, model versions, policies, and scored outputs. Free-form reasoning text is not enough.</p><p><strong><a href="https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html">Lewis et al., &#8220;Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks&#8221;</a></strong>. RAG is not an agent paper, but finance agents are retrieval-bound. The distinction between parametric memory and external, updateable, provenance-bearing knowledge matters for filings, transcripts, news, research notes, macro releases, and market data.</p><p><strong><a href="https://arxiv.org/abs/2112.09332">Nakano et al., &#8220;WebGPT: Browser-Assisted Question-Answering with Human Feedback&#8221;</a></strong>. WebGPT is an early modern template for a language model that searches, collects evidence, cites sources, and answers with the browser in the loop. For finance, this is the move from static model output to evidence acquisition. The model is no longer only producing text. It is choosing what evidence to retrieve before it answers.</p><p><strong><a href="https://arxiv.org/abs/2205.00445">Karpas et al., &#8220;MRKL Systems&#8221;</a></strong>. MRKL made modularity explicit. The language model routes among tools, symbolic modules, knowledge sources, and external calculators. Chapter 24 uses the same principle in a finance setting: deterministic calculations should be tools, retrieval should carry provenance, and the LLM should not pretend to internalize every operation.</p><p><strong><a href="https://arxiv.org/abs/2210.03629">Yao et al., &#8220;ReAct: Synergizing Reasoning and Acting in Language Models&#8221;</a></strong>. ReAct is a canonical starting pattern for evidence-grounded agents: reason, act, observe, repeat. In Chapter 24, the first notebook builds this loop with structured JSON decisions and trace capture. The trace ties stated reasoning to tool calls and observations. The hard audit evidence still consists of the tool invocation, observation, state transition, and stored source.</p><p><strong><a href="https://arxiv.org/abs/2305.10601">Yao et al., &#8220;Tree of Thoughts: Deliberate Problem Solving with Large Language Models&#8221;</a></strong> and <strong><a href="https://arxiv.org/abs/2303.11366">Shinn et al., &#8220;Reflexion: Language Agents with Verbal Reinforcement Learning&#8221;</a></strong>. Tree of Thoughts adds branching and scoring at decision points where premature commitment is costly. Reflexion records post-run lessons that can persist without updating model weights. In finance, both mechanisms need controls. A branch can help compare market hypotheses, and a lesson can improve future behavior. But both need validity horizons, provenance, and pruning rules. Otherwise, a temporary market condition becomes a persistent bias.</p><p><strong><a href="https://arxiv.org/abs/2305.16291">Wang et al., &#8220;Voyager: An Open-Ended Embodied Agent with Large Language Models&#8221;</a></strong>. Voyager is not a finance paper, but its skill-library idea maps well to research agents. A financial operator should not have to rediscover the same data-loading, feature-inspection, or backtest-diagnostic procedures every time. It needs a bounded skill corpus whose behavior can be inspected.</p><p><strong><a href="https://arxiv.org/abs/2402.01030">Wang et al., &#8220;Executable Code Actions Elicit Better LLM Agents&#8221;</a></strong>. CodeAct reframes action as executable code rather than fixed text or JSON. Technical research workflows involve many computational actions: querying a registry, reading a parquet file, computing an IC, running a backtest, or inspecting a result table. Chapter 24&#8217;s research operator follows this direction by providing the model with general tools and a skill corpus.</p><p><strong><a href="https://arxiv.org/abs/2405.15793">Yang et al., &#8220;SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering&#8221;</a></strong>. SWE-agent made the agent-computer interface a first-class variable. That lesson generalizes beyond software engineering. If the environment is hard to inspect, the state is hidden, the tools are poorly named, or errors are hard to recover from, the agent will fail due to interface issues even when the base model is strong.</p><h2>Evaluation and security are part of the system</h2><p><strong><a href="https://arxiv.org/abs/2407.01502">Kapoor et al., &#8220;AI Agents That Matter&#8221;</a></strong>. This paper anchors the evaluation section. It argues that accuracy alone is the wrong target because cost, reproducibility, holdout design, benchmark overfitting, and the needs of downstream developers decide whether an agent works in practice. Finance needs that discipline.</p><p><strong><a href="https://arxiv.org/abs/2308.03688">AgentBench</a>, <a href="https://arxiv.org/abs/2307.13854">WebArena</a>, <a href="https://arxiv.org/abs/2404.07972">OSWorld</a>, and <a href="https://arxiv.org/abs/2310.06770">SWE-bench</a></strong>. These benchmarks shifted evaluation from &#8220;does the model produce the right text?&#8221; to &#8220;can the system change an environment into the target state?&#8221; That shift fits agent evaluation, but it also creates new validity problems. An agent can satisfy a checker without doing the intended work, reading the hidden state, or exploiting the evaluation harness itself.</p><p><strong><a href="https://arxiv.org/abs/2406.13352">AgentDojo</a></strong>. AgentDojo turns indirect prompt injection into an environment problem. The agent must complete assigned work while treating retrieved content as untrusted. That model fits finance, where retrieved documents can contain adversarial instructions, speculative narratives, stale facts, or conflicting claims.</p><p><strong><a href="https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/">OWASP Top 10 for LLM Applications 2025</a></strong>. OWASP is not an agent paper, but tool-connected LLM systems create security failures that ordinary model evaluation misses. Chapter 24 turns this into engineering controls: least privilege, source allowlists, prompt-injection filters, policy proxies, and logged allow/deny decisions.</p><p>Two papers from May 2026 are recent stress tests, not settled references. <strong><a href="https://arxiv.org/abs/2605.17554">Evaluating Deep Research Agents on Expert Consulting Work</a></strong> assesses deep-research agents on structured analytical deliverables, using verifiers, rubrics, and cognitive traps. Reported acceptance rates are low across frontier systems. <strong><a href="https://arxiv.org/abs/2605.17526">SaaSBench</a></strong> tests long-horizon work in multi-component enterprise software and finds that many failures occur during setup, configuration, and integration before deep business logic is reached. The finance lesson is direct: agent failures are often system failures, not only reasoning failures.</p><h2>The finance branch</h2><p>The finance literature below falls into four groups: broad LLM-in-finance maps, financial-agent benchmarks, forecasting-agent systems, and portfolio or trading-agent architectures. Those are not the same system class. A research benchmark, a forecasting assistant, a portfolio-construction committee, and an execution agent require different evidence and controls.</p><p><strong><a href="https://doi.org/10.3905/jpm.2024.1.646">Kong et al., &#8220;Large Language Models for Financial and Investment Management&#8221;</a></strong>. Kong et al. provide a broad investment-management map: retrieval, domain-specific data, task decomposition, evaluation, and deployment constraints. The paper does not reduce finance to sentiment analysis or trading signals. It treats LLMs as part of a workflow that has to respect evidence, timing, and institutional constraints.</p><p><strong><a href="https://arxiv.org/abs/2508.00828">Finance Agent Benchmark</a></strong>. This benchmark adds a concrete constraint. It uses expert-authored financial research tasks that require recent SEC filings and an agentic harness with search and EDGAR access. The dataset has 537 questions across nine task categories. The best-reported model achieved 46.8 percent accuracy at an average cost of $3.79 per query. The evidence is concrete, costly, tool-based, and still limited.</p><p><strong><a href="https://arxiv.org/abs/2603.08262">Lu et al., &#8220;FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use&#8221;</a></strong>. Finance Agent Benchmark tests financial research tasks over filings and search. FinToolBench tests tool-using financial agents under runnable execution conditions. It pairs 760 executable financial tools with 295 tool-required queries and evaluates not only success, but also timeliness, intent restraint, and regulatory-domain alignment. That maps directly to Chapter 24&#8217;s view of agents as auditable workflows over tools, traces, and policy constraints.</p><p><strong><a href="https://arxiv.org/abs/2402.12659">Xie et al., &#8220;FinBen: A Holistic Financial Benchmark for Large Language Models&#8221;</a></strong>. FinBen is a checkpoint before discussing agents because it separates financial language tasks from numerical reasoning, forecasting, risk, and decision-making. Static benchmarks do not evaluate full agent behavior, but they reveal where base-model capabilities are thin before an agent loop adds tools, retrieval, and state.</p><p><strong><a href="https://arxiv.org/abs/2402.07862">Schoenegger et al., &#8220;AI-Augmented Predictions&#8221;</a></strong>. This paper serves as a bridge between general agents and forecasting. LLM assistants can improve human forecasting accuracy, but the improvement comes through a decision-support relationship, not full replacement. That is close to Chapter 24&#8217;s stance: agents gather and organize evidence, but the output still needs scoring, calibration, and supervision.</p><p><strong><a href="https://arxiv.org/abs/2409.19839">Karger et al., &#8220;ForecastBench&#8221;</a></strong>. ForecastBench evaluates future events whose answers are not known at submission time. That design directly targets leakage. It also keeps the evaluation unit clear: a probability on a resolvable question, not a compelling narrative about what may happen.</p><p><strong><a href="https://arxiv.org/abs/2402.18563">Halawi et al., &#8220;Approaching Human-Level Forecasting with Language Models&#8221;</a></strong>. Halawi et al. provide the methodological bridge to AIA Forecaster. The system searches for relevant information, generates forecasts, and aggregates predictions against human forecaster baselines. It treats forecasting agents as workflows for retrieval, aggregation, and evaluation.</p><p><strong><a href="https://arxiv.org/abs/2511.07678">Alur et al., &#8220;AIA Forecaster: Technical Report&#8221;</a></strong>. AIA Forecaster is the chapter&#8217;s central reference for the forecasting-agent implementation. It combines agentic search, independent forecasts, supervisor reconciliation, and statistical calibration. Its reported results cut both ways: expert-level performance on ForecastBench, weaker performance than market consensus on a harder prediction-market benchmark, and better results when combined with market consensus. That supports decision assistance, not a claim that an LLM forecasts on its own.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AGw2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf35986b-b9dd-455f-9e35-0af7299dd0db_2752x1536.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AGw2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf35986b-b9dd-455f-9e35-0af7299dd0db_2752x1536.jpeg 424w, https://substackcdn.com/image/fetch/$s_!AGw2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf35986b-b9dd-455f-9e35-0af7299dd0db_2752x1536.jpeg 848w, https://substackcdn.com/image/fetch/$s_!AGw2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf35986b-b9dd-455f-9e35-0af7299dd0db_2752x1536.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!AGw2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf35986b-b9dd-455f-9e35-0af7299dd0db_2752x1536.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AGw2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf35986b-b9dd-455f-9e35-0af7299dd0db_2752x1536.jpeg" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/af35986b-b9dd-455f-9e35-0af7299dd0db_2752x1536.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1784554,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://insights.ml4trading.io/i/198839659?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf35986b-b9dd-455f-9e35-0af7299dd0db_2752x1536.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AGw2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf35986b-b9dd-455f-9e35-0af7299dd0db_2752x1536.jpeg 424w, https://substackcdn.com/image/fetch/$s_!AGw2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf35986b-b9dd-455f-9e35-0af7299dd0db_2752x1536.jpeg 848w, https://substackcdn.com/image/fetch/$s_!AGw2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf35986b-b9dd-455f-9e35-0af7299dd0db_2752x1536.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!AGw2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf35986b-b9dd-455f-9e35-0af7299dd0db_2752x1536.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 2. Chapter 24 implements forecasting as a supervised evidence workflow: specialists produce independent views, aggregation combines probabilities, debate surfaces contradictions, and the final artifact preserves probability, confidence, caveats, and audit evidence.</em></p><p><strong><a href="https://arxiv.org/abs/2311.13743">Yu et al., &#8220;FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design&#8221;</a></strong> and <strong><a href="https://arxiv.org/abs/2407.06567">Yu et al., &#8220;FinCon: A Synthesized LLM Multi-Agent System with Conceptual Verbal Reinforcement&#8221;</a></strong>. These papers belong on the list as design references for memory and multi-agent financial decision systems. Treat them as architecture references, not as proof of deployable trading edge. They expose a design problem: when an agent stores lessons from prior decisions, which lessons are valid enough to persist, and which should be pruned before they become bias?</p><p><strong><a href="https://arxiv.org/abs/2508.11152">Zhao et al., &#8220;AlphaAgents: Large Language Model based Multi-Agents for Equity Portfolio Constructions&#8221;</a></strong> and <strong><a href="https://arxiv.org/abs/2604.02279">Ang, Azimbayev, and Kim, &#8220;The Self Driving Portfolio&#8221;</a></strong>. These papers move from research assistance toward portfolio construction. Chapter 24 reads them cautiously. Role-based analysts, peer critique, investment policy constraints, and supervisor combinations are useful architectural patterns. They do not remove the need for statistical evaluation, transaction-cost modeling, permissions, and operational controls.</p><p><strong><a href="https://arxiv.org/abs/2605.19337">Xia et al., &#8220;Agentic Trading: When LLM Agents Meet Financial Markets&#8221;</a></strong>. This May 2026 survey is best read as a methodological audit, not as a settled taxonomy. It maps 77 LLM-based trading-agent studies and finds that comparable evaluation remains weak: time-consistent splits, transaction-cost assumptions, universe construction, execution semantics, and reproducible artifacts are often missing. That supports Chapter 24&#8217;s conservative boundary: before financial agents influence capital, their evidence, timing, costs, and execution assumptions must be inspectable.</p><p><strong><a href="https://doi.org/10.3905/jpm.2025.1.778">Fabozzi and Lopez de Prado, &#8220;Implementing AI Foundation Models in Asset Management&#8221;</a></strong>. This paper anchors the governance thread. Prompts, retrieval corpora, model versions, and outputs become controlled artifacts once they affect asset management decisions. That is why Chapter 24 treats traces and replay as model-risk infrastructure, not just an engineering convenience.</p><p><strong><a href="https://papers.ssrn.com/abstract=5217505">Lopez-Lira, Tang, and Zhu, &#8220;The Memorization Problem&#8221;</a></strong>. Economic forecasting with LLMs has a contamination problem: a model may appear to forecast the past because it has absorbed realized outcomes during training. That makes pre-cutoff evaluation hard to interpret. Chapter 24&#8217;s answer is not to trust narrative claims of forecasting skill. It uses cutoff dates, time-shift tests, event windows, baselines, and post-resolution scoring.</p><p><strong><a href="https://arxiv.org/abs/2507.20957">Lee et al., &#8220;Your AI, Not Your View&#8221;</a></strong>. Lee et al. show that LLMs can carry systematic investment preferences and confirmation bias. Retrieval and tool use do not automatically remove latent model preferences. A financial agent needs stress tests that present the same evidence under different framings and check whether the conclusion changes for the wrong reason.</p>]]></content:encoded></item><item><title><![CDATA[How ML4T uses case studies to test strategies across markets]]></title><description><![CDATA[Nine case studies across seven asset classes, run under one research protocol.]]></description><link>https://insights.ml4trading.io/p/how-ml4t-uses-case-studies-to-test</link><guid isPermaLink="false">https://insights.ml4trading.io/p/how-ml4t-uses-case-studies-to-test</guid><dc:creator><![CDATA[Stefan Jansen]]></dc:creator><pubDate>Tue, 19 May 2026 13:18:16 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8f062fe3-94bd-4727-b2c5-7854d3fc6185_1200x630.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The third edition of <em>Machine Learning for Trading</em> carries <a href="https://www.ml4trading.io/case-studies/">nine case studies:</a> ETFs, crypto perpetuals, NASDAQ-100 microstructure, S&amp;P 500 equity and option analytics, US firm characteristics, FX pairs, CME futures, direct S&amp;P 500 options, and a broad US equities panel.</p><p>That list matters because the studies are not decorative applications at the end of the book. They show how the same research process performs across different datasets and in very different markets and trading environments, and each includes around 20 notebooks from data sourcing to detailed performance analysis.</p><p>They cover different asset classes, frequencies, breadths, cost regimes, and execution problems. Some are monthly. Some are daily. One is intraday. Some are long-only ranking problems. Some are long-short cross-sectional problems. One is a delta-hedged options strategy.</p><p>Model choice rarely decides whether a strategy survives. Label design, cost regime, data construction, breadth, position sizing, and search discipline usually decide it.</p><p>That is what the case studies are built to show.</p><p>That comparison space is intentionally wide. A monthly ETF rotation process, an 8-hour crypto funding trade, a 15-minute NASDAQ-100 signal, a weekly futures ranking problem, and a delta-hedged options strategy do not put pressure on the same parts of the workflow. That is why the set is useful: each market makes a different failure mode visible.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xk3Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F979f8369-04c9-4a48-8294-991f5ac9aaa5_2912x1440.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xk3Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F979f8369-04c9-4a48-8294-991f5ac9aaa5_2912x1440.png 424w, https://substackcdn.com/image/fetch/$s_!Xk3Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F979f8369-04c9-4a48-8294-991f5ac9aaa5_2912x1440.png 848w, https://substackcdn.com/image/fetch/$s_!Xk3Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F979f8369-04c9-4a48-8294-991f5ac9aaa5_2912x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!Xk3Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F979f8369-04c9-4a48-8294-991f5ac9aaa5_2912x1440.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xk3Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F979f8369-04c9-4a48-8294-991f5ac9aaa5_2912x1440.png" width="1456" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/979f8369-04c9-4a48-8294-991f5ac9aaa5_2912x1440.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:218125,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://insights.ml4trading.io/i/198401480?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F979f8369-04c9-4a48-8294-991f5ac9aaa5_2912x1440.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Xk3Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F979f8369-04c9-4a48-8294-991f5ac9aaa5_2912x1440.png 424w, https://substackcdn.com/image/fetch/$s_!Xk3Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F979f8369-04c9-4a48-8294-991f5ac9aaa5_2912x1440.png 848w, https://substackcdn.com/image/fetch/$s_!Xk3Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F979f8369-04c9-4a48-8294-991f5ac9aaa5_2912x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!Xk3Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F979f8369-04c9-4a48-8294-991f5ac9aaa5_2912x1440.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 1. The point of the set is not just breadth. It is that the same research process is exposed to very different markets, cadences, and execution constraints.</em></p><h2>What becomes comparable across nine very different markets</h2><p>Because the case studies are built on the same discipline, they make more than headline Sharpe ratios comparable. A 21-day ETF label, an 8-hour crypto label, a 5-day futures label, a 15-minute intraday equity label, and a return-to-expiry options label are not interchangeable prediction problems. They encode different holding periods, execution assumptions, and cost burdens.</p><p>The same goes for features and models. Some studies rely on traditional financial features such as momentum, carry, volatility, and the term structure. Others add model-based features such as HMM regimes, GARCH volatility, or forecast model outputs. The model set is also deliberately broad: linear baselines, gradient boosting, tabular deep learning, sequence models, latent-factor models, and, where the setup supports it, causal estimators.</p><p>Just as important, the book forces the post-prediction steps into view. Signals are converted into positions, run through explicit backtests, stress-tested under costs, modified by allocators and risk overlays, and then evaluated on a holdout set. Feature ICs are measured with heteroscedasticity- and autocorrelation-corrected standard errors. Screening uses false-discovery control. Backtests are read with probabilistic and deflated Sharpe analysis, bootstrap intervals, and search-accounting adjustments. The point is not just to report a number, but to say what that number does and does not justify.</p><p>That stack is why the case studies read as research rather than as examples.</p><h2>What changes once the workflow meets real markets</h2><p>In the firm-characteristics study, label treatment materially changes the result. The raw 1-month return label leaves the linear baseline at an IC of about -0.005, with a HAC interval that straddles zero. On the winsorized label, the same linear family moves to about +0.023 and clears zero on a thin margin. GBM is strong in both cases, around +0.080, which is the point: the large change came from label treatment, not from swapping ridge for a more expressive architecture.</p><p>In crypto perpetuals, the problem is different. On the primary 8-hr forward return regression label, only one family leader is clearly credible: NLinear at +0.0293 daily IC with a HAC 95% interval of [+0.0168, +0.0419]. GBM, linear, and TabM all straddle zero on that same primary label. But directional reframings recover the signal for other families. That is not a &#8220;deep learning wins&#8221; story. It is a label-and-market-structure story in a small sample with only 19 instruments and two folds.</p><p>In ETFs, the comparison is broad enough to separate prediction quality from strategy quality. The study compares all major model families across a large, liquid cross-asset panel. The editorial point here is not to crown a universal winner. It is to show that the family leader on validation IC need not be the leader once the signal is turned into a strategy. The prediction problem and the portfolio problem are related but not the same.</p><p>In the NASDAQ-100 microstructure, the signal can be statistically real yet economically fragile. The current rank-1 prediction has a daily IC of around +0.0054, with a HAC interval of [+0.0022, +0.0086], indicating positive directional alignment. But the holdout strategy Sharpe still flips negative, around -1.69, and the strategy trails the equal-weight holdout benchmark. That is a clean example of why &#8220;detected signal&#8221; and &#8220;deployable strategy&#8221; are not synonyms at intraday cadence.</p><p>S&amp;P 500 equity-plus-options analytics make a different point: options can be a useful source of information for stock prediction, but credible validation results still do not settle the execution and holdout questions.</p><p>And in direct S&amp;P 500 options, the cost problem becomes the case study. The workflow uses a dedicated hold-to-maturity cost cascade because standard basis point grids are the wrong abstraction for the instrument. Even there, the strategy analysis says no statistically resolved edge has been earned yet. That is a useful result. It shows what it looks like when the instrument pushes back hard enough that careful modeling is still not enough.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aUYT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecca1e4a-2540-4bf0-936d-e5e3a8d461e3_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aUYT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecca1e4a-2540-4bf0-936d-e5e3a8d461e3_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!aUYT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecca1e4a-2540-4bf0-936d-e5e3a8d461e3_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!aUYT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecca1e4a-2540-4bf0-936d-e5e3a8d461e3_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!aUYT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecca1e4a-2540-4bf0-936d-e5e3a8d461e3_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aUYT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecca1e4a-2540-4bf0-936d-e5e3a8d461e3_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ecca1e4a-2540-4bf0-936d-e5e3a8d461e3_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3319827,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://insights.ml4trading.io/i/198401480?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecca1e4a-2540-4bf0-936d-e5e3a8d461e3_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aUYT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecca1e4a-2540-4bf0-936d-e5e3a8d461e3_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!aUYT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecca1e4a-2540-4bf0-936d-e5e3a8d461e3_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!aUYT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecca1e4a-2540-4bf0-936d-e5e3a8d461e3_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!aUYT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecca1e4a-2540-4bf0-936d-e5e3a8d461e3_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 2. The same research stack repeats across all nine studies. That is what makes the differences interpretable rather than anecdotal.</em></p><h2>Why the case studies matter</h2><p>This is the real reason to read that part of the book closely.</p><p>The case studies do not just show that ML can be applied to many markets. They show how differently the same research stack behaves when the market changes.</p><p>Sometimes label engineering matters more than architecture. Sometimes costs decide the result. Sometimes breadth rescues a weak signal. Sometimes the prediction is credible, but the strategy is not. Sometimes the right conclusion is not to deploy, but to narrow the claim, change the horizon, or stop.</p><p>Because the trade definition, label design, model comparison, backtest, cost accounting, and holdout discipline are kept explicit, you can ask better questions when a strategy fails. Was the label wrong? Was the cross-section too narrow? Was the cost regime too severe? Did the signal survive validation but die in holdout? Did the apparent edge disappear once uncertainty and search adjustments were counted?</p><p>And those questions are not asked loosely. The workflow forces them through HAC IC, false-discovery control, probabilistic and deflated Sharpe, bootstrap uncertainty, and search-accounting discipline before a result is allowed to sound stronger than it is.</p><p>That is serious empirical work. And it is much closer to how real quant research feels than a neat parade of winning backtests.</p><p>Read that way, the nine case studies are not a tour of examples. They are part of the book where the whole argument is exposed to the market and forced to earn its claims.</p><h2>The ML4T library ecosystem, built to support the workflow</h2><p>The software stack behind that process is <a href="https://www.ml4trading.io/libraries/">documented</a>  on our website and is live on <a href="https://pypi.org/user/ml4t/">PyPI</a>:</p><ul><li><p><a href="https://github.com/ml4t/data">ML4T Data</a> handles multi-provider acquisition and point-in-time storage.</p></li><li><p><a href="https://github.com/ml4t/engineer">ML4T Engineer</a> builds features, labels, and alternative bars; </p></li><li><p><a href="https://github.com/ml4t/models">ML4T Models</a> adds finance-native latent-factor, SDF, direct-prediction, and portfolio-learning models; </p></li><li><p><a href="https://github.com/ml4t/diagnostic">ML4T Diagnostic</a> covers IC analysis, false-discovery control, Deflated Sharpe, Rademacher, PBO, CPCV, and tearsheets; and </p></li><li><p><a href="https://github.com/ml4t/backtest">ML4T Backtest</a> turns signals into event-driven strategy results with explicit execution, risk, and account rules. </p></li></ul><p>Together, they make the case studies reproducible rather than descriptive.</p>]]></content:encoded></item><item><title><![CDATA[Six libraries, one workflow]]></title><description><![CDATA[Six public libraries now cover the reusable parts of the ML4T workflow: data, features, domain-specific models, diagnostics, backtesting, and live trading.]]></description><link>https://insights.ml4trading.io/p/six-libraries-one-workflow</link><guid isPermaLink="false">https://insights.ml4trading.io/p/six-libraries-one-workflow</guid><dc:creator><![CDATA[Stefan Jansen]]></dc:creator><pubDate>Tue, 12 May 2026 16:03:59 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!izNK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fb7b029-0b9f-4509-aae6-b687f2a4f956_3200x1840.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A notebook can demonstrate a workflow. It rarely makes the workflow reusable. The harder problem is preserving the assumptions that make a result interpretable: data provenance, label construction, validation design, execution semantics, and deployment controls.</p><p>That is the change this spring. The ML4T workflow now has a <a href="http://ml4trading.io/libraries">public software layer </a>that readers can inspect and pressure-test, rather than reconstructing everything from scattered notebooks and chapter code.</p><p>Six libraries now carry the main parts of that loop:</p><ul><li><p><code>ml4t-data</code></p></li><li><p><code>ml4t-engineer</code></p></li><li><p><code>ml4t-models</code></p></li><li><p><code>ml4t-diagnostic</code></p></li><li><p><code>ml4t-backtest</code></p></li><li><p><code>ml4t-live</code></p></li></ul><p>The stack is not equally mature; most are in public beta, <code>ml4t-live</code> is still alpha, and <code>ml4t-models</code> is the most recent addition. Even so, the reusable layer is now concrete enough for readers to run, inspect, and break.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!izNK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fb7b029-0b9f-4509-aae6-b687f2a4f956_3200x1840.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!izNK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fb7b029-0b9f-4509-aae6-b687f2a4f956_3200x1840.png 424w, https://substackcdn.com/image/fetch/$s_!izNK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fb7b029-0b9f-4509-aae6-b687f2a4f956_3200x1840.png 848w, https://substackcdn.com/image/fetch/$s_!izNK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fb7b029-0b9f-4509-aae6-b687f2a4f956_3200x1840.png 1272w, https://substackcdn.com/image/fetch/$s_!izNK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fb7b029-0b9f-4509-aae6-b687f2a4f956_3200x1840.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!izNK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fb7b029-0b9f-4509-aae6-b687f2a4f956_3200x1840.png" width="1456" height="837" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4fb7b029-0b9f-4509-aae6-b687f2a4f956_3200x1840.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:837,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:182710,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://insights.ml4trading.io/i/197367763?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fb7b029-0b9f-4509-aae6-b687f2a4f956_3200x1840.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!izNK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fb7b029-0b9f-4509-aae6-b687f2a4f956_3200x1840.png 424w, https://substackcdn.com/image/fetch/$s_!izNK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fb7b029-0b9f-4509-aae6-b687f2a4f956_3200x1840.png 848w, https://substackcdn.com/image/fetch/$s_!izNK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fb7b029-0b9f-4509-aae6-b687f2a4f956_3200x1840.png 1272w, https://substackcdn.com/image/fetch/$s_!izNK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fb7b029-0b9f-4509-aae6-b687f2a4f956_3200x1840.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>The ML4T library ecosystem is a six-step research-to-production workflow: ml4t-data to ml4t-engineer to ml4t-models on the build path, then ml4t-diagnostic to ml4t-backtest to ml4t-live on the prove-and-deploy path, with an iterate-and-redeploy loop back to data.</em></p><h2>From teaching material to professional workflow</h2><p>The six libraries line up with the actual research and deployment sequence:</p><ul><li><p><code>ml4t-data</code> acquires, stores, and refreshes data</p></li><li><p><code>ml4t-engineer</code> builds features, labels, and leakage-safe training inputs</p></li><li><p><code>ml4t-models</code> packages finance-native model families and hands predictions downstream</p></li><li><p><code>ml4t-diagnostic</code> asks whether the signal survives statistical scrutiny</p></li><li><p><code>ml4t-backtest</code> simulates execution under explicit behavioral assumptions</p></li><li><p><code>ml4t-live</code> carries the same strategy surface into shadow, paper, and live operation</p></li></ul><p>That means a reader can now do something much more concrete than &#8220;learn the workflow.&#8221; They can pull a futures or equities panel with <code>ml4t-data</code>, construct features and targets with <code>ml4t-engineer</code>, train or score them with <code>ml4t-models</code>, test IC stability and multiple-testing risk with <code>ml4t-diagnostic</code>, simulate next-bar or quote-aware execution in <code>ml4t-backtest</code>, and then carry the same strategy surface into <code>ml4t-live</code> shadow mode. That is a more concrete workflow than another abstract essay about process.</p><h2>Build</h2><h3><code>ml4t-data</code></h3><p>Every quant workflow begins with data, and data engineering failures often stay invisible until they become expensive. <code>ml4t-data</code> is the acquisition, storage, and refresh layer for the rest of the workflow.</p><p>Its core abstraction is a <code>DataManager</code> that provides a single interface for fetching, storing, updating, and loading data across providers. Breadth matters: about 20 provider adapters spanning equities, crypto, futures, FX, macro series, prediction markets, and factor data.</p><p>The more important part is that the package treats data as an ongoing research asset rather than a one-off notebook download, with local Parquet storage, metadata-backed refresh workflows, gap detection, backfills, and validation in the same layer. That is why the futures and the commitment-of-traders (COT) modules matter. They solve recurring workflow problems that simple wrappers usually ignore: bulk futures ingestion, continuous contract construction, and a point-in-time combination of weekly positioning data with market series.</p><h3><code>ml4t-engineer</code></h3><p><code>ml4t-engineer</code> is where raw market data starts becoming something a model can learn from.</p><p>It includes <strong>120 features across 11 categories</strong>, as well as triple-barrier labeling, alternative bars, feature discovery, fractional differencing, preprocessing, and leakage-safe dataset-building utilities. The important design choice is that feature construction, label construction, and ML-ready dataset preparation are all in one package rather than scattered across custom scripts.</p><p>These steps are not independent. Triple-barrier labels, ATR-scaled barriers, volume and dollar bars, tick-imbalance bars, fractional differencing, registry-driven discovery, and train-only preprocessing all change the shape of the learning problem. Treating them as one layer is a workflow choice, not just an API choice.</p><p>The validation posture is also concrete. The library shows explicit validation against TA-Lib-compatible features and AFML-style labeling methods, which is the right kind of proof for software that sits directly between market data and model training.</p><h3><code>ml4t-models</code></h3><p><code>ml4t-models</code> is the newest and narrowest of the six, but it has a clear modeling point of view.</p><p>It starts from finance-native contracts: persistent panels, ragged cross-sections, portfolio sequences, structural factor extraction, stochastic discount factor learning, direct asset prediction, and end-to-end portfolio allocation.</p><p>The public surface is correspondingly specific. The library includes latent factor estimators such as PCA, <a href="https://www.sciencedirect.com/science/article/abs/pii/S1062940822001978">Risk-Premium PCA</a> (RPPCA), <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2983919">Instrumented PCA</a> (IPCA), and <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3335536">Conditional Autoencoder</a> (CAE) variants; a <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3350138">stochastic discount factor model</a>; a supervised autoencoder for direct asset prediction; and portfolio-learning models for linear, LSTM, and deeper allocation settings. It also includes helpers that pass predictions and weight frames to the backtest and diagnostic layers, rather than treating modeling as an isolated exercise.</p><h2>Prove and deploy</h2><h3><code>ml4t-diagnostic</code></h3><p><code>ml4t-diagnostic</code> is the part of the ML4T stack that asks the hardest question, last-mile research too often postpones: is there a real signal here, or just activity that looked convincing in the sample?</p><p>Its public surface leans into HAC-adjusted information coefficients, purged and combinatorial cross-validation, deflated Sharpe, false-discovery control, PBO, feature selection, structured backtest reporting, and template-based tearsheets.</p><p>Signal validation, statistical corrections, feature diagnostics, and backtest reporting live in one place. That makes it easier to separate prediction-quality problems from portfolio-translation problems and to ask whether an apparent result remains credible once multiple testing, autocorrelation, and leakage risks are accounted for honestly.</p><h3><code>ml4t-backtest</code></h3><p>Backtesting is crowded, which is one reason <code>ml4t-backtest</code> needs a clearer claim than &#8220;another framework.&#8221;</p><p>It is an event-driven simulator with explicit execution semantics and parity profiles that make comparisons meaningful rather than vague. The package emphasizes same-bar and next-bar execution modes, quote-aware fills, position-level and portfolio-level risk rules, and profiles spanning common frameworks, plus a conservative, realistic mode.</p><p>It also preserves inspectable artifacts after the run: fills, trades, portfolio state, predictions, and resolved config snapshots. That is what makes the bridge into <code>ml4t-diagnostic</code> reliable.</p><h3><code>ml4t-live</code></h3><p>Last but not least,<code> ml4t-live</code> is one of the more recent additions to the group.</p><p>It extends the workflow into a staged operation. The same <code>Strategy</code> interface is used in <code>ml4t-backtest</code> carries into live or shadow trading with broker adapters, feed adapters, safety controls, reconciliation, preflight checks, and execution journaling.</p><p><code>ml4t-live</code> is built around staged deployment: shadow mode first, then paper trading, then live operation with explicit controls around stale data, position limits, order limits, drawdown limits, and kill-switch persistence. That is a much more honest view of production than pretending a strategy is &#8220;deployed&#8221; once an API key works.</p><h2>Where to start</h2><p>The best starting point depends on the problem you already have.</p><p>If you need repeatable acquisition and updates:</p><ul><li><p><code>ml4t-data</code><a href="https://www.ml4trading.io/docs/data/getting-started/quickstart/"> quickstart</a></p></li><li><p><code>ml4t-data</code><a href="https://www.ml4trading.io/docs/data/providers/"> providers</a></p></li><li><p><code>ml4t-data</code><a href="https://www.ml4trading.io/docs/data/user-guide/incremental-updates/"> incremental updates</a></p></li></ul><p>If you already have data and need ML-ready features and labels:</p><ul><li><p><code>ml4t-engineer</code><a href="https://www.ml4trading.io/docs/engineer/getting-started/quickstart/"> quickstart</a></p></li><li><p><code>ml4t-engineer</code><a href="https://www.ml4trading.io/docs/engineer/user-guide/labeling/"> labeling guide</a></p></li><li><p><code>ml4t-engineer</code><a href="https://www.ml4trading.io/docs/engineer/user-guide/dataset-builder/"> dataset builder</a></p></li></ul><p>If you have signals and want to test credibility:</p><ul><li><p><code>ml4t-diagnostic</code><a href="https://www.ml4trading.io/docs/diagnostic/getting-started/quickstart/"> quickstart</a></p></li><li><p><code>ml4t-diagnostic</code><a href="https://www.ml4trading.io/docs/diagnostic/user-guide/workflows/"> workflows</a></p></li><li><p><code>ml4t-diagnostic</code><a href="https://www.ml4trading.io/docs/diagnostic/user-guide/statistical-tests/"> statistical tests</a></p></li><li><p><code>ml4t-diagnostic</code><a href="https://www.ml4trading.io/docs/diagnostic/user-guide/backtest-tearsheets/"> backtest tearsheets</a></p></li></ul><p>If you want to compare execution assumptions:</p><ul><li><p><code>ml4t-backtest</code><a href="https://www.ml4trading.io/docs/backtest/getting-started/quickstart/"> quickstart</a></p></li><li><p><code>ml4t-backtest</code><a href="https://www.ml4trading.io/docs/backtest/user-guide/profiles/"> profiles</a></p></li><li><p><code>ml4t-backtest</code><a href="https://www.ml4trading.io/docs/backtest/user-guide/execution-semantics/"> execution semantics</a></p></li></ul><p>If you want the safest path from backtest to production:</p><ul><li><p><code>ml4t-live</code><a href="https://www.ml4trading.io/docs/live/getting-started/quickstart/"> quickstart</a></p></li><li><p><code>ml4t-live</code><a href="https://www.ml4trading.io/docs/live/user-guide/risk/"> risk guide</a></p></li><li><p><code>ml4t-live</code><a href="https://www.ml4trading.io/docs/live/user-guide/examples/"> examples guide</a></p></li><li><p><code>ml4t-live</code><a href="https://www.ml4trading.io/docs/live/user-guide/operator-guide/"> operator guide</a></p></li></ul><p>If you want to inspect the finance-native model layer:</p><ul><li><p><code>ml4t-models</code><a href="https://www.ml4trading.io/docs/models/"> docs</a></p></li><li><p><code>ml4t-models</code><a href="https://www.ml4trading.io/docs/models/getting-started/quickstart/"> quickstart</a></p></li><li><p><code>ml4t/models</code><a href="https://github.com/ml4t/models"> repo</a></p></li></ul><p>I wanted to start this newsletter run with the libraries because they are the parts that readers can use immediately.</p><p>People in this field already know they need cleaner data pipelines, leak-aware feature work, honest validation, realistic execution, and safer production handoff. The question is whether those principles have been turned into reusable software with sufficient structure to improve how people actually work.</p><p>This issue maps the public software layer. Each library is large enough to deserve its own treatment later. For now, the job is simply to make that layer visible.</p><p>Useful feedback starts where these abstractions break against real workflows: provider gaps, labeling edge cases, diagnostics that need different assumptions, execution profiles that do not match a venue, or model contracts that fail on ragged panels.</p>]]></content:encoded></item><item><title><![CDATA[Nine case studies, one end-to-end workflow]]></title><description><![CDATA[Setup, labels, features, models, costs, risk &#8212; what the ML4T strategy research workflow does at every stage.]]></description><link>https://insights.ml4trading.io/p/nine-case-studies-one-end-to-end</link><guid isPermaLink="false">https://insights.ml4trading.io/p/nine-case-studies-one-end-to-end</guid><dc:creator><![CDATA[Stefan Jansen]]></dc:creator><pubDate>Tue, 05 May 2026 16:59:44 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/eb7c58b2-f698-4c1d-9fba-7b4982064af9_1200x630.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The early chapters of <em>Machine Learning for Trading</em>, 3rd edition, introduce a research workflow for systematic strategy development. Chapters 6 through 20 then apply it to nine case studies that span seven asset classes, five forecasting horizons, and frequencies from 8-hourly to monthly. The case studies are concrete, worked examples a reader can pick from &#8212; the one closest to your data, cadence, or asset class. This issue walks through what the workflow does at each stage, with pointers to where the libraries that support it are located.</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/FCKRK/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f407c497-db6d-41a9-9caa-bd5682e630d8_1220x1166.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f0d75bc-633a-43b2-ba49-e3a9770928ab_1220x1290.png&quot;,&quot;height&quot;:658,&quot;title&quot;:&quot;Case studies at a glance&quot;,&quot;description&quot;:&quot;Nine case studies in chapters 6 to 20 of Machine Learning for Trading, 3rd edition.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/FCKRK/1/" width="730" height="658" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>Each case has its own page at <a href="https://www.ml4trading.io/case-studies/">ml4trading.io/case-studies</a> with pipeline details, related chapters, and links to the GitHub code.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gQjh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc7a327-61c5-4097-8144-4a12ed732caf_2720x1568.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gQjh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc7a327-61c5-4097-8144-4a12ed732caf_2720x1568.jpeg 424w, https://substackcdn.com/image/fetch/$s_!gQjh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc7a327-61c5-4097-8144-4a12ed732caf_2720x1568.jpeg 848w, https://substackcdn.com/image/fetch/$s_!gQjh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc7a327-61c5-4097-8144-4a12ed732caf_2720x1568.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!gQjh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc7a327-61c5-4097-8144-4a12ed732caf_2720x1568.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gQjh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc7a327-61c5-4097-8144-4a12ed732caf_2720x1568.jpeg" width="1456" height="839" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/adc7a327-61c5-4097-8144-4a12ed732caf_2720x1568.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:839,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2439157,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://insights.ml4trading.io/i/196556682?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc7a327-61c5-4097-8144-4a12ed732caf_2720x1568.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gQjh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc7a327-61c5-4097-8144-4a12ed732caf_2720x1568.jpeg 424w, https://substackcdn.com/image/fetch/$s_!gQjh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc7a327-61c5-4097-8144-4a12ed732caf_2720x1568.jpeg 848w, https://substackcdn.com/image/fetch/$s_!gQjh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc7a327-61c5-4097-8144-4a12ed732caf_2720x1568.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!gQjh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc7a327-61c5-4097-8144-4a12ed732caf_2720x1568.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>The ML for Trading workflow that organizes the book &#8212; research loop above the evidence boundary, deployment below, and feedback closing the cycle.</em></p><h2>What the workflow does at each stage</h2><p><strong>Setup</strong> <em>(Chapter 6).</em> Each case begins with an explicit specification: the asset universe, the rebalance cadence, the train/validation/holdout split, the baseline checkpoints the case will measure itself against, and a search-accounting log that records every model trained &#8212; feature inputs, hyperparameters, fold-level metrics, and runtime artifacts. Chapter 6 argues for explicit search accounting as a guard against backtest overfitting: reproducibility is hard to recover once prior runs have disappeared from memory.</p><p><strong>Labels</strong> <em>(Chapter 7).</em> The label is the quantity the model is trained to predict; defining it is a modeling decision, not a downstream encoding of a separate target. Chapter 7 organizes labels into fixed-horizon and variable-horizon families. Fixed-horizon labels are evaluated at a predetermined offset: continuous forward returns over the trading horizon for regression, or discrete state codes &#8212; the sign of the return, a quantile bucket, or exceedance of a volatility-scaled threshold &#8212; for classification. Variable-horizon labels let the realized horizon depend on the path: trend-scanning labels expand the look-forward window until a trend test rejects, and triple-barrier labels resolve when one of a profit target, stop loss, or time limit binds first. Horizon and the instrument&#8217;s cost regime at that horizon constrain the choice &#8212; a 21-day forward return on monthly ETFs is a different estimation problem than an 8-hour funding-period return on crypto perpetuals. Label construction primitives, alternative bar samplers, and the overlap-aware sample-weighting they require are provided by <a href="https://github.com/ml4t/engineer">ml4t-engineer</a>.</p><p><strong>Features</strong> <em>(Chapter 8).</em> Engineered features come from <a href="https://github.com/ml4t/engineer">ml4t-engineer</a>: roughly 120 technical indicators across 11 categories &#8212; momentum, volatility, trend, volume, microstructure, and others &#8212; Polars-native and JIT-compiled, with around 60 cross-validated against TA-Lib. Where the asset structure supports them, alternative bar samplers (volume bars, dollar bars, tick-imbalance bars) replace fixed-time bars; microstructure features appear when intrabar data are available.</p><p><strong>Model-based features</strong> <em>(Chapter 9).</em> Features extracted from auxiliary statistical models fit per series, used to encode dynamics that engineered indicators capture only loosely: </p><ul><li><p>Kalman-filtered states and innovations, </p></li><li><p>spectral and path-signature coefficients, </p></li><li><p>ARIMA residuals, </p></li><li><p>GARCH and HAR/rough-volatility estimates, </p></li><li><p>HMM and Wasserstein regime posteriors, </p></li><li><p>fractional-differencing transforms for stationarity, and </p></li><li><p>uncertainty-aware variants of each. </p></li></ul><p>The distinction is mechanical rather than thematic &#8212; the feature is an estimated quantity from a fitted model, so its training-time and inference-time computation has to respect the same purged-walk-forward discipline as the prediction model that consumes it.</p><p><strong>Feature evaluation</strong> <em>(Chapter 7, second pass).</em> Before any model is trained, every feature is screened individually &#8212; daily cross-sectional information coefficient, ICIR, HAC-robust standard errors, walk-forward folds. Features that fail the triage screen do not silently carry forward into model training. The diagnostic machinery here lives in <a href="https://github.com/ml4t/diagnostic">ml4t-diagnostic</a>, which also provides the deflated Sharpe ratio, combinatorial purged cross-validation, and the multiple-testing corrections used downstream.</p><p><strong>Model families</strong> <em>(Chapters 11&#8211;15).</em> Each case runs the families that fit its data:</p><ul><li><p>regularized linear models (Chapter 11), </p></li><li><p>gradient-boosted trees and tabular deep learning (Chapter 12), </p></li><li><p>sequence deep learning, including LSTMs, TCNs, and transformers ( Chapter 13),</p></li><li><p>latent-factor models, including IPCA and a stochastic-discount-factor specification (Chapter 14), and </p></li><li><p>double machine learning for the cases where confounding is the open question (Chapter 15). </p></li></ul><p>The latent-factor and SDF estimators, together with a conditional-autoencoder model and several end-to-end portfolio-learning architectures, are packaged in <a href="https://github.com/ml4t/models">ml4t-models</a> &#8212; the most recent of the six libraries. Hyperparameter search and fold-level evaluation are uniform across families; comparability across cases comes from running the same protocol everywhere, not from picking a per-case favorite.</p><p><strong>Signal-stage backtest</strong> <em>(Chapter 16).</em> Predictions become positions through the backtester provided by <a href="https://github.com/ml4t/backtest">ml4t-backtest</a> &#8212; event-driven with point-in-time correctness, exit-first order processing matching real broker behavior, configurable same-bar or next-bar fills, and quote-aware execution that distinguishes bid, ask, and midpoint sources. The same code path runs the validation backtest and the frozen-holdout backtest, with no leakage between them.</p><p><strong>Portfolio construction</strong> <em>(Chapter 17).</em> Allocator choice is part of the experiment. Equal-weight long-short top-N is the baseline; risk parity, mean-variance with shrinkage, and robust-optimization variants run alongside where the universe supports them. End-to-end portfolio-learning models &#8212; where the allocator is itself learned rather than rule-based &#8212; are part of <a href="https://github.com/ml4t/models">ml4t-models</a>. The point is to isolate how much of any net result comes from the signal and how much comes from the allocator.</p><p><strong>Transaction costs</strong> <em>(Chapter 18).</em> Costs are modeled instrument-by-instrument and calibrated to the level a participant trading the case-study universe would incur, rather than a flat basis-point placeholder applied uniformly. Equity bid-ask half-spreads are derived from quote data with a bottom-quintile discipline; futures roll costs and continuous-contract artifacts are handled at the bar-construction level; FX uses interbank-spread approximations; option strategies use premium-scaled bid-asks sized to the round-trip; per-share commissions enter where the cadence makes them binding. Each case reports a sensitivity analysis at multiple cost levels &#8212; a single static cost assumption hides where the strategy actually breaks down. The cost machinery sits on the same execution layer as the backtester.</p><p><strong>Risk overlays</strong> <em>(Chapter 19).</em> Daily-loss caps, drawdown-triggered position cuts, position-size limits, and regime-aware sizing where the data supports an explicit regime layer. The overlays are applied as a separate pass over the cost-aware backtest output rather than fused into the signal stage. Keeping them separate preserves a useful diagnostic distinction: when a result disappoints, you can ask whether the signal had alpha that the cost regime erased, whether the risk overlay was the binding constraint, or whether the overlay never engaged at all.</p><p><strong>Cross-case analysis</strong> <em>(Chapter 20).</em> The synthesis appears in Chapter 20, which treats the nine cases as a single experiment rather than nine independent reports. It examines how well upstream feature-triage diagnostics predict downstream strategy survival, identifies, case by case, where prediction quality, portfolio translation, or execution friction is the binding constraint, and points to the lever each case suggests for the next research iteration. Detailed per-case numbers are the subject of upcoming issues.</p><h2>What&#8217;s coming</h2><p>Coming issues will move case-by-case &#8212; each case&#8217;s binding constraint, the iteration step the evidence suggests, what changed between research cycles, and the open questions left on the table. The cross-case synthesis from Chapter 20, individual stages worth a deep dive (feature triage, instrument-specific cost modeling, allocator comparison), and the methods themselves all have their own future issues queued up.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vtgj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84c1941-18a3-4248-9192-3f89d50b595a_2944x1440.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vtgj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84c1941-18a3-4248-9192-3f89d50b595a_2944x1440.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Vtgj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84c1941-18a3-4248-9192-3f89d50b595a_2944x1440.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Vtgj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84c1941-18a3-4248-9192-3f89d50b595a_2944x1440.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Vtgj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84c1941-18a3-4248-9192-3f89d50b595a_2944x1440.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vtgj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84c1941-18a3-4248-9192-3f89d50b595a_2944x1440.jpeg" width="1456" height="712" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e84c1941-18a3-4248-9192-3f89d50b595a_2944x1440.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:712,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2651436,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://insights.ml4trading.io/i/196556682?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84c1941-18a3-4248-9192-3f89d50b595a_2944x1440.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Vtgj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84c1941-18a3-4248-9192-3f89d50b595a_2944x1440.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Vtgj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84c1941-18a3-4248-9192-3f89d50b595a_2944x1440.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Vtgj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84c1941-18a3-4248-9192-3f89d50b595a_2944x1440.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Vtgj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84c1941-18a3-4248-9192-3f89d50b595a_2944x1440.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>The research loop from Chapter 6, alongside the live-trading loop. The new case-study iteration agent runs inside the right-hand loop.</em></p><p>A <strong>new agent has also joined the research loop.</strong> It runs its own iterations on each case study &#8212; re-running setup decisions, refining the feature panel, adjusting cost assumptions and risk parameters, and proposing the next experiment to try. Whatever it surfaces worth reporting will land here.</p><p>Per-case detail lives at <a href="https://www.ml4trading.io/case-studies/">ml4trading.io/case-studies</a>. GitHub repo going live close to launch.</p>]]></content:encoded></item><item><title><![CDATA[Inside the Agent Lab]]></title><description><![CDATA[A live implementation of the AIA Forecaster paper, and what each pipeline stage actually does.]]></description><link>https://insights.ml4trading.io/p/inside-the-agent-lab</link><guid isPermaLink="false">https://insights.ml4trading.io/p/inside-the-agent-lab</guid><dc:creator><![CDATA[Stefan Jansen]]></dc:creator><pubDate>Fri, 01 May 2026 15:40:31 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!g4Sb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb7df520-1c9a-485a-80f6-e627f5435d63_2752x1536.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The <a href="https://ml4trading.io/agent-lab/">Agent Lab</a> currently assigns an 82% probability to the upper bound of the federal funds rate being above 3.00% after the April 2027 FOMC meeting. <a href="https://kalshi.com">Kalshi</a> trades the same question at 47%. On a different question &#8212; whether month-over-month core PCE inflation will print above 0.3% in April 2026 &#8212; the Lab is at 30%, and the market is at 51%, the disagreement running the other way.</p><p>The Agent Lab is our implementation of the <a href="https://doi.org/10.48550/arXiv.2511.07678">AIA Forecaster (Alur et al. 2025)</a> from Bridgewater AIA Labs &#8212; the multi-agent research pipeline, Chapter 24 of the third edition teaches end-to-end. It runs on live prediction-market questions, publishes a probability against each one, and persists every search result, agent trace, and intermediate aggregate to a database so that any run can be replayed against its original evidence.</p><p>This issue walks through what the Lab actually does. We are running it as a research experiment alongside the book, not as an institutional product, and the simplifications matter; we will name them as they come up.</p><h2>What you see on a question page</h2><p>The landing page lists featured questions across the US macro calendar &#8212; federal funds rate, core PCE, payrolls, and GDP. Each card shows the market price, the Lab&#8217;s forecast, a one-line agent-derived rationale, and a link to the full dossier.</p><p>The dossier is the part worth seeing. For a given question, it shows:</p><ul><li><p>The distribution of individual agent probabilities for that run.</p></li><li><p>The probability trajectory as new daily forecasts accumulate.</p></li><li><p>The pipeline arithmetic &#8212; agents in, mean probability out.</p></li><li><p>The supervisor&#8217;s synthesis of the evidence with citations to the sources that the agents actually retrieved.</p></li><li><p>Run ID, generation timestamp, the model used for that run, and the number of search calls.</p></li></ul><p>Every number is timestamped and attributable. This is the chapter&#8217;s position made operational: agents as engineering systems for evidence-rich decision support, with replayable traces rather than chat interfaces.</p><h2>From Chapter 24 to running code</h2><p>Chapter 24 walks the design space through ten notebooks: a ReAct loop, tool contracts and explicit state, a research agent, aggregation arithmetic, multi-agent research, adversarial debate, the full forecasting pipeline, and an evaluation/governance pass. The notebooks run deterministically in mock mode for teaching and CI, and switch to live providers and live search when the reader is ready.</p><p>The <code>aia-forecaster</code> repository (available at publication in June) takes the same architecture and runs it against live Kalshi and Polymarket questions. What the repo adds beyond the notebooks is the operational infrastructure: a SQLite run log with token-cost telemetry, market connectors with retry and filtering, an evaluation harness against historical resolutions, configuration profiles, and a scheduler for daily pull-and-forecast jobs.</p><h2>How a forecast is built</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g4Sb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb7df520-1c9a-485a-80f6-e627f5435d63_2752x1536.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g4Sb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb7df520-1c9a-485a-80f6-e627f5435d63_2752x1536.jpeg 424w, https://substackcdn.com/image/fetch/$s_!g4Sb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb7df520-1c9a-485a-80f6-e627f5435d63_2752x1536.jpeg 848w, https://substackcdn.com/image/fetch/$s_!g4Sb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb7df520-1c9a-485a-80f6-e627f5435d63_2752x1536.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!g4Sb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb7df520-1c9a-485a-80f6-e627f5435d63_2752x1536.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g4Sb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb7df520-1c9a-485a-80f6-e627f5435d63_2752x1536.jpeg" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db7df520-1c9a-485a-80f6-e627f5435d63_2752x1536.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1715541,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://insights.ml4trading.io/i/196124984?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb7df520-1c9a-485a-80f6-e627f5435d63_2752x1536.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!g4Sb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb7df520-1c9a-485a-80f6-e627f5435d63_2752x1536.jpeg 424w, https://substackcdn.com/image/fetch/$s_!g4Sb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb7df520-1c9a-485a-80f6-e627f5435d63_2752x1536.jpeg 848w, https://substackcdn.com/image/fetch/$s_!g4Sb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb7df520-1c9a-485a-80f6-e627f5435d63_2752x1536.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!g4Sb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb7df520-1c9a-485a-80f6-e627f5435d63_2752x1536.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>The AIA Forecaster pipeline &#8212; from market question to published probability.</em></p><p>For each market on each daily run, the Lab does the following:</p><ol><li><p><strong>Reword the question.</strong> Kalshi market titles are written for traders. The pipeline rewrites each one into a single explicit yes/no question that names the date, threshold, units, and resolution source. &#8220;Will the rate of core PCE inflation be above 0.3% in April 2026?&#8221; becomes the longer &#8220;Will the month-over-month percent change in core PCE be above 0.3% in April 2026, according to the Bureau of Economic Analysis?&#8221; The paper publishes this prompt verbatim in Appendix F. It does most of the work of preventing the units-and-time-frame mistakes a fast reader would make.</p></li><li><p><strong>Run the research agents.</strong> Three agents run a ReAct loop in parallel over a search tool. They are <em>identical</em>: same prompt, same temperature, no role specialization. Diversity comes from stochastic sampling, not from prescribed roles. Each agent returns a probability and the evidence chain that produced it. (The paper&#8217;s production configuration uses ten agents on a frontier model; we run three on an open-source model. More on that below.)</p></li><li><p><strong>Aggregate.</strong> The mean of the three agent probabilities is the ensemble forecast.</p></li><li><p><strong>Supervisor pass.</strong> A separate agent reviews the ensemble against the market price and the agents&#8217; rationales. It can override the ensemble &#8212; but only when its confidence in the override is explicitly high. Most of the time, it confirms.</p></li><li><p><strong>Persist.</strong> The rewritten question, the search results, every agent trace, the aggregate, and the supervisor&#8217;s synthesis all land in SQLite, keyed by a run ID. Any run is replayable with its evidence held constant.</p></li></ol><p>The point of step 5 is that nothing the Lab publishes is a black box. If a forecast looks wrong, the question is &#8220;which step produced it&#8221; &#8212; and the answer is in the database.</p><p>A note on two stages that look ordinary on the diagram but carry most of the system. </p><ol><li><p><strong>Search matters.</strong> On a batch of 64 live markets, the paper evaluates the same pipeline: Brier 0.1002 with search and 0.3609 without, worse than always predicting 50%, which mechanically scores 0.25. Each agent&#8217;s iterated search-and-reason loop is what produces the headline result. </p></li><li><p><strong>The supervisor is not a judge.</strong> The paper tested a simpler &#8220;best of M&#8221; supervisor that reads the agents&#8217; forecasts and picks the one it considers best, and it lost to the simple mean &#8212; selecting the worst of the M forecasts 7.2% of the time. The agentic supervisor&#8217;s gain comes specifically from running <em>new</em> searches to resolve disagreements, not from re-judging the existing answers. Naive verification underperforms averaging; only verification plus additional evidence beats it.</p></li></ol><h2>What the system is and is not</h2><p>A few honest caveats about the version visitors see today.</p><p><strong>Open-source model by default; three agents, not ten.</strong> The paper&#8217;s headline configuration runs ten agents per question on a frontier model. We run three agents on Qwen 3 (32 billion Parameters) &#8212; open-source, locally hosted, free at the margin &#8212; because a daily sweep on a frontier model with the full number of research agent iterations would cost on the order of $5/market/day. The price is paid in quality: an open-source model produces forecasts that illustrate how the pipeline works, not the numbers the paper produced - both web search and model quality matter materially. The same pipeline can run on Anthropic&#8217;s Sonnet &#8212; and we do occasionally use it for comparison sweeps &#8212; but the daily schedule uses Qwen.</p><p><strong>It is not built to beat the market.</strong> The paper&#8217;s headline result is that AIA Forecaster matches a human superforecaster panel on <a href="https://arxiv.org/abs/2409.19839">ForecastBench</a> &#8212; Brier score 0.075, statistically indistinguishable. On liquid prediction markets, the system <em>underperforms</em> the market consensus on its own. Instead, it produces independent information that improves on the consensus when combined with it. The paper formalizes this with a regression of resolution outcome on (market price, AIA forecast): even on the harder benchmark where AIA loses to consensus on its own, the optimal ensemble assigns roughly a third of its weight to the AIA forecasts, and the combined estimator beats either input alone. The Lab&#8217;s position is the same: a calibrated probability against each question, alongside the market price, not in place of it.</p><p><strong>Calibration is currently off.</strong> Language models trained with RLHF tend to hedge toward the middle of the probability scale: even when the evidence supports an extreme forecast, the raw probability tends to be timid. The paper&#8217;s correction is Platt scaling &#8212; a logistic transform applied to each forecast as it is produced. The transform has one coefficient, which the paper sets a priori to &#8730;3, a value drawn from the calibration literature (<a href="https://arxiv.org/abs/2111.03153">Neyman and Roughgarden, 2022</a>) rather than fitting it against the authors&#8217; own benchmarks &#8212; a choice they explicitly make to avoid overfitting. With only three agents and open-weight models, the risk for us cuts the other way: extremization can amplify a wrong-side-of-50% forecast into a confidently-wrong forecast. The recent prompt rebuild already produces appropriately confident outputs, so we run with calibration disabled for now.</p><p><strong>The forecasts move in both directions.</strong> On the Fed-rate questions in the opener, the Lab is well above the market. On core PCE, it is well below. The system is not built to take a contrarian view; it is built to produce an evidence-backed probability and let disagreement, in either direction, stand or fall in the face of resolution.</p><h2>Read a dossier</h2><p><a href="https://ml4trading.io/agent-lab/">Open the Agent Lab</a>, pick a question where the forecast and the market visibly disagree, and read the dossier end-to-end. The agent distribution, the trajectory, the supervisor&#8217;s synthesis, and the search citations are all there. The next Insights issue lands Tuesday.</p>]]></content:encoded></item><item><title><![CDATA[More than 27 chapters]]></title><description><![CDATA[How the five libraries, 112 primers, and 56 agent skills complement the 27 chapters.]]></description><link>https://insights.ml4trading.io/p/more-than-27-chapters</link><guid isPermaLink="false">https://insights.ml4trading.io/p/more-than-27-chapters</guid><dc:creator><![CDATA[Stefan Jansen]]></dc:creator><pubDate>Tue, 28 Apr 2026 14:06:11 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/7b2f07b7-11e8-4a55-b19b-8bd1415e62af_1200x630.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The third edition of <em>Machine Learning for Trading</em> ships with 27 chapters (and 400+ notebooks). It also ships with five open-source Python libraries, <a href="https://ml4trading.io/primer/">112 primer articles</a>, and <a href="https://ml4trading.io/skills/">56 agent skills</a> across nine categories. Issue 1 promised to unpack what the companion material does that the chapters alone cannot. This issue is part of that unpacking &#8212; two of the libraries in detail, plus a tour of the primer set. Skills and the agent layer that consumes them get their own treatment in forthcoming issues.</p><h2>Five libraries, one workflow</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jgfG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66b0bc17-083d-4439-b113-975ac3337150_3200x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jgfG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66b0bc17-083d-4439-b113-975ac3337150_3200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!jgfG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66b0bc17-083d-4439-b113-975ac3337150_3200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!jgfG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66b0bc17-083d-4439-b113-975ac3337150_3200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!jgfG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66b0bc17-083d-4439-b113-975ac3337150_3200x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jgfG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66b0bc17-083d-4439-b113-975ac3337150_3200x1200.png" width="1456" height="546" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/66b0bc17-083d-4439-b113-975ac3337150_3200x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:546,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:122137,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://insights.ml4trading.io/i/195746689?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66b0bc17-083d-4439-b113-975ac3337150_3200x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jgfG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66b0bc17-083d-4439-b113-975ac3337150_3200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!jgfG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66b0bc17-083d-4439-b113-975ac3337150_3200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!jgfG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66b0bc17-083d-4439-b113-975ac3337150_3200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!jgfG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66b0bc17-083d-4439-b113-975ac3337150_3200x1200.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The five libraries trace the research-to-production path the book teaches: data acquisition, feature engineering, signal validation, strategy simulation and evaluation, and live deployment. Each library is finance-native &#8212; APIs, semantics, and data contracts tailored to the domain rather than borrowed from a generic ML stack. The chapters establish the methods. The library, its tests, and its validation harness ensure the methods are implemented correctly at scale.</p><p>This issue goes deeper into two areas: data management and the backtest-to-live pair.</p><h2><code>ml4t-data</code>: twenty providers, one interface</h2><p>Most ML-for-trading projects start by writing the same data layer. Ten different vendor SDKs with ten different schemas, ad-hoc CSVs that drift out of date, and a notebook full of one-off fetches that may or may not reproduce. <code>ml4t-data</code> provides the data layer that the project should not have to write.</p><p>A single <code>DataManager</code> unifies 20+ provider adapters behind one interface &#8212; the same <code>fetch</code>, <code>load</code>, and <code>update</code> calls regardless of source. Coverage spans 850,000 FRED economic series, 70+ global exchanges via Finnhub, 10,000+ cryptocurrencies via CoinGecko, prediction-market history from Kalshi and Polymarket, academic factor data (Fama-French, AQR), and Databento-backed futures for CME and ICE. Data is stored locally in Hive-partitioned Parquet with metadata tracking and is queryable directly with DuckDB or Polars. CLI commands handle incremental updates, gap detection, and validation against OHLC invariants and anomaly detectors.</p><p>The two modules where lookahead bias is most likely to enter quietly get dedicated treatment. The futures module builds continuous contracts with configurable roll logic for CME and ICE products. The COT (Commitment of Traders) module joins weekly CFTC positioning data to OHLCV under explicit point-in-time semantics &#8212; the join key is the release date, not the date the positions describe &#8212; so the model never sees a Tuesday&#8217;s positions before the Friday they were published.</p><h2><code>ml4t-backtest</code> &#8594; <code>ml4t-live</code>: the same strategy, twice</h2><p>The hardest move in the workflow is the move from a backtested strategy to a live one. The two environments share almost nothing by default &#8212; different data feeds, different fill semantics, different failure modes &#8212; and the bugs that survive the transition are the ones that erase paper-PnL the fastest. The backtest and live libraries are designed as a single system that cleanly crosses that boundary.</p><p><code>ml4t-backtest</code> is the simulation engine, and its headline claim is cross-framework parity. Behavioral profiles for Zipline, Backtrader, VectorBT, and LEAN set dozens of knobs to match each target framework exactly so that you can either reproduce the backtester you know, or the behavior of the broker you use. Validated on 250 assets over 20 years, the Zipline profile reproduces 226,723 trades with zero gap and a $10.30 value discrepancy on the final portfolio (0.0014%), running 8&#215; faster than the reference. The point of the parity work is to establish that the engine does what the standard implementations do, so the strategy author can stop worrying about the engine.</p><p><code>ml4t-live</code> takes the same <code>SignalStrategy</code> class, unmodified, into production. <code>SafeBroker</code> enforces 16 risk parameters &#8212; position limits, order limits, daily-loss caps, fat-finger rejection at &#177;5% from market, asset whitelisting. Shadow mode runs the strategy logic on live data without placing real orders, closing the gap between a passing backtest and a safe live deployment. The kill switch persists atomic JSON state, so a mid-run crash does not leave orphaned positions. The design constraint is zero-rewrite migration: the code that passed the backtest is the code that runs the broker.</p><h2>112 primers: a menu of choice</h2><p>The second edition carried its own prerequisites. Chapters had to stop and introduce hypothesis testing, linear regression, and basic time-series mechanics. That crowded out pages for the trading applications the book was actually about. The third edition moves that material into primers and gives the chapters back to trading.</p><p>The result is a menu for you to pick and choose from along two axes:</p><ol><li><p><strong>Level of preparation:</strong> 20 foundational primers, 66 intermediate, 26 advanced.</p></li><li><p><strong>Topic:</strong> 24 primers are cross-chapter, the other 88 provide background on or further expand on individual chapters: Ch 9 on model-based features has 11 primers, Ch 17 on portfolio construction has 8, Ch 14 on latent factors has 7.</p></li></ol><p>264 unique research citations back the set. A few primers to show the range:</p><ul><li><p><strong>&#8220;Volatility: Realized, Implied, and Why It Clusters&#8221;</strong> (Cross-chapter, foundational). Three distinct volatility objects &#8212; realized, implied, and conditional &#8212; and why they are not interchangeable. Square-root annualization explained. Volatility clustering as a potential regime marker, not an artifact of noisy estimation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KNMT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8600bd-56c2-46b3-bb05-95f8da5a76ae_1784x881.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KNMT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8600bd-56c2-46b3-bb05-95f8da5a76ae_1784x881.png 424w, https://substackcdn.com/image/fetch/$s_!KNMT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8600bd-56c2-46b3-bb05-95f8da5a76ae_1784x881.png 848w, https://substackcdn.com/image/fetch/$s_!KNMT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8600bd-56c2-46b3-bb05-95f8da5a76ae_1784x881.png 1272w, https://substackcdn.com/image/fetch/$s_!KNMT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8600bd-56c2-46b3-bb05-95f8da5a76ae_1784x881.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KNMT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8600bd-56c2-46b3-bb05-95f8da5a76ae_1784x881.png" width="1456" height="719" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f8600bd-56c2-46b3-bb05-95f8da5a76ae_1784x881.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:719,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:98211,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://insights.ml4trading.io/i/195746689?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8600bd-56c2-46b3-bb05-95f8da5a76ae_1784x881.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KNMT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8600bd-56c2-46b3-bb05-95f8da5a76ae_1784x881.png 424w, https://substackcdn.com/image/fetch/$s_!KNMT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8600bd-56c2-46b3-bb05-95f8da5a76ae_1784x881.png 848w, https://substackcdn.com/image/fetch/$s_!KNMT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8600bd-56c2-46b3-bb05-95f8da5a76ae_1784x881.png 1272w, https://substackcdn.com/image/fetch/$s_!KNMT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8600bd-56c2-46b3-bb05-95f8da5a76ae_1784x881.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>SPY daily log returns and 20-day rolling annualized realized volatility, 2018&#8211;2024. Long calm stretches near 10&#8211;15% punctuated by short clusters &#8212; late 2018, the COVID spike to ~94%, the 2022 sell-off, and the 2023 banking episodes. Roughly 5% of days exceed 30% annualised volatility. From the primer <a href="https://ml4trading.io/primer/00/03-volatility/">&#8220;Volatility: Realized, Implied, and Why It Clusters&#8221;</a>.</em></p></li><li><p><strong>&#8220;Random Matrix Theory for PCA in Finance&#8221;</strong> (Ch 14, advanced). The Marchenko&#8211;Pastur law provides a null benchmark for the eigenvalue distribution of a sample covariance or correlation matrix in the absence of structure. In finance, eigenvalues that exceed the upper edge of the Marchenko&#8211;Pastur bulk (the range of eigenvalues that finite-sample estimation noise alone can generate) are often interpreted as candidate signal components, while eigenvalues inside the bulk are treated as noise-dominated. This gives a practical, though assumption-dependent, way to decide how many principal components to retain and motivates covariance cleaning and eigenvalue shrinkage methods.</p></li><li><p><strong>&#8220;Temporal-Difference Learning and Bellman Equations&#8221;</strong> (Ch 21, intermediate). The foundation for value-based RL methods used in applications such as execution and hedging includes the Bellman equations, value iteration, TD(0), the bias&#8211;variance trade-off relative to Monte Carlo estimation, and the transition from tabular control methods such as Q-learning and SARSA to neural variants such as DQN and Double DQN.</p></li></ul><h2>What&#8217;s next</h2><p>Friday&#8217;s issue opens the <a href="https://www.ml4trading.io/agent-lab/">Agent Lab</a> &#8212; our implementation of the multi-agent research pipeline based on Bridgewater&#8217;s <a href="https://arxiv.org/html/2511.07678v1">AIA Forecaster</a> that Chapter 24 teaches &#8212; running daily on live questions from Kalshi and Polymarket. The 56 agent skills and the agent layer that orchestrates the full research-to-production workflow will be included in a later issue.</p>]]></content:encoded></item><item><title><![CDATA[What changed in six years, and what didn't]]></title><description><![CDATA[A survey of what's new in the 3rd ed of ML for Trading &#8212; generative AI, autonomous agents, causal ML, nine case studies, five libraries &#8212; and what didn't change.]]></description><link>https://insights.ml4trading.io/p/what-changed-in-six-years</link><guid isPermaLink="false">https://insights.ml4trading.io/p/what-changed-in-six-years</guid><dc:creator><![CDATA[Stefan Jansen]]></dc:creator><pubDate>Fri, 24 Apr 2026 12:05:49 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/a3816396-52ae-4dee-8d21-4a84d0f87e84_1200x630.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The second edition of <em>Machine Learning for Trading</em> shipped on 31 July 2020. Two months earlier, OpenAI posted the GPT-3 paper to arXiv. The third edition ships in June 2026. The second edition added a few early deep-learning applications to the first (December 2018). The second-to-third gap is much larger because it covers some of the most consequential six years AI and ML have ever seen.</p><p>The field moved. Two questions the book tries to answer did not:</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://insights.ml4trading.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading ML for Trading Insights! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><ol><li><p>How to develop a trading strategy end-to-end; the book now includes nine case studies from equities and futures to ETFs, FX, and crypto, with holding periods from minutes to months.</p></li><li><p>How to evaluate a strategy without fooling yourself with a plausible-looking backtest; the new <a href="https://ml4trading.io/libraries/ml4t-diagnostic/">ml4t-diagnostic</a> library ships state-of-the-art overfitting guards &#8212; from the Deflated Sharpe ratio to the Rademacher Anti-Serum &#8212; and the chapters cover process discipline and the relevant tests in detail.</p></li></ol><h2>How the landscape changed</h2><p><strong>Generative AI and autonomous agents are rapidly becoming part of the research workflow.</strong> Three new chapters respond directly:</p><ol><li><p>Retrieval-augmented generation for financial research (Ch 22),</p></li><li><p>Knowledge graphs (Ch 23), and</p></li><li><p>Autonomous agents (Ch 24).</p></li></ol><p>Alongside these three, Chapter 10 compresses the second edition&#8217;s three NLP chapters &#8212; sentiment, topic modeling, word embeddings &#8212; into a single chapter organized around transformer-based embeddings as a pipeline stage. Topic modeling and word2vec mostly drop out.</p><p><strong>Deep learning diversified, then dispersed.</strong> The second edition had a dedicated six-chapter deep-learning part; the third has none. The deeper shift is that finance has begun to develop its own domain-specific architectures, from latent-factor models to end-to-end portfolio learning, rather than importing deep learning from other domains unchanged. The material now travels with the application it serves.</p><ul><li><p>GANs and diffusion models are used for synthetic data (Ch 5).</p></li><li><p>Transformers support the text feature pipeline (Ch 10).</p></li><li><p>Tabular DL sits alongside gradient boosting (Ch 12).</p></li><li><p>Sequence models land in Chapter 13. <a href="https://doi.org/10.1093/rfs/hhaa009">Gu, Kelly, and Xiu&#8217;s 2019 conditional autoencoder</a> and <a href="https://doi.org/10.48550/arXiv.1904.00745">Chen, Pelger, and Zhu&#8217;s 2021 deep-learning stochastic discount factor</a> anchor the latent-factor chapter (Ch 14).</p></li><li><p>End-to-end portfolio learning sits in Chapter 17.</p></li><li><p>Deep reinforcement learning, with three concrete applications &#8212; optimal execution, market making, and deep hedging &#8212; stays in Chapter 21.</p></li></ul><p><strong>Chapter 13 takes a deliberately skeptical view of deep learning for time series.</strong> Foundation models are harder to extract value from off the shelf on financial data than in other domains. The chapter&#8217;s cross-dataset rollup asks where deep learning actually lands on the curve when LSTMs, TCNs, attention variants, and a foundation model are run across the case-study datasets. Deep learning is a tool with specific strengths, not a blanket replacement.</p><p><strong>Three additions at the chapter and section levels.</strong> Causal analysis and conformal predictions have continued to gain importance:</p><ul><li><p>Causal machine learning (Ch 15) is a new chapter: Pearl-style identification, double ML for isolating factor effects, Bayesian structural time series, time-series causal discovery.</p></li><li><p>Conformal prediction is now a standard pipeline stage in Chapter 11, not an advanced topic.</p></li></ul><p>Both matter more now than at any earlier point because LLMs and agents have collapsed the cost of generating plausible-looking hypotheses, and the counterweight is formal robustness.</p><p>Chapter 9 adds a new perspective: ARIMA, GARCH, spectral, regime-switching, and Bayesian time-series models are treated as feature extractors for a downstream predictor rather than as standalone forecasters.</p><p><strong>Operational reality moved from the edges of the book to the center.</strong> From strategy implementation to deployment, details matter in practice:</p><ul><li><p>Chapter 18 is a dedicated chapter on transaction costs &#8212; taxonomy, microstructure-regime link, Almgren&#8211;Chriss as the unifying framework, and the guardrails for when costs kill a strategy.</p></li><li><p>Chapter 19 is dedicated to risk management &#8212; VaR and CVaR, path risk, stress testing, adaptive controls without leakage, and kill switches.</p></li><li><p>Chapter 25 covers live trading through Interactive Brokers, Alpaca, and QuantConnect.</p></li><li><p>Chapter 26 covers MLOps and governance.</p></li></ul><p>None of the four had a counterpart in the second edition. The backtrader and zipline backtesters have been replaced by <a href="https://ml4trading.io/libraries/ml4t-backtest/">ml4t-backtest</a>, and we also demonstrate vectorized alternatives like <a href="https://vectorbt.dev/">vectorBT</a>.</p><p><strong>Two foundation-level additions.</strong></p><ol><li><p>Market microstructure gets its own chapter (Ch 3): tick, volume, and dollar bars as information-driven sampling, limit-order-book reconstruction, continuous-futures construction.</p></li><li><p>Synthetic financial data moved from an advanced topic in 2E Chapter 21 to a foundation chapter (Ch 5), and broadened well beyond GANs to include Monte Carlo baselines, diffusion models, LLM-based structured-data synthesis, and an explicit fidelity&#8211;utility&#8211;privacy evaluation framework.</p></li></ol><p><strong>Data and infrastructure caught up.</strong> Polars replaces pandas across notebooks where the migration is worthwhile. Commercial data sources sit alongside free ones, because free data has become increasingly rare over the past six years, and has serious limitations. Crypto is more central, and platforms like Alpaca make it materially easier to move from research prototype to paper trading and then to small-scale live execution than it was in 2020. Prediction markets &#8212; Kalshi, Polymarket &#8212; appear to be a new research frontier.</p><h2>What didn&#8217;t change</h2><p>The constant is process discipline. If anything, the third edition gives it more weight than the second.</p><p><strong>Backtesting is one stage in a research pipeline, not the finish line.</strong> The book breaks the research-to-deployment arc into dedicated chapters rather than a single chapter on simulation. Chapter 16 is the simulation stage: the <a href="https://ml4trading.io/libraries/ml4t-backtest/">ml4t-backtest</a> library, event-driven and vectorized modes, walk-forward with purging and embargo. Chapter 17 is portfolio construction: equal-weight and risk parity as hard benchmarks, the Markowitz curse, hierarchical risk parity, regime-adaptive allocation without discrete switching, and end-to-end portfolio learning. Chapter 18 handles costs, Chapter 19 handles risk, and Chapter 20 synthesizes across the nine case studies &#8212; reporting what generalized, what didn&#8217;t, and what was deliberately left on the table.</p><p><strong>Statistical discipline is threaded through the chapters.</strong> The anchor papers are organizing content in Chapters 7, 11, 16, and 20:</p><ul><li><p>Deflated Sharpe ratio &#8212; <a href="https://doi.org/10.2139/ssrn.2460551">Bailey and L&#243;pez de Prado (2014)</a></p></li><li><p>Rademacher Anti-Serum &#8212; Paleologo (2025), <em>Elements of Quantitative Investing</em>, &#167;8.3</p></li><li><p>Purged, embargoed, and combinatorial cross-validation &#8212; L&#243;pez de Prado (2018), <em>Advances in Financial Machine Learning</em></p></li><li><p>Probability of Backtest Overfitting &#8212; <a href="https://doi.org/10.2139/ssrn.2326253">Bailey, Borwein, L&#243;pez de Prado, and Zhu (2015)</a></p></li><li><p>Multiple-testing corrections in factor research &#8212; <a href="https://doi.org/10.1093/rfs/hhv059">Harvey, Liu, and Zhu (2016)</a></p></li><li><p>Conformal prediction &#8212; the Vovk, Gammerman, and Shafer lineage</p></li></ul><p><strong>Hands-on implementation remains front and center, growing substantially in scope and scale.</strong> The third edition is built around <strong>nine case studies</strong> across asset classes and frequencies:</p><ul><li><p>ETFs</p></li><li><p>Broad US equities</p></li><li><p>US firm characteristics</p></li><li><p>NASDAQ-100 microstructure on minute-bar TAQ data</p></li><li><p>S&amp;P 500 equities joined with options analytics</p></li><li><p>S&amp;P 500 options as a volatility strategy</p></li><li><p>CME futures</p></li><li><p>FX majors</p></li><li><p>Crypto perpetuals, with funding as a structural signal</p></li></ul><p>Roughly 170 case-study notebooks carry each case through the same pipeline stages &#8212; setup, labels, features, model-based features, evaluation, linear, GBM, tabular DL, sequence DL, latent factors, causal, backtest, portfolio construction, costs, risk, synthesis. Cross-case rollups appear at the ends of the model chapters and in a dedicated synthesis chapter.</p><p>The second edition taught, model by model, on different datasets. The third edition teaches one pipeline across nine datasets, with explicit protocols for reporting across cases. The cross-case grid is the clearest pedagogical difference between the editions.</p><h2>More than &#8216;just a book&#8217;: 450+ notebooks, 100+ primers, 56 agent skills, and five libraries</h2><p>The third edition ships with roughly 450 notebooks, <a href="https://ml4trading.io/primer/">over one hundred primer articles</a>, <a href="https://ml4trading.io/skills/">56 agent skills</a> across nine categories (concepts, data, features, validation, backtest, portfolio, production, advanced AI, workflows), and five open-source Python libraries:</p><ol><li><p><a href="https://ml4trading.io/libraries/ml4t-data/">ml4t-data</a> &#8212; sourcing, validation, and point-in-time data pipelines.</p></li><li><p><a href="https://ml4trading.io/libraries/ml4t-engineer/">ml4t-engineer</a> &#8212; feature and label engineering with 120+ financial indicators.</p></li><li><p><a href="https://ml4trading.io/libraries/ml4t-diagnostic/">ml4t-diagnostic</a> &#8212; model evaluation, overfitting guards, and uncertainty quantification.</p></li><li><p><a href="https://ml4trading.io/libraries/ml4t-backtest/">ml4t-backtest</a> &#8212; event-driven and vectorized strategy simulation with walk-forward controls.</p></li><li><p><a href="https://ml4trading.io/libraries/ml4t-live/">ml4t-live</a> &#8212; broker adapters for live execution (Interactive Brokers, Alpaca, QuantConnect).</p></li></ol><p>The agent skills exist because coding agents increasingly participate in implementation. A skill encodes the canonical approach to a specific task &#8212; a walk-forward split with purging and embargo, a deflated Sharpe check on a set of backtests, a cost-sensitivity sweep &#8212; in a form a reader&#8217;s agent can consume without reinventing it. The book carries the argument; the skill shortens the distance between the argument and a correct implementation when an agent does the typing.</p><p>The <a href="https://ml4trading.io/agent-lab/">Agent Lab</a> on <a href="https://ml4trading.io">ml4trading.io</a> is our implementation of Bridgewater&#8217;s <a href="https://doi.org/10.48550/arXiv.2511.07678">AIA Forecaster</a> &#8212; the multi-agent research pipeline that Chapter 24 shows how to build. It publishes Platt-calibrated probabilities on live Kalshi and Polymarket questions.</p><h2>About <em>Insights</em></h2><p>This is Issue 1 of <em>Insights</em>, a twice-weekly letter running through the June launch and after. Each issue takes one claim from the book &#8212; a library, a primer article, an agent skill, a case study, or a new paper &#8212; and goes deeper than the book alone has room for. The next issue covers what the five libraries, the primer set, and the 56 skills do that 27 chapters alone cannot. The one after opens the Agent Lab.</p><p>If you subscribed to the second-edition list: welcome back.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://insights.ml4trading.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading ML for Trading Insights! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>