Trading Volume Alpha in Portfolio Optimization
Trading Volume Alpha in Portfolio Optimization
Ruslan Goyenko
Bryan T. Kelly
Tobias J. Moskowitz
Yinan Su
Chao Zhang
We are grateful to the Columbia & RFS AI in Finance Conference participants and discussant
Dmitriy Muravyev; seminar participants at Cornell, Syracuse, CityU Hong Kong, and George
Mason; as well as Martin Lettau, Lu Lu, Andrew Patton, and Annette Vissing-Jorgensen for
valuable comments and suggestions. We thank Zhongji Wei and Andy Yang for their excellent
research assistance. AQR Capital Management is a global investment management firm, which
may or may not apply similar investment techniques or methods of analysis as described herein.
Moskowitz is a member of the NBER, has an academic consulting relationship with AQR Capital,
and sits on the board of Commonfund. The views expressed here are those of the authors and not
necessarily those of AQR or the National Bureau of Economic Research. Send correspondence to
Bryan Kelly, [Link]@[Link].
NBER working papers are circulated for discussion and comment purposes. They have not been
peer-reviewed or been subject to the review by the NBER Board of Directors that accompanies
official NBER publications.
© 2024 by Ruslan Goyenko, Bryan T. Kelly, Tobias J. Moskowitz, Yinan Su, and Chao Zhang. All
rights reserved. Short sections of text, not to exceed two paragraphs, may be quoted without explicit
permission provided that full credit, including © notice, is given to the source.
Trading Volume Alpha
Ruslan Goyenko, Bryan T. Kelly, Tobias J. Moskowitz, Yinan Su, and Chao Zhang
NBER Working Paper No. 33037
October 2024
JEL No. C45, C53, C55, G00, G11, G12, G17
ABSTRACT
Portfolio optimization focuses on risk and return prediction, yet implementation costs critically
matter. Predicting trading costs is challenging because costs depend on trade size and trader
identity, thus impeding a generic solution. We focus on a component of trading costs that applies
universally – trading volume. Individual stock trading volume is highly predictable, especially with
machine learning. We model the economic benefits of predicting volume through a portfolio
framework that trades off tracking error versus net-of-cost performance – translating volume
prediction into net-of-cost alpha. The economic benefits of predicting individual stock volume are
as large as those from stock return predictability.
Tobias J. Moskowitz
Yale School of Management
165 Whitney Avenue
P.O. Box 208200
New Haven, CT 06520-8200
and NBER
[Link]@[Link]
1 Introduction
Research on portfolio optimization chiefly focuses on mean return prediction, and to a lesser extent,
variance-covariance prediction.1 However, real-world implementation costs also play a critical role
in the efficacy of portfolios. While the benefits and pitfalls of mean return and volatility forecasts are
well-studied in the literature, trading costs have received relatively little attention, and forecasting
Predicting trading costs, particularly at the individual stock level, is challenging since the
biggest component of these costs for a large investor is price impact, which depends on the size of
the trade and the amount traded by other traders in that security, as well as the identity of the
trader (Frazzini, Israel, and Moskowitz, 2018).3 Since each trader may face their own cost function,
finding a generic solution to this piece of the portfolio problem is challenging. Moreover, the size
of the trade is endogenously a function of expected trading costs, finding an estimate of expected
trading costs by focusing on the component of costs that is neither trader-specific nor endogenous
– the level of total trading volume in the stock. Trading volume is driven by other traders in the
market for the same security, which is generic to all traders. As Frazzini, Israel, and Moskowitz
(2018) show, trade size divided by daily trading volume – termed the market participation rate in a
stock – is the key driver of price impact costs. Following Kyle (1985), price impact is an increasing
function of participation rate. Holding trade size constant, the less trading volume in the stock the
greater the trader’s price impact will be.4 Our empirical strategy is to forecast trading volume for
1
A very long literature in asset pricing focuses on return prediction, with summaries on the state of this literature,
including some of its criticisms, found in Harvey, Liu, and Zhu (2016); McLean and Pontiff (2016); Jensen, Kelly, and
Pedersen (2022). A good summary of the literature on volatility prediction can be found in Engle (2004).
2
Several papers focus on trading costs from a theoretical perspective: Kyle (1985); Gârleanu and Pedersen (2016).
Even less work has provided empirical estimates of trading costs for use in a portfolio optimization context: Frazzini,
Israel, and Moskowitz (2012, 2018).
3
While many different trading cost models exist, they universally contain these three elements: trade size, market
size, and trader specifics (identify, information, motive, patience, etc.).
4
Theoretical foundations of trading costs in the classic market microstructure literature include asymmetric infor-
mation (Kyle, 1985; Glosten and Milgrom, 1985) and inventory costs (Stoll, 1978; Ho and Stoll, 1983; Grossman and
Miller, 1988). In either case, volume is the major determinant of liquidity (Benston and Hagerman, 1974; Glosten
and Harris, 1988; Brennan and Subrahmanyam, 1995) and is used to proxy for trading costs or liquidity (Campbell,
Grossman, and Wang, 1993; Datar, Naik, and Radcliffe, 1998; Amihud, 2002).
2
each security as a proxy for expected trading costs, and then use this forecast to optimize portfolios
Importantly, our aim is to provide a general forecast of costs for an individual stock that
abstracts from the need to specify trade size and applies universally across all traders. Our goal
is not to provide the “best” or most reliable trading cost model or forecast. Rather, our simple
objective is to provide a forecast of trading costs that any trader could use. This aim necessitates
a simple, rather than sophisticated, trading cost model. The goal is to focus on one component of
a trading cost model that is generic and see how valuable that can be.
into a portfolio theory problem. We model a portfolio framework that seeks to maximize the
net-of-cost performance of the portfolio using a mean-variance utility function, where the cost of
transacting scales linearly with participation rate (motivated by theory and empirical work from
the literature). The optimization trades off the cost of trading versus the (opportunity) cost of
not trading – minimizing trading costs versus minimizing tracking error to the before-cost optimal
portfolio – where trading costs and tracking error are endogenously negatively related. In the
model, we take the first and second moments of security returns as given and focus solely on the
tradeoff between trading costs and tracking error to the pre-cost optimal portfolio.
Using this framework, we can couch the volume prediction problem into a portfolio problem
and quantify the economic value of volume prediction in terms of its benefit to after-cost returns
or Sharpe ratio.5 In essence, we translate trading volume predictability into portfolio alpha, which
we term “trading volume alpha.” This translation opens up after-cost portfolio modeling, which
has been restricted due to limited access to trading cost data, to the widely available volume data.
This paper demonstrates the economic value of volume prediction with one set of predictors and
standard neural network implementation. Yet the framework laid out in the paper illustrates the
importance of and enables more work on constructing volume prediction signals and models.6
We impose a functional form for trading costs due to price impact motivated by theory (Kyle,
5
Related, Balduzzi and Lynch (1999) and Çetin, Jarrow, and Protter (2004) model transaction costs in portfolio
settings and study the economic value in various transaction cost models.
6
This research answers the call to integrate market microstructure frictions into asset pricing studies using machine
learning tools (Goldstein, Spatt, and Ye, 2021).
3
1985) and empirical evidence (Frazzini, Israel, and Moskowitz, 2018), where price impact (Kyle’s
Lambda) is an increasing linear function of the trader’s participation rate. All else equal, a higher
predicted volume allows the trader to trade more aggressively (larger size) because price impact per
dollar traded will be lower. Conversely, a lower predicted volume causes the trader to scale back
the trade (even perhaps to zero and not trade) because the price impact per dollar will be higher.
By modeling trading costs and benefits as a tracking error problem, we abstract from return or
variance prediction and focus exclusively on volume prediction and its economic impact through
An interesting feature emerges from the model that generates an asymmetry in the economic
value of volume prediction. Price impact costs are linear in participation rate, but non-linear
in trading volume. Very low trading volume implies exponentially high impact costs, whereas
very high volume implies negligible costs. As volume tends to zero, price impact costs approach
infinity, whereas when volume becomes large, impact costs are bounded by zero. Hence, predicted
changes in volume have much more economic impact when volume is low versus high, thus creating
asymmetric costs of volume forecast errors. Conversely, tracking error, or the opportunity cost of
not trading, is independent of volume. The combination of these two effects implies the optimization
will penalize overestimating volume more than underestimating volume. Trading too much when
you overestimate volume is more costly than trading too little when you underestimate it. At lower
volume, the cost of trading with respect to volume is very steep, and at high volume the cost-volume
relationship is flat. Intuitively, an illiquid stock’s price impact is very sensitive to small changes
in volume and in a highly non-linear way – participation rates move by orders of magnitude for
small changes in volume – resulting in trading costs increasing at a much faster rate when trading
too aggressively. Conversely, a very liquid security’s price is fairly inelastic to changes in volume,
because having more or less liquidity at that point has little impact on costs. As a result, the
Since participation rate drives trading costs, it is not only trading volume that matters, but also
7
Our framework, and its implication that portfolio optimization will seek to trade conservatively rather than
aggressively because of the asymmetric cost of volume forecast errors, implies that arbitrage activity may be limited
as a consequence. This implication provides a novel and additional source of limits to arbitrage activity in the spirit
of Shleifer and Vishny (1997).
4
the size of the trade. Trade size is determined endogenously and is a function of expected trading
volume and aversion to tracking error. In the model, the size (assets under management, AUM) and
volatility of the fund (and risk aversion of the investor) also affect the costs and benefits of trading.
Because price impact is an increasing function of participation rate, trading costs increase with
AUM endogenously, and the relative penalty for tracking error decreases with AUM. The optimal
tradeoff between trading costs and tracking error will therefore vary with the size of the portfolio,
and so will the economic impact of volume prediction. For small AUM, tracking error considerations
likely dominate trading cost considerations, hence the economic benefit to predicting volume may
be relatively less valuable. For large AUM, trading cost considerations likely dominate.8
Applying the model framework to data, we run a series of trading experiments for optimally
designed portfolios that take into account trading costs using stock-level volume prediction as the
sole input. Since liquidity is an unknown quantity to the portfolio manager, she uses volume
prediction as an input to alter the expected cost and benefit of trading, endogenously responding
to her forecast of volume by altering her portfolio. We assess the out-of-sample performance of this
portfolio, net of trading costs, to evaluate the economic impact of predicting volume. More accurate
volume predictions provide more efficient implementation and hence more efficient portfolios net of
costs.
We experiment with various sets of target positions to mimic realistic trading tasks. We start by
simulating an extremely profitable (before-cost) daily quantitative trading strategy, which represents
an unachievable target. This portfolio, which requires high turnover and aggressive trading, is much
less profitable after accounting for trading costs. We then consider expected trading costs in the
optimization to maximize the after-cost performance of the fund (at various fund sizes). We also
target a host of factor portfolios from Jensen, Kelly, and Pedersen (2022) that are based on ex-ante
return-predicting characteristics to examine the effectiveness of trading volume alpha across the
spectrum of factors. We examine the net of trading cost performance of an investment strategy
that seeks to target each of these factor portfolios while taking expected trading costs into account,
8
From this analysis, it is also possible to assess the optimal dollar size of the portfolio, including the break-even
fund size where trading costs exactly offset portfolio returns or the fund size that maximizes after-cost dollars rather
than returns (Frazzini, Israel, and Moskowitz, 2018).
5
using predicted market trading volume as the only input into forecasting expected trading costs.
Expected trading costs dictate how frequently to trade, which stocks to trade, and how much to
trade, with each of these choices simultaneously affecting tracking error to the portfolio.
We find that volume prediction has a measurable and significant effect on after-cost portfolio
performance. To predict volume, we use technical signals, such as lagged returns and lagged trading
volume, as well as firm characteristics that the literature finds capture return anomalies, but not
necessarily trading volume. We then add indicators for various market-wide or firm-level events
associated with volume fluctuation, including upcoming and past earnings releases. We analyze
both linear and non-linear prediction methods using various neural networks designed to maximize
out-of-sample predictability. Finally, we alter the objective/loss function of the neural network to
take into account the portfolio problem’s economic objective when predicting volume. Our aim,
again, is not to find the best prediction model for stock trading volume, but rather to couch the
prediction problem into economic outcomes and measure its costs and benefits accordingly. To
that end, we assess a number of different models and variables for predicting volume in order to
We find that volume prediction improves significantly over moving averages of lagged trading
volume when using technical signals. Adding firm characteristics (such as BE/ME) further im-
proves volume predictability, even though these variables are primarily used for return forecasting
and are not necessarily related to trading volume. Information on events such as earnings releases
further enhances volume predictability. Non-linear functions from neural network searches provide
better predictability over simple linear models, controlling for the same set of variables/information.
Recurrent neural networks, which learn dynamic predictive relationships, yield additional improve-
ments.
Finally, we find that imposing an economic loss function consistent with the portfolio prob-
lem greatly improves the value of volume predictability from a portfolio performance perspective.
Specifically, we fine-tune the neural networks on the economic objective (derived from the portfolio
problem) rather than on a statistical objective, such as mean squared errors used to pre-train the
9
For example, we leave out many potential variables that relate to trading volume, such as other microstructure
variables, cross-security lead-lag effects, etc.
6
volume prediction model. Fine-tuning yields significant gains in out-of-sample portfolio perfor-
mance compared to the statistically pre-trained model. This result obtains because the portfolio
problem recognizes that trading costs are not a linear function of trading volume. The neural
network places greater weight on observations that impact trading costs more and worries less
about predicting volume where trading costs are less affected, such as recognizing the asymmetric
costs of over- versus underestimating volume. While an MSE objective criterion may maximize
the out-of-sample R2 of volume prediction, an economic loss function directly tied to the portfolio
problem provides more valuable volume predictability that results in better out-of-sample after-cost
portfolio performance.
Since portfolio size affects the tradeoff between the cost and benefit of trading, the resulting
portfolio solution also changes with different levels of AUM. In addition, while volume prediction
has benefits across all factors, some factors have a greater trading volume alpha than others due to
the varying tradeoff between the cost of trading and the opportunity cost of not trading. Intuitively,
we find that factors with a higher turnover (e.g., momentum, short-term reversals) benefit more
from portfolio optimization that accounts for expected trading costs based on volume forecasts.
In general, we find that trading volume alpha is substantial. The marginal improvement on a
portfolio from trading volume alpha is as large as finding return alpha. For example, for a $1 billion
fund, the after-cost improvement in portfolio performance due solely to trading volume prediction
beyond using lagged volume measures, can be as much as double in terms of expected returns or
Sharpe ratio after trading costs. Among popular asset pricing factors, the improvement in after-cost
returns ranges from 20 bps to 100 bps above using a moving average of lagged volume to predict
future volume. Refining the prediction methods and deepening the prediction models could add
The rest of the paper is organized as follows. Section 2 covers some preliminaries to the analysis:
a motivation for predicting trading volume and a description of our data. Section 3 examines volume
4 presents a theoretical optimal portfolio framework for quantifying the economic value of volume
prediction in terms of portfolio performance, which we call “trading volume alpha.” Section 5
7
discusses the empirical results of predicting volume through the lens of our theoretical framework
using a variety of machine learning methods. Section 6 applies these insights and methods to
trading experiments that characterize the net of cost performance improvement of simulated real-
We first motivate why predicting volume is interesting and useful and then describe the data.
2.1 Motivation
Trading costs are critical to realizing investment performance, yet insufficient empirical attention
has been paid to them regarding optimal portfolio construction. This lack of research is largely
due to the challenges imposed by modeling trading costs and finding a generic portfolio solution.
As a consequence, and despite its importance, portfolio theory mainly focuses on modeling and
One of the main challenges in modeling and forecasting trading costs is that the largest com-
ponent of these costs for a large investor is price impact, which depends on the size of the trade,
the amount traded by other traders (in the same and in opposite directions), and the identity of
the trader – different traders may face different price impacts. These features frustrate a generic
portfolio solution that incorporates trading costs. In particular, since optimal portfolio weights
should reflect the expected net of cost return, a solution requires an expected cost function, which
Following Kyle (1985) and subsequent empirical work (Frazzini, Israel, and Moskowitz, 2018),
price impact depends on the trader’s participation rate, defined as the dollar amount traded by
investor n in security i relative to total dollar volume in stock i (the amount traded in the market
$T radedn,i,t
P articipationRaten,i,t = .
$V olumei,t
8
Price impact is an increasing function of participation rate (modeled linearly in Kyle, 1985 and
the positive relation empirically verified in Frazzini, Israel, and Moskowitz, 2018), whose elasticity
varies by investor n. The numerator is also endogenous to expected price impact, which itself is
a function of the participation rate. The circular nature of trade size, and its variation across
investors, makes modeling trading costs particularly challenging. However, the denominator of the
participation rate is independent of n and exogenous to the trader’s desired trade size (assuming a
single investor’s trade is too small to materially affect total dollar volume). Thus, trading volume is
generic and exogenous to each trader and provides a variable universal to all investors for modeling
costs.
From a prediction standpoint, total dollar volume is also easier to forecast than a specific
investor’s trades. Predicting market-level trading activity (total buys and sells) is an easier task
and high frequency data on total trading volume is readily available, while data on individual
traders is not. Moreover, as we will find, machine learning techniques can significantly improve our
ability to forecast volume, due in part to the non-linear nature of volume and its relation to trading
costs. With these insights, we model expected trading costs solely using forecasted total dollar
volume for a stock. Although this exercise is only a partial solution to the trading cost problem, it
is a general one, and it allows us to showcase the economic value of predicting trading volume in
2.2 Data
We compile a data panel of daily stock-level dollar trading volume (Vei,t ) and 175 predictors (Xi,t ).
The unit of observation is stock-day (i, t). We adopt the convention that Xi,t is observed by day
t − 1, whereas the associated prediction target Vei,t is observed until the end of day t. We use a tilde
The sample period is 2018 to 2022, or 1,258 days. The cross-section covers around 4,700 stocks,
with an average of 3,500 stocks per day, or 4,400,000 observations in total. We split the data into
10
Korajczyk and Sadka (2004), Novy-Marx and Velikov (2016), and DeMiguel et al. (2020) evaluate the after-cost
profitability of factor portfolios, rather than actively seeking cost mitigation in portfolio optimization. Jensen et al.
(2022) take price impact transaction costs as given in portfolio optimization. We tackle the forecasting problem given
uncertain transaction costs and apply it within a portfolio optimization framework.
9
a 3-year training sample and a 2-year testing sample. All models are trained once in the training
sample and evaluated out of the sample. We avoid re-sampling methods such as cross-validation
Our analysis focuses on predicting out-of-sample trading volume. Reasonable in-sample fit,
often evaluated in the literature (Chordia, Huh, and Subrahmanyam, 2007; Chordia, Roll, and
Subrahmanyam, 2011), does not often lead to good out-of-sample (OOS) performance, which is of
primary interest to evaluate the robustness and economic impact of volume predictability.
While it is common in the stock return predictability literature to use large sets of variables
(e.g., “factor zoo”) to identify the best predictors, it is less common to use large data sets to predict
market microstructure variables such as trading volume. We show that using big data improves
The main variable we aim to predict is dollar trading volume, which we measure as the natural
logarithm of end-of-day transacted total dollar trading volume for each stock. This variable is
highly persistent. We focus on predicting innovations in trading volume as well, which we show
has significant impact on trading costs. When volume is suddenly much lower than expected,
an investor will “overpay” in transaction costs. If volume is higher than expected, then there is
more liquidity, and an investor incurs opportunity costs from not trading aggressively enough. We
examine the predictive content of several sets of predictors, including technical, fundamental, and
Daily dollar trading volume Vei,t ranges widely (from thousands to billions of dollars) across stocks
and is highly skewed. We take the log of dollar volume, ve = log Ve , whose distribution, shown in
Log dollar volume is highly persistent in the time series and cross-section of stocks, and can
be easily predicted by lagged moving averages of various frequencies. The five-day moving average
predicts log dollar volume with an R2 of 93.68%, higher than the one-day lag (92.53%), moving
10
Figure 1: Distributions of daily stock-level log dollar volume (e
v ) and its shock (e
η ), panel pooled
Histograms of vei,t and ηei,t in the full sample of around 4,400,000 stock-day observations. The
second horizontal axis in the left panel is dollar volume Vei,t in the log scale. Log dollar
volume shock ηe is daily log dollar volume minus the moving average in the past five days:
ηei,t := vei,t − 15 (e
vi,t−1 + · · · + vei,t−5 ).
We focus on predicting the log dollar volume shock defined as daily log dollar volume minus
1
the moving average in the past five days, ηei,t := vei,t − 5 vi,t−1 + · · · + vei,t−5 ) := vei,t − [ma5 ]i,t .
(e
This exercise is comparable to predicting asset returns (change in log price) instead of price levels.
Figure 1 (right panel) shows the pooled distribution of ηe, which is relatively symmetric, centered
2.4 Predictors
We use a total of 175 predictors from various sources, including technical signals, firm fundamentals,
and market and corporate events. We show that the virtue of complexity approach in return
prediction (Kelly, Malamud, and Zhou, 2024) is also useful for volume prediction. We find that
each subset of variables provides incremental improvement to predicting volume, while using all
variables has the greatest OOS predictability. We list the sets of predictors that are cumulatively
1. Technical signals (“tech”): lagged moving averages of returns and log dollar volume over the
2. A small set of commonly used fundamental firm characteristics (“fund-1”): market equity,
standardized earnings surprise, book leverage, book-to-market equity, Dimson beta, and firm
11
age. (6 predictors.)11
3. The remaining firm characteristics from the JKP dataset (“fund-2”), which are merged and
4. Calendar dates with large effects on trading volume (“calendar”). We hard code four binary
- Early closing days for the exchanges (July 3rd, Black Friday, Christmas Eve, and New
Year’s Eve).
- Triple witching days (four times a year when the index futures contract expires at the
same time as the index option contract and single stocks options).
- Double witching days (eight times a year when two of the three events above, the single
Early closing days have substantially less trading volume, while the other three are associated
hot encoding) indicating whether the firm has an upcoming earnings release or just had one
in the past few days. We first construct the number of days until the next known scheduled
earnings release event. For example, a value of zero implies the current day is previously
known to have a scheduled release. A negative value means there are no known scheduled
events in the future and indicates how many days since the last event. We convert this
number into 10 dummy variables of categorical bins: ≤ −4, −3, −2, −1, 0, 1, 2, 3, 4, ≥ 5. The
This list of variables for predicting volume is not exhaustive. Other variables that could add
11
These predictors are from the JKP dataset (Jensen, Kelly, and Pedersen, 2022). We forward fill the monthly firm
characteristics in time when merging to the daily panel. Hence, the characteristics are still always ex-ante available.
On each day, we rank standardize in the cross-section each characteristic to a uniform distribution from -1 to +1.
12
Triple witching happens on the third Friday in March, June, September, and December. Double witching is on
the third Friday in the other eight months. Russell index re-balancing is on the fourth Friday of June when the Russell
1000, Russell 2000, Russell 3000, and other Russell indexes are reconstituted, and has “often been one of the highest-
volume trading days of the year” for the exchange, due to indexes tracking funds adjusting their holdings to reflect
the updates. Rerferenes: [Link] and
[Link]
12
predictive power are microstructure variables, intraday observations, and lead-lag relations across
stocks in terms of trading (e.g., large to small stocks, within industry, etc.). As stated previously,
we do not attempt to provide the best volume prediction model. Rather, we translate the volume
prediction problem into an economic problem whose objective is after-cost portolio performance.
Using our framework, future work can add further predictors for volume that may provide even
larger economic benefits than we show here. However, our framework provides a way to assess
We start with a statistical prediction of daily trading volume using various subsets of predictors
in the training sample panel to estimate the models.13 We compare linear models (ols), with
neural networks (nn) that allow for non-linear transformations and complex interactions, as well
as recurrent neural networks (rnn) that, in addition, allow for state variables to incorporate time
series dynamics. The simplest baseline is predicting vbi,t = [ma5 ]i,t , or in other words, ηbi,t = 0. (The
“hat” denotes predicted values.) Linear regression is also a simple benchmark comparison.
The neural network implementation is kept simple, standard, and fixed throughout the paper in
order to facilitate transparency. The network architecture has three fully-connected hidden layers
of 32, 16, and 8 ReLU nodes, respectively, and one linear output node. The size of the input layer
designed to capture time-series dynamics. An rnn is analogous to state space models like GARCH,
in which the forecast ηbi,t is not only a (non-linear) function of the concurrent predictors Xi,t , but
13
As explained above, predicting ηe or ve are essentially the same: predicting ηe as nn(X) is just predicting ve as
nn(X) + ma5 , where ma5 is one of the predictors. From the machine learning perspective, this is implementing a
simple residual connection (ResNet, He et al. 2016) as illustrated in Figure 4.
13
also of a set of state variables that is the output of the network applied to the previous data
network function and state are the state variables. The recurrent neural network processes data
sequentially, and recursively passes the state variables to the next time period. Essentially, rnn
extracts predictive information from concurrent and lagged predictors Xi,t , Xi,t−1 , Xi,t−2 , . . . , in
contrast to a nn that uses only the concurrent predictors but nothing from the past. Although Xi,t
contains moving averages of vei,t−1 , vei,t−2 , vei,t−3 . . . , for example, the way such lagged information
enters the model without an rnn architecture is highly restrictive. With rnn, the model can “learn”
flexible dynamics, where time-series dependencies are parameterized by trainable network weights.
We implement the rnn with the popular and standard lstm (long short-term memory) architecture.
The number of layers and neurons are kept the same as nn, but the total number of parameters
Appendix A.1 contains other details on implementing the machine learning methods, includ-
ing the optimizer, training scheme, and infrastructure. We do not tune or optimize the hyper-
Table 1 reports the OOS prediction accuracy of each method. We cumulatively add sets of predictors
in the columns from left to right to highlight the prediction improvement from using larger data
sets. The rows correspond to different prediction methods. We express the accuracy in two R2 ’s.
Panel A reports the explained percentage of the total sum of squared errors of log dollar volume
relative to ηe, the residual after controlling for the five-day moving average. Panel B converts that
the number of parameters estimated in each model with each set of predictors.
Volume is highly predictable. The most sophisticated model using all predictors can predict
14
Specifically, the bottom hidden layer in the aforementioned 3-layer network is upgraded to an lstm layer with
32 hidden states and cell states, with the rest of the two layers unchanged. Lstm is a standard and popular type of
rnn with four specific internal mechanisms, or gates, that control the flow of information from both the short- and
long-term past (Hochreiter and Schmidhuber, 1997). See Kelly and Xiu (2023) for a general reference and Appendix
A.1 for our specifications.
14
Table 1: Prediction accuracy
Each row represents a prediction model, and each column cumulatively adds to the set of predictors.
Panels A and B respectively express the OOS prediction accuracy in two different R2 ’s. The R2
relative to ηe is 1 − MSE/avg(e v − ma5 )2 ; R2 relative
P to ve is 1 − MSE/avg(e v ))2 , where
v − avg(e
1
MSE := avg(e v − vb)2 = avg(e
η − ηb)2 and avg := |OOS| i,t∈OOS is the OOS average. Each reported
R2 value is the average across five independent runs initialized with different random seeds to ensure
the results’ robustness and reproducibility. Panel C reports the number of parameters, for which
rnn is about four times of nn due to the four gates in lstm, see exact formulas in Appendix A.1.
nearly 20% of future variation in daily trading volume changes. In comparison, daily stock returns
are hardly predictable with a positive OOS R2 , even with state of the art models and predictors,
and monthly returns are only slightly predictable (Gu, Kelly, and Xiu, 2020).
Even simple ols models can deliver double-digit R2 , especially when using the largest set of
predictors, but the nn and rnn add an additional 3-4% of OOS R2 predictability.15
Adding more predictors improves accuracy in general. All 175 predictors can increase the R2
by more than three percentage points compared to just the eight technical signals. The exception
is that adding the large set of fundamental signals (fund-2) makes the methods (even ML methods)
15
We experimented with regularization on the linear model (lasso and ridge regressions), and did not find significant
improvements.
15
perform a little worse. This may be due to overfitting when the number of features increases and
where we do not use regularization techniques to try to correct for that. The predictors associated
with expected returns do not necessarily work for predicting volume. Market-wide calendar events
are quite effective in capturing volume changes, however. Scheduled earning announcements also
The results show that machine learning is critical, and that complexity has its virtue in the
context of predicting volume. The prediction accuracy of the rnn is better than the nn, which
in turn is better than ols, uniformly across each configuration of included predictors. Panel C
shows the improved prediction accuracy is achieved through a significant increase in the number of
parameters, a measure of model complexity. Appendix A.1 shows the computational costs of the
Panel B reports an alternative R2 measured as the explained percentage of the total variation of
ve. Under this metric, the R2 is always high since the trailing moving average explains ve to a large
extent already, and the gain is at most around 1.2 percentage points on top of ma5 , which does
not look impressive. This begs the question: which one is the right metric to gauge the economic
value of this statistical task? Similarly, why take the log of volume and not predict dollar volume
Ve or some other transformation of it? In the next section, we will formulate an economic objective
as a function of volume by modeling a portfolio problem and show the economic value from the
Appendix C.1 reports that larger firms have higher prediction accuracy than smaller firms,
while the overall patterns of predictability across methods are robust in each firm size sub-sample.
The R2 ’s evaluated in the mega firms are roughly twice those of the nano firms. Smaller firms
have a greater magnitude of unexpected trading volume shocks that are hardest to predict. This
result makes sense since small firms are volatile and have low trading volume, hence unexpected
events that give rise to volume spikes are more likely for these firms. This finding indicates that
in addition to small firms being less liquid on average, their liquidity is also less predictable and
more volatile. In other words, their trading costs are less predictable. We examine whether firms
of different size groups should be modeled differently by implementing a mixture of experts (moe)
16
method, which is shown to be beneficial for the linear model but not for the neural networks.
To quantify the economic value of predicting volume, we set up a portfolio problem that features
a tradeoff between tracking a target portfolio versus minimizing trading costs. The key choice is
whether to trade aggressively toward the target or passively to avoid trading costs. The optimal
balancing point depends on the volume forecast and the trading costs associated with it. We
evaluate to what extent more accurate volume forecasts translate to better trading execution.
The incentive to trade is modeled with an objective that penalizes the tracking error toward
a target portfolio. The tracking target, potentially informed by the various return forecasting
signals, is taken as given since the target itself plays a tangential role in the core tradeoff analysis.
Thus, we fix and set aside the return prediction problem to focus on the improvement afforded
by the volume prediction problem. Once this problem is solved, we experiment with a range of
pre-specified tracking targets to evaluate the economic benefit achieved in different tracking tasks,
The tracking error objective is modeled with a simple mean-variance portfolio optimization frame-
work. In this framework, the tracking target is rooted in return forecasts. Regardless of the
mean-variance microfoundation, the tracking error objective is also relevant for circumstances such
Consider a portfolio manager who chooses dollar portfolio positions xi,t in stock i at the start
X γ X X
2
A(1 + rf,t ) + xi,t mi,t − xi,t xj,t σij,t − T radingCosti,t , (1)
2A
i i,j i
where A is the manager’s assets under management (AUM) or fund size and γ is her risk aversion
17
coefficient.16 To improve the investment outcome, much empirical work has been devoted to pre-
2 ). Instead, we assume a simple
dicting mean excess returns (mi,t ) and the variance-covariances (σij,t
form for the return moments and take them as given in order to illustrate the economic value solely
of the trading cost term. Assuming Varri = σ 2 , with zero covariances, the objective function is
2 X
γσ 2 X A A X 2
− xi,t − mi,t − T radingCosti,t + A(1 + rf,t ) + mi,t . (2)
2A γσ 2 2γσ 2
i i i
The task is to balance the tradeoff between the first term, which is the tracking error penalty as the
result of the mean-variance optimization, and the second term, the transaction cost, to be detailed
below. The third term can be ignored in the optimization since it is irrelevant to the x choices.
1
T rackingErrori,t := µ(xi,t − x∗i,t )2 , (3)
2
where the target x∗ is the before-cost mean-variance efficient portfolio position, which increases in
the asset’s return expectations as well as the total portfolio size A. In implementation, the trading
target is formed in a separate process without immediate trading cost consideration. We analyze
a general strategy that optimizes the trading rate toward the target based on volume predictions.
γσ 2
Parameter µ := A controls the strength of the tracking error penalty. In later empirical analyses,
we do not calibrate µ from risk coefficients but instead treat it as a hyperparameter and tune the
optimal µ under various AUM levels according to investment performance. Still, the qualitative
relationship is preserved – a larger investor penalizes tracking error (measured in dollars) less,
1e 0 2
T radingCosti,t := λ i,t (xi,t − xi,t ) , (4)
2
16
We consider a simple situation where the AUM is constant due to an immediate payout program, see Eq. 13 for
detail. In the second term, the risk aversion coefficient is explicitly adjusted by A such that the before-cost Markowitz
optimal dollar position, x∗ , scales up with A.
18
e as a function of volume: λ
where x0 is the starting position. We specify λ e = 0.2/Ve = 0.2 exp(−e
v ),
following Frazzini, Israel, and Moskowitz (2018). Underlying the quadratic functional form, it is
assumed that the price impact is linear in the trade’s size relative to the volume of the day (a.k.a.
0
participation rate): P riceImpact = 0.1 x−x (Kyle, 1985). For example, buying (or selling) 10% of
e V
the daily volume would move the price by 1% (or −1%). And the trading cost is the price impact
X
min (T rackingErrori,t + T radingCosti,t ) . (5)
{xi,t }
i,t
the tilde). One must predict them based on available conditioning information represented by
predictors Xi,t . Taking x0i,t and x∗i,t as given, the problem becomes {i, t}-separable.18 Then, the
problem is
1e 0 2 1 ∗ 2 19
min E λ(x − x ) + µ(x − x ) .
x∈σ(X ) 2 2
To implement this problem empirically, we first normalize the problem by the target trade size
x∗ − x0 so that we analyze the loss for a one-dollar target trade. The problem scales quadratically:
a $1,000-dollar target trade will incur 106 times the loss of a $1-trade. In detail, let the choice
17
For simplicity, we assume away cross-impact on λ e from related stocks as well as other determinants of λ. e Also,
0.2
other functional forms for λ, e= √
e such as the quadratic form often shown empirically, λ = 0.2 exp(− 12 ve), (Frazzini,
V
e
Israel, and Moskowitz, 2018) can also be used. In this case, all the analyses carries through but ve will be twice
as large. We stick with the linear specification consistent with theory (Kyle, 1985) and empirical evidence on the
unexpected component of trading volume (see Frazzini, Israel, and Moskowitz 2018).
18
Ideally, x0i,t should not be taken as given, as it is affected by the choice on the previous day, but we are not
considering the dynamics of the problem here. By taking x0i,t and x∗i,t as given, the problem is {i, t}-separable
and easier to solve. In the trading experiments (Section 6), however, we consider the dynamics by evaluating the
recursively traded portfolios.
19
Here “x ∈ σ(X )” restricts x as a random variable that depends only on predictive information available at the
time of the choice, withh Xi,t := [Xi,t , Xi,t−1 , Xi,t−2i, . . . ]. This unconditional expectation minimization is equivalent
to solving minx∈R E λ(x − x0 )2 + 1 µ(x − x∗ )2 X for each X realization.
2
1e
2
19
x−x0
variable be trading rate z := x∗ −x0
, then the minimization objective becomes
1e 1 1 2
λ(x − x0 )2 + µ(x∗ − x)2 = x∗ − x0 λze 2 + µ(1 − z)2 . (6)
2 2 2
2
Since the factor, x∗ − x0 , does not matter for the choice of z, define the economic loss as
min E [lossecon (e
v , z; µ)] , (8)
z∈σ(X )
which is the main focus of economic machine learning. Being able to separate x∗ , x0 affords many
conveniences. It means the core problem is independent of the target strategy or fund size. It
allows us to look at each i, t observation independently in a volume prediction setting. After the
prediction task is done, we evaluate the investment performance under various pre-specified target
Suppose ve = v and we ignore the inaccuracy in the prediction, the optimal policy is then,
µ 1
s(v; µ) := arg min lossecon (v, z; µ) = = . (9)
z µ+λ 1 + exp (−v + log 0.2 − log µ)
We plot this function in Figure 2. It is a sigmoid function with a horizontal offset determined by
µ. The optimal trading rate z ranges from 0 to 1 as v increases. Then, the optimal dollar position
choice is x = x0 + s(v; µ)(x∗ − x0 ). This is the finding of Gârleanu and Pedersen (2013) that the
optimal strategy should “trade partially toward the aim.”20 In their setting, the trading rate is a
20
The other finding in Gârleanu and Pedersen (2013) is that the optimal strategy should also “aim in front of
the target.” This dynamic effect is abstracted away in our problem since we are considering the {i, t}-separable
optimization. One interpretation is that x∗ is already the aim that is in front of the target that implicitly embeds
the dynamic effect.
20
Figure 2: The policy function (s) that maps log volume (v) to trading rate (z)
1.0
s(v; µ)
s(v; µ) with a larger µ
0.8
0.4
0.2
0.0
0 4 8 12 16 20 24
v
The solid curve uses a µ value relevant for $1b AUM. The dashed curve changes to a greater µ
(AUM = $100m), increasing z across the spectrum. The background is the histogram of ve data
repeated from Figure 1, to show that a typical ve corresponds to a ze somewhere in the middle
between 0 and 1 given the chosen µ’s.
fixed constant. Here, it is still irrelevant to either the trading target or the starting point (x∗ , x0 ),
but importantly, it depends on the volume prediction, v. Instead of assuming liquidity as a constant
known by the agent, the innovation and emphasis of this paper is that volume prediction alters the
tradeoff between the cost and benefit of trading, and hence more accurate predictions lead to more
efficient portfolio implementation with a z that varies with forecasted volume. Additionally, the
optimal z also does not explicitly depend on the scale of the fund. If the AUM doubles, and both x∗
and x0 double, the optimal z remains the same, while the dollar position choice x doubles. However,
a smaller fund will find a larger µ more relevant for their investment performance optimization (µ-
tuning detailed further below). In that case, they will trade more aggressively uniformly across v,
21
5.1 The statistical and economic tasks of volume prediction
We consider two ways of approaching the portfolio optimization problem. The first conducts
a statistical prediction of volume and then plugs the volume forecasts into the optimal trading
policy s(v; µ) to form a trading plan. This indirect approach we call “statistical” learning. The
second is an economic learning approach, which instead learns trading rate z as a function of
conditioning information directly to minimize the economic loss. We argue directly choosing z is
also “predicting volume,” but with a different optimization objective rather than the least squares
loss commonly used in statistical predictions. We term this approach “economic” learning. We
show their theoretical differences: the economic loss penalizes inaccuracies in volume forecasting
asymmetrically, where overestimating volume is more costly. Therefore the model optimized for the
economic goal should be more conservative, at the expense of compromising on accuracy measured
Step 1: run machine learning regressions of ve onto X in the training sample as in Section 3
X
v ∗ (·) = arg min vi,t − v(Xi,t ))2 .
(e (10)
v(·)
i,t∈train
Step 2: plug the OOS predictions vbi,t := v ∗ (Xi,t ) into policy equation (9) to trade zbi,t =
s(b
vi,t ; µ).
parameterize z as a neural network, optimize the economic objective in the training sample
X
z ∗ (·) = arg min lossecon (e
vi,t , z(Xi,t ); µ) (11)
z(·)
i,t∈train
The difference between the two approaches boils down to the different loss functions deployed in
penalizing volume prediction errors. For example, a trading action zb = z(X ) implies an underlying
21
For both approaches, the machine learning framework can be rnn or nn, in which Xi,t includes concurrent and
lagged predictors (X) or only the concurrent.
22
volume forecast vb = s−1 (b
z ; µ), and equivalently the economic loss can be written as a function of
X
min lossecon
vv (evi,t , v(Xi,t ); µ) (12)
v(·)
i,t∈train
Given that the two approaches are only different in the loss functions, we compare lossecon
vv
1
with the least squares loss function, lossls (e
v , v) := 2 (v − ve)2 , used in statistical predictions. To
understand how the two approaches behave differently, first note that the two functions are the
same for the smallest possible loss being attained, which is when the forecast, v exactly equals
the target, ve. In empirical experiments, we label the strategy made with perfect foresight “oracle”
vi,t = vei,t ), which yields the smallest mean squared error (MSE) and
as the unattainable ideal (b
Second, both loss functions monotonically increase as the forecast v deviates away from the true
ve. Therefore, it is intuitive to think that making forecasts that are close to ve in the least squares
sense, will translate to better portfolio implementation as evaluated by the economic loss.22 How-
ever, forecast errors will not guarantee this outcome because of the differences between the two
loss functions. From a theoretical perspective, it is well known that the conditional expectation,
v |X ], is the minimizer of the problem minv∈σ(X ) E lossls (e
E[e v , v) , so that the statistical learn-
ing method recovers the conditional expectation with the neural network tools. However given a
v |X ] = arg minv∈σ(X ) E lossls (e
Proposition 1. The least squares minimizer, E[e v , v) , does not
Even with unlimited data, the “perfect” statistical learning would not optimize the economic prob-
22
The above mentioned properties expressed mathematically are arg minv loss(v, ve) = ve, ∀e v ; and loss(v, ve) is in-
creasing in |v − ve| , ∀v, ve. Both lossecon
vv and lossls
satisfy these properties. In Figure 3, the dots mark the minimums.
In the right panel, the dips around the minimums are too shallow to be noticeable, though analytically they are
indeed the minimums. See Appendix B.4 for more details on the local curvature around the minimum points.
23
Figure 3: The statistical (least squares) and economic loss functions
30
lossls (e
v , v) lossecon
vv (ev , v; µ)
10−3
25 minimum point minimum point
10−4 greater µ
20
10−5
15 10−6
10 10−7
10−8
5
10−9
0
0 4 8 12 16 20 24 0 4 8 12 16 20 24
v v
if the generic loss function ϕ is in the Bregman class (Banerjee, Guo, and Wang, 2005; Patton,
2020). As is well known, the least squares loss belongs to the Bregman class, the economic loss
More intuitively, we plot and compare the two loss functions in Figure 3. We pick five different
true values for ve = 4, 8, 12, 16, 20, and respectively plot the loss curves for a range of forecasts v ∈
[0, 24]. The distinguishing feature of the economic loss is its asymmetric penalty for overestimating
volume when actual volume is low. For example, the blue loss curve (actual ve = 4) is very high if
the forecast is mistakenly large (v > 12). The economic intuition for why this particular forecasting
error is so costly is that it implies trading aggressively (z close to 1) when actual liquidity is low.
In contrast, errors in the other direction are much more forgiving (e.g., the purple curve
with ve = 20).23 Trading too little when actual liquidity turns out to be ample delivers a loss
23
Notice the vertical axis is in log scale, meaning the purple is much more flat compared to blue in terms of the
difference of its right and left ends. The same plot in linear scale is in Appendix Figure B.2.
24
equal to the opportunity cost of not better tracking the target portfolio, which is fixed at µ
(limz↓0 lossecon (e
v , z; µ) = µ, ∀e
v ). In summary, for low levels of true volume, overestimating vol-
ume is very costly, but for high levels of volume, the economic cost of volume forecast error is
relatively small.
We state the asymmetric property formally in Proposition 4 in Appendix B.4, and provide
further analysis. The analytical results rely on the quadratic functional forms assumed in equations
(3) and (4). However, the qualitative points carry over to more general settings.
These findings have important implications. Off-the-shelf machine learning tools are not the
most suitable for specific portfolio problems because they minimize statistical error rather than
economic error. Hence, an altered financial machine learning design that accounts for economic loss
can be more effective. The ranking of forecasts evaluated by the two loss functions can be reversed
– a set of forecasts with a smaller sample MSE might have a greater economic loss – something we
will see empirically in Table 2. The asymmetry in the economic loss has important implications for
implementing a model optimized for the statistical criteria to the trading task. The model should
“learn” to be more careful about the potential of a liquidity dry-up and be conservative in avoiding
overestimating volume. When necessary, it should compromise least squares accuracy in favor of
Both the statistical and economic learning approaches are implemented with neural networks,
given the many benefits of deep learning such as the ability to handle high-dimensional data and
non-linear relations. Next, we emphasize one aspect of the implementation that is particularly
relevant for transferring statistical learning results to finance applications such as portfolio opti-
mization.
We implement a transfer learning paradigm in which the statistical and economic learning models
The statistical volume prediction is an upstream task. It learns valuable but generic information
on the predictive relationship and serves as the foundation model (center node in Figure 4 right
25
panel). It enjoys off-the-shelf machine learning programs optimized for this typical task. The
background knowledge can be transferred to more specific downstream tasks such as portfolio
optimization in our case. The finance-motivated tasks benefit from a good “foundation”. Fine-
tuning the pre-trained foundation model per the economic loss objective further improves the
economic performance. The different specific downstream tasks require separate economic training
routines. (The tasks are different because the economic loss function is modulated by µ.) The
same pre-trained foundation model serves as the common starting point for different downstream
fine-tuning routines.24
Pre-training optimizes a nn or rnn from X to ηb (the blue part in Figure 4 left panel) as described
in the previous section. Using the pre-trained network as is, flowing the output ηb through additional
transformations, “+ma5 ” and “sigmoid s( · ; µ)” shown in black, the resulting zb implements
the plug-in step. Taking this foundation model as the initialization, fine-tuning conducts further
stochastic gradient descent in the economic loss function evaluated at the same training sample. The
resulting fine-tuned network implements the economic learning approach. Only the trainable part
(in blue) is updated in fine-tuning. The economic approach only innovates on the loss function,
not the network architecture or data. Neural networks tackle the non-linearity not only in the
predictive relationship but also in the (marginal) economic loss function. Experiments show fine-
tuning requires only a small number of epochs of training to significantly improve OOS economic
Many other finance problems share a similar structure, in which statistical results are transferred
into actionable strategies that are applied towards an economically motivated problem. Markowitz
portfolio optimization is a classic example. Another example is financial risk management, which
relies on volatility forecasts. The transfer learning procedure adopted here provides a unified
framework for applying machine learning techniques in these scenarios.26 The procedure is also
24
An even lower-level downstream task is to make decisions given target position x∗ . We do not directly train for
that, but do evaluate the performance in such trading experiments further below.
25
Alternatively, side-stepping pre-training but directly training for the economic loss from scratch is generally less
robust and takes more time (epochs over the sample) to train, probably because the machine learning program is not
optimized for such a loss. Not to mention that each µ would require a separate training routine as they do not share
the common pre-trained baseline.
26
Some financial machine learning studies, including Jensen et al. (2022), Chen et al. (2023), Cong et al. (2021)
and Chen, Pelger, and Zhu (2023) involve directly training for the economic target.
26
Figure 4: Network architecture and transfer learning procedure
training start
statistical learning
sigmoid:
(pre-training for upstream task)
similar to how GPT models are “P”re-trained on language representations and applications like
We present a systematic comparison of how different predictor sets and models perform in terms of
both statistical and economic performance. The results show the economic performance is indeed
optimized by fine-tuning the training process on the same objective, and that more predictors and
Table 2 Panel A evaluates the OOS mean economic loss (MEL), and Panel B the mean squared
error (MSE). For ease of comparison, Panels A′ and B′ normalize these metrics as a percentage
loss reduction such that the “oracle” (which represents the perfect prediction vb = ve) is at 100%
accuracy and the baseline ma5 at 0%. (The percentage reduction in MSE is then the R2 predicting
ηe as in Table 1 Panel A.) The four columns correspond to different µ values that modulate the
economic loss function, representing four different downstream economic tasks relevant to different
27
AUM magnitudes. As the AUM decreases, the relevant economic objective is parameterized by a
greater µ, such that overall trading becomes more aggressive. The average trading rate ranges from
trading conservatively near the starting position (13% for AUM = $10b) to trading almost all the
The rows include the statistical prediction methods described in Section 3 (ma5 , ols, nn, and
rnn). Additionally, the economic learning approaches (those ending with “.econ”) conduct fine-
tuning on top of the statistical forecasts to optimize the economic loss function under the four
different µ’s. The lines spanning the columns indicate that the statistical forecasts do not vary
with µ, whereas economic learning generates different (z-implied) volume forecasts as µ varies.
The predictor sets include the 8 “tech” signals or “all” of the 175 signals.
Looking first at the statistical performances of the forecasts, model complexity and feature rich-
ness help reduce MSE. Configuration rnnall yields the highest R2 ≈ 20%. However, smaller MSEs
do not necessarily lead to better economic performance. For example, nntech accrues greater MELs
than olstech , albeit nn is more accurate in terms of R2 (comparing Panels A and B). For the low
AUM task in particular, the various volume forecasts are even worse than the baseline ma5 , likely
due to the need to trade aggressively in this task and the disproportional penalty when aggressive
trades are the result of overestimating volume. This result is intuitive given the theoretical analysis,
and argues why economic fine-tuning produces better OOS portfolio performance.
The economic training methods lead to better economic outcomes. The fine-tuned networks
with all the predictors yield the best performance, reaching an OOS economic performance that
is about 43%∼70% of the unattainable oracle benchmark at various AUM scales. These improved
OOS outcomes are not mechanically guaranteed since the fine-tuning is to minimize in-sample
loss. The empirical results demonstrate the validity of the economic learning design. Looking at
Panel B′ , the statistical accuracy retreats after fine-tuning, often to levels even worse than the ma5
baseline resulting in negative R2 . This means to achieve improvements in economic outcomes, the
models compromise MSE. Optimal trading actions are conservative against volume overestimation
Furthermore, for a smaller AUM or, equivalently, a higher µ value, the economic models tend
28
Table 2: Economic and statistical performance of different methods
1 2 3 4 1 2 3 4
µ 1.2e-9 6.3e-8 4.7e-7 9.4e-6 1.2e-9 6.3e-8 4.7e-7 9.4e-6
avg ze 0.13 0.57 0.78 0.95 0.13 0.57 0.78 0.95
relevant AUM $10b $1b $100m $10m $10b $1b $100m $10m
A. Mean economic loss (MEL) (×10−8 ) A′ . % reduction in mean economic loss
ma5 0.1046 3.163 15.41 93.0 0.0 0.0 0.0 0.0
olstech 0.1041 3.011 14.81 93.2 27.9 29.6 11.1 -0.3
nntech 0.1043 3.100 14.97 99.7 19.2 12.3 8.0 -12.8
rnntech 0.1040 2.955 14.37 102.2 33.6 40.8 19.2 -17.7
[Link] 0.1041 2.991 12.35 66.8 31.3 33.6 56.5 50.2
[Link] 0.1039 2.855 11.78 64.2 39.7 60.4 67.0 55.3
olsall 0.1040 3.024 14.97 94.7 32.4 27.2 8.1 -3.3
nnall 0.1040 3.019 15.07 106.5 33.3 28.2 6.2 -25.9
rnnall 0.1040 3.012 14.78 109.8 34.8 29.5 11.7 -32.1
[Link] 0.1039 2.810 11.56 61.9 39.6 69.2 70.9 59.6
[Link] 0.1038 2.812 11.60 66.4 43.7 68.8 70.3 51.0
oracle 0.1029 2.653 9.99 40.8 100 100 100 100
29
introduce more statistical bias, as indicated by the increasingly negative R2 values. This can be
explained by the changes in the economic loss functions. With a smaller µ, trading is more intensive
in general (s curve in Figure 2 shift to the left). That means the penalty for overestimating low
volume is more stringent and takes effect earlier (dashed curves in Figure 3 shift to the left). Thus,
for smaller AUM, there is a greater difference between the least squares loss and the economic loss.
We now apply the analysis to real-world investment portfolios in a set of trading experiments.
The trading experiment forms a set of trades xi,t by applying the various trading strategies detailed
in the previous section to dynamically track a set of given target positions x∗i,t . The target positions
are not optimized on trading cost considerations, but formed in a separate process focused solely
on return prediction. We evaluate the outcome of the implemented trades xi,t , including the
mean return, Sharpe ratio, and turnover, in the OOS period. We examine whether a volume
prediction method brings improved investment performance net of trading costs and tracking error
considerations.
Various sets of target positions {x∗i,t } are exogenously supplied to mimic realistic trading tasks.
The first set of experiments mimic a quantitative strategy. We simulate an extremely profitable
before-cost trading strategy assuming the agent can, with some probability, forecast the realized
direction of stock price change. We experiment with different AUM levels, in which the dollar
positions scale linearly while the trading costs increase quadratically. As a result, the optimal
µ, which controls the overall trading aggressiveness, varies. The second set of experiments tracks
monthly-rebalanced factor portfolios sorted on firm characteristics from the literature as the trading
targets. These experiments reveal the effectiveness of volume prediction across the spectrum of
investment styles.
Given the target {x∗i,t }, the implemented trading outcome {xi,t } is constructed following the
30
trading rate strategy, such as zbi,t = [Link](Xi,t ; µ), which is formed in the training sample.
In particular, xi,t = x0i,t + zbi,t (x∗i,t − x0i,t ), where the starting position is formed recursively as
Note the dynamic effect here: the portfolio choice matters for the starting position on the next day,
although the optimization does not explicitly consider it. Additionally, the target amount to trade
x∗ − x0 , varies across i, t, which is another aspect abstracted from in the theoretical analysis.
Given the resulting trades {xi,t }, the accounting of the investment outcome is standard. The
daily dollar payoff is such that the AUM is fixed over time:
X X
payofft = A(1 + rf,t ) + xi,t rei,t − T radingCosti,t − A. (13)
i i
payofft X Xλ
ei,t
0 2
reimplemented,t := − rf,t = wi,t rei,t − A wi,t − wi,t , (14)
A 2
i i
xi,t 0 := x0i,t
where wi,t := A and wi,t A are portfolio weights as a ratio of AUM. We have the familiar
result that the before-cost return (the first term) is scale-invariant while the percentage trading
cost due to price impacts (the second term) scales with the AUM linearly. We report the mean and
1 X xi,t − x0i,t 1 X 0
Turnover := × 252 = wi,t − wi,t × 252. (15)
T 2A 2T
i,t i,t
We consider a set of trading targets {x∗i,t } that simulate a quantitative investment strategy. We
simulate a trading signal that, with 1% chance, perfectly forecasts whether a stock goes up or down
over the next five days. The signal is independent across i, t. Following the signal, all stocks are
27
To initiate the recursive calculation, let x0i,t = 0 on the first day a stock appears in the sample.
31
allocated to either the long or the short group. When no signal is received, the stock position stays
the same. We let, x∗i,t be the equal-weighted long-short strategy in which each leg sums up to 50%
of the AUM.
This portfolio has an unrealistically high before-cost OOS Sharpe ratio of around 7, which is
brought down substantially (to more reasonable levels) after trading costs. We experiment with a
sequence of AUM magnitudes up to $10 billion which command varying levels of trading costs. We
first evaluate across a grid of µ values that vary the aggressiveness of trades, and then perform the
µ-tuning method that allows for the comparison of the highest investment performance attained
We run a trading experiment for every method and each µ value for various AUM levels. Each
experiment plots one dot in Figure 5, with OOS turnover versus mean return (or Sharpe ratio) as
the coordinates. Each curve corresponds to one method, connecting the dots with varying µ values.
To understand the general shape of the curves, first, note that under different µ values, the
strategies vary from always passively holding (z = 0) to trading all the way to the target (z = 1).
As the implemented portfolio becomes more aggressive (µ increase), the turnover increases. On the
vertical axis, the mean return (or Sharpe ratio) first increases as a result of active trading for profit
and then bends downward due to trading costs, whose effect shows up with high turnover. The two
opposing forces result in the inverted U-shaped curves. For higher AUM, the trading costs’ effect is
stronger, resulting in lower inverted U’s that peak at smaller turnover levels (i.e., lower µ values).
In the next subsection, We implement the highest attainable investment outcome by selecting the
The investment gain of a better trading rate strategy z(X ) is shown in the vertical displacement
of the curves. The improvement comes from two aspects. First, more closely tracking the target
portfolio delivers a higher return. Second, reducing trading costs when liquidity is expected to
be low, for a given amount of turnover, lowers the trading cost drag on the portfolio. These two
objectives are contradictory in our framework, hence the value of predicting volume more accurately
is to strike a better balance in the tradeoff between trading costs and tracking error. At the two
extremes of the curves (µ = 0 or +∞), there is no room for tradeoff, and hence predicting volume
32
Figure 5: Trading experiment performance, with only tech predictors
Mean return after tcost (%, annualized) AUM = $10b AUM = $1b AUM = $100m
A B C
10
6 D E F method:
oracle
ma5
Sharpe ratio (annualized)
olstech
nntech
rnntech
3 [Link]
[Link]
A B C
10
6 D E F method:
oracle
ma5
Sharpe ratio (annualized)
olsall
nnall
rnnall
3 [Link]
[Link]
Each dot plots the outcome of one trading experiment: the horizontal coordinate is turnover,
vertical is the mean return or Sharpe ratio. Each curve is for a method with varying µ values.
Figures 5 and 6 differ in using only the tech or all sets of predictors, while the benchmarks curves
“ma5 ” and “oracle” are the same.
33
Table 3: Investment performance in trading experiments
Comparing the methods, we can analyze the sources of improvement. First, using the full set of
predictors is better than using only the set of technical predictors, which is still much better than
no additional information besides the five-day moving average. Second, holding the information
set constant, the neural network model predicts volume better than linear regression. Lastly, given
the pre-trained neural network, further fine-tuning the trading strategy by directly optimizing the
Finally, we tune the hyperparameter µ to report each method’s highest attainable investment
performance. We choose the µ value that maximizes the in-sample expected return (or Sharpe
ratio) and report the out-of-sample performance at the tuned µ. Effectively, µ is selected from the
peaks of the in-sample version of the curves (not plotted) and then applied to the OOS curves (as
plotted). We expect the in-sample and OOS curves to peak at relatively close µ ranges so that the
method attains an OOS performance close to the peak of the OOS curves.
28
The “relevant AUM” row in Table 2 is calculated according to the µ-tuning result under each AUM level under
method ma5 .
34
The results are reported in Table 3. Applying better prediction methods and using a larger
set of predictors improves investment performance uniformly across AUM levels. The economic
magnitude of improvement is significant, comparable to if not more than the marginal improvement
from innovating on return prediction signals. For $10b of AUM, the average annual return increases
from 3.88% when using the baseline prediction method of only a 5-day moving average volume, to
4.68% when making volume predictions with an economic objective optimization imbedded within
a rnn. For $1b AUM, the magnitude of improvement is even greater, going from 6.47% to 8.98%,
and more than doubling the Sharpe ratio from 2.21 to 4.53.
For smaller AUM ($10m), the improvement is still noticeable but smaller, because price impact
shrinks and all methods therefore prescribe trading very aggressively. The investment performance
converges to the high before-cost level regardless of the prediction method. For more realistic
considerations, future research could consider per-unit trading costs such as bid-ask spread in
addition to price impacts, which tend to show up as dollar trade sizes shrink.
As another set of trading experiments, we use as trading targets the portfolios sorted on character-
istics in the JKP dataset that come from the asset pricing literature. The goal is to examine the
For each of the 153 characteristics, the target {x∗i,t } is formed in a standard fashion without
considering trading costs: at the start of every month, the cross section of stocks is split at the 50th
quantile into equal-weighted long-short portfolios based on each characteristic. We fix the AUM at
$10 billion ($5 billion for each of long and short legs).
We fix µ across the 153 factors for consistent comparison, and because factor-specific µ tuning
is likely to be unstable. For example, consider factors that happen to earn negative realized returns
during the training sample period, the factor-specific optimal µ would be zero – no trading at all.
By evaluating at a fixed positive µ, we can address whether this factor that happens to lose money
in the sample, would lose less with a better implementation. We pick the µ that optimizes the
35
Figure 7: Mean return improvements in implementing each factor portfolio
1.0 zero_trades_21d
iskew_ff3_21d
1.0 Short-Term Reversal
rskew_21d
iskew_hxz4_21d
rmax5_rvol_21d iskew_capm_21d
turnover_var_126d
dolvol_var_126d
lti_gr1a
seas_1_1an
0.8 0.8
inv_gr1a
ret_3_1
ivol_ff3_21d rmax1_21d coskew_21d
taccruals_ni ivol_hxz4_21d seas_2_5an
rvol_21d rmax5_21d
ivol_capm_21d
0.6 age
cash_atturnover_126d
aliq_mat
zero_trades_126d
taccruals_at niq_be_chg1 resff3_6_1 beta_dimson_21d 0.6
at_be
oaccruals_ni
z_score oaccruals_at
op_at dsale_dsga
niq_at_chg1
ret_12_7 Profitability Low Risk
ope_bel1
qmj_prof
op_atl1
ebit_sale
ni_ar1
mispricing_perf
eqnetis_at
qmj_safety prc_highprc_252d
bidaskhl_21d seas_16_20an
seas_11_15an Accruals
Low Leverage Profit Growth Seasonality
ebit_bev
netdebt_me
ivol_capm_252d netis_at
dsale_drec
niq_at betadown_252d
qmj_growth
cop_at niq_be nfna_gr1a
ni_me be_me f_score
ocfq_saleq_std resff3_12_1
ocf_at
ni_be at_me cowc_gr1a
chcsho_12mocf_at_chg1
cop_atl1
zero_trades_252d
qmj
beta_60m
o_score eqnpo_12m
dgp_dsale seas_6_10an Quality Momentum
0.4 eqnpo_me
ope_be bev_mev
sale_emp_gr1
ebitda_mev
gp_atl1 eqpo_me
earnings_variability
capx_gr2 seas_16_20na
fnl_gr1a
mispricing_mgmt dbnetis_at 0.4 Value
saleq_gr1 niq_su
debt_gr3 coa_gr1a
at_turnover
debt_me
opex_ateq_dur
ocf_me
fcf_me
noa_at
col_gr1a
sale_gr1
dsale_dinv ret_12_1 Debt Issuance
gp_at
kz_index
sale_bev
tangibilityppeinv_gr1a
noa_gr1a
ival_me
capex_abn
emp_gr1
be_gr1a
lnoa_gr1a
saleq_su
seas_11_15na ret_6_1 Investment
dolvol_126d inv_gr1
capx_gr3 capx_gr1
prcaliq_at
rd_saleni_ivol
pi_nix
ami_126d
sale_gr3
market_equity at_gr1
sti_gr1a seas_2_5na
seas_6_10na
ret_9_1
Size
0.2 sale_me
rd5_atrd_me ret_60_12 nncoa_gr1a
seas_1_1na 0.2
betabab_1260d
ni_inc8q
ncoa_gr1atax_gr1a
corr_1260d
ncol_gr1a
0.0 div12m_me 0.0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Turnover (annualized) Turnover (annualized)
Each dot implements one JKP factor portfolio. The y-axis is the difference in after-cost mean
excess return between implementing with the [Link] and the ma5 . The x-axis is the turnover
of the factor portfolio target (i.e., Eq. 15 with xi,t = x∗i,t , x0i,t = x∗i,t−1 ). Panel B averages the points
in A by style clusters (from JKP).
Figure 7 plots the gain in mean return when implementing the factor portfolios with the
[Link] volume prediction compared with using the ma5 volume prediction. The horizontal
axis is the turnover of the target factor portfolio. Panel B averages the points by investment style
clusters (from JKP). The plots show volume prediction from the [Link] benefits portfolio imple-
mentation across the factors. The average gain in mean after-cost return across factors is 0.44% per
year from volume prediction alone using the [Link] model versus the simple moving average.
Almost all of the 153 factors have positive gains. With the $10b AUM scale, this translates into
an additional $44m per year in implementation cost-saving from improved volume prediction.
Across factors, the gain is larger for those factors with higher turnover. In the right region of
Figure 7, some raw factors have a turnover approaching six (600% per year or roughly turning over
half of the AUM every month). The gains for these factors, including various short-term reversal
strategies, are around 0.5% to 1.0%. These factors are constructed with technical signals over a
shorter window.29 In the left part of the figure, even factors with low turnover (those relying on
29
Examples: ret 1 0 short term reversal; iskew capm 21d idiosyncratic skewness from the CAPM; iskew ff3 21d
idiosyncratic skewness from the FF3F; rmax5 rvol 21d highest 5 days of return; rskew 21d return skewness 21d;
seas 1 1an 1 Year Annual Seasonality; and coskew 21d coskewness.
36
quarterly fundamental signals and signals with greater persistency) show gains that range from
Appendix Figure C.3 plots the same gains in the vertical axis but changes the horizontal axis to
the mean return attained by ma5 , i.e., the baseline level in the gain calculation. The figure shows
that, regardless of the baseline, the gain is independently distributed around a positive center.
That is, a better volume prediction is uniformly effective, and the improvement is not concentrated
on factors that have positive (or negative) realized returns. Appendix Figure C.4 reports similar
plots with gains measured in Sharpe ratio space instead of mean returns.
7 Conclusion
trading costs – a term we call “trading volume alpha.” Volume is highly predictable, especially
when using machine learning techniques, large data signals, and exploiting the virtue of complexity
in prediction. We find that volume prediction can be as valuable as return prediction in achieving
We find that incorporating an economic objective function directly into machine learning is
even more effective for obtaining useful predictions. This feature may be general to many finance
applications of machine learning, where incorporating the economic objective directly may dominate
a two-step process that first satisfies some statistical objective and then incorporates that statistical
object into an economic framework. For volume prediction, the asymmetric cost of overestimating
While we find substantial economic benefits from volume predcition using our framework and
methods, there is much room for improvement. Our goal is not to develop the best trading cost
model or even the best volume prediction model, but rather to translate the prediction problem into
economic consequences, which yield interesting insights. A more exhaustive search for prediction
variables and models that forecast volume more accurately could translate into even larger economic
37
benefits than we show here. Some promising candidates for additional features and methods are
lead-lag volume relations across stocks, more seasonal indicators, other market microstructure
Our simple framework for predicting and characterizing volume alpha also has limitations. For
one, we study a very simple functional form for trading costs that maps volume prediction directly
into costs. Other functional forms and other determinants of costs beyond volume may lead to
novel results. In addition, we separate the volume prediction problem from the expected return
and variance/covariance modeling problem. Combining all three could generate further portfolio
improvements and the interaction between these three prediction problems could be enlightening.
Lastly, our trading experiments are merely a tool to illustrate some possible applications of our
insights, but are not designed to optimize any performance outcome. Specifically, two things not
considered in our design are: dynamic effects from trading and heterogeneous trade tasks. Adding
these more complex features would be an interesting area for future research.
Trading volume prediction in general is an interesting research area worthy of further explo-
ration. While we have couched the volume prediction problem into a portfolio context to translate
the problem into economic consequences, understanding the role of volume more generally – its
causes and consequences – is interesting. Using some of our techniques may shed light on this
question and may help pinpoint what components of volume are most valuable to trading costs
and, as a byproduct, portfolio construction. For example, informed versus uninformed volume,
volume with temporary versus permanent price impact, and short- versus long-term volume may
be interesting research pursuits. Examining various aspects of volume could be very useful for
improving portfolio optimization and understanding trading activity more broadly. We leave these
38
References
Amihud, Y. 2002. Illiquidity and stock returns: Cross-section and time-series effects. Journal of
Financial Markets 5:31–56.
Balduzzi, P., and A. W. Lynch. 1999. Transaction costs and predictability: Some utility cost
calculations. Journal of Financial Economics 52:47–78.
Banerjee, A., X. Guo, and H. Wang. 2005. On the optimality of conditional expectation as a
Bregman predictor. IEEE Transactions on Information Theory 51:2664–9. Conference Name:
IEEE Transactions on Information Theory.
Benston, G. J., and R. L. Hagerman. 1974. Determinants of bid-asked spreads in the over-the-
counter market. Journal of Financial Economics 1:353–64.
Brennan, M. J., and A. Subrahmanyam. 1995. Investment analysis and price formation in securities
markets. Journal of Financial Economics 38:361–81.
Campbell, J. Y., S. J. Grossman, and J. Wang. 1993. Trading volume and serial correlation in
stock returns. The Quarterly Journal of Economics 108:905–39.
Çetin, U., R. A. Jarrow, and P. Protter. 2004. Liquidity risk and arbitrage pricing theory. Finance
and Stochastics 8:311–41.
Chen, H., Y. Cheng, Y. Liu, and K. Tang. 2023. Teaching economics to the machines. Available
at SSRN 4642167 .
Chen, L., M. Pelger, and J. Zhu. 2023. Deep learning in asset pricing. Management Science .
Chordia, T., S.-W. Huh, and A. Subrahmanyam. 2007. The cross-section of expected trading
activity. The Review of Financial Studies 20:709–40.
Chordia, T., R. Roll, and A. Subrahmanyam. 2011. Recent trends in trading activity and market
quality. Journal of Financial Economics 101:243–63.
Cong, L. W., K. Tang, J. Wang, and Y. Zhang. 2021. AlphaPortfolio: Direct construction through
deep reinforcement learning and interpretable AI. Available at SSRN 3554486 .
Datar, V. T., N. Y. Naik, and R. Radcliffe. 1998. Liquidity and stock returns: An alternative test.
Journal of Financial Markets 1:203–19.
Engle, R. 2004. Risk and volatility: Econometric models and financial practice. American Economic
Review 94:405–20.
Frazzini, A., R. Israel, and T. J. Moskowitz. 2012. Trading costs of asset pricing anomalies.
Fama-Miller Working Paper, Chicago Booth Research Paper .
39
Glosten, L. R., and L. E. Harris. 1988. Estimating the components of the bid/ask spread. Journal
of Financial Economics 21:123–42.
Glosten, L. R., and P. R. Milgrom. 1985. Bid, ask and transaction prices in a specialist market
with heterogeneously informed traders. Journal of financial economics 14:71–100.
Goldstein, I., C. S. Spatt, and M. Ye. 2021. Big data in finance. The Review of Financial Studies
34:3213–25.
Grossman, S. J., and M. H. Miller. 1988. Liquidity and market structure. Journal of Finance
43:617–33.
Gu, S., B. Kelly, and D. Xiu. 2020. Empirical asset pricing via machine learning. The Review of
Financial Studies 33:2223–73.
———. 2016. Dynamic portfolio choice with frictions. Journal of Economic Theory 165:487–516.
Harvey, C. R., Y. Liu, and H. Zhu. 2016. . . . and the cross-section of expected returns. The Review
of Financial Studies 29:5–68.
He, K., X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 770–8.
Ho, T. S., and H. R. Stoll. 1983. The dynamics of dealer markets under competition. Journal of
Finance 38:1053–74.
Hochreiter, S., and J. Schmidhuber. 1997. Long short-term memory. Neural Computation 9:1735–
80.
Jensen, T. I., B. Kelly, and L. H. Pedersen. 2022. Is there a replication crisis in finance? Journal
of Finance .
Jensen, T. I., B. T. Kelly, S. Malamud, and L. H. Pedersen. 2022. Machine learning and the
implementable efficient frontier. doi:10.2139/ssrn.4187217.
Kelly, B., S. Malamud, and K. Zhou. 2024. The virtue of complexity in return prediction. Journal
of Finance 79:459–503.
Kelly, B., and D. Xiu. 2023. Financial machine learning. Foundations and Trends in Finance
13:205–363.
Korajczyk, R. A., and R. Sadka. 2004. Are momentum profits robust to trading costs? Journal of
Finance 59:1039–82.
Kyle, A. S. 1985. Continuous auctions and insider trading. Econometrica 53:1315–35. Publisher:
[Wiley, Econometric Society].
40
McLean, R. D., and J. Pontiff. 2016. Does academic research destroy stock return predictability?
Journal of Finance 71:5–32.
Novy-Marx, R., and M. Velikov. 2016. A taxonomy of anomalies and their trading costs. The
Review of Financial Studies 29:104–47.
Shleifer, A., and R. W. Vishny. 1997. The limits of arbitrage. Journal of Finance 52:35–55.
Stoll, H. R. 1978. The supply of dealer services in securities markets. Journal of Finance 33:1133–51.
41
Internet Appendix
A Technical details
The nn architecture consists of three fully-connected hidden layers with 32, 16, and 8 neurons,
respectively.
The rnn architecture is similar to that of the nn. The first (bottom) hidden layer in the 3-
layer network is upgraded to an lstm layer with 32 hidden states and cell states, respectively. The
remaining two layers are unchanged: fully connected with 16 and 8 neurons, respectively.
The formulas for the number of parameters in nn and rnn, as reported in Table 1 Panel C, are
stated below. For nn (three fully-connected hidden layers with 32-16-8 neurons), the formula is
has 32 hidden states and 32 cell states with four gates, changing the formula to (# of predictors +
In training the rnn, we implement the “many-to-one” type data pipeline, where the model
recursively processes a sequence of 10 inputs, Xi,t−9 , . . . , Xi,t , and produces a single output ηbi,t to
calculate the training loss at each data point {i, t}. (When increasing the sequence length from
10 to 50, the results had minimum improvements but required much larger GPU memory and
longer training time.) For data points at the beginning of a stock’s observed period where lagged
predictors (e.g., Xi,t−9 ) are not available, we fill in with zero vectors.
42
In training both nn and rnn models, we use the Adam optimizer, with default learning rates and
other parameters. The batch size is 1024. The ηe prediction models are trained with 50 epochs. For
the sake of clear benchmarking, we do not adopt early stopping, weights dropout, or hyperparameter
tuning with cross-validation, though these techniques could further boost the prediction accuracy.
A. nnall B. rnnall
22 In-Sample In-Sample
Out-of-Sample 25 Out-of-Sample
21 24
20 23
19 22
R2
R2
18 21
20
17
19
16
0 10 20 30 40 50 0 10 20 30 40 50
Epoch Epoch
The learning curves show the gap between the IS and OOS R2 is relatively small, and does not
widen with continued training. This indicates limited in-sample overfitting at this neural network
configuration. The rnn learning curves show slightly more severe overfitting, though the OOS
learning curve is still relatively stable as training continues. The learning curves display fluctuation
due to the randomness of the stochastic gradient descent, though the extent to which is not severe.
We also note the exact results depend on the inherent randomness of the training program. We find
the quantitative results are insensitive to random seeds and report the average of five independent
Table A.1 summarizes computational costs in terms of the training times and memory usage
for nn and rnn with all predictors for a single random seed. These experiments were conducted on
a system equipped with an Nvidia A100 GPU with 40 GB of GPU memory, an AMD EPYC 7713
64-Core Processor @ 1.80GHz with 128 cores, and 1.0TB of RAM, running Ubuntu 20.04.4 LTS.
43
Table A.1: Training time and memory usage
Training time (hours) CPU memory usage (GB) GPU memory usage (GB)
The track error penalty term in the portfolio optimization objective function can be economically
founded. We show the quadratic tracking error penalty can be derived from a mean-variance utility
function. This analysis connects the target positions x∗i to the before-cost mean-variance efficient
portfolio weight, which increases in the asset’s return expectations as well as the total portfolio size
(AUM). It also provides a microfoundation of the hyperparameter µ and the quadratic penalty: as
the position deviates away from the target, the mean-variance loss increases in a quadratic fashion.
γ
Assume the agent has a mean-variance utility function, U = EA′ − 2A VarA
′ − T Cost, where
A′ is the before-cost investment outcome, A is the initial wealth (AUM), and γ is the risk aversion
coefficient, T Cost is the term for the objective of minimizing transaction costs (not the focus
here as we are justifying the tracking error part). A′ is the outcome of the portfolio strategy:
P
A′ = A(1 + rf ) + i xi ri , where xi is the dollar position in risky asset i with excess return ri .
Assume Eri = mi , Varri = σ 2 , and zero covariances. The agent’s portfolio optimization problem
X γ X 2 2
U = A(1 + rf ) + xi mi − xi σ − T Cost (16)
2A
i i
X γσ 2
2
= − x + mi xi + A(1 + rf ) − T Cost (17)
2A i
i
γσ 2 X 2 2A
=− xi − mi xi + A(1 + rf ) − T Cost (18)
2A γσ 2
i
44
2 !
γσ 2 X A A X 2
=− xi − mi + mi + A(1 + rf ) − T Cost (19)
2A γσ 2 2γσ 2
i i
The first term matches the tracking error term modeled in Eq. 3. The second term is the before
TCost utility at the zero tracking error portfolio (x = x∗ ). This term is constant of the x choice,
Comparing the first term with the tracking error modeled in Eq. 3, we see the target portfolio
is x∗i = A
m.
γσ 2 i
As expected, the target positions are the result of Markowitz mean-variance opti-
mization. They are proportional to the return expectations in the cross section. They scale linearly
with the AUM (A) and inversely with the risk aversion coefficient and volatility. Additionally, the
γσ 2
overall tracking error penalizing coefficient µ = A , which is decreasing in AUM. The rationale is
that the quadratic penalty stems from the quadratic risk change as the position deviates away from
the target, and that the absolute risk aversion coefficient decreases with the wealth level. Although
the main analysis takes µ as a hyperparameter ignoring its microfoundation, we still observe the
negative relationship between the tuned µ and the economically relevant AUM (e.g., in Section
5.3).
lossecon
vv (evi,t , v(Xi,t ); µ). In this appendix, we provide the equivalent representation as a prediction
Define lossecon
zz (ez , z; µ) := lossecon (s−1 (e
z ; µ), z; µ), where s−1 ( · ; µ) is inverse of s( · ; µ) function.
Under this definition, the economic task can be viewed as the problem of looking for a function
z(·) that maps X into z to minimize the training sample average loss:
X
min lossecon
zz (ezi,t , z(Xi,t ); µ) (20)
z(·)
i,t∈train
45
According to the definition, the analytical expression of lossecon
zz (ez , z; µ) is
µ
lossecon
zz (ez , z; µ) = (z − ze)2 + µ(1 − ze). (21)
ze
µ
In this expression, the loss can be seen as the squared z prediction error weighted by ze . The
µ e= µ
ze = =⇒ λ −µ (22)
e
µ+λ ze
We show lossecon
zz (ez , z; µ) is still meaningfully different from the standard squared error loss
Without loss of generality, any loss function F (p, q) can be normalized as F̄ (p, q) := F (p, q) −
F (p, p), such that that the function acquires the convenient property that F̄ (p, p) = 0, and that
the solution to the optimization problem minq∈σ{X } E [F (p, q)] does not change. Therefore, in the
econ
v , v; µ) := lossecon
following propositions, we normalize accordingly and consider lossvv (e vv (ev , v; µ) −
econ
lossecon
vv (e z , z; µ) := lossecon
v , ve; µ) and losszz (e zz (ez , z; µ) − lossecon
zz (ez , ze; µ).
The definition of the Bregman loss function is (Banerjee, Guo, and Wang, 2005):
46
Definition 1. Let ϕ : Rd → R be a strictly convex differentiable function, then, the Bregman loss
We consider the simpler case where p, q are scalars. In this case, a Bregman function has
the property that its partial second derivative in the first argument is independent of the second
argument.
∂ 2 Dϕ (p, q)
= ϕ′′ (p) (24)
∂p2
econ
Proposition 2. Function losszz (e
z , z; µ) is not in the Bregman class, for all µ.
econ
Proof. We verify that losszz (e
z , z; µ) violates the property in Eq. 24.
econ µ
losszz (e
z , z; µ) = (z − ze)2 (25)
ze
econ
∂ 2 losszz (e z , z; µ) 2µz 2
2
= 3 (26)
∂e
z ze
econ
Proposition 3. Function lossvv (e
v , v; µ) is not in the Bregman class, for all µ.
Proof. Given the property in Eq. 24, a Bregman loss function must be unbounded as p → +∞.
This is because for any fixed q, when p > q, Dϕ (p, q) is increasing and convex in p.
as ve → +∞.
econ 0.2 exp (−ev ) + µ (exp (−v + log 0.2 − log µ))2
lossvv (e
v , v; µ) :=
(1 + exp (−v + log 0.2 − log µ))2
0.2 exp (−e v + log 0.2 − log µ))2
v ) + µ (exp (−e
−
(1 + exp (−e v + log 0.2 − log µ))2
47
(0.2µ)2 exp (−e v ) − exp (v))2
v ) (exp (e
= (27)
v ) + 0.2) (µ exp (v) + 0.2)2
(µ exp (e
The following is a visualization of the economic loss function in addition to the one in Figure 3
Panel B. The vertical axis is changed to the linear scale from the log scale. For large ve (12, 16, 20,
in green, red, purple), the curves are indistinguishable from a flat line because the blue curve is at
a much greater magnitude, which is for the loss of overestimating low actual volume (e
v = 4).
lossecon
vv (ev , v; µ)
0.0035 minimum point
greater µ
0.0030
0.0025
0.0020
0.0015
0.0010
0.0005
0.0000
0 4 8 12 16 20 24
v
Note: same as Figure 3 Panel B but with the vertical axis in linear scale.
The following proposition formally states the asymmetric property of the loss function for
over/under-estimating volume.
48
Proposition 4. Consider two symmetrical cases with low and high liquidity ve1 and ve2 such that
ze1 = 1 − ze2 < 0.5. Suppose one makes an overestimation in the low-volume case vb1 = ve1 + ε,
comparing with an equal amount of underestimation in the high-volume case vb2 = ve2 − ε, the
additional loss incurred in the first case is greater than the second:
lossecon
vv (ev1 , vb1 ; µ) − lossecon
vv (ev1 , ve1 ; µ) > lossecon
vv (ev2 , vb2 ; µ) − lossecon
vv (ev2 , ve2 ; µ), ∀µ > 0. (29)
Proof. We first show that, given ze2 = 1 − ze1 , as well as vb1 − ve1 = ve2 − vb2 = ε, we have zb2 = 1 − zb1 ,
1 1
=1−
1 + exp(−e
v2 + log 0.2 − log µ) 1 + exp(−e
v1 + log 0.2 − log µ)
Then we have:
1 1
zb1 + zb2 = +
1 + exp(−e
v1 − ε + log 0.2 − log µ) 1 + exp(−ev2 + ε + log 0.2 − log µ)
1 1
= + =1
1 + exp(−e
v1 − ε + log 0.2 − log µ) 1 + exp(e
v1 + ε − log 0.2 + log µ)
1 1
The last equation comes from the fact that 1+exp(x) + 1+exp(−x) = 1. Then we make use of the loss
µ
lossecon
zz (ez , z; µ) = (z − ze)2 + µ(1 − ze) (31)
ze
µ µ
Since ze1 − zb1 = zb2 − ze2 and ze1 < ze2 , that is ze1 > ze2 , we have the required result:
lossecon
vv (ev1 , vb1 ; µ) − lossecon
vv (ev1 , ve1 ; µ) > lossecon
vv (ev2 , vb2 ; µ) − lossecon
vv (ev2 , ve2 ; µ), ∀µ > 0. (32)
49
C Additional empirical results
C.1 Prediction results in firm size groups and “mixture of experts” forecasts
Table C.2 Panel A provides additional assessments of prediction accuracy by evaluating volume
forecasts in different size groups. We use the five groups from the JKP data sorted on the firms’
market capitalization.30
Table C.2: Prediction accuracy (R2 in %) in different size groups and “mixture of experts”
A: pooled training evaluated in size groups and jointly (same models as in Table 1)
olsall 15.99 13.32 12.60 20.90 25.49 26.16
nnall 18.45 15.80 14.86 23.71 27.76 29.12
rnnall 19.86 16.63 16.14 26.00 30.50 32.02
B: size group training evaluated in size groups and jointly (mixture of experts)
ols+moeall 16.34 13.68 12.73 21.43 25.93 27.47
nn+moeall 17.78 15.29 14.43 22.69 26.57 27.71
rnn+moeall 18.26 15.24 14.71 24.76 29.02 30.99
Panel A evaluates the benchmark models (pooled training) in the five size groups, respectively, in
the OOS period. Each model uses “all” 175 predictors. Column “jointly” repeats Table 1, Panel
A, last column. Panel B trains “expert” models for each size group separately and evaluates them
in their corresponding size groups in the OOS period. Column “jointly” evaluates the mixture of
experts (moe) model, which predicts with the corresponding expert model trained on the same size
group for each OOS data point. Each R2 value is the average of five runs, same as Table 1.
The prediction accuracy increases as firm size increases, regardless of the prediction method.
The R2 ’s evaluated in the mega firms are roughly twice those of the nano firms. As explained in
the main text, smaller firms have a greater magnitude of unexpected trading volume shocks that
are hardest to predict. This result makes sense since small firms are volatile and have low trading
30
The five groups are defined according to the market capitalization breakpoints of NYSE stock percentiles: mega
stocks, greater than the 80th percentile; large, 50–80; small, 20–50; micro, 1-20; and nano, below the 1st percentile.
50
volume, hence unexpected events that give rise to volume spikes are more likely for these firms.
This finding also indicates that in addition to small firms being less liquid on average, their liquidity
is also less predictable and more volatile. Hence, tiny firms are not only costly to trade in general,
but their costs are less predictable. These results are intuitive and suggest that our prediction
models are capturing true variation in volume and not simply noise.
We also examine whether firms of different size groups should be modeled differently. We train a
model on each size group separately to attempt to better capture the heterogeneity across the firm
size dimension rather than pooling all firms in the same model. We implement a simple mixture
of experts (moe) method, where each size group is trained separately to form “expert” models and
The moe improves performance of the ols method. Comparing the first lines in Panels A and
B, linear models catered to different size groups are more accurate than the pooled ols. For nn
and rnn, however, the sample size reduction outweighs the potential benefits of separate training,
making the mixture of expert models less accurate (either conditioning on size groups or jointly).
Since the potential non-linear effects of firm size are already allowed in the neural networks, forcing
size groups into different models is less effective. For this reason, we stick with pooled training
samples.
We provide results on the trading experiments that implement the factor zoo portfolios in addition
to Subsection 6.3.
Figure C.3 has the same vertical axis as Figure 7, which is the improvement in after-cost mean
excess return from ma5 to [Link] . The horizontal axis changes to the mean excess return
achieved with the ma5 method, i.e., the baseline level of the improvement. The plot shows the gain
is distributed around a positive center uncorrelated with the baseline. That is, a better volume
prediction is uniformly effective, and the improvement is not concentrated on factors that have
Figure C.4 is the Sharpe ratio version of Figure 7 by showing the gain in Sharpe ratio instead
51
of the mean return. The plot shows a similar pattern to the one reported in the main text. The
gains in Sharpe ratios are larger for those factors with higher turnover, reaching around 0.3 to 0.4
per year.
1.0
Mean return gain, [Link] vs. ma5 (%, annualized)
1.0 iskew_ff3_21d
zero_trades_21d Short-Term Reversal
rskew_21d
iskew_hxz4_21d
rmax5_rvol_21d
iskew_capm_21d
turnover_var_126d
dolvol_var_126d
lti_gr1a
0.8
seas_1_1an
0.8
inv_gr1a
ret_3_1
coskew_21d rmax1_21d ivol_ff3_21d
taccruals_ni ivol_hxz4_21d
seas_2_5an
rvol_21d
0.6 cash_at aliq_mat
niq_be_chg1
turnover_126d
zero_trades_126d
rmax5_21d
ivol_capm_21d 0.6
age niq_at_chg1taccruals_at
resff3_6_1
at_be
dsale_dsga
z_score oaccruals_at
ret_12_7
beta_dimson_21d
oaccruals_ni
op_at
Accruals Low Risk Profitability
bidaskhl_21d ni_ar1
seas_16_20an ope_bel1 mispricing_perf
op_atl1 ebit_sale qmj_prof eqnetis_at
seas_11_15an
netdebt_me
qmj_growth
dsale_drec qmj_safety
nfna_gr1abetadown_252d cop_at
prc_highprc_252d
ebit_bev
ivol_capm_252d
niq_at netis_at Profit Growth
Low Leverage Seasonality
seas_6_10an beta_60m
ocf_at_chg1
ocfq_saleq_std f_score
qmj resff3_12_1 cowc_gr1a
be_me niq_be
cop_atl1 zero_trades_252dni_be
ni_me
ocf_at
at_me Momentum
0.4 seas_16_20na
dgp_dsale
sale_emp_gr1
earnings_variability capx_gr2 fnl_gr1a
chcsho_12m
eqnpo_12m o_score
bev_mev
mispricing_mgmt
ope_be
dbnetis_at eqpo_me
eqnpo_me
ebitda_mev 0.4 Quality Value
dsale_dinv
debt_gr3 niq_su
opex_at
gp_atl1
saleq_gr1col_gr1acoa_gr1a
ret_12_1
eq_dur
sale_gr1 at_turnover debt_me fcf_me ocf_me Debt Issuance
seas_11_15na
noa_at
saleq_su
capex_abn ival_me tangibility sale_bev
kz_index
ppeinv_gr1a
emp_gr1
noa_gr1a
gp_at ret_6_1
be_gr1a
Investment
inv_gr1 dolvol_126d lnoa_gr1a
ni_ivol prc
pi_nix
capx_gr3
rd_sale
ami_126d market_equity
sale_gr3
seas_2_5na sti_gr1a
capx_gr1
ret_9_1 at_gr1
aliq_at
Size
0.2
seas_6_10na
rd5_at
ret_60_12
seas_1_1na
rd_me nncoa_gr1a
sale_me 0.2
betabab_1260d
ni_inc8q tax_gr1a ncoa_gr1a
corr_1260d
ncol_gr1a
0.0 div12m_me 0.0
15 10 5 0 5 10 15 15 10 5 0 5 10 15
Mean return implemented with ma5 (%, annualized) Mean return implemented with ma5 (%, annualized)
The same plot as Figure 7, but changing the x-axis to mean return achieved with the ma5 method,
i.e., the baseline of the gain.
0.4 0.4
Sharpe ratio gain, [Link] vs. ma5 (annualized)
iskew_capm_21d
rskew_21d
Short-Term Reversal
0.3 0.3
inv_gr1a
nfna_gr1a seas_1_1an
rmax5_rvol_21d
taccruals_ni
coskew_21d
lti_gr1a dsale_drec ret_1_0
0.2 cowc_gr1a
niq_be_chg1
resff3_6_1 0.2
niq_su
niq_at_chg1
capx_gr2 fnl_gr1a
oaccruals_ni turnover_var_126d
dolvol_var_126d
taccruals_at seas_2_5an Accruals
cop_ataliq_mat
sale_bev
ocfq_saleq_std
z_score
kz_index
debt_gr3
cop_atl1earnings_variability
be_gr1a
lnoa_gr1a resff3_12_1
ret_3_1 Profit Growth
oaccruals_at
dsale_dsga
ni_ar1
capx_gr1
sale_emp_gr1
ppeinv_gr1a ocf_at_chg1 dbnetis_at
ret_12_7
Quality Debt Issuance Seasonality
0.1
op_at
op_atl1
opex_atdgp_dsale
qmj
qmj_growth
qmj_safety
at_turnover
sti_gr1a
f_score
coa_gr1a
mispricing_mgmt
col_gr1a
noa_gr1a
saleq_gr1
sale_gr1mispricing_perf
emp_gr1
zero_trades_21d
beta_dimson_21d 0.1 Investment Momentum
gp_atl1 qmj_prof
beta_60m
ope_bel1
tangibility
gp_at
at_be
ebit_bev
netdebt_me
noa_at
eq_dur
inv_gr1
capx_gr3
ocf_atival_me
ebit_sale at_gr1
be_me
saleq_su
netis_at
dsale_dinv
eqnetis_at
capex_abnniq_at
ivol_ff3_21d
ivol_hxz4_21d
ivol_capm_21d rmax5_21d
rvol_21d
rmax1_21d seas_16_20an
seas_11_15an
seas_6_10an Profitability
Low Leverage Low Risk
niq_be
age cash_atturnover_126d prc_highprc_252d ret_6_1
ret_12_1 ret_9_1
rd_sale o_score
ni_be
ope_beeqpo_me
eqnpo_me
rd_me
zero_trades_126d
bev_mev
ivol_capm_252d
ni_me chcsho_12m
debt_me
fcf_me
sale_gr3 nncoa_gr1a
eqnpo_12m
zero_trades_252d
ebitda_mev
dolvol_126d
ocf_me
at_me
seas_16_20na
betadown_252d
tax_gr1a seas_2_5na
seas_11_15na
Value
aliq_at
pi_nix
rd5_at
sale_me
market_equity
ncoa_gr1a
ni_inc8q seas_1_1na bidaskhl_21d Size
ami_126d
ni_ivolbetabab_1260d seas_6_10na
prc ret_60_12
0.0 div12m_me
corr_1260d
ncol_gr1a
0.0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Turnover (annualized) Turnover (annualized)
The same plot as Figure 7, but showing the gain in Sharpe ratio instead of mean return.
52
The economic loss function penalizes inaccuracies in volume forecasting asymmetrically, with overestimating volume being particularly costly, especially when actual liquidity is low . In contrast, the statistical loss function (least squares) is symmetric and penalizes deviations without considering the economic implications of these errors . This leads to the economic model being more conservative to avoid costly overestimations .
Trading volume alpha translates trading volume predictability into portfolio alpha by incorporating volume prediction signals into after-cost portfolio modeling, thus enhancing the economic value of volume prediction . This results in after-cost returns or Sharpe ratio improvements, effectively enhancing portfolio performance beyond traditional volume measures .
Incorporating trading volume predictions into portfolio management can significantly enhance after-cost performance by improving trade timing, selection, and sizing, thus generating volume-based alpha . This approach allows for more precise liquidity management and improved performance across asset pricing factors, especially those with high turnover .
Transfer learning enhances machine learning models by pre-training on a statistical objective and then fine-tuning for specific economic tasks . This approach leverages the general predictive power of pre-trained models and adapts them specifically for trading scenarios by adjusting for economic losses, improving both statistical and economic performance of volume prediction models .
The asymmetry in economic loss implies that machine learning models optimized for economic loss rather than statistical accuracy are better suited for portfolio optimization tasks . These models need to account for the potential of liquidity dry-ups conservatively, often requiring a compromise in least squares accuracy for minimizing economic errors, which is particularly vital in trading environments sensitive to liquidity changes .
Economic learning focuses directly on minimizing economic loss by selecting trading rates as a function of conditioning information, adjusting for the economic impact of forecast errors . In contrast, statistical learning aims to minimize statistical errors like mean squared error without considering economic implications, potentially optimizing for factors irrelevant to economic efficiency in trading .
Volume is the major determinant of liquidity and is used to proxy for trading costs or liquidity . The research utilizes this by integrating volume prediction into a portfolio theory problem to optimize portfolios net of trading costs . This approach highlights the economic value of volume prediction by maximizing the net-of-cost performance using a mean-variance utility function .
For smaller AUM, tracking error considerations generally dominate due to less significant trading costs, hence minimizing tracking error is prioritized . For larger AUM, trading costs are more significant due to price impact being a function of participation rate, leading to a preference for minimizing trading costs . Thus, the optimal tradeoff varies with portfolio size and influences the economic impact of volume prediction .
Portfolio size influences trading strategies as larger portfolios face higher trading costs due to increased price impact and participation rate . This necessitates more cautious trading to minimize costs, whereas smaller portfolios can prioritize minimizing tracking error over trading costs due to less impactful price effects . The result is disparate strategies and executions dependent on asset size .
A simple trading cost model is preferred because it provides a forecast of trading costs that any trader could use without the need for specifying trade size . The simplicity allows for broad applicability across different trade scenarios, aiming to integrate volume prediction into portfolio optimization to improve after-cost performance .