Methodology·2026-06-07·10 min read·← all posts

Sustainable crypto trading strategy vs a lucky streak — 6 statistical filters

A 12-month +180% return is impressive. It is also exactly the return profile of a strategy that got lucky for 12 months. The honest truth about most published trading track records is that you cannot tell them apart from random outcomes without running statistical filters that the marketer rarely shows you. Here are the six tests professional researchers apply, with the math behind each.

Why this matters

Two strategies enter the public market with the same 12-month return. One is genuinely repeatable; the other is a fortunate sample. The customer who funds the lucky strategy with serious capital loses money over the next 12 months as variance reverses to the mean. The customer who funds the genuinely repeatable strategy compounds.

The visible track record is identical. The statistical fingerprint, when you look for it, is not. Distinguishing the two takes about 30 minutes of analysis but most retail traders never run it.

Filter 1: Sample size relative to win rate

A 60%-win-rate strategy needs roughly 100 trades before you can statistically distinguish it from a 50%-win-rate strategy with reasonable confidence. A 70%-win-rate strategy needs about 50 trades. A 90%-win-rate strategy can be statistically identified in 25 trades.

The rule of thumb: 4 / (p − 0.5)² trades, where p is the claimed win rate. So:

55% claim → 1,600 trades needed
60% claim → 400 trades needed
70% claim → 100 trades needed
80% claim → 44 trades needed

If someone shows you a 65% win rate from 50 trades, statistically you cannot distinguish that from a 50% win rate. The sample is too small. They might have real edge — or they might be in a lucky window. The track record alone cannot tell you.

What to do: ask for the strategy's run-time and trade count. If the count is below the threshold for the claimed win rate, the result is inconclusive regardless of how impressive it looks.

Filter 2: Sharpe ratio AFTER realistic friction

Sharpe is the cleanest summary metric. It accounts for risk, not just return. A Sharpe of 1.5+ on out-of-sample crypto trading is genuinely impressive. A Sharpe above 3.0 is rare and almost always indicates either fraud or unrealistic friction modeling.

The trap: most published Sharpe numbers ignore real friction. Slippage of 0.2%, fees of 0.08%, and funding of 0.05% per period add up to a real drag. A backtest claiming Sharpe 2.5 with zero friction often becomes Sharpe 1.0 with realistic friction — still good, but a different sales story.

What to do: ask "what friction assumption are you using?" If the answer is "we use realistic slippage" without specifics, the friction is probably underestimated. Push for the per-trade cost number.

Filter 3: Maximum drawdown vs expected drawdown

For any strategy with a given Sharpe ratio, statistical theory predicts an expected maximum drawdown. A Sharpe-1.5 strategy with 12 months of history should expect a max drawdown of roughly 25–35% over that period. If a published track record shows max drawdown of only 5%, either (a) the strategy is genuinely better than Sharpe-1.5 implies, (b) the period was unusually benign, or (c) the data has been smoothed/curated.

The diagnostic: compute "Sharpe × expected DD multiplier" and compare against the claimed max DD. If the claimed DD is dramatically below expectation, dig deeper before trusting.

Filter 4: Walk-forward stability

Take the full backtest period and split it into 5 equal windows. Compute the strategy return in each window separately. Five wins is a "stable" backtest. Three wins and two losses is "regime-sensitive." One enormous winning window and four marginal ones is "lucky-window-dependent."

A surprising number of published track records turn out to be "one big winning quarter, three flat quarters" when you decompose them. The overall return looks impressive; the per-window distribution reveals it's a single fortuitous event.

What to do: ask for the walk-forward decomposition. If the strategy is real, every (or almost every) window should be positive. If only one is positive, you are looking at a sampling artifact, not a strategy.

Filter 5: Bootstrap confidence interval on the return

Statistical bootstrap takes the actual trade-by-trade returns and resamples them with replacement many times (typically 2000–5000) to estimate the distribution of plausible returns from the same edge. The 95% confidence interval tells you "given this trade record, what is the plausible range of true returns?"

Example: a backtest produces +180% return. Bootstrap CI of [+45%, +320%] tells you the strategy is likely-positive but the true expected return is much less certain than the point estimate suggests. Bootstrap CI of [−25%, +180%] tells you the strategy might not even be positive after accounting for variance.

The headline +180% is the same in both cases. The honest message is very different.

Filter 6: Strategy decay across out-of-sample period

A strategy that worked at +20%/month in 2024 but +5%/month in 2025 is decaying. The decay tells you the strategy edge is competitive — other people have found similar inefficiency and are exploiting it. Eventually edge → 0.

For a serious researcher, decay is the most important signal. A strategy with stable edge over 12 months is much more valuable than a strategy with crushing edge in month 1 and minimal edge in month 12 — even if the second strategy's total return is higher.

What to do: look at monthly returns over the strategy's full history. Plot them. Trend? Stable? Increasing? Each pattern has different implications for whether the future will look like the past.

The combined diagnostic

A strategy that passes all six filters is rare. Most claimed strategies fail at least one. The ones that pass are worth serious attention; the ones that fail are usually not.

In our own research, every strategy we publish — NEVA, CATALYST, VENUE, PHOENIX — has been through this exact battery. We have killed more strategies than we have published. Some examples we have written about openly: BURST looked good for months before walk-forward decomposition revealed regime-dependent edge that was already disappearing. ORACLE mirror-trading had a beautiful backtest that collapsed under bootstrap CI. Both got killed at the research stage. Those that survive are what subscribers get.

Reading our own track record honestly

For traders evaluating our service: we publish equity curves for both backtest months and live months. The methodology is path-dependent (real intra-trade dynamics), friction is realistic (slippage + fees + funding), and the live data is what actually happened on a real Binance account. No backfilled trades, no smoothing, no removed losers.

The honest assessment of our own track record: NEVA is currently passing all six filters across a multi-month window, with the largest single-trade contribution we have published being a memecoin pump catch that produced asymmetric upside. Without that single trade, the strategy is still positive but the magnitude looks different. We disclose this openly in our public materials because it is the honest read.

You should apply the same six filters to any track record before funding a strategy with real money. If the provider cannot answer the six diagnostic questions in clear language, the track record is not as good as it looks.

Try the strategies we keep

Hedonist Pro delivers signals from four uncorrelated systematic strategies, all passing the six filters above. Trial is free. We publish honest equity curves on real money — backtest where labeled, live where labeled.

Start free trial →