Quant·2026-06-11·11 min read·← all posts

How to spot a fake crypto backtest track record — 7 red flags

A backtest is the easiest thing in finance to fake — not through fraud, but through self-deception. You don't need to lie about a single number. You can run an honest simulation on real data, report the output faithfully, and still produce a track record that is pure fantasy. The chart shows +900% with a smooth equity curve; the strategy loses money the moment it touches a real exchange. Below are the seven red flags that separate a tradeable edge from a statistical mirage, and the exact questions that expose a fabricated track record in under a minute.

Why backtests lie even when nobody is lying

A trading strategy has a handful of free parameters: which signal, how strong, how long to hold, when to exit. A historical dataset has a fixed amount of genuine, repeatable structure and a much larger amount of noise that will never repeat. When you search across parameters looking for the combination that produced the best historical return, you are — by construction — fitting to that noise. The more combinations you try, the better your best result looks, and the less of it is real.

This is why a backtest can be both honest and worthless. The numbers are real outputs of a real computation. They just describe a pattern that existed once, by chance, and will not exist again. Spotting a fake track record is mostly about detecting whether the number describes structure or noise.

Red flag 1: No transaction costs (or unrealistically low ones)

This is the single most common killer. A strategy that rebalances frequently can show a beautiful gross return and be deeply negative after costs. On crypto perpetuals, the realistic round-trip cost on a liquid pair is roughly 0.1% in fees plus spread and slippage — call it 0.2–0.5% per round-trip on anything outside the top names, and worse on small caps.

The test: ask what friction was assumed per trade, then ask how the result changes at double that friction. A real edge degrades gracefully — it might go from +40% to +25%. A mirage collapses — it goes from +28% to −51% to −81% as you walk friction up. We have killed several internally promising signals on exactly this test: the gross edge per rebalance was smaller than the cost of executing it, which means only a market-maker who earns the spread could harvest it, not a taker who pays it.

Red flag 2: A Sharpe ratio that is too good

The Sharpe ratio measures return per unit of volatility. Warren Buffett's lifetime Sharpe is around 0.8. The best quant funds in the world run sustained Sharpes of 2–3. If a crypto backtest claims a Sharpe of 8, 10, or 13, it is not a discovery — it is a warning light.

Extremely high Sharpes almost always come from one of two artifacts: very high rebalancing frequency (which inflates the annualised number before costs are subtracted), or look-ahead bias leaking future information into the entry. When you see a Sharpe above ~4 on a retail-accessible strategy, assume it is a measurement error until proven otherwise. The honest reaction to a Sharpe of 13 is not excitement — it is "what did I do wrong?"

Red flag 3: Look-ahead bias and contaminated fields

Look-ahead bias means the backtest used information that would not have been available at the moment of the simulated trade. It is insidious because it is usually invisible in the code. A dataset column labelled "max price over the next 4 hours" is fine for measuring outcomes — but if your entry logic accidentally references the window that includes the move you are trying to predict, every trade looks like a winner.

We caught exactly this in our own dump-recovery research: a precomputed "maximum favourable excursion" field was measured from a window that started before the entry point, so it was contaminated on 100% of events. The fabricated version of the strategy showed an 81% win rate; the honest re-test, refetching real candles from the entry point forward, showed 53%. Same strategy, same data, one bug — and a near-doubling of the apparent edge. If a track record cannot tell you precisely what was known at entry versus measured afterward, treat it as contaminated.

Red flag 4: Overlapping windows and a single lucky period

If a strategy is tested by sliding a window forward one hour at a time and counting each as an independent result, the "sample size" is fake. Adjacent windows share almost all their data, so a thousand overlapping observations might contain only a few dozen genuinely independent ones. A strategy that looks statistically bulletproof on overlapping windows often falls apart the moment you re-test it on strict, non-overlapping periods.

The related trap is the single lucky period. A strategy that earned its entire return in one three-week window in one market regime has not been validated — it has been curve-fit to a moment. Ask to see the equity curve broken into thirds. If two of the three thirds are flat or negative and one is a rocket, the rocket is the regime, not the edge.

Red flag 5: No out-of-sample or walk-forward test

The minimum credible validation is a train/test split: tune the strategy on one slice of history, then test it — untouched — on a slice it never saw. A strategy that is profitable in-sample and falls apart out-of-sample is overfit, full stop. A real edge survives the holdout.

Walk-forward is the stronger version: repeatedly tune on the past, test on the immediate future, roll forward. It mimics how the strategy would actually have been deployed. If a track record was produced by optimising over the entire history at once with no holdout, it tells you nothing about future performance — it only tells you the author found the best-fitting parameters for the past.

Red flag 6: Survivorship and selection in the universe

If a backtest only trades coins that still exist today, it has quietly excluded every project that delisted, collapsed, or went to zero. In crypto, where the failure rate of tokens is enormous, this survivorship bias systematically flatters any long-biased strategy. The losers were deleted from the dataset before the test ran.

The mirror image is universe selection: testing on the 1,000 most liquid pairs makes friction assumptions plausible; testing on illiquid micro-caps makes returns look huge while hiding the fact that you could never have filled the orders. A credible track record states its universe explicitly and tests on instruments you could actually have traded at the size claimed.

Red flag 7: Bootstrap confidence intervals are missing

A single point estimate — "+113% over 90 days" — tells you nothing about uncertainty. Resample the trade sequence thousands of times (a bootstrap) and you get a distribution: maybe the strategy is positive in 96% of resamples, maybe only 60%. A track record that quotes one number and no confidence interval is hiding its own fragility. The honest version says "+36% median, but the bootstrap probability of a positive result is 96%, with a wide interval" — and lets you judge.

The 30-second test

You will not always be able to audit someone's code. But you can ask five questions, and the answers — or the discomfort they produce — tell you almost everything:

  1. What friction did you assume, and what happens at double it?
  2. What is the out-of-sample result, on data the strategy never saw during tuning?
  3. Show me the equity curve in thirds — is the return spread across the period or concentrated in one window?
  4. What was the universe, and does it include coins that have since delisted?
  5. What is the bootstrap probability that the result is positive, not just the headline number?

An operator with a real edge answers these immediately and unprompted, because they have already asked themselves the same questions and killed the versions that failed. An operator selling a mirage gets defensive, changes the subject, or quotes the headline number louder. The questions are the same whether you are evaluating a signal service, a fund, or your own research.

Why we publish the honest version

Most of our internal research dies. We have run hundreds of factor configurations and watched the best-looking candidates collapse under friction, look-ahead correction, and non-overlapping re-tests. That is not a failure of the process — it is the process. The handful of strategies that survive every one of these tests are the only ones we put real capital behind, and the only ones our members receive. Our prediction-market track record and our trading signals are reported with the same discipline we use to kill our own ideas: honest friction, out-of-sample validation, and confidence intervals rather than headline numbers.

If you want signals from a desk that treats its own backtests as guilty until proven innocent — and tells you when a result is uncertain instead of dressing it up — get access here. A track record is only worth what survives the seven questions above.

Trade with a desk that kills its own mirages

Live signals, honest reporting, and research that survives out-of-sample. No fabricated win rates.

Get access →

← Back to all posts · Free email courses · Prediction-market AI