Find the historical distribution of retail flow.

See here for a similar issue.

It's not that hard to find historical Uniswap trade sizes via a dune query. What's harder is determining which of those trades originated from retail users.

Some possible approaches

Ask Uniswap folks if they've seen research on this, and see if we can get data for orderflow that passes through the Uniswap interface. I can also ask MMs / DeFi traders.
Run a query to find which orders passed through the Uniswap router contract. All of these are almost certainly uninformed.
For each swap o, calculate the avg execution price, compute some weighted arithmetic mean of the 1,2,5 minute later prices, and if it's positive markout (i.e. swapper lost, pool won), then count it as retail. This is basically saying "swapper loses => swapper is retail", but we're not saying "swapper wins => swapper not retail", since retail wins 50% of the time (modulo fees). Perhaps you could then use this swap size distribution of swaps that lost to make some claim about the swaps that won.
```
Pr(order is retail | order wins) 
    = (Pr(order wins | order is retail) * Pr(order is retail)) / Pr(order wins)
    = (0.5 * p) / ((1-p)*1 + p*.5) # where p = proportion of orders that are retail
```
Now define s to be the number of observed "swapper loses" scenarios. If we assume that retail loses in half of the cases and wins in the other half, it would follow that you can estimate the p parameter as p=2s. Then we'd get the following expression.
```
Pr(order is retail | order wins) 
    = (0.5 * (2s)) / ( (1-2s) + (2s)*.5)
    = s / (1-s)
```
We could then say that, among the "swapper wins" trades, find the sample s / (1-s) of them that best fits the distribution of the size of the "swapper loses" order sizes.

I feel like overall this might be an egregiously criminal way of doing things. (1) We assume that informed flow never has positive markout on 1-5 minute time horizon. (2) We assume that retail loses in half of the cases but this is likely not true due to the fact that they execute on average on worse prices than informed flow (if the prices were good at beginning of block, arbs would frontrun; if prices are bad at beginning of block, arbs will let retail trade on the bad prices; this implies Pr(order wins | order is retail) < 0.5. (3) We assume that the distribution of "retail swapper loses" sizes is similarly shaped to the "retail swapper wins" sizes, which is likely not the case because large retail swaps are more likely to lose. Nevertheless, if we can overcome these issues then it could make for a pretty sick general purpose methodology.

xenophonlabs / valuing-orderflow

Find the historical distribution of retail flow. #2