I think it might make sense to separate how pixels are encoded (-> thermometer encoding) from the wiring of the network (-> random permutation).
A few thoughts:
Pixel encoding:
I think "feature engineering" is the right key word here. It was done heavily before DNNs became the go-to way to do image processing (they consume the (normalized) pixel values directly).
For example, the Sobel filter could be used to preprocess inputs and detect edges.
Perhaps there are also more global features, i.e., not on the pixel level?
Note that any feature engineering would have to be proven also by the circuit.
Wiring:
I guess there could be some kind of windowing, similar to CNNs?
But then I'd be worried about the WNN looking for exact matches. Perhaps it should be replaced with something like: a discriminator is active if the maximal hamming distance to the closest training example of that class is low (but not necessarily 0).
How else can one preprocess data: other approaches to thermometer encoding / random permutation? They could improve accuracy.