Open lilltroll77 opened 13 years ago
the fir module uses 9+3FNOPs in the MAC loop, resulting in 12N+M1 instructions in total for a sample.
I claim that for I can do it in 5.5N+M2 for an even length filter with double buffering and (7 or 8)N +M3 for arbitrary lenght filter with a single circ buffer using XC.
Shall I push a proposal ?
Thanks - fixed this. The repo needed a bin directory to dump the response.csv file into.
Yes please push a proposal.
The absolute minimum for a FIR that does not fit in registers is going to be something like::
LDW LDW [FNOP] MACCS
The [FNOP] slot can be filled with a useful instruction if placed between the two loads. But there will be more instructions for increasing indices etc. Unrolling will help, except that a circular buffer requires wrap around checks, which you can get around by specialising the function as many times as you unroll. Not sure if that made sense....
Exactly, without unroll it can be done with,
LDW SUBI LDW SUBI MACS BRif!0
where 2*SUBI updates the two array index and make a fetch as well using the double data or double coef method that unfolds the circ buffer to a double length linear buffer (avoiding the need of wrap around check). Eg (5+1/s)N+M with s unrolls
Whit-out double length: Reminder is not good for checking the circ. buffer due to the computation time effecting other threads!? The alternative with an extra branch instead creates 2 branches in total, making it hard to fetch new instruction to fill the instruction buffer. Can it be done with 7N on the xs1 arch. using one circ buffer for filter-lengths other than 2^k?
Looks good. You can get rid of the second SUBI by presetting the pointer right I think?
bu entryPoint
.align 4
loop:
MACCS r5, r6, r0, r3
entrypoint:
LDW r0,r1[r2]
SUBI r2, r2, 1
LDW r3, r4[r2]
BT r2, loop
MACCS r5, r6, r0, r3
Making it a linear buffer gives an overhead of O(N) every N filters I presume - so that should be negligible?
Just a thought, if you want to do a FIR of up to something like 32 taps very quickly, you can use 8 threads and all registers. It would revert to a sequence:
IN
IN
IN
MACCS
MACCS
MACCS
MACCS
OUT
OUT
OUT
Where you IN a sample and a low/high accumulator, do 4 MACC operation on pre stored coefficients and 3 old values, and OUT the sample and the low/high accumulator. This comes down to 4 MACCS in 10 instructions, and all FIRs are pipelined.
I moved improvements of the actual code to another issue.
Maybe the Makefile in app_example_biquad should test if GCC is present and if not write an error message like: The GNU C Compiler is missing. Trying to build on a Windows machine will report:
make -f ../../sc_dsp_filters/build_biquad_coefficients/Makefile \ FILTER='-min -20 -max 20 -step 1 -bits 27 -low 250 -high 4000 -h src/coeffs.h -xc src/coeffs.xc -csv bin/response.csv' process_begin: CreateProcess(NULL, make -f ../../sc_dsp_filters/build_biquad_coefficients/Makefile "FILTER=-min -20 -max 20 -step 1 -bits 27 -low 250 -high 4000 -h src/coeffs.h -xc src/coeffs.xc -csv bin/response.csv", ...) failed.
A linux hacker will see that it begins with make not xmake, but that is not so easy to see for the common Windows user.
yes - I agree that it should - but not sure how to do that in a portable manner.
Another solution would be to use xcc and xsim - if only xsim could cope with arguments. But they could be read over stdin I guess. That would mean that you just use xcc/xmake/xsim, and no other toolchains required?
That would be nice!
Tested in Ubuntu, but the linker? is not happy with the .o files.
Not sure how it every compiled without the -lm option. That should resolve it. I will try and change it to not require gcc today.
Hi
No hurry, but I'm curious if it was possible to use xcc instead of gcc.
I had a crack at it, and it is possible, but I stumbled across a few interesting issues; the software was written with "memory is no limit" in mind.
I will have another crack a bit later this week; it won't be fast, since it will run on a simulator, and double precision floating point is implemented in software... I don't want to translate it to short fixed point at this stage (although it is something I have been wondering about - one of hte reasons to write lib_fixed_point is to see if coefficients can be computed on the fly rather than store them in big tables).
Made some more progress on this - all parameters are now passed in in a way that is xcc compatibly - not using argc/argv. Next I will set it up with a large memory and see whether it is prohibitively slow.
xcc/xsim is too slow. Porting it to java (which must be installed anyway) seems a better path.
Response to "You can get rid of the second SUBI by presetting the pointer right I think?"
Yes and here is an working and testes example (Q2.30) but based on channel input https://gist.github.com/1216283
It runs at almost 20 Mtaps/s on a 100MIPS thread for longer filters, e.g. little time outside tme MAC loop. So the burden is 5N+M for any lenght FIR filter
That is good - have you got a complete fork that has a fast FIR in it that I can merge?
Since program data and memory shares the same SRAM, it is stupid to use the double data, since a typicall arraysize >> codesize.
I'm writing a longer code that uses the same size of X and H if y=firfilt(h,x); Also it will only use channels, meaning that with help of a control token you can change the pointer to h, since I really need to increase the speed of my adaptive lossy NLMS filters. By using a an even number of fir taps, or a quad number of fir taps the burden can still be 5N+M or less. I'm emigrating my code to 11.2.2 so my old ugly trick with const struct doesn't work anymore. So I can try to optimize slow sections of XC code as well.
What about "splitting" the standard filter toolbox into an other new with only adative filters. e.g, just add a 8.24 fir asm version @ 5N+M to the existing one Typically users use standard IIR or FIR filters, but when everything get's adaptive you need more knowledge.
I'm afraid of mixing up the standard toolbox with "strange" things that the common user doesn't understand, since many threads should be running in parallel in adaptive mode. Also, adative IIR filters is typically made as lattice filters due to stability and limited cycles.
Yes - I prefer to have a choice of a number of different options (with an explanation as to which one to pick), rather than a "one size fits all" magic one that does everything based on how you call it. So - a separate module would be a good way to do it.
I will send an merge that is 100% comp. with the existing XC code. But I guess it isn't a heavy download on this toolbox yet, so I will get write some more adaptive filter code, and learn from my mistakes - before I push this code. (If you wish you can add in the todo list, meanwhile) that there is an double data asm prototype on gist that is working if the user understand the concept of fir_dd.
Sounds good - the biquad code is being used here and there, but the FIR code not yet so there is no pressure. I will put it on the todo list.
app_example_biquad doesn't build due to a missing file. (At least it is missing after clean)