app_example_biquad doesn't build

lilltroll77 commented 13 years ago

app_example_biquad doesn't build due to a missing file. (At least it is missing after clean)

lilltroll77 commented 13 years ago

the fir module uses 9+3FNOPs in the MAC loop, resulting in 12N+M1 instructions in total for a sample.

I claim that for I can do it in 5.5N+M2 for an even length filter with double buffering and (7 or 8)N +M3 for arbitrary lenght filter with a single circ buffer using XC.

Shall I push a proposal ?

henkmuller commented 13 years ago

Thanks - fixed this. The repo needed a bin directory to dump the response.csv file into.

Yes please push a proposal.

The absolute minimum for a FIR that does not fit in registers is going to be something like::

LDW LDW [FNOP] MACCS

The [FNOP] slot can be filled with a useful instruction if placed between the two loads. But there will be more instructions for increasing indices etc. Unrolling will help, except that a circular buffer requires wrap around checks, which you can get around by specialising the function as many times as you unroll. Not sure if that made sense....

lilltroll77 commented 13 years ago

Exactly, without unroll it can be done with,

LDW SUBI LDW SUBI MACS BRif!0

where 2*SUBI updates the two array index and make a fetch as well using the double data or double coef method that unfolds the circ buffer to a double length linear buffer (avoiding the need of wrap around check). Eg (5+1/s)N+M with s unrolls

Whit-out double length: Reminder is not good for checking the circ. buffer due to the computation time effecting other threads!? The alternative with an extra branch instead creates 2 branches in total, making it hard to fetch new instruction to fill the instruction buffer. Can it be done with 7N on the xs1 arch. using one circ buffer for filter-lengths other than 2^k?

henkmuller commented 13 years ago

Looks good. You can get rid of the second SUBI by presetting the pointer right I think?

    bu entryPoint
    .align 4
loop:
    MACCS  r5, r6, r0, r3
entrypoint:
    LDW    r0,r1[r2]
    SUBI   r2, r2, 1
    LDW    r3, r4[r2]
    BT     r2, loop
    MACCS  r5, r6, r0, r3

Making it a linear buffer gives an overhead of O(N) every N filters I presume - so that should be negligible?

henkmuller commented 13 years ago

Just a thought, if you want to do a FIR of up to something like 32 taps very quickly, you can use 8 threads and all registers. It would revert to a sequence:

IN
IN
IN
MACCS
MACCS
MACCS
MACCS
OUT
OUT
OUT

Where you IN a sample and a low/high accumulator, do 4 MACC operation on pre stored coefficients and 3 old values, and OUT the sample and the low/high accumulator. This comes down to 4 MACCS in 10 instructions, and all FIRs are pipelined.

lilltroll77 commented 13 years ago

I moved improvements of the actual code to another issue.

Maybe the Makefile in app_example_biquad should test if GCC is present and if not write an error message like: The GNU C Compiler is missing. Trying to build on a Windows machine will report:

make -f ../../sc_dsp_filters/build_biquad_coefficients/Makefile \ FILTER='-min -20 -max 20 -step 1 -bits 27 -low 250 -high 4000 -h src/coeffs.h -xc src/coeffs.xc -csv bin/response.csv' process_begin: CreateProcess(NULL, make -f ../../sc_dsp_filters/build_biquad_coefficients/Makefile "FILTER=-min -20 -max 20 -step 1 -bits 27 -low 250 -high 4000 -h src/coeffs.h -xc src/coeffs.xc -csv bin/response.csv", ...) failed.

A linux hacker will see that it begins with make not xmake, but that is not so easy to see for the common Windows user.

henkmuller commented 13 years ago

yes - I agree that it should - but not sure how to do that in a portable manner.

Another solution would be to use xcc and xsim - if only xsim could cope with arguments. But they could be read over stdin I guess. That would mean that you just use xcc/xmake/xsim, and no other toolchains required?

lilltroll77 commented 13 years ago

That would be nice!

Tested in Ubuntu, but the linker? is not happy with the .o files.

https://gist.github.com/919842

henkmuller commented 13 years ago

Not sure how it every compiled without the -lm option. That should resolve it. I will try and change it to not require gcc today.

lilltroll77 commented 13 years ago

Hi

No hurry, but I'm curious if it was possible to use xcc instead of gcc.

henkmuller commented 13 years ago

I had a crack at it, and it is possible, but I stumbled across a few interesting issues; the software was written with "memory is no limit" in mind.

I will have another crack a bit later this week; it won't be fast, since it will run on a simulator, and double precision floating point is implemented in software... I don't want to translate it to short fixed point at this stage (although it is something I have been wondering about - one of hte reasons to write lib_fixed_point is to see if coefficients can be computed on the fly rather than store them in big tables).

henkmuller commented 13 years ago

Made some more progress on this - all parameters are now passed in in a way that is xcc compatibly - not using argc/argv. Next I will set it up with a large memory and see whether it is prohibitively slow.

henkmuller commented 13 years ago

xcc/xsim is too slow. Porting it to java (which must be installed anyway) seems a better path.

lilltroll77 commented 13 years ago

Response to "You can get rid of the second SUBI by presetting the pointer right I think?"

Yes and here is an working and testes example (Q2.30) but based on channel input https://gist.github.com/1216283

It runs at almost 20 Mtaps/s on a 100MIPS thread for longer filters, e.g. little time outside tme MAC loop. So the burden is 5N+M for any lenght FIR filter

henkmuller commented 13 years ago

That is good - have you got a complete fork that has a fast FIR in it that I can merge?

lilltroll77 commented 13 years ago

Since program data and memory shares the same SRAM, it is stupid to use the double data, since a typicall arraysize >> codesize.

I'm writing a longer code that uses the same size of X and H if y=firfilt(h,x); Also it will only use channels, meaning that with help of a control token you can change the pointer to h, since I really need to increase the speed of my adaptive lossy NLMS filters. By using a an even number of fir taps, or a quad number of fir taps the burden can still be 5N+M or less. I'm emigrating my code to 11.2.2 so my old ugly trick with const struct doesn't work anymore. So I can try to optimize slow sections of XC code as well.

lilltroll77 commented 13 years ago

What about "splitting" the standard filter toolbox into an other new with only adative filters. e.g, just add a 8.24 fir asm version @ 5N+M to the existing one Typically users use standard IIR or FIR filters, but when everything get's adaptive you need more knowledge.

I'm afraid of mixing up the standard toolbox with "strange" things that the common user doesn't understand, since many threads should be running in parallel in adaptive mode. Also, adative IIR filters is typically made as lattice filters due to stability and limited cycles.

henkmuller commented 13 years ago

Yes - I prefer to have a choice of a number of different options (with an explanation as to which one to pick), rather than a "one size fits all" magic one that does everything based on how you call it. So - a separate module would be a good way to do it.

lilltroll77 commented 13 years ago

I will send an merge that is 100% comp. with the existing XC code. But I guess it isn't a heavy download on this toolbox yet, so I will get write some more adaptive filter code, and learn from my mistakes - before I push this code. (If you wish you can add in the todo list, meanwhile) that there is an double data asm prototype on gist that is working if the user understand the concept of fir_dd.

henkmuller commented 13 years ago

Sounds good - the biquad code is being used here and there, but the FIR code not yet so there is no pressure. I will put it on the todo list.

xcore / sc_dsp_filters

app_example_biquad doesn't build #2