xcore / sc_dsp_filters

Code to perform standard DSP functions, such as Biquads, FIRs, sample rate conversion
http://github.xcore.com/sc_dsp_filters/
Other
20 stars 8 forks source link

Distributed FIR #9

Open lilltroll77 opened 13 years ago

lilltroll77 commented 13 years ago

Hi

There was an request on the xcore about running 3000 taps @ 96 kHz on XMOS. Since that could fit into an XMOS G4 with 90%burden + overhead i had to test a little.

Depending on how it is written 1 sec of data takes 913-928 ms to calc. on 16 threads. I know it was written on a todolist somewhere with distributed FIR. An example with distributed FIR on 1 core and another example with several cores would be great. Since people are afraid of the distribution programming. (You need some ASM or C).

Is it interesting to fit it in this dsp_filters or another? This one "must" run with channels for the data, but I use unified memory on each core for the taps and old data states.

I can test it on L1 and G4, but I haven't any L2 ... yet.

henkmuller commented 13 years ago

Hi Mikael,

Sorry for the delay - was busy on doing USB stuff. Very pleased to get a proof of concept WWW over USB working.

I think it would be good to add both a parallel and a simple version of hte FIR - I guess they would be in one module since the may share some common code (to do an effective FIR on a single short piece of code).

A very simple way to parallelise it is to run 16 subsequent samples through 16 different threads - that means that each sample will have the full latency of a single 3000 point FIR, but between them they get the right throughput. With a wee bit of shared memory this can run very nicely ie, add 16 samples to buffer, set each thread going on computing one answer, collect 16 answers, and repeat.

Better latency is obtained by parallelising the FIR itself - which would be better of course, but not as trivial to write :) It would also use less memory because each core could only keep one quarter of the coefficients.

We are a bit short of good L2 development boards, we only have the large audio board, but not something small XC-1 style. Will give that some thought!

Henk

lilltroll77 commented 13 years ago

I made a hybrid. Each core uses a shared memoryspace. Each core calculates it's own sample but, a core calculates the FIR in parallel with 4 threads on the same sample. So on a G4 the delay would be 4 samples + CODEC. I found it to be a good mix between memory usages , inter core channel bandwidth and latency. I also started to write a minimum latency variant where each sample is distributed to all threads, but that creates some more overhead.

henkmuller commented 13 years ago

Yes - that is a good mix. It can be made a bit faster by running each sample through all 4 threads on the core, but not enormously.

How are you planning to do this - fork and do a marge request? Anything I can contribute or are you happily hacking away?

lilltroll77 commented 13 years ago

I created a new catalog in the dsp_filter tree to not interfere with the existing stuff. I will push it on monday. You should maybe not pull the main makefile, since I changet it to just compile my stuff.

It runs like this: Testing performance, Running FIR-filter for 1 sec on a single thread with 3000 filter taps Filtered 6660 samples during 1 second 19980 kTaps per sec. CRC32 checksum for all filtered samples was: 0xED9B0990 Calculating the CRC32 checksum from the XC implementation, this might take some time Correct Checksum for filtered datasequence is: 0xED9B0990

I think you get my idea. It runs for one sec with different Asm implementations. And you get the kTaps per sec and the CRC32 checksum of the seq. which is finally compared to the XC implementation of the FIR filter to guarantee that it works flawless.

lilltroll77 commented 13 years ago

I played around with different distributed solution for one core. This one is "rather" simple to understand. It doesn't use chanends that arise "hidden" in asm etc.

Testing performance, Running FIR-filter for 1 sec on quad threads with 3000 filter taps Filtered 26511 samples during 1 second 79533 kTaps per sec. CRC32 checksum for all filtered samples was: 0xEB9762A1 Calculating the CRC32 checksum from the XC implementation, this might take some time Correct Checksum for filtered datasequence is: 0xEB9762A1

This runs at the speed of 99.52% of 4*singel FIR threads, that must be very good.

lilltroll77 commented 13 years ago

I tried to push one of the example app from EGIT. (Commit) Did you recieve anything? If so I have more to push!

I am comfused between EGIT and GIT.

henkmuller commented 13 years ago

Spend most of the weekend on trains without quality mail/web access. I will let you know tomorrow morning whether inane got something.

Cheers Henk

On 16 Oct 2011, at 18:36, Mikaelreply@reply.github.com wrote:

I tried to push one of the example app from EGIT. Did you recieve anything? If so I have more to push!

Reply to this email directly or view it on GitHub: https://github.com/xcore/sc_dsp_filters/issues/9#issuecomment-2421936

lilltroll77 commented 13 years ago

I know how to do it with GIT, but I want to do it with EGIT, and I do not really know what happend on your side, but I have requested to fill in some more docs. about how to push with EGIT.

lilltroll77 commented 13 years ago

Also, before I push more code, I would like to contribute to the documentation of this one. Can you point to the place where it should be placed, or create a new empty file if it is missing.

henkmuller commented 13 years ago

Hi MIkael,

I haven't seen anything yet.

As far as I know there are two ways in github to do this: One is for you to "fork" the repository, work on it as normal, and then issue a "pull request" The other is for me to give you "push-permission". I have added you to the team, so you shoudl be able to push your committed changes. Don't worry about the Makefile - I will revert that one. If you just want to make one app, you can just cd into the app directory and make (that is what I usually do in repos with many apps)

About the documentation, I have rewritten the biquad documentation, using the doxygen style comments in the header files. I will push this in a minute, it is in doc/biquad.rst. I suggest to use a similar format for the FIR, have a look and feel free to ask questions.

I will have a chat with Dave Lacey about egit - I have no experience with it!

lilltroll77 commented 13 years ago

I just tried push again in EGIT but it says An internal Exception occurred during push: ssh://github.com:29418/xcore/sc_dsp_filters.git: Connection refused: connect

I made an fork with the standard GIT tools. I try to use that the traditional way until Dave explains the EGIT way.

lilltroll77 commented 13 years ago

I also tried this with help of the GUI to set it up.

ssh://lilltroll77@github.com:29418/xcore/sc_dsp_filters.git

including my password, but I get rejected. Is ssh the way to push ?

henkmuller commented 13 years ago

Hmmm, are your ssh keys set up properly? http://help.github.com/set-up-git-redirect The port number in the URL seems interesting.

If so, could you just try a quick thing from the command line to check whether it is an egit problem or a github problem?

Make some tmp directory, go into it and try:

git clone git@github.com:xcore/sc_dsp_filters.git cd sc_dsp_fiters touch blah git add blah git commit . -m "Test" git push

On 17 Oct 2011, at 09:03, Mikael wrote:

I just tried push again in EGIT but it says An internal Exception occurred during push: ssh://github.com:29418/xcore/sc_dsp_filters.git: Connection refused: connect

I made an fork with the standard GIT tools. I try to use that the traditional way until Dave explains the EGIT way.

Reply to this email directly or view it on GitHub: https://github.com/xcore/sc_dsp_filters/issues/9#issuecomment-2426002

lilltroll77 commented 13 years ago

That sees to work, somehow I have to setup EGIT in such a way that it uses the same RSA keys. Maybe I find something in the docs, but I added an issue in the documentation about that, since I guess more than me uses EGIT under Windows. PS. I sent you a Skype req. DS

lilltroll77 commented 13 years ago

Oki, I tried to push with standard GIT, did you get it? Don't forget to add the correct licence text in the files. I have no clue what it should be.

henkmuller commented 13 years ago

No - github thinks that the last change was 5 days ago. Can you copy-and-paste the output of the sequence into an email please?

Cheers, Henk

On 17 Oct 2011, at 09:56, Mikael wrote:

Oki, I tried to push with standard GIT, did you get it? Don't forget to add the correct licence text in the files. I have no clue what it should be.

Reply to this email directly or view it on GitHub: https://github.com/xcore/sc_dsp_filters/issues/9#issuecomment-2426416

henkmuller commented 13 years ago

PS - I just pushed the new documents so you will need to do a pull before you can push it back...

On 17 Oct 2011, at 09:56, Mikael wrote:

Oki, I tried to push with standard GIT, did you get it? Don't forget to add the correct licence text in the files. I have no clue what it should be.

Reply to this email directly or view it on GitHub: https://github.com/xcore/sc_dsp_filters/issues/9#issuecomment-2426416

lilltroll77 commented 13 years ago

Check if the distributed works on your side as well on a multi-core XMOS chip (L2 or G4) without errors. If so, add it to main make file. I might wait until the 2 issues are fixed in the compiler before I write a MultiCore ver. with arrays of channels.

lilltroll77 commented 13 years ago

I made several changes to the multithreaded example to be able to support a multiCore solution with different amount of distributed threads / core.

At the moment I have:

Testing performance, Running FIR-filter for 1 sec on 3 cores with 4 threads/core with 3000 filter taps 77221 samples during 1 second 231663 kTaps per sec.

The goal is to use 15 threads and 1 thread for something else like I2S

I do not understand how to make it easy for the user to split the filtertaps to different cores. One way is not to split it just copy it to all cores, only addressing the interesting part, but that eats memory. Also I would like to not use malloc. Any ideas ?

Is it possible to push a branch that is not belonging to the origin ?

anyway, I need to follow the memory read in the simulator. It is many memory pointers with 12 threads or more, and all needs to be at the correct position.