Feedback from Tim Mace - Githubissues

[x] Page 5 - No 'Build Tool' entry in the table for Linux and the ' section has 'TBD'
[x] Page 7 - 'This purpose this tutorial is not to demonstrate' should be 'The purpose of this tutorial is not to demonstrate'
[x] Page 8 - should '3. Waits for a frame of processed audio to be transferred back from Tile 0' be either '3. Waits for a frame of processed audio to be transferred back from Tile 1' or '3. Waits for a frame of processed audio to be transferred back to Tile 0'
[x] Page 9 - In the table, the text in the second column extends into the third - need to adjust the widths of columns 2 and 3
[ ] Page 10 - First paragraph - possibly not too important but if we are going to compare the accuracy between the BFP and FP at any stage it would be better to have taps with more of a dynamic range between them
[ ] Page 10 - It doesn't actually make a difference to this example because of the simple filter transfer function but putting the scaling into the coefficients increases the noise floor proportionately. Ideally we should scale after the dot product, particularly since the xcore VPU can capture the headroom. Is this just used in this example or do we scale the coefficients for all filters with this library?
[x] Page 22 - 'Note that to demonstrte a fully floating-point' should be 'Note that to demonstrate a fully floating-point'
[ ] There seem to be a number of optimisations that could be made to the code although they all have minor performance implications
[ ] Page 25 - 'The main reason for this speed-up is that the C compiler does not, by default, emit dual-issue implementations of C functions' - Why not? Is there some fundamental reason why the compiler can't do this or is it just on our 'to do' list?
[ ] Page 23 (and others) - not clear what the scale is for the output (guessable) and the sample diff - it would be good to indicate the SNR for comparison
[x] Page 28 - 'That is a 32-bit value (1+31) with 1 integer bit and 31 fractional bits. Likewise, q8_24 indicates the Q8.24 format with 24 fractional bits, and so on' - should this be 'That is a 32-bit value (1+31) with 1 integer bit and 31 fractional bits. Likewise, q8_24 indicates 8 integer bits and 24 fractional bits, and so on'
[ ] Page 34 - There are two formats for the 'Part ##' terms; black bold and blue bold italic - are these supposed to be the same or is there supposed to be a differentiation?
[x] Page 58 - 'INT32_SQUARE_MAX_LOG2 = log2 (INT32_MIN) = log2 ((−2 31) 2 ) = log2 (262) = 62' - shouldn't the 'log2 (INT32_MIN)' be 'log2 (INT32_MIN^2)'?
[x] Page 65 - 'Because the VPU’s accumulators in 32-bit more are 40 bits, we know that to be safe we need to shift the result an extra 8 bits.' should be 'Because the VPU’s accumulators in 32-bit mode are 40 bits, we know that to be safe we need to shift the result an extra 8 bits.'
[x] Page 89 - different font used for the first instruction in '(VLADSB, VFTFB, VFTFF, VFTTB, VFTTF'
[x] Page 98 - 'Because the code paths for the floating-point logic involved is sensitive do the data values, random filter weights and input data were used.' should be 'Because the code paths for the floating-point logic involved is sensitive to the data values, random filter weights and input data were used.'

Page 9 - In the table, the text in the second column extends into the third - need to adjust the widths of columns 2 and 3

This was an absolute nightmare to fix! In the end the end the only solution I could find was to shorten the descriptions. I spent about 2 hours trying to fix this.

Page 10 - First paragraph - possibly not too important but if we are going to compare the accuracy between the BFP and FP at any stage it would be better to have taps with more of a dynamic range between them

I agree with this, and had originally intended to make it do something more interesting after I basically had everything written and working. But that took a lot longer than anticipated and unfortunately it will be a good bit of work to make this change now.

If we consider this necessary I can go back and do it. Otherwise, if there's ever occasion to release a "version 2" (or whatever), this is one of the updates I would suggest in that.

Page 10 - It doesn't actually make a difference to this example because of the simple filter transfer function but putting the scaling into the coefficients increases the noise floor proportionately. Ideally we should scale after the dot product, particularly since the xcore VPU can capture the headroom. Is this just used in this example or do we scale the coefficients for all filters with this library?

The library doesn't require this to be one way or the other. The filter_fir_s32() implementation uses the (Q2.30) coefficients provided by the user as-is, and the final step to produce the output sample is to apply an arithmetic right-shift to the accumulators. The user is only required to ensure that the coefficients are scaled such that the accumulators won't overflow.

I believe the floating-point --> fixed-point filter conversion python script provided with the library chooses a coefficient scaling where it tries to leave minimal headroom in the filter coefficients. (Though it will only scale them by powers of 2).

But otherwise, it is left up to the user to make sure the coefficient scaling and output shift meet their needs.

There seem to be a number of optimizations that could be made to the code although they all have minor performance implications

I agree there are likely many such optimizations in various places. In some cases I have consciously chosen to prefer clarity of purpose over potentially minor optimizations. For example, none of the loops are unrolled anywhere.

I'm sure in many other cases I just didn't notice that an optimization was possible. If there are any specific optimizations that you think would make an appreciable difference, I'm happy to try them out.

Page 25 - 'The main reason for this speed-up is that the C compiler does not, by default, emit dual-issue implementations of C functions' - Why not? Is there some fundamental reason why the compiler can't do this or is it just on our 'to do' list?

I'm not sure. I believe there are compiler flags or #pragmas that will have the compiler generate dual-issue code. There may also be some cases I'm not aware of where it actually does emit dual-issue code automatically.

My guess (speculation) for why this doesn't generally happen by default is because we're typically more constrained by memory than speed, and emitting dual-issue code will always make the program size larger, particularly if it isn't well optimized.

Page 23 (and others) - not clear what the scale is for the output (guessable) and the sample diff - it would be good to indicate the SNR for comparison

I'm not sure what you mean with respect to the scale not being clear (the Y-axes all show a scale).

Page 34 - There are two formats for the 'Part ##' terms; black bold and blue bold italic - are these supposed to be the same or is there supposed to be a differentiation?

The ones which are blue and in italics are because they are links to those sections in the document. The others are not links because it didn't seem appropriate to make every single one a link.

My operating convention was:

Make references to the stages (e.g. "Part 2C") bold (and capitalized) so that it was clear something specific is being referred to
Within each stage's section of the document, references to other stages are a link the first time it appears in that section, and non-links afterwards.

xmos / xmath_walkthrough

Feedback from Tim Mace #1