xmos / lib_spi

SPI peripheral library
Other
3 stars 10 forks source link

[WIP] Refactor in quad spi #22

Closed cdoak closed 6 years ago

cdoak commented 6 years ago

I have started mocking up some code that may be the replacement API for lib_spi where it now also incorporates the bus functions for lquad_spi.

In this first attempt, I have focused on the ability to maintain a single API interface, but pass different port structures based on SPI/QSPI requirements. I have achieved this through the use of a union. I'm interested to hear your thoughts on this vs the obvious route of having two API headers (xmos_spi.h and xmos_qspi.h)

Colin.

ghost commented 6 years ago

It looks like an implementation detail to me (internal header file), rather than an API. Thus the use of union is an optimisation detail of opaque state which does not affect the API contact. Q: does this implementation detail leak out into the API? Q: do you need both features to be supported in the same binary build? Q: does the binary size increase at the expense of generic source code?

cdoak commented 6 years ago

I've taken Roberts advice on board and buried the implementation details from the client behind the opaque handle (via the CREATE_XXX_HANDLE) macros.

I've also implemented function pointers that are set when CREATE_XXX_HANDLE is called. This negates the need to if-the-else and switch statements in the calls, therefore reducing code size and improving speed.

Still alot of work to do, but interested to hear your thoughts now @robert-lytton , @samchesney, @pthedinger, @chrisc-xmos .

cdoak commented 6 years ago

I'm debating with myself who controls the device_select (CS) port. Im leaning on the side of its NOT lib_spi.

Take lib_flash for example, it will want to output and input data all within the same CS low period. If lib_spi controlled the port there would be two toggles on CS, one for the output and another for the input. Whereas if lib_flash controlled it, it could lower the cs, perform its output followed by input and then raise it when done.

I could leave the CS port owned by lib_spi but expose an API function to raise/lower CS on its handle. So the caller has control over it.

What are your thoughts @samchesney @robert-lytton @pthedinger @chrisc-xmos ?

ghost commented 6 years ago

There may not be a CS (if the board is laid out as a daisy-chain of 1 or more devices). But I have no idea if this is something a customer is allowed to do... If you want to parameterise this behaviour use a strategy pattern viz: the caller passes in a function pointer or 'details' that will do any necessary work (or a null).

"All problems in computer science can be solved by another level of indirection" – David Wheeler "…except for the problem of too many layers of indirection" – Kevlin Henney's corollary

samchesney commented 6 years ago

Hi @cdoak, I like the look of the API with the opaque handle 👍

I guess the complication of supplying shared multi-bit ports in as CS probably means passing in a function pointer as Robert suggests would be best. However as this still complicates the lib_flash requirement you highlighted it may just be best to just keep it external to the library.

samchesney commented 6 years ago

Hi @cdoak, one other thing came to mind last night, I was wondering whether this API could also be used for a SPI slave implementation by passing an extra parameter to the CREATE_XXX_HANDLE() macros?

cdoak commented 6 years ago

In lib_flash, when transmitting data, I format the data first to the correct pins and then I transmit it.

With lib_spi, I'd like to wrap this formatting up, so that the client doesn't need to concern themselves with it. It will be auto-formatted based on what handle they have.

The following code shows this formatting for QSPI write bytes in SPI mode...

// Copyright (c) 2016-2018, XMOS Ltd, All rights reserved

//Write bytes for QSPI SPI mode.  That is all data is output from the SIO port one
//word at a time.  As all the bits are doubled up so it takes two words to make
//up the required byte.  The data must be transmitted from pin IO0 (SI) of the
//4 pin port.

#if defined(__XS2A__)

#define STACKWORDS          8 //TODO: how to calculate?

#define EIGHT_CLOCK_EDGES   16 //doubled up

//Registers
#define QSPI_PORTS          r0
#define TX_DATA             r1
#define NUM_DATA            r2
#define SCLK                r3
#define SIO                 r4
#define CLK_EDGES           r5
#define WORD_OUT_ONE        r6
#define WORD_OUT_TWO        r7
#define ZIP_OUT            r8
#define INDEX               r9
#define CLK_START_END       r10

#define PORT_DELAY          r8
#define PORT_TIME           r11

//Data Pool for clock edges
.section  .dp.data,"awd",@progbits
.align 4
.cc_top clock_edge_data.datum
clock_edges_cpol_zero:
  .word 0xAAAA //8 edges (doubled up to 16)
clock_edges_cpol_one:
  .word 0x5555 //8 edges (doubled up to 16)
.cc_bottom clock_edge_data.datum
.text

////////////////////////////////////////////////////////////////////////////////
//qspi_port_spi_mode_zero_write_bytes_asm
//
//Writes bytes from memory up to and including speeds of 50MHz
//
//in      - internal_handle : spi_handle_t
//out     - tx_bytes : char*
//in      - num_bytes : unsigned
//return  - void
////////////////////////////////////////////////////////////////////////////////

.cc_top qspi_port_spi_mode_zero_write_bytes_asm.function
.globl  qspi_port_spi_mode_zero_write_bytes_asm
.align  4
.type qspi_port_spi_mode_zero_write_bytes_asm,@function

qspi_port_spi_mode_zero_write_bytes_asm:
.issue_mode dual
  DUALENTSP_lu6 STACKWORDS

  //Callee save registers: TODO: why SP[1]..SP[3]
  std r4, r5, sp[1]
  std r6, r7, sp[2]
  std r8, r9, sp[3]

  //Deducting one from NUM_DATA from the get-go to enable 1 byte reads below...
  {ldw SIO, QSPI_PORTS[1]; sub NUM_DATA, NUM_DATA, 1}
  //Initialising the INDEX to -1
  {ldw SCLK, QSPI_PORTS[2]; mkmsk INDEX, 32}
  //Mode 0 requires cpol 0 clock edges
  //cpol 0 starts/finish low
  {ldw CLK_EDGES, dp[clock_edges_cpol_zero]; ldc CLK_START_END, 0}
  outpw res[SCLK], CLK_START_END, 1

  bl qspi_port_write_bytes_asm

  //Callee restore registers
  ldd r4, r5, sp[1]
  ldd r6, r7, sp[2]
  ldd r8, r9, sp[3]

  retsp STACKWORDS

.qspi_port_spi_mode_zero_write_bytes_asm_tmp:
  .size qspi_port_spi_mode_zero_write_bytes_asm, .qspi_port_spi_mode_zero_write_bytes_asm_tmp - qspi_port_spi_mode_zero_write_bytes_asm
  .align  4
  .cc_bottom qspi_port_spi_mode_zero_write_bytes_asm.function

////////////////////////////////////////////////////////////////////////////////
//qspi_port_spi_mode_one_write_bytes_asm
//
//Writes bytes from memory up to and including speeds of 50MHz
//
//in      - internal_handle : spi_handle_t
//out     - tx_bytes : char*
//in      - num_bytes : unsigned
//return  - void
////////////////////////////////////////////////////////////////////////////////

.cc_top qspi_port_spi_mode_one_write_bytes_asm.function
.globl  qspi_port_spi_mode_one_write_bytes_asm
.align  4
.type qspi_port_spi_mode_one_write_bytes_asm,@function

qspi_port_spi_mode_one_write_bytes_asm:

.issue_mode dual
  DUALENTSP_lu6 STACKWORDS

  //Callee save registers: TODO: why SP[1]..SP[3]
  std r4, r5, sp[1]
  std r6, r7, sp[2]
  std r8, r9, sp[3]

  //TODO:

  //Callee restore registers
  ldd r4, r5, sp[1]
  ldd r6, r7, sp[2]
  ldd r8, r9, sp[3]

retsp STACKWORDS

.qspi_port_spi_mode_one_write_bytes_asm_tmp:
  .size qspi_port_spi_mode_one_write_bytes_asm, .qspi_port_spi_mode_one_write_bytes_asm_tmp - qspi_port_spi_mode_one_write_bytes_asm
  .align  4
  .cc_bottom qspi_port_spi_mode_one_write_bytes_asm.function

////////////////////////////////////////////////////////////////////////////////
//qspi_port_spi_mode_two_write_bytes_asm
//
//Writes bytes from memory up to and including speeds of 50MHz
//
//in      - internal_handle : spi_handle_t
//out     - tx_bytes : char*
//in      - num_bytes : unsigned
//return  - void
////////////////////////////////////////////////////////////////////////////////

.cc_top qspi_port_spi_mode_two_write_bytes_asm.function
.globl  qspi_port_spi_mode_two_write_bytes_asm
.align  4
.type qspi_port_spi_mode_two_write_bytes_asm,@function

qspi_port_spi_mode_two_write_bytes_asm:

.issue_mode dual
  DUALENTSP_lu6 STACKWORDS

  //Callee save registers: TODO: why SP[1]..SP[3]
  std r4, r5, sp[1]
  std r6, r7, sp[2]
  std r8, r9, sp[3]

  //TODO:

  //Callee restore registers
  ldd r4, r5, sp[1]
  ldd r6, r7, sp[2]
  ldd r8, r9, sp[3]

retsp STACKWORDS

.qspi_port_spi_mode_two_write_bytes_asm_tmp:
  .size qspi_port_spi_mode_two_write_bytes_asm, .qspi_port_spi_mode_two_write_bytes_asm_tmp - qspi_port_spi_mode_two_write_bytes_asm
  .align  4
  .cc_bottom qspi_port_spi_mode_two_write_bytes_asm.function

////////////////////////////////////////////////////////////////////////////////
//qspi_port_spi_mode_three_write_bytes_asm
//
//Writes bytes from memory up to and including speeds of 50MHz
//
//in      - internal_handle : spi_handle_t
//out     - tx_bytes : char*
//in      - num_bytes : unsigned
//return  - void
////////////////////////////////////////////////////////////////////////////////

.cc_top qspi_port_spi_mode_three_write_bytes_asm.function
.globl  qspi_port_spi_mode_three_write_bytes_asm
.align  4
.type qspi_port_spi_mode_three_write_bytes_asm,@function

qspi_port_spi_mode_three_write_bytes_asm:

.issue_mode dual
  DUALENTSP_lu6 STACKWORDS

  //Callee save registers: TODO: why SP[1]..SP[3]
  std r4, r5, sp[1]
  std r6, r7, sp[2]
  std r8, r9, sp[3]

  //TODO:

  bl qspi_port_write_bytes_asm

  //Callee restore registers
  ldd r4, r5, sp[1]
  ldd r6, r7, sp[2]
  ldd r8, r9, sp[3]

retsp STACKWORDS

.qspi_port_spi_mode_three_write_bytes_asm_tmp:
  .size qspi_port_spi_mode_three_write_bytes_asm, .qspi_port_spi_mode_three_write_bytes_asm_tmp - qspi_port_spi_mode_three_write_bytes_asm
  .align  4
  .cc_bottom qspi_port_spi_mode_three_write_bytes_asm.function

////////////////////////////////////////////////////////////////////////////////
//qspi_port_write_bytes_asm
////////////////////////////////////////////////////////////////////////////////

.cc_top qspi_port_write_bytes_asm.function
.globl  qspi_port_write_bytes_asm
.align  4
.type qspi_port_write_bytes_asm,@function

qspi_port_write_bytes_asm:

.issue_mode dual
  DUALENTSP_lu6 STACKWORDS

  //Format the first byte so that it is all lined up on the SI pin
  add INDEX, INDEX, 1
  {ld8u WORD_OUT_ONE, TX_DATA[INDEX]; ldc ZIP_OUT, 0x00}
  add WORD_OUT_TWO, WORD_OUT_ONE, 0     //Duplicate
  zip WORD_OUT_ONE, WORD_OUT_TWO, 0     //Double up all the bits
  unzip WORD_OUT_ONE, WORD_OUT_TWO, 3   //Unzip into two words
  zip ZIP_OUT, WORD_OUT_ONE, 0          //Align out to SI...
  zip ZIP_OUT, WORD_OUT_TWO, 0          //...
  zip ZIP_OUT, WORD_OUT_ONE, 0          //...
  zip ZIP_OUT, WORD_OUT_TWO, 0          //...Align out to SI

  //Line the QSPI_PORTS up to output the clock edges and start outputting
  //at the required times
  setc res[SIO], 0x1   //TODO: find docs on what 0x1 means for port resource type
  in PORT_TIME, res[SIO]
  getts PORT_TIME, res[SIO]
  ldc PORT_DELAY, 0x96 //TODO: configurable port setup delay
  add PORT_DELAY, PORT_TIME, PORT_DELAY
  setpt res[SCLK], PORT_DELAY
  setpt res[SIO], PORT_DELAY
  outpw res[SCLK], CLK_EDGES, EIGHT_CLOCK_EDGES

  //If there is only one byte of data to send:
  //skip on down to read the final(only) byte...
  bf NUM_DATA, qspi_port_spi_mode_zero_write_bytes_asm_final_byte

qspi_port_spi_mode_zero_write_bytes_asm_loop:

  {out res[SIO], WORD_OUT_ONE; add INDEX, INDEX, 1}
  {out res[SIO], WORD_OUT_TWO; sub NUM_DATA, NUM_DATA, 1}

  //Format the next word...
  {ld8u WORD_OUT_ONE, TX_DATA[INDEX]; ldc ZIP_OUT, 0x00}
  add WORD_OUT_TWO, WORD_OUT_ONE, 0     //Duplicate
  zip WORD_OUT_ONE, WORD_OUT_TWO, 0     //Double up all the bits
  unzip WORD_OUT_ONE, WORD_OUT_TWO, 3   //Unzip into two words
  zip ZIP_OUT, WORD_OUT_ONE, 0          //Align out to SI...
  zip ZIP_OUT, WORD_OUT_TWO, 0          //...
  zip ZIP_OUT, WORD_OUT_ONE, 0          //...
  zip ZIP_OUT, WORD_OUT_TWO, 0          //...Align out to SI

  //Output clock edges to carry data that will be input on the next iteration
  outpw res[SCLK], CLK_EDGES, EIGHT_CLOCK_EDGES

  //Go back around the loop if more data to read...
  bt NUM_DATA, qspi_port_spi_mode_zero_write_bytes_asm_loop

qspi_port_spi_mode_zero_write_bytes_asm_final_byte:

  //Output the final byte
  out res[SIO], WORD_OUT_ONE
  out res[SIO], WORD_OUT_TWO

  //Wait for SCLK to complete
  outpw res[SCLK], CLK_START_END, 1
  syncr res[SCLK]

  retsp STACKWORDS

.qspi_port_write_bytes_asm_tmp:
  .size qspi_port_write_bytes_asm, .qspi_port_write_bytes_asm_tmp - qspi_port_write_bytes_asm
  .align  4
  .cc_bottom qspi_port_write_bytes_asm.function

#endif //defined(__XS2A__)

Unfortunately it is taking too long to get around the qspi_port_spi_mode_zero_write_bytes_asm_loop, such that when I output the next word combination for the byte, I end up with some extra zeros between these words and the last words.

Can anyone see any way to optimise this loop any more than it already is? Unfortunately the zips/unzips don't appear to be dual issuable.

My alternative is to format all the bytes into word pairs before I perform the send, however this will come at a cost of memory. That is I'd need enough space to store each byte to be sent as a word pair.

Would appreciate your thoughts @samchesney @robert-lytton @pthedinger @andrewstanfordjason

pthedinger commented 6 years ago

As discussed - a 16-entry lookup table to convert each nibble to a word should do the trick.

cdoak commented 6 years ago

I modified the code to use the 16-entry lookup table as per @pthedinger suggestion as follows, but unfortunately it's still too slow, resulting in unwanted 0 values between each byte sent...

.cc_top qspi_port_write_bytes_asm.function
.globl  qspi_port_write_bytes_asm
.align  4
.type qspi_port_write_bytes_asm,@function

lut_align_byte_to_si_pin:

  .word 0x00000000 //0x0
  .word 0x00000011 //0x1
  .word 0x00001100 //0x2
  .word 0x00001111 //0x3
  .word 0x00110000 //0x4
  .word 0x00110011 //0x5
  .word 0x00111100 //0x6
  .word 0x00111111 //0x7
  .word 0x11000000 //0x8
  .word 0x11000011 //0x9
  .word 0x11001100 //0xA
  .word 0x11001111 //0xB
  .word 0x11110000 //0xC
  .word 0x11110011 //0xD
  .word 0x11111100 //0xE
  .word 0x11111111 //0xF

qspi_port_write_bytes_asm:

.issue_mode dual
  DUALENTSP_lu6 STACKWORDS

  //Format the first byte so that it is all lined up on the SI pin
  {add INDEX, INDEX, 1; mkmsk BYTE_MASK, 0x4}
  {ld8u BYTE_OUT, TX_DATA[INDEX]; ldap LUT_SI_PIN, lut_align_byte_to_si_pin}
  {shr WORD_OUT_ONE, BYTE_OUT, 0x4; and WORD_OUT_TWO, BYTE_OUT, BYTE_MASK}
  ldw WORD_OUT_ONE, LUT_SI_PIN[WORD_OUT_ONE]
  ldw WORD_OUT_TWO, LUT_SI_PIN[WORD_OUT_TWO]

  //Line the QSPI_PORTS up to output the clock edges and start outputting
  //at the required times
  setc res[SIO], 0x1   //TODO: find docs on what 0x1 means for port resource type
  in PORT_TIME, res[SIO]
  getts PORT_TIME, res[SIO]
  ldc PORT_DELAY, 0x96 //TODO: configurable port setup delay
  add PORT_DELAY, PORT_TIME, PORT_DELAY
  setpt res[SCLK], PORT_DELAY
  setpt res[SIO], PORT_DELAY
  outpw res[SCLK], CLK_EDGES, EIGHT_CLOCK_EDGES

  //If there is only one byte of data to send:
  //skip on down to read the final(only) byte...
  bf NUM_DATA, qspi_port_spi_mode_zero_write_bytes_asm_final_byte

qspi_port_spi_mode_zero_write_bytes_asm_loop:

  {out res[SIO], WORD_OUT_ONE; add INDEX, INDEX, 1}
  {out res[SIO], WORD_OUT_TWO; sub NUM_DATA, NUM_DATA, 1}

  //Format the next byte so that it is all lined up on the SI pin
  {ld8u BYTE_OUT, TX_DATA[INDEX]; mkmsk BYTE_MASK, 0x4}
  {shr WORD_OUT_ONE, BYTE_OUT, 0x4; and WORD_OUT_TWO, BYTE_OUT, BYTE_MASK}
  ldw WORD_OUT_ONE, LUT_SI_PIN[WORD_OUT_ONE]
  ldw WORD_OUT_TWO, LUT_SI_PIN[WORD_OUT_TWO]

  //Output clock edges to carry data that will be input on the next iteration
  outpw res[SCLK], CLK_EDGES, EIGHT_CLOCK_EDGES

  //Go back around the loop if more data to read...
  bt NUM_DATA, qspi_port_spi_mode_zero_write_bytes_asm_loop

qspi_port_spi_mode_zero_write_bytes_asm_final_byte:

  //Output the final byte
  out res[SIO], WORD_OUT_ONE
  {out res[SIO], WORD_OUT_TWO; ldw CLK_START_END, sp[4]}

  //Wait for SCLK to complete
  outpw res[SCLK], CLK_START_END, 1
  syncr res[SCLK]

  retsp STACKWORDS

.qspi_port_write_bytes_asm_tmp:
  .size qspi_port_write_bytes_asm, .qspi_port_write_bytes_asm_tmp - qspi_port_write_bytes_asm
  .align  4
  .cc_bottom qspi_port_write_bytes_asm.function

I'll try @pthedinger other suggestion of using an additional thread to perform the formatting next.

cdoak commented 6 years ago

The lookup tables was fast enough. It was my mistake not realising that the bits were reversed that was causing the issue.

cdoak commented 6 years ago

The code for the function...

void flash_read_jedec_id(const flash_handle_t * const flash_handle,
                         char jedec[num_bytes],
                         const size_t num_bytes)
{
  unsafe
  {
    flash_handle_impl_t *unsafe internal_handle = (flash_handle_impl_t *unsafe)flash_handle;

    #if defined(XASSERT_ENABLE_ASSERTIONS)
    validate_divides(func_flash_read_jedec_id, flash_handle);
    #endif //defined(XASSERT_ENABLE_ASSERTIONS)

    //Build the flash_command_read_jedec
    unsigned command_out[2];
    {command_out[0], command_out[1]} = build_sio_pin0_data(flash_command_read_jedec);

    //Send the command out on a slower clkblk
    switch_sio_to_output(internal_handle);

    //Drive the CS low.  CS is clocked off the default reference clock and must be
    //lowered now to meet setup timings.
    drive_chip_select(internal_handle, chip_select_drive_low);

    //input time from the port;
    unsigned time_stamp;
    internal_handle->flash_ports.flash_sio :> void @ time_stamp;
    time_stamp += PORT_SETUP_DELAY;

    //Line everything up so that we output the command and the clock edges
    //at the same time.  8 clock edges are required to complete this.
    internal_handle->flash_ports.flash_sclk @ time_stamp <: CLOCK_EDGES_8;
    internal_handle->flash_ports.flash_sio @ time_stamp <: command_out[0]; //Command
    internal_handle->flash_ports.flash_sio <: command_out[1]; //Command
    partout(internal_handle->flash_ports.flash_sclk, 1, 0); //finish sclk low
    sync(internal_handle->flash_ports.flash_sclk); // ensure all output completes

    //Now turn the SIO port around.
    switch_sio_to_input(internal_handle);

    //Input the current port time again
    internal_handle->flash_ports.flash_sio :> void @ time_stamp;
    time_stamp += PORT_SETUP_DELAY;

    //Output our flash clock edges - 8 cycles for the first jedec byte
    partout_timed(internal_handle->flash_ports.flash_sclk, 16, CLOCK_EDGES_8, time_stamp);
    asm("setpt res[%0], %1"::"r"(internal_handle->flash_ports.flash_sio),"r"(time_stamp + internal_handle->flash_clock_delays.flash_byte_input_delay));

    if(num_bytes == 1)
    {
      jedec[0] = read_byte_asm(internal_handle->flash_ports);
    }
    else
    {
      read_bytes_asm(internal_handle->flash_ports, jedec, num_bytes);
    }

    //Drive the CS back hi now that everything is complete
    drive_chip_select(internal_handle, chip_select_drive_high);
  }
}

Will reduce to something like follows when it switches over to use lib_spi in this refactors...

reconfigure_sio_port();
  drive_cs_lo();
  spi_tx_bytes(&spi_handle, transaction_type_spi, jedec_command, 1);
  reconfigure_sio_port();
  spi_rx_bytes(&spi_handle, transaction_type_spi, jedec_id, 3);
cdoak commented 6 years ago

Hi @robert-lytton. Can I just ask you to check that I have understood the fptgroup attribute and therefore used it correctly in my last commit (https://github.com/xmos/lib_spi/pull/22/commits/a8a97525ecf78a420358346878364f690a4943cd)

Many Thanks

cdoak commented 6 years ago

Merging my pull request thus far so my work isn't lost when my account is deleted.