y-256 / libdivsufsort

A lightweight suffix-sorting library
MIT License
364 stars 83 forks source link

Minimum separator character #28

Open dhowe opened 3 months ago

dhowe commented 3 months ago

Not sure if this question makes sense and it may also be machine-dependent, but I'm wondering if there is a delimiter char I can use to separate words in the input to libdivsufsort so that this char will sort less than any of the characters in my text (the usual ascii set) and thus not otherwise affect the order of the suffix array ?

akamiru commented 3 months ago

What you want is called a generalized suffix array in literature. Libdivsufsort is by now fairly old, unmaintained and not state of the art. Pretty sure you will find software better suited for your job.

That said libdivsufsort is machine independent and will simply sort an array of unsigned 8 bit integers. So can simply use the 0 value as a seperator if it does not appear anywhere in your strings. (Haven't looked at the code in quite some years. Might be the case that you have to use 1).

dhowe commented 3 months ago

I assume you mean ascii 0 (NUL) rather than ascii 48 (0)... That was the first thing I tried, but got some strange results. I'll try next with ascii 1. Any pointers to a library for a generalized suffix array would be great too. Thanks.

danielsaad commented 3 months ago

Hi Daniel. You could try gsufsort: https://github.com/felipelouza/gsufsort

Best regards.

On Thu, Jul 25, 2024 at 11:36 AM Daniel Howe @.***> wrote:

I assume you mean ascii 0 (NUL) rather than ascii 48 (0)... That was the first thing I tried, but got some strange results. I'll try next with ascii

  1. Any pointers to a library for a generalized suffix array would be great too. Thanks.

— Reply to this email directly, view it on GitHub https://github.com/y-256/libdivsufsort/issues/28#issuecomment-2250490072, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPS7ZGSKMZJE5YYTQK6BXDZOEEO7AVCNFSM6AAAAABLNZV2YSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJQGQ4TAMBXGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Daniel Saad Nogueira Nunes

akamiru commented 3 months ago

libdivsufsort does't know about ascii it sorts bytes independent of encoding that's why I wrote "value 0".

If you had issues with using zero then it might be that libdivsufsort treats them as string terminator. I haven't looked into the code since 2016 so ... no idea.