Fix ASCII/binary detection issues

hakostra commented 3 years ago

This is an attempt to fix #156

STL files starting with a whitespace (first line indented) are now properly recognized as ASCII
If the file mode is ASCII go straight to ASCII reader, and if file mode is binary, go straight to binary reader
In automatic mode, if the file/stream start with word 'solid', possibly prepended by whitespace, try ASCII mode first and fallback to binary
If header contain non-ASCII characters, assume binary by default in automatic mode

I added three STL's in the tests-folder, but did not figure out how to include these in the testing framework, sorry. All of the STL's should read and parse.

The Cube.stl is a square cube from (x, y, z) = (0, 0, 0) to (2, 2, 2) in ASCII and binary format. The Cube.stl in ASCII format does not parse correctly with the current develop-branch, and this PR fixes this issue.

The CubeInvalidUnicode.stl is the same cube, but where a non-ASCII/non-Unicode character (0xFFFF) is inserted in the end of the 80-byte header. Reading in this file should cover the except UnicodeError part that is not covered by the existing tests.

In the end I want to mention that I was not completely satisfied with the changes I came up with in stl.py, because the load-function became less understandable. However, I think that if a file is opened in Mode.BINARY, it should go straight to the binary reader without checking particularly much about the header and trying ASCII first, and the same for Mode.ASCII. The automatic mode is complicated by needing to support autodetection in non-seekable streams, so this part is maybe best left as-is.

I am also slightly unsure about Python 2.7 support in this PR. I no longer have this available on my development machine and cannot test. I do however think that header.decode("ascii").lstrip().lower() is valid Python 2.7...

Thanks.

coveralls commented 3 years ago

Coverage decreased (-100.0%) to 0.0% when pulling 22f89d6ddfe739fdeafabb1d5ffdac4bd6d22520 on hakostra:feature/ascii-binary-detection-issue into 93f194684f1f208fd333e430b430a7f22d657816 on WoLpH:develop.

hakostra commented 3 years ago

I just realized the following: A perfectly valid binary STL can have a header with 80 bytes of NULL (0x00). I think this decodes to an empty string in Python. Also, there can be cases where the binary header contain less than 5 ASCII characters (e.g. the word 'STL') and then the rest null-bytes.

In these cases the check

headerstr = header.decode("ascii").lstrip().lower()
possibly_ascii = True if headerstr[:5] == 'solid' else False

is slightly ugly ([:5] on a str with length < 5). Maybe a better syntax would have been

possibly_ascii = headerstr.startswith('solid')

However, i just checked, and it appears that

a = ""
test = a[:5]

is valid and raises no errors...

I do not have my development PC now, but I will construct such an STL tomorrow and check if it passes through in automatic mode.

wolph commented 3 years ago

In these cases the check
headerstr = header.decode("ascii").lstrip().lower()
possibly_ascii = True if headerstr[:5] == 'solid' else False
is slightly ugly ([:5] on a str with length < 5). Maybe a better syntax would have been
possibly_ascii = headerstr.startswith('solid')
However, i just checked, and it appears that
a = ""
test = a[:5]
is valid and raises no errors...

It's not the prettiest, I'm trying to remember the reason for this syntax but both wil work. Actually, it could have been reduced to:

possibly_ascii = header.decode("ascii").lstrip().lower()[:5] = 'solid'

The True/False thing is silly because a == should always return a boolean

hakostra commented 3 years ago

You mean to use == don't you:

possibly_ascii = header.decode("ascii").lstrip().lower()[:5] == 'solid'

The reason for using [:5] originally might be because that was the only way to do this with a binary sequence, as it was previously. For this PR I had to convert the header to a string, because otherwise I cannot do "lstrip" on it.

I also created an STL variant with a header consisting only of null-bytes, and it passes the automatic ASCII/binary detection without any problems. The 80-byte sequence of null's decode as I anticipated to an empty string.

wolph commented 3 years ago

You mean to use == don't you:

Indeed, that was a typo :)

The reason for using [:5] originally might be because that was the only way to do this with a binary sequence, as it was previously. For this PR I had to convert the header to a string, because otherwise I cannot do "lstrip" on it.

Ah... that explains a lot. It shouldn't be needed for bytes though:

In [1]: x = b' some binary string with spaces '

In [2]: x.strip()
Out[2]: b'some binary string with spaces'

In [3]: x.lstrip()
Out[3]: b'some binary string with spaces '

I think the slicing method was done mostly due to Python 2/3 incompatibility issues. But that's no longer relevant so it can be replaced :)

I also created an STL variant with a header consisting only of null-bytes, and it passes the automatic ASCII/binary detection without any problems. The 80-byte sequence of null's decode as I anticipated to an empty string.

Nice :)

hakostra commented 3 years ago

Ah. So you can lstrip on bytes objects as well... I thought that was valid for 'str' only. Sorry, still get confused on this.

Anyhow, I think the ASCII decode is sensible, because if it contain binary content that cannot be decoded to ASCII in the header, it is at least not an ASCII STL file (but maybe someone tries an "unicode STL"?).

Also, let me know if you want me to do more on this PR.

wolph commented 3 years ago

I've made a few small changes but everything is merged beyond that :)

Thank you for the help!

wolph / numpy-stl

Fix ASCII/binary detection issues #157