wiseio / paratext

A library for reading text files over multiple cores.
Apache License 2.0
1.06k stars 103 forks source link

Type Detection #31

Closed gridseth closed 7 years ago

gridseth commented 8 years ago

Summary:

When parsing files I ran into problems with how paratext parses some of the elements in a file. I believe these may be related to the process_token() function in colbased_worker.hpp that tries to detect whether a token is an integer, float, exponential/scientific number or otherwise.

Examples:

It detects a string such as A.1 as a number and I get 0.000000 It detects a string such as 3ABC as a number and I get 3.000000

Details:

Let A.1 be the input. When reading the token and checking if the token is an integer (line 270) it checks if token_[i] is a digit. If not, we move on to see if we are dealing with a float instead. However we advance the index, i, regardless of whether the integer check passes or fails. Therefore, when we get to the float check on line 279 we are looking at the . character instead of A. Then the check for a float passes since we see . and only digits after it. Finally the result is 0.000000 since A.1 gets converted to a float before it is passed to process_float.

Numbers like 12.345 are not picked up as floats (because the integer check fails on . and we check for float on the next character, here 3), but instead as exponentials. They pass as exponentials not because they pass the exponential check, but because exp_possible is set as true at the beginning (on line 272 after the integer check passes on line 270) and does not become false. Both exponentials and floats are passed to process_float in the end. For the same reason 3ABC is detected as an exponential instead of a string and we get 3.000000. (Numbers like .123 are not detected as floats or exponentials because exp_possible is set as false after the integer check .)

Note:

When making updates to the process_token() function I did some simplifications, but did not change the behaviour of the function other than for the issues found.

In the code change suggestion I have let a number on the form 14e-3 be a valid exponential (compared to 14.0e-3). This can be changed if not desired.

wesm commented 8 years ago

This patch will need unit tests to demonstrate the problem and corrected output. Where are we at on unit test framework? I can potentially help get a googletest unit test setup going.

alexeygrigorev commented 7 years ago

Why the source repository is unknown? I wanted to quickly apply the patch, but had to hand-code the change...

deads commented 7 years ago

Thank you for the bug fix. Can you please pull from master and write a unit test for the fix and run the tests via Travis? Thank you!

deads commented 7 years ago

Can you pull from wiseio/paratext into your local master so that our regression test suite can test your changes?

  git pull https://github.com/wiseio/paratext.git master

Thank you!

deads commented 7 years ago

I do not know how your PR lost track of its original repo+branch. I have monkeypatched your PR into https://github.com/wiseio/paratext/pull/53 and added some tests.

I will go ahead and close this PR as it is now superseded by #53 .

Thanks for your bug fix! Damian