yakra / tmtools

Tools to aid in development of the TravelMapping project
0 stars 0 forks source link

DBFtrim: trailing 0s in type F fields #6

Closed yakra closed 6 years ago

yakra commented 6 years ago

If a type F field never uses scientific notation, IE, never contains 'e' (or 'E'?) look into trimming trailing 0s as done for type N fields (or sometimes if it does contain 'e'?)

yakra commented 6 years ago

ROADS_ACF.yTrim4.dbf unique_id: trim ".000000000000000"; save 16 B ah_blm: trim "0000000"; save 7 B ah_elm: trim "0000000"; save 7 B ah_length: trim "0000000"; save 7 B ah_seg_num: trim ".000000000000000"; save 16 B TOTAL: save 53 B / record x 439026 records = 23,268,378 B

2013-03-05: Some fields are missing. Was this list compiled from a culled file? A more relevant comment will be posted below...

yakra commented 6 years ago

7/9 test files are identical to previous DBFtrim version. Exceptions:

txdot-2015-roadways_48113.dbf (Dallas) Shape_STLe F 12 2 0. <- 0.0000000000 Why is the extraneous decimal point not trimmed? (It can't be left-justified; 12 chars take up entire field width. DecCount?) • Using DBFcull, I see the DIFF is limited to this field. Good... • Per DBFmine, 0.0000000000 is the only value for all records

ROADS_ACF.yTrim.dbf Binary files testfiles.old/ROADS_ACF.yTrim.dbf and testfiles.new/ROADS_ACF.yTrim.dbf differ File sizes are identical. No indication in field info display of any extraneous 0s being trimmed. (Why?) In theory then, nothing should be different... Either: • load a 610.7 MB file into hex editor, • fix the bug affecting TX first and hope the problem goes away, or • something else

yakra commented 6 years ago

TX

txdot-2015-roadways_48113.dbf (Dallas) Shape_STLe F 12 2 0. <- 0.0000000000 Why is the extraneous decimal point not trimmed? (It can't be left-justified; 12 chars take up entire field width. DecCount?) • Using DBFcull, I see the DIFF is limited to this field. Good... • Per DBFmine, 0.0000000000 is the only value for all records

Yes, the problem is in DecCount. Set to 0 in original file. Thus tDBF.fArr[fNum].DecCount = DecCount-MinEx0; -> 0-10; wraps around & becomes 246. Thus if (MinEx0 == DecCount) MinEx0++; never happens.

A couple workarounds for cases like this should be pretty easy.

First, look thru other files with type F fields, and see if having a Decimal Count of 0 is a common thing, or unique to the TXDOT files.

FWIW, the Decimal Count in the next field(Shape_Leng) is screwy too: Length of 12, Decimal Count of 11. 12 minus the leading zero & the decimal point itself leaves only 10 decimal places. This may just be a TXDOT thing...

yakra commented 6 years ago

First, look thru other files with type F fields, and see if having a Decimal Count of 0 is a common thing, or unique to the TXDOT files.

~/gis/data/pe/nrn_rrn_pe_12.0_shp_en/NRN_PE_12_0_ROADSEG.dbf No type F fields.

~/gis/data/md/SHA_Routes/SHA_LINE_ROUTES_MD_2015 SHAPELENGT, DecCount 0x0B / 11, scientific notation, makes sense SHAPE_Leng, DecCount 0x0B / 11, scientific notation, makes sense All good!

~/gis/data/me/medotpubrdss/2016-04-08/medotpubrdss Shape_len, DecCount 0x0B / 11, scientific notation, makes sense All good!

~/gis/data/me/e911rdss/e911rds SHAPE_len, DecCount 0x0B / 11, scientific notation, makes sense All good!

~/gis/data/me/medotpubrdss/2017-08-24/medotpubrds PRIM_BMP, PRIM_EMP, SEG_LEN__M, Shape_len All DecCount 0x0B / 11, scientific notation, make sense All good!

~/gis/data/nh/roads_dot_2016/Roads_DOT MP_START, MP_END, SECT_LENGT, SHAPE_Leng All DecCount 0x0B / 11, scientific notation, make sense All good!

~/gis/data/ma/RoadInv2017/Road_Inventory.yOrig.dbf No type F fields.

~/gis/data/ar/ROADS_ACF/ROADS_ACF unique_id, DecCount 0x0F / 15, only 0s to right of decimal. Blank value exists. src_code, DecCount 0x0F / 15, only 0s to right of decimal. Blank value exists. nssda_val, DecCount 0x0F / 15, only 0s to right of decimal. Blank value exists. ah_blm, DecCount 0x0F / 15, as many as 8 non-0 pl. Blank value. Trimmed DecCount == 3. HOW? ah_elm, DecCount 0x0F / 15, as many as 8 non-0 pl. Blank value. Trimmed DecCount == 3. HOW? ah_length, DecCount 0x0F / 15, as many as 8 non-0 pl. Blank value. Trimmed DecCount == 3. HOW? ah_seg_num, DecCount 0x0F / 15, only 0s to right of decimal. Blank value exists. Shape_STLe, DecCount 0x0F / 15, all sig figs used & no potential for trimming. No blank value.

yakra commented 6 years ago

AR

Why are no extraneous zeros trimmed? How does DecCount get trimmed from 15 to 3? It's those bloody blank-space values!

Solution:

What happens to a blank-space value? 15 for (pad = 0; (fVal[pad] <= ' ') && pad < len; pad++); pad gets set to strlen 16 NewVal = new char[strlen(fVal+pad)+1]; --> NewVal = new char[strlen("\0")+1]; --> NewVal = new char[0+1]; 17 strcpy(NewVal, fVal+pad); --> strcpy(NewVal, "\0"); 19 fVal = NewVal; fVal == "\0" strlen(fVal) == 0 (...Right?)

yakra commented 6 years ago
  • Reset DecCount along with MinEx0 in the else statement.
  • Create an exception to this for blank values [ if (strlen(fVal)) ? ] so I can still trim stuff. How would this get handled when saving trimmed records to disk?

ah_length gets trimmed 1 byte too many 10.10300000 datum disappears completely all other values have 1 digit left of decimal

unsigned int RecNum = ((unsigned int)DBFf.tellg()-dbf.HeaLen)/dbf.RecLen+1;

testfiles.1/ROADS_ACF.ah_length-only.dbf RecNum = (140551-65)/19+1; RecNum = 140486/19+1; RecNum = 7394+1;

testfiles.2/ROADS_ACF.ah_length-only.dbf RecNum = 7394+1; then multiply by RecLen RecNum = 81334/11+1; then add HeaLen RecNum = (81399-65)/11+1

Datum there is 0.10300000, I.E. the '1' @ beginning got trimmed... All data before that only goes to 3 decimal places.

10.103000000000000 stays the same as established from Record 1: MinEx0 = 12 DecCount = 15-12 = 3 len = strlen(fVal)-MinEx0; len = 18-12; len = 6 Later, we reach 0.032541810000000 @ record 9237 MinEx0 = 7 DecCount = 15-7 = 8 if (strlen(fVal) > tDBF.fArr[fNum].len+MinEx0) if (17 > 6+7) if (17 > 13) (Yes.) len = strlen(fVal)-MinEx0; len = 17-7; len = 10

Long story short, the solution is to track how many digits are to the left of the decimal.

yakra commented 6 years ago

MaxIntD implemented...

TX: v1 & v3 no diff. Good. PE: v0 & v3 no diff. Good. MD: v0 & v3 no diff. Good. ME e911rds: v0 & v3 no diff. Good. NH: v0 & v3 no diff. Good. ME DOT16: v0 & v3 no diff. Good. ME DOT17: v0 & v3 no diff. Good. MA: v0 & v3 no diff. Good.

And finally, AR... unique_id: trim ".000000000000000"; save 16 B src_code: trim ".000000000000000"; save 16 B nssda_val: trim ".000000000000000"; save 16 B ah_blm: trim "0000000"; save 7 B ah_elm: trim "0000000"; save 7 B ah_length: trim "0000000"; save 7 B ah_seg_num: trim ".000000000000000"; save 16 B Shape_STLe cannot be trimmed TOTAL: save 85 B / record x 439026 records = 37317210 B 610687600 (v1) - 37317210 = 573370390

And the verdict is... 573370390. YES!

yakra commented 6 years ago

"The TX exception" to trim extraneous decimal point

if (MinEx0 >= DecCount) tDBF.fArr[fNum].DecCount = 0;

...inter alia.

TX: v4 filesize 27610 B greater than v1 thru v3. Good! All other files identical to v3.