yakra / tmtools

Tools to aid in development of the TravelMapping project
0 stars 0 forks source link

DBFtrim: leading whitespace in type C fields #13

Closed yakra closed 6 years ago

yakra commented 6 years ago

For example, ROADS_ACF.yTrim4.dbf meta_id contains a datum of " 29159". All other records are a maximum of 5 characters. This could save 439026 bytes.

Also test Road_Inventory.yOrig.dbf:

FieldName   Type    Length  Max
Station     C   80  0

to make sure nothing goes haywire. Although, it should be pretty cut & dry.

Best to tackle this concurrently with #7

yakra commented 6 years ago

In the ROADS_ACF example, as other records are still left-justified, leading whitespace will not be trimmed if I assume a constant "margin".

Maybe look into more intelligent solutions, during file copy process

yakra commented 6 years ago

Pennsylvania_Local_Roads.dbf (3.1 GB) exhibits the mirror image of the usual convention. Its... • Type C fields are right-justified. • Type N fields are left-justified. Thus it cannot be trimmed at all, and the output file is identical to the input file.

yakra commented 6 years ago

~/gis/data/ma/RoadInv2017/Road_Inventory.yOrig.dbf 1 byte per record smaller Route_Dire 3 -> 2

./DBFmine /home/yakra/gis/data/ma/RoadInv2017/Road_Inventory.yOrig.dbf Route_Dire 

 NB                                                                             
EB                                                                              
NB                                                                              
SB                                                                              
SN                                                                              
WB                                                                              
eb                                                                              
nb                                                                              

BOOM!

yakra commented 6 years ago

~/gis/data/ar/ROADS_ACF/ROADS_ACF.yOrig.dbf HOSED -- only ~40k written. No error message or crash. filesize varies depending on number of threads: 1 thread = 36184 B, +1305 for each additional thread (tested up to 9). RecLen == 1305

From previous NumThreads' version's terminal 0x1A, there are 180 0x00s, then more record info...

ROADS_ACF.yTrim--thr_1.dbf: terminal 0x1A @ 36183, in record 26, which begins @ 35058 -- 1125 bytes into it. 1125 B corresponds to the beginning to ah_blm

ROADS_ACF.yTrim.dbf (v4): ah_blm in record 26, @ 36209, is blank. Yes, I was just starting to suspect that...

field.cpp: 19 if (strchr(fVal, '.') && !strchr(fVal, 'E') && !strchr(fVal, 'e')) Nope

else    if (vlen)
    {   MinEx0 = 0;
        tDBF.fArr[fNum].DecCount = DecCount;
    }

vlen == 0, so this never happens. MinEx0 becomes 7 DecCount becomes 8

DBFtrim.cpp: 26 for (fI = vlen-oDBF.fArr[fNum].MinEx0; fI < tDBF.fArr[fNum].len; fI++) outDBF.put(' '); -> for (fI = 0-7; fI < 12; fI++) outDBF.put(' '); -> for (fI = -7; fI < 12; fI++) outDBF.put(' '); fI is an unsigned char, so that makes this -> for (fI = 249; fI < 12; fI++) outDBF.put(' '); ...and the loop never executes, because 249 > 12.

If fI were simply made an int, this would not work, of course. This would write 19 spaces to the field, more than the 12 it's meant to hold. MinEx0 > vlen because this is a blank value in a field otherwise containing decimal values, with a MinEx0. This is the only case where this should ever occur.

27 outDBF.write(fVal, tDBF.fArr[fNum].len-fI+vlen-oDBF.fArr[fNum].MinEx0); -> outDBF.write(<31 null 0s>, 12-fI+0-7); If fI == 249, then 12-fI+0-7 == 12-249+0-7 == -244. If fstream::write gets passed a negative length to write, does this just treat it as EOF? (Learn more about fstream, and its error codes etc., and see how exactly this would play out...) This appears to be what causes RecWrite in each thread to just stop writing.

180 bytes revisited 180 = 1305 (Record Length) - 1125 (the point it which things go nuts). The next thread's seekp sets the output position ahead to the next record, 180 B ahead of the last good bytes written. Hence 180 00s. After the initial blank datum is encountered, I think I'm only getting as many partial records as there are threads running. The formula (NumThreads+24)*1305+2433+1125+1 predicts the sizes of my 9 test files.

yakra commented 6 years ago

Solution:

switch (oDBF.fArr[fNum].type)
{   case 'N':   case 'F':
    fTrim(fVal, vlen);
    for (fI = vlen-oDBF.fArr[fNum].MinEx0; fI < tDBF.fArr[fNum].len; fI++) outDBF.put(' ');
    outDBF.write(fVal, tDBF.fArr[fNum].len-fI+vlen-oDBF.fArr[fNum].MinEx0);
    break;
// etc...

->

switch (oDBF.fArr[fNum].type)
{   case 'N':   case 'F':
    fTrim(fVal, vlen);
    if (vlen)
    {   for (fI = vlen-oDBF.fArr[fNum].MinEx0; fI < tDBF.fArr[fNum].len; fI++) outDBF.put(' ');
        outDBF.write(fVal, tDBF.fArr[fNum].len-fI+vlen-oDBF.fArr[fNum].MinEx0);
    }
    else    for (fI = 0; fI < tDBF.fArr[fNum].len; fI++) outDBF.put(' ');
    break;
// etc...

Everywhere except AR, there's no DIFF from v5 files. For ~/gis/data/ar/ROADS_ACF/ROADS_ACF.yOrig.dbf, the v6 file is 439026 B smaller than v5, as predicted in the OP -- w0ot!

yakra commented 6 years ago

Pennsylvania_Local_Roads.dbf (3.1 GB) exhibits the mirror image of the usual convention. Its... • Type C fields are right-justified. • Type N fields are left-justified. Thus it cannot be trimmed at all, and the output file is identical to the input file.

I looked too quickly; the bit about the justification is incorrect. This file cannot be trimmed because the field data isn't being read correctly at all. This isn't new; it shows up in one form or another since my earliest saved DBFtrim executables. By all other indications, the updates to trim whitespace from both L & R have otherwise been successful. I'm going to commit the changes and track the Pennsylvania issue separately in #26.