yakra / DataProcessing

Data Processing Scripts and Programs for Travel Mapping Project
0 stars 0 forks source link

TMBitset leftovers #252

Open yakra opened 12 months ago

yakra commented 12 months ago

Remaining items from #245 after the merge of TravelMapping#592.

Higher Priority:

Medium priority:

Low Priority:

yakra commented 11 months ago

If units were a static member or function, what would happen if I cast a pointer to a type with a different unit? Gotta be careful to not go past the end of the array though.

WRT units as a static member, it only makes sense to do this if len is as well. Don't wanna do this prematurely. It's possible that in the future I'd want different TMBitset objects of the same instantiation referring to different ranges of items, and could shoot myself in the foot.

That said, static or not, such punning may be useful for iteration, which doesn't use units. We'd be able to minimize allocated space and benefit from optimal iteration, in cases where |= etc. are unused or we're willing to take the performance hit.

yakra commented 3 months ago

traveler_num solutions

yakra commented 3 months ago

e586 get augment index once for siteupdateST (TMBEpsilon) ecef from TMBPreAlpha

What made it into master: https://github.com/yakra/DataProcessing/blob/387c99b0c260938c870457bb9793d4f78b09987f/siteupdate/cplusplus/tasks/threaded/ConcAug.cpp#L11-L19

A micro-optimization:

size_t index = 0;
for (TravelerList *t = TravelerList::allusers.data, *end = TravelerList::allusers.end(); t != end; t++)
{   cout << '.' << flush;
    for (HighwaySegment *s : t->clinched_segments)
      if (s->concurrent)
        for (HighwaySegment *hs : *(s->concurrent))
          if (hs != s && hs->route->system->active_or_preview() && hs->add_clinched_by(index))
        concurrencyfile << "Concurrency augment for traveler " << t->traveler_name << ": [" << hs->str() << "] based on [" << s->str() << "]\n";
    index++;
}
yakra commented 3 months ago

f602 Index calculated once per chopped route. (TMBEpsilon) 48c8 "partway there" from add_index, ElStacko & TMBPreAlpha

Index calculated once per chopped route. Will calculating it once in TravelerList ctor & passing it into store_traveled_segments help, or be a wash due to overhead of pushing it onto the stack?

Ah yes, that's the one. Function arguments use registers, not the stack.

Relevant bits that made it into master: https://github.com/yakra/DataProcessing/blob/d9f02319b36a2b40eba94fdd8fa65a7ea06e89a9/siteupdate/cplusplus/classes/Route/store_traveled_segments.cpp#L9-L12

yakra commented 3 months ago

meta-iteration counts

children of trav (nixtl) @ 7f7d61e, nixtl branch

2 files have identical diffs in both stashes:

TMBitset.cpp, right after the #includes

extern bool g_add;
extern unsigned long g_elses;
extern unsigned long g_ifs;
extern unsigned long g_its;
extern unsigned long g_loops;

Older iterator format, but easy enough to figure out

void operator ++ ()
{ do {  if (!(bits >>= 1))
         {  p += 8*sizeof(unit)-bit;
        bit=0;
        bits = b.data[++u];
        g_ifs += g_add;
         }
    else {  ++bit;
        ++p;
        g_elses += g_add;
         }
    g_loops += g_add;
     }  while (!(bits&1));
  g_its += g_add;
}

siteupdate.cpp, right up top

bool g_add = 0;
unsigned long g_elses = 0;
unsigned long g_ifs = 0;
unsigned long g_its = 0;
unsigned long g_loops = 0;

Comment out SQL file after DATA CHECK SUCCESSFUL, before timestamp = time(0);

cout << g_loops << '\t' << g_ifs << '\t' << g_elses << endl;
cout << g_its << endl;

61839d9, meta-iteration count HighwayGraph::write_subgraphs_tmg

// traveler names
g_add = 1;
for (TravelerList *t : traveler_set)
    travelfile << t->traveler_name << ' ';
g_add = 0;

660c060, meta-iteration count (CompStats) Region::compute_stats.cpp g_add=1; this block; g_add=0;

yakra commented 3 months ago

64-16-bit punning

void operator |= (TMBitset<item,unit>& other)
{   long  *e64 = (long*)(data+units)-1;
    long  *d64 = (long*)data, *o64 = (long*)other.data;
    while (d64 <= e64)    *d64++   |=   *o64++;
    unsigned short *e16 = (unsigned short*)(data+units);
    unsigned short *d16 = (unsigned short*)(e64)+1, *o16 = d16-data+other.data;
    while (d16 < e16)     *d16++   |=   *o16++;
}

64-32-16-bit punning

void operator |= (TMBitset<item,unit>& other)
{   long  *end = (long*)(data+units)-1;

    long  *d64 = (long*)data, *o64 = (long*)other.data;
    while (d64 <= end)    *d64++   |=   *o64++;

    int   *d32 = (int*)  d64, *o32 = (int*)  o64;
    if    (units & 2)     *d32++   |=   *o32++;

    if    (units & 1) *(short*)d32 |= *(short*)o32;
}

punning SIO2

void operator |= (TMBitset<item,unit>& other)
{   long  *end = (long*)(data+units)-1;
    long  *d64 = (long*)data, *o64 = (long*)other.data;
    while (d64 <= end)    *d64++   |=   *o64++;
    int   *d32 = (int*)  d64, *o32 = (int*)  o64;
    if    (units & 4)     *d32++   |=   *o32++;
    short *d16 = (short*)d32, *o16 = (short*)o32;
    if    (units & 2)     *d16++   |=   *o16++;
    if    (units & 1) *(char*)d16 |= *(char*)o16;
}
yakra commented 3 months ago

Complex Iterator Optimization

This was based on removing the unit parameter (953a) from TMBitset entirely. I won't bother reprinting the full details.


Then (03ec), same punning as SIO2 above, except no unit param, and units -> bytes.


8-bit (f749) After declaring bits & before friend iterator

static constexpr unsigned char skip_vals[] =
{   8,1,2,1,3,1,2,1,4,1,2,1,3,1,2,1,5,1,2,1,3,1,2,1,4,1,2,1,3,1,2,1,
    6,1,2,1,3,1,2,1,4,1,2,1,3,1,2,1,5,1,2,1,3,1,2,1,4,1,2,1,3,1,2,1,
    7,1,2,1,3,1,2,1,4,1,2,1,3,1,2,1,5,1,2,1,3,1,2,1,4,1,2,1,3,1,2,1,
    6,1,2,1,3,1,2,1,4,1,2,1,3,1,2,1,5,1,2,1,3,1,2,1,4,1,2,1,3,1,2,1
};

At the very end of the file template <class item> constexpr unsigned char TMBitset<item>::iterator::skip_vals[];

And finally, iteration itself

void operator ++ ()
{ do {  const unsigned char skip_ahead = skip_vals[bits >>= 1];
    if (skip_ahead >= 8-bit)
         {  bits = b.data[++u];
        index += 8-bit;
        bit=0;
         }
    else {  bits >>= skip_ahead-1;
        index += skip_ahead;
        bit += skip_ahead;
         }
     }  while (!(bits&1));
}

Later (next commit after 16-bit) changed to TM::skip_vals[bits >> 1] and bits >>= skip_ahead; -- cleaner code, and should make no difference performance-wise, using one more register to do either subtraction or a shift.

Not the best solution.


16-bit (ec4e) ...is a godawful monstrosity. ;)

TMBitset.cpp: char->short, bytes->words, 8->16, 1->2 for calloc & memcmp. skip_vals array deleted; see TM::skip_vals.

TMGlobal.h

namespace TM
{   extern unsigned char skip_vals[32768];
    unsigned char get_skip(unsigned short);
    void set_skip_vals();
}

TMGlobal.cpp (because I was unsuccessful in trying to constexpr it)

namespace TM
{   unsigned char get_skip(unsigned short val)
    {   unsigned char skip;
        for (skip=1; skip<16; ++skip)
          if (val & 1) return skip;
          else val >>= 1;
        return skip;
    }

    unsigned char skip_vals[32768];

    void set_skip_vals()
    {   for (unsigned short v=0; v<32768; ++v)
          skip_vals[v] = get_skip(v);
    }
}

Makefile classes/TMGlobal/TMGlobal.o in CommonObjects

siteupdate.cpp #include "classes/TMGlobal/TMGlobal.h" TM::set_skip_vals(); afer the first 4 variables (3 lines) on main, right before argument parsing.


ternaries (2d94) I doubt all of these would be compiled as CMOVs, but here it is for historical interest's sake. Using __builtin_ctz within the current if/else framework will surely perform better.

void operator ++ ()
{ do {  const unsigned char skip_ahead = TM::skip_vals[bits >> 1];
    const bool same_u = skip_ahead < 16-bit;
    bits   = same_u ? bits >> skip_ahead : b.data[++u];
    index += same_u ?     skip_ahead : 16-bit;
    bit    = same_u ? bit  +  skip_ahead : 0;
     }  while (!(bits&1));
}

And even if I did use __builtin_ctz in this context... While I do see a lot of potential for pipeline stalls in TMB iteration, preventing them may be outweighed by all the crap I have to put into the pipeline to get these to compile as CMOVs. Yeah no. Let's keep the code more readable & intuitive.

yakra commented 3 months ago

LOL, here it is for posterity's sake; the very first TMBitset before any optimizations or tweaks.

#include <cmath>
#include <cstdlib>
#include <cstring>

template <class item, class unit> class TMBitset
{   item* const array;
    unit* data;
    const size_t size;
    const size_t units;

    static constexpr size_t maxbits = 8*sizeof(unit)-1;

    public:
    TMBitset(const size_t s, item* const a):
      array( a ),
      size ( s ),
      units( ceil(double(size+1)/8/sizeof(unit)) )
    {   data = (unit*const)calloc(units, sizeof(unit));
        //create dummy end() item
        data[size/8/sizeof(unit)] |= 1 << size%(8*sizeof(unit));
    }

    ~TMBitset() {free(data);}

    bool insert(item* const v)
    {   const size_t index = v-array;
        const unit u = data[index/8/sizeof(unit)];
        data[index/8/sizeof(unit)] |= 1 << index%(8*sizeof(unit));
        return u != data[index/8/sizeof(unit)];
    }

    bool operator () (item* const v)
    {   const size_t index = v-array;
        return data[index/8/sizeof(unit)] & 1 << index%(8*sizeof(unit));
    }

    bool operator != (TMBitset<item,unit>& other) {return memcmp(data, other.data, units*sizeof(unit));}

    void operator |= (TMBitset<item,unit>& other)
    {   for (size_t u = 0; u < units; ++u)
          data[u] |= other.data[u];
    }

    class iterator
    {   TMBitset& b;
        size_t bit, u, index;
        unit bits;

        friend iterator TMBitset::begin();

        public:
        iterator(TMBitset& bs, size_t i): b(bs), u(i/8/sizeof(unit)), bit(i%(8*sizeof(unit))), index(i), bits(b.data[u]) {}

        item* const operator * () const {return b.array+index;}

        void operator ++ ()
        { do {  ++index;
            if (bit == maxbits)
                 {  bit=0;
                ++u;
                bits = b.data[u];
                 }
            else {  bits >>= 1;
                ++bit;
                 }
             }  while (!(bits&1));
        }

        bool operator != (iterator& other) const {return index != other.index;}
    };

    iterator begin()
    {   iterator it(*this, 0);
        if (!(it.bits&1)) ++it;
        return it;
    }

    iterator end() {return iterator(*this, size);}
};

TMBitset 1st draft, ecff52f, first TMBitset branch, 2nd commit therein, right after store TravelerLists in contiguous memory.

Implemented:

  • add_clinched_by (ReadList, ConcAug)
  • Route::clinched_by_traveler (UserLog)
  • HGEdge compression

ToDo:

  • traveled graph traveler list compilation
  • other minor traveled graph improvements
  • improve iteration efficiency