Open yakra opened 12 months ago
WRT units
as a static member, it only makes sense to do this if len
is as well.
Don't wanna do this prematurely.
It's possible that in the future I'd want different TMBitset
objects of the same instantiation referring to different ranges of item
s, and could shoot myself in the foot.
That said, static or not, such punning may be useful for iteration, which doesn't use units
.
We'd be able to minimize allocated space and benefit from optimal iteration, in cases where |=
etc. are unused or we're willing to take the performance hit.
traveler_num
solutionseaa6
,
the first attempt, has the simplest diff but performed worse than no-build.
Only including this for historical curiosity's sake.
traveler_num
directly.
Other than that...for (size_t index = 0; index < traveler_lists.size(); ++index)
if (clinched_by(traveler_lists[index]))
code[index/4] += 1 << index%4;
Also, traveler_lists
is no longer even passed to clinchedby_code
-- master & subgraph routines no longer use compatible data types.
8be1
& 83ed
are the same idea, adapted for std::vector
and TMBitset
respectively.
static thread_local unsigned int* nums;
while
loop; delete[]
aftert->traveler_num[threadnum] = travnum++;
-> TravelerList::nums[t-TravelerList::allusers.data] = travnum++;
traveler_lists
and do it master graph style.write_master_graphs_tmg()
// clinchedby_code setup: 1st the buffer...
size_t nibbles = ceil(double(TravelerList::allusers.size)/4);
char* cbycode = new char[nibbles+1];
// deleted after writing edges
cbycode[nibbles] = 0;
// ...and then traveler numbers
TravelerList::nums = new unsigned int[TravelerList::allusers.size];
// deleted after writing edges
for (unsigned int t = 0; t < TravelerList::allusers.size; ++t) TravelerList::nums[t] = t;
// now edges, only write if not already written
... and then delete[] TravelerList::nums;
after cbycode
.
8be1
allocated & deleted for MT only, while 83ed
did so unconditionally.
May as well keep it unconditional; cleaner code. One extra allocation for siteupdateST won't break us, and ST isn't meant to be the efficient option anyway.
Speaking of which...8be1
allocated & deleted for ST, both master & sub.83ed
had no changes, meaning siteupdateST didn't allocate for subgraphs. Oops.#ifdef
s or #ifndefs
.code[TravelerList::nums[t-TravelerList::allusers.data]/4] +=
1 << TravelerList::nums[t-TravelerList::allusers.data]%4;
Optimizations:
for (auto t = clinched_by.ibegin(), end = clinched_by.iend(); t != end; ++t)
New TMBitset iterators returning array indices instead of pointers.
The tricky part will be dealing with different return types (item
vs size_t
) & getting it right in TMBitset without duplicating lots of code.
Probably make TMBitset::iterator
a template class...{ auto num = TravelerList::nums[*t];
code[num/4] += 1 << num%4;
}
...although the compiler may optimize it away, as it did with the equivalent change to 387c99b. Still though -- more legible code?
threadnum
no longer needed as a function argument9fdc
is pretty much the same deal, except it doesn't use a thread-local variable.
e586
get augment index once for siteupdateST (TMBEpsilon)
ecef
from TMBPreAlpha
What made it into master: https://github.com/yakra/DataProcessing/blob/387c99b0c260938c870457bb9793d4f78b09987f/siteupdate/cplusplus/tasks/threaded/ConcAug.cpp#L11-L19
A micro-optimization:
size_t index = 0;
for (TravelerList *t = TravelerList::allusers.data, *end = TravelerList::allusers.end(); t != end; t++)
{ cout << '.' << flush;
for (HighwaySegment *s : t->clinched_segments)
if (s->concurrent)
for (HighwaySegment *hs : *(s->concurrent))
if (hs != s && hs->route->system->active_or_preview() && hs->add_clinched_by(index))
concurrencyfile << "Concurrency augment for traveler " << t->traveler_name << ": [" << hs->str() << "] based on [" << s->str() << "]\n";
index++;
}
f602
Index calculated once per chopped route. (TMBEpsilon)
48c8
"partway there" from add_index, ElStacko & TMBPreAlpha
Index calculated once per chopped route. Will calculating it once in TravelerList ctor & passing it into store_traveled_segments help, or be a wash due to overhead of pushing it onto the stack?
Ah yes, that's the one. Function arguments use registers, not the stack.
Relevant bits that made it into master: https://github.com/yakra/DataProcessing/blob/d9f02319b36a2b40eba94fdd8fa65a7ea06e89a9/siteupdate/cplusplus/classes/Route/store_traveled_segments.cpp#L9-L12
28af
/add_index
: local_indexadd_clinched_by
takes no args28af
used local_index
and shared_index
.
Instead, probably use tl_index
and existing tl_it
.
WRT efficiency, it's a wash. We're either adding to the index to get a pointer, or subtracting from a pointer to get the index.
28af
's approach has uglier, more verbose code, unless I try a using-declaration.
The tl_index
approach has cleaner code and smaller diffs, especially for the threaded tasks.
Probably use variable assignment from adcf
.adcf
/ElStacko
: calculate index once per travelersize_t index = this - allusers.data();
right after // variables used in construction
Route::store_traveled_segments
takes size_t index
as 2nd argument
Nix the line where it's calculatedstatic thread_local size_t TravelerList::local_index
?
See https://github.com/TravelMapping/DataProcessing/pull/592/#issuecomment-1627784490children of trav (nixtl) @ 7f7d61e
, nixtl branch
2 files have identical diffs in both stashes:
TMBitset.cpp, right after the #include
s
extern bool g_add;
extern unsigned long g_elses;
extern unsigned long g_ifs;
extern unsigned long g_its;
extern unsigned long g_loops;
Older iterator format, but easy enough to figure out
void operator ++ ()
{ do { if (!(bits >>= 1))
{ p += 8*sizeof(unit)-bit;
bit=0;
bits = b.data[++u];
g_ifs += g_add;
}
else { ++bit;
++p;
g_elses += g_add;
}
g_loops += g_add;
} while (!(bits&1));
g_its += g_add;
}
siteupdate.cpp, right up top
bool g_add = 0;
unsigned long g_elses = 0;
unsigned long g_ifs = 0;
unsigned long g_its = 0;
unsigned long g_loops = 0;
Comment out SQL file
after DATA CHECK SUCCESSFUL
, before timestamp = time(0);
cout << g_loops << '\t' << g_ifs << '\t' << g_elses << endl;
cout << g_its << endl;
61839d9
, meta-iteration count
HighwayGraph::write_subgraphs_tmg
// traveler names
g_add = 1;
for (TravelerList *t : traveler_set)
travelfile << t->traveler_name << ' ';
g_add = 0;
660c060
, meta-iteration count (CompStats)
Region::compute_stats.cpp
g_add=1;
this block; g_add=0;
64-16-bit punning
void operator |= (TMBitset<item,unit>& other)
{ long *e64 = (long*)(data+units)-1;
long *d64 = (long*)data, *o64 = (long*)other.data;
while (d64 <= e64) *d64++ |= *o64++;
unsigned short *e16 = (unsigned short*)(data+units);
unsigned short *d16 = (unsigned short*)(e64)+1, *o16 = d16-data+other.data;
while (d16 < e16) *d16++ |= *o16++;
}
64-32-16-bit punning
void operator |= (TMBitset<item,unit>& other)
{ long *end = (long*)(data+units)-1;
long *d64 = (long*)data, *o64 = (long*)other.data;
while (d64 <= end) *d64++ |= *o64++;
int *d32 = (int*) d64, *o32 = (int*) o64;
if (units & 2) *d32++ |= *o32++;
if (units & 1) *(short*)d32 |= *(short*)o32;
}
punning SIO2
void operator |= (TMBitset<item,unit>& other)
{ long *end = (long*)(data+units)-1;
long *d64 = (long*)data, *o64 = (long*)other.data;
while (d64 <= end) *d64++ |= *o64++;
int *d32 = (int*) d64, *o32 = (int*) o64;
if (units & 4) *d32++ |= *o32++;
short *d16 = (short*)d32, *o16 = (short*)o32;
if (units & 2) *d16++ |= *o16++;
if (units & 1) *(char*)d16 |= *(char*)o16;
}
This was based on removing the unit
parameter (953a
) from TMBitset entirely. I won't bother reprinting the full details.
units
-> bytes
or words
(8*sizeof(unit))
-> 8
, etc.maxbits
removed (this was before ubits
)Then (03ec
), same punning as SIO2 above, except no unit
param, and units
-> bytes
.
8-bit (f749
)
After declaring bits
& before friend iterator
static constexpr unsigned char skip_vals[] =
{ 8,1,2,1,3,1,2,1,4,1,2,1,3,1,2,1,5,1,2,1,3,1,2,1,4,1,2,1,3,1,2,1,
6,1,2,1,3,1,2,1,4,1,2,1,3,1,2,1,5,1,2,1,3,1,2,1,4,1,2,1,3,1,2,1,
7,1,2,1,3,1,2,1,4,1,2,1,3,1,2,1,5,1,2,1,3,1,2,1,4,1,2,1,3,1,2,1,
6,1,2,1,3,1,2,1,4,1,2,1,3,1,2,1,5,1,2,1,3,1,2,1,4,1,2,1,3,1,2,1
};
At the very end of the file
template <class item> constexpr unsigned char TMBitset<item>::iterator::skip_vals[];
And finally, iteration itself
void operator ++ ()
{ do { const unsigned char skip_ahead = skip_vals[bits >>= 1];
if (skip_ahead >= 8-bit)
{ bits = b.data[++u];
index += 8-bit;
bit=0;
}
else { bits >>= skip_ahead-1;
index += skip_ahead;
bit += skip_ahead;
}
} while (!(bits&1));
}
Later (next commit after 16-bit) changed to TM::skip_vals[bits >> 1]
and bits >>= skip_ahead;
-- cleaner code, and should make no difference performance-wise, using one more register to do either subtraction or a shift.
Not the best solution.
(skip_ahead >= 8-bit)
with (0 == (bits &= ~(unit)1))
may be slightly more efficient. It looks like about the same number of ops on the surface but...
bits &= ~(unit)1
stores its result in the original register whereas 8-bit
needs a new register for its result.skip_ahead >= rhs
means doing a comparison to get our bool, whereas with 0 == rhs
, the bool is rhs
.skip_ahead
only some of the time, within the else
block.__builtin_ctz
.16-bit (ec4e
)
...is a godawful monstrosity. ;)
TMBitset.cpp: char->short, bytes->words, 8->16, 1->2 for calloc & memcmp. skip_vals array deleted; see TM::skip_vals.
TMGlobal.h
namespace TM
{ extern unsigned char skip_vals[32768];
unsigned char get_skip(unsigned short);
void set_skip_vals();
}
TMGlobal.cpp (because I was unsuccessful in trying to constexpr it)
namespace TM
{ unsigned char get_skip(unsigned short val)
{ unsigned char skip;
for (skip=1; skip<16; ++skip)
if (val & 1) return skip;
else val >>= 1;
return skip;
}
unsigned char skip_vals[32768];
void set_skip_vals()
{ for (unsigned short v=0; v<32768; ++v)
skip_vals[v] = get_skip(v);
}
}
Makefile
classes/TMGlobal/TMGlobal.o
in CommonObjects
siteupdate.cpp
#include "classes/TMGlobal/TMGlobal.h"
TM::set_skip_vals();
afer the first 4 variables (3 lines) on main
, right before argument parsing.
ternaries (2d94
)
I doubt all of these would be compiled as CMOVs, but here it is for historical interest's sake.
Using __builtin_ctz
within the current if/else framework will surely perform better.
void operator ++ ()
{ do { const unsigned char skip_ahead = TM::skip_vals[bits >> 1];
const bool same_u = skip_ahead < 16-bit;
bits = same_u ? bits >> skip_ahead : b.data[++u];
index += same_u ? skip_ahead : 16-bit;
bit = same_u ? bit + skip_ahead : 0;
} while (!(bits&1));
}
And even if I did use __builtin_ctz
in this context...
While I do see a lot of potential for pipeline stalls in TMB iteration, preventing them may be outweighed by all the crap I have to put into the pipeline to get these to compile as CMOVs.
Yeah no. Let's keep the code more readable & intuitive.
LOL, here it is for posterity's sake; the very first TMBitset before any optimizations or tweaks.
#include <cmath>
#include <cstdlib>
#include <cstring>
template <class item, class unit> class TMBitset
{ item* const array;
unit* data;
const size_t size;
const size_t units;
static constexpr size_t maxbits = 8*sizeof(unit)-1;
public:
TMBitset(const size_t s, item* const a):
array( a ),
size ( s ),
units( ceil(double(size+1)/8/sizeof(unit)) )
{ data = (unit*const)calloc(units, sizeof(unit));
//create dummy end() item
data[size/8/sizeof(unit)] |= 1 << size%(8*sizeof(unit));
}
~TMBitset() {free(data);}
bool insert(item* const v)
{ const size_t index = v-array;
const unit u = data[index/8/sizeof(unit)];
data[index/8/sizeof(unit)] |= 1 << index%(8*sizeof(unit));
return u != data[index/8/sizeof(unit)];
}
bool operator () (item* const v)
{ const size_t index = v-array;
return data[index/8/sizeof(unit)] & 1 << index%(8*sizeof(unit));
}
bool operator != (TMBitset<item,unit>& other) {return memcmp(data, other.data, units*sizeof(unit));}
void operator |= (TMBitset<item,unit>& other)
{ for (size_t u = 0; u < units; ++u)
data[u] |= other.data[u];
}
class iterator
{ TMBitset& b;
size_t bit, u, index;
unit bits;
friend iterator TMBitset::begin();
public:
iterator(TMBitset& bs, size_t i): b(bs), u(i/8/sizeof(unit)), bit(i%(8*sizeof(unit))), index(i), bits(b.data[u]) {}
item* const operator * () const {return b.array+index;}
void operator ++ ()
{ do { ++index;
if (bit == maxbits)
{ bit=0;
++u;
bits = b.data[u];
}
else { bits >>= 1;
++bit;
}
} while (!(bits&1));
}
bool operator != (iterator& other) const {return index != other.index;}
};
iterator begin()
{ iterator it(*this, 0);
if (!(it.bits&1)) ++it;
return it;
}
iterator end() {return iterator(*this, size);}
};
TMBitset 1st draft, ecff52f
, first TMBitset
branch, 2nd commit therein, right after store TravelerLists in contiguous memory
.
Implemented:
- add_clinched_by (ReadList, ConcAug)
- Route::clinched_by_traveler (UserLog)
- HGEdge compression
ToDo:
- traveled graph traveler list compilation
- other minor traveled graph improvements
- improve iteration efficiency
Remaining items from #245 after the merge of TravelMapping#592.
Higher Priority:
HighwaySegment
storage and/or other bandwidth optimizations There may be some redundancy here; one thing = another thing rebased/cherrypicked etc.eaa6
nix tl::tn9fdc
new unsigned int[num_travelers] per thread8be1
thread_local7f7d
et al)Medium priority:
Low Priority:
segments
SQL table: only iterateclinched_by
for active/preview systems~ Little to no improvement. Near/within margin of error. And hell -- a mature project where everything is added will have only a/p systems anyway! :grin: Edit: Even after 0ff0af1, close enough to margin of error that no-build is slower on 5 of 8 machines. KISS; leave as-is.