tuples for augment_lists

yakra commented 1 year ago

https://github.com/yakra/DataProcessing/blob/616425bc2fb5df01778f43c2d06ccbe15ae94816/siteupdate/cplusplus/threads/ConcAugThread.cpp#L20 All those allocations and appends probably get a bit expensive.

[ ] Instead, make a std::tuple<TravelerList,HighwaySegment,HighwaySegment> and insert the individual components into the ofstream, same as in single-threaded operation.
Are the calls to HighwaySegment::str() too expensive now that it's single-threaded?
- [ ] Try a std::tuple<TravelerList,std::string,std::string> and construct the strings in the multi-threaded bit, or
- [ ] Insert the constituent parts of HighwaySegment::str() into the ofstream

yakra commented 1 year ago

3 different flavors for each of the checkboxes above: Pointer, String, Elements. For each, 2 variations, keeping augment_list as a std::list<whatever>, or converting to a std::vector: List, Vector. Yielding 6 build alternatives to evaluate: pl, pv, sl, sv, el, ev.

The vector versions performed better across the board, so I added in one more alternative to evaluate, v, which is the same as no-build except for the list->vector conversion.

The vector options performed better than their list counterparts especially in the directly affected tasks, which makes sense:
- ConcAugThread: better on all Linux machines, with only 1 exception (1 pair or executables @ 1 specific thread count) each on BiggaTomato, lab5 & lab2, and 2 exceptions on lab3. The pattern starts to break down a bit on lab4, with vectors faster 87/96 times. Only on bsdlab does the pattern not hold -- vectors faster 56/96 times, much closer to the 50% mark. May not be statistically significant.
- concurrencies.log: An even clearer win. 1 exception on lab4; 13 exceptions on bsdlab.
They even performed better in the tasks with unchanged code not directly affected. This also makes sense: • With lists, small chunks of data are scattered across the heap. We have to pull a lot of separate locations into cache, with a lot of wasted space. • With vectors, our data are arranged contiguously. With each cache line we pull in from main memory, we get a lot more data we need, and fewer locations need to be transferred. Fewer memory locations overwritten in cache = fewer cache misses once we get around to computing stats.
- Vectors outperformed lists a smaller percentage of the time. No big surprise here; these are the indirectly affected tasks after all.
- A loss in one task is usually offset by a win in the other. When viewing the sum of CompStatsRThread & CompStatsTThread together, we see a greater percentage of wins for vectors. 2 losses on lab2, 3 on lab4.
- Again, only on bsdlab do we not see much list/vector difference. In fact, in CompStatsTThread, lists performed better 51/96 times. In both CompStats threads combined, vectors did better 53/96 times.

For all 4 tasks combined, v is a clear winner at all but the lowest thread counts (and of course we want to optimize for more threads) on every machine except BiggaTomato, which has the least cache & slowest RAM. There, it still outperforms no-build across the board. At 4 threads, it's in 3rd place, only 0.05 s behind the winner, sv. Wherever v is not first place, it still beats no-build, no exceptions.

TLDR, v is our winner.

But wait. Can we still improve on things? Let's eliminate the list options from consideration & take a deeper dive. Considering v, pv, sv, & ev...

CongAugThread:
- pv & ev rapidly trade 1st & 2nd place, occasionally tying. Makes sense as this 1st task is identical between the two.
- sv in 3rd place, no exceptions. Makes sense as it allocates & constructs loads of HighwaySegment::str() strings.
- v in 4th place, no exceptions. Makes sense as it constructs all these same strings & makes even bigger strings out of them.
- Ideally, we want ev or pv. Failing that, something relatively light on string allocation.
concurrencies.log: With only 2 exceptions across all machines & thread counts (3rd & 4th traded places @ 14 & 16 threads on lab4), this pattern holds:
- v in 1st place. Iterate thru augment_lists[t] & print one string. Boom. Done.
- sv in 2nd place. Iterate thru augment_lists[t] & print 7 strings, dereferencing 1 pointer along the way.
- ev in 3rd place. Print 15 strings, dereference 13 pointers, construct 2 short-ish Route::readable_name() strings.
- pv in 4th place. Print 7 strings, dereference 3 pointers, construct 2 long-ish HighwaySegment::str() strings.
- Ideally, we want to minimize ofstream insertions, dereferencing & string constructon, as well as find a solution that performs well in CongAugThread and minimizes cache misses when computing stats. Can't have all of these things at once though.
CompStatsRThread and CompStatsTThread: Each machine follows its own pattern, many different from one to the next, and each counterintuitive in its own way. None follow the ev -> pv -> sv -> v hierarchy I'd expect. I won't try to make sense out of this; I'll just try out a few more solutions & let the chips fall where they may.

3 more alternatives:

v2 improves on v, easing up on the string allocation by not including "Concurrency augment for traveler " in each augment_list entry, instead inserting it as a const char* into the ofstream.
Improve on ev with:
- r1: Take some references to avoid pointer-chasing.
- r2: Ditch route->readable_name() in favor of its constituent parts, taking a couple more references along the way.

yakra commented 1 year ago

v2 handily beats v during ConcAugThread, everywhere but bsdlab @ 5 threads. As expected.
v beats v2 writing concurrencies.log, as expected. By a wider margin than I'd have thought, even if it's only a couplefew hundredths of a second.
Computing stats results continue to be counterintuitive...v outperforms *v2 in CompStatsRThread on every machine but BT & bsdlab, and in CompStatsTThread on every machine but bsdlab. This makes no sense; it's extraneously allocating loads more memory. But the results are pretty consistent. On lab3 i's even the top performer.
Overall, for all 4 tasks combined, v2 has a narrow lead on 4/7 machines, and a wider more consistent lead on bsdlab, v counterintuitively takes the lead on lab1, and r2 takes the lead on BiggaTomato, which makes sense with its lower RAM bandwidth & smaller cache.

Narrow lead of v2 notwithstanding, I'll just implement v for now, which does need to happen at a minimum, and take another look at this after the region.php rankings bugfix (and maybe sequential TravelerList objects & TMBitset<TravelerList> clinched_by) change ConcAugThread operation & CompStatsThread iteration.

yakra commented 1 year ago

branches on BiggaTomato	branch	commit
y238r2	abd9d238800126af8c53612bc574ccb942c90349
y238r1	abd9d238800126af8c53612bc574ccb942c90349
y238v2	fa2918559cbd24d722ce42f862a03387784a7423
y238v	525c2b6e2a29c6e1b34e500834164b589ce9e5cb
y238ev	83b06cf2ab00fa2114295178bfcfa4c4c59ec732
y238el	1aabe3c0df3ef9109be34c35c3e8bf6387eb94e7
y238sv	1555dce2d38a21510cc01a547a5c7101d672745a
y238sl	52ecc48c34880123b30681d8cd7cb3b7df756e18
y238pv	0244cea61d2e9331484b0419439e9d94f6a7e319
y238pl	0abfc10fb0e62fadfbd0c8575f652e16a8c12613

yakra / DataProcessing

tuples for augment_lists #238