reduce string copies when writing graphs (inline) Components written directly to file by HighwayGraph::write_*graphs_tmg instead of concatenating into a long std::string *_tmg_line functions eliminated completely
TMGf
950400c
fc400d9
OK
reduce string copies when writing graphs (function) Components written directly to file by HighwayGraph::write_*graphs_tmg instead of concatenating into a long std::string *_tmg_line functions retained
TMGf2
54e5a7d
TMGf
OK
allocate fstr once and pass, rather than reallocate thousands of times
TMGf3
f93ed18
TMGf2
fixed
edge labels: instead of std::string construction & copying, just insert its components into the ofstream
old TMGi was 9cdc37d
old TMGf3 was 9bcee48
old TMGr was 48797e1e8879d54bcddb0c0be9e60788f67c2bf7: old TMGf3 (9bcee48) rebased onto vet (77af90e1bee0efcaa8f1041cfea3a6368f8b7d78)
TMGi bugfix (0993a1b on epoch) squashed
TMGf3 bugfix (e73a1bf on epoch) squashed
TMGr bugfix (c1d7eaf on epoch) squashed
etF: cherrypick e before rebasing onto TMGf.squashed: df4567a
etF before interactive rebase (2nd cherrypicked copy of unused verte3x vars removal): 6038cf3def06684a68a72b92bb42f2a2916dd3c3
Python attempt at 90edd7a on BiggaTomato. Very little difference (~2.3s?), probably just noise. Going no-build.
Mix & Match
no-build
v
e
t
ve
vt
et
vet
no-build
fc4009d
c200686
bb4e450
7a67ce4
22cf13b
e3e708e
a08dd6c
77af90e
i
bfe86f0
x
x
x
x
x
x
x
i3
cd244c5
f1
950400c
8c10540
661a59b
a8b269a
584c37b
27f909f
3e41a76
824e2b8
f2
54e5a7d
1afe9a8
758f082
dcfb20d
aedc633
d2b66ce
c87a8e2
ab9c7a5
f3
f93ed18
3b94f38
34aecf6
f5fb4c1
f38426d
94130de
9f33c1a
11a9f42
cd ytm/DataProcessing/siteupdate/cplusplus/
git fetch
git checkout c200686; make siteupdate; mv siteupdate siteupdate_v
git checkout 7a67ce4; make siteupdate; mv siteupdate siteupdate_t
git checkout e3e708e; make siteupdate; mv siteupdate siteupdate_vt
git checkout a08dd6c; make siteupdate; mv siteupdate siteupdate_et
git checkout 3b94f38; make siteupdate; mv siteupdate siteupdate_vF3
git checkout f5fb4c1; make siteupdate; mv siteupdate siteupdate_tF3
git checkout 94130de; make siteupdate; mv siteupdate siteupdate_vtF3
git checkout 9f33c1a; make siteupdate; mv siteupdate siteupdate_etF3
# _v _t _vt _et _vF3 _tF3 _vtF3 _etF3
TMGi is out. On lab1, efficiency drops off sharply above 1 thread. At 3+ threads, performs worse than no-build. A little surprising considering the memory bandwidth improvement, but I've seen this kind of behavior when inlining code vs keeping it off in a separate function before, strange as it may be.
TMGi3 is out. On lab1, makes up for the losses of TMGi and performs comparably to TMGf* until efficiency drops off >5 threads. At 7+ threads, performs worse than no-build. Makes sense to not include anything based off TMGi.
No-build is out.TMGf* perform better without exception.
TMGf3 preferred.TMGf* perform comparably across the board, with no clear winner. With the reduction in string construction, concatenation & memory copies, it makes sense that TMGf3 would make better use of memory bandwidth, even if the effect is small & often lost in the noise. TMGf3's performance was often, if not always, at or near the top of the pack. Call it an unclear winner.
I may revisit this after selecting a CPU alternative (or narrowing the field at least), and see if differences become more apparent when memory bandwidth is at more of a premium.
CPU
No-build is out. Only faster in 2 instances: BiggaTomato, vet vs vetF3, @ 2 (23.13 vs 23.15 s) & 3 (17.03 vs 17.53 s) threads. This appears to just be noise. We probably just don't have enough threads for the bandwidth savings to really kick in. Otherwise, on all machines but lab3, all build alternatives are faster. On lab3, t performs better even with RAM @ no-build. Otherwise, all build alternatives perform better across the board except for vetF3 and veF3.
vF3 is out.eF3 is always faster on all machines except lab3, where...
eF3 is faster @ 1-14 threads and again @ 24 threads
tF3 is faster @ 5-23 threads, except @ 15.
vF3 is indeed fastest @ 15 threads, but only beats tF3 by 0.11 s. (Noise?) But, why even run @ 15 threads? Sure, the fastest recorded single total run time was @ 15 threads, but with veF3, and it was a bit of an outlier. We're better off looking at the average times across all 10 passes. Wanna optimize for fastest average subgraph speed? vtF3 @ 8 threads; 11.92 s. Wanna optimize for average total run time? eF3 @ 24 threads; 51.35 s. Don't wanna use all 24 threads? 2nd best is eF3 @ 12 threads; 51.52 s. Let's call eF3 superior to vF3.
tF3 is out, even if it takes a clear lead on lab3 from 11-22 (except @ 15) threads. It looks good, but there's no reason to bother.
As noted above, whether optimizing for subgraph or overall speed, we'd want a different alternative @ a different # threads.
Yes, tF3 did measure faster than eF3 at 9-12 threads, but this is noise: eF3 performed better than tF3 when we look at total run time, not just subgraphs.
Up to 8 threads, the advantages of other alternatives are too big to ignore.
On lab2 @ 11 threads, tF3 leads eF3 by a whole 0.06 s. Again, doesn't matter; there's more efficiency to be had at a different # threads; eF3, veF3, vtF3 & etF3 are all faster whether optimizing for subgraph or total run time.
Other machines, only on lab1 @ 8 threads is tF3 not dead last; it leads vet3 by 0.16 s. Aside from my suspicions that OS moreso than chipset affects scaling to more threads (how would FreeBSD do?), vet3 may actually be less of a priority anyway. More on this below...
veF3 is out. Peak efficiency (whether subgraph or total) never happens here. If considering using this, may as well use vetF3.
Low # threads: inferior to vetF3.
Mid-high # threads on lab3: comparable to vetF3, inferior to all remaining partial-build alternatives.
lab2: mid = comparable / inferior; high = comparable/comparable
lab1.5: everything is low # threads. :)
vetF3 & vtF3 deferred. Changing vertex sets to lists would require changing set intersections for full custom graphs. Build sets as needed, iterate through lists & check membership, whatever... Kicking this can down the road lets me more easily make a decision now.
OK, so vtF3 yield's lab3's peak efficiency by 0.61 s, and outperforms eF3 and even etF3 on BiggaTomato. But neither of these are production environments; lab2 (and almost certainly noreaster) don't suffer from these bandwidth limitations. On lab2, etF3 looks to have a pretty clear lead in the "sweet spot", though by the time we take all tasks into account, it gets lost in the noise.
vetF3 takes 1st place on lab3 from 1-6 threads, but lags eF3, vtF3 & etF3 @ 7+. On lab2, vetF3 ramps down the fastest thru the transition zone, trailing all partial-build alternatives except vF3 & tF3 @ 8 threads. More comparable to the rest of the pack @ 10-11 threads+. On every other machine, vetF3 is the subgraph efficiency champ. Makes me a little sad to not select this alternative, but I may revisit it again later.
etF3 selected. With only eF3 otherwise remaining, etF3 keeps a clear lead at lower # of cores, and on lab2, edges out eF3 for the overall efficiency sweet spot (whether subgraph or total).
exists now. Same hardware as lab3, running Ubuntu instead of CentOS.
Overall, everything scales to more threads much more gracefully.
Does this change anything?
RAM
Same conclusion.
CPU
No-build is still out. Performs better than CentOS and closes the gap to the build options, but is still pretty clearly inferior. This seems to be in line with Ubuntu making better use of memory bandwidth overall, and scaling to higher thread counts.
vF3 is still out. Aside from a few tiny blips @ 20+ threads, this is the worst-performing build alternative.
tF3 is still out. In contrast to lab3, this is the worst-performing remaining build alternative, except for vetF3 @ 16 & 21 threads.
eF3 is out now. While it was the last alternative eliminated on lab3, here it's clearly slower than etF3 except @ 19 (by the tiniest of margins) & 23 threads.
vtF3 is out. Looking at the Relative Efficiency chart in LibreOffice, it's essentially interchangeable with _etF3. That one takes the lead overall by 0.21s; this one takes the lead for subgraphs by 0.04s. Between the two, _etF3 won't disrupt full custom graphs, and performs markedly better with a low number of threads in CentOS.
veF3? There's more of an argument to keep it now. It's the peak efficiency winner for both Subgraph (10.66000s at 9 threads) and overall time (42.860s at 20 threads), though the latter may be just noise. Other than @ 20 threads, trails _etF3 @ 10+ threads.
vetF3 remains competitive thru the sweet spot and on to the end. Takes 2nd place for both Subgraph (10.84000s at 9 threads) and overall time (43.610s at 19 threads). This may be the best choice for noreaster, as I expect overall efficiency to top out at or below 8 threads, due to the number of other processes competing for RAM bandwidth and L3 cache.
_etF3 remains a strong contender for "sweet spot" and overall crown on CentOS. Going no-build on vertices avoids breaking full custom graph compatibility, and I won't have to lock myself into or out of any particular approach. SELECTED. I can always revisit this in the future, after implementing full custom graphs.
Memory bandwidth alternatives
Components written directly to file by
HighwayGraph::write_*graphs_tmg
instead of concatenating into a longstd::string
*_tmg_line
functions eliminated completelyComponents written directly to file by
HighwayGraph::write_*graphs_tmg
instead of concatenating into a longstd::string
*_tmg_line
functions retainedfstr
once and pass, rather than reallocate thousands of timesstd::string
construction & copying, just insert its components into the ofstreamCPU time alternatives
(shortname formerly hge)
rebase?
(shortname formerly mv)
(shortname formerly mt)
Rubbish bin (epoch unless otherwise noted)
old TMGi was
9cdc37d
old TMGf3 was9bcee48
old TMGr was 48797e1e8879d54bcddb0c0be9e60788f67c2bf7: old TMGf3 (9bcee48
) rebased onto vet (77af90e1bee0efcaa8f1041cfea3a6368f8b7d78) TMGi bugfix (0993a1b
on epoch) squashed TMGf3 bugfix (e73a1bf
on epoch) squashed TMGr bugfix (c1d7eaf
on epoch) squashed etF: cherrypick e before rebasing onto TMGf.squashed:df4567a
etF before interactive rebase (2nd cherrypicked copy of unused verte3x vars removal): 6038cf3def06684a68a72b92bb42f2a2916dd3c3 Python attempt at90edd7a
on BiggaTomato. Very little difference (~2.3s?), probably just noise. Going no-build.Mix & Match