subgraph speed/scaling alternatives

yakra / DataProcessing

Data Processing Scripts and Programs for Travel Mapping Project

0 stars 0 forks source link

Memory bandwidth alternatives

short name	commit	parent	diff	description / comments
TMGi	bfe86f0	fc400d9	fixed	reduce string copies when writing graphs (inline) Components written directly to file by `HighwayGraph::write_graphs_tmg` instead of concatenating into a long `std::string` `_tmg_line` functions eliminated completely
TMGf	950400c	fc400d9	OK	reduce string copies when writing graphs (function) Components written directly to file by `HighwayGraph::write_graphs_tmg` instead of concatenating into a long `std::string` `_tmg_line` functions retained
TMGf2	54e5a7d	TMGf	OK	allocate `fstr` once and pass, rather than reallocate thousands of times
TMGf3	f93ed18	TMGf2	fixed	edge labels: instead of `std::string` construction & copying, just insert its components into the ofstream

CPU time alternatives

short name	commit	parent	diff	description / comments
v	ToDo	f181bf51d5fda3580565657ed6f866b878793e85		sets -> lists & bools: vertices
e	bb4e4500954e9a3941d219f8a0ec976726fe6650	f181bf51d5fda3580565657ed6f866b878793e85	OK	sets -> lists & bools: edges (shortname formerly hge)
t	ToDo	f181bf51d5fda3580565657ed6f866b878793e85		sets -> lists & bools: travelers
ve	22cf13bda214ee2a43662811c8f50ca345b07de4	1c204a8748e6406df0464689ee460a8830555d5d rebase?	OK	sets -> lists & bools: vertices, edges (shortname formerly mv)
vt	ToDo	f181bf51d5fda3580565657ed6f866b878793e85		sets -> lists & bools: vertices, travelers
et	ToDo	f181bf51d5fda3580565657ed6f866b878793e85		sets -> lists & bools: edges, travelers
vet	77af90e1bee0efcaa8f1041cfea3a6368f8b7d78	22cf13bda214ee2a43662811c8f50ca345b07de4	OK	sets -> lists & bools: vertices, edges, travelers (shortname formerly mt)

Rubbish bin (epoch unless otherwise noted)

old TMGi was 9cdc37d old TMGf3 was 9bcee48 old TMGr was 48797e1e8879d54bcddb0c0be9e60788f67c2bf7: old TMGf3 (9bcee48) rebased onto vet (77af90e1bee0efcaa8f1041cfea3a6368f8b7d78) TMGi bugfix (0993a1b on epoch) squashed TMGf3 bugfix (e73a1bf on epoch) squashed TMGr bugfix (c1d7eaf on epoch) squashed etF: cherrypick e before rebasing onto TMGf.squashed: df4567a etF before interactive rebase (2nd cherrypicked copy of unused verte3x vars removal): 6038cf3def06684a68a72b92bb42f2a2916dd3c3 Python attempt at 90edd7a on BiggaTomato. Very little difference (~2.3s?), probably just noise. Going no-build.

Mix & Match

	no-build	v	e	t	ve	vt	et	vet
no-build	fc4009d	c200686	bb4e450	7a67ce4	22cf13b	e3e708e	a08dd6c	77af90e
i	bfe86f0	x	x	x	x	x	x	x
i3	cd244c5
f1	950400c	8c10540	661a59b	a8b269a	584c37b	27f909f	3e41a76	824e2b8
f2	54e5a7d	1afe9a8	758f082	dcfb20d	aedc633	d2b66ce	c87a8e2	ab9c7a5
f3	f93ed18	3b94f38	34aecf6	f5fb4c1	f38426d	94130de	9f33c1a	11a9f42

cd ytm/DataProcessing/siteupdate/cplusplus/
git fetch
git checkout c200686; make siteupdate; mv siteupdate siteupdate_v
git checkout 7a67ce4; make siteupdate; mv siteupdate siteupdate_t
git checkout e3e708e; make siteupdate; mv siteupdate siteupdate_vt
git checkout a08dd6c; make siteupdate; mv siteupdate siteupdate_et
git checkout 3b94f38; make siteupdate; mv siteupdate siteupdate_vF3
git checkout f5fb4c1; make siteupdate; mv siteupdate siteupdate_tF3
git checkout 94130de; make siteupdate; mv siteupdate siteupdate_vtF3
git checkout 9f33c1a; make siteupdate; mv siteupdate siteupdate_etF3

# _v _t _vt _et _vF3 _tF3 _vtF3 _etF3

Alternative selection

RAM

TMGi is out. On lab1, efficiency drops off sharply above 1 thread. At 3+ threads, performs worse than no-build. A little surprising considering the memory bandwidth improvement, but I've seen this kind of behavior when inlining code vs keeping it off in a separate function before, strange as it may be.

TMGi3 is out. On lab1, makes up for the losses of TMGi and performs comparably to TMGf* until efficiency drops off >5 threads. At 7+ threads, performs worse than no-build. Makes sense to not include anything based off TMGi.

No-build is out. TMGf* perform better without exception.

TMGf3 preferred. TMGf* perform comparably across the board, with no clear winner. With the reduction in string construction, concatenation & memory copies, it makes sense that TMGf3 would make better use of memory bandwidth, even if the effect is small & often lost in the noise. TMGf3's performance was often, if not always, at or near the top of the pack. Call it an unclear winner.

I may revisit this after selecting a CPU alternative (or narrowing the field at least), and see if differences become more apparent when memory bandwidth is at more of a premium.

CPU

No-build is out. Only faster in 2 instances: BiggaTomato, vet vs vetF3, @ 2 (23.13 vs 23.15 s) & 3 (17.03 vs 17.53 s) threads. This appears to just be noise. We probably just don't have enough threads for the bandwidth savings to really kick in. Otherwise, on all machines but lab3, all build alternatives are faster. On lab3, t performs better even with RAM @ no-build. Otherwise, all build alternatives perform better across the board except for vetF3 and veF3.

vF3 is out. eF3 is always faster on all machines except lab3, where...

eF3 is faster @ 1-14 threads and again @ 24 threads
tF3 is faster @ 5-23 threads, except @ 15.
vF3 is indeed fastest @ 15 threads, but only beats tF3 by 0.11 s. (Noise?) But, why even run @ 15 threads? Sure, the fastest recorded single total run time was @ 15 threads, but with veF3, and it was a bit of an outlier. We're better off looking at the average times across all 10 passes. Wanna optimize for fastest average subgraph speed? vtF3 @ 8 threads; 11.92 s. Wanna optimize for average total run time? eF3 @ 24 threads; 51.35 s. Don't wanna use all 24 threads? 2nd best is eF3 @ 12 threads; 51.52 s. Let's call eF3 superior to vF3.

tF3 is out, even if it takes a clear lead on lab3 from 11-22 (except @ 15) threads. It looks good, but there's no reason to bother.

As noted above, whether optimizing for subgraph or overall speed, we'd want a different alternative @ a different # threads.
Yes, tF3 did measure faster than eF3 at 9-12 threads, but this is noise: eF3 performed better than tF3 when we look at total run time, not just subgraphs.
Up to 8 threads, the advantages of other alternatives are too big to ignore.
On lab2 @ 11 threads, tF3 leads eF3 by a whole 0.06 s. Again, doesn't matter; there's more efficiency to be had at a different # threads; eF3, veF3, vtF3 & etF3 are all faster whether optimizing for subgraph or total run time.
Other machines, only on lab1 @ 8 threads is tF3 not dead last; it leads vet3 by 0.16 s. Aside from my suspicions that OS moreso than chipset affects scaling to more threads (how would FreeBSD do?), vet3 may actually be less of a priority anyway. More on this below...

veF3 is out. Peak efficiency (whether subgraph or total) never happens here. If considering using this, may as well use vetF3.

Low # threads: inferior to vetF3.
Mid-high # threads on lab3: comparable to vetF3, inferior to all remaining partial-build alternatives.
lab2: mid = comparable / inferior; high = comparable/comparable
lab1.5: everything is low # threads. :)

vetF3 & vtF3 deferred. Changing vertex sets to lists would require changing set intersections for full custom graphs. Build sets as needed, iterate through lists & check membership, whatever... Kicking this can down the road lets me more easily make a decision now.

OK, so vtF3 yield's lab3's peak efficiency by 0.61 s, and outperforms eF3 and even etF3 on BiggaTomato. But neither of these are production environments; lab2 (and almost certainly noreaster) don't suffer from these bandwidth limitations. On lab2, etF3 looks to have a pretty clear lead in the "sweet spot", though by the time we take all tasks into account, it gets lost in the noise.
vetF3 takes 1st place on lab3 from 1-6 threads, but lags eF3, vtF3 & etF3 @ 7+. On lab2, vetF3 ramps down the fastest thru the transition zone, trailing all partial-build alternatives except vF3 & tF3 @ 8 threads. More comparable to the rest of the pack @ 10-11 threads+. On every other machine, vetF3 is the subgraph efficiency champ. Makes me a little sad to not select this alternative, but I may revisit it again later.

etF3 selected. With only eF3 otherwise remaining, etF3 keeps a clear lead at lower # of cores, and on lab2, edges out eF3 for the overall efficiency sweet spot (whether subgraph or total).

Does this change anything?

RAM

Same conclusion.

CPU

No-build is still out. Performs better than CentOS and closes the gap to the build options, but is still pretty clearly inferior. This seems to be in line with Ubuntu making better use of memory bandwidth overall, and scaling to higher thread counts.

vF3 is still out. Aside from a few tiny blips @ 20+ threads, this is the worst-performing build alternative.

tF3 is still out. In contrast to lab3, this is the worst-performing remaining build alternative, except for vetF3 @ 16 & 21 threads.

eF3 is out now. While it was the last alternative eliminated on lab3, here it's clearly slower than etF3 except @ 19 (by the tiniest of margins) & 23 threads.

vtF3 is out. Looking at the Relative Efficiency chart in LibreOffice, it's essentially interchangeable with _etF3. That one takes the lead overall by 0.21s; this one takes the lead for subgraphs by 0.04s. Between the two, _etF3 won't disrupt full custom graphs, and performs markedly better with a low number of threads in CentOS.

veF3? There's more of an argument to keep it now. It's the peak efficiency winner for both Subgraph (10.66000s at 9 threads) and overall time (42.860s at 20 threads), though the latter may be just noise. Other than @ 20 threads, trails _etF3 @ 10+ threads.

vetF3 remains competitive thru the sweet spot and on to the end. Takes 2nd place for both Subgraph (10.84000s at 9 threads) and overall time (43.610s at 19 threads). This may be the best choice for noreaster, as I expect overall efficiency to top out at or below 8 threads, due to the number of other processes competing for RAM bandwidth and L3 cache.

_etF3 remains a strong contender for "sweet spot" and overall crown on CentOS. Going no-build on vertices avoids breaking full custom graph compatibility, and I won't have to lock myself into or out of any particular approach. SELECTED. I can always revisit this in the future, after implementing full custom graphs.

yakra / DataProcessing