yakra commented 1 year ago

Benchmark:

[x] Route::clinched_by_traveler: branching vs indirection Branching is faster.
[x] Graph generation with different units Bigger is faster.
[x] Explicitly inline matching_vertices_and_edges (Change name? It finds travelers too) Slightly faster on all machines except lab2. (Noise?)

[x] Try branchless variant of eaa6:

for (size_t index = 0; index < traveler_lists.size(); ++index)
code[index/4] += clinched_by(traveler_lists[index]) << index%4;

Branching is faster.

[x] Revert eaa6; apply 4e05 directly to prev commit For Computing Stats, 96e9... • is THE top performer so far for lab1, lab3, lab4. • outperforms 4e05 on BiggaTomato. • on lab2 lags behind top performer 4e05 by only 3.8 ms. Just noise? A very tight race here.
[x] 5c76 inconclusive, but does appear slightly slower for userlogs, very consistently. Try:
- [x] comparing against 6922 for a more apples-to-apples comparison.
- [x] comparing against the final selected commit once all the dust settles (see 4a64).
- [x] Out of curiosity, is there a diff if I build with GCC instead of clang? Yes.
[x] What if traveler_lists is nixed altogether? As we iterate thru traveler_set, just keep a count instead of creating a vector. The trade-off is doing one more iteration thru traveler_set at the end of the traveled graph for traveler names. Is this outweighed by not having to construct the vector and do a modest number of allocations/reallocations? Preliminary results: helps on BiggaTomato; hurts on lab1. Be interesting to see results on bsdlab, at higher thread counts, and after doing more work on the RAM bandwidth bottleneck. Final: No speed advantage. Leaving as-is. Maybe re-examine after more RAM bandwidth improvements are implemented.

Try out:

[x] What happens to traveled graphs if a devel system comes before concurrent a/p in systems.csv?
[x] How much does initial HGEdge construction slow down if I force an active/preview canonical HighwaySegment? \~ 0.01 - 0.02 s on BiggaTomato -- 0.9 - 1.9 % more time.
[x] Simple iteration optimization (with different units) The trade-off: Larger units = skip back farther but less often.
- 64-bit is clearly suboptimal.
- 8, 16 & 32-bit all very close to margin of error. Maybe a slight edge for 32-bit; makes sense to use it anyway because good |= performance.
[x] Complex iteration optimization
- 8-bit f749 underperforms 8-32-bit simple iteration on lab{1..4}; slight lead on BiggaTomato.
- 16-bit ec4e is 1st place for BiggaTomato & lab1; underperforms f749 & even 64-bit f8cf on lab2. Other machines TBD.
- [x] [bits >> 1] solution
- [x] ternaries Both underperform ec4e on BiggaTomato & lab1. Successively better on lab2 though, with ternaries 1st place overall & [bits >> 1] falling between 16 & 32-bit SIO2. Other machines TBD.
- [x] How much time does the 15-bit lookup table take to construct? Does this outweigh the advantage of using 16-bit ComItOpt?
- About 0.3 ms on BiggaTomato.
- N/A -- there isn't an advantage to using 16-bit ComItOpt.
- ~LOL what if I constexpr the damn thing by brute force?~ Never mind. Nothing to be gained by doing this.
- [x] 32k is 1/16 of 512k. Will ec4e perform poorly on Epoch? No. Performs well; outperforms SIO2. Similar to BiggaTomato.
[x] Pointer punning |= Try simplified versions for larger units:
- [x] 64-32-16-bit
- [x] 64-16-bit
- [x] 64-32-bit
[x] Replace TravelerList::traveler_num with one unsigned int* traveler_num = new unsigned int[TravelerList::allusers.size()]; per thread. Index via for (TravelerList *t : traveler_lists) traveler_nums[t-TravelerList::allusers.data()] = travnum++;
[x] ~A variant of eaa6 with TMBitset [] operator?~ LOLNOPE. Different vectors (everything vs subset); indices don't match up.

Kinda both:

~Init TravelerLists at beginning; init segments with size not capacity~
[x] ~Or better yet, init segments with TravelerList::ids.size()~ Segments set via TMArray::size, set via TravelerList::ids.size()
[x] Timestamp for first task, with & without. Python Too?
[x] TMArray<TravelerList> means threaded construction in place (via placement new) without separate read_list function.
251

Clean up:

[x] fix comment
[x] Constexpr the static edge format constants
[x] Template specialization
[x] As of ecff, cmath no longer explicitly needed in HighwayGraph.cpp
[x] Extraneous else after continue; parens can be clarified https://github.com/yakra/DataProcessing/blob/80f27243bf26579c53c6c7a215a5aeb22347317a/siteupdate/cplusplus/classes/GraphGeneration/HighwayGraph.cpp#L112-L116
[x] extraneous visibility == 1 check just before that
[x] (4db4) maxbits less useful in SimItOpt2. Delete it; replace with 8*sizeof(unit); let -1 and +1 cancel out.
~bits should be unsigned char in pun branch. Dumb luck that it ran without errors. Fixed for ComItOpt.~ Never mind. Using a branch that retains unit instead.
[x] uint8_tetc.
[ ] Explicit instantiation?
[x] Inlining MV&E (2d1b) means #include "../../templates/TMBitset.cpp" not needed in HighwayGraph.h. Can lose the include guard too LOL. :cowboy_hat_face:
[ ] segments SQL table: only iterate clinched_by for active/preview systems
[x] (c8b9) Check for diffs due to constant folding: 8/sizeof(unit)
[x] Comment for != and |= operators
[x] Review TMBitset variable names
[x] Cast (unit)1 before <<, lest the unsigned long bug make a reappearance. See f8cf on SimItOpt2 branch.
[x] (4db4) Switch: are additions in for loop reordered? Make it look pretty!
[x] nix () and add_value until needed

yakra commented 1 year ago

find crashes via

for file in sulogs/2023-04-06/siteupdate_{a423,4448,8fba,8d55,7828,96e9,2d1b,5c76}*; do
  echo -e "$file\t" `tail -n 1 $file \
  | sed 's~\[[0-9.]*\]~~' \
  | tr -d .`; done \
| grep -v 'Total run time:' \
| sed -r 's~.*siteupdate_(....)-.*~\1~' \
| uniq

yakra commented 1 year ago

dead branches

SimItOpt2 with sign bit bug before reset:
- dd1b6e0 long
- a8a3db9 int
- 9280ec4 short OK, not dead, but deprecated binary & logs. Hard reset here; bugfix commit added
tlist_canon @ d37dfe0979bc5a470dad5851147850503d096bdc

yakra commented 1 year ago

commits

c	incl?	descr	comments
a5a7	yes	contig RAM	Prerequisite for everything else
ecff	yes	TMB 1st draft	The main point of all this :)
d704	yes	t_l via TMB	Small speedup. Saves RAM. Potential for improvements via iteration and \|= optimizations.
a009	yes	t_l list->vec	Not slower. The Right Thing To Do. Saves RAM. Locality. Is index used in future commits? Only in `eaa6`, which won't be used as it hurts performance. What if `traveler_lists` is nixed altogether, and we just iterate `traveler_set` one more time at the end? Or, what if, instead of doing reallocations/copies, we just `reserve(TravelerList::allusers.size())` after declaring? Re-examine nixing in future.
91a0	yes	!extr include	Housekeeping. Rebase earlier?
7ce7	yes	userlog warn	Unrelated, and doesn't appear to affect performance any, fnord. Rebase earlier?
f977	no	t_l as ref	Pass a bool instead; see `4504`. Nix extraneous traveler_lists creation for master graph.
eaa6	no	nix tl::tn	Hurts performance. See `9fdc` et al instead.
4e05	no	avoid iterate	Child of `eaa6`. See `96e9` even though it's the same diff. :P
6922	no	ctor / init	Any changes to performance? See next commit. No: See `7415`.
5c76	no	integral type	Userlogs slightly slower? Test again once dust settles? No: See `4a64`.
b4ab	no	names & order	Arg order just makes sense. Name changes are more related to prev commit. No: See `c896`.
e61a	no	++u	No change to binary. Done on its own in some future commits, including `47a7`.
a423	yes	branching CBT	Faster userlogs. Rebase earlier? Maybe try again once `HighwaySegments` are stored sequentially?
4448	no	short	`unit` branch; affects `\\|=` only. Prefer flavors with iterator optimization.
8fba	no	int	`unit` branch; affects `\\|=` only. Prefer flavors with iterator optimization.
8d55	no	long	`unit` branch; affects `\\|=` only. Prefer flavors with iterator optimization.
7828	no	no-branch CBC	Hurts performance
96e9	yes	avoid iterate	`4e05` cherrypicked. Reorder additions in `for` loop; make it look pretty.
2d1b	yes	inline mv&e	Slight speedup, simpler prototypes & includes
f28c	no	long bugfix	`unit` branch; affects `\\|=` only. Prefer flavors with iterator optimization.
953a	no	! unit param	Not strictly necessary for pointer-punning `\\|=`, but cleans things up. No: Different units may be useful for scenarios other than `clinched_by`. E.g., `uint64_t` may work well for `HGVertex` subgraph membership.
03ec	no	punning `\\|=`	Evaluate WRT straight-up unit `\\|=`. Prefer flavors with iterator optimization; 3 are simplified for 16/32-bit units; see below.
7958	no	SimItOpt1	SimItOpt2 outperforms this.
3910	no	SimItOpt2	32 > 16 > 8-bit for iterating `clinched_by`. Additionally, allows for better `\\|=` optimization.
1be1	no	SIO1 short	SimItOpt2 outperforms this.
87e4	no	SIO1 int	"
2cbb	no	SIO1 long	"
9280	no	SIO2 short	32 > 16 > 8-bit for iterating `clinched_by`. Additionally, allows for better `\\|=` optimization.
9347	no	sign bit fix	"
acc9	yes	unsigned int	"
8269	no	unsigned long	Performs worse than 8/16/32-bit everywhere except epoch.
f8cf	no	ulong bugfix	"
f749	no	ComItOpt	Evaluate against SimItOpt. All tasks. ~Try with per-unit `\\|=`.~ <--Never mind. No: Underperforms SIO2 on all lab machines (but not BiggaTomato & epoch though those aren't the target). KISS principle.
9fdc	no	trav_nums	May not do much on its own. Try again after other speedups are in place. See `8be1`.
bc9d	no	TMB it idx	Pro: No arithmetic for clinchedby_code. Con: Still adding b.array+index everywhere else (deref).
0b69	no	iterator ptr	Pro: No arithmetic for dereferencing. Con: CBC subtracts out. Use `4db4` instead.
^ ^^		^ ^^	Both are superior to `9fdc`, which adds for everything, then subtracts out for CBC. Test both sans `9fdc`. Which gets called more? • 69% `clinched_by` in `clinchedby_code` @ 36,193,165, which wants indices. • 31% in everything else @ 16,024,183, which want pointers. (8,000,360 `clinched_by` in stats & SQL + 23,463 `traveler_set` in MV&E) So on paper, `bc9d` wins. But in practice we're within margin of error, with `0b69` faster on the 4 non-Ubuntu (coincidence?) machines. If foregoing or at least deferring `9fdc` because no clear benefit, it makes sense to use `0b69`.
ec4e	no	16-bit it opt	Wins on epoch, BT & lab1; lags behind SimItOpt2 everywhere else, even before accounting for lookup array setup. Constexpr would be expensive for program & code size. Ugly changes to code, a namespace I wouldn't otherwise need.
16bb	no	`[bits >> 1]`	Only needed for ComItOpt, which looks like it's not going to pan out. Only performs better on lab2, so within margin of error at best.
2d94	no	ternaries	1st place on lab2 by 0.5 ms & lab3 by 0.7; lags behind SimItOpt2 everywhere else except BT, where it's 2nd place behind `ec4e`, and more arguably epoch, where it's slower than 8 & 16-bit, but faster than 32-bit (which looks like it'll be the selected alternative). But again, BT & epoch aren't target machines. Keep it simple.
fc17	no	punning SIO2	Bigger is better.
72e2	no	64-32-16 pun	"
c310	no	64-16-bit pun	"
47a7	yes	64-32-bit pun	Bigger is better. Fewest branches & loops. 32-bit is fastest for iteration too.
4db4	yes	iterator ptr	Faster.
4504	yes	"0"	Rebase earlier
7f87	yes	trav	Better on paper; != op replaced with just sending an existing variable. Rename variable to `is_traveled` for readability.
05d7	no	reserve	No definitive diff; no-build may be faster.
8be1	...	TL::nums	No apparent diffs. Revisit when more RAM bandwidth optimizations are in place?
c896	yes	arg order	Measures faster, but probably just noise.
7415	yes	ctor / init	Keep the initialization part; no diffs to binary from that. Array start & data pointer go from 8 to 24 B apart, increasing probability of being on different cache lines. That's got to explain the drop in UserLog performance. Therefore, reorder to get them consecutive again; see `01d6`.
4a64	yes	integral type	This time, no change to binary from `7415`.
6da9	no	alignas(32)	Increases `sizeof(HighwaySegment)` from 112 to 128. Does help UserLog performance on all machines, but hurts most other tasks on most machines. Combined, still a slight advantage overall but future commits do better...
01d6	yes	reorder membr	`start` and `data` adjacent again, restoring old same/different cache line probability. Data occasionally on different cache lines appears to be outweighed by not increasing `sizeof(HighwaySegment)`, except for bandwidth-constrained machines BiggaTomato, lab1 & bsdlab. `01e7` makes this a non-issue...
01e7	yes	cbt_index	Not only avoids reading `start` (potentially from another cache line), avoids redundant recalculation of the index. A clear advantage over all prev commits on all Linux boxes.
48c8	yes	part add_idx	`add_clinched_by` takes index, calculated once per chopped route, in `store_traveled_segments`.
28af	no	local_index	No clear benefit; more diffs. Keep it simple.
c8b9	yes	const fold	readability
adcf	no	1 index/trav	No clear benefit; more diffs. Keep it simple.
6634	yes	whitespace
9536	yes	constexpr	no diff to binary on BiggaTomato
ecef	yes	ST aug index

yakra commented 1 year ago

Functions, operators, etc.

insert No real changes, because constant folding all via HighwaySegment::add_clinched_by:
- ConcAug
- ReadList via store_traveled_segments
operator () No real changes, because constant folding
- UserLog: Route::clinched_by_traveler
operator != Only changes in 953a (nix unit parameter) & children (ComItOpt et al): multiply by sizeof(unit) before memcmp
- Graphs: edge compression
operator |= Type-punning variants
- Graphs: compiling traveler lists
iteration Simple, 8 & 16-bit complex (and variants), unit choice, store index or pointer
- CompStats
- Graphs:
- creating traveler_lists vector, or getting item count & writing traveler names to last .tmg line
- HighwaySegment::clinchedby_code
- SQL: segments table

yakra commented 1 year ago

HTML (resolved)

TMB_Total.html Original commits; not much to see here. Maybe do another one with the final selected commit for kicks.
TMB_ReadList.html Original commits. No real changes to insert. Maybe do another one with the final selected commit for kicks, considering the weird minor-but-consistent diffs.
TMB_ConcAug.html Original commits. No real changes to insert. Maybe do another one with the final selected commit for kicks, considering the weird minor-but-consistent diffs.
TMB_CompStatsO.html Original commits, 47a7 iterator optimization, 558e iterate active/preview only, 4db4 store & return pointer. Using 4db4.
TMB_iter.html No-build, contig TLs, t_l as ref, 4 SIO1, 4 SIO2, 8/16-bit CIO. Using acc9, 32-bit SIO2.
TMB_iter2.html No-build, contig TLs, t_l as ref, 8/16/32-bit SIO2, 8/16-bit CIO, [bits >> 1], ternaries. Using acc9, 32-bit SIO2.
TMB_UserLog.html Original commits, a423 branching CBT, bf94 merge. Using bf94.
TMB_Compress.html Original commits; no changes to operator != here. Maybe do another one with the final selected commit for kicks.
TMB_Graphs.html Original commits. Refer to pages targeting the specifics: |=, iteration, clinchedby_code, trav_nums, inline mv&e, unit/punning. Using f977 though that commit may be overwritten.
TMB_unit.html f977 + 3 more unit alternatives, 03ec punning |=, 4 SIO2, 4 with 8/16/32-bit punning Using 47a7, 32-bit SIO2 + 64/32-bit punning.

yakra commented 1 year ago

HTML

TMB_CBC.html 4db4 store pointer, 2effinline, 4504 "0", 7f87 trav. Yes.
TMB_TravNums.html No-build, contig TLs, t_l as ref, 9fdc trav_nums, bc9d TMB it idx, 0b69 iterator ptr. 0b69 cherrypicked -> 4db4. TBD: Unsure what to do WRT trav_nums yet.
TMB_inline.html f977 t_l as ref, 2d1b inline mv&e, 4db4 store pointer, 2eff merge, d0fanixtl branch, 13b9 merge. Yes.
TMB_nixtl.html 4db4 store pointer, edb4 nix t_l, aecc "0", d0fa notrav, 13b9 inline Yes? Contradicts nixtl2.
TMB_nixtl2.html 4504 "0", 13b9 nixtl, 7f87 trav (PreAlpha), 7f7d trav(nixtl) No (Contradicts nixtl). May re-examine in future.

yakra commented 1 year ago

hidden/deleted branches

comm	branch	DoWhat	comment
ecef	ConstFold	delete	TMBPreAlpha FFWDed
adcf	ElStacko	hidden
28af	add_index	hidden
89bd	cbt_index	delete	TMBPreAlpha FFWDed
01d6	reorder	delete	merged
6da9	alignas	hidden
8be1	thread_local	hidden	re-examine in future
05d7	reserve	hidden
660c	[stash]	hidden	meta-iteration (CompStats)
6a83	[stash]	hidden	meta-iteration (traveler names)
7f7d	nixtl	hidden	re-examine in future
47a7	64-32	delete	parent of TMBPreAlpha & nixtl
c310	64-16	hidden
72e2	64-32-16	delete	parent of 64-16
fc17	SIO2pun	hidden
2d94	ternaries	hidden
16bb	bit1	delete	parent of ternaries
ec4e	ComItOpt16	delete	parent of ternaries
0b69	return_p	hidden
bc9d	idx	hidden
9fdc	nixon	delete	parent of return_p
f749	ComItOpt	delete	parent of ternaries
f8cf	SimItOpt2	hidden
2cbb	SimItOpt	hidden
03ec	pun	delete	parent of ternaries
54fb	issue588	hidden	use this
f28c	unit	hidden
2d1b	inline	delete	merged
96e9	switch	delete	merged
7828	cbycode	hidden
a423	cbt_br	delete	merged
e61a	TMBitset	hidden	Nix TL::tn dropped; the rest redone manually downstream

yakra commented 1 year ago

Alpha ToDo list

Solo items

C	Descr	Comments
`54fb`	issue 588
`91a0`	!extr include
`7ce7`	userlog warn
`96e9`	avoid iterate
ToDo	Timestamp for first task (Get list of travelers in the system). Python Too.
ToDo	Constexpr the static edge format constants.
v	Extraneous else after continue; parens can be clarified
^	extraneous visibility == 1 check just before that

traveled_tmg_line & traveler_lists cleanup

C	Descr	Comments
`4504`	"0"	HGEdge, HighwaySegment. HG.cpp: 2 tmg_line, 3 size->travnum
`7f87`	trav

contig RAM group

C	Descr	Comments
ToDo	Get TMArray sorted.	TMArray means threaded construction in place (via placement new) without separate read_list function.
`a5a7`	contig RAM
ToDo	fix comment
ToDo	TravelerList dtor, to delete traveler_num

TMB group

C	Descr	Comments
`ecff`	TMB 1st draft	Does ctor init vars in order? Don't include siteupdate.cpp changes; see below.
Nope	~init segments with TravelerList::ids.size()~	Never mind. `TMArray<TravelerList>::size`.
`a423`	branching CBT
`acc9`	unsigned int
`47a7`	64-32-bit pun
`4db4`	iterator ptr
`c896`	arg order
`7415`	ctor / init
`4a64`	integral type
`01d6`	reorder membr
`01e7`	cbt_index
`48c8`	part add_idx
`c8b9`	const fold
`6634`	whitespace
`9536`	constexpr
`ecef`	ST aug index
done	~nix () and add_value until needed~	Done as part of c8b9
ToDo	Cast (unit)1 before <<	...lest the unsigned long bug make a reappearance. See f8cf on SimItOpt2 branch.
ToDo	Review TMBitset variable names	Also reduce arithmetic in iterator construction.
ToDo	Comment for != and \|= operators
ToDo	segments SQL table: only iterate clinched_by for active/preview systems
ToDo	uint8_tetc.
ToDo	As of ecff, cmath no longer explicitly needed in HighwayGraph.cpp
ToDo	Template specialization
ToDo	Explicit instantiation?

t_l via TMB group

C	Descr	Comments
`d704`	t_l via TMB
`2d1b`	inline mv&e	Delete that extra line of whitespace from HighwayGraph.h as in `4504`
ToDo	Don't include TMBitset
`4504`	"0"	GetSubData. HG.cpp: assign traveler numbers
`a009`	t_l list->vec

yakra commented 1 year ago

partial specialization experiments deleted stash @ 484be5074e68f31db790cc53b34cf2494f6ff114

yakra commented 1 year ago

Benchmark commits:

WIP: HD @ 59afa75 UD @ 2126e76
Final: HD @ 40c2356 UD @ 5b071e5

yakra commented 1 year ago

List

Conc Stat User

Comp Graph Graph_Linux Total ErCh ErCh_zoom

yakra commented 6 months ago

deleted branches (BiggaTomato)

g_db

g_db comments out DB here and here.
g_trav comments out everything specific to traveled graphs in mv&e, write_master and write_sub, except for assigning traveler numbers. HighwayGraph.cpp only.
g_cb comments out traveler_lists population
990a288 adds {}// at the beginning of this line

yakra / DataProcessing

`TMBitset` checklist #245

251

dead branches

commits

Functions, operators, etc.

HTML (resolved)

HTML

hidden/deleted branches

Alpha ToDo list

Solo items

traveled_tmg_line & traveler_lists cleanup

contig RAM group

TMB group

t_l via TMB group

deleted branches (BiggaTomato)