Closed yakra closed 1 year ago
find crashes via
for file in sulogs/2023-04-06/siteupdate_{a423,4448,8fba,8d55,7828,96e9,2d1b,5c76}*; do
echo -e "$file\t" `tail -n 1 $file \
| sed 's~\[[0-9.]*\]~~' \
| tr -d .`; done \
| grep -v 'Total run time:' \
| sed -r 's~.*siteupdate_(....)-.*~\1~' \
| uniq
SimItOpt2
with sign bit bug before reset:
dd1b6e0
longa8a3db9
int9280ec4
short
OK, not dead, but deprecated binary & logs. Hard reset here; bugfix commit addedtlist_canon
@ d37dfe0979bc5a470dad5851147850503d096bdc
c | incl? | descr | comments |
---|---|---|---|
a5a7 | yes | contig RAM | Prerequisite for everything else |
ecff | yes | TMB 1st draft | The main point of all this :) |
d704 | yes | t_l via TMB | Small speedup. Saves RAM. Potential for improvements via iteration and |= optimizations. |
a009 | yes | t_l list->vec | Not slower. The Right Thing To Do. Saves RAM. Locality. Is index used in future commits? Only in eaa6 , which won't be used as it hurts performance.What if traveler_lists is nixed altogether, and we just iterate traveler_set one more time at the end? Or, what if, instead of doing reallocations/copies, we just reserve(TravelerList::allusers.size()) after declaring?Re-examine nixing in future. |
91a0 | yes | !extr include | Housekeeping. Rebase earlier? |
7ce7 | yes | userlog warn | Unrelated, and doesn't appear to affect performance any, fnord. Rebase earlier? |
f977 | no | t_l as ref | Pass a bool instead; see 4504 .Nix extraneous traveler_lists creation for master graph. |
eaa6 | no | nix tl::tn | Hurts performance. See 9fdc et al instead. |
4e05 | no | avoid iterate | Child of eaa6 . See 96e9 even though it's the same diff. :P |
6922 | no | ctor / init | Any changes to performance? See next commit. No: See 7415 . |
5c76 | no | integral type | Userlogs slightly slower? Test again once dust settles? No: See 4a64 . |
b4ab | no | names & order | Arg order just makes sense. Name changes are more related to prev commit. No: See c896 . |
e61a | no | ++u | No change to binary. Done on its own in some future commits, including 47a7 . |
a423 | yes | branching CBT | Faster userlogs. Rebase earlier? Maybe try again once HighwaySegments are stored sequentially? |
4448 | no | short | unit branch; affects \|= only. Prefer flavors with iterator optimization. |
8fba | no | int | unit branch; affects \|= only. Prefer flavors with iterator optimization. |
8d55 | no | long | unit branch; affects \|= only. Prefer flavors with iterator optimization. |
7828 | no | no-branch CBC | Hurts performance |
96e9 | yes | avoid iterate | 4e05 cherrypicked. Reorder additions in for loop; make it look pretty. |
2d1b | yes | inline mv&e | Slight speedup, simpler prototypes & includes |
f28c | no | long bugfix | unit branch; affects \|= only. Prefer flavors with iterator optimization. |
953a | no | ! unit param | Not strictly necessary for pointer-punning \|= , but cleans things up.No: Different units may be useful for scenarios other than clinched_by . E.g., uint64_t may work well for HGVertex subgraph membership. |
03ec | no | punning \|= |
Evaluate WRT straight-up unit \|= . Prefer flavors with iterator optimization; 3 are simplified for 16/32-bit units; see below. |
7958 | no | SimItOpt1 | SimItOpt2 outperforms this. |
3910 | no | SimItOpt2 | 32 > 16 > 8-bit for iterating clinched_by . Additionally, allows for better \|= optimization. |
1be1 | no | SIO1 short | SimItOpt2 outperforms this. |
87e4 | no | SIO1 int | " |
2cbb | no | SIO1 long | " |
9280 | no | SIO2 short | 32 > 16 > 8-bit for iterating clinched_by . Additionally, allows for better \|= optimization. |
9347 | no | sign bit fix | " |
acc9 | yes | unsigned int | " |
8269 | no | unsigned long | Performs worse than 8/16/32-bit everywhere except epoch. |
f8cf | no | ulong bugfix | " |
f749 | no | ComItOpt | Evaluate against SimItOpt. All tasks. ~Try with per-unit \|= .~ <--Never mind.No: Underperforms SIO2 on all lab machines (but not BiggaTomato & epoch though those aren't the target). KISS principle. |
9fdc | no | trav_nums | May not do much on its own. Try again after other speedups are in place. See 8be1 . |
bc9d | no | TMB it idx | Pro: No arithmetic for clinchedby_code. Con: Still adding b.array+index everywhere else (deref). |
0b69 | no | iterator ptr | Pro: No arithmetic for dereferencing. Con: CBC subtracts out. Use 4db4 instead. |
^ ^^ | ^ ^^ | Both are superior to 9fdc , which adds for everything, then subtracts out for CBC.Test both sans 9fdc . Which gets called more?• 69% clinched_by in clinchedby_code @ 36,193,165, which wants indices.• 31% in everything else @ 16,024,183, which want pointers. (8,000,360 clinched_by in stats & SQL + 23,463 traveler_set in MV&E)So on paper, bc9d wins. But in practice we're within margin of error, with 0b69 faster on the 4 non-Ubuntu (coincidence?) machines.If foregoing or at least deferring 9fdc because no clear benefit, it makes sense to use 0b69 . |
|
ec4e | no | 16-bit it opt | Wins on epoch, BT & lab1; lags behind SimItOpt2 everywhere else, even before accounting for lookup array setup. Constexpr would be expensive for program & code size. Ugly changes to code, a namespace I wouldn't otherwise need. |
16bb | no | [bits >> 1] |
Only needed for ComItOpt, which looks like it's not going to pan out. Only performs better on lab2, so within margin of error at best. |
2d94 | no | ternaries | 1st place on lab2 by 0.5 ms & lab3 by 0.7; lags behind SimItOpt2 everywhere else except BT, where it's 2nd place behind ec4e , and more arguably epoch, where it's slower than 8 & 16-bit, but faster than 32-bit (which looks like it'll be the selected alternative). But again, BT & epoch aren't target machines. Keep it simple. |
fc17 | no | punning SIO2 | Bigger is better. |
72e2 | no | 64-32-16 pun | " |
c310 | no | 64-16-bit pun | " |
47a7 | yes | 64-32-bit pun | Bigger is better. Fewest branches & loops. 32-bit is fastest for iteration too. |
4db4 | yes | iterator ptr | Faster. |
4504 | yes | "0" | Rebase earlier |
7f87 | yes | trav | Better on paper; != op replaced with just sending an existing variable. Rename variable to is_traveled for readability. |
05d7 | no | reserve | No definitive diff; no-build may be faster. |
8be1 | ... | TL::nums | No apparent diffs. Revisit when more RAM bandwidth optimizations are in place? |
c896 | yes | arg order | Measures faster, but probably just noise. |
7415 | yes | ctor / init | Keep the initialization part; no diffs to binary from that. Array start & data pointer go from 8 to 24 B apart, increasing probability of being on different cache lines. That's got to explain the drop in UserLog performance. Therefore, reorder to get them consecutive again; see 01d6 . |
4a64 | yes | integral type | This time, no change to binary from 7415 . |
6da9 | no | alignas(32) | Increases sizeof(HighwaySegment) from 112 to 128. Does help UserLog performance on all machines, but hurts most other tasks on most machines. Combined, still a slight advantage overall but future commits do better... |
01d6 | yes | reorder membr | start and data adjacent again, restoring old same/different cache line probability. Data occasionally on different cache lines appears to be outweighed by not increasing sizeof(HighwaySegment) , except for bandwidth-constrained machines BiggaTomato, lab1 & bsdlab. 01e7 makes this a non-issue... |
01e7 | yes | cbt_index | Not only avoids reading start (potentially from another cache line), avoids redundant recalculation of the index. A clear advantage over all prev commits on all Linux boxes. |
48c8 | yes | part add_idx | add_clinched_by takes index, calculated once per chopped route, in store_traveled_segments . |
28af | no | local_index | No clear benefit; more diffs. Keep it simple. |
c8b9 | yes | const fold | readability |
adcf | no | 1 index/trav | No clear benefit; more diffs. Keep it simple. |
6634 | yes | whitespace | |
9536 | yes | constexpr | no diff to binary on BiggaTomato |
ecef | yes | ST aug index |
insert
No real changes, because constant folding
all via HighwaySegment::add_clinched_by
:
store_traveled_segments
operator ()
No real changes, because constant folding
Route::clinched_by_traveler
operator !=
Only changes in 953a
(nix unit parameter) & children (ComItOpt et al): multiply by sizeof(unit)
before memcmp
operator |=
Type-punning variants
unit
choice, store index or pointer
traveler_lists
vector, or getting item count & writing traveler names to last .tmg lineHighwaySegment::clinchedby_code
segments
tableinsert
.
Maybe do another one with the final selected commit for kicks, considering the weird minor-but-consistent diffs.insert
.
Maybe do another one with the final selected commit for kicks, considering the weird minor-but-consistent diffs.47a7
iterator optimization, 558e
iterate active/preview only, 4db4
store & return pointer.
Using 4db4
.acc9
, 32-bit SIO2.acc9
, 32-bit SIO2.a423
branching CBT, bf94
merge.
Using bf94
.operator !=
here.
Maybe do another one with the final selected commit for kicks.|=
, iteration, clinchedby_code, trav_nums, inline mv&e, unit/punning.
Using f977
though that commit may be overwritten.f977
+ 3 more unit
alternatives, 03ec
punning |=
, 4 SIO2, 4 with 8/16/32-bit punning
Using 47a7
, 32-bit SIO2 + 64/32-bit punning.4db4
store pointer, 2eff
inline, 4504
"0", 7f87
trav.
Yes.9fdc
trav_nums, bc9d
TMB it idx, 0b69
iterator ptr.
0b69
cherrypicked -> 4db4
.
TBD: Unsure what to do WRT trav_nums yet.f977
t_l as ref, 2d1b
inline mv&e, 4db4
store pointer, 2eff
merge, d0fa
nixtl branch, 13b9
merge.
Yes.4db4
store pointer, edb4
nix t_l, aecc
"0", d0fa
notrav, 13b9
inline
Yes? Contradicts nixtl2.4504
"0", 13b9
nixtl, 7f87
trav (PreAlpha), 7f7d
trav(nixtl)
No (Contradicts nixtl). May re-examine in future.comm | branch | DoWhat | comment |
---|---|---|---|
ecef | ConstFold | delete | TMBPreAlpha FFWDed |
adcf | ElStacko | hidden | |
28af | add_index | hidden | |
89bd | cbt_index | delete | TMBPreAlpha FFWDed |
01d6 | reorder | delete | merged |
6da9 | alignas | hidden | |
8be1 | thread_local | hidden | re-examine in future |
05d7 | reserve | hidden | |
660c | [stash] | hidden | meta-iteration (CompStats) |
6a83 | [stash] | hidden | meta-iteration (traveler names) |
7f7d | nixtl | hidden | re-examine in future |
47a7 | 64-32 | delete | parent of TMBPreAlpha & nixtl |
c310 | 64-16 | hidden | |
72e2 | 64-32-16 | delete | parent of 64-16 |
fc17 | SIO2pun | hidden | |
2d94 | ternaries | hidden | |
16bb | bit1 | delete | parent of ternaries |
ec4e | ComItOpt16 | delete | parent of ternaries |
0b69 | return_p | hidden | |
bc9d | idx | hidden | |
9fdc | nixon | delete | parent of return_p |
f749 | ComItOpt | delete | parent of ternaries |
f8cf | SimItOpt2 | hidden | |
2cbb | SimItOpt | hidden | |
03ec | pun | delete | parent of ternaries |
54fb | issue588 | hidden | use this |
f28c | unit | hidden | |
2d1b | inline | delete | merged |
96e9 | switch | delete | merged |
7828 | cbycode | hidden | |
a423 | cbt_br | delete | merged |
e61a | TMBitset | hidden | Nix TL::tn dropped; the rest redone manually downstream |
C | Descr | Comments |
---|---|---|
54fb |
issue 588 | |
91a0 |
!extr include | |
7ce7 |
userlog warn | |
96e9 |
avoid iterate | |
ToDo | Timestamp for first task (Get list of travelers in the system). Python Too. | |
ToDo | Constexpr the static edge format constants. | |
v | Extraneous else after continue; parens can be clarified | |
^ | extraneous visibility == 1 check just before that |
C | Descr | Comments |
---|---|---|
4504 |
"0" | HGEdge, HighwaySegment. HG.cpp: 2 tmg_line, 3 size->travnum |
7f87 |
trav |
C | Descr | Comments |
---|---|---|
ToDo | Get TMArray sorted. | TMArray |
a5a7 |
contig RAM | |
ToDo | fix comment | |
ToDo | TravelerList dtor, to delete traveler_num |
C | Descr | Comments |
---|---|---|
ecff |
TMB 1st draft | Does ctor init vars in order? Don't include siteupdate.cpp changes; see below. |
Nope | ~init segments with TravelerList::ids.size()~ | Never mind. TMArray<TravelerList>::size . |
a423 |
branching CBT | |
acc9 |
unsigned int | |
47a7 |
64-32-bit pun | |
4db4 |
iterator ptr | |
c896 |
arg order | |
7415 |
ctor / init | |
4a64 |
integral type | |
01d6 |
reorder membr | |
01e7 |
cbt_index | |
48c8 |
part add_idx | |
c8b9 |
const fold | |
6634 |
whitespace | |
9536 |
constexpr | |
ecef |
ST aug index | |
done | ~nix () and add_value until needed~ | Done as part of c8b9 |
ToDo | Cast (unit)1 before << | ...lest the unsigned long bug make a reappearance. See f8cf on SimItOpt2 branch. |
ToDo | Review TMBitset variable names | Also reduce arithmetic in iterator construction. |
ToDo | Comment for != and |= operators | |
ToDo | segments SQL table: only iterate clinched_by for active/preview systems | |
ToDo | uint8_tetc. | |
ToDo | As of ecff, cmath no longer explicitly needed in HighwayGraph.cpp | |
ToDo | Template specialization |
|
ToDo | Explicit instantiation? |
C | Descr | Comments |
---|---|---|
d704 |
t_l via TMB | |
2d1b |
inline mv&e | Delete that extra line of whitespace from HighwayGraph.h as in 4504 |
ToDo | Don't include TMBitset | |
4504 |
"0" | GetSubData. HG.cpp: assign traveler numbers |
a009 |
t_l list->vec |
partial specialization experiments deleted stash @ 484be5074e68f31db790cc53b34cf2494f6ff114
Benchmark commits:
HD @ 59afa75
UD @ 2126e76
HD @ 40c2356
UD @ 5b071e5
traveler_lists
population990a288
adds {}//
at the beginning of this line
Benchmark:
Route::clinched_by_traveler
: branching vs indirection Branching is faster.unit
s Bigger is faster.matching_vertices_and_edges
(Change name? It finds travelers too) Slightly faster on all machines except lab2. (Noise?)eaa6
:Branching is faster.
eaa6
; apply4e05
directly to prev commit For Computing Stats,96e9
... • is THE top performer so far for lab1, lab3, lab4. • outperforms4e05
on BiggaTomato. • on lab2 lags behind top performer4e05
by only 3.8 ms. Just noise? A very tight race here.5c76
inconclusive, but does appear slightly slower for userlogs, very consistently. Try:6922
for a more apples-to-apples comparison.4a64
).traveler_lists
is nixed altogether? As we iterate thrutraveler_set
, just keep a count instead of creating a vector. The trade-off is doing one more iteration thrutraveler_set
at the end of the traveled graph for traveler names. Is this outweighed by not having to construct the vector and do a modest number of allocations/reallocations? Preliminary results: helps on BiggaTomato; hurts on lab1. Be interesting to see results on bsdlab, at higher thread counts, and after doing more work on the RAM bandwidth bottleneck. Final: No speed advantage. Leaving as-is. Maybe re-examine after more RAM bandwidth improvements are implemented.Try out:
HGEdge
construction slow down if I force an active/preview canonicalHighwaySegment
? \~ 0.01 - 0.02 s on BiggaTomato -- 0.9 - 1.9 % more time.unit
s) The trade-off: Larger units = skip back farther but less often.|=
performance.f749
underperforms 8-32-bit simple iteration on lab{1..4}; slight lead on BiggaTomato.ec4e
is 1st place for BiggaTomato & lab1; underperformsf749
& even 64-bitf8cf
on lab2. Other machines TBD.[bits >> 1]
solutionec4e
on BiggaTomato & lab1. Successively better on lab2 though, with ternaries 1st place overall &[bits >> 1]
falling between 16 & 32-bit SIO2. Other machines TBD.constexpr
the damn thing by brute force?~ Never mind. Nothing to be gained by doing this.ec4e
perform poorly on Epoch? No. Performs well; outperforms SIO2. Similar to BiggaTomato.|=
Try simplified versions for larger units:TravelerList::traveler_num
with oneunsigned int* traveler_num = new unsigned int[TravelerList::allusers.size()];
per thread. Index viafor (TravelerList *t : traveler_lists) traveler_nums[t-TravelerList::allusers.data()] = travnum++;
eaa6
with TMBitset[]
operator?~ LOLNOPE. Different vectors (everything vs subset); indices don't match up.Kinda both:
TravelerList::ids.size()
~ Segments set via TMArray::size, set via TravelerList::ids.size()TMArray<TravelerList>
means threaded construction in place (via placement new) without separate read_list function.251
Clean up:
ecff
, cmath no longer explicitly needed in HighwayGraph.cppelse
aftercontinue
; parens can be clarified https://github.com/yakra/DataProcessing/blob/80f27243bf26579c53c6c7a215a5aeb22347317a/siteupdate/cplusplus/classes/GraphGeneration/HighwayGraph.cpp#L112-L116visibility == 1
check just before that4db4
)maxbits
less useful inSimItOpt2
. Delete it; replace with8*sizeof(unit)
; let-1
and+1
cancel out.bits
should beunsigned char
inpun
branch. Dumb luck that it ran without errors. Fixed forComItOpt
.~ Never mind. Using a branch that retainsunit
instead.uint8_t
etc.2d1b
) means#include "../../templates/TMBitset.cpp"
not needed in HighwayGraph.h. Can lose the include guard too LOL. :cowboy_hat_face:segments
SQL table: only iterateclinched_by
for active/preview systemsc8b9
) Check for diffs due to constant folding:8/sizeof(unit)
!=
and|=
operators(unit)1
before<<
, lest the unsigned long bug make a reappearance. Seef8cf
onSimItOpt2
branch.4db4
) Switch: are additions infor
loop reordered? Make it look pretty!()
andadd_value
until needed