Opportunities for multithreading

yakra commented 5 years ago

Task	Priority	Comments
Sorting waypoints in Quadtree	Done 250a2fa6e17d82e9dce74e9f79537af1b2b98aea	Combine sorting colocated point lists. See #89.
Writing near-miss point merged wpt files	Done
Processing traveler list files	Done
Concurrency detection	(bottom) 1.1s	Potential for data races makes this impracticable.
Processing waypoint labels and checking for unconnected chopped routes	Done b07f73b1a6531aad7ca8e0f9cc5ce81d2be0a45e	https://github.com/TravelMapping/DataProcessing/issues/397
Concurrency augments	Done
Computing stats	Done	Had less luck first time round: Spent all its time hung up inside one mutex with one map always creating and throwing std::out_of_range. Was written a little too much like the original Python. :)
per-traveler stats log entries	Done
Writing traveler list logs	Done	~Maybe fold the writing into the calculation... Only 0.2 s now.~ Done.
Writing stats csv files	Done 2dd3b28b54165ab2f1745ad2befed0c612c3b8b2	~ 1/3 all active, 1/3 active+preview, 1/3 individual systems
unique names and vertices	Done 4d988b326b2c2069a71b43177c4c1721859ac0dd	https://github.com/TravelMapping/DataProcessing/issues/285
Creating edges	(medium) 3.3s	Do #175 first! Need mutex, to: • check for duplicate edges (reorder code in 1st ctor). • add to adjacency lists Just one should be good, but see below...
Compressing collapsed edges	(low) ~1.3s	Need mutex for adding to and detaching from adjacency lists. • 1 for every vertex = 22,686,480 B • 1 for every quadtree node (faster) = 4,916,336 B • 1 for every quadtree node (slower) = 1,526,120 B • 1 for everything = 40 B. I can get away with this, right?
writing master graphs	Done 96be2a38300bf8ea63c23ddd17cf80441669fd9f	Combined into one function & written alongside subgraphs.
subgraphs	Done 4879245fdfa727e57f90cd33b110a7ddd14e9bd6
Performing data checks	Done	Folded into `Route::read_wpt` to take advantage of idle CPU time while waiting for I/O.
Marking datacheck false positives	(high) 9.7s	Iterator validity, etc. Don't think parallel, just think concurrent -- see https://github.com/TravelMapping/DataProcessing/issues/379
database file	Done bc964258e733abafe6ff2550f16fa12788a51439	Start one thread writing everything except for the graph info; generate graphs; join() the file-writing thread; append graph data to the file.

yakra commented 5 years ago

Multi-threaded subgraph writing is implemented, but there's room to improve parallelism.

There's a lot of "dead space" in the way processing is split up between each graph category. Some cores are idle while we wait for the last graphs in each category to finish, before we prepare for the graphs in the next category. With only 2 multisystem graph sets, 2 cores are idle the entire 3.8s. And EUR-freeways probably finishes quite a bit before USA-national.

A better way is to first prep all categories' graphs, and load up the entire graph_vector first. Then we can plow through the entire thing in one go, no gaps, no stalls. Add graphs to graph_vector from slowest to fastest. Fill the bathtub up with baseballs before filling it up with marbles, n'at.

Task	Time (s)	Sets	Seconds Per set
Writing continent graphs.	17.4	6	2.90
Writing multisystem graphs.	4.2	2	2.10
Writing system data graphs.	4.6	4	1.15
Writing country graphs.	10.8	14	0.77
Writing area data graphs.	8.5	25	0.34
Writing multiregion graphs.	1.1	4	0.27
Writing regional data graphs.	10.9	342	0.03

yakra commented 5 years ago

^ Tackle https://github.com/TravelMapping/DataProcessing/issues/223 first. this may alter where the area graphs sit in the speed hierarchy.

yakra commented 5 years ago

Task	Time (s)	Sets	Seconds Per set
Creating continent graphs.	17.3	6	2.88
Creating multisystem graphs.	4.3	2	2.15
Creating system data graphs.	4.5	4	1.1
Creating country graphs.	10.6	14	0.76
Creating multiregion graphs.	1.1	4	0.3
Creating area data graphs.	1.2	26	0.05
Creating regional data graphs.	11	329	0.03

yakra commented 4 years ago

Lab2 is able to take advantage of its greater number of cores, and outperforms lab1 in most multi-threaded tasks. yakra@BiggaTomato:~/TravelMapping/devel/cpuscale/20200318$

sum=0
deltas=$( \
  for task in ReadWpt NmpSearch NmpMerged ReadList \
              ConcAug CompStats UserLog Subgraph; do
    vals=$( \
      for BOX in lab1 lab2; do
        ./taskscale.sh $BOX _insert $task | grep 'peak'
      done | sed 's~.*is \(.*\)s at.*~\1~'
    )
    eq=$(echo $vals | tr ' ' '-')
    echo -n "$eq="
    echo "scale=3; $eq" | bc
  done | cut -f2 -d'=' \
)
for d in $deltas; do sum=$sum+$d; done
echo "scale=3; $sum" | bc

All combined, lab2 is faster by 1.89s. Lab1 wins overall because it's faster in the single-threaded bits.

TBD: How much total time is taken up by each machine:

in all single-threaded tasks total?
in all multi-threaded tasks total?

To find average time for a given single-threaded task on a given machine: yakra@BiggaTomato:~/TravelMapping/devel/cpuscale/20200318/sulogs/cpp$

count=$(ls siteupdate_insert-lab2* | wc -l)
sum=0
for file in $(ls siteupdate_insert-lab2*); do
  sum=$sum+$( \
    delta=$( \
      tac $file | grep '^\[[0-9]\+\.[0-9]\]' \
      | grep -B 1 unique \
      | sed 's~^\[\([0-9]\+\.[0-9]\).*~\1~'
    )
    echo "scale=1; $(echo $delta | tr ' ' '-')" | bc \
  )
done
echo "scale=8; ($sum) / $count" | bc

yakra / DataProcessing

Opportunities for multithreading #28