yakra / DataProcessing

Data Processing Scripts and Programs for Travel Mapping Project
0 stars 0 forks source link

Opportunities for multithreading #28

Open yakra opened 5 years ago

yakra commented 5 years ago
Task Priority Comments
Sorting waypoints in Quadtree Done
250a2fa6e17d82e9dce74e9f79537af1b2b98aea
Combine sorting colocated point lists. See #89.
Writing near-miss point merged wpt files Done
Processing traveler list files Done
Concurrency detection (bottom)
1.1s
Potential for data races makes this impracticable.
Processing waypoint labels and checking for unconnected chopped routes Done
b07f73b1a6531aad7ca8e0f9cc5ce81d2be0a45e
https://github.com/TravelMapping/DataProcessing/issues/397
Concurrency augments Done
Computing stats Done Had less luck first time round: Spent all its time hung up inside one mutex with one map always creating and throwing std::out_of_range. Was written a little too much like the original Python. :)
per-traveler stats log entries Done
Writing traveler list logs Done ~Maybe fold the writing into the calculation... Only 0.2 s now.~
Done.
Writing stats csv files Done
2dd3b28b54165ab2f1745ad2befed0c612c3b8b2
~ 1/3 all active, 1/3 active+preview, 1/3 individual systems
unique names and vertices Done
4d988b326b2c2069a71b43177c4c1721859ac0dd
https://github.com/TravelMapping/DataProcessing/issues/285
Creating edges (medium)
3.3s
Do #175 first!
Need mutex, to:
• check for duplicate edges (reorder code in 1st ctor).
• add to adjacency lists
Just one should be good, but see below...
Compressing collapsed edges (low)
~1.3s
Need mutex for adding to and detaching from adjacency lists.
• 1 for every vertex = 22,686,480 B
• 1 for every quadtree node (faster) = 4,916,336 B
• 1 for every quadtree node (slower) = 1,526,120 B
• 1 for everything = 40 B. I can get away with this, right?
writing master graphs Done
96be2a38300bf8ea63c23ddd17cf80441669fd9f
Combined into one function & written alongside subgraphs.
subgraphs Done
4879245fdfa727e57f90cd33b110a7ddd14e9bd6
Performing data checks Done Folded into Route::read_wpt to take advantage of idle CPU time while waiting for I/O.
Marking datacheck false positives (high)
9.7s
Iterator validity, etc. Don't think parallel, just think concurrent -- see https://github.com/TravelMapping/DataProcessing/issues/379
database file Done
bc964258e733abafe6ff2550f16fa12788a51439
Start one thread writing everything except for the graph info; generate graphs; join() the file-writing thread; append graph data to the file.
yakra commented 5 years ago

Multi-threaded subgraph writing is implemented, but there's room to improve parallelism.

There's a lot of "dead space" in the way processing is split up between each graph category. Some cores are idle while we wait for the last graphs in each category to finish, before we prepare for the graphs in the next category. With only 2 multisystem graph sets, 2 cores are idle the entire 3.8s. And EUR-freeways probably finishes quite a bit before USA-national.

A better way is to first prep all categories' graphs, and load up the entire graph_vector first. Then we can plow through the entire thing in one go, no gaps, no stalls. Add graphs to graph_vector from slowest to fastest. Fill the bathtub up with baseballs before filling it up with marbles, n'at.

Task Time (s) Sets Seconds
Per set
Writing continent graphs. 17.4 6 2.90
Writing multisystem graphs. 4.2 2 2.10
Writing system data graphs. 4.6 4 1.15
Writing country graphs. 10.8 14 0.77
Writing area data graphs. 8.5 25 0.34
Writing multiregion graphs. 1.1 4 0.27
Writing regional data graphs. 10.9 342 0.03
yakra commented 5 years ago

^ Tackle https://github.com/TravelMapping/DataProcessing/issues/223 first. this may alter where the area graphs sit in the speed hierarchy.

yakra commented 5 years ago
Task Time (s) Sets Seconds
Per set
Creating continent graphs. 17.3 6 2.88
Creating multisystem graphs. 4.3 2 2.15
Creating system data graphs. 4.5 4 1.1
Creating country graphs. 10.6 14 0.76
Creating multiregion graphs. 1.1 4 0.3
Creating area data graphs. 1.2 26 0.05
Creating regional data graphs. 11 329 0.03
yakra commented 4 years ago

Lab2 is able to take advantage of its greater number of cores, and outperforms lab1 in most multi-threaded tasks. yakra@BiggaTomato:~/TravelMapping/devel/cpuscale/20200318$

sum=0
deltas=$( \
  for task in ReadWpt NmpSearch NmpMerged ReadList \
              ConcAug CompStats UserLog Subgraph; do
    vals=$( \
      for BOX in lab1 lab2; do
        ./taskscale.sh $BOX _insert $task | grep 'peak'
      done | sed 's~.*is \(.*\)s at.*~\1~'
    )
    eq=$(echo $vals | tr ' ' '-')
    echo -n "$eq="
    echo "scale=3; $eq" | bc
  done | cut -f2 -d'=' \
)
for d in $deltas; do sum=$sum+$d; done
echo "scale=3; $sum" | bc

All combined, lab2 is faster by 1.89s. Lab1 wins overall because it's faster in the single-threaded bits.


TBD: How much total time is taken up by each machine:


To find average time for a given single-threaded task on a given machine: yakra@BiggaTomato:~/TravelMapping/devel/cpuscale/20200318/sulogs/cpp$

count=$(ls siteupdate_insert-lab2* | wc -l)
sum=0
for file in $(ls siteupdate_insert-lab2*); do
  sum=$sum+$( \
    delta=$( \
      tac $file | grep '^\[[0-9]\+\.[0-9]\]' \
      | grep -B 1 unique \
      | sed 's~^\[\([0-9]\+\.[0-9]\).*~\1~'
    )
    echo "scale=1; $(echo $delta | tr ' ' '-')" | bc \
  )
done
echo "scale=8; ($sum) / $count" | bc