VISIBLE_HIDDEN_COLOC and graph_points

yakra commented 1 year ago

[x] Moving the VISIBLE_HIDDEN_COLOC datacheck from WaypointQuadtree::graph_points into HighwaySystem::route_integrity has a few advantages:
- Organization: move into a function already focused mainly on datachecks.
- [x] Move from a single-threaded to multi-threaded part of the program.
- [x] ~Once Waypoints are stored in contiguous memory, iterating via Route rather than quadtree node will be better for cache locality.~ This doesn't matter though. Iteration speed isn't affected:
- New route_integrity method piggybacks onto existing Route iteration.
- Old graph_points iteration is still intact, based on the quadtree.
- This makes this essentially a special case of the next bullet point.
With Waypoints stored contiguously, graph point setup iterating via system->route (benchmarked slower when WaypointQuadtree::graph_points was written) rather than via quadtree may become faster. Low impact, though. Low priority. Potentially higher impact: Vertices created in this order may yield a more favorable compressed edge order. This item gets its own issue.

yakra commented 8 months ago

Once HIDDEN_JUNCTION is also moved out of graph generation, graph generation can be removed from --errorcheck mode entirely. This would save about 20% of total errorcheck time on BiggaTomato.

Background:

Way back before I began changes to VISIBLE_HIDDEN_COLOC and HIDDEN_JUNCTION,

--errorcheck mode would skip graph generation entirely. So, errorcheck mode wouldn't actually check all the errors, namely HIDDEN_JUNCTION.
Datacheck FPs were processed before graph generation and thus HIDDEN_JUNCTION were performed, leading to wacky hijinks.

@jteresco wrote:

The root of the problem here is that we have a group of datacheck errors that are only going to be detected when graph generation happens. Maybe the answer is that the master graph needs to be generated every time to allow these datachecks to occur, but only generate all of the subgraphs (the time consuming part) when there's no -k flag. Then move the FP/unmarked FP stuff to after that graph is generated.

So that's what was done.

At the same time, both datachecks would produce non-deterministic results due to non-deterministic ordering of waypoint colocation lists caused by multi-threaded reading of WPT files.

So, I moved VISIBLE_HIDDEN_COLOC detection into graph generation too, needing to do it some time after all waypoints are read into the system and colocation lists are fully populated. In retrospect, doing this as part of graph generation wasn't the best idea, but I was still newer to understanding the site update process then. :) I forget the exact rationale if there was anything other than "One possibility is iterating thru all_waypoints.point_list() after the WaypointQuadtree is constructed. But this seems wasteful if we're already iterating thru the waypoints as they're read in from disk."

Things have changed since then; there's now the HighwaySystem::route_integrity function that iterates through routes and waypoints anyway. As the OP notes, it focuses on performing datachecks, and is more effcient; multi-threaded with better memory access patterns.

Present day:

Efforts to include data from devel systems in flagging HIDDEN_JUNCTION errors will remove that datacheck from the graph generation process, and relocate it into HighwaySystem::route_integrity. I have a working prototype here on my desktop that's mostly read to go.
And then there's VISIBLE_HIDDEN_COLOC, which this issue is all about. In addition to the aforementioned benefits, fitting it into the framework developed to get HIDDEN_JUNCTION working will provide some slight efficiency improvements of its own.

Once both these datachecks are removed from the graph generation process, building the master graph can be removed from --errorcheck mode again, cutting out a considerable amount of time. About 20% of total errorcheck time on BiggaTomato, and even more on machines with more cores, as the relevant graph routines are single-threaded.

Moving forward (ping @jteresco):

--errorcheck mode, or /fast/tm/datacheck if you prefer Aside from reporting 15ish new HIDDEN_JUNCTION errors, the only real difference here will be that siteudpate --errorcheck will no longer produce waypointsimplification.log. I doubt anybody cares about this. :) FWIW, it's not even linked from the Developer Logs page. @michihdeu, you're probably the most observant & adept at using errorcheck -- any opinions?

--skipgraphs mode This is probably rarely if ever used anymore due to the improvements in graph generation speed over the years. But if we do use it, it should be done right. There will be no graphdata directory, and no waypointsimplification.log. Looking into this though, it's probably no big deal. There's no link to either from the devel pages (so nothing to get broken), only to logs/ itself. We just won't see a link to waypointsimplification.log indexed in that directory, and life goes on. Any opinions or thoughts?

michihdeu commented 8 months ago

I don't use waypointsimplification.log at all. I neither use datacheck.log after running "errorcheck mode" but copy the FP entry from the frontend after the site update. I didn't read the whole novel but rely on your expertise. Just go for it!

jteresco commented 8 months ago

I agree that no one cares about the lack of a waypointsimplification.log at this point. It was of interest when developing the initial simplification rules and code, and can be if talking about them academically (I've had this idea that this is an interesting example but have never gone ahead and used it). Definitely doesn't matter if it's not generated in datacheck mode.

--skipgraphs has had two main purposes: running extra updates more quickly when we didn't want to bother, and when I was using the graph data in classes and wanted to suspend graph generation for a bit. The former is still potentially helpful to save file space but efficiency is so much better it's not a meaningful time saver. The latter is now totally irrelevant with graph archive sets that can be loaded up in HDX. Bottom line, I'd like to keep it available but I also don't mind if it means the log is not generated.

yakra commented 8 months ago

OK, good to know.

It was of interest when developing the initial simplification rules and code

I find it useful when doing the same. Though I certainly know how to make one when I need one. ;) For general purposes & most users though yeah, its not very relevant.

I'll forge ahead with the changes here.

yakra / DataProcessing