yakra / DataProcessing

Data Processing Scripts and Programs for Travel Mapping Project
0 stars 0 forks source link

VISIBLE_HIDDEN_COLOC and graph_points #242

Closed yakra closed 3 months ago

yakra commented 1 year ago
yakra commented 8 months ago

Once HIDDEN_JUNCTION is also moved out of graph generation, graph generation can be removed from --errorcheck mode entirely. This would save about 20% of total errorcheck time on BiggaTomato.

Background:

Way back before I began changes to VISIBLE_HIDDEN_COLOC and HIDDEN_JUNCTION,

@jteresco wrote:

The root of the problem here is that we have a group of datacheck errors that are only going to be detected when graph generation happens. Maybe the answer is that the master graph needs to be generated every time to allow these datachecks to occur, but only generate all of the subgraphs (the time consuming part) when there's no -k flag. Then move the FP/unmarked FP stuff to after that graph is generated.

So that's what was done.

At the same time, both datachecks would produce non-deterministic results due to non-deterministic ordering of waypoint colocation lists caused by multi-threaded reading of WPT files.

So, I moved VISIBLE_HIDDEN_COLOC detection into graph generation too, needing to do it some time after all waypoints are read into the system and colocation lists are fully populated. In retrospect, doing this as part of graph generation wasn't the best idea, but I was still newer to understanding the site update process then. :) I forget the exact rationale if there was anything other than "One possibility is iterating thru all_waypoints.point_list() after the WaypointQuadtree is constructed. But this seems wasteful if we're already iterating thru the waypoints as they're read in from disk."

Things have changed since then; there's now the HighwaySystem::route_integrity function that iterates through routes and waypoints anyway. As the OP notes, it focuses on performing datachecks, and is more effcient; multi-threaded with better memory access patterns.

Present day:

Once both these datachecks are removed from the graph generation process, building the master graph can be removed from --errorcheck mode again, cutting out a considerable amount of time. About 20% of total errorcheck time on BiggaTomato, and even more on machines with more cores, as the relevant graph routines are single-threaded.

Moving forward (ping @jteresco):

--errorcheck mode, or /fast/tm/datacheck if you prefer Aside from reporting 15ish new HIDDEN_JUNCTION errors, the only real difference here will be that siteudpate --errorcheck will no longer produce waypointsimplification.log. I doubt anybody cares about this. :) FWIW, it's not even linked from the Developer Logs page. @michihdeu, you're probably the most observant & adept at using errorcheck -- any opinions?

--skipgraphs mode This is probably rarely if ever used anymore due to the improvements in graph generation speed over the years. But if we do use it, it should be done right. There will be no graphdata directory, and no waypointsimplification.log. Looking into this though, it's probably no big deal. There's no link to either from the devel pages (so nothing to get broken), only to logs/ itself. We just won't see a link to waypointsimplification.log indexed in that directory, and life goes on. Any opinions or thoughts?

michihdeu commented 8 months ago

I don't use waypointsimplification.log at all. I neither use datacheck.log after running "errorcheck mode" but copy the FP entry from the frontend after the site update. I didn't read the whole novel but rely on your expertise. Just go for it!

jteresco commented 8 months ago

I agree that no one cares about the lack of a waypointsimplification.log at this point. It was of interest when developing the initial simplification rules and code, and can be if talking about them academically (I've had this idea that this is an interesting example but have never gone ahead and used it). Definitely doesn't matter if it's not generated in datacheck mode.

--skipgraphs has had two main purposes: running extra updates more quickly when we didn't want to bother, and when I was using the graph data in classes and wanted to suspend graph generation for a bit. The former is still potentially helpful to save file space but efficiency is so much better it's not a meaningful time saver. The latter is now totally irrelevant with graph archive sets that can be loaded up in HDX. Bottom line, I'd like to keep it available but I also don't mind if it means the log is not generated.

yakra commented 8 months ago

OK, good to know.

It was of interest when developing the initial simplification rules and code

I find it useful when doing the same. Though I certainly know how to make one when I need one. ;) For general purposes & most users though yeah, its not very relevant.

I'll forge ahead with the changes here.