Open wkalt opened 8 months ago
spent some time hacking on the meaningful names in storage, namely paths including topic and producer name. It makes the API that merges messages from a list of tree roots inconvenient, since a list of prefixes must also be specified. I think we should stash that idea and maybe think about solving the problem with better introspection APIs in the database. Ideally users don't care about the data file layout. https://github.com/wkalt/dp3/commit/20c80fef9b236cc48312c729a0fa4b7fee29b904
bugs
bug: versionstore is not hooked up properly. See nodestore.nextVersion or something - needs to come from versionstore.queries for topics/producers that are not stored produce 500 errors, including multi-topic requests when some topics are present but not othersduplicate uploads don't duplicate. If I upload the same file twice, I don't get duplicate messages in GetMessages. On one hand this is desired, on the other hand I don't remember implementing it...https://github.com/wkalt/dp3/pull/5output MCAP writer is misassociating schemas when merging across different files, resulting in nil-valued schemas for one of the fileshttps://github.com/wkalt/dp3/pull/6Tree merge is erroneously merging areas of the tree that did not changehttps://github.com/wkalt/dp3/commit/3e16a2eee064169e5f73b68099e54c422f0108b0Statistics mishandles NaN and probably infinite valued floats. Probably the thing to do is skip these for sum/min/max/mean accounting, and maintain nan/inf counts as a separate statistic.https://github.com/wkalt/dp3/commit/1883231721d1c6bc825b55c1396ae23b1b1c7d6dTree iterator is building a full list of leaves up front, which is extremely slow on huge resultsets. Needs to be incremental.https://github.com/wkalt/dp3/commit/60685f5d634591e888912a9629fa2db54d762aacnondeterministic import failure: https://github.com/wkalt/dp3/issues/11https://github.com/wkalt/dp3/pull/18service does not currently crash on a port conflicthttps://github.com/wkalt/dp3/commit/c0440dbc738dd2e41136e7df1ed309d403c270d3semicolon termination should be in the grammar not enforced in client - to enable batched queries.https://github.com/wkalt/dp3/commit/cfc95d8f9063dca732681adebdee8d92aa762432local disk storage implementation should write to a tmpfile + renamehttps://github.com/wkalt/dp3/pull/14the executor defers initialization of the output writer until a message is successfully pulled, to allow schema conflicts to be surfaced as 400 errors. It needs to write an empty file if the resultset is simply empty.https://github.com/wkalt/dp3/commit/82dbf9c6b713dd3254445329089f60ead23d4651tree methods
deletereturn message diff between versionstesting
tests for concurrent inserts into treeduplicate data must be deduplicated (on timestamp and message byte)testing for storage with minioperformance analysis
design questions
Multiple schemas may be used for a single topic name, particularly over long periods of time as schemas evolve. Is it OK to have multiple schemas in one tree, or do we really need to make trees unique by schema? Nothing in playback breaks due to multiple schemas, but search/statistics features could be complicated (they will be complicated whether there are multiple trees or one).https://github.com/wkalt/dp3/commit/70905e589eed69da95ca408d85bdaa465420854ffeatures needed
Currently we flush WAL synchronously with inserts. WAL flushing needs to get moved to a background thread that intelligently flushes after periods of inactivity on a topic or when size limits get reached.https://github.com/wkalt/dp3/pull/5can we ditch the nodestore staging map if inserts flush to WAL?https://github.com/wkalt/dp3/pull/5data files should be segregated by tree in storage, with a meaningful namehttps://github.com/wkalt/dp3/commit/20c80fef9b236cc48312c729a0fa4b7fee29b904statrange command should not require start/endhttps://github.com/wkalt/dp3/pull/7export command should not require topics. When no topics are supplied it should return all topics.https://github.com/wkalt/dp3/commit/be61a9b8d672763f817804dc2d7ceea72ffddf02Switch to 64-bit offsets and lengths in IDs. This costs 8 bytes per IDs but will insulate us against gigantic messages.https://github.com/wkalt/dp3/commit/d6ad7189338ea0fe8fba5874be9a538ec49c39a2WAL doesn't garbage collect yethttps://github.com/wkalt/dp3/pull/8multiple database support. it should be possible to have sim and real-world data segregated on one instance.https://github.com/wkalt/dp3/commit/54707eff915992e62110b2e379b8964163ccf856Die immediately on second siginthttps://github.com/wkalt/dp3/commit/3e35173537d99e6bee9ebe51ced8c592fd1a49a3catalog introspection
from within the client,
what producers do I have?what topics exist for a producer?what are the message-level stats for each table's root nodes?what previous versions of a table do I have, dated and numbered?what schema(s) are associated with a topic?what fields on a topic can be queried?eventually - what databases do I have?community
present @ foxglove community meetuppresent @ foxglove community meetup 1 mo followupperformance evaluation
client
fun CLI features. Like psql "session" interface, plotting of statistical ranges, displaying images? playing video?https://github.com/wkalt/dp3/commit/f688894ea029fc398442a1eed71fa75d185593b3clustering
versionstore, wal, rootmap are currently sqlite-based. Both versionstore and rootmap need to move out of sqlite because multiple nodes need to hit them. WAL can stay sqlite for now. Let's go with postgres for now.storage needs an S3-compatible implementation. Use minio libraries.*inserts need to shard across replicas based on producer + topic. What manages the shards? Probably goes in postgres.on the read side, it would be best if we could merge reads with WAL. The "problem" is this would require distribute WAL storage IF we also want any node to be able to serve reads. We can solve this with distributed WAL storage but that's more complicated and slower.monitoring
pprof debugging endpointdeployment
retention policies
search & query language
statistics: field-levelSQL or not SQL?SQL: better 3rd party compatibility, maybe chatgpt can answer queries for usNot SQL: SQL is crappy for expressing complex as-of joins, which are a common kind of query. Maybe we can do a lot better. Ideally end users would be able to express queries themselves. Queries might be something like "show me all times in last 6 months when it was raining and we were taking an unprotected left and there were dogs in the intersection". That is hard to write in SQL if you aren't a SQL expert. We don't want customers to need to hire teams of SQL experts to translate. Also chatgpt is far from writing good english to SQL for arbitrary business contexts - not clear it will ever work.Expanded in https://github.com/wkalt/dp3/issues/9.statistics acceleration for scanshttps://github.com/wkalt/dp3/pull/25maintenance
weirdnesses
versions are assigned unnecessarily while staging writes to WAL. Each write to WAL gets a version, then we merge them and create one big commit with a final version. I think the version assignment can just be deferred until the big commit.https://github.com/wkalt/dp3/pull/5tree insert over existing data currently clones all nodes down to the leaf. Pretty sure it only needs to clone the root for tree dimensions, and then all the other copying happens at time of merge from WAL. No indication so far that this is a bottleneck but it probably will be if it isn't yet.https://github.com/wkalt/dp3/pull/5cgo sqlite stuff is hard to inspect with pprof. Need a solution or perhaps switch to golang embedded db.https://github.com/wkalt/dp3/pull/5beta release blockers