Format discussion - Githubissues

Marlamin commented 6 years ago

Discuss! Current proposal can be found in the README.

Marlamin commented 6 years ago

Something that still needs working out is the way version for each structure will be specified.

I'm against storing definitions for the same file in different directories (AKA 7.3.5.25807/Map.dbd) which would cause there to be a lot of duplicated data and much harder to maintain column names across versions. A solution to that would be to only have a single file per DBC with multiple definitions inside.

@mmosimca suggested something like this for a single-file format in the past: LAYOUT 01234567, 89ABCDEF; which would go above the first column definition for that build.

Which works by referring to the layout_hash in the DBC file (WDB5 and up) which would save us from redefining structures when there have only been minor/not noteworthy changes (hence multiple layout_hash) support.

Pre-WDB5 files would just have a BUILD 1234; header instead, where 1234 would be the first build that structure was seen in. If you are loading DBCs for build 1245 and the structure would still be the same as 1234, the 1234 definition would be used. If the structure in 1245 has changed a new definition for that build should be added, to which future builds with the same structure can fall back on.

Note that the above implementation would have an issue during periods where two branches of the game (such as 8.0 Beta and 7.3.x PTR) could have similar build numbers, causing the wrong definitions to be loaded for that build. This could be solved by something that @bloerwald suggested (correct me if wrong) at one point which is to have a comma separated lists of builds (like LAYOUTs) for which the structure was valid, but this list would get very long. This solution would only have issues in cases where two separate game branches have the same build number, which has happened in the past. In that case, we could always assume the build number is for the lowest level client (Beta > PTR > Live).

Thoughts?

mdX7 commented 6 years ago

BUILD 1000;
<uint32 a>
string[4] b
string c
string d

lets say we have this as base and imagine string c gets dropped in build 1001 that could result in the following, where we still have redundant data, but it's human readable. Basicly the same as splitting in directories, just in one file.

BUILD 1000;
<uint32 a>
string[4] b
string c
string d

BUILD 1001;
<uint32 a>
string[4] b
# string c got removed
string d

an other approach would we that we have build/layouthashes defined per column, but that will get way too confused for human reading. Imagine:

<uint32 a> {BUILD 1000, 1001}
string[4] b {BUILD 1000, 1001}
string[6] b {BUILD 1002} # array resize
int32 e {BUILD 1002} # added
string c {BUILD 1000}
string d {BUILD 1000}

for that PTR/Live issue, we mostly have layouthashes since this stuff happened afaik. So should be fine using layouthashes for this. And a probably better approach would be that we may list ranges and comma separate build/hashes, e.g.: BUILD 1000-1100, 1400, 1404-1500, gonna be a bit funny to parse though.

Marlamin commented 6 years ago

I think the middle one with duplicated data and no build/layout hash for every line is the best. Using # for comments sounds good to me.

mdX7 commented 6 years ago

By the way - not sure if this happens anywhere, but just in case. When we have an array of foreign keys I'd suggest to format it this way: uint16<AreaTable>[5] AreaTableID so that we have this syntax in general for each field: DataType<referenced Table>[ArraySize] FieldName

Marlamin commented 6 years ago

Sounds good!

Here's a bit of conversation regarding layout hashes and some adaptations to the proposal to deal with array resizes from #modcraft on QuakeNet.


17:32 <Simca>  everything except array resize
17:32 <Alram>  How often does that happen
17:33 <Simca>  there have 2 instances in the history of layouthashes that i'm aware of
17:33 <Simca>  have been*
17:33 <Alram>  How does one account for that
17:33 <Alram>  Go by build?
17:34 <Alram>  Or just yolo and fuck that version
17:35 <Simca>  i proposed a solution: LAYOUT hash1-recordsize1
17:36 <Simca>  that would have solved both instances
17:36 <Alram>  But is it worth the added complexity
17:36 <Simca>  in THEORY, it's not enough.
17:36 <Simca>  because if they resized two arrays of the same size at the same time
17:37 <Simca>  or if the resized array was so small that padding compensated for the change that would have occurred in record size
17:37 <Simca>  then even that solution would fail
17:37 <Simca>  but in practice record size would be far more than ever needed. the only case where a resize occurs (both cases so far) have been on Flags fields
17:39 <@schlumpf>  Can we kind of do the inverse and go from theoretical sql format to dbc format? Like, automate the column shrinking?
17:41 <Simca>  not really. we could imitate it poorly, but we'd be left with gaps and problems - we just don't have enough information, especially about newer fields, after they did all the sorting
17:42 <Simca>  the advantage of the format you purposed that has layout and build always side by side in the header is that it would solve array resizing
17:44 <Simca>  anyway, two different configs could then share a layouthash but they would never share builds at the same time
17:44 <Simca>  that assures we'd account for all cases
17:44 <Simca>  if a person requests a layouthash without providing a build, we just give them the version of the hash for the latest build
17:47 <Alram>  Hopefully this isn't too much complexity
17:47 <Alram>  Otherwise nobody will implement it and stick with their own definitions
17:47 <Simca>  no, it's actually perfectly simple. layouts, builds, config
17:48 <Alram>  Can help write converters/implementations in worst case```

Marlamin commented 6 years ago

Other points (than mentioned in above comment) to discuss/agreed upon:

Unknown column naming (Unk0, Unk1 etc or just empty and let parsers handle rest)
Additional specialized types other than locstring, do we handle these? (Vector3 being float[3], flags)
<> for inline types is kinda weird. Do parsers need to know about them? Can we comment them?
Do we have comments? Do they get put on new lines or not? Are they parsed (wiki?) or human only?
Do we mention map name in file? @miceiken argues filenames are metadata and not contents
(via @bloerwald) How do we make sure column names are the same across definitions and won't diverge?

Marlamin commented 6 years ago

Updated the format in README after some feedback from IRC. The addition of the locstring type is the biggest change. Most parsers already handle this in some way (some call it loc, others langstringref etc) so it shouldn't be a big bother to implement.

Other stuff in above comment is still up for discussion.

mdX7 commented 6 years ago

Just talked to @Marlamin on Discord, we noticed that we definitly need ranges for builds. Imagine:

BUILD 21846, 21863, 21874, 21911, 21916, 21935, 21952, 21953, 21963, 21973, 21989, 21992, 21996, 22017, 22018, 22019, 22045, 22053, 22077, 22083, 22101, 22124, 22133, 22143, 22158, 22171, 22201, 22210, 22231, 22248, 22260, 22280, 22289, 22293, 22306, 22324

Idea is to split ranges/single builds by line, e.g.

BUILD a-b
BUILD c-d
BUILD e
BUILD f-g
BUILD h-i
BUILD j-k
BUILD l
BUILD m-n
BUILD o-p
<attributes here>

another approach is comma seperated, but that is a bit harder to read as human:

BUILD a-b, c-d, e, f-g, h-i, j-k, l, m-n, o-p
<attributes here>

justMaku commented 6 years ago

Some feedback from my side:

Instead of <uint32 id> for keys, I'd suggest doing key<uint32> id
Instead of uint32<ForeignDbName> fieldName do foreign<uint32, ForeignDbName, ForeignFieldName> fieldName

bloerwald commented 6 years ago

Not sure against directory per build/version/layout, since those actually are able to do symlinks nicely. Also, diff v1/map v2/map. Yet, I agree it isn't best UX, but there are reasons for it. I want to throw in a third option: abuse git. One commit per build. Branches per major. git rebase -i && git push -f if you actually do a change. This would force you to resolve conflicts, meaning that if you add a name, you have to ensure it is matching all versions after that. It would not force you to verify to the earliest version, but at least for the past. → One file, blocks for now.

Layout hash imho is a bad idea. While it seems quite fine, it isn't a hash of the actual file content as shown with it changing even though file does not, and inverse the contents changing (array length) while the hash does not. I don't like it, but see the appeal. We can use it to deduplicate structs during annotation though. → I wouldn't rely on it as first hand id.

Build IDs are not enough, we need versions due to branches. I would not do ranges implicitly but only verified versions. There are cases where we know that it has literally never changed structure ever since Vanilla. Wiki currently says ranges, but explicitly mentions verified ones (e.g. 0.5.3.3368-1.12.1.5875-3.0.2.8905-3.3.5.12340-6.0.1.18179 for https://wowdev.wiki/DB/CinematicSequences) → I want a distinction between "assumed matching" and "verified matching" versions. Versions are major.minor.patch.build, as Blizzard does.

What's missing so far in the proposed formats is descriptions, comments. These are the first that get out of sync between versions and the biggest reason for a uint32_t m_ID; <build 12340, 20505>. It contrasts with reorders, and the late-time bitpacking shit changing integer width. I do not have a solution for this. Maybe a hybrid approach where possible columns are defined on top and then referenced? It follows some examples for a possible implementation:

CinematicSequences.dbd


COLUMNS
uint$key                               m_ID
uint$foreign_key$SoundEntries$m_ID     m_soundID
uint$foreign_key$CinematicCamera$m_ID  m_camera   While there is an array, only one is ever used in live data. If multiple are given, they are played in sequence, one after the other.

BUILD 0.5.3.3368,1.12.1.5875,3.0.2.8905,3.3.5.12340,6.0.1.18179 m_ID<32> m_soundID<32> m_camera<32>[8]

BUILD 7.0.1.probably m_ID<32> m_soundID<16> m_camera<8>[8]


* Cfg_Categories.dbd

COLUMNS uint$key m_ID uint region locstring m_name_lang bitfield$LocaleMask m_localeMask bitfield$CharsetMask m_create_charsetMask bitfield$CharsetMask m_existing_charsetMask bitfield$F m_flags

ENUM 1 needs tournaments enabled on account

ENUM binary 00000 development 00001 eu_or_us 00100 russia 01010 korea 10001 taiwan_or_china note how this is bogus since it overlaps with eu_or_us, these are obviously something else

BUILD 1.12.1.5875 m_ID<32> region<32> m_name_lang

BUILD 6.0.1.18179 m_ID<32> m_localeMask<32> m_create_charsetMask<32> m_existing_charsetMask<32> m_flags<32> m_name_lang


* LocaleMask.enum

ENUM binary 000000000001 enUS also enGB 000000000010 koKR 000000000100 frFR 000000001000 deDE 000000010000 enCN also zhCN 000000100000 enTW also zhTW 000001000000 esES 000010000000 esMX 000100000000 ruRU 010000000000 ptPT also ptBR 100000000000 itIT



**→ I really want a hybrid approach to minimise duplication.**

Also in the suggested code above, more types. I don't think we need `C3Vector` as intrinsic type, but I wouldn't be against `C3Vector.type` describing it to be reused. **→ Bitfields with typed values for example. Also, Foreign keys need to reference what they reference in the other table.**

While I prefer unknown columns to just not be named, I guess we need to give them bogus names to track them over versions. **→ I suggest a marker for verified names, or inverse.**

I have no opinion on how to mark out-of-line columns. Is that maybe just as much of a layout thing as in-row strings?

I don't understand what "mentioning map name" refers to.

Marlamin commented 6 years ago

I don't understand what "mentioning map name" refers to.

Typo, my bad. Should be "mentioning dbc name".

MMOSimca commented 6 years ago

@justMaku I would prefer to leave the type ('uint32') at the front of each line if possible. For parsers, that bit is by far the most important thing, and moving it around (behind key< or foreign<) makes that more complex to grab.

@bloerwald Let's not waste space and everybody's time by prefixing every variable with 'm_' in the actual format. If you want to do that in the script for the wiki, that's fine. It is your wiki. But in the actual dbd files, let's not waste space to adhere to a C standard that even people who use C generally think is stupid. This is not C.

As for the rest of it, I'd personally prefer to leave enum definitions and bitfield definitions out of it. It's a valid point to make and a valid argument to have though. To me this is more like 'what is the structure of rows', whereas enums, bitfield definitions, and to a lesser extent foreign key references veer off into.

The problem with relying exclusively on BuildIDs is that later versions do not have that information. If you give me a db2 and a config that only lists BuildIDs, I have nothing. That means all parsers from now on will require user input to read files (build ID). Currently, they don't. This is a pretty massive issue. At the very least, for this format to be -useful- for parsers, it must include a list of layouthashes with each entry.

I would rather not introduce more types. Bitfield and enum may be acceptable, depending on how they're handled, but definitely not things like C3Vector. That could be a comment or description, but adding in additional types as a mandate to the format complicates things further.

We want the format to be as simple as possible. It's important that it convey a lot of information, but I would rather it conveyed less if it adding more wildly increased complexity.

justMaku commented 6 years ago

I would prefer to leave the type ('uint32') at the front of each line if possible. For parsers, that bit is by far the most important thing, and moving it around (behind key< or foreign<) makes that more complex to grab.

@MMOSimca isn't key or foreign_key a type though and should be implemented as such by parser? Having it before the underlying type allows the parser to know about it earlier.

Marlamin commented 6 years ago

First and foremost we should be careful to not make the format too complex. People have to implement this (including relatively dumb people like me) for it to actually get traction and solve the issues we're trying to solve (different docs spread everywhere across the internet).

On that note, I haven't been able to reach @barncastle yet who will be pretty vital in getting (public) adoption going. I have also not received any feedback yet from @tomrus88 and @Warpten who also have DBx implementations that deal with definitions in some way or form.

@MMOSimca isn't key or foreign_key a type though and should be implemented as such by parser? Having it before the underlying type allows the parser to know about it earlier.

I don't think there's a difference between something being a foreign key and it not being a foreign key before WDC1. There's a difference now though but I'm unsure how that affects parsing.

I wouldn't rely on it as first hand id.

I think I'm with @MMOSimca on this, parsers generally don't know about builds when looking at files so layouthash should always be listed if available. I am for listing more specific builds as well, though. This should solve any branch issues we're having. We still need a way of Schlumpf also mentioned (in private) that it might be an interesting idea to list tablehash as well.

MMOSimca commented 6 years ago

If we wish to list Tablehash in the file, it should just be the first line of the file.

TABLEHASH XXXXXXXX

For cases where the file existed and was removed before db2s were introduced (or before the file was converted to a db2), we would call it TABLEHASH 00000000.

Marlamin commented 6 years ago

Do we know how TABLEHASH is generated? Could just generate correct hashes for older files.

MMOSimca commented 6 years ago

Unfortunately, no. Given the fact that we're dealing with Blizzard here, it could be literally anything. Somebody could have taken the MD5 of a jpeg of a cat and added 12345678 to it, and every db2 was just based on a different cat picture.

The popular theory I'd heard was that it was probably just the name of the table, hashed. The problem then happened when 'item-sparse' was renamed to 'ItemSparse' and its TableHash did not change.

justMaku commented 6 years ago

Unfortunately, no. Given the fact that we're dealing with Blizzard here, it could be literally anything. Somebody could have taken the MD5 of a jpeg of a cat and added 12345678 to it, and every db2 was just based on a different cat picture.

Seems like a solid plan, all in favour, vote with your favourite cat gif.

bloerwald commented 6 years ago

While keys don’t have to be in a type system, I advocate for type system as strong as possible giving as much information as possible. Having the information what keys and foreign key columns are helps with ever single version of the file format.

bloerwald commented 6 years ago

I would personally avoid a magic value for unknown table hashes but just not have it (optional<uint32_t>)

bloerwald commented 6 years ago

I wouldn’t mention dbc Name in file. Then again, mentioning table hash is just the same. It is probably fine having either.

bloerwald commented 6 years ago

The m_ prefix and _lang suffix comes from blizzard, not c or my drunk brain. I generally prefer having stuff as close as possible to their stuff, so I decided to keep it on wiki.

bloerwald commented 6 years ago

Build/version, filename, tablehash and layouthash, column count, all have one common property: they can merely act as a heuristic for the parsers. A viewer can try to infer a definition to use based on them, but in the end none of those are guaranteed to result in the right definition unless the user picks it manually. I’m all in for having all of the information available, so add tablehash and layouthash if available. I just wanted to state that I don’t think that one of them alone should be the primary identifier of a definition.

bloerwald commented 6 years ago

I agree on not adding complex types like c3vector, yet bitfields and enums have huge Information other than bitcount that integers would have. Feel free to use them as an alias for integers in your implementation, but I strongly suggest to not throw away that information, seeing that we have it.

justMaku commented 6 years ago

I want to throw in a third option: abuse git. One commit per build. Branches per major. git rebase -i && git push -f if you actually do a change. This would force you to resolve conflicts, meaning that if you add a name, you have to ensure it is matching all versions after that. It would not force you to verify to the earliest version, but at least for the past.

This would make collaboration super complicated in case we want to go back: for example adding names to previously unknown columns. Rewriting history is never a good idea in a distributed system like git.

bloerwald commented 6 years ago

Rewriting history is never a good idea in a distributed system like git.

I agree and wanted to throw it in just for sake of completeness in discussion.

MMOSimca commented 6 years ago

I do agree with this. I really like LayoutHash, and I wish we could use it as the primary definition, but we can't. For starters, DBCache.bin is just as important to support, and it only lists build number in its header. Beyond that, there is the layout hash non-recalculation issue with pure array resizes that was mentioned earlier.

I just think that everywhere we mention BuildID, we should include LayoutHash if at all possible as it is relatively critical to modern DB2 parsing.

On Tue, Jan 9, 2018 at 7:11 AM, bloerwald notifications@github.com wrote:

Build/version, filename, tablehash and layouthash, column count, all have one common property: they can merely act as a heuristic for the parsers. A viewer can try to infer a definition to use based on them, but in the end none of those are guaranteed to result in the right definition unless the user picks it manually. I’m all in for having all of the information available, so add tablehash and layouthash if available. I just wanted to state that I don’t think that one of them alone should be the primary identifier of a definition.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Marlamin/WoWDBDefs/issues/1#issuecomment-356266945, or mute the thread https://github.com/notifications/unsubscribe-auth/AE-ZtJejzgZ_JNRjwTbwmr1ZLWXoXofFks5tI1dVgaJpZM4RVTzv .

bloerwald commented 6 years ago

Because it popped up in private discussion: enum thing needs versioning too, which my current approach doesn't cover.

Warpten commented 6 years ago

Private discussions. Yuck

Marlamin commented 6 years ago

From IRC:

We're seeing about adopting @bloerwald's above mentioned format (with some changes), with more examples soon.
We're doing build ranges, already shipped to sample definition. Build range should be per x.x.x patch (new BUILD line for new range) and should never go into the future or span too many builds in the present.
Verified builds will be added through a VERIFIED_BUILD (or something similar). Same structure as BUILD. Verified definitions are verified to be correct for that build, while the BUILD header is a more generic indication of support (likely correct, but not guaranteed).
Additional people that work on DBC related things have been poked. Adoption will be key and as such we are still looking for more feedback.

Marlamin commented 6 years ago

There are now some WIP sample definitions by @bloerwald in a separate folder. It's looking like a great start but we might have to make the column stuff at the beginning a bit more readable so we can maintain it a bit easier in the future. The changes proposed by his format cover deduplication, comments and make sure that column names are enforced to be the same. Feedback on those samples is welcome!

justMaku commented 6 years ago

ENUM<WeaponFlags>
0 untouched
4 sheathe after animation
16 sheate after aniation
32 pull

This doesn't look really that human readable to me, in my opinion anyone reading this manually would probably be more accustomed to the enum style found in many programming languages:

enum WeaponFlags {
   untouched = 0
   sheathe after animation = 4
   pull = 32
}

justMaku commented 6 years ago

uint$foreign_key$CinematicCamera$m_ID

Same issue here, the format is very hard to read as a human. Most important information (to me as a reader) is at the very end of the line. I believe it would look much better as:

id key<uint32> // Name, Type, Backing Type
name string // Name, Type
map foreign<Map, uint32> // Name, Type, Location, Backing Type

This way the bits of information are ordered in the magnitude of relevance from left to right. This matters for us, mere human beings, and computers don't care.

bloerwald commented 6 years ago

enums

I like how you missed 16. The remainder is just format though. The two are equivalent. I still claim that parsers are irrelevant as long as they are the same complexity. Both, “enum“ Name\n(value name\n)+ and “enum“ Name „{„ (name „=„ value „,“?)+ „};“ are the same complexity and content and thus equivalent. Debating this is not relevant as of now. Rather is wanting to have it or if/how to do versions.

justMaku commented 6 years ago

I like how you missed 16.

Couldn't be arsed to copy-paste typos.

The remainder is just format though.

Yeah, this is format feedback, as asked by @Marlamin

I still claim that parsers are irrelevant as long as they are the same complexity. Both, “enum“ Name\n(value name\n)+ and “enum“ Name „{„ (name „=„ value „,“?)+ „};“ are the same complexity and content and thus equivalent.

I wholeheartedly agree and that's why I believe we should always be thinking about making it as human readable as possible in the first place (unless we decide that human-readability is not a major feature anymore).

bloerwald commented 6 years ago

not a typo
while I agree to have something as easy as possible human readable at the end, it doesn’t make a difference during discussion if I write uint$key, key<uint>, key uint, ’tis an unsigned integer being a key. They are the same complexity and information. That’s what should matter.

Just as we don’t give a fuck if it is version or build. It just doesn’t matter. The question is what information is there and which structure it has, how it is referenced.

Warpten commented 6 years ago

Completely disregard the mess WDC1 introduces, is should be up to parsing implementations to deal with it. For bitpacked fields:

a) specify bit width in a comment somehow b) uintN, floatN (I don't think floatN is a thing yet, but fuck) c) don't include it. It's in the file anyway.

IMHO it is important to not cram the format too much. Keeping types, names, and FKs is plenty enough.

Ref human readability, just write a parser for the human readable stuff that generates it better suited for machines/code. Make everyone happy. We write stuff we can read and process it for a machine.

pod$fk_type is disgusting. I think spaces everywhere are simpler, with a defined line format that every single line respects. Not an array? I don't care, declare a size of 1.

Regarding versioning. A solution at least for machines is order maps. Fields are indexed 0-n and after they are declared, we have something like

(build0, build1, ...) = { 0, 2, 5, 1, 3, ...} // shared structure
(build2) = { you get it }
(build3) = (build0) // reference other structure

Sure there is duplication when we are facing a dumb case where two columns exchange place between builds but I think duplication is not a bad thing as it makes data more readable, and the overall size of a definition file is something no one gives a flying fuck about. this is IMHO not even too bad for human format. Just make indexing explicit on file, no one wants to count fields in ItemSparse.

I.E.:

0 1 int32 Id;
1 3 float; // unk name float[3]
2 1 int map id; # Map Id // foreign key to Map.Id

Delimiter for shit md parser

^(N:[0-9]+) +(cardinality:[0-9]+) +(type:[a-z]+)(optionalBitSize:[0-9]+) +(humanName:[a-z0-9_-]+); +(?:\# (fk_ref:.+))?$

to fix the issue with build/tablehash, just map those as well. Tedious, but still at least that can be automated.

Warpten commented 6 years ago

Enums can be crammed in there too I'm sure, probably smth like int32!wep_flags. Looks more readable than $ to me. Humans are used to numbers around dollar sign.

Also for enums definition provide both bitset value and shift value.

Looks like a table and humans are better at reading tables than text.

MMOSimca commented 6 years ago

c) don't include it. It's in the file anyway.

This is the correct way to handle WDC1 bitpacked fields. Bitpacked fields are an implementation detail. What we want is table layout the way the game holds it in memory, basically, not the way it is in the files (yes, I understand the irony of this considering that the entire point of this is for a guideline on how to read the files, but I stand by my comment and can explain further if required). That's part of the reason my original format did not care about localization - the game doesn't either. (And I never cared about pre-WoD files.)

I'm also in the anti-$ crowd, for whatever that's worth. It makes me think of PHP and I have PHP-PTSD.

Marlamin commented 6 years ago

FWIW, I've already changed the $ stuff in the sample definition file for Map.

bloerwald commented 6 years ago

Format, spaces, $, …: this doesn’t change the basics and can thuan interchanged freely and only concerns the final spec.

Regarding “it is in the file anyway”: we just need to make sure it actually is possible to parse all versions of the file.

Regarding localization, while you can parse without that, you can also parse without column names. The entire thing is done to give additional information. Localized columns to me are a relevant semantic.

Warpten commented 6 years ago

Also idk if it's been settled but I'm fine with locstring being a pseudo pod type

barncastle commented 6 years ago

Just throwing in my two cents - ignore my ignorance if I'm off the mark!

I agree with the current structure however I have a couple of queries/remarks about the Columns block.

I appreciate that some of the comments will be helpful when developing but they're not vital especially since the whole point of this is to be a generic definition for parsers. Would it not be better to store that information separately on the wiki (which also promotes wiki contributions) and just link the article in the meta? This also avoids formatting issues/additional space usage and fights over if a comment is too long/too short/valid/relevant etc.
Do the column types default to the largest size? E.g. in Alpha if it was an int but in WoD its now a byte and treated as such (say max value check), it gets set as an int but has the bitwidth next to it?

Marlamin commented 6 years ago

I mostly agree with the points made in 1, should we drop comments?

As for 2, I'm unsure. Maybe @bloerwald has kept that in mind somehow?

bloerwald commented 6 years ago

I proposed comments to be part of this description since it is also something heavily duplicated on the wiki currently. I see that they are not vital for pure parsers. Just ignore them there. For the wiki, it would be hugely useful since it would take care of the issue of having different versions of the comments per version, as it is currently.

In my format suggestion, column types do not default. A version definition specifies the bits for the column if it is a dynamic-bit-possible (i.e. int) column. One can add the (sane) default to use int32 if nothing is specified, i.e. let

BUILD 0.5.3.3368,1.12.1.5875,3.0.2.8905,3.3.5.12340,6.0.1.18179
m_ID
m_soundID
m_camera[8]

be equivalent to

BUILD 0.5.3.3368,1.12.1.5875,3.0.2.8905,3.3.5.12340,6.0.1.18179
m_ID<32>
m_soundID<32>
m_camera<32>[8]

if that's really worth simplifying.

bloerwald commented 6 years ago

Tablehashes for pre-tablehash dbs (via furl(?), simca): https://repl.it/repls/PlasticLankyTrogon / https://github.com/Blizzard/heroprotocol/blob/master/mpyq/mpyq.py _hash, with some rules like stripping characters and all upper I don’t remember

Marlamin commented 6 years ago

So I think we're pretty close to locking in on something we can actually start working with.

Can we now start working on the following?

Validator/reference parsers (I'll see if I can get some awful C# going)
Generators generating dbdefs from currently used definition formats (just for starting with/working of)

Also, we need to agree on how the files look (which symbols are used for what etc). I know it keeps being dismissed as "just being format" but it'd be nice to get this down.

Warpten commented 6 years ago

To bounce back in @barncastle 's 2, I personally treat integers as either 8, 16, or 32 (or N bits more recently) bit values internally and just cast up when loading into the structure (at least for WDB5, as you could accurately guess the size of every field but the last one if there were mismatched element sizes). Doing that for virtually every integer-y field shouldn't be too difficult. But that's a debate point, since it also makes more sense to just assume everything is int, and cast down when serializing... It also kind of sounds like an implementation detail, which has the nice side effect of dumbing down the process of creating a structure for the user: "when in doubt, int"

Marlamin commented 6 years ago

I've updated the sample code, sample definition and format proposal with the changes that I think were agreed upon. Stuff in the above few comments (default int size/comments) is still left open.

Marlamin commented 6 years ago

The first set of files based on 6.0.1 DB structures has been generated which closes the discussion for the initial format spec. We're not going to do default int sizes as of right now, but this can still be discussed as it's a pretty minor change. Comments are still in but should be kept somewhat small. Things that require more explanation should go on wiki and on wiki alone.

Currently in the process of adding proper foreign keys to things after which we'll start adding more versions. I'll also extend the validator at one point to check whether or not foreign keys go to valid DBs/columns.

Thanks for contributing up to this point, everyone! Next up: multiple definitions spanning multiple versions! Map is still the only one currently doing so.

I'll leave this open for a while in case anyone still has comments they want to make.

wowdev / WoWDBDefs

Format discussion #1