wrye-bash / wrye-bash

A swiss army knife for modding Bethesda games.
https://wrye-bash.github.io
GNU General Public License v3.0
456 stars 79 forks source link

Crashes in Oblivion with Irresponsible Creatures tweak #630

Closed Infernio closed 2 years ago

Infernio commented 2 years ago

Report by Vrugdush (originally here on Discord):

Crosspost from the xOBSE server: Bug report for Oblivion Bashed Patch Tweak, causes CTD: I've found quite an odd interaction between a custom creature from a mod, a Bashed Patch tweak, and the MessageLogger OBSE plugin. Using the Tweak Actors: Irresponsible Creatures [All Creatures] Bashed Patch tweak with Unique Landscapes: Aspen Woods in the load order, causes a reproducible CTD when encountering the rooster creature from said mod, but only if you also have the MessageLogger OBSE plugin loaded, and only if its "bHookAllEditorID" setting is on. For (a lot) more details, you can check the Oblivion Community Server support channel. I've reproduced this bug on two different systems, with OBSE 21 as well as xOBSE 22.4.1. It seems like the Bashed Patch Tweak writes broken data somehow, because the crash doesn't happen if you change the responsibility of the rooster manually with xEdit. It happens with both version 309.1 and the latest WIP version of Wrye Bash. It's possible it could occur with other creatures too, but so far the Aspen Woods rooster is the only one I've found. To reproduce, just start a new character with Oblivion.esm, xulAspenwood.esp, Unique Landscapes.esp, and the Bashed Patch, 0.esp with the Tweak Actors: Irresponsible Creatures [All Creatures] tweak active, make sure you have OBSE and the MessageLogger plugin, and move to an area in the Aspen Woods near to where the rooster spawns, such as with "cow tamriel 30 0". There is only one of them in the mod. Most of the time the game just crashes on the loading screen with no message, but a few times I've had a Windows exception error message pop up. The crash also doesn't happen every time, but almost.

Report by jonka (originally here on Discord and here on Discord):

So, I have an interesting bug in a rather complicated setting. After days of debugging with help from the amazing Oblivion Reloaded team, I found a CTD that seems to be caused by two WB tweaks (which might be two separate bugs). In fact, Oblivion Reloaded is only making the crash appear earlier, but it also happens without OR. The first tweak is the bashed-patch setting "Tweak Actors: Irresponsible Creatures". When this is enabled, I get CTDs in certain locations. When using a heap replacer plugin, that CTD happens already pre-mainmenu. When the tweak is disabled, all is fine. Additional observations: MessageLog shows model load errors with filenames that definitely invalid, like Creatures\Horse. (those are two control characters, and ). They vary a bit, but go away when the tweak is disabled also, in at least one backtrace, the crash is inside strcpy(), called by TESDataHandler_LoadFormRecord => TESDataHandler_LoadForm => TESCreature_LoadForm (thanks to llde from the OR team for that analysis) all of this starts to matter only in a heavily modded game. with less mods, the crash doesn't appear. I could not make out any specific mod causing this, rather it seems related to the size/number of mods. When it first appeared, I could disable any single mod and it went away so much for the first tweak, which I have analysed in much more detail than the second one. The second one is also causing CTDs in certain locations when having a heap replacer active. This is caused by several (but not all) of the "Tweak Names" settings. Again, enable => reproducible crash, disable => all is well. Unfortunately, the crash log is empty for these, as if there had been no crash. oh, and the CTD also happens when removing all obse plugins (except obse itself) I can play without these two options, so I only mildly care... but it seems to be a genuine bug somewhere, something that triggers a buffer overflow or invalid pointer in oblivion What you need: * Oblivion (obviously) with SI, i.e. Oblivion.esm (CRC 2ff840c5) and DLCShiveringIsles.esp (CRC 2bb84977) * Unofficial Oblivion Patch.esp (CRC F09257d1) https://www.nexusmods.com/oblivion/mods/5296?tab=files&file_id=1000029461 * Unofficial Shivering Isles Patch.esp (CRC 5fccf9d3) https://www.nexusmods.com/oblivion/mods/10739?tab=files&file_id=1000029462 * Oscuro's_Oblivion_Overhaul.esp (CRC 2903fc90) https://www.nexusmods.com/oblivion/mods/46199?tab=files&file_id=1000029842 * OOOExtended.esp (CRC b93f3299) https://www.nexusmods.com/oblivion/mods/47177?tab=files&file_id=1000014973 * Oblivion Reloaded 9.1.1 (default options) https://www.tesreloaded.com/Public/Mod/Mod.aspx?IDMod=1 * A Bashed patch with everything disabled except "Tweak Actors: Irresponsible Creatures [all creatures]" * Optionally, CobbCrashLogger and MessageLogger Expected Behaviour: CTD before the main menu shows. I can also supply my bashed patch that I have reduced in tes4edit until a single crashing record was left: [Bashed_Patch_0.esp.zip](https://github.com/wrye-bash/wrye-bash/files/8101873/Bashed_Patch_0.esp.zip)

Steps for reproducing:

jwalt commented 2 years ago

So, further analysis revealed the following:

I recreated the single edit that the minimized crash-inducing bashed patch makes, using TES4Edit. There was a single byte length difference, stemming from the NIFZ entry. The TES4Edit version had an extra null byte there.

Looking at MelStrings.packSub(), the function documentation says that it will add null terminators to every string in the array, but the actual program logic uses null1.join(), which means null terminators are only inserted in between elements, but not at the end.

That's exactly the one null byte difference between the bash/tes4edit variants.

Since a second null byte (from some other source) follows, this happens to work in most cases, but it probably leads to an out-of-bounds read at some unpredictable moment later on. I presume the specific circumstances for the crash listed above just happen to create a setting where that results in a crash.

Infernio commented 2 years ago

That should be handled by https://github.com/wrye-bash/wrye-bash/blob/0277bf2f5293a48304384c2eaa5316849dd60c4f/Mopy/bash/brec/basic_elements.py#L529-L530

That should call Subrecord.packSub, which then calls MelString._dump_bytes, which adds the final null terminator. It's possible that broke during refactoring though (@Utumno?)

jwalt commented 2 years ago

I see, but TES4Edit places two null bytes at the end of that list. Maybe oblivion needs that?

Infernio commented 2 years ago

That's possible. Not sure if that applies to all MelStrings or just to this one (e.g. KFFZ, the animations subrecord).

jwalt commented 2 years ago

I had a second similar crash caused by the "Tweak Names" collection (several of them). That went away with this fix, too.

Infernio commented 2 years ago

Can you try this build: https://github.com/wrye-bash/wrye-bash/suites/5372587249/artifacts/168603737 and report back if that fixes it?

jwalt commented 2 years ago

Yes, this build works for the two test cases I have, the initial CTD as described above, and a second setting in a heavily modded game. It also fixes my "Tweak Names" problem.

jwalt commented 2 years ago

Out of curiosity, I tried using vrsion 304.4. That version does not have the bug, so it is a regression.

Infernio commented 2 years ago

Still a C-bug, C-regression means it broke between the last release (309.1) and dev.

jwalt commented 2 years ago

8004d365abd69e24df7e8d2b917642cdfc80390b might be the culprit. Before that, the code used null1.join(...) + null1 wrapped in a packSub0 call that appended another null.

Utumno commented 2 years ago

Thanks for detailed report @jwalt and Inf for TTT fix :P - as noted in that commit there is some work in 480-records-refactoring-pt3 on record string elements handling, bit it's currently blocked by #543 - there are a few open questions, so nice this is investigated now :)

lojack5 commented 2 years ago

For anyone curious why, they're most likely using a Windows API for string list loading/writing, something like this which shows up all over the place in API calls. I know we (you guys) fixed it, and it was right before: just another example of being careful with data structures involving null-terminated strings.