address spelling errors in a disciplined manner

andrewrk commented 1 month ago

We get a lot of spelling error corrections, especially from people looking to do their first contribution to the project.

This wastes computer and human resources because it takes a lot less time to solve this in bulk. I don't want to merge 1,000 PRs each fixing one word even if those 1,000 PRs are spread among many years.

However we cannot ignore this issue when the spelling errors are in doc comments because it can cause problems with grep and searching in autodocs.

This issue is to implement some kind of automated spelling checker that has no false positives, yet catches everything that would otherwise be caught by a contributor, and add it to our CI so that we can stop this problem once and for all.

In the meantime please pick something substantial to work on. If you want to fix spelling errors then fix all of them, including the ones from the future.

P.S. Please don't suggest some "skip CI" mechanism.

nektro commented 1 month ago

https://github.com/codespell-project/codespell looks like a popular and actively maintained option and has a -w flag to write the changes

stefanzweifel/git-auto-commit-action@v5 is a handy github action to make commits after making changes in CI

paperdave commented 1 month ago

another existing tool https://github.com/crate-ci/typos. i've used it on a personal project as a ci step, and additionally in my editor at all times. the only false positives i had was E.ACCES and strat (configurable).

is your vision for this to be a spell-checker written in Zig so that it can be compiled and run without additional dependencies? if so, maybe porting one of these options would be reasonable start

andrewrk commented 1 month ago

bad ideas: :x: dependency on github :x: inability to test locally with zig build

Rexicon226 commented 1 month ago

Are we looking to check the spelling of decls? That seems a lot harder, but just curious. To add to that, are we just looking to check .zig files, or all?

sfiedler0 commented 1 month ago

What about pinning this issue (so new contributors that want to report spelling errors see this issue)?

lsculv commented 1 month ago

I used the suggestion of @paperdave above and tried out using the codespell tool on the zig source tree just to do a preliminary check on what things look like. I ended up with the following rough code. I just extract all of the comments in all of the *.zig files in the repo and then pass them through the codespell tool, then print out a usable report (the exact specifics here are not that important but I'm including this just for reproducibility):

Code

```zig const std = @import("std"); const mem = std.mem; const process = std.process; const heap = std.heap; const fmt = std.fmt; const fs = std.fs; const io = std.io; pub const Comment = struct { file_name: []const u8, line_number: u32, body: []const u8, }; pub fn main() !void { var gpa = heap.GeneralPurposeAllocator(.{}){}; const alloc = gpa.allocator(); // Extract every comment that is in a Zig file in the main Zig repo. // The output will have the form: $file:$line:$match // // These are the names of files or directories that should be excluded from // the searching process. This probably should include generated files with // weird comments that produce a lot of false positives. const ignored_files = [_][]const u8{ "lib/compiler/aro/aro/Builtins/Builtin.zig", }; const rg_args = comptime rga: { const add_len = ignored_files.len * 2; var args: [8 + add_len][]const u8 = undefined; @memcpy(args[0..8], &[_][]const u8{ "rg", "//(/|!)?\\s*(.*)", "--replace", "$2", "--only-matching", "--type", "zig", "--line-number", }); var i = 8; for (ignored_files) |file| { args[i] = "--glob"; args[i + 1] = "!" ++ file; i += 2; } const final = args; break :rga final; }; const rg = try process.Child.run(.{ .argv = &rg_args, .allocator = alloc, .max_output_bytes = 1 << 63 }); defer alloc.free(rg.stdout); defer alloc.free(rg.stderr); if (rg.term.Exited != 0) { process.exit(rg.term.Exited); } // Parse each of the lines of the above output into a comment structure var comments = std.ArrayList(Comment).init(alloc); defer comments.deinit(); var rg_lines = mem.splitSequence(u8, rg.stdout, "\n"); while (rg_lines.next()) |line| { if (line.len == 0) { break; } var sections = mem.splitSequence(u8, line, ":"); const file_name = sections.next().?; const line_number = try fmt.parseInt(u32, sections.next().?, 10); const body = sections.next().?; const comment = Comment{ .file_name = file_name, .line_number = line_number, .body = body, }; try comments.append(comment); } // Construct a file to pass into the `codespell` utility. // `codespell` needs to take input from a file as far as I can tell. const tempdir_name = "/tmp/tmp.spellcheck-zig"; try fs.makeDirAbsolute(tempdir_name); var tempdir = try fs.openDirAbsolute(tempdir_name, .{}); defer { tempdir.close(); fs.deleteTreeAbsolute(tempdir_name) catch unreachable; } const fname = "comment_bodies.txt"; const file = try tempdir.createFile(fname, .{ .read = true }); defer file.close(); var bw = io.bufferedWriter(file.writer()); for (comments.items) |comment| { _ = try bw.write(comment.body); _ = try bw.write("\n"); } try bw.flush(); // Run codespell on our newly created file // The output looks like: $file:$line: $word ==> $correction const ignored_words = [_][]const u8{ "flate", "dependee", "dependees", "re-use", "runned", "reenable", "ECT", "strat", "re-declare", "re-use", "HSA", "Synopsys", "AKS", "Numer", "crate", "Arithmetics", "fo", "AtLeast", }; const csa_cap = if (ignored_words.len > 0) 4 else 2; var codespell_args = try std.ArrayList([]const u8).initCapacity(alloc, csa_cap); defer codespell_args.deinit(); try codespell_args.append("codespell"); try codespell_args.append(try fs.path.join(alloc, &[_][]const u8{ tempdir_name, fname })); const iwl_cap = comptime blk: { var i = 0; for (ignored_words) |word| { i += word.len; i += 1; } break :blk i; }; var ignore_words_list = try std.ArrayList(u8).initCapacity(alloc, iwl_cap); defer ignore_words_list.deinit(); if (ignored_words.len > 0) { for (ignored_words) |word| { for (word) |char| { try ignore_words_list.append(char); } try ignore_words_list.append(','); } try codespell_args.append("--ignore-words-list"); try codespell_args.append(ignore_words_list.items); } const codespell = try process.Child.run(.{ .argv = codespell_args.items, .allocator = alloc }); defer alloc.free(codespell.stdout); defer alloc.free(codespell.stderr); if (codespell.term.Exited != 65) { std.debug.print("codespell did not detect any errors", .{}); process.exit(codespell.term.Exited); } const handle = io.getStdOut(); var stdout_buf = io.bufferedWriter(handle.writer()); var stdout = stdout_buf.writer(); var codespell_lines = mem.splitSequence(u8, codespell.stdout, "\n"); while (codespell_lines.next()) |line| { if (line.len == 0) { break; } var sections = mem.splitSequence(u8, line, ":"); _ = sections.next().?; // This would be the file name, but its irrelevant const line_number = try fmt.parseInt(usize, sections.next().?, 10); const correction = sections.next().?; const original_word = ow: { const word_end = mem.indexOf(u8, correction, " ==>").?; break :ow correction[1..word_end]; // 1 skips the leading space }; const new_word = nw: { const correction_begin = mem.indexOf(u8, correction, " ==>").? + 5; break :nw correction[correction_begin..]; }; const comment = comments.items[line_number - 1]; try stdout.print("{s}:{d}: {s} -> {s} ({s})\n", .{ comment.file_name, comment.line_number, original_word, new_word, comment.body, }); try stdout_buf.flush(); } } ```

After using this I ended up with ~150 "typos" which I pared down to 116 by expanding the ignore list to include a generated file and adding some words to the allow list. There will (and should) probably be some quibbling over a couple of the included "typos", but the vast majority are actual mistakes and not just preference.

Here is the output I got just for reference:

Output

``` src/crash_report.zig:360: witout -> without (it can't even do that witout panicing, it will recover without logging) src/crash_report.zig:360: panicing -> panicking (it can't even do that witout panicing, it will recover without logging) tools/gen_spirv_spec.zig:404: wont -> won't (Just assume that these wont clobber zig builtin types.) tools/update_spirv_features.zig:242: thats -> that's (we need just the name, but to avoid duplicates here we will just skip anything thats not asciidoc.) src/link/MachO.zig:986: overriden -> overridden (image unless overriden by -no_implicit_dylibs.) src/translate_c.zig:4924: childern -> children (check all childern for opaque types.) src/Compilation.zig:4112: occured -> occurred (observed to cause Error occured during wast conversion ) src/Compilation.zig:4112: wast -> waste (observed to cause Error occured during wast conversion ) src/link/Elf.zig:3209: destuctors -> destructors (We need to sort constructors/destuctors in the following sections) src/link/Elf.zig:3215: prority -> priority (The prority of inclusion is defined as part of the input section's name. For example, .init_array.10000.) src/link/Elf.zig:3222: prority -> priority (Ties are broken by the file prority which corresponds to the inclusion of input sections in this output section) src/link/Elf.zig:3716: progam -> program ((This is an upper bound so that we can reserve enough space for the header and progam header) src/link/Elf.zig:3806: preceeding -> preceding, proceeding (If a section matches segment flags with the preceeding section,) src/link/Elf.zig:3834: segement -> segment (the start address of the segement (and first section).) src/link/Elf.zig:3922: commited -> committed (Get size actually commited to the output file.) lib/std/os/linux.zig:688: familiy -> family (futex2 familiy of calls.) lib/std/os/linux.zig:5680: cancelation -> cancellation (Key off 'fd' for cancelation rather than the request 'user_data'.) lib/std/os/linux.zig:5894: cancelation -> cancellation (sync cancelation API) src/Zcu.zig:1731: hte -> the (`arg_index` is the index of the argument which hte source location refers to.) lib/compiler_rt/os_version_check.zig:27: compatability -> compatibility (our implementation differs by simply removing that backwards compatability support. We only use) lib/std/tar.zig:402: te -> the, be, we, to (size rounded to te block boundary) src/link/Wasm.zig:282: contigious -> contiguous (The strings are stored as a contigious array where each string is zero-terminated.) src/link/Coff.zig:1569: abstact -> abstract (pass. This will go away once we abstact away Zig's incremental compilation into) src/link/SpirV.zig:234: preseves -> preserves (them somehow. Easiest here is to use some established scheme, one which also preseves the) src/link/Plan9.zig:52: preceeding -> preceding, proceeding (The debugger looks for the first file (aout.Sym.Type.z) preceeding the text symbol) src/link/Plan9.zig:142: convience -> convince, convenience (is just a pointer for convience.) src/link/Plan9.zig:237: thats -> that's (this might have to be modified in the linker, so thats why its mutable) src/link/Plan9.zig:407: compatable -> compatible (we have already checked the target in the linker to make sure it is compatable) src/InternPool.zig:5575: offets -> offsets, offers (field offets are populated.) src/InternPool.zig:9284: automatially -> automatically (If the enum is automatially numbered, `value` must be `.none`.) lib/compiler/objcopy.zig:824: possibile -> possible (It doesn't support all possibile elf files) lib/compiler/objcopy.zig:825: sperate -> separate (It was written for a specific use case (strip debug info to a sperate file, for linux 64-bits executables built with `zig` or `zig c++` )) lib/compiler/objcopy.zig:826: reoders -> reorders (It moves and reoders the sections as little as possible to avoid having to do fixups.) lib/compiler/objcopy.zig:998: indexs -> indexes, indices (remap the indexs after omitting the filtered sections) lib/compiler/objcopy.zig:1163: modifing -> modifying (add padding to avoid modifing the program segments) lib/std/io.zig:669: docuementation -> documentation (The `ReadFile` docuementation states that `lpNumberOfBytesRead` does not have a meaningful) src/link/Wasm/Symbol.zig:7: containings -> containing (Bitfield containings flags for a symbol) lib/compiler/aro_translate_c.zig:1421: neccessity -> necessity (within the struct itself is [variable_name] by neccessity since it's an) test/link/static_libs_from_object_files/build.zig:76: librairies -> libraries (using static librairies) test/link/static_libs_from_object_files/build.zig:110: librairies -> libraries (using static librairies and object files) lib/compiler/resinator/source_mapping.zig:189: useable -> usable (that is useable in the same way that a non-empty mapping would be.) lib/compiler/resinator/res.zig:345: interpretted -> interpreted (Anything that resolves to zero is not interpretted as a number) lib/compiler/resinator/res.zig:386: interpretted -> interpreted (Anything that resolves to zero is not interpretted as a number) lib/compiler/resinator/res.zig:655: useable -> usable (to anything useable, so there's no point in emulating that behavior--erroring for) lib/compiler_rt/arm.zig:88: dentical -> identical (This is dentical to the standard `memset` definition but with the last) lib/compiler/resinator/parse.zig:909: paramter -> parameter (If there is no comma after the style paramter, the Win32 RC compiler) src/Sema.zig:928: guranteed -> guaranteed (Semantically analyze a ZIR function body. It is guranteed by AstGen that such a body cannot) src/Sema.zig:12140: resposible -> responsible (be resposible for recursively resolving different prongs as needed.) src/Sema.zig:16679: occured -> occurred (If the rhs is zero, then the result is lhs and no overflow occured.) src/Sema.zig:16697: occured -> occurred (If either of the arguments is zero, the result is zero and no overflow occured.) src/Sema.zig:16698: occured -> occurred (If either of the arguments is one, the result is the other and no overflow occured.) src/Sema.zig:20269: swich -> switch, swish (extend this swich as additional operators are implemented) src/Sema.zig:31700: underyling -> underlying (Traverse an arbitrary number of bitcasted pointers and return the underyling vector) src/Sema.zig:33662: ths -> the, this (If the LHS is const, check if there is a guaranteed result which does not depend on ths RHS value.) src/Sema.zig:33683: ths -> the, this (If the RHS is const, check if there is a guaranteed result which does not depend on ths LHS value.) lib/compiler/resinator/comments.zig:243: som -> some (som\"/*ething*/"BLAH) lib/compiler/resinator/comments.zig:245: som -> some (som\"/*ething*/"BLAH) lib/std/crypto/pcurves/common.zig:126: Conditonally -> Conditionally (Conditonally replace a field element with `a` if `c` is positive.) src/link/MachO/ZigObject.zig:1024: appopriate -> appropriate (1. first we lower the initializer into appopriate section (__thread_data or __thread_bss)) lib/compiler/resinator/compile.zig:2169: compability -> compatibility (This is the 'alternate compability form' of the separator, see) lib/std/Build.zig:632: dependant -> dependent (dependant packages require a standard prefix, such as include directories for C headers.) lib/compiler/resinator/code_pages.zig:32: interpretted -> interpreted (^ second byte of UTF-8 representation (0x80), but interpretted as) lib/compiler/resinator/code_pages.zig:59: continutation -> continuation (`. `.~<0xC2>a is not well-formed UTF-8 (0xC2 expects a continutation byte after it).) src/link/MachO/DebugSymbols.zig:237: preceed -> precede, proceed (however at the cost of having LINKEDIT preceed DWARF in dSYM binary which we) lib/compiler_rt/rem_pio2_large.zig:211: cancelation -> cancellation (the fraction part may be lost to cancelation before we) lib/std/treap.zig:148: undefind -> undefined (can have `undefind` content because the value will be initialized internally.) lib/std/time.zig:181: comparible -> compatible, comparable (windows and wasi timestamps are in u64 which is easily comparible) lib/compiler/aro/aro/Parser.zig:5440: pointes -> points (comparisons between floats and pointes not allowed) lib/compiler/aro/aro/Compilation.zig:21: ocurred -> occurred (A fatal error has ocurred and compilation has stopped.) lib/std/c/solaris.zig:283: chainging -> changing, chaining (Here we alias lifreq to ifreq to avoid chainging existing code in os and x.os.IPv6.) lib/std/hash/xxhash.zig:756: onthe -> on the (Digest the last block onthe Accumulator copy.) src/codegen/llvm.zig:4966: conatins -> contains (`true` if `table` conatins a reference to the `else` block.) src/codegen/llvm.zig:10914: prefetchs -> prefetches (LLVM fails during codegen of instruction cache prefetchs for these architectures.) src/codegen/llvm.zig:11684: treates -> treats (generic address space and treates as regular pointers. This is the way that HIP also does it.) lib/std/hash/auto_hash.zig:135: dependant -> dependent (next one so that they're dependant.) lib/std/os/windows/ntstatus.zig:2679: unble -> unable, uncle (Windows was unble to process the application binding information.) lib/compiler/aro/aro/Driver/Filesystem.zig:199: retun -> return (Otherwise retun null) lib/std/crypto/25519/field.zig:209: Conditonally -> Conditionally (Conditonally replace a field element with `a` if `c` is positive) src/codegen/spirv.zig:55: hte -> the (Note that hte incoming block from the `else` label is) src/codegen/spirv/Section.zig:283: insturction -> instruction (describe the entire insturction in which it is used.) lib/std/os/linux/seccomp.zig:29: dependant -> dependent (which is dependant on the ABI. Since BPF programs execute in a 32-bit) lib/std/tar/writer.zig:8: premission -> permission (be used for long names. Options enables setting file premission mode and) lib/std/tar/writer.zig:385: cant -> can't (cant fit into 255 + sep) src/codegen/spirv/Assembler.zig:99: occured -> occurred (The offset in bytes from the start of `src` that this error occured.) src/codegen/spirv/Assembler.zig:150: occured -> occurred (A list of errors that occured during processing the assembly.) lib/std/Thread.zig:968: atleast -> at least (start with atleast a single page, which is used as a guard to prevent) lib/std/crypto.zig:209: environements -> environments (making most attacks unpractical even on shared/low latency environements.) lib/std/debug/SelfInfo.zig:887: statics -> statistics (TODO handle globals N_GSYM, and statics N_STSYM) src/Package/Fetch.zig:1904: Emmit -> Emit (Emmit errors to an `ErrorBundle`.) src/Package/Fetch.zig:2055: sensitve -> sensitive (This tarball has duplicate path 'dir1/file1' to simulate case sensitve) src/Package/Fetch.zig:2092: wich -> which (Same as previous tarball but has build.zig.zon wich excludes 'dir1'.) src/Package/Fetch.zig:2307: constains -> constrains, contains (Test helper, asserts thet package dir constains expected_files.) lib/std/debug/Dwarf.zig:2267: opportunisticly -> opportunistically (This only opportunisticly tries to load from the debuginfod cache, but doesn't try to populate it.) src/arch/wasm/CodeGen.zig:716: seperate -> separate (For each individual Wasm valtype we store a seperate free list which) src/arch/wasm/CodeGen.zig:3632: incase -> in case (incase of an actual integer, we emit the correct signedness) src/arch/wasm/CodeGen.zig:4131: seperated -> separated (When the highest and lowest values are seperated by '50',) src/arch/wasm/CodeGen.zig:4153: atleast -> at least (we put inside, are atleast 0.) src/arch/wasm/CodeGen.zig:4157: substracting -> subtracting (make the index start from 0 by substracting the lowest value) src/arch/sparc64/Emit.zig:658: preferrably -> preferably (the address of the original target, preferrably in) src/arch/riscv64/CodeGen.zig:2224: instruciton -> instruction (spilled reg not tracked with spilled instruciton) src/arch/riscv64/CodeGen.zig:7815: guarnetee -> guarantee (Don't guarnetee other memory operations to be ordered after the load.) src/arch/riscv64/CodeGen.zig:7818: accurs -> accurse, occurs (Make sure all previous reads happen before any reading or writing accurs.) test/behavior/cast_int.zig:186: unitialised -> uninitialised (using ptrCast not to depend on unitialised memory state) test/behavior/cast_int.zig:222: unitialised -> uninitialised (using ptrCast not to depend on unitialised memory state) src/arch/riscv64/encoding.zig:74: properities -> properties (the mnemonic has some special properities that can't be handled in a generic fashion) src/arch/riscv64/bits.zig:164: calller -> caller (foat temporaries. calller saved.) src/arch/riscv64/bits.zig:187: infering -> inferring (infering which register is being talked about given the instruction it's in.) src/arch/riscv64/bits.zig:190: seperate -> separate (seperate IDs for `x0` and `f0`. We will assume that each register set has 32 registers) src/arch/riscv64/bits.zig:245: referes -> refers, referees (This index referes to where in the stack frame the args are spilled to.) src/arch/riscv64/bits.zig:247: referes -> refers, referees (This index referes to a frame dedicated to setting up args for function called) src/arch/riscv64/bits.zig:250: referes -> refers, referees (This index referes to the frame where callee saved registers are spilled and restored from.) src/arch/riscv64/abi.zig:200: seperating -> separating (but we haven't implemented seperating vector registers into register_pairs) test/behavior/switch.zig:891: remaning -> remaining, renaming, remaking (will not necessarily zero remaning payload memory) src/arch/arm/CodeGen.zig:79: offsest -> offsets, offset (calculate the right stack offsest with respect to the `.fp` register.) src/arch/arm/bits.zig:24: higer -> higher (unsigned higer) src/arch/aarch64/CodeGen.zig:79: offsest -> offsets, offset (calculate the right stack offsest with respect to the `.fp` register.) ```

Thats all using the following allow list:

flate, dependee, dependees, re-use, runned, reenable, ECT, strat, re-declare, re-use, HSA, Synopsys, AKS, Numer, crate, Arithmetics, fo, AtLeast

This might be a good starting place for anybody making a more refined tool.

It would be possible (probably not all that hard) to rewrite the core of the codespell tool in a simplier manner, as it has quite a few knobs and features that I don't think are necessary. Ultimately, spell checking will always require a human at wheel to make sure puns, jargon, and the like are not caught, but it can be done with minimal intervention.

Just as a footnote to an already-big-enough comment: here is a discussion of spellchecking comments in the Linux kernel that I think might be of interest to anyone reading this thread: http://www.kegel.com/kerspell/

Rexicon226 commented 1 month ago

Great results! Would you mind placing the larger blocks into collapsable regions, please?

lsculv commented 1 month ago

Great results! Would you mind placing the larger blocks into collapsable regions, please?

Fixed (I thought it would do that automatically, oops)

ziglang / zig

address spelling errors in a disciplined manner #21660