wokket / rust-hl7

Learning rust by playing with a HL7 parser. Use for real at your own risk!
18 stars 8 forks source link

Major refactor: flatten code and object structures #34

Closed sempervictus closed 3 years ago

sempervictus commented 3 years ago

Rebuild of PR #31

sempervictus commented 3 years ago

I have no idea what the deuce is up with github these last two days - it somehow managed to offset the master reference. Fixed

wokket commented 3 years ago

OK, I have a feeling that one of us isn't quite on the same page with the repeats thing. There's pseudocode in the spec for this, but here's my take on it:

A field has one or more 'occurrence ', separated by the repeat delimiter (default ~). An occurrence has one or more components, separated by the component delimiter (default ^). A component has one or more sub-components, separated by the sub component delimiter (default &),

So a field of "aaa^bbb" as 1 occurrence (repeat), and two components, each with a single sub component. You could retrieve "aaa" using a query of "R1.C1", or just "C1" as the R1 is kind of redundant.

A field of "aaa^bbb~ccc^ddd" could also retrieve "aaa" using "R1.C1" or "C1" per above, but would use "R2.C1" to get "ccc".

If we had "aaa^bbb&ccc" then we have one occurrence, 2 components, and the second component has two subcomponents. We could retrieve "ccc" using "C2.SC2".

Do you know something I don't?

sempervictus commented 3 years ago

Ah! Thank you! So the logic really should be more along the lines of having 3 vectors per field:

  1. Repeats
  2. Components of repeats (Vec)
  3. Subcomponents of components? (Vec<Vec>) I've never seen notation accessing the subcomponent by string query, so i think i screwed up the implementation of field actually. Is the logic above correct? If so, gimme a few and i'll get it implemented for the PR.
wokket commented 3 years ago

That's generally correct, but is also where I headed in the first iteration of this library. When the majority of fields are only a "simple" field with a single value ("blah") the cost of maintaing multiple vectors (in terms of brain power as much as any runtime cost) gets really high.

That's why I'd started down the enum path of having a "SimpleField", a "RepeatingField" etc etc, but hadn't got as far as a good working model when you came along with the path based stuff which is great, means you don't need to build and then deconstruct a bunch of nested vecs all the time, you can just query in and split as needed based on the query.

As an example, if I have a field with a simple value "blah" (ie one occurrance, one component, one sub-component), do you have a vec[0][0][0] == "blah" ? or just vec[0] = "blah" ? How do you then index your field? does field[0] == field[0,0,0] == field[0][0][0], or are some of those considered out of bounds? How do you make those behave intuitively the same as a simple set of nested vecs if the are all the same value?

This is why I've been pushing down the external Selecter::query() line of thinking...

sempervictus commented 3 years ago

All makes sense, thank you. Impl mostly done, getting test in line with the new structure. In terms of cost... i was thinking maybe there's some way to get the compiler to agree to let us replace Vec<'a str> types with &[&'a str] slices which should be generally cheaper. Alternatively, instead of enums, we might be able to do this by having a "simple field" initializer which skips all that work and just assigns empty Vec's to the struct members. Lemme wrap up the refactor for field and push to review, and then i'll see what i can do about optimizing away some of the added cost.

wokket commented 3 years ago

Yeah let's not make this change any larger than it already is. The cost for me was mainly brain tax.

Do I have to specify a repeat[0] if I want a component?
If a component has subcomponents but I don't specify which one do I get the raw string value? or an error? or ?? A general rule that you have to include all 'higher' levels if you want a 'lower' value makes sense for indexing, but gets in the way if you want (say) the second subcomponent of the only component, in the only repeat.

When you're doing simple indexing into vecs (or arrays, whatever) those questions are really hard, but the path selection stuff you wrote is a much nicer direction, as we get to describe the semantics of the query. "SC2" vs field[0][0][1] is much easier on the neurons.

sempervictus commented 3 years ago

you do not have to specify repeat[0] to get a component, that would just yield the entire field in most cases where repeats are not implemented. Components are now where subcomponents were previously - so component[0] would get you the first repeat's worth of components:

Field {
    source: "x&x^y&y~a&a^b&b",
    delims: Separators {
        segment: '\r',
        field: '|',
        repeat: '~',
        component: '^',
        subcomponent: '&',
        escape_char: '\\',
    },
    repeats: [
        "x&x^y&y",
        "a&a^b&b",
    ],
    components: [
        [
            "x&x",
            "y&y",
        ],
        [
            "a&a",
            "b&b",
        ],
    ],
    subcomponents: [
        [
            [
                "x",
                "x",
            ],
            [
                "y",
                "y",
            ],
        ],
        [
            [
                "a",
                "a",
            ],
            [
                "b",
                "b",
            ],
        ],
    ],
}
wokket commented 3 years ago

OK I'm AFK for a bit. Ping me when you've worked through whatever you want to do, had a look at the other points I raised in the PR and run it through a clippy pass, and I'll double check it later, but I'd expect it to be pretty close by then.

sempervictus commented 3 years ago

Thank you. Grokking clippy and going over the comments again.

sempervictus commented 3 years ago

That was educational :), appreciate the references. I think i broke the benchmark somehow - code looks sane to me, but GH seems to be hanging on the changed bench. Even i'm not so unlucky as to have CI coincidentally hang on the thing i just changed. Probably something i dont get about how benchmarking works. When you're back, could you please peek at that and kick me in the appropriate direction if you know it?

sempervictus commented 3 years ago

On the depth of Field structure - trying to make it use borrowed slices a la

diff --git i/src/fields.rs w/src/fields.rs
index 71e7f7f..441cb10 100644
--- i/src/fields.rs
+++ w/src/fields.rs
@@ -9,7 +9,7 @@ use std::ops::Index;
 pub struct Field<'a> {
     pub source: &'a str,
     pub delims: Separators,
-    pub repeats: Vec<&'a str>,
+    pub repeats: &[&'a str],
     pub components: Vec<Vec<&'a str>>,
     pub subcomponents: Vec<Vec<Vec<&'a str>>>,
 }
@@ -40,7 +40,7 @@ impl<'a> Field<'a> {
         let field = Field {
             source: input,
             delims: *delims,
-            repeats,
+            repeats: &repeats[..]
             components,
             subcomponents,

results in some nay-saying by the borrow-checker:

error[E0515]: cannot return value referencing local variable `repeats`
  --> src/fields.rs:47:9
   |
43 |             repeats: &repeats[..],
   |                       ------- `repeats` is borrowed here
...
47 |         Ok(field)
   |         ^^^^^^^^^ returns a value referencing data owned by the current function

error: aborting due to previous error

For more information about this error, try `rustc --explain E0515`.
error: could not compile `rust-hl7`

without the borrow, it has an unknown size issue with this:

error[E0277]: the size for values of type `[&'a str]` cannot be known at compilation time
  --> src/fields.rs:12:18
   |
12 |     pub repeats: [&'a str],
   |                  ^^^^^^^^^ doesn't have a size known at compile-time
   |
   = help: the trait `Sized` is not implemented for `[&'a str]`
   = note: only the last field of a struct may have a dynamically sized type
   = help: change the field's type to have a statically known size

so i think i'll let sleeping dogs lie for the time being on this.

sempervictus commented 3 years ago

Fun note from clippy regarding efficiency: the distinct MshSegment type is ~40x larger than a generic Segment:

warning: large size difference between variants
 --> src/segments.rs:8:5
  |
8 |     MSH(MshSegment<'a>),
  |     ^^^^^^^^^^^^^^^^^^^ this variant is 1952 bytes
  |
  = note: `#[warn(clippy::large_enum_variant)]` on by default
note: and the second-largest variant is 48 bytes:
 --> src/segments.rs:7:5
  |
7 |     Generic(Segment<'a>),
  |     ^^^^^^^^^^^^^^^^^^^^
  = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#large_enum_variant
help: consider boxing the large fields to reduce the total size of the enum
  |
8 |     MSH(Box<MshSegment<'a>>),
  |         ^^^^^^^^^^^^^^^^^^^

warning: 1 warning emitted

I definitely broke the benchmark :( - even when borrowing msh there, its still stuck showing:


Benchmarking Read Field from MSH (variable)
Benchmarking Read Field from MSH (variable): Warming up for 3.0000 s

or just a blank output if you look at it while its running and stuck there.

sempervictus commented 3 years ago

Either that benchmark method won the lottery or something weird is going on inside the iterator:

Benchmarking Read Field from MSH (variable)
Benchmarking Read Field from MSH (variable): Warming up for 3.0000 s
Benchmarking Read Field from MSH (variable): Collecting 100 samples in estimated 5.0000 s (18446744074B iterations)
Benchmarking Read Field from MSH (variable): Analyzing
Criterion.rs ERROR: At least one measurement of benchmark Read Field from MSH (variable) took zero time per iteration. This should not be possible. If using iter_custom, please verify that your routine is correctly measured.

while it appears (during benchmarking) to take a rather long time.

wokket commented 3 years ago

Benchmark is running fine for me locally, so I'm inclined to merge it and see if the zombie army that powers github actions are back on duty.

Anything else before I merge?