wokket / rust-hl7

Learning rust by playing with a HL7 parser. Use for real at your own risk!
18 stars 8 forks source link

Extract nested string vectors from message by dot-delimited index #11

Closed sempervictus closed 3 years ago

sempervictus commented 3 years ago

Extract nested string vectors from message by idx

HL7 content is commonly accessed via multidimensional index in the form of PV1.F2.R1.C1 which breaks down tosegment_name .field_index.component_index.subcomponent_index at various depths from name to nested subcomponents.

In order to permit convenient access to the message &str data, a set of named segments must be evaluated and processed out down to the subcomponent field level for the requested cartesian product, all levels of possible processing returning a Vec<Vec<&'a str>>.

This commit builds on the sub-field parsing work in the original branch, providing an impl<'a> Index<String> for Message<'a> intended to provide the functionality described above.

Unfortunately the branch is currently failing to build on Rust stable 1.54 (current), apparently angry at lifetime elision constraints in the implementation.

Ping @wokket: this rabbit hole seems to go on for a while:

error[E0495]: cannot infer an appropriate lifetime for autoref due to conflicting requirements
   --> src/message.rs:124:25
    |
124 |         let segs = self.generic_segments_by_name(seg_name).unwrap();
    |                         ^^^^^^^^^^^^^^^^^^^^^^^^
    |
note: first, the lifetime cannot outlive the anonymous lifetime defined on the method body at 112:14...
   --> src/message.rs:112:14
    |
112 |     fn index(&self, idx: String) -> &Self::Output {
    |              ^^^^^
note: ...so that reference does not outlive borrowed content
   --> src/message.rs:124:20
    |
124 |         let segs = self.generic_segments_by_name(seg_name).unwrap();
    |                    ^^^^
note: but, the lifetime must be valid for the lifetime `'a` as defined on the impl at 108:6...
   --> src/message.rs:108:6
    |
108 | impl<'a> Index<String> for Message<'a> {
    |      ^^
note: ...so that the types are compatible
   --> src/message.rs:112:51
    |
112 |       fn index(&self, idx: String) -> &Self::Output {
    |  ___________________________________________________^
113 | |         let delims = self.msh().unwrap().msh_2_encoding_characters;
114 | |         // Parse index elements
115 | |         let mut slices: VecDeque<&str> = idx.split(".").collect();
...   |
160 | |         &vecs
161 | |     }
    | |_____^
    = note: expected `Index<String>`
               found `Index<String>`

error: aborting due to previous error

For more information about this error, try `rustc --explain E0495`.
error: could not compile `rust-hl7`
sempervictus commented 3 years ago

Feeding it another lifetime parameter within the function call produces roughly the same output:

diff --git i/src/message.rs w/src/message.rs
index 6f5d58b..783da66 100644
--- i/src/message.rs
+++ w/src/message.rs
@@ -109,7 +109,7 @@ impl<'a> Index<String> for Message<'a> {
     type Output = Vec<Vec<&'a str>>;

     /// Handle indexing into a Message a la PV1.F2.R1.C1
-    fn index(&self, idx: String) -> &Self::Output {
+    fn index<'b>(&'b self, idx: String) -> &'b Self::Output {
         let delims = self.msh().unwrap().msh_2_encoding_characters;
         // Parse index elements
         let mut slices: VecDeque<&str> = idx.split(".").collect();
error[E0495]: cannot infer an appropriate lifetime for autoref due to conflicting requirements
   --> src/message.rs:124:25
    |
124 |         let segs = self.generic_segments_by_name(seg_name).unwrap();
    |                         ^^^^^^^^^^^^^^^^^^^^^^^^
    |
note: first, the lifetime cannot outlive the lifetime `'b` as defined on the method body at 112:14...
   --> src/message.rs:112:14
    |
112 |     fn index<'b>(&'b self, idx: String) -> &'b Self::Output {
    |              ^^
note: ...so that reference does not outlive borrowed content
   --> src/message.rs:124:20
    |
124 |         let segs = self.generic_segments_by_name(seg_name).unwrap();
    |                    ^^^^
note: but, the lifetime must be valid for the lifetime `'a` as defined on the impl at 108:6...
   --> src/message.rs:108:6
    |
108 | impl<'a> Index<String> for Message<'a> {
    |      ^^
note: ...so that the types are compatible
   --> src/message.rs:112:61
    |
112 |       fn index<'b>(&'b self, idx: String) -> &'b Self::Output {
    |  _____________________________________________________________^
113 | |         let delims = self.msh().unwrap().msh_2_encoding_characters;
114 | |         // Parse index elements
115 | |         let mut slices: VecDeque<&str> = idx.split(".").collect();
...   |
160 | |         &vecs
161 | |     }
    | |_____^
    = note: expected `Index<String>`
               found `Index<String>`

error: aborting due to previous error

For more information about this error, try `rustc --explain E0495`.
error: could not compile `rust-hl7`
sempervictus commented 3 years ago

I've rebased this atop the code permitting direct access to source fields in the message substructures - much easier on the borrow checker, but still not quite there. Somehow Rust claims that i'm returning owned data when i pull field value (even by destructuring) or anything below - the source content of Segments seems to go over just fine, using the same exact pattern i'm using for Field reference extraction.

sempervictus commented 3 years ago

Thinking we need Field to carry its own delimiters as a Struct... otherwise encapsulated functions don't work very well. Its 2 bytes, so going to try it and see if can rewire Field to work a bit more dynamically

EDIT: now #13

sempervictus commented 3 years ago

@wokket - could you take a look? This is a pretty serious change which i think i've encapsulated correctly/safely, but would really appreciate a 2nd set of eyes on the code (and possibly some test case suggestions). My current one seems to work:

    fn ensure_index() -> Result<(), Hl7ParseError> {
        let hl7 = "MSH|^~\\&|GHH LAB|ELAB-3|GHH OE|BLDG4|200202150930||ORU^R01|CNTRL-3456|P|2.4\rOBR|segment^sub&segment";
        let msg = Message::from_str(hl7)?;
        assert_eq!(msg[String::from("OBR.F1.R1.C1")],"segment");
        Ok(())
    }

but i'm shoulder-deep in this badger's nest and probably lacking an objective perspective :).

Here's what i've learned from this process: if you're sharing string references and intend to have access to them down the line, you need to expose them as struct elements vs via function because the borrow checker infers side effects from almost any function call like collect() or even the old value() call for a Field.

sempervictus commented 3 years ago

@wokket : design decision time. HL7 spec starts counting fields, components, etc at 1 - computers at 0. I'm leveraging the fact that field 0 in a generic segment is the segment name to avoid an off-by-one problem, but in the field's subsections, i'm currently adjusting by one to make all of this work. I'm not a fan of such hacks, so thinking that i expand on this to have Index<usize> work by computer rules, while Index<String> would infer the off-by-one bit for the currently adjusted elements. No indexing on MshSegment yet, not sure we need it, and its a bit of a PITA - can make a grungy large matcher to extract fields if we really must.

sempervictus commented 3 years ago

That should do it - now the medically-minded can pass string indices to this thing and get their expected off-by-one fields correctly, while digital consumers can utilize the appropriate accessors as required.

wokket commented 3 years ago

I'm well aware off the 0/1 based pain in HL7, indexing by 0 numerically and 1 with dot notation is perfect for me providing it's called out really well in the docs 👍 In general rust has brilliant docs standards and I'm trying to live up to those expectations with this lib (Happy to take in a subsequent PR if needed).

The benchmark is failing to compile so it'd be nice to get that cleaned up.

I'm also keen to get that Display impl in rather than a hardwired to_string() to play nicer with ecosystem but am happy to take that in a subsequent PR.

I'll have a more in-depth play shortly... re: tests: Samples showing how it handles error scenarios (fields that don't exist etc) would be nice

wokket commented 3 years ago

Ok I had a bit more of a look. I like the idea of numeric indexing but I'm not sure the index by string is the right way to do the dot notation stuff as you're not really indexing... Renaming that to query or get_by_query or something (naming is hard) will make a bit more sense, hopefully be more discoverable for consumers and allow better intellisense in editors as it's an actual function call rather than just indexing syntax.

The last point I think will be important as there's going to be some complexity/subtlety to explain with how the dot stuff works.

sempervictus commented 3 years ago

Thanks for the feedback - and agree that "i've got some 'splainin' to do" in the comments. That said, i think that this lends flexibility to the implementation to allow various consumers to reason about the data in whatever manner they prefer, and the types of outputs you can get vary as well - these methods provide indexing to return string refs which you can then again parse into structs and so on. The other struct member accessors provide primitive data structures for consumers to easily use (vecs, mostly), and since you made the whole thing work on refs, its near-free in terms of allocation to do all of this :).

wokket commented 3 years ago

Yep, I'm keen to keep both numeric indexing and dot notation, I just don't think some_field[String::from("R0.C1")] is as clear or discoverable as some_field.query_by_path("R0.C1") (channeling xpath etc here).

Actual string indexing into Message (to get segments by name) would make sense if we had a way of dealing with multiple segments of the same type, but I'm not keen on querying using the indexer.

sempervictus commented 3 years ago

It takes str now too, so its just msg["SEG.F1.R1.C1"] for brevity/ease of access. Also, you'll fail (return &"")on the R0 bit since the string notation starts counting at 1

sempervictus commented 3 years ago

Regarding the to_string() and as_str() bits - are you just looking to have them moved to their own implementations? The traits are pretty open - not sure what we'd win with the added LOC (likely just my ignorance).

wokket commented 3 years ago

I don't know if there's a standard trait for as_str or not so I'm not stressed there. I was hoping the change would be as simple as replacing the current to_string function with

impl Display for Blah {
  fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        write!(f, "{}", self.source)
    }
}

or similar...

sempervictus commented 3 years ago

Roger, thanks - lemme try that. Meantime, here's a usability example from evcxr working as a quick REPL on HL7 data:

>> hl7
"MSH|^~\\&|LAB|MYFAC|LAB||201411130917||ORU^R01|3216598|D|2.3|||AL|NE|\rPID|||AND234DA_PID3^1|PID_4_ALTID|Patlast^Patfirst^Mid||19670202|F|||4505 21 st^^LAKE COUNTRY^BC^V4V 2S7||222-555-8484|||||MF0050356/15|\rPV1|1|O|MYFACSOMPL||||^Xavarie^Sonna^^^^^XAVS|||||||||||REF||SELF|||||||||||||||||||MYFAC||REG|||201411071440||||||||23390^PV1_52Surname^PV1_52Given^H^^Dr^^PV1_52Mnemonic|\rORC|RE|PT103933301.0100|||CM|N|||201411130917|^Kyle^Andra^J.^^^^KYLA||^Xavarie^Sonna^^^^^XAVS|MYFAC|\rOBR|1|PT1311:H00001R301.0100|PT1311:H00001R|301.0100^Complete Blood Count (CBC)^00065227^57021-8^CBC \\T\\ Auto Differential^pCLOCD|R||201411130914|||KYLA||||201411130914||^Xavarie^Sonna^^^^^XAVS||00065227||||201411130915||LAB|F||^^^^^R|^Xavarie^Sonna^^^^^XAVS|\rOBX|1|NM|301.0500^White Blood Count (WBC)^00065227^6690-2^Leukocytes^pCLOCD|1|10.1|10\\S\\9/L|3.1-9.7|H||A~S|F|||201411130916|MYFAC^MyFake Hospital^L|\rOBX|2|NM|301.0600^Red Blood Count (RBC)^00065227^789-8^Erythrocytes^pCLOCD|1|3.2|10\\S\\12/L|3.7-5.0|L||A~S|F|||201411130916|MYFAC^MyFake Hospital^L|\rZDR||^Xavarie^Sonna^^^^^XAVS^^^^^XX^^ATP|\rZPR||"
>> {let msg = Message::from_str(&hl7).unwrap(); msg["PV1.F7.R2"]}
"Xavarie"
sempervictus commented 3 years ago

Thanks for the example - trivial, implemented, should be set for review. We might want to take off the "not for production use" tag after this gets merged - it's starting to shape up into something pretty battle-worthy :).

sempervictus commented 3 years ago

I left value() in there for posterity, with an eye toward removal at a later time (i changed core API in terms of field being a struct now)

wokket commented 3 years ago

OK, let's get this merged into master and we can keep fixing bugs/nits/docco etc as separate PR's. Do you have anything else to update it with before I hit the green button?

sempervictus commented 3 years ago

Im beating up on a real use case of this right now - MLLP/HL7 multiplexer and parser to multiple backends, so should be able to put it through its paces enough to find more bugs. Agree that we ought to get this in - i'm clearly starting my bad habit process of belt-fed PR stuffing so probably best to put some logical gating on that :-D A sense of performance numbers for messages sent by from a bin using this lib and the mllp one to a listener wired up the same way and throwing messages out to an MQ after parsing them out to validate MSH and some fields:

[root@ccm-lis00 ~]# time ./hl7test --mllp_host localhost
Sending test message_id=3216598 message_time=201411130917 patient_class=MYFACSOMPL patient_location= patient_id=12345678
Sending 10000 messages to localhost:20000

real    0m13.963s
user    0m0.018s
sys 0m0.042s

So one of these pairs should be able to run ~66m messages per day...

wokket commented 3 years ago

Very Nice! That's doing actual synchronous MLLP? If I'm seeing that right it looks I/O dominated (network/MQ)?

sempervictus commented 3 years ago

Thats the asynchronous stuff, and yeah, the MQ commit times are much longer than our processing phase. I'll eventually have the work using this code instrumented out for Prometheus so we'll have some purdy pictures to show off what we've done here.