y-crdt / y-crdt

Rust port of Yjs
https://docs.rs/yrs/
Other
1.39k stars 73 forks source link

JSON serialization #312

Open TeemuKoivisto opened 11 months ago

TeemuKoivisto commented 11 months ago

Hey, so I have a question about the JSON serialization of docs. As of now, I believe this is how you do it:

fn to_json(doc: lib0::any::Any) -> String {
    let mut buf = String::new();
    doc.to_json(&mut buf);
    buf
}

Yet this produces a non-JSON string: <paragraph>first paragraph</paragraph><paragraph>second</paragraph>

Yes it is more compact but how can I get an actual JSON string? Do I need to write my own custom serializer? I am working with ProseMirror documents so it'd have to conform to their structure.

Also and this is kinda related, you can not serialize the top level yrs::doc::Doc right? Or at least it for me it panics with 'Cannot convert to value - unsupported type ref: 15', /Users/teemu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/yrs-0.16.10/src/types/mod.rs:276:22

Instead I have to do this:

pub fn to_json(&self) -> Any {
    let txn = self.doc.transact();
    let xml = txn.get_xml_element("prosemirror").unwrap();
    let yxml = Value::YXmlFragment(xml.into());
    let json = yxml.to_json(&txn);
    json
}
paulgb commented 11 months ago

Here’s some code that will emit a JSON object that looks like this:

[{"text":"the "},{"attrs":{"bold":true},"text":"quick "},{"text":"brown fox "},{"attrs":{"italic":true},"text":"jumps"},{"text":" over the lazy dog"}]
// tx is a transaction
let text = tx.get_xml_text("prosemirror").unwrap();
let mut segments = Vec::new();
for t in text.diff(tx, YChange::identity) {
    let Value::Any(Any::String(str)) = t.insert else {
        continue;
    };
    let attrs: HashMap<String, serde_json::Value> = t.attributes.unwrap_or_default().iter().map(|(k, v)| {
        match v {
            // convert an "Any" into a serde_json::Value. This is not exhaustive.
            Any::String(s) => {
                (k.to_string(), serde_json::Value::String(s.to_string()))
            }
            Any::Bool(b) => {
                (k.to_string(), serde_json::Value::Bool(*b))
            }
            _ => {
                panic!("type not handled {:?}", v)
            }
        }
    }).collect();

    let attrs = if attrs.len() > 0 {
        segments.push(json!({
            "text": str,
            "attrs": attrs,
        }))
    } else {
        segments.push(json!({
            "text": str,
        }))
    };
}

println!("{}", serde_json::to_string(&segments).unwrap());

Also and this is kinda related, you can not serialize the top level yrs::doc::Doc right?

Until you have called get_xml_element on the Doc, the doc doesn't really know that the value at key prosemirror is an XML element instead of an array. Internally, yrs uses the value 15 to represent an unknown type.

It is sometimes possible to use a heuristic to "guess" the type at a certain key, but there is currently no way to iterate over keys without causing the panic when an unknown key is encountered, so it's impossible to serialize the top-level doc unless you know all of the keys in advance. This is fragile in production because any client can add a key and cause a panic in your program. My PR #307 exposes a way to iterate over the root keys without the possible panic.

Horusiath commented 11 months ago

@TeemuKoivisto your example is missing details: using Any::to_json adds quotations for string case. Your error message is not related to it.

The error message is related to unknown type of the root-level collection, which usually happens when you created an empty document and applied a remote update to it without initializing its root level collections first - which is generally an antipattern.

Since update doesn't contain the information about the specific type of the root level collections (as they are dynamic and should be initialized on all clients to begin with), its type starts as an unknown. It will be overridden when root level type is properly initialized (ie. by using Doc::get_or_insert_xml_element). However it cannot be overridden by read transaction operation, which is what you're using.

Another matter is that IIRC - based on the fact that "prosemirror" suggests using Yjs with ProseMirror plugin - is that you probably want to initialize it via doc.get_or_insert_xml_fragment("prosemirror"). 90% of the time using XmlElementRef and XmlTextRef as root level collections is a sign of mistake.

TeemuKoivisto commented 11 months ago

Thank you @paulgb @Horusiath for the helpful advice!

I did immediately rewrite my function to this:

    pub fn to_json(&self) -> Any {
        let txn = self.doc.transact();
        if let Some(xml) = txn.get_xml_element("prosemirror") {
            let yxml = Value::YXmlFragment(xml.into());
            yxml.to_json(&txn)
        } else {
            Value::default().to_json(&txn)
        }
    }

As .get_xml_element kept throwing panics.

They way I see it with doc.to_json is that it's perfectly valid to create a document without content. Might be an anti-pattern but so it goes. Also, as I initialize the "prosemirror" fragment on client-side there's an in-between state where the doc stays empty until the changes from the client come in. Thus any calls to .to_json will lead to panicing.

IMO it'd be nicer to return an empty value on such doc instead since the doc is valid - just empty. Or if the top doc is not meant to be serialized directly, just remove the to_json to not even allow making such mistakes.

As with the ProseMirror integration, using NodeJS backends I haven't had to initialize the XMLfragment myself on server-side before. And which is why this came as a bit of surprise, I think they as well have an intermediary state without content but it is quickly updated by the clients.

I wouldn't necessarily want to enforce it manually on the server as if the root collection was dynamic, it'd just cause headache. But I suppose the bigger problem is panicing while it's empty.

And I see about the JSON representation. String is valid JSON, sure, but it's quite tricky to serialize every bit of metadata into XML and then transform it back. Immediate problem I see is that the string doesn't follow the original HTML form - you can't parse it directly with ProseMirror anymore. Also the top node type is lost (as was with the JS version). So I believe the answer to my question is that yes, you have to write your own serializer to output ProseMirror compatible JSON?

Horusiath commented 11 months ago

IIRC the y-prosemirror plugin is using combination of XmlFragment.toArray and then manually serialising XmlElement's children and attributes, while using XmlText.toDelta for text. In yrs equivalent of that method is Text::diff (works on XmlTextRefs as well).

TeemuKoivisto commented 11 months ago

Yeah I did come up with this:

pub fn doc_to_pm_json(doc: &Doc, xml: XmlFragmentRef) -> serde_json::Value {
    let txn = doc.transact();
    let mut content = Vec::new();
    fragment_to_json(&mut content, &txn, xml);
    json!({
        "type": "doc",
        "content": content
    })
}

fn fragment_to_json(
    content: &mut Vec<serde_json::Value>,
    txn: &Transaction<'_>,
    xml: yrs::XmlFragmentRef,
) {
    for idx in 0..xml.len(txn) {
        node_to_json(content, &txn, xml.get(txn, idx));
    }
}

fn node_to_json(
    content: &mut Vec<serde_json::Value>,
    txn: &Transaction<'_>,
    node: Option<XmlNode>,
) {
    if let Some(xml) = node {
        match xml {
            yrs::XmlNode::Element(el) => {
                let mut children = Vec::new();
                for idx in 0..el.len(txn) {
                    node_to_json(&mut children, &txn, el.get(txn, idx));
                }
                content.push(json!({
                  "type": el.tag(),
                  "content": children
                }));
            }
            yrs::XmlNode::Fragment(fr) => {
                fragment_to_json(content, txn, fr);
            }
            yrs::XmlNode::Text(text) => {
                text_to_json(content, txn, text);
            }
        }
    }
}

fn map_to_json<T>(map: &Box<HashMap<T, Any>>) -> serde_json::Map<String, serde_json::Value>
where
    T: ToString,
{
    map.iter()
        .map(|(k, v)| match v {
            Any::String(s) => (k.to_string(), serde_json::Value::String(s.to_string())),
            Any::Bool(b) => (k.to_string(), serde_json::Value::Bool(*b)),
            Any::Null => todo!(),
            Any::Undefined => todo!(),
            Any::Number(_) => todo!(),
            Any::BigInt(_) => todo!(),
            Any::Buffer(_) => todo!(),
            Any::Array(_) => todo!(),
            Any::Map(m) => (k.to_string(), serde_json::Value::Object(map_to_json(m))),
        })
        .collect()
}

fn text_to_json(content: &mut Vec<serde_json::Value>, txn: &Transaction<'_>, text: XmlTextRef) {
    for t in text.diff(txn, YChange::identity) {
        let Value::Any(Any::String(str)) = t.insert else {
            continue;
        };
        let attrs = map_to_json(&t.attributes.unwrap_or_default());
        let marks: Vec<serde_json::Value> = attrs
            .iter()
            .map(|(k, v)| {
                let empty = match v {
                    serde_json::Value::Null => true,
                    serde_json::Value::Bool(_) => false,
                    serde_json::Value::Number(_) => false,
                    serde_json::Value::String(v) => v.len() == 0,
                    serde_json::Value::Array(v) => v.len() == 0,
                    serde_json::Value::Object(v) => v.len() == 0,
                };
                if empty {
                    json!({
                        "type": k,
                    })
                } else {
                    json!({
                        "type": k,
                        "attrs": v
                    })
                }
            })
            .collect();
        if attrs.len() > 0 {
            content.push(json!({
                "type": "text",
                "text": str,
                "marks": marks
            }))
        } else {
            content.push(json!({
                "type": "text",
                "text": str,
            }))
        };
    }
}

I'll have to check whether those todo! branches will panic. The result turns into string with serde_json::to_value(&doc).unwrap() and it looks correct so far.

Is this something that could be usable as part of some crate? I'm not probably going to be the only one who will need this.

Horusiath commented 11 months ago

@TeemuKoivisto maybe this should be the first question: what do you need this for?

TeemuKoivisto commented 11 months ago

The same reason people use yXmlFragmentToProsemirrorJSON. I was thinking I could store the document in this form as the final materialized version without having to worry about parsing it. I'd keep the editing history as separate binary blob elsewhere. We'll see.

But sure, we can see and wait whether people need it or not. It's not much code, I'll probably refactor it to use raw strings directly.