weltling / parle

Parser and lexer for PHP
Other
82 stars 9 forks source link

Idea: support for serialization ? #5

Open remicollet opened 6 years ago

remicollet commented 6 years ago

For now

<?php
// create $lex ... which mays have tons of lines
$ser = serialize($lex));

$lex = unserialize($ser);
$lex->consume($in);

Result in PHP Fatal error: Uncaught Parle\LexerException: Lexer state machine is not ready in /tmp/foo.php:41

Indeed, serialization result in something obviously wrong: "O:11:"Parle\Lexer":0:{}"

If too much work to support, perhaps better to declare as not serializable ? Perhaps a "saveToJson" and "loadFromJson" methods could be better (you know... serialization...)

Perhaps, just a bad idea ;)

weltling commented 6 years ago

There is a serialization support for lexertl https://github.com/BenHanson/lexertl14/blob/master/lexertl/serialise.hpp, however it depends on Boost and likely has some specific format. For the parser there's ATM no such code, so far I could see. I think parsertl could support serialization with Boost, too. This method would require to use Boost for serializing the PHP internal stuff, too, which might be not that suitable for PHP users.

I currently work on the documentation and lets see yet how the current PHP API suffices. Therefore I was rather deferring the further internal integration, until the API got stable. I think JSON serialization could be made possible, when the PHP API is established and at least in beta. Perhaps should mark both as not serializable for now and keep the issue open. Of course the goal to have things serialized is a worthwhile one.

Thanks.

BenHanson commented 3 years ago

If I can get a C++20/constexpr version of the libraries going, then a parser to read the serialisation format can be built at compile time, which would then justify loading the saved data instead of just building it again.

weltling commented 1 year ago

@BenHanson i think there's a need to clarify the matter a bit. PHP uses an ASCII based serialization format. The point here is of course about the PHP own items (variables, objects, constants, etc.) versus lexertl/parsertl objects that Parle\{Parser\Lexer} instances carry inside.

I guess the question with regard to lexertl/parsertl would be not about raising the C++ version requirement, but whether a serialization can be a feature of these libraries themselves. Perhaps it could be possible to serialize/unserialize using some other non binary format? Perhaps there could be some other option? A portable binary format could be an option, too. Given the PHP own serialization format is not binary, it is portable, so that's a good point, too.

With serialization supported, the opportunities to save/share/exchange parsers and lexers will of course improve. This would seem an advantage to lexertl/parsertl as well as any consumers. PHP related, even with a binary parser, say embedding a base64 encoded blob into the PHP serialization format is thinkable. Without having to define any C++/PHP code, one would just unserialize and able to operate on that.

Thanks

BenHanson commented 1 year ago

@weltling In fact I could just output the table numbers as ASCII and use C++ streams to stream them back in etc.

In the past it always seemed kind of pointless to me, but it's not hard to do so why not if people want it?

weltling commented 1 year ago

Yep, that could be an approach. To keep in mind is also the security component, as the serialized string can be ponentially manipulated.

Thansk

BenHanson commented 1 year ago

I have written the C++ for serialisation and will publish it soon. If security is a concern it may be better to store the serialised text directly in the application rather than as an external file?

weltling commented 1 year ago

The idea with the serialization is exactly about having the data saved outside the app. Like for example


$f = "/path/to/dump.txt";
if (file_exists($f)) {
    $paser = Parser::from(file_get_contents($f));
} else {
    $parser = new Parser;
    .........
}
file_put_contents($f, serialize($parser));

There's a big red block of test warning about the trusted sources here:

https://www.php.net/manual/en/function.unserialize.php

but there's no real way to control what goes around. Say same data can be passed over network, saved in DB, whatsoever. So just mentioning as it's a usual practice to care the data read in is actually valid and won't say crash the app. It would then concern both the C++ lib and the PHP side, too.

Thanks

BenHanson commented 1 year ago

As the state machine is just a bunch of numbers, JSON won't really help here.

As mentioned in the link you provided, adding a hash may help. I can probably use a hash function from the standard library, although that will lock a serialisation to a particular compiler.

weltling commented 1 year ago

Yep, JSON is of no use, but i guess a hash would be not much of that as well. Once the list of numbers is outside, the hash can be manipulated same way the numbers are. The number list itself is a good enough approach in first place, IMO, it allows for a plain operation on the underlying data.

It would be perhaps some sanity check that would do the job to ensure the numbers are valid? At this point I'm not very familiar with the parsertl and lexertl internals to suggest something more concrete :/ For example, like checking if some imported ID refers to an existing item, etc.?

Thanks