yegor256 / factbase

In-memory database of facts (records with attributes) with a predicative searching facility
https://rubygems.org/gems/factbase
MIT License
5 stars 2 forks source link

Security concern: It's very dangerous to use `Marshal.load` with untrusted data. #106

Open tank-bohr opened 1 month ago

tank-bohr commented 1 month ago

See security recommendations for details.

It's used in baza. As far as I understand the code, the data was received from an HTTP request, which makes the data untrusted.

https://github.com/yegor256/factbase/blob/51d357c5d3c7411191a43c0184e0f43adba53e78/lib/factbase.rb#L218

yegor256 commented 1 month ago

@tank-bohr that's a reasonable concern. Let's use some other format for serialization. Definitely not JSON or YAML since they are very verbose.

tank-bohr commented 1 month ago

From the top of my head

But I would consider JSON, after all. All binary formats require libraries with C extensions, and they often work unreliably. I had experience with protobufs in Ruby, and we observed random crashes on the C extension level.

yegor256 commented 1 month ago

@tank-bohr JSON is very verbose, if you ask me. Our files will be much larger than now. Maybe it's not a big deal, since we are planning to ZIP them, but still...

0crat commented 1 month ago

@tank-bohr Thanks for reporting a new bug! You've earned +15 points. By reporting bugs, you help our project improve its quality. If you find anything else in the repository that doesn't look as good as you might expect, do not hesitate to report it.

tank-bohr commented 4 weeks ago

one more candidate: parquet

Ruby library: https://github.com/apache/arrow/tree/main/ruby/red-parquet

yegor256 commented 4 weeks ago

@tank-bohr maybe we can invent our own format? The structure of the data that we save/load is very primitive. Shouldn't be hard to serialize/deserialize, without any C-level code.

tank-bohr commented 4 weeks ago

@yegor256 I'm not a big fan of inventing new formats. The format is usually closely related to speed (see #6). Parquet, for example, is optimized for data access and search.

I have a crazy idea. What if we migrate to the SQLite format? It will solve two problems with one shot:

yegor256 commented 4 weeks ago

@tank-bohr it's not so crazy) We can use SQLite, but how will we convert our query language to SQL? What will be the design of tables? How we will organize indices?

tank-bohr commented 4 weeks ago

@yegor256 I mean using SQLite as storage, not changing the queries. Then, the tables could be

string facts

id key value metadata
84b34a87-42c2-444f-837d-7bc0ad58dab0 repo factbase
84b34a87-42c2-444f-837d-7bc0ad58dab0 owner yegor256
84b34a87-42c2-444f-837d-7bc0ad58dab0 type issue-was-created
... ... ... ...

integer facts

id key value metadata
84b34a87-42c2-444f-837d-7bc0ad58dab0 number 42
... ... ... ...

For each supported datatype, we have to have

Indices could be

Then (eq repo factbase) will be transformed under the hood to the following SQL-query

SELECT id
  FROM fact_strings
 WHERE key   = 'repo'
   AND value = 'factbase'
yegor256 commented 4 weeks ago

@tank-bohr I would go with this structure:

fact property string integer float
10 foo NULL 42 NULL
10 bar "hello" NULL NULL
10 bar "test" NULL NULL
11 xyz NULL NULL 3.14

This would mean this factbase:

[
  { foo: 42, bar: ["hello", "test"] }
  { xyz: 3.14 }
]

WDYT?

tank-bohr commented 4 weeks ago

I have two concerns:

But it could work for sure.