Security concern: It's very dangerous to use `Marshal.load` with untrusted data.

yegor256 / factbase

In-memory database of facts (records with attributes) with a predicative searching facility

https://rubygems.org/gems/factbase

MIT License

5 stars 2 forks source link

Security concern: It's very dangerous to use `Marshal.load` with untrusted data. #106

Open tank-bohr opened 1 month ago

tank-bohr commented 1 month ago

See security recommendations for details.

It's used in baza. As far as I understand the code, the data was received from an HTTP request, which makes the data untrusted.

https://github.com/yegor256/factbase/blob/51d357c5d3c7411191a43c0184e0f43adba53e78/lib/factbase.rb#L218

yegor256 commented 1 month ago

@tank-bohr that's a reasonable concern. Let's use some other format for serialization. Definitely not JSON or YAML since they are very verbose.

tank-bohr commented 1 month ago

From the top of my head

But I would consider JSON, after all. All binary formats require libraries with C extensions, and they often work unreliably. I had experience with protobufs in Ruby, and we observed random crashes on the C extension level.

yegor256 commented 1 month ago

@tank-bohr JSON is very verbose, if you ask me. Our files will be much larger than now. Maybe it's not a big deal, since we are planning to ZIP them, but still...

0crat commented 1 month ago

@tank-bohr Thanks for reporting a new bug! You've earned +15 points. By reporting bugs, you help our project improve its quality. If you find anything else in the repository that doesn't look as good as you might expect, do not hesitate to report it.

tank-bohr commented 4 weeks ago

one more candidate: parquet

Ruby library: https://github.com/apache/arrow/tree/main/ruby/red-parquet

yegor256 commented 4 weeks ago

@tank-bohr maybe we can invent our own format? The structure of the data that we save/load is very primitive. Shouldn't be hard to serialize/deserialize, without any C-level code.

tank-bohr commented 4 weeks ago

@yegor256 I'm not a big fan of inventing new formats. The format is usually closely related to speed (see #6). Parquet, for example, is optimized for data access and search.

I have a crazy idea. What if we migrate to the SQLite format? It will solve two problems with one shot:

safe file format
fast data access because of indices

yegor256 commented 4 weeks ago

@tank-bohr it's not so crazy) We can use SQLite, but how will we convert our query language to SQL? What will be the design of tables? How we will organize indices?

tank-bohr commented 4 weeks ago

@yegor256 I mean using SQLite as storage, not changing the queries. Then, the tables could be

string facts

id	key	value	metadata
84b34a87-42c2-444f-837d-7bc0ad58dab0	repo	factbase
84b34a87-42c2-444f-837d-7bc0ad58dab0	owner	yegor256
84b34a87-42c2-444f-837d-7bc0ad58dab0	type	issue-was-created
...	...	...	...

integer facts

id	key	value	metadata
84b34a87-42c2-444f-837d-7bc0ad58dab0	number	42
...	...	...	...

For each supported datatype, we have to have

a dedicated table (the most compact format, but requires multiple queries to get the map)
or a column in one wide, sparse table (value_int, value_string ...)
or serialize each value to the string and keep it's type

Indices could be

id to be able to fetch a complete map
key + value for requests like (eq repo factbase)

Then (eq repo factbase) will be transformed under the hood to the following SQL-query

SELECT id
  FROM fact_strings
 WHERE key   = 'repo'
   AND value = 'factbase'

yegor256 commented 4 weeks ago

@tank-bohr I would go with this structure:

fact	property	string	integer	float
10	foo	NULL	42	NULL
10	bar	"hello"	NULL	NULL
10	bar	"test"	NULL	NULL
11	xyz	NULL	NULL	3.14

This would mean this factbase:

[
  { foo: 42, bar: ["hello", "test"] }
  { xyz: 3.14 }
]

WDYT?

tank-bohr commented 4 weeks ago

I have two concerns:

This model allows inconsistency. For example, have two values (e.g., string and int) for one property. But we can give the guarantee that one and only one column will have a non-null value since all the APIs under our control
The table bloating. Most of the values in a record will be NULLs. Need to check how SQLite handles NULLs

But it could work for sure.