Insecurity and Python pickles

Posted Mar 15, 2024 8:43 UTC (Fri) by aragilar (subscriber, #122569)
In reply to: Insecurity and Python pickles by Wol
Parent article: Insecurity and Python pickles

I think we're using the same words to mean different things. The data I've worked with has come in two different forms:
* arrays of records (and collections of these arrays): generally having a db makes it easier and faster to do more complex queries over these vs multiple files (or a single file with multiple arrays), and formats designed for efficient use of "tabular" data (e.g. parquet) are better than random CSV/TSV.
* n-dimensional arrays: this represent images/cubes/higher moments of physical data (vs metadata), and so are different in kind to the arrays of records. This is is where HDF5, netCDF, FITS (if you're doing observational astronomy) come in.

I think the data you're talking about is more graph-like right (and feels like the kind of thing where you want to talk about the structure of how data is related)? That feels different in kind to both the above, and so naturally tools designed for other types of data don't match?

My understand of ML/AI is generally they're pushed into one of the two bins above, but that may be a bias based on the data I encounter.

to post comments

Insecurity and Python pickles

Posted Mar 15, 2024 9:52 UTC (Fri) by Wol (subscriber, #4433) [Link]

> I think we're using the same words to mean different things. The data I've worked with has come in two different forms:

No surprise ...

> * arrays of records (and collections of these arrays): generally having a db makes it easier and faster to do more complex queries over these vs multiple files (or a single file with multiple arrays), and formats designed for efficient use of "tabular" data (e.g. parquet) are better than random CSV/TSV.

So are your records one-dimensional? That makes your "arrays of records" two-dimensional - what I think of as your typical relational database table.

And what do you mean by "a complex query"? In MV that doesn't make sense. Everything that makes SQL complicated, belongs in an MV Schema - joins, case, calculations, etc etc. Pretty much ALL MV queries boil down to the equivalent of "select * from table".

> * n-dimensional arrays: this represent images/cubes/higher moments of physical data (vs metadata), and so are different in kind to the arrays of records. This is is where HDF5, netCDF, FITS (if you're doing observational astronomy) come in.

And if n=2? That's just your standard relational database aiui.

It's strange you mention astronomy. Ages back there was a shoot-out between Oracle, and Cache (not sure whether it was Cache/MV). The acceptance criteria were to hit 100K inserts/hr or whatever - I don't know what these speeds are, I'm generally limited by the speed people can type. Oracle had to struggle to hit the target - all sorts of optimisations like disabling indices on insert and running an update later etc etc. Cache won, went into production, and breezed through 250K within weeks ...

> I think the data you're talking about is more graph-like right (and feels like the kind of thing where you want to talk about the structure of how data is related)? That feels different in kind to both the above, and so naturally tools designed for other types of data don't match?

Graph-like? I'm not a visual person so I don't understand what you mean (and my degree is Chemistry/Medicine).

To me, I have RECORDs - which are the n-dimensional 4NF representation of an object, and the equivalent of a relational row!

I then have FILEs which are a set of RECORDS, and the equivalent of a relational table.

All the metadata your relational business analyst shoves in the data, I shove in the schema.

With the result that all the complexities of a SQL query, and all the multiple repetitions across multiple queries, just disappear because they're in the schema! (And with a simple translation layer defined in the schema, I can run SQL over my FILEs.)

I had cause to look up the "definition" of NoSQL recently. Version 1 was the name of a particular database. Version 2 was the use I make of it - defined by the MV crowd, "Not only SQL" - databases that can be queried by SQL but it's not their native language (in MV's case because it predates relational). Version 3 is the common one now, web stuff like JSON that doesn't really have a schema, and what there is is embedded with the data.

So I understand talking past each other with the same language is easy.

But all I have to do is define N as two (if my records just naturally happen to be 1NF), and I've got all the speed and power of your HDF5-whatever, operating like a relational database. But I don't have the complexity, because 90% of SQL has been moved into the database schema.

Cheers,
Wol