[go: up one dir, main page]

  1. 55
    All you need is PostgreSQL databases performance programming ebellani.github.io
  1.  

    1. 11

      …and ClickHouse.

      1. 4

        Yeah. ClickHouse has turned into the thing I reach for when I need OLAP.

        1. 2

          only if you need ad hoc analytical queries. And even then, do try duckdb first.

          1. 1

            Makes sense. Why do you prefer DuckDB over ClickHouse?

            1. 2

              DuckDB is based on top notch research from the folks behind cedardb (and others). See https://duckdb.org/why_duckdb#peer-reviewed-papers-and-thesis-works

              clickhouse seems basically to have grown out of the performance needs of a company, and it shows.

              1. 1

                Fair point.

                and it shows

                Do you know concrete points I should be aware of before making my bet?

                1. 1

                  clickhouse was not acid for a long while, basically returning control before ensuring durability. Transactions are still experimental. I'd not trust it for anything in production

                  https://medium.com/@pranavmehta94/clickhouse-fsync-what-the-history-tells-us-61a370ec8a2d https://clickhouse.com/docs/guides/developer/transactional

                  1. 1

                    anything in production

                    What counts as production seems to vary a lot in this thread.

                    This is enough production for me: https://www.google.com/search?q=clickhouse+site%3Ahttps%3A%2F%2Fblog.cloudflare.com%2F&ie=UTF-8&oe=UTF-8&hl=en-us&client=safari

                    Thanks for sharing the fsync blog. I might be reading it wrong though.

                    In other words, fsync was never intended to provide comprehensive durability guarantees—it addresses only a narrow class of failures. More importantly, ClickHouse successfully operated for over a decade without it. Its durability model has always relied on replication and distributed architecture, not synchronous disk flushes.

                    1. 1

                      just look at CH transaction page and count the caveats. If you care about consistency, I'd advise to stay away from that. YMMV

          2. 2

            Why so? I know nothing about it.

            1. 3

              I use Clickhouse at work. It's very nice for handling large amount of data (i have hundred of billions of lines and i can query them in minutes to find patterns on hardware from 2012), but what i really like are :

              • TTLs (you can auto expire rows, columns, in a dynamic way)
              • Materialized view, both refreshable and classic  (you can transform data automatically, and deserialize json / do join when some data is inserted in a given table, and the data is appended to another table)
              • The fact you can query URLs (e.g. csv files) and populate a table on a schedule
              • Compression you can change, and choose the best compression for the task at hand
              1. 1

                Very true. I'd also add native support for clustering and/or sharding on all levels: SQL keywords, built-in efficient consensus, table layout.

            2. 2

              Technically you also have TimescaleDB?

            3. 5

              stored_procedures_everywhere.gif

              On the serious note, I expected to learn how Postgres' rich type set allows to store almost anything, or how its locking primitives remove the need in external lock system coordination, or how different isolation levels cover for a spectrum of concurrency heavy apps, or how to design and tune a table to allow cheap write heavy storage, or how Postgres CDC plugins enable transparent heterogeneous replication, or how embedded scripting allows to pull the client-server app inside out where latency is critical, or how… stop-token.

              1. 3

                There's only one correct isolation level.

                1. 2

                  I feel you. In the everyday apps where sound changes are top priority this is true. But not every problem is a money transfer. A simple example would be multiplayer games with single row stat updates without blocking range slightly off analytics.

              2. 1

                That OLAP approach is pretty interesting. Basically you materialize the ledger sums the application needs ahead of time with triggers. One nice thing about an OLAP database (or something like DuckDB querying parquets) is you can run whatever stats you want on your data columns, relatively quickly, without needing to anticipate what intermediate data you'll need to compute it. This flexibility is valuable for business reporting and other analytical purposes.

                1. 3

                  ad hoc analytical queries is one of the few places where I advise people to reach for something else. And that would be DuckDB.

                  1. 2

                    Yep. OLAP is also giving better compression because of the on-disk table layout through per column compression codecs optimized for similar data being packed in long continuous vectors.