[go: up one dir, main page]

Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Dassana. JSON-native,schema-less logging solution built atop ClickHouse
25 points by gauravphoenix on April 21, 2022 | hide | past | favorite | 15 comments
Hello HN, I’m Gaurav. Founder & CEO of Dassana. We are coming out of stealth today and would like to invite the community to give us a try. https://lake.dassana.io/

First, a bit of a backstory. I grew up with grep to search log files. The kind of person whose grep was aliased to grep -i. Then came along Splunk. It was a game-changer. For every single start-up I started (there are a few) I used Splunk and quite often we will run out of our ingestion quota. SumoLogic wasn’t cheaper either so we looked into DataDog. It was good until we started running issues with aggregate queries (facets etc), rehydration takes forever and the overall query experience is not fun (it wasn’t fun with Splunk and SumoLogic either).

All these experiences over the last two decades led me to wish for a simple solution where I can just throw a bunch of JSON/CSV data and query it with simple SQL. These days most logs are structured to begin with and the complexity of parsing logs to extract fields etc has moved to log shippers such as fluentd, logstash etc.

Enter HackerNews and ClickHouse.

I first learned about ClickHouse from HackerNews and was completely floored by its performance. Given its performance and storage savings due to columnar storage, it was an obvious choice to build a logging solution on top of it. As we started doing POC with it, it was obvious that it is a perfect solution for us if we could solve the problem of schema management. Over the last six months or so, that’s what we have working on. We designed a storage scheme that flattens the JSON objects and exposes an SQL interface that takes a SQL and converts it to our schemaless table query.

Being JSON native, we allow querying specific JSON objects in arrays. This is something that is not possible with many logging vendors and if you use something like Athena good luck figuring out the query- it is possible but quite complicated. Here is sample query - select count(distinct eventName) from aws_cloudtrail where awsRegion=us-east-1

Also, there are no indices, fields, facets etc in Dassana. You just send JSON/CSV logs and you query them with 0 latency. And yes, we do support distributed joins among different data sources (we call them apps). And like any other distributed system, it has limitations but it generally works great for almost all log-related use cases.

One amazing side effect of what we built is that we can offer a unique pricing model that is a perfect match for logging data. Generally speaking, log queries tend to be specific. There is always some sort of a predicate- a user name, hostname, an IP address. But these queries run over large volumes of data. As such, these queries run insanely fast on our system and we are able to charge separately for queries and reduce the cost of ingestion dramatically. In general, we expect our solution to be about 10x cheaper (and 10x faster) than other logging systems.

When not to use Dassana? Not suitable for unstructured data. We don’t offer full-text-search (FTS) yet. We are more like a database for logs than a lucence index for text files. With more and more people starting to use structured logs, this problem with either go away on its own but as I said, we do plan to offer FTS in the future. Note that you can already use log shippers such as fluent, vector,logstash etc to give structure to logs.

What’s next? 1. Grafana plugin. Here is a sneak preview- https://drive.google.com/file/d/1JKnX5Aa6cp_pYnMiFzAojA24bjUn28WM/view?usp=sharing

2. Alerting/Slack notifications. You will be able to save queries and get Slack notifications when results match.

3. JDBC driver.

4. TBD. You tell us what to build. Email me and I will personally follow up with you: gk @ dassana dot input/output

I will be online all day today happy to answer any question. Feel free to reach out by email too.



* Who do you see as your competition? AWS's CloudWatch / Centralized Logging? Splunk? GCP's Logging? Logstash? Graylog?

* What kind of query language are you thinking? I imagine SQL-like, as that's Clickhouse's native language.

* Business-wise, how are you gonna integrate with the cloud providers, AWS / GCP / Azure? Most people who use those services just use the built-ins.

* More than Grafana, I think you need something like Metabase integrated OOTB. That might be a killer feature.

* IMHO, FTS is a must-have from day 1. Most software that folks run produce non-structured logs OOTB (sad, I know), so folks won't even be able to try your service without changing their software. And getting a lot of software, even popular ones like Python/Flask, Ruby/Rails, Java/Spring, to produce structured logs is not a simple task.

Best of luck!!


>Who do you see as your competition? AWS's CloudWatch / Centralized Logging? Splunk? GCP's Logging? Logstash? Graylog?

More like Athena/Presto/SnowFlake. Simply put, anyone who is DB like systems for querying structured logs.

>What kind of query language are you thinking? I imagine SQL-like, as that's Clickhouse's native language.

Pretty much CH like SQL with some syntactic sugar for JSON. https://docs.dassana.cloud/docs/query/sample-queries#filter-...

>Business-wise, how are you gonna integrate with the cloud providers, AWS / GCP / Azure? Most people who use those services just use the built-ins.

Cheaper, faster and easier. Let me openly challenge anyone- take a nested JSON, send it to logging service and query it. And now do the same with Dassana. You will find night and day difference. I agree with most folks start with in-built services but they soon grow out of it. Here is an example- if you are using GCP, try getting count of failed HTTP requests group by host or ip. Turns out, there is no support for aggregate queries and you will have to create bunch of complicated metric filters to achieve it.

>More than Grafana, I think you need something like Metabase integrated OOTB. That might be a killer feature.

That's an awesome suggestion, we are going to look into it

>IMHO, FTS is a must-have from day 1. Most software that folks run produce non-structured logs OOTB (sad, I know), so folks won't even be able to try your service without changing their software. And getting a lot of software, even popular ones like Python/Flask, Ruby/Rails, Java/Spring, to produce structured logs is not a simple task.

Agree with your sentiment,we for now are focusing on use cases where you have structured logs. SecOps teams have such use cases. These teams mostly deal with data like CloutTrail, VpcFlows, ALB logs etc.


> * More than Grafana, I think you need something like Metabase integrated OOTB. That might be a killer feature.

Nowadays you can directly connect to CH from many BI tools, and this could be different choice depending on report types / personal preferences. For example, our SeekTable has built-in connector for ClickHouse, and support 2 different drivers - one for binary TCP interface, another one for HTTP(S) interface.


Are you using the new JSON column type released in clickhouse 22.3?

https://clickhouse.com/blog/clickhouse-22-3-lts-released/


Not yet. It is quite a bit rough around the edges and far from production use. Besides, it will require creation of multiple tables/column for each schema type (i.e. github schema might conflict with gitlab schema). As such, we decided to flatten the JSONs and use map data type and have our own SQL layer that translates the queries to underlying ClickHouse queries. This allows us to add a lot of syntactic sugar e.g. https://docs.dassana.cloud/docs/query/intro

We might start using that feature in future.


What do you mean by flatten the JSON? How it is stored in clickhouse?


We flatten it to a map and store the map. When you query, we dynamically generate ClickHouse query that queries the map.


If the JSON is nested, how do you put it the map?

Since map key is one type, how do you handle multiple types of values?


we flatten nested JSONs too. Not sure what you meant by second question, can you rephrase or provide an example? Also, feel free to drop me an email or join slack[1]- it will be easier to discuss tech details there.

[1] https://dassanacommunity.slack.com/join/shared_invite/zt-teo...


How do you compare to https://betterstack.com/logtail which also seems to be built on Clickhouse?


A few differences:

- Even though logtail is quite cheap, we are even cheaper solution - Our pricing model separates ingestion from query. If you don't query, you don't pay for query. Just ingestion. - We are JSON native. Our SQL allows querying JSON fields that are nested under nested JSON arrays. - Performance. We believe our solution is much faster for selective queries. This like most performance claims, it all just depends on the data shape, volume and what you are querying.

Similarities: - We both use ClickHouse as underlying DB.


Cool product and pricing model

> Cloud Log Lake

That's the first time I'm hearing a Clickhouse backend described as a lake. Care to explain?


Generally speaking, tech like ClickHouse is considered a data warehouse tech. But we are using it as data lake tech: there are no schemas involved in Dassana. This means that you can send free form JSON objects to Dassana and query them using SQL.


I'm wondering if all these logging solutions that don't offer traces have any kind of future.


Depends- there is plenty of market for purely structured logs query. I come from security background and these days in cloud security, all logs are structured: CloudTrail,VpcFlows,ALB logs etc. We are going to be focusing on such use cases.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: