"Homegrowns are next"

phiresky · 6 months ago

like I said, brotli contains a large dictionary for web content / http which means you can’t compare it directly to other compressors when looking at web content. the reason they do a comparison like that is because hardcoded dictionaries are not part of the zstd compression content-encoding because it is iffy.

phiresky · 6 months ago

compressed size is more important than speed of compression

Yes, but decompression speed is even more important, no? My internet connection gets 40MByte/s and my ssd 500+MB/s, so if my decompressor runs at <40MB/s it’s slowing down my updates / boot time and it would be better to use a worse compression.

Arch - since 2021 for kernel images https://archlinux.org/news/moving-to-zstandard-images-by-default-on-mkinitcpio/ and since 2019 for packages https://lists.archlinux.org/pipermail/arch-dev-public/2019-December/029739.html

brotli is mainly good because it basically has a huge dictionary that includes common http headers and html structures so those don’t need to be part of the compressed file. I would assume without testing that zstd would more clearly win against brotli if you’d train a similar dictionary for it or just include a random WARC file into --patch-from.

Cloudflare started supporting zstd and is using it as the default since 2024 https://blog.cloudflare.com/new-standards/ citing compression speed as the main reason (since it does this on the fly). It’s been in chrome since 2021 https://chromestatus.com/feature/6186023867908096

The RFC mentions dictionaries but they are not currently used:

Actually this is already considered in RFC-8878 [0]. The RFC reserves zstd frame dictionary ids in the ranges: <= 32767 and >= (1 << 31) for a public IANA dictionary registry, but there are no such dictionaries published for public use yet. [0]: https://datatracker.ietf.org/doc/html/rfc8878#iana_dict

And there is a proposed standard for how zstd dictionaries could be served from a domain https://datatracker.ietf.org/doc/rfc9842/

it’s better in every metric

Let me revise that statement to - it’s better in every metric (compression speed, compressed size, feature set, most importantly decompression speed) compared to all other compressors I’m aware of, apart from xz and bz2 and potentially other non-lz compressors in the best compression ratio aspect. And I’m not sure whether it beats lzo/lz4 in the very fast levels (negative numbers on zstd).

that struck me as weird about what you were saying

What struck me as weird about what you were kind of calling it AI hype crap, when they are developing this for their own use and publishing it (not to make money). I’m kind of assuming this based on how much work they put into open sourcing the zstd format and how deeply it is now used in much FOSS which does not care at all for facebook. The format they are introducing uses explicitly structured data formats to guide a compressor - a structure which can be generated from a struct or class definition, and yes potentially much easier by an LLM, but I don’t think that is hooey. So I assumed you had no idea what you were talking about.

phiresky · edit-2 6 months ago

I have literally never heard of someone claiming zstd was the best overall general purpose compression. Where are you getting this?

You must be living in a different bubble than me then, because I see zstd used everywhere, from my Linux package manager, my Linux kernel boot image, to my browser getting served zstd content-encoding by default, to large dataset compression (100GB+)… everything basically. On the other hand it’s been a long time since I’ve seen bz2 anywhere, I guess because of it’s terrible decompression speed - it decompresses slower than an average internet connection, making it the bottleneck and a bad idea for anything sent (multiple times) over the internet.

That might also be why I rarely see it included in compression benchmarks.

I stand corrected on the compression ratio vs compression speed, I was probably thinking of decompression speed as you said, which zstd optimizes heavily for and which I do think is more important for most use cases. Also, try -22 --ultra as well as --long=31 (for data > 128MB). I was making an assumption in my previous comment based on comparisons I do often but I guess I never use bz2.

Random sources showing zstd performance on different datasets

https://linuxreviews.org/Comparison_of_Compression_Algorithms

https://www.redpill-linpro.com/techblog/2024/12/18/compression-tool-test.html

https://insanity.industries/post/pareto-optimal-compression/

phiresky · 6 months ago

My point is you are comparing the wrong thing, if you make zstd as slow as bz2 by increasing the level, you will get same or better compression ratio on most content. You’re just comparing who has defaults you like more. Zstd is on the Pareto front almost everywhere, you can tune it to be (almost) the fastest and you can tune it to be almost the highest compression ratio with a single number, all while having decompression speeds topping alternatives.

Additionally it has features nothing else has, like --adapt mode and dictionary compression.

phiresky · 6 months ago

Zstd by default uses a level that’s like 10x faster than the default of bz2. Also Bz2 is unusably slow in decompression if you have files >100MB.

phiresky · 6 months ago

This is from the same people that made zstd, the current state of the art for generic compression by almost any metric. They know what they are doing. Of course this is not better at generic compression because that’s not what it’s for.

Edit: I would assume the video is not great and can recommend the official article: https://engineering.fb.com/2025/10/06/developer-tools/openzl-open-source-format-aware-compression-framework/

phiresky · 1 year ago

high quality template:

phiresky · 1 year ago

"Homegrowns are next"

phiresky · 2 years ago

The ActivityPub protocol lemmy uses is (in my opinion) really bad wrt scalability. For example, if you press one upvote, your instance has to make 3000 HTTP requests (one to every instance that cares).

But on the other hand, I recently rewrote the federation queue. Looking at reddit, it has around 100 actions per second. The new queue should be able to handle that amount of requests, and PostgreSQL can handle it (the incoming side) as well.

The problem right now is more that people running instances don’t have infinite money, so even if you could in theory host hundreds of millions of users most instances are limited by having a budget of 10-100$ per month.

phiresky · 2 years ago

So far it doesn’t seem like any company actually wants to compete in this space (longer-form somewhat text-focused communities). Even reddit is trying to become more twitter and less reddit.

phiresky · 2 years ago

I agree that it’s not ideal to be hosted on a platform controlled by Microsoft, but it’s just a fact that you lose 90+% of contributors if you are anywhere else (there’s an article where someone compared, can’t find it right now). It’s not great that that’s how it is, but you need to choose your battles.

I’m not really very concerned, since git itself is decentralized, and if Github starts causing visible problems moving somewhere else is not a huge problem. Also VPNs exist.

phiresky · 2 years ago

I’m not aware of philosophical disagreements with that feature, I can just think of logical and technical issues. Like how moderation would federate, etc. If all the mods come to an agreement then the mods on one instance could lock their community and link to the other one. If the mods disagree, then moderation is going to be chaos in any case, no?

I think multi-community views would be a great idea.

phiresky · 2 years ago

Interoperability is great, but sadly there isn’t really any organized group effort to standardize more aspects / extensions of ActivityPub. AP is really “thin” in that it barely prescribes anything. There’s not even a test suite to test whether software complies to the spec of AP.

So everyone kind of does their own thing, and fixes interoperability on a case-by-case basis. This makes it kinda frustrating to spend time on - lemmy already has special cases for many different softwares (peertube, mastodon, …) and every one increases the complexity.

phiresky · 2 years ago

Lemmy is somewhat protected by being an AGPL-licensed project, preventing proprietarization. If there’s ever a relicensing effort, ba fearful.

I’m not sure what exactly becoming a organization would entail, but so far I’d say the development part is not really large enough? For me I would start being suspicious when a significant amount of dev power came from compan(ies), but so far no company has shown any interest afaik.

There’s already been a few forks, for example lemmynsfw has made some changes on their side, which nutomic is now looking to integrate back into lemmy.

phiresky · 2 years ago

There is a ton of decentralized projects that no one has really ever heard of, new ones pop up all the time (I was watching multiple of them in the past). Sadly in most cases it seems like most authors stop working on their projects after a while.

The same ideas have existed for a long time but both decade old projects (ever heard of Freenet? Probably no) and new ones . Many of them are very ambitious and try to replace huge swaths of things (not just file storage but also social aspects, web of trust, etc) but then collapse under the complexity. IPFS is the most well known new project and (good imo) has limited its scope, but sadly (still) suffers from huge scalability issues, some of which are deep in the design.

I think it’s really hard to align incentives there - the nicer it is the harder it is to make money with it. So either these projects tend towards control by one entity or they tend towards death.

Really the only one that seems to have a long lasting life so far is torrents. Which are amazing. And Email if you want to count that.

phiresky · 2 years ago

I think flairs would be the same as user-tagging. There’s an open proposal for post-tagging https://github.com/LemmyNet/rfcs/pull/4 and the discussion there was so far to add tagging for one type of thing and then later expand to others (like user tagging).

It’s a bit of a complicated feature because it needs decision who can tag whom, and what is the scope (who is it federated two), and how does it transfer / interact with other ActivityPub software.

phiresky · 2 years ago

Personally I came with them so I guess they are my people ;)

phiresky · 2 years ago

I don’t think it’s that large. Text is very small and compressible compared to images. Well it depends on if you mean the actual database storage (uncompressed, with indexes) or a compressed copy of all the posts. You can see the post number in the URL, which on lemmy.world for this post is 11169622. That means there’s around 11 million posts total in lemmy.world’s database. If you assume each of them takes 0.5kB of storage that would be only ~ 5 GB of posts.

phiresky · 2 years ago

0.19 was a bit of a special case because there was a set of breaking updates that had to be done at some point, and trickle releasing breaking changes isn’t really great either. Usually hopefully the breaking changes are rare, so releases can be more frequent.

phiresky · 2 years ago

For migration we recently added a feature to export your user data. But “real” migrating accounts is something I put on our “todo” list, though it probably also first needs a proposal to define how it should work exactly (should it still work when the original instance is down?) As soon as we start giving users more control over their private key issues start appearing like not having any infrastructure for key rotation / revocation. Without that it will only work when the original instance still exists.

I’m not sure if by tagging users you mean linking / mentioning them? Or adding tags to them like you can tag posts / users on other platform. For tagging in general there’s a pending proposal https://github.com/LemmyNet/rfcs/pull/4 . So far it focuses on post tagging though to reduce the scope. I think the goal is going to be to start with one kind of tagging and add more kinds of tagging later.

For improving cross-instance linking (both communities, posts, and users) we also have a open milestone. There’s a few spitballing issues about it, but no real concrete proposal on how to build it yet.

phiresky · 2 years ago

I don’t think we found any specific groups of people attacking Lemmy. I personally just saw one or two what looked like individuals trying (and succeeding) to take Lemmy down with a few very simple requests that forced Lemmy to do lots of compute (something like fetching the next million posts from page 10000). The fixes for those were simple because it was just missing limits checking.

I’m not sure if there actually was a larger organized attack. Lots of performance issues in Lemmy simply appeared simultaneously and compunded each other with a rapidly growing number of active users and posts.

phiresky · 2 years ago

Hacking in 1980 vs Hacking in 2024

phiresky · 3 years ago

Fosstodon's position on Meta's Threads

phiresky · 3 years ago

Posting to Lemmy be like

phiresky · 3 years ago

Make Your Renders Unnecessarily Complicated by Modeling a Film Camera in Blender

phiresky · 3 years ago

High CPU usage? How to profile Lemmy server process CPU usage

phiresky · 3 years ago

We Cooled a Computer with FIRE

phiresky · 3 years ago

A refrigerator that works by stretching rubber bands

phiresky · 3 years ago

Manual Heat Pumps are Such Fun

phiresky

Moderates