lakeFS’ cover photo
lakeFS

lakeFS

Software Development

Git for Data - Scalable Data Version Control

About us

lakeFS by Treeverse is a data version control system that manages data the way one manages code. Using a Git-like model, lakeFS brings software engineering best practices to data, AI and
 ML projects - enabling safe development and testing, early error detection, and reproducible results. This approach streamlines collaboration, reduces operational overhead, and supports compliance, auditing, and consistent standards across data silos - empowering data practitioners to deliver projects faster, with greater confidence.

Website
https://lakefs.io/
Industry
Software Development
Company size
11-50 employees
Headquarters
Santa Monica, California
Type
Privately Held
Founded
2020

Products

Locations

Employees at lakeFS

Updates

  • Who owns your AI? And does the infrastructure beneath it actually support that ownership? These are the questions at the center of Matthew Miller's session at the AI-Ready Data Summit on March 31st. As Sr. Principal Chief Architect in the Field CTO Office at Red Hat, Matthew brings 25+ years of experience across defense, intelligence, financial services, manufacturing, and more — working with federal institutions, state organizations, and Fortune 500 companies alike. His session, "AI Sovereignty and the Infrastructure Beneath," will explore how AI platforms and data infrastructure are evolving to support enterprise AI at scale — and what sovereignty really means when the stakes are high. If you're making decisions about your AI platform, data ecosystem, or enterprise AI strategy, this is a session worth showing up for. 📅 March 31, 2026 | 10 AM–2 PM EDT | Free Virtual Event Reserve your spot: https://hubs.la/Q044Tt590 #aireadydata #dataversioncontrol #datainfrastructure

    • No alternative text description for this image
  • Who owns your AI? And does the infrastructure beneath it actually support that ownership? These are the questions at the center of Matthew Miller's session at the AI-Ready Data Summit on March 31st. As Sr. Principal Chief Architect in the Field CTO Office at Red Hat, Matthew brings 25+ years of experience across defense, intelligence, financial services, manufacturing, and more — working with federal institutions, state organizations, and Fortune 500 companies alike. His session, "𝗔𝗜 𝗦𝗼𝘃𝗲𝗿𝗲𝗶𝗴𝗻𝘁𝘆 𝗮𝗻𝗱 𝘁𝗵𝗲 𝗜𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗕𝗲𝗻𝗲𝗮𝘁𝗵," will explore how AI platforms and data infrastructure are evolving to support enterprise AI at scale — and what sovereignty really means when the stakes are high. If you're making decisions about your AI platform, data ecosystem, or enterprise AI strategy, this is a session worth showing up for. 📅 March 31, 2026 | 10 AM–2 PM EDT | Free Virtual Event Reserve your spot: https://hubs.ly/Q044TBDj0 #aireadydata #dataversioncontrol #datainfrastructure

    • No alternative text description for this image
  • View organization page for lakeFS

    7,310 followers

    Who owns your AI? And does the infrastructure beneath it actually support that ownership? These are the questions at the center of Matthew Miller's session at the AI-Ready Data Summit on March 31st. As Sr. Principal Chief Architect in the Field CTO Office at Red Hat, Matthew brings 25+ years of experience across defense, intelligence, financial services, manufacturing, and more — working with federal institutions, state organizations, and Fortune 500 companies alike. His session, "𝗔𝗜 𝗦𝗼𝘃𝗲𝗿𝗲𝗶𝗴𝗻𝘁𝘆 𝗮𝗻𝗱 𝘁𝗵𝗲 𝗜𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗕𝗲𝗻𝗲𝗮𝘁𝗵," will explore how AI platforms and data infrastructure are evolving to support enterprise AI at scale — and what sovereignty really means when the stakes are high. If you're making decisions about your AI platform, data ecosystem, or enterprise AI strategy, this is a session worth showing up for. 📅 March 31, 2026 | 10 AM–2 PM EDT | Free Virtual Event Reserve your spot: https://hubs.ly/Q044TyQT0 #aireadydata #dataversioncontrol #datainfrastructure

    • No alternative text description for this image
  • MLflow points to a location. lakeFS points to a snapshot. The difference is everything when a regulator asks you to reproduce a model. MLflow knows you trained on a dataset. It doesn't know which version. If the underlying data has changed since training — new samples added, bad samples removed, preprocessing updated — there's no way to get back to the exact state of the data at training time. The experiment record points to a location, not a snapshot. Git doesn't solve this either. Training datasets are hundreds of gigabytes or terabytes of binary objects in cloud storage. You can't check a million retinal scans into a Git repo. In his latest post, Joe Pringle walks through exactly how to close this gap using a computer vision model for diabetic retinopathy detection as a hands-on example. The core idea: link every MLflow run to a specific lakeFS commit. A few extra tags in your logging code, and suddenly you have complete bidirectional traceability between models and the exact training data used to build them. The post covers: → Zero-copy branching for safe data experimentation at scale → Immutable dataset snapshots via commits and tags → Closing the MLflow loop with lakeFS commit IDs → Logging compliance metadata as a byproduct of normal workflow → What a full audit trail looks like in practice MLflow and Git are already in your stack. lakeFS is what turns that stack into something a regulator, or your future self, can actually trust. Full write-up with code linked below. 👇 https://hubs.la/Q043Tf5W0 #dataversioncontrol #datainfrastructure #medicalimaging

    • No alternative text description for this image
  • Registration is now open for the AI-Ready Data Summit — a free, live virtual event on March 31, 2026! 80% of AI projects never make it to production. Data issues are the #1 reason why. This summit brings together the enterprise AI and data leaders who are solving that problem at scale — from organizations including Dell Technologies, Lockheed Martin, and Red Hat. In 4 hours and 10 sessions, you'll get: • Real-world case studies from enterprises in regulated industries and government • Actionable frameworks for data infrastructure, governance, and production ML • Proven approaches to reproducibility, scalability, and AI readiness Who should attend: Data & AI leaders, data and ML engineers, AI/ML platform teams, and Centers of Excellence driving enterprise AI strategy. Reserve your free spot today 👇 🔗https://lnkd.in/e463YrfK

    AI-Ready Data Summit

    AI-Ready Data Summit

    www.linkedin.com

  • Most teams obsess over their data. Far fewer track the context around it. That's a problem because without metadata tracking, you lose visibility into where data came from, how it changed, and who's using it. In our latest article, we break down: → What metadata tracking actually is (and how it differs from data lineage, cataloging, and management) → The 6 types of metadata every team should be capturing → Real-world use cases: pipeline debugging, ML reproducibility, schema evolution, compliance → The biggest challenges teams face → The best practices to overcome those challenges As data volumes grow and pipelines get more complex, metadata isn't passive documentation anymore. It's an active driver of trust, efficiency, and governance. If your team is flying blind on data context, this one's for you. 👇 https://hubs.la/Q043Sdlz0 #dataversioncontrol #datagovernance #metadata

    • No alternative text description for this image
  • "Garbage in, lawsuit out" isn't just a catchy phrase. It's a very real threat. Track with this: Your ML model is performing as planned. Revenue is up 15%. Then you get a subpoena demanding proof that your November training data wasn't biased. But your S3 bucket has been overwritten by January's ETL job. The data is gone. You can't prove your innocence. This nightmare as presented by Itai Gilo at PyData Global 2025 is preventable. The same version control principles that transformed software engineering 20 years ago can solve ML compliance today. In our latest article recapping the the talk, Itai breaks down three critical compliance failures that plague ML teams: ▪️PII leakage that forces you to scrap entire projects ▪️Reproducibility traps that make audits impossible ▪️Traceability gaps that leave you defenseless in legal disputes The solution? Treat your data like code. With data version control, you can: ✅ Block sensitive data with automated pre-merge hooks ✅ Link every model to an immutable commit ID for perfect reproducibility ✅ Generate audit trails automatically. No more tracking in spreadsheets. It’s 2026. GDPR fines can reach €20M or 4% of global revenue. But that’s not the biggest risk. Losing B2B deals because you can't prove data lineage while your competitors can is. Build compliance into your infrastructure from day one, and your data lake becomes an asset, not a liability. 🚀 Read the full article to see how data version control transforms Alice’s compliance nightmare into an automated success story 👇 [Link to article in comments] #DataEngineering #DataGovernance #MLCompliance

    • No alternative text description for this image
  • lakeFS reposted this

    What do on-prem applied AI data platforms look like? Aimed at institutions that want it but don't know how. When people think of data engineering and AI, they often think of training models. But what most people don't realize is that the case for producing data with pre-trained AI models by refining existing datasets is much more common. What does that look like in practice? What can a data engineering pipeline with AI capabilities look like when running it on-prem? Because people often ask me what this can look like in practice when your AI is on-prem, I decided to set up a demo. If you are technical, you can check out the repo and try it out yourself - link in comment. If you just want to understand the concepts, you can watch the video - link also in comment. It is half an hour long, but it covers a lot of ground about on-prem AI and data engineering without expecting too much prior knowledge. There's a reason the entire AI conversation revolves around chat interfaces. They're easy to demo. Type a question, get an answer, everyone claps. But a chatbot fits poorly into a data driven organization with sophisticated data products that support decision making on a daily basis. They need AI that fits into their existing data workflows. Classify thousands of documents. Enrich datasets at scale. Generate synthetic versions of sensitive data so teams can collaborate without a compliance nightmares. That's what I built. A demo of a complete data platform, with data pipelines, versioned storage, local LLM inference, noteboks, and even synthetic data generation. No cloud API calls. No data leaving the machine. Most of the stack I am using is either open source, or included with many AI factory setups. And of course, given my day-to-day job, the test dataset I use for this demonstration is 7,700 central bank speeches from 1997 until 2022. Some technologies demoed in this stack: Dagster Labs, lakeFS, NVIDIA NIMs, NeMo Safe Synthesizer and KAI scheduler, marimo, ArgoCD Credit to the crew at Sveriges riksbank for all the technical experimentation with data platform stacks that inspired this demo, including Johan Carlin, Arian Javdan, Jon Söråker, Sidi Kasmi, Björn Annergren and 🏗️ Jon Erik K. (PS: At the end of the demo there is a graph. I completely forgot to explain why the AI model tracked that central bank communication turned dovish around 2020. Of course, the answer is Covid and the efforts made to stimulate economies around the world.)

    • No alternative text description for this image
  • View organization page for lakeFS

    7,310 followers

    Healthcare AI teams face a new reality: regulators now require complete reproducibility of model behavior and data states. Can't reproduce a training run from six months ago? That's not just an inconvenience. It means delayed product launches, failed audits, and compliance costs that compound with every iteration. This post breaks down why reproducibility has shifted from best practice to regulatory standard, and how to provide the immutable snapshots and lineage tracking that auditors demand. Read more:  https://lnkd.in/emyhgJwQ #AI #HealthcareAI #MLOps #DataVersionControl #Reproducibility #DataOps

  • View organization page for lakeFS

    7,310 followers

    Your ML team needs to find all images labeled "defective" from Q3 production runs tagged by a specific annotation workflow. Your data lake has 10 billion objects. How do you find them? For most teams, the honest answer is: you can't. At least not without building custom infrastructure or scanning through files for hours. This is the reality for data teams operating at scale. Finding the right data becomes an archaeological expedition. You build custom indexing systems and maintain separate metadata catalogs that inevitably drift out of sync. When your queries finally run, they're slow and return partial results. lakeFS #MetadataSearch changes this. Instead of building custom systems, you query metadata directly with SQL. Every object's metadata – system properties like size and creation time, plus user-defined data like workflow IDs, annotation labels, or PII flags – gets indexed into an Iceberg table that you can query with any compatible tool. Want to find all Parquet files over 1GB tagged with sensitive data under a specific path? It's a straightforward SQL query. Need files produced by a particular workflow last week? Same thing. But here's what makes this truly powerful: reproducibility. When you query a constantly changing data lake, your results change too. This breaks AI governance, regulatory compliance, and team collaboration. With lakeFS, you query immutable versions through specific commits or tags. The same query against the same commit always returns identical results. This matters for real work. Data scientists need to share curation results reliably. ML engineers must validate experiments months later. Compliance teams need audit trails that don't change. The implementation leverages lakeFS's Git-like version control for data. By building on immutable commits, metadata queries inherit the same reproducibility guarantees as the data itself. What changes? You stop maintaining parallel metadata systems and writing scripts to scan objects. Instead, you write SQL queries that run fast, return complete results, and work the same way six months from now. This shifts metadata from scattered documentation to a first-class interface for your data lake. For teams managing billions of objects, that's the difference between drowning in your data and actually using it. #datacompliance #datagovernance #dataengineering #datascience #datacollaboration lakeFS

    • No alternative text description for this image

Similar pages

Browse jobs

Funding

lakeFS 2 total rounds

Last Round

Series B

US$ 20.0M

See more info on crunchbase