Open-source AI at FOSDEM

February 15, 2024

This article was contributed by Koen Vervloesem

At FOSDEM 2024 in Brussels, the AI and Machine Learning devroom hosted several talks about open-source AI models. With talks about a definition of open-source AI, "ethical" restrictions in licenses, and the importance of open data sets, in particular for non-English languages, the devroom provided an overview of the current state of the domain.

An AI model is a program that has been trained on a data set to recognize patterns, mimic the learned data in its output, or to make some kinds of decisions autonomously. Most notably, large language models (LLMs), which are extensive neural networks capable of generating human-like text, were a recurrent subject at FOSDEM. This report comes from the live-streams of the talks, as the flu unfortunately prevented me from attending FOSDEM in-person this year.

Characteristically, an LLM incorporates up to several hundred billion "weights", which are floating-point numbers that are also referred to as "parameters". Companies developing large language models are not inclined to release their models and the code to run them as open source, since training the models requires significant computing power and financial investment. However, that doesn't stop various organizations from developing open-source LLMs. Last year, LWN looked at open-source language models.

License restrictions

Niharika Singhal, project manager at the Free Software Foundation Europe (FSFE), talked about the trend of imposing ethical restrictions on AI models through licensing. Singhal provided several instances of added restrictions of that sort, related to field of endeavor, behavior, or commercial practices. One is the Hippocratic License, which restricts the licensee from executing numerous actions deemed harmful based on various "international agreements and authorities on fundamental human rights norms". There's also the Llama 2 v2 use policy, which prohibits use of the LLM for violent or terrorist activities, as well as "any other criminal activity". Similarly, BigScience's OpenRAIL-M License imposes restrictions on the use of models for various harmful activities.

According to Singhal, these additional restrictions have serious implications: "They create barriers against the use and reuse of the models, which also makes it more difficult to adapt and improve the models." She believes that to preserve "openness" in AI, the licenses of AI models must be interoperable with free-software licenses, which isn't the case with these restrictions. She concludes that licenses can't be a substitute for regulation: "Restrictive practices to comply with ethical rules shouldn't be in licenses: these belong to the domain of regulations."

A definition of open-source AI

Stefano Maffulli, executive director of the Open Source Initiative (OSI), described OSI's efforts to define open-source AI. In 2022, the OSI started contacting researchers, other "open" organizations, technology companies, and civil-rights organizations, to ask them about their ideas for an open-source AI system.

As a general principle, Maffulli maintains that the GNU Manifesto's Golden Rule should be applicable to AI: "If I like an AI system, I must be free to share it with other people." For an AI system to be categorized as open-source, it needs to grant us adaptations of the four basic freedoms applicable to open-source software: to use, study, modify, and share.

We need to be able to use the system for any purpose and without having to ask for permission. We need to be able to study how the system works and inspect its components. We need to be able to modify the system to change its recommendations, predictions, or decisions to adapt to our needs. And we need to be able to share the system with or without modifications, for any purpose.

According to Maffulli, a pertinent question to pose in this context is: "What is the preferred form to make modifications to an AI system?" To get an answer to this question, OSI has created small working groups to analyze some popular AI systems. "We're starting with Llama 2 and Pythia, two LLMs. After this, we'll repeat the same exercise with BLOOM, OpenCV, Mistral, Phi-2, and OLMo." For each of these AI systems, the working group will identify the requirements to guarantee the four basic freedoms. For example, understanding why, given an input, you get a particular output, is necessary to being able to study an AI system.

In 2024, the OSI will release a new draft of the open-source AI definition monthly, based on bi-weekly virtual public town halls. "Our goal is to have a 1.0 release by the end of October", Maffulli said. Everyone is welcome to partake in the discussions regarding the drafts in OSI's public forum.

According to Maffulli, there can't be a spectrum when it comes to open-source AI: either an AI system is open source, or it isn't. Nevertheless, many players within the domain of large language models misuse the term "open source". For example, one of the most popular "open" LLMs is Meta's Llama 2. When Meta's Yann LeCun announced this model on Twitter last year, he wrote: "This is huge: Llama-v2 is open source, with a license that authorizes commercial use!". However, the Llama 2 license has limitations on its commercial use that are based on the number of active users. It also forbids using Meta's model to improve other LLMs. Both limitations are at odds with the OSI's Open Source Definition.

Open data sets

Julie Hunter, a research engineer at the French software company Linagora, discussed building open-source language models. According to Hunter, the LLMs developed by Meta, as well as those by MosaicML and the Technology Innovation Institute's Falcon models, are so-called "open-weight models": the weights of the neural networks are published. This allows a choice of how the model is run and the model can be fine-tuned by adapting the weights with additional training. However, the weights don't explain why something works or doesn't work. "Without access to the data the model is trained on, it leaves a lot open to guesswork", Hunter said.

There has been a push for open training data and, as a result, a lot of data sets have been added to web sites like the one run by Hugging Face. "Anyone can train their new LLM on these data sets", Hunter said. "However, there are several problems with many of these data sets. They are often crawled from the web, packed with personal information, toxic language, and low-quality sentences. Furthermore, they are predominantly English."

The OpenLLM France consortium aims to build open-source AI models and technologies for the French language. For its first model, Claire, the main goal was to create a French data set with traceable licenses. The Claire French Dialogue Dataset is a corpus containing 140-million words from transcripts and stage plays in French, as well as from parliamentary discussions. This data set, Claire-Dialogue-French-0.1, is mostly using the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license, though some parts have other (traceable) licenses.

The data set was used to fine-tune an open-weights model, Falcon-7B. "The main purpose of this approach was to evaluate the impact of a good data set on the performance of the model", Hunter said. Michel-Marie Maudet, Linagora's general manager, added that the company's idea of developing a language model based on a small and high-quality corpus of data was inspired by Microsoft Research's paper "Textbooks Are All You Need". He continued:

The quality of a data set is more important than its quantity. A small and high-quality corpus results in a compact, specialized model with superior control over its responses in terms of interpretability and reliability. It also makes training faster, which allows us to continuously update it.

In October 2023, the model Claire-7B-0.1 was published on Hugging Face. The code to train the model was also made public, under the AGPLv3.

Beyond English

OpenLLM France is now working on a 100% open-source language model, Lucie, slated for release in April 2024. Maudet explained: "This model is trained with 100% open-source data sets of French, English, German, Spanish, and Italian texts, as well as some computer code." The data sets include the archives of the French national library and academic publications with open access.

Maudet's talk presented some details about OpenLLM France and its mission. The community, which started in July 2023, boasts over 450 active members, ranging from academic institutions to companies. Why is a France-focused LLM consortium required? Maudet explained that an exploration into the geographical distribution of LLMs with more than a billion parameters since 2018 reveals that nearly 70% of them are created in North America, and only 7.5% in Europe. Upon examining the language distribution in Llama 2's training data, the figures seem even more dismal: "While English comprises almost 90% of the data, European languages such as German and French account for just 0.17% and 0.16% of the data, respectively." Because European languages are underrepresented in their data sets, models like Llama 2 exhibit subpar performance in these languages.

There have been similar initiatives in other parts of Europe to build a European-language open-source LLM, such as LAION and openGPT-X in Germany and Fauno in Italy. At FOSDEM, Maudet announced that OpenLLM France is renaming itself to OpenLLM Europe (though the web site is not available yet). "Our mission is to develop an open-source LLM for each European language."

Conclusion

The fact that organizations call their AI systems "open source" even if their license is at odds with the four basic freedoms is a sign that we really need to have a clear definition of open-source AI. Hopefully, OSI's definition—expected by the end of 2024—will also help stop the proliferation of licenses with various well-meant but detrimental ethical restrictions. Beyond that, it would be beneficial for a consortium such as OpenLLM Europe to attract enough members to build powerful open-source LLMs beyond English.

Index entries for this article
GuestArticles	Vervloesem, Koen
Conference	FOSDEM/2024

to post comments

Open-source AI at FOSDEM

Posted Feb 16, 2024 8:05 UTC (Fri) by LtWorf (subscriber, #124958) [Link] (7 responses)

> There's also the Llama 2 v2 use policy, which prohibits use of the LLM for violent or terrorist activities, as well as "any other criminal activity"

What's the point?

Terrorists and criminals are doing what they are doing, but we think they would have moral issues with violating a software license?

Open-source AI at FOSDEM

Posted Feb 16, 2024 8:39 UTC (Fri) by taladar (subscriber, #68407) [Link] (5 responses)

Some people have a really deep-rooted belief in the power of the law and rules. You can observe that in many politicians too. They believe just because something is a written rule it somehow has near universal adherence, even if nobody enforces it.

Open-source AI at FOSDEM

Posted Feb 16, 2024 9:05 UTC (Fri) by roc (subscriber, #30627) [Link] (3 responses)

It's more likely to be an attempt to distance Meta from misuses of Llama --- "not our fault, we prohibited that use".

Open-source AI at FOSDEM

Posted Feb 20, 2024 4:04 UTC (Tue) by patrakov (subscriber, #97174) [Link] (2 responses)

I don't see the point: "It's not our fault; there is a lot of legislation already prohibiting such actions, no need to duplicate it."

Open-source AI at FOSDEM

Posted Feb 20, 2024 10:30 UTC (Tue) by farnz (subscriber, #17727) [Link] (1 responses)

It up-front destroys any attempt to put forward the legal theory that Meta intended that LLaMA be used for criminal or terrorist activity, and released it to that end; as a consequence, if your case against Meta is based on that legal theory, you'd better be able to demonstrate that to the judge before discovery, since otherwise even the most prejudiced of judges knows that all Meta need do is appeal a decision to move forwards, and they'll be facing formal censure.

Without it, you might have enough of a case based on the statements of public AI doomsayers (like those who signed this letter) that you could go to discovery once you've proven that LLaMA was used for criminal or terrorist activity, since Meta "should reasonably have known this was a risk". And once you've got there, you can make things very expensive for Meta, even though they'll almost certainly win at trial.

Open-source AI at FOSDEM

Posted Feb 20, 2024 17:23 UTC (Tue) by epa (subscriber, #39769) [Link]

I dunno, it can go both ways. "This licence strictly prohibits you from using the AI for bomb-making by disabling the safety check at line 515 of main.c..."

Open-source AI at FOSDEM

Posted Feb 16, 2024 11:44 UTC (Fri) by Wol (subscriber, #4433) [Link]

> They believe just because something is a written rule it somehow has near universal adherence, even if nobody enforces it.

Even when it defies the laws of logic ...

Cheers,
Wol

Open-source AI at FOSDEM

Posted Feb 16, 2024 13:49 UTC (Fri) by neggles (subscriber, #153254) [Link]

It's purely legal CYA. Makes it much harder to sue Meta if someone uses it to commit crimes - "the ToS bans that so not our fault" is a valid defense apparently.

Open-source AI at FOSDEM

Posted Feb 16, 2024 15:29 UTC (Fri) by karim (subscriber, #114) [Link] (2 responses)

I wonder if there was any discussion about computing power for generating and serving these models. My understanding is that that's one of the key aspects in terms of usefulness. I mean, it's all great to share models, vet content origin and ensure non-English dominance (French speaker here), but that's all pointless if these models have little to no chance of standing up against commercial offerings.

Open-source AI at FOSDEM

Posted Feb 16, 2024 20:58 UTC (Fri) by osma (subscriber, #6912) [Link]

Many of the open LLMs are much smaller than GPT-3.5 and GPT-4, and thus can be deployed using cheaper infrastructure. The smallest ones even run on phones or laptops (though obviously they are quite limited).
Since many more or less open models are already available for everyone, there's a competitive market in inference services for them. Using smaller models can be quite cost-effective for tasks that don't require GPT-4 level capabilities.

Open-source AI at FOSDEM

Posted Feb 18, 2024 3:29 UTC (Sun) by aiethics.abhishek (guest, #164948) [Link]

There is this paper that talks about the relationship between how to effectively govern AI systems and computing power requirements for some of the largest models that are out there, and of course the scale of their impact and how they shape the rest of the AI ecosystem.

From the abstract of the paper:

Computing power, or "compute," is crucial for the development and deployment of artificial intelligence (AI) capabilities. As a result, governments and companies have started to leverage compute as a means to govern AI. For example, governments are investing in domestic compute capacity, controlling the flow of compute to competing countries, and subsidizing compute access to certain sectors. However, these efforts only scratch the surface of how compute can be used to govern AI development and deployment. Relative to other key inputs to AI (data and algorithms), AI-relevant compute is a particularly effective point of intervention: it is detectable, excludable, and quantifiable, and is produced via an extremely concentrated supply chain. These characteristics, alongside the singular importance of compute for cutting-edge AI models, suggest that governing compute can contribute to achieving common policy objectives, such as ensuring the safety and beneficial use of AI. More precisely, policymakers could use compute to facilitate regulatory visibility of AI, allocate resources to promote beneficial outcomes, and enforce restrictions against irresponsible or malicious AI development and usage. However, while compute-based policies and technologies have the potential to assist in these areas, there is significant variation in their readiness for implementation. Some ideas are currently being piloted, while others are hindered by the need for fundamental research. Furthermore, naive or poorly scoped approaches to compute governance carry significant risks in areas like privacy, economic impacts, and centralization of power. We end by suggesting guardrails to minimize these risks from compute governance.

Open-source AI at FOSDEM

Posted Feb 16, 2024 16:07 UTC (Fri) by opsec (subscriber, #119360) [Link] (1 responses)

Common Crawl was not mentioned in the talk on open data sets ?
https://commoncrawl.org/

Open-source AI at FOSDEM

Posted Feb 16, 2024 17:00 UTC (Fri) by koenvervloesem (subscriber, #56573) [Link]

Yes, Common Crawl was mentioned by Julie Hunter in her talk, as one of the data sets that she had concerns about (web-crawled data and focus on English language). See slide 3 in her presentation [PDF].

Open-source AI at FOSDEM

Posted Feb 16, 2024 17:12 UTC (Fri) by ballombe (subscriber, #9523) [Link]

> The fact that organizations call their AI systems "open source" even if their license is at odds with the four basic freedoms is a sign that we really need to have a clear definition of open-source AI.

While this is certainly important to do so, there always have been organizations calling about anything "open source". That just shows that there is now a perceived advantage to label oneself open-source,
rather than open-source being not being not well-defined.
However it just makes even more important to get in control of the open-source AI terminology.

Open-source AI at FOSDEM

Posted Feb 16, 2024 17:45 UTC (Fri) by NAR (subscriber, #1313) [Link] (3 responses)

For example, understanding why, given an input, you get a particular output, is necessary to being able to study an AI system.

Is it a solved (or even solvable) problem? I mean better understanding than "because the weights are what they are". Is it possible to get back those training inputs that are responsible for a given input - particular output pair? A couple of years ago one of my colleagues left a huge financial company to mathematical institute to research this subject, but I have no idea if he succeeded.

Open-source AI at FOSDEM

Posted Feb 16, 2024 19:22 UTC (Fri) by k3ninho (subscriber, #50375) [Link] (1 responses)

I understand "explainable AI" [1] to be about debug-logging between the layers of your ML model, so that you can write code to show the symbols involved in a journey from input to output.

1: https://cloud.google.com/explainable-ai

K3n.

Open-source AI at FOSDEM

Posted Feb 18, 2024 3:32 UTC (Sun) by aiethics.abhishek (guest, #164948) [Link]

I also think there is an important distinction that we need to consider on what traditional explainability approaches are trying to do vs. what the needs of the audience are.

This paper makes some useful points around the needs and articulates a nice framework for thinking about how one approaches explainability and transparency for AI systems.

Open-source AI at FOSDEM

Posted Feb 16, 2024 20:54 UTC (Fri) by osma (subscriber, #6912) [Link]

Explainable AI is indeed a hard problem (that is being studied, as pointed out in another comment). But when trying to understand why a model behaves as it does, there's a big difference between having access only to the weights - essentially a black box due to the huge amount of computations - versus also knowing the model training code and all the training data. If the model produces a surprising result, the training data can be examined to identify similar patterns.

Open-source AI at FOSDEM

Posted Feb 22, 2024 19:41 UTC (Thu) by flussence (guest, #85566) [Link] (5 responses)

Never mind the “open source” part - it seems like the only clear definition of “AI” they all tacitly agree on is “computer system that only produces right answers some of the time and cannot be debugged with a human amount of effort”. I was writing those in PHP a quarter century ago.

Open-source AI at FOSDEM

Posted Feb 23, 2024 14:52 UTC (Fri) by geert (subscriber, #98403) [Link] (4 responses)

So someone needs to write an AI to debug it...

Open-source AI at FOSDEM

Posted Mar 2, 2024 14:35 UTC (Sat) by sammythesnake (guest, #17693) [Link] (3 responses)

While I suspect you had your metaphorical tongue firmly planted in your metaphorical cheek, this might be exactly the kind of task an AI could usefully be applied to. It depends, after all, on considering vast datasets (millions of weights in a neutral network) and spotting patterns. That's exactly what an LLM does with a vast corpus of training data, after all.

A couple of years ago, I would have said that NNs were poorly suited to dealing with code-like stuff (like how the neutral network actually *uses* its weights) but given the huge recent leaps in progress with LLMs ("language" is really rather "code-like" in many relevant ways) I'm inclined to think it's a possible area for success, given enough research and investment.

There's the challenge of clearly defining the questions you ask it, though - questions like ”Why do these ChatGPT prompts produce Nazi propaganda / expressions of racist ideals / whatever?" are a minefield of fuzzy/ill-defined/controversial concepts!

I suspect an early step in building such a system would be training a "classifier" to label outputs as potentially troublesome (racist/sexist/violent etc.) That could be used AGN-style to massage the weights of an LLM might be able to train out some of its more troublesome opinions!

Another thing that might be useful is something that can automate googling (reliable!) resources on the internet to help assess the veracity of a statement. That would be useful not only for ChatGPT output, but even for things like Facebook posts! Maybe a browser extension :-P

Back in the day, "AI" systems were a lot more procedural (see "expert systems" for example) meaning the chain of logic linking input to output could be very well defined. The flexibility of these systems was really difficult to scale, though, because what you wanted them to do had to be defined so much more explicitly.

I remember there being some work based on ideas similar to formal proof systems that could at least work with arbitrary databases of "facts" and still explain a chain of logic linking an answer to its source facts. *Creating* those databases of facts was still a punishing manual process for the main part, though.

The creation of databases of "facts" (or better, "assertions") could potentially be automated to a degree with LLM-like approaches. Ideally with some filtering/rating from a veracity checker and/or offensiveness classifier.

Similarly, an LLM-like approach could help *summarise* the explanations, which generally tend toward monstrously verbose.

A vast database of assertions, a fuzzy logic system to interrogate it (maybe with an LLM based interface to interpret the user's questions and express the system's answers) could potentially give exactly the desired chain of cause & effect linking the base assertions to the output. Armed with that, tweaks could be applied to the assertions database, such as downgrading or *inverting* the "reliability rating" of untruthful or offensive actions.

Hmm, another entry or two on my ever-growing list of things I could probably justify a PhD grant for, if only I had infinite TUITs. Sadly, the few I have are mostly square ones...

Open-source AI at FOSDEM

Posted Mar 2, 2024 19:03 UTC (Sat) by kleptog (subscriber, #1183) [Link] (2 responses)

> I suspect an early step in building such a system would be training a "classifier" to label outputs as potentially troublesome (racist/sexist/violent etc.)

This sort of stuff already exists. The OpenAI APIs already check the responses against a number of filters and will trigger if it goes off the rails. I think in the end we'll be working with clusters of smaller LLMs, each specialised in some way and working together. And connected to logical subunits that can do things like calculations and reasoning which LLMs are terrible at. Just like parents teach their children social norms, so the LLMs of today will "teach" their norms to their successors.

> A vast database of assertions, a fuzzy logic system to interrogate it (maybe with an LLM based interface to interpret the user's questions and express the system's answers) could potentially give exactly the desired chain of cause & effect linking the base assertions to the output. Armed with that, tweaks could be applied to the assertions database, such as downgrading or *inverting* the "reliability rating" of untruthful or offensive actions.

This sort of thing would be quite simple to make with the tooling available today. But it turns out that LLMs are just not quite powerful enough. The step of feeding the assertions+questions into an LLM in the form of a conversation is just clumsy enough that it doesn't quite work. (This is basically what's Bing's prompt is doing.) What we need is some architectural improvement that would allow an LLM to reference "pluggable external memory" without it having to be part of the conversation. I'm sure someone is working on it.

Of course, the biggest issue is that LLMs don't actually understand the world at all. They don't understand weight, colour, texture, length, hunger or emotions in general. They can explain colour the same way a blind man can. That limits their use in a lot of areas, but where they do work, they can work very well.

Open-source AI at FOSDEM

Posted Mar 5, 2024 0:31 UTC (Tue) by sammythesnake (guest, #17693) [Link] (1 responses)

> The step of feeding the assertions+questions into an LLM in the form of a conversation is just clumsy enough that it doesn't quite work.

I'm imagining the LLM working more like https://www.tldrthis.com/ - i.e. it can process something provided by the fuzzy-expert-system type approach and have a conversation *about* that, but not be asked to generate answers itself. At the same time, the user wouldn't have to know the "right way" to interrogate the database, because that's what the LLM-like layer would be the "expert" in.

It's a bit like when my Glaswegian relatives visit the land of the Sassenach in which I dwell - my friends will say "what did he do at the weekend" and I say "hold on, I'll ask", then relay the answer back to the monoglots, because my rellies all sound like Rab C Nesbitt :-D

Of course, the distinction between the two layers (the human-language interface and the actually-know-anything part) would likely be less explicitly exposed from the user's POV, it being an implementation detail.

If you want a fully free-form conversation, then there are some subtleties to working out which (aspects of) responses need to be guided by the database, and which the LLM can be trusted with. Anything that's a question/answer format should clearly be in the former category, but things like "Write me a Shakespearean Sonnet about the Tory Party" are a little more fuzzy. Perhaps yet another layer of AI to make that call ;-)

Open-source AI at FOSDEM

Posted Mar 5, 2024 9:48 UTC (Tue) by paulj (subscriber, #341) [Link]

Came here for the Linux technical content. Laughing at the Naked Video references. ;) That show was pure gallus.

Waiting for some Burnistoun references now.