The fight between open source software and proprietary software is well known. But the tensions that have permeated software circles for decades have carried over into the burgeoning artificial intelligence space, with controversy dogging it.
The New York Times recently posted a glowing review from Meta CEO Mark Zuckerberg, noting how his embrace of “open source AI” had made him popular once again in Silicon Valley. The problem, however, is that the big Flame Meta markup language models are not really open source.
Or are?
By most estimates, they are not. But it highlights how the notion of “open source AI” will only generate more debate in the years to come. This is something that the Open Source Initiative (OR IF) is trying to address, led by the CEO Stefano Maffulli (pictured above), who has been working on the problem for more than two years through a global effort spanning conferences, workshops, panels, webinars, reports and more.
AI is not software code
The OSI has been a steward of the Definition of open source (OSD) for more than a quarter of a century, establishing how the term “open source” can or should be applied to software. A license that meets this definition can legitimately be considered “open source”, although it recognizes a license spectrum from extremely permissive to not so permissive.
But moving legacy naming and licensing conventions from software to AI is problematic. jose jacksopen source evangelist and founder of VC firm OSS CapitalHe even goes so far as to say that there is “There is no such thing as open source AI.”, noting that “open source was explicitly invented for software source code.”
In contrast, “neural network weights” (NNW), a term used in the world of artificial intelligence to describe the parameters or coefficients through which the network learns during the training process, are not comparable in any way. in any meaningful way with the software.
“Neural net weights are not software source code; Humans can’t read or debug them,” says Jacks. “Furthermore, fundamental open source rights also do not carry over to NNWs in a consistent manner.”
This led Jacks and his OSS Capital colleague Heather Meeker to come up with your own type definitionaround the concept of “open pesos”.
So before we’ve even come to a meaningful definition of “open source AI,” we can already see some of the tensions inherent in trying to get there. How can we agree on a definition if we cannot agree that the “thing” we are defining exists?
Maffulli, for what it’s worth, agrees.
“The point is correct,” he told TechCrunch. “One of the initial debates we had was whether to call it open source AI, but everyone was already using the term.”
This reflects some of the challenges in the broader sphere of AI, where debates abound over whether what we call “AI” today it really is AI or simply powerful systems taught to detect patterns among vast swaths of data. But detractors are mostly resigned to the fact that the “AI” nomenclature already exists, and there is no point in fighting it.
Founded in 1998, OSI is a nonprofit public benefit corporation working on a wide variety of open source activities around advocacy, education, and its core purpose: defining open source. Today, the organization relies on sponsorships for funding, with such esteemed members as Amazon, Google, Microsoft, Cisco, Intel, Salesforce, and Meta.
Meta’s involvement with OSI is particularly notable right now when it comes to the notion of “open source AI.” Even though Meta hangs up his AI hat on the open source plugThe company has notable restrictions regarding how its Llama models can be used: Sure, they can be used for free for commercial and research use cases, but app developers with over 700 million monthly users must apply for a license. special to Meta, which it will grant solely at its own discretion.
Simply put, Meta’s Big Tech brothers can whistle if they want.
Meta’s language around its LLMs is somewhat malleable. Although the company called its Open source Llama 2 model.With the arrival of Llama 3 in April, he moved away from the terminology a bit, using phrases as “openly available” and “openly accessible”. But in some places, still refers to the model as “open source”.
“Everyone in the conversation is in perfect agreement that Llama itself cannot be considered open source,” Maffulli said. “People I’ve talked to who work at Meta know it’s a bit of a stretch.”
On top of that, some might argue that there is a conflict of interest here: does a company that has shown a desire to leverage the open source brand also provide funding to the administrators of “the definition”?
This is one of the reasons why OSI is trying to diversify its funding and recently obtained a grant from the Sloan Foundation, which is helping fund its global multi-stakeholder push to reach the open source definition of AI. TechCrunch can reveal that this grant amounts to around $250,000, and Maffulli is hopeful that this may alter the optics around its reliance on corporate funding.
“That’s one of the things the Sloan grant makes even clearer: We could say goodbye to Meta money at any time,” Maffulli said. “We could do it even before this Sloan Grant, because I know we will receive donations from others. And Meta knows it very well. They are not interfering with any of this. [process]neither is Microsoft, nor GitHub, nor Amazon, nor Google; “They know absolutely that they cannot interfere, because the structure of the organization does not allow it.”
Working Definition of Open Source AI
The current draft of the open source AI definition is at version 0.0.8, which constitutes three central parts: the “preamble”, which establishes the mandate of the document; the very definition of open source AI; and a checklist that walks through the components necessary for an open source compatible AI system.
According to the current draft, an open source AI system should provide freedom to use the system for any purpose without asking permission; allow others to study how the system works and inspect its components; and modify and share the system for any purpose.
But one of the biggest challenges has been related to data: That is, can an AI system be classified as “open source” if the company has not made the training data set available for others to review? According to Maffulli, it is more important to know where the data comes from and how a developer tagged, deduplicated and filtered it. And also, having access to the code that was used to assemble the data set from its various sources.
“It’s much better to know that information than to have the data set without the rest,” Maffulli said.
While it would be nice to have access to the full data set (the OSI makes it an “optional” component), Maffulli says that in many cases it is neither possible nor practical. This could be because the dataset contains confidential or copyrighted information that the developer does not have permission to redistribute. Additionally, there are techniques for training machine learning models whereby the data itself is not shared with the system, using techniques such as federated learning, differential privacy, and homomorphic encryption.
And this perfectly highlights the fundamental differences between “open source software” and “open source AI”: the intentions may be similar, but they are not comparable, and this disparity is what the OSI is trying to capture in its definition.
In software, source code and binary code are two views of the same artifact: they reflect the same program in different ways. But training data sets and subsequent trained models are different things: you can take that same data set and you won’t necessarily be able to recreate the same model consistently.
“There is a variety of statistical and random logic that occurs during training, which means you can’t make it replicable in the same way that software can,” Maffulli added.
Therefore, an open source AI system should be easy to replicate and with clear instructions. And this is where the checklist facet of the open source AI definition comes into play, which is based on a recently published academic article titled “The Openness Model Framework: Promoting Integrity and Openness for Reproducibility, Transparency, and Usability in Artificial Intelligence.”
This article proposes the Model Openness Framework (MOF), a classification system that rates machine learning models “based on their completeness and openness.” The Ministry of Finance requires that specific components of AI model development be “included and published under appropriate open licenses,” including training methodologies and details on model parameters.
Stable condition
The OSI calls the official release of the definition the “stable release,” much like a company would do with an app that has undergone extensive testing and debugging before prime time. The OSI purposely does not call it a “final version” because some parts are likely to evolve.
“We really can’t expect this definition to last 26 years like the open source definition,” Maffulli said. “I don’t expect the top part of the definition, like ‘what is an AI system?’ — change a lot. But are the parts that we refer to in the checklist, those component lists dependent on the technology? Tomorrow, who knows what the technology will be like.”
The Board is expected to approve the stable definition of open source AI in the Open conference of all things in late October, when the OSI embarked on a world tour in the intervening months that spanned five continents, seeking more “diverse input” on how “open source AI” will be defined in the future. But any final changes are likely to be little more than “small tweaks” here and there.
“This is the home stretch,” Maffulli said. “We have arrived at a complete version of the definition; We have all the elements we need. Now we have a checklist, so we’re checking that there are no surprises there; There are no systems that should be included or excluded.”