The Open Source Initiative (OSI) recently unveiled its latest draft definition of “open source AI” to clarify the ambiguous use of the term in this rapidly evolving field. This move comes as some companies like Meta release trained AI language model weights and code with usage restrictions while using the “open source” label. This has sparked fierce debates among free software advocates about what truly constitutes “open source” in the context of AI.
For example, Meta's Llama 3 model, while free, does not meet the traditional open source criteria defined by the OSI for software because it imposes licensing restrictions on use based on company size or the type of content created with the model. The AI image generator Flux is another “open” model that is not truly open source. Because of this kind of ambiguity, we have typically described AI models that contain code or weights with constraints, or that lack accompanying training data, using alternative terms such as “open weights” or “source available.”
To formally address the issue, the OSI — known for its commitment to open software standards — has assembled a group of about 70 participants, including researchers, lawyers, politicians, and activists. Representatives of major technology companies such as Meta, Google, and Amazon have also joined the initiative. The group's current draft (version 0.0.9) of the definition of open source AI emphasizes “four fundamental freedoms” reminiscent of those that define free software: giving users of the AI system permission to use it for any purpose without permission, to study how it works, to modify it for any purpose, and to redistribute it with or without modifications.
By establishing clear criteria for open source AI, the organization hopes to create a benchmark against which AI systems can be evaluated. This will likely help developers, researchers and users make more informed decisions about the AI tools they build, study or use.
Truly open-source AI can also shed light on potential software vulnerabilities of AI systems, as researchers can see how the AI models work behind the scenes. Contrast this approach with an opaque system like OpenAI's ChatGPT, which is more than just a big GPT-4o language model with a fancy interface—it's a proprietary system of interlocking models and filters, and its exact architecture is a closely guarded secret.
OSI's project schedule suggests that a stable version of the definition of “open source AI” is expected to be announced in October at the All Things Open 2024 event in Raleigh, North Carolina.
“Permission-free innovation”
In a May press release, the OSI stressed the importance of defining what open source AI really means. “AI is different from regular software, forcing all stakeholders to consider how open source principles apply to this space,” said Stefano Maffulli, OSI's executive director. “OSI believes that everyone has the right to maintain control over the technology. We also recognize that markets thrive when clear definitions encourage transparency, collaboration and permissionless innovation.”
The organization's latest draft definition goes beyond the AI model and its weights to include the entire system and its components.
For an AI system to be considered open source, it must provide access to what the OSI calls a “preferred form for making changes.” This includes detailed information about the training data, the complete source code used to train and run the system, and the model's weights and parameters. All of these items must be available under licenses or terms approved by the OSI.
Notably, the draft does not require the release of raw training data. Instead, it requires “data information” – detailed metadata about the training data and methods. This includes information about data sources, selection criteria, preprocessing techniques, and other relevant details that would enable a professional to recreate a similar system.
The “data disclosure” approach aims to provide transparency and reproducibility without necessarily disclosing the actual dataset. This is ostensibly intended to address potential privacy or copyright concerns while respecting open source principles, although this particular point may require further discussion.
“TThe most interesting thing about [the definition] is that they allow training data NOT to be published,” said independent AI researcher Simon Willison in a short Ars interview about the OSI proposal. “IThis is an extremely pragmatic approach. If this were not possible, there would be hardly any powerful 'open source' models.”