Part of a Solution

11 Jun, 2024

The thing that I dislike most about our current zeitgeist of AI is how much data laundering happens. I think a neat (and hopefully possible) solution would be the following:

All statistical models (aka AI) over a certain size (to be determined) MUST be open sourced.
Open sourced in this sense means: a. The code to create the models (and all dependent code) MUST be open source. b. The training data used to create the model MUST be hosted by the model creators. c. Any randomization used in the training of a model MUST use a seed and that seed be published.
To solve the business issue of open source, more stringent IP protection must be introduced such that: a. Competitors cannot copy a novel method of training data that's been protected (IP protection of the training code). b. Competitors cannot copy training corpus that's composed of copyrighted material (IP protection of the training data). For instance, if Disney trained a model on its corpus of IP, DreamWorks could not use that same training data without IP infringement.

This achieves a few goals:

We gain transparency on what data is used to generate the models. a. A nice secondary effect is by forcing model generating companies to also host the training corpus, any legitimate copyright complaint against the hosted corpus would have a domino effect to cause the model to be taken down as well. b. Any personal data that was used could be similarly revoked via legal action such as CCPA.
The models can be retrained to verify their weights match the training corpus. a. Most people won't be able to retrain using the same inputs at home, but competing corporations, governments, and special interest groups like the EFF can do so on a regular basis to confirm that the aforementioned transparency is not falsified.
Companies still have financial incentive to create better ways to train models and curate better training data.

There's an existing legal framework in place that works in similar ways to achieve similar goals: the patent system. It's horrendously outdated and has loopholes that often allow it to cause more harm than it protects against. However, when faced with the alternative of rampant AI-nonsense-machines, we need some kind of legal framework to disincentivize a large portion of harm that the current set of AI tools can cause.