GitHub’s commercial AI tool was built from open source code

“I’m generally happy to see expansions in free use, but I’m a little bit bitter when they end up benefiting large companies that derive en masse value from the work of small authors,” Woods said.

One thing that is clear about neural networks is that they can memorize their training data and reproduce copies. That risk exists whether that data involves personal information, medical secrets, or copyrighted code, says Colin Raffel, professor of computer science at the University of North Carolina who co-wrote an article at appear (currently available as an unpaired pre-publication.) examining a similar copy in OpenAI’s GPT-2. Getting the model, which is trained on a large body of text, to spit out training data was rather trivial, they found. But it can be difficult to predict what a model will memorize and copy. “You only really find out when you throw it out into the world and people use and abuse it,” Raffel explains. Given this, he was surprised to see that GitHub and OpenAI had chosen to train their model with code with copyright restrictions.

According to internal GitHub testing, direct copying occurs in about 0.1% of Copilot releases, a surmountable error, according to the company, and not an inherent flaw in the AI ​​model. This is enough to cause slowness in the legal department of any for-profit entity (“a non-zero risk” is only a “risk” for a lawyer), but Raffel notes that it may not be. not that different from employees copying restricted code. . Humans break the rules regardless of automation. Ronacher, the open source developer, adds that most copies of Copilot appear to be relatively harmless – cases where simple solutions to problems come up over and over again, or quirks like the infamous earthquake code, which has been (improperly) copied by people into many different code bases. “You can make Copilot trigger some hilarious things,” he says. “If used as intended, I think it will be less of a problem.”

GitHub also indicated that it has a possible solution in the works: a way to flag these text outputs when they occur so that programmers and their lawyers know not to reuse them commercially. But building such a system isn’t as straightforward as it sounds, Raffel notes, and it tackles the larger problem: what if the output was not textual, but a close copy of the training data? ? What happens if only the variables have been changed or if a single row has been expressed in a different way? In other words, how many changes are necessary for the system to no longer be an imitator? With code generation software in its infancy, legal and ethical boundaries are not yet clear.

Many legal scholars believe AI developers have fairly wide latitude when selecting training data, says Andy Sellars, director of the Technology Law Clinic at Boston University. Much of the “fair use” of copyrighted material comes down to whether it is “transformed” when reused. There are many ways to transform a work, such as using it for parody or criticism or summarizing it – or, as the courts have repeatedly found, using it as fuel for algorithms. In an important case, a federal court dismissed a lawsuit brought by a publishing group against Google Books, finding that its process of digitizing books and using snippets of text to allow users to browse them was an example of fair dealing. But how this translates into AI training data is not firmly established, Sellars adds.

It’s a little strange to put code on the same basis as books and artwork, he notes. “We treat the source code as a literary work even if it bears little resemblance to literature,” he says. We can think of the code as relatively useful; the task it performs is more important than the way it is written. But in copyright, the key is how an idea is expressed. “If Copilot spits out an output that does the same as one of its training inputs – similar settings, similar result – but spits out different code, it probably won’t involve copyright law.” He said.

The ethics of the situation is another matter. “There is no guarantee that GitHub has the best interests of independent coders at heart,” says Sellars. Copilot depends on the work of its users, including those who have explicitly tried to prevent their work from being reused for profit, and it can also reduce the demand for these same coders by further automating programming, notes. he. “We must never forget that there is no cognition in the model,” he says. This is statistical model matching. The ideas and creativity extracted from the data are all human. Some researchers have said that Copilot underscores the need for new mechanisms to ensure that those who produce the data for AI are fairly compensated.


Source link

Leave a reply