Why is Copilot so bad?

Posted Jul 2, 2022 23:15 UTC (Sat) by anselm (subscriber, #2796)
In reply to: Why is Copilot so bad? by SLi
Parent article: Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Copyright does not, in general, work so that everything that uses a work as input and produces an output necessarily would 1) either need a specific license to do so, or 2) produce outputs that are legally derived works or require a license of the input, even if the outputs are both complex and useful.

No. But if the device produces output that can be identified as a nontrivial part of a copyrighted work (e.g., a function definition), then the fact that it used an “AI model” does not mean it is somehow magically exempt from infringing on the copyright of that work.

In other words, if I produced that output myself by cutting and pasting the part in question from the original copyrighted work, I would obviously be infringing on its copyright. If Copilot produced the same output by passing the original copyrighted work through an AI model, why should that not be a copyright issue?

to post comments

Why is Copilot so bad?

Posted Jul 4, 2022 12:06 UTC (Mon) by nye (subscriber, #51576) [Link] (11 responses)

> In other words, if I produced that output myself by cutting and pasting the part in question from the original copyrighted work, I would obviously be infringing on its copyright. If Copilot produced the same output by passing the original copyrighted work through an AI model, why should that not be a copyright issue?

Did anyone claim that it wouldn't?

Why is Copilot so bad?

Posted Jul 4, 2022 21:24 UTC (Mon) by anselm (subscriber, #2796) [Link] (7 responses)

Microsoft seems to think so. (They also claim that it doesn't happen very often, as if that was a valid excuse.)

Why is Copilot so bad?

Posted Jul 5, 2022 15:12 UTC (Tue) by nye (subscriber, #51576) [Link] (6 responses)

> Microsoft seems to think so

No they do not. They have not claimed that. They will not claim that. This straw man is *ridiculous* and seeing it repeated so often makes me want to scream.

The fundamental assertion that they're implicitly making by publishing copilot is that output from copilot is not automatically, ipso facto, an infringement of the license on its training data.

You[0] seem to be claiming that this further implies an assertion that the output from copilot is automatically, ipso facto, not an infringement of the license on its training data. Rather like claiming that "not all people are men" implies "all people are not men".

But not only are Github/MS not saying that, they are saying the opposite. In fact, what they *actually* say is this:

> You should take the same precautions as you would with any code you write that uses material you did not independently originate.
> These include rigorous testing, *IP scanning*, and checking for security vulnerabilities
(emphasis mine)

[0] In the plural sense. I imagine you *personally* have just been misled by the people making up this straw man, since it's so common.

Why is Copilot so bad?

Posted Jul 5, 2022 15:58 UTC (Tue) by anselm (subscriber, #2796) [Link] (5 responses)

> These include rigorous testing, *IP scanning*, and checking for security vulnerabilities (emphasis mine)

In other words, they want us to perform the due diligence that they're not prepared to do themselves. This does not detract from the fact that they're misleading Copilot users about the copyright status of the code that Copilot emits, so they're potentially violating licenses such as the GPL or BSD license which stipulate that code covered by them can only be passed on if the license grant is also passed on.

Why is Copilot so bad?

Posted Jul 6, 2022 11:17 UTC (Wed) by nye (subscriber, #51576) [Link] (4 responses)

> they're misleading Copilot users about the copyright status of the code that Copilot emits

That directly contradicts the part of my comment that you quoted! Where are you getting this? Why do you think that you can tell such blatant lies and not get called out? I'm... well actually I'm just speechless at this point. I guess there's not much point continuing any further.

Why is Copilot so bad?

Posted Jul 6, 2022 13:12 UTC (Wed) by anselm (subscriber, #2796) [Link] (3 responses)

From what I've seen, Copilot does not annotate its suggestions with information about the status of the material it derives these suggestions from. That is, Copilot is “misleading” recipients of code snippets about their copyright status by not saying anything about their copyright status at all and instead requiring the recipients to figure out for themselves if the snippets are copyrighted (and if so, under what license, if any, they may be used). This may be justified from Github's/Microsoft's POV because many of the suggestions Copilot makes may be too trivial or too much like very obvious boilerplate to qualify for copyright protection in the first place, but there is no guarantee for that. Accidentally including, e.g., GPL material from Copilot output into their own non-GPL projects is a risk that Copilot users need to deal with somehow.

Nobody would have a problem with Copilot if Copilot said, where appropriate, “This code snippet derives from code licensed under the GPL”, because anyone receiving such a code snippet could then decide for themselves whether they wanted to accept it on those terms and act accordingly. (It would depend on the nature of the snippet in question whether this is an actual problem; e.g., three lines of schematic boilerplate from a GPL project are probably fairly innocuous to take over even for non-GPL code, but a nontrivial piece of nonobvious code might be more of an issue.) It would certainly suggest more effectively that the Copilot project is acting in good faith than simply sticking one's head in the sand.

Why is Copilot so bad?

Posted Jul 6, 2022 20:00 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (2 responses)

Unfortunately, that's just not how models of this sort work. It is not a search engine. It is a GAN, or at least something similar to a GAN. The generative side of the model never even "sees" the inputs in the first place, it just gets feedback on how well it can fool the other (discriminator) side of the model. It has no idea where its suggestions come from or how similar they are to its inputs, it just knows that "when I suggest code that looks like this, I get positive feedback."

The whole "we'll tell you if your code looks similar to input data" thing is a search engine layered on top of Copilot, but that's really only going to be useful for very close matches. It doesn't have the smarts to say "well, this actually came from codebase X, even though it looks completely different to X."

Why is Copilot so bad?

Posted Jul 7, 2022 7:26 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (1 responses)

Unless you make one model for GPL code, another for MIT code, etc… Then you know the legal color of your suggestions.

Why is Copilot so bad?

Posted Jul 7, 2022 17:34 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

You cannot comply with even the MIT license unless you know exactly who to attribute. You can't just say "Uh, it's MIT licensed, but I don't know where it came from." The same goes for GPL.

Why is Copilot so bad?

Posted Jul 4, 2022 22:42 UTC (Mon) by sfeam (subscriber, #2841) [Link] (2 responses)

Can a text generated by random generation of sequential characters constitute copyright violation? An unequivocal "yes" answer seems to sound a death-knell for clean room implementations. An unequivocal "no" answer lets Copilot off the hook.

This starts to sound very close to the classic "infinite number of monkeys typing at random" scenario. Are the monkeys inevitably guilty of copyright violation?

The stereotypical madly-typing monkeys generate text strings where each character c is generated with uniform probability P(c) and accepted into the output text independent of that monkey's previous typing history. What if we bias P(c) to favor more readable text (give the monkeys Dvorak keyboards?). What if we filter acceptance by previous history (Markov filters? trained monkeys?). What if we house the monkeys in a black box and label it "Copilot"? What if we replace the monkeys with a neural net? Where in this process of refining the scenario does the possibility of copyright violation creep in, if anywhere?

Why is Copilot so bad?

Posted Jul 5, 2022 0:57 UTC (Tue) by anselm (subscriber, #2796) [Link]

Where in this process of refining the scenario does the possibility of copyright violation creep in, if anywhere?

It doesn't really matter exactly how the monkeys came up with the copy. Your copyright problem starts where you take the monkeys' output, which is demonstrably identical to a preexisting copyrighted work, and pass it off as something you're entitled to dispose of as you please, because the original copyright holder's claim that – never mind those monkeys – you just ripped off their stuff will be difficult for you to refute. (In the case of Copilot, this is, if anything, more difficult, because you effectively showed the monkeys the original copyrighted work first, so their coming up with a verbatim copy eventually will surprise nobody.)

Why is Copilot so bad?

Posted Jul 5, 2022 22:19 UTC (Tue) by hummassa (guest, #307) [Link]

1. IANAL (but in another point in my life I was a paralegal in DA office and I participated in legal research, about prosecuting copyright violators)

2. the answer to
> Are the monkeys inevitably guilty of copyright violation?

is: the monkeys are never guilty (monkeys are not people, only people can be guilty)... but if you copy, distribute, or perform the work received from the monkeys in public, then you are.

3. clarifying the last part: if the monkeys (or the ML model) produces a copyrightable piece of a copyrighted work, then the monkeys are nothing else than another medium where the copyrightable work is fixed. The monkeys, or the ML model, are just like an HD or a DVD-RW or a physical, printed book.

4. so: if a random number generator generates the number that when represented in binary is the same number as the image of a Blu-Ray of "Avengers: Endgame", this does ABSOLUTELY NOTHING over the copyright of the work. You can't burn it to a BR and play it in a public setting. You can't even copy it. Only Sony/Marvel and their licensees can.