Why is Copilot so bad?

Posted Jul 5, 2022 9:47 UTC (Tue) by farnz (subscriber, #17727)
In reply to: Why is Copilot so bad? by bluca
Parent article: Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Building the model isn't subject to copyright restriction (which I agree is right and proper - we don't place copyright restrictions on people picking up information from code they read), but using it might be, just as I might be infringing copyright if I accidentally type in a byte-for-byte identical copy of something I read during code review at a past job.

There's precedent for this in human creativity - former Beatle George Harrison lost a case for "subconscious plagiarism" (hat tip to rsidd) because he listened to a song several years before writing a song that happened to be almost exactly the same melody. No copyright restrictions applied to George Harrison listening to the song he later infringed copyright on, but they did come into play once he created a "new" work that happened to be too similar to an existing work he knew about.

The same could well apply to Copilot - creating the model is OK (human analogy is consuming media), holding the model itself is OK (human analogy is having a memory of past work), but using the output of the model is infringement if it's regurgitated copyrightable code from its input ("subconscious plagiarism" in the Harrison case).

to post comments

Why is Copilot so bad?

Posted Jul 5, 2022 10:54 UTC (Tue) by SLi (subscriber, #53131) [Link] (5 responses)

Code tends to be much more functional than "pure arts" like music. I doubt what you say is possible with code for a human (coming up with identical code might be, but in that case it's unlikely to be significant enough to be a copyright violation, and, you know, you are actually allowed to apply the generally useful stuff you have learned in previous jobs).

The copyright violation would have to be in the parts that remain to be filled in once you have copied the parts not protected by copyright—for example:

- the purpose ("what the code does", for example "reciprocal square root")
- how it does it, especially if it is the best or one of a limited number of good ways to do it (so, yes, perhaps counterintuitively the expression in a particularly clever code snippet might enjoy less protection)
- whatever is dictated by external factors (the magic numbers in the reciprocal square root code? There are probably other reasons why they are not protected, but they also aren't because they need to be exactly those numbers to work, as dictated by a mathematical law); this also applies to what the coding style dictates

So, in practice, for a small enough snippet that such an accident is plausible, what might remain
and what must pass the originality threshold to attain copyright protection is things like:

- variable names—but if they are "normal" and not very creative (using "i" for a loop counter or "number" for a number), it doesn't contribute a whole lot
- Stylistic things that do not come directly from coding style or the way things are commonly done. How you group your code. Perhaps the order of some lines of code, where you insert blank lines (in cases where it would be unlikely for two coders to do it the same way), etc.
- Comments. Short, purely technically descriptive snippers are probably unlikely alone to meet the originality threshold, but if you come up with enough similar technical prose, even in the form of multiple short comments that alone aren't original enough, I think this might be your best bet of violating copyright.

The threshold for originality (in the US) is "low", but not nonexistent. Some things that have been deemed to not meet the threshold are (and remember that with code you need to meet it with what is left once you remove the substantial unprotected elements):

- Simple enough logos, even when there clearly is *some* creativity involved: https://en.wikipedia.org/wiki/Threshold_of_originality#/m...
- Blank forms
- Typefaces
- This vodka bottle: https://en.wikipedia.org/wiki/Threshold_of_originality#/m...

Why is Copilot so bad?

Posted Jul 5, 2022 11:08 UTC (Tue) by farnz (subscriber, #17727) [Link] (4 responses)

It's unlikely to happen with a human coding, simply because I'm not going to copy any copyright-significant decisions from a colleague - I may have a very similar snippet, but the details will change, because that's the nature of a human copying out code from memory. It's more likely to happen with Copilot, since it sometimes regurgitates complete snippets of its input, unchanged, and in a very literal manner.

This is why I suspect the legality of Copilot is currently a lot greyer than either side would like us to think; where it copies code that's not eligible for copyright protection, it may be obvious that it's copied something, but not an infringement because there's no protection to infringe (just as me copying #define U8_MAX ((u8)~0U) from the Linux kernel is not infringing, because there's nothing in there to protect). The risk, however, comes in when the snippet is something that's eligible for copyright protection; I note, for example, that Copilot sometimes outputs the comments that go with a code snippet from its input, which are more likely to be protected than the code itself.

My guess is that if it comes to court, the training process and model will be non-infringing definitionally, because the law says so in the first case, and because in the second case, it's not a reproduction of the copyrighted inputs. The output, however, will face the tests for whether it meets the bar for protection, and if it does and is a reproduction of someone else's work, then it could be deemed infringing; the fact that the model and its training process are not infringements does not guarantee that the output of Copilot is also non-infringing.

So on the GitHub side, the thing they're skating over is that the training process and the tool can be non-infringing without guaranteeing that the output is also non-infringing. On the SFC side, they're skating over the fact that a direct copy does not guarantee infringement, since not all code is eligible for protection. The truth all depends on what a judge says if such a case comes before them - and I'd expect to see that appealed to the highest legal authorities (Supreme Court in the USA).

Why is Copilot so bad?

Posted Jul 5, 2022 11:52 UTC (Tue) by SLi (subscriber, #53131) [Link] (2 responses)

Yeah, I'm not sure there are lots of people who both understand something about the law and would be willing to declare that it's clear cut either way. I definitely am not. My gut feeling is that it will be deemed legal, possibly with some minor changes or post-processing, but I wouldn't bet my life on it.

My more important point is that it *should* be legal, as a matter of sane policy that also would be the result that benefits free software, just like most pushback against copyright maximalism.

Why is Copilot so bad?

Posted Jul 5, 2022 15:26 UTC (Tue) by farnz (subscriber, #17727) [Link] (1 responses)

I disagree that it should be legal - taking that position to an absurd extreme, if I train an ML model on Linux kernel versions alone, I could have an ML model that's cost me a few million dollars but that outputs proprietary kernels that are Linux-compatible and work on the hardware I care about. Effectively, copyright becomes non-existent for big companies who can afford to do this.

My position therefore depends strongly on what the tool actually outputs; if the snippets are such that they are not protected by copyright in their own right, and the tool only outputs unprotected snippets, then I'm OK with it; this probably needs some filtering on the output of the tool to remove known infringing snippets, which I'm also fine with ensuring is legal (it should not be infringement to include content purely for the purpose of ensuring that that content is not output by the tool - fair use sort of argument).

I also very strongly believe that the model itself should not be copyright infringement in and of itself - it's the output that may or may not be infringing, depending on how you use it, and it's the user of the model who infringes if they use infringing output from the model. That may sound like splitting hairs, but it means that Copilot and similar systems are fine, legally speaking, as are any other models trained from publicly available data. It's only the use you put them to that needs care - you could end up infringing by using a tool that is capable of outputting protected material, and it's on the tool user to watch for that and not accept infringing outputs from their tools.

Why is Copilot so bad?

Posted Jul 5, 2022 17:37 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

> I also very strongly believe that the model itself should not be copyright infringement in and of itself - it's the output that may or may not be infringing, depending on how you use it, and it's the user of the model who infringes if they use infringing output from the model. That may sound like splitting hairs, but it means that Copilot and similar systems are fine, legally speaking, as are any other models trained from publicly available data.

I suspect that would very much depend if someone manages to find a business model where a model, trained on someone else’s copyrighted production, makes a lot on money on its own (not via the output of original work copycats). People and lawmakers tend to take a dim view on someone making a lot of money from other people’s belongings without those getting a cut.

I doubt, for example, that the pharmaceutical companies will manage to escape forever paying back the countries whose fauna/flora they sampled to create medicines. The pressure will only grow with climate change and such natural products becoming harder to preserve.

Why is Copilot so bad?

Posted Jul 5, 2022 15:35 UTC (Tue) by nye (subscriber, #51576) [Link]

> My guess is that if it comes to court, the training process and model will be non-infringing definitionally, because the law says so in the first case, and because in the second case, it's not a reproduction of the copyrighted inputs. The output, however, will face the tests for whether it meets the bar for protection, and if it does and is a reproduction of someone else's work, then it could be deemed infringing; the fact that the model and its training process are not infringements does not guarantee that the output of Copilot is also non-infringing.

This seems eminently reasonable and appears (from the outside, of course) to be the same conclusion that Microsoft's lawyers have made. So far as I'm aware they haven't made an explicit statement on the matter, but I think it's reasonable to infer the first part (training process and model) from the fact that they approved the release of the software, and the second part (the status of the output) from the fact that they recommend "IP scanning" of any output that you use.

At least in the EU it's clearer such that it's hard to see how there could really be any other possible interpretation. Not sure if we have the same laws regarding ML data collection here in Brexit Britain or if that came too late.