Why is Copilot so bad?

Posted Jul 2, 2022 22:57 UTC (Sat) by Wol (subscriber, #4433)
In reply to: Why is Copilot so bad? by SLi
Parent article: Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

But that's what the word "derivative" means!

The output of Copilot is derived from its inputs. Therefore, by the definition of the word "derive", any and all output is a derivative of the input that was used to create it.

The only question is, to what extent does copyright either consider it a legal derivative work and hence subject to licence, or trivial and hence not subject to licence.

Any attempt to argue otherwise is basically playing Humpty Dumpty. The law does not define the word "derivative" as far as I know, so it means (approximately) what it means in common English. To argue that the output is not a derivative work is to argue that the English language is meaningless ...

(Oh, and while I don't know what the legal implications are, remember that the EU treats "works" and "data" separately. Saying that it's perfectly acceptable to treat works in public view as data fits nicely into the EU directive saying you can *train* an AI on by public "works" by treating it as data. But if you then treat the output as a work, you are promptly putting it back under copyright rules ...)

Cheers,
Wol

to post comments

Why is Copilot so bad?

Posted Jul 3, 2022 6:38 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (5 responses)

US law does indeed define "derivative work" in 17 USC 101 as follows:

> A “derivative work” is a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted. A work consisting of editorial revisions, annotations, elaborations, or other modifications which, as a whole, represent an original work of authorship, is a “derivative work”.

Do not ask why they offer two completely different definitions back-to-back, because I have no idea. That's how it is in the statute book, and (presumably) how Congress wrote it.

Unfortunately, they do not define the word "work," and that's really the sticking point here. If I count the number of "E"s in a novel, and publish that number on a website, surely the number is not a derivative work of the novel, despite the fact that it has been "transformed" from the novel. A number is not a (creative) work, so it can't be a derivative work. But where do you draw the line? This definition does not tell us.

Why is Copilot so bad?

Posted Jul 4, 2022 9:06 UTC (Mon) by nim-nim (subscriber, #34454) [Link] (1 responses)

What you produce with the ML suggestions is definitely a work in the eye of the law.

You’re trying to nitpick by claiming that if you add a sufficient number of indirections and air-splitting in the derivation steps, it’s not (legally) a derivation.

But even if a judge would agree to follow this kind of reasoning (most would reject it out of hands, that’s basically muddying waters and a judge core job is to un-muddle what parties present to reach a verdict), that also works the other way :

“if splitting steps till a suggestion is too small to be considered a derivation in law works, how many of those tidbits can you combine the other way till you reach the critical mass and the end result is definitely protected?”

You can’t exempt yourself from legal obligations via technical foolery. It does not work that way law-side.

Why is Copilot so bad?

Posted Jul 6, 2022 19:48 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

No, I'm really not. I'm talking about the model parameters, which are a bunch of opaque numbers (the weights on the NN). I agree that you have to assess the final output separately, and that the intermediate steps are irrelevant. My point is merely that the model parameters are arguably not a "work" within the meaning of copyright law, and so you might not be able to sue GitHub.

Of course, even if the model *is* a derivative work, you still can't sue GitHub for creating the model, because the GPL places no restrictions on creating derivative works unless you distribute them to other people. You could allege an AGPL violation, if you can find a specific AGPL codebase and prove that GitHub actually used it as an input, but that's much harder. Or you can allege that the output of the model is a derivative work, at which point you don't have to worry about whether the model itself is a derivative work (copyright law doesn't care), but then you really do need to identify specific inputs that are "substantially similar" to one particular output, and that might be difficult as well.

Why is Copilot so bad?

Posted Jul 20, 2022 14:49 UTC (Wed) by ghane (guest, #1805) [Link] (2 responses)

> If I count the number of "E"s in a novel, and publish that number on a website, surely the number is not a derivative work of the novel, despite the fact that it has been "transformed" from the novel.

I have a question, sparked by your comment above.

1. Would an index of a work be a Derivative?

2. Would a concordance of a work be a Derivative?

In my mind, this is a single question :-)

Note that both of these can be done by a human or a program, with the exact same output.

Surely this must have been litigated somewhere already.

ISTR that because of the delays in publishing the Dead Sea Scrolls, a group in the 1990s published a complete Conocrdance, thus making the texts substantially available to researchers.

Why is Copilot so bad?

Posted Jul 20, 2022 15:13 UTC (Wed) by Wol (subscriber, #4433) [Link] (1 responses)

I suspect both the index and the concordance would be classed as a database, and thus not subject to copyright (but subject to other laws certainly as far as the EU is concerned).

Failure to publish is another major problem, because I believe most copyright laws apply to *legally* *published* work, from *the date of publication*.

There are plenty of cases of unpublished works being kept out of the public eye, and I can think of at least one where somebody published excerpts of a 200-yr-old work. But because he owned the original, nobody could get their hands on the complete work.

Another famous example of this sort of thing is Queen Victoria's diaries. She published the early ones unexpurgated. But when she died, her daughter and literary executor published "sanitised" and heavily edited versions, destroying the originals. Her nephew, George V, was horrified at such vandalism but was powerless. So the later diaries are missing roughly 2/3rds their original content :-(

But certainly as far as Co-Pilot is concerned, I think your references to indices and concordances misses the point. They may be new works of scholarship, but they are intended to direct you back to the original. Co-Pilot, it seems, hides the original from you so are "quoting blind" if you use its output.

Cheers,
Wol

Why is Copilot so bad?

Posted Jul 20, 2022 16:10 UTC (Wed) by ghane (guest, #1805) [Link]

Thank you for your inputs.

I have found a reference to what I remembered: https://www.nytimes.com/1991/09/05/world/computer-breaks-...

I was specifically not referring to Copilot, but asking in general. However,

> But certainly as far as Co-Pilot is concerned, I think your references to indices and concordances misses the point. They may be new works of scholarship, but they are intended to direct you back to the original. Co-Pilot, it seems, hides the original from you so are "quoting blind" if you use its output.

Note that the reconstructed text from mining the concordance and indexes was useful precisely because the original was not available. If it had been, the reconstructed text would be useless. They specifically claimed that the reconstructed text was not new in any way, it had been done by a computer(!). The original text guys, paradoxically, while calling them "pirates", claimed the reconstructed text was not the same as the original, and hence had no value.