Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 10:12 UTC (Fri) by bluca (subscriber, #118303)
In reply to: Software Freedom Conservancy: Give Up GitHub: The Time Has Come! by Karellen
Parent article: Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

I am not a judge nor a lawyer - obligatory premise.

To me it seems pretty obvious: the work is not consumed under the terms of the license, whatever it might be, so the license doesn't apply to anything that is produced from it. If there's a dual licensed project GPL+commercial (as it's quite common), and I buy the commercial license, anything I do with it is not affected by the terms of the GPL, because that's not how I got the project. In the same way, TDM copyright exceptions are what allow me to train a model on anything publicly accessible, which means I do not see how any claims about the output of the model being subject to the original licenses of the input hold water. The original license is irrelevant, because the law gives me an exception. That is a good thing by the way, we need more exceptions to our ever-more-draconian copyright laws.

Now on the question on whether the output of the model is a derived work - under copyright law, and not under the terms of whatever the original license was - that sounds complicated but it definitely does not seem as clear cut as "Infringement!" as some maximalist takes make it sound. When Copilot was first announced, Felix Reda (who was actually a MEP when these laws were written) wrote an excellent article that touched on that, and it still applies today:

https://felixreda.eu/2021/07/github-copilot-is-not-infrin...

to post comments

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 11:06 UTC (Fri) by Karellen (subscriber, #67644) [Link] (2 responses)

Thanks for the link, it's a very interesting read.

Going to the "Machine-generated code is not a derivative work" section, I don't think it's as clear-cut as the author of that piece makes out.

Firstly:

copyright conflicts would constantly arise when two authors use the same trivial statement independently of each other, such as “Bucks beats Hawks and advance to the NBA finals”, or “i = i+1”. The short code snippets that Copilot reproduces from training data are unlikely to reach the threshold of originality.

I think there's an important difference here. Obviously, it's possible for two small snippets of work to be identical, and still have been generated independently, with neither snippet having its origins in the other. But the output of CoPilot does have its origins in the code it is trained on. It has colour.

From the classic 2004 essay (written by a lawyer) What Colour are Your Bits?:

I think Colour is what the designers of Monolith are trying to challenge, although I'm afraid I think their understanding of the issues is superficial on both the legal and computer-science sides. The idea of Monolith is that it will mathematically combine two files with the exclusive-or operation. You take a file to which someone claims copyright, mix it up with a public file, and then the result, which is mixed-up garbage supposedly containing no information, is supposedly free of copyright claims even though someone else can later undo the mixing operation and produce a copy of the copyright-encumbered file you started with. Oh, happy day! The lawyers will just have to all go away now, because we've demonstrated the absurdity of intellectual property!

The fallacy of Monolith is that it's playing fast and loose with Colour, attempting to use legal rules one moment and math rules another moment as convenient. When you have a copyrighted file at the start, that file clearly has the "covered by copyright" Colour, and you're not cleared for it, Citizen. When it's scrambled by Monolith, the claim is that the resulting file has no Colour - how could it have the copyright Colour? It's just random bits! Then when it's descrambled, it still can't have the copyright Colour because it came from public inputs. The problem is that there are two conflicting sets of rules there. Under the lawyer's rules, Colour is not a mathematical function of the bits that you can determine by examining the bits. It matters where the bits came from. The scrambled file still has the copyright Colour because it came from the copyrighted input file. It doesn't matter that it looks like, or maybe even is bit-for-bit identical with, some other file that you could get from a random number generator. It happens that you didn't get it from a random number generator. You got it from copyrighted material; it is copyrighted. The randomly-generated file, even if bit-for-bit identical, would have a different Colour. The Colour inherits through all scrambling and descrambling operations and you're distributing a copyrighted work, you Commie Mutant Traitor.

Emphasis in original - it matters where the bits came from. But the whole thing is worth reading, if you've not seen it already.

Secondly:

On the other hand, the argument that the outputs of GitHub Copilot are derivative works of the training data is based on the assumption that a machine can produce works. This assumption is wrong and counterproductive. Copyright law has only ever applied to intellectual creations – where there is no creator, there is no work. This means that machine-generated code like that of GitHub Copilot is not a work under copyright law at all, so it is not a derivative work either. The output of a machine simply does not qualify for copyright protection – it is in the public domain. That is good news for the open movement and not something that needs fixing.

(Emphasis mine.) Going back to the compiler analogy, this paragraph seems to imply that the output of compilers does not qualify for copyright protection - which is clearly absurd. And just because CoPilot doesn't produce output which corresponds to all of its input, that shouldn't matter either. Compilers throw away comments. And dead code. And redundant instructions. (Given suitably clever optimisation passes.) But that machine-generated output still qualifies for copyright protection.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 11:23 UTC (Fri) by Karellen (subscriber, #67644) [Link]

Correction: The author of "What Colour are Your Bits?" is not a lawyer. I got confused where they said "We lawyers think about colour..." in a place that was not a (hypothetical) quote. My bad. I think the essay still stands though.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 11:41 UTC (Fri) by Wol (subscriber, #4433) [Link]

> Thanks for the link, it's a very interesting read.

> Going to the "Machine-generated code is not a derivative work" section, I don't think it's as clear-cut as the author of that piece makes out.

Assuming the law were applied correctly (which it usually isn't :-( machine-generated code is a derivative work (it's a translation) of the original input. The transformation applied by the machine does not create or destroy copyright. So the machine-generated output is, FOR COPYRIGHT PURPOSES, IDENTICAL to the original input.

Cheers,
Wol

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 11:37 UTC (Fri) by Wol (subscriber, #4433) [Link] (21 responses)

> To me it seems pretty obvious: the work is not consumed under the terms of the license, whatever it might be, so the license doesn't apply to anything that is produced from it.

But as I keep hearing repeated, to consume the work outwith the licence IT HAS TO BE FOR ACADEMIC / LEARNING PURPOSES.

So if Copilot is used to produce academic research papers or teaching material (ie books etc), then that's fine.

But if it's used to provide programming prompts and snippets of code to copy and use, THEN THE EXCEPTION DOES NOT APPLY, AND COPYRIGHT DOES APPLY.

So ANY AND ALL code supplied to the general populace is suspect. In other words, if I work for any programming shop, be it software house or end user, and I incorporate code from Copilot into my work, that code is copyright the original author. And that author's copyright applies! Which means I damn well better know where Copilot got it from!!! Okay, many snippets may be too small for copyright to apply, but that's a completely different argument.

tldr; if you're using Copilot to help you WRITE code (as opposed to providing you with study material), you are almost certainly breaking Copyright Law.

And if you're using Copilot to provide study material you're an idiot. It's teaching you the consensus method, not the correct method.

So just don't use copilot :-)

Cheers,
Wol

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 12:37 UTC (Fri) by bluca (subscriber, #118303) [Link] (20 responses)

> But as I keep hearing repeated, to consume the work outwith the licence IT HAS TO BE FOR ACADEMIC / LEARNING PURPOSES.

That is factually wrong, and using all caps doesn't make it right. Consuming legally accessible public corpora for TDM is allowed for any purpose under the EU directive. The only difference is that academic institutions are allowed to ignore generic opt-outs.

There is currently no mechanism to express such opt-out, like you can for scrapers with a robots.txt. The W3C is working on a common spec for that: https://www.w3.org/2022/tdmrep/
Of course it's a generic opt-out, you can't pick and choose the parsers you don't like.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 14:50 UTC (Fri) by bluca (subscriber, #118303) [Link] (19 responses)

Let's quote verbatim to nip this in the bud once and for all:

> TITLE II
>
> MEASURES TO ADAPT EXCEPTIONS AND LIMITATIONS TO THE DIGITAL AND CROSS-BORDER ENVIRONMENT
>
> <...>
>
> Article 4
>
> Exception or limitation for text and data mining
>
> 1. Member States shall provide for an exception or limitation to the rights provided for in Article 5(a) and Article 7(1) of Directive 96/9/EC, Article 2 of Directive 2001/29/EC, Article 4(1)(a) and (b) of Directive 2009/24/EC and Article 15(1) of this Directive for reproductions and extractions of lawfully accessible works and other subject matter for the purposes of text and data mining.
>
> 2. Reproductions and extractions made pursuant to paragraph 1 may be retained for as long as is necessary for the purposes of text and data mining.
>
> 3. The exception or limitation provided for in paragraph 1 shall apply on condition that the use of works and other subject matter referred to in that paragraph has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online.
>
> 4. This Article shall not affect the application of Article 3 of this Directive.

https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32019L0790&from=EN

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 15:05 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

In that case, does that mean copilot is over-reaching itself? Okay, in the EU data is covered differently from text, but the exception appears to be for DOING the mining.

Which search engines like Google will love. It (and I'm quite happy with this) makes it legal for them to have huge search databases.

But there's a very big difference between using that mined data to direct people back to the original source document, and outputting something based on that source (essentially creating a derived document) to be passed on to a third party without the first party knowing anything about it.

Maybe the grounds for feeling that way have changed, but I still feel that actually *using* the output from Copilot for pretty much anything other than study is a very dangerous occupation, and maybe even using it for study ...

Cheers,
Wol

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 15:10 UTC (Fri) by bluca (subscriber, #118303) [Link]

During the proceedings legislators explicitly talked about AI/ML applications and development benefiting from the changes and clarity brought forward by this directive. So no, it's definitely not "over-reaching" to use this beyond indexing purposes.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 15:18 UTC (Fri) by ldearquer (guest, #137451) [Link] (16 responses)

Not an expert, but I think you may be confusing training and real world operation.
For training it seems OK to use whatever is lawfully accessible, and retain copies for as long as training lasts for this purpose.
But I don't see how that adds any exception on copyright for real world usage of your trained system.

As an example, if you have a neural network that identifies the music style of an input song, I understand you may use copyrighted stuff for training your system. In real world usage, a user may input some song, and your system may respond "that's likely country music". But the inverse, where user input is "country music" and your system starts giving away excerpts of copyright protected songs...

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 1, 2022 22:58 UTC (Fri) by bluca (subscriber, #118303) [Link] (15 responses)

'real world operation' doesn't really mean much. The output of Copilot is quite clearly either transformative (and as such not a copyright issue) or so small that it most likely doesn't qualify for originality. In either case, the original license is irrelevant, so the outrage about supposed 'GPL infringement' is moot.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 9:58 UTC (Sat) by ballombe (subscriber, #9523) [Link] (14 responses)

This is not established. Nothing prevents copilot to return verbatim copy of files it learned. And renaming variable names is not transformative.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 13:12 UTC (Sat) by bluca (subscriber, #118303) [Link] (13 responses)

Verbatim copies of files? Never seen that? Before you paste the link to the usual inverse square root gif, that's not a file, and see another reply above about it.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 17:09 UTC (Sat) by ballombe (subscriber, #9523) [Link] (2 responses)

You are reversing the duty of proof. It is Microsoft to prove this cannot happen, not us.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 21:45 UTC (Sat) by kleptog (subscriber, #1183) [Link] (1 responses)

> You are reversing the duty of proof. It is Microsoft to prove this cannot happen, not us.

That's ridiculous. You can't prove a negative. It's the same as asking someone to prove they're not beating their spouse.

If Microsoft came out with a statement that as far as they can tell it doesn't happen, people will just claim they're lying. The only relevant evidence is if someone comes up with actual examples.

This is even leaving aside of code formatters like Black which are so opinionated it's almost that the point that for any piece of code there is only one way it can be formatted, so you couldn't even tell the different between an actual copy and an accidental one if you wanted to.

If you take a step back and think about what it would take to build such an AI model, if the model has any understanding of the structure of code, there's no reason at all to think that it will randomly copy entire blocks of text literally from the input. It's going to be working at a completely different level.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 3, 2022 15:13 UTC (Sun) by ballombe (subscriber, #9523) [Link]

> If you take a step back and think about what it would take to build such an AI model, if the model has any understanding of the structure of code, there's no reason at all to think that it will randomly copy entire blocks of text literally from the input. It's going to be working at a completely different level.

Nobody outside MS really know how copilot actually work, so you cannot make any claim about it. 'AI model' is just a buzzword.

I do not see how the Math.isPrime example can occur outside literal copying.
copilot seems to be less transformative than a C compiler generating machine code and so far binaries have always been considered derivative from the source.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 4, 2022 12:38 UTC (Mon) by nye (subscriber, #51576) [Link] (9 responses)

> Verbatim copies of files? Never seen that?

This back and forth kind of misses the point IMO. If copilot outputs something which is a verbatim copy of a substantial piece of code, then *of course* it shouldn't magically have had its copyright removed. Similarly, if a person with an exceptionally accurate memory writes down some copyrighted code that they memorised last year, the fact that they didn't literally copy/paste it has no real bearing on its copyright status. It feels like this shouldn't be controversial.

It seems you assert that it isn't or shouldn't be possible for copilot to do this, but how ever accurate that is I don't think it's particularly important - partly because it's hard to prove and partly because it could be subject to change.

All of the talk about verbatim outputs seems like a largely pointless distraction from the important part: the infinite set of outputs which are *not* a verbatim copy of a substantial piece of code, and which the copyright maximalists argue must be considered a derivative of all of its training inputs.

Here is what it boils down to: if I, as a programmer, either A) perform a sequence of steps, or B) write a program to perform a sequence of steps, then assuming that all inputs and outputs are the same, does the choice of A vs B affect the legality of the outcome? I don't believe that there's a logically coherent argument for the answer being "yes".

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 4, 2022 12:48 UTC (Mon) by bluca (subscriber, #118303) [Link] (5 responses)

You are assuming one-liners or boilerplate that everybody else is also using in the exact same way pass the threshold of originality (or whatever it is called in legalese). That is one big assumption to make.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 4, 2022 13:35 UTC (Mon) by nye (subscriber, #51576) [Link]

> You are assuming one-liners or boilerplate that everybody else is also using in the exact same way pass the threshold of originality (or whatever it is called in legalese). That is one big assumption to make.

I am definitely not assuming that. If we're talking about "a verbatim copy of a substantial piece of code", then that's essentially my definition of "substantial", but I specifically said "if" in that section, and my point was that IMO it's not at all the important part of the discussion; it's just a distraction (this is why I considered it unimportant to define "substantial" in that context).

FWIW, while we're further entertaining the distraction anyway, I'm not even convinced that the repeatedly-cited fast inverse square root should be eligible for copyright protection - on the grounds that the only bit of creative work in it is the choice of a magic constant, which isn't typically something that would be considered copyrightable. It would be interesting to see if a court is ever asked to rule on this specific piece of code (although I think it's basically always a sad day when we get to the point that a court is required to rule on anything, so "interesting" should not be construed as "good").

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 5, 2022 11:27 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (3 responses)

The examples so far are not one liners.

But even if they were, because the model is asked to predict what someone else would have written given the same program structure, they are far less independent from this very same structure than random snippets found on the web.

And, general structure is one of the things that distinguish fair use from plagiarism.

You can not have it both ways, mimic accurately what others would have done, and pretend you are not deriving their work (this is especially striking where people have used ML to complete damaged work of arts, more accurately than the best best forger. Who cares that the forgery was done one stroke at a time.)

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 5, 2022 13:37 UTC (Tue) by bluca (subscriber, #118303) [Link] (2 responses)

Except in real world usage the similarity is with either the same project in which it is being used (so it's moot), or with something obvious and standard like boilerplate used in the same way by every user of a given library or api, which means the test of originality would not be met.

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 5, 2022 16:47 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (1 responses)

If you’ve found a reliable legal way to determine originality automatically why are you even posting there? It would be worth its weight in gold (printed in extra large font on lead plates) to every single legal department of whole Fortune 500.

And if you did not, how can you claim the tool never outputs anything original?

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 5, 2022 17:55 UTC (Tue) by bluca (subscriber, #118303) [Link]

It is a very simple trick: actually use the tool you are talking about, to see what it does outside demos and funny gifs

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 4, 2022 12:56 UTC (Mon) by Wol (subscriber, #4433) [Link] (2 responses)

> All of the talk about verbatim outputs seems like a largely pointless distraction from the important part: the infinite set of outputs which are *not* a verbatim copy of a substantial piece of code, and which the copyright maximalists argue must be considered a derivative of all of its training inputs.

Not just the maximalists. Taking the word "derivative" at face value, all the output is derivative of the training data.

The question isn't whether it's derivative, the question is whether it's sufficiently *trivial* not to be copyright, or sufficiently complex and derived from just one or two training items to be a blatant copyright violation. And that will probably have to be determined on a case-by-case basis.

tldr; don't assume because it comes from Copilot that it's copyright-free... (don't assume that it isn't, either).

Cheers,
Wol

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 4, 2022 13:42 UTC (Mon) by nye (subscriber, #51576) [Link] (1 responses)

> Not just the maximalists. Taking the word "derivative" at face value, all the output is derivative of the training data.

That is the maximal possible interpretation, so yes, just the maximalists, by definition. You haven't even added so much as any vague handwaving about transformative use!

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 4, 2022 14:19 UTC (Mon) by Wol (subscriber, #4433) [Link]

> just the maximalists, by definition

Except the quote I was replying to said COPYRIGHT maximalists.

And I certainly didn't claim that the output was - or even should be - copyright. I just said that it was - BY DEFINITION OF THE WORD - derivative.

If I openly said that *some* output is too trivial to copyright, how does that make me a copyright maximalist? And again, isn't "transformative use" - by definition - derivative? FFS, it's a *transformation* - it's the same thing but altered ...

Cheers,
Wol

Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

Posted Jul 2, 2022 16:06 UTC (Sat) by NAR (subscriber, #1313) [Link]

What if we replace (in this argument) the artificial intelligence with natural one? Humans learn differently that computers, but we still see lot of code and that influences our behaviour. If I see only GPL'd code, if I learn coding only using GPL'd examples, shall my code also be under the GPL?