Why is Copilot so bad?

Posted Jul 2, 2022 13:16 UTC (Sat) by bluca (subscriber, #118303)
In reply to: Why is Copilot so bad? by pabs
Parent article: Software Freedom Conservancy: Give Up GitHub: The Time Has Come!

A world in which it is not legal to train an AI on public corpora is a world in which only giant corporations with huge caches of proprietary code (or text or whatever) can build AIs. That's not a good outcome for free software.

to post comments

Why is Copilot so bad?

Posted Jul 2, 2022 14:03 UTC (Sat) by mpr22 (subscriber, #60784) [Link] (34 responses)

Admirable!

But the world we have right now is one where a giant corporation is using public domain (fine), permissively licensed (fine-ish) and copyleft (not so fine) code, rather than its own proprietary code, to train its AI model.

Why is Copilot so bad?

Posted Jul 2, 2022 14:22 UTC (Sat) by SLi (subscriber, #53131) [Link] (23 responses)

Yes. And I believe a good world is one where anyone can use any publically available data to train an AI and use it freely, or at least without copyright of the training material preventing it (ethical reasons may be a good reason to regulate some uses of AI).

Now it sounds like some want to throw the baby out with the bathwater and prevent all code AI models, apart from giants like Google or Microsoft training their own models on their own code (possibly for in-house use only, if they are scared of information leaks).

The society shouldn't adopt a copyright maximalist stance and stifle uses of AI merely because the first available models were proprietary.

The idea that "any use" of copyrighted works should require a license is a typical maximalist idea, and I expect to hear that more from the entertainment industry than free software proponents. Training an AI and using it to produce code is, rather clearly to me, one of those things that are only extremely tangential to any traditional purpose of the copyright system. It's fundamentally not at all different from a human reading publically available code and using the memories formed that way to write more. I don't think even the craziest copyright maximalists claim that the products of that are typically be derivative works.

Why is Copilot so bad?

Posted Jul 2, 2022 16:02 UTC (Sat) by mpr22 (subscriber, #60784) [Link] (2 responses)

> It's fundamentally not at all different from a human reading publically available code and using the memories formed that way to write more.

If a corporation is allowed to bleach the copyleft off of your code by using it as feedstock for an incomprehensibly complex computer algorithm and then asking the algorithm to solve that problem, copyleft is gravely wounded.

Why is Copilot so bad?

Posted Jul 2, 2022 22:12 UTC (Sat) by kleptog (subscriber, #1183) [Link] (1 responses)

> If a corporation is allowed to bleach the copyleft off of your code by using it as feedstock for an incomprehensibly complex computer algorithm and then asking the algorithm to solve that problem, copyleft is gravely wounded.

How is this different to anyone looking at copylefted code in Github for inspiration to solve a problem they're having, and then using that idea, written in their own way, in their own program? Copyright is focussed on the copying of expression, not the copying of ideas. As long as you can argue the model is copying the idea, not the expression, copyright is completely irrelevant.

The whole issue comes down to the distinction we've made in copyright law between what compilers do (which is considered pure manipulation having no effect on copyright), and what people do (which is looking at pieces of source code to learn and use that to make more source code). Isn't the rule of thumb: if you're copying from one source it's plagiarism, if you're copying from two it's research?

I don't really see how a model built on examining lots of source, some copylefted, producing code reduces the value of the input code. If a computer model can actually come up with code that does something you've typed, perhaps it wasn't so original and it's the kind of thing we want to automate away anyway.

TBH, the idea of a model writing code for you to solve a problem sounds nice. But what would be really valuable is something that could see where many programs are solving a similar problem, that it makes a library for that and refactors all the other programs to use that.

Why is Copilot so bad?

Posted Jul 5, 2022 9:38 UTC (Tue) by farnz (subscriber, #17727) [Link]

Treating it as comparable to a human is what I suspect the courts will do, and rsidd has pointed out that music precedent in the case of George Harrison's "My Sweet Lord" suggests that if Copilot does output snippets of its training data unchanged, then unless that snippet is "purely functional", it'll be found to be a copyright infringement by the user of Copilot.

That's a risk for any user of Copilot to assess - are they OK about a possible infringement suit caused by the fact that Copilot has access to code owned by Alphabet, Meta, Microsoft and other entities whose code is on GitHub?

Why is Copilot so bad?

Posted Jul 4, 2022 8:48 UTC (Mon) by nim-nim (subscriber, #34454) [Link] (19 responses)

It would be utterly trivial to use machine learning on FLOSS code *legally*.

There are not a thousand of different FLOSS licenses.

There are not a thousand of different combination rules.

Compared to the number of files the model ingests to output suggestions, determining what is the license of a project, what other licenses it can be combined with, and what are the possible licensing for the result is *TRIVIAL*. No need for special magic research exemptions, no need to anger people, just apply the original licensing, legally safe by design in all jurisdictions.

Pretending everything is public domain is not just laziness, it’s *opinionated* laziness, that tries to blur the lines so everything not “protected” by bigcorp lawyers is free to pillage, and everything produced by this pillaging can be safely put out of bounds.

Why is Copilot so bad?

Posted Jul 4, 2022 10:27 UTC (Mon) by SLi (subscriber, #53131) [Link] (17 responses)

Only if all the licensing metadata is correct and the authors understood what they were doing and had the right to do so. Hint: It isn't and they didn't... I have followed Debian actively enough to know that a team of few interested people can rather easily stumble on copyright violations, even if the majority of people might prefer those to not be discovered.

Why is Copilot so bad?

Posted Jul 4, 2022 14:10 UTC (Mon) by nim-nim (subscriber, #34454) [Link] (16 responses)

Well, guess what, the real world is imperfect.

“Mister judge, some goods in that store are probably mislabeled, therefore I decided that paying for what i picked up was unnecessary”. How do you think that would work out ?

The law does not let out out of the hook because others may have made mistakes. Everyone makes mistakes. There’s a difference between making an honest mistake (trying and failing to achieve perfection) and not trying at all.

Why is Copilot so bad?

Posted Jul 4, 2022 16:17 UTC (Mon) by SLi (subscriber, #53131) [Link] (15 responses)

My point exactly. But would you have the world be such that every time you discover that there was some small piece of code in the ton of code you use to train the model on, you have to retrain the model for a cost of a few million?

As a practical matter, no such large corpus of code without any copyright violations to be discovered exists. I suspect the large corporations come closest. For the free software world, this idea would kill the last hope of training such models.

I believe Microsoft's motivation for not training it on their internal code is not about copyright violations, but being careful to not divulge trade secrets—which is obviously a non-issue with any code that is freely accessible.

Why is Copilot so bad?

Posted Jul 4, 2022 16:37 UTC (Mon) by bluca (subscriber, #118303) [Link] (1 responses)

> I believe Microsoft's motivation for not training it on their internal code is not about copyright violations, but being careful to not divulge trade secrets—which is obviously a non-issue with any code that is freely accessible.

I'm not in GH so I don't know, but if I had to take a wild guess I'd say it's much simpler than that. The non-GH internal SCM systems are such an horrendous pain in the back to use and even get access to, I'm willing to bet the team working on the model, even if given permission to use those sources, would "nope" the heck out very, very fast and never look back.

Why is Copilot so bad?

Posted Jul 4, 2022 17:53 UTC (Mon) by Wol (subscriber, #4433) [Link]

Givn that I've worked with SourceSafe, I'm inclined to agree with you ... :-)

Cheers,
Wol

Why is Copilot so bad?

Posted Jul 5, 2022 6:39 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (12 responses)

> For the free software world, this idea would kill the last hope of training such models.

Not at all.

JUST APPLY THE ORIGINAL LICENSING

We spent decades streamlining FLOSS licensing to make sure the number of actual licenses in play is small and their effects are clearly understood. The kind of legal effort proprietary companies coped on. As shown every single time some software giant tries to relicense its own (?) code and takes years clearing the effects of not doing due diligence before.

THERE IS NO VALID EXCUSE TO IGNORE FLOSS LICENSES.

There is easily 60% of github content that is governed by a small dozen of FLOSS licenses. This is more than enough to train a model on. Distinguishing between a dozen different terms is not hard.

This is especially galling from an extra-wealthy company who had the means for years to clear the legal status of its own code (but did not), spent years mocking people that wasted time arguing about the exact effect of FLOSS license terms, and then starts pillaging this very code without even trying to comply with the hardly won simple licensing state.

This is especially galling from a division (github) that has been asked for years to help committers navigate legalities, make half assed efforts, and then proceeds to ignore the result of those efforts.

Stop finding ridiculous excuses. FLOSS is about the only software trove where ML can work legally *because* of its licensing simplicity (that took a lot of effort to achieve). ASSUMING YOU APPLY THIS LICENSING. Otherwise, no better than proprietary software, and Microsoft has plenty of its own to play with, and it’s not welcome playing with other people’s software when not abiding with the legal conditions.

No better than the people that ignore creative commons terms because their own legal status is an utter mess, and they expect others to be just as bad. Not an honest mistake once they’ve been told repeatedly it’s not the case. They can stomp on their own licensing not on the one of others.

Why is Copilot so bad?

Posted Jul 5, 2022 6:48 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

(Also we *do* remember that Microsoft filed Amicus briefs on Oracle's side, when it spent years suing Google for 9 lines of rangecheck implementation, and now wants us to accept that copying of FLOSS code on industrial scale is not protected provided it’s mediated by a black box ML model).

Why is Copilot so bad?

Posted Jul 5, 2022 10:14 UTC (Tue) by SLi (subscriber, #53131) [Link] (10 responses)

So are you seriously telling me distributions like Debian do not find out regularly that they have been distributing something which is a copyright violation? Because if you are, you clearly just do not know.

There is no such thing as a truly massive corpus of code with a known license and guaranteed freedom from copyright issues. There just isn't. It's not an excuse.

Why is Copilot so bad?

Posted Jul 5, 2022 10:28 UTC (Tue) by amacater (subscriber, #790) [Link]

Debian and licenses - yes, it's one of the things that Debian maintainers do is licensing and copyright checking. If software is found where the licence is changed, it's removed. It's also one of the things that goes into Debian packaging checks, SPDX, reproducible builds ... there's a good faith effort to do this for every Debian package. Jokingly, I refer to Debian licence "fascism" as one of the saving graces of Debian because you _can_ be as sure as feasible that someone has checked.

This is not necessarily the case for other distributions - which may have other priorities / commercial pressures or whatever - but that's their world. Disclaimer: I am a Debian developer since about 1998 but don't currently package software, though I do keep note of the tools and processes that do.

Why is Copilot so bad?

Posted Jul 5, 2022 10:51 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (7 responses)

Irrelevant

The law does not require perfection, it deals with the real world.

The law requires good faith efforts, ie you do not get a free pass to appropriate stuff clearly labeled under someone else’s license, and you make efforts to fix things once you’re notified the labeling was in error.

Nothing more AND NOTHING LESS.

Why is Copilot so bad?

Posted Jul 5, 2022 10:58 UTC (Tue) by SLi (subscriber, #53131) [Link] (5 responses)

So are you saying that, if using models like Copilot is a copyright violation, the law would still not require you to stop using a model trained from Debian's source code once you have realized you trained it with unlicensed material? Because they did a good enough effort? Even if they could, at a significant cost, retrain it?

Why is Copilot so bad?

Posted Jul 5, 2022 11:58 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (4 responses)

It definitely NOT a copyright violation if you apply the licensing terms of the code you are copying.

If you ignore those terms, it MAY be a copyright violation, depending on the extent and originality of the copying and depending on how much it is linked to overall program structure (ie the more accurate the model will be, the more likely it will be to infringe).

The instrument you use for this copying (CTRL + C or fancy ML) is pretty much irrelevant in the eyes of the law. The law cares about effects (you killed your neighbor) not the instrument used (you used a knife like your arch arch ancestor, you printed a gun, used a fancy sci fi laser or Harry Potter’s magical wand). But, tech people keep thinking they will fool a judge just by using an instrument never encountered before.

Also the law deals with the real world, not absolutes, so infringing accidentally in good faith (the code was mis-labelled) is not the same thing as deliberately ignoring the code license pro-eminently displayed on the github project landing page. In one case you get condemned for one symbolic dollar (provided you did due diligence to fix your mistake), in the other it can reach billions.

As for the “significant cost of retraining” just try this in front of a judge and the peanut gallery, we all know here those models are periodically retrained for lots of different reasons (including mistakes in the data set and licensing mistakes are not less worthy than others).

Notwithstanding the fact that Microsoft is the operator of one of the worlds biggest clouds, which the judge will find it hard to ignore.

Why is Copilot so bad?

Posted Jul 5, 2022 12:20 UTC (Tue) by SLi (subscriber, #53131) [Link] (3 responses)

Ok, but that's my point exactly: There's not much hope for a free model in a world where you have to retrain it every time you discover it was tainted by freely available code which a human could read on the net but could not legally copy.

It may be, barely, possible for a large corporation like Google or Microsoft with their internal code bases which tend to be better curated (but still it will be hard).

You do realize that training a model the scale of Copilot costs a few millions every time you do it?

Good luck getting funding for retraining the free model every time Debian finds a copyright violation. I could see public or donated funding for a single training, but not for that.

So, if the law is what you claim it is, we can possibly still have proprietary models, but it's quite unlikely to have significant models trained on free software.

I think your rhetoric about tech people trying to fool judges is a bit misplaced and incendiary. I think it's safe to guess that Microsoft lawyers have found Copilot to be legally safe enough. And it's not like this is some device designed purely to try to circumvent law.

Why is Copilot so bad?

Posted Jul 5, 2022 12:50 UTC (Tue) by nim-nim (subscriber, #34454) [Link] (2 responses)

> Ok, but that's my point exactly: There's not much hope for a free model in a world where you have to retrain it every time you discover it was tainted by freely available code which a human could read on the net but could not legally copy.

First computing power is dirt cheap and what was prohibitively expensive yesterday is wasted on ad processing and crypto mining today.

Second, the law does not deal with absolutes it deals with the real world and proportionality.

It does not require instantaneous systematic compliance. That, would be pretty much impossible to achieve in the material world. It requires speedy realistic compliance (as soon as you can, not as soon as it is convenient or cheap for you).

Periodic retraining would be fine, as long as you do not delay it unduly to avoid any consequence. And you *will* retrain periodically if only because computing languages keep evolving and you will need to make the model aware of new variants.

In the meanwhile, it is computationally cheap to filter output to ignore suggestions found in code you’ve been informed is tainted.

And if you are convinced the amount of tainted code will largely exceed you capacity to filter, and you proceed with your ML project anyway, it will be hard to take is as anything but willful copyright infringement.

And it all terribly inconvenient I know. The law is not about your individual convenience.

> I think it's safe to guess that Microsoft lawyers have found Copilot to be legally safe enough.

”Even for copyrightable platforms and software packages, the determination whether infringement has occurred must take into account doctrines like fair use that protect the legitimate interests of follow-on users to innovate. But the promise of some threshold copyright protection for […] elements of computer software generally is a critically important driver of research and investment by companies like amici and rescinding that promise would have sweeping and harmful effects throughout the software industry”

Gregory G. Garre, Counsel for Microsoft Corporation, BRIEF FOR AMICI CURIAE MICROSOFT CORPORATION […] IN SUPPORT OF APPELLANT

That’s what Microsoft thinks when the code in question is not produced by Joe Nobody on Github

Why is Copilot so bad?

Posted Jul 5, 2022 14:16 UTC (Tue) by SLi (subscriber, #53131) [Link] (1 responses)

Computing power dirt cheap? You clearly haven't moved into the world of AI yet. Seriously, training those models costs millions in electricity and computer time costs only per training run.

In the future, it's possible that you may be able to train models people train today for millions for less, but even that is a bit speculative (I think the biggest advancements are likely to come from algorithmic development, but it's probably still possible to squeeze some computation per watt more). You still won't be able to train in the future the better models they then train for dirt cheap.

Why is Copilot so bad?

Posted Jul 5, 2022 14:32 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

Then it was utterly foolish to spend those millions before writing the little amount of code necessary to check the legal metadata. Behaving foolishly is a general consequence of thinking rules apply to others not you.

Why is Copilot so bad?

Posted Jul 5, 2022 11:01 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

Also the default state of something you find somewhere on the street or on the web is not free to use it’s protected. You are not allowed to steal the pile of furniture lying in the street during a relocation just because every single table is not tagged off limits.

Why is Copilot so bad?

Posted Jul 5, 2022 12:15 UTC (Tue) by pabs (subscriber, #43278) [Link]

You are correct that Debian does have to remove code fairly regularly that was found to be non-free or even non-redistributable. Most instances are caught by maintainers before they enter Debian, but sometimes mistakes are made.

https://bugs.debian.org/cgi-bin/pkgreport.cgi?pkg=ftp.deb...
https://snapshot.debian.org/removal/

Why is Copilot so bad?

Posted Jul 4, 2022 12:08 UTC (Mon) by nye (subscriber, #51576) [Link]

> Pretending everything is public domain is not just laziness, it’s *opinionated* laziness, that tries to blur the lines so everything not “protected” by bigcorp lawyers is free to pillage, and everything produced by this pillaging can be safely put out of bounds.

Nobody ever claimed that.

Why is Copilot so bad?

Posted Jul 2, 2022 14:23 UTC (Sat) by bluca (subscriber, #118303) [Link] (9 responses)

No, it's one where _anyone_ can do that, because building a model from public repos is not subject to copyright restrictions (and thus the license is irrelevant). That's a good thing, and it levels the playing field.

Why is Copilot so bad?

Posted Jul 2, 2022 15:59 UTC (Sat) by mpr22 (subscriber, #60784) [Link] (1 responses)

No, it doesn't level the playing field.

Because the big capitalist proprietors still have more money and smart people to throw at the exercise than any other non-state actor.

Why is Copilot so bad?

Posted Jul 2, 2022 16:41 UTC (Sat) by bluca (subscriber, #118303) [Link]

Making it de-facto impossible for anyone not in a gigantic corporation to build an AI model doesn't change any of that, it makes things strictly worse. Fortunately it is not illegal, at least in Europe.

Why is Copilot so bad?

Posted Jul 5, 2022 9:47 UTC (Tue) by farnz (subscriber, #17727) [Link] (6 responses)

Building the model isn't subject to copyright restriction (which I agree is right and proper - we don't place copyright restrictions on people picking up information from code they read), but using it might be, just as I might be infringing copyright if I accidentally type in a byte-for-byte identical copy of something I read during code review at a past job.

There's precedent for this in human creativity - former Beatle George Harrison lost a case for "subconscious plagiarism" (hat tip to rsidd) because he listened to a song several years before writing a song that happened to be almost exactly the same melody. No copyright restrictions applied to George Harrison listening to the song he later infringed copyright on, but they did come into play once he created a "new" work that happened to be too similar to an existing work he knew about.

The same could well apply to Copilot - creating the model is OK (human analogy is consuming media), holding the model itself is OK (human analogy is having a memory of past work), but using the output of the model is infringement if it's regurgitated copyrightable code from its input ("subconscious plagiarism" in the Harrison case).

Why is Copilot so bad?

Posted Jul 5, 2022 10:54 UTC (Tue) by SLi (subscriber, #53131) [Link] (5 responses)

Code tends to be much more functional than "pure arts" like music. I doubt what you say is possible with code for a human (coming up with identical code might be, but in that case it's unlikely to be significant enough to be a copyright violation, and, you know, you are actually allowed to apply the generally useful stuff you have learned in previous jobs).

The copyright violation would have to be in the parts that remain to be filled in once you have copied the parts not protected by copyright—for example:

- the purpose ("what the code does", for example "reciprocal square root")
- how it does it, especially if it is the best or one of a limited number of good ways to do it (so, yes, perhaps counterintuitively the expression in a particularly clever code snippet might enjoy less protection)
- whatever is dictated by external factors (the magic numbers in the reciprocal square root code? There are probably other reasons why they are not protected, but they also aren't because they need to be exactly those numbers to work, as dictated by a mathematical law); this also applies to what the coding style dictates

So, in practice, for a small enough snippet that such an accident is plausible, what might remain
and what must pass the originality threshold to attain copyright protection is things like:

- variable names—but if they are "normal" and not very creative (using "i" for a loop counter or "number" for a number), it doesn't contribute a whole lot
- Stylistic things that do not come directly from coding style or the way things are commonly done. How you group your code. Perhaps the order of some lines of code, where you insert blank lines (in cases where it would be unlikely for two coders to do it the same way), etc.
- Comments. Short, purely technically descriptive snippers are probably unlikely alone to meet the originality threshold, but if you come up with enough similar technical prose, even in the form of multiple short comments that alone aren't original enough, I think this might be your best bet of violating copyright.

The threshold for originality (in the US) is "low", but not nonexistent. Some things that have been deemed to not meet the threshold are (and remember that with code you need to meet it with what is left once you remove the substantial unprotected elements):

- Simple enough logos, even when there clearly is *some* creativity involved: https://en.wikipedia.org/wiki/Threshold_of_originality#/m...
- Blank forms
- Typefaces
- This vodka bottle: https://en.wikipedia.org/wiki/Threshold_of_originality#/m...

Why is Copilot so bad?

Posted Jul 5, 2022 11:08 UTC (Tue) by farnz (subscriber, #17727) [Link] (4 responses)

It's unlikely to happen with a human coding, simply because I'm not going to copy any copyright-significant decisions from a colleague - I may have a very similar snippet, but the details will change, because that's the nature of a human copying out code from memory. It's more likely to happen with Copilot, since it sometimes regurgitates complete snippets of its input, unchanged, and in a very literal manner.

This is why I suspect the legality of Copilot is currently a lot greyer than either side would like us to think; where it copies code that's not eligible for copyright protection, it may be obvious that it's copied something, but not an infringement because there's no protection to infringe (just as me copying #define U8_MAX ((u8)~0U) from the Linux kernel is not infringing, because there's nothing in there to protect). The risk, however, comes in when the snippet is something that's eligible for copyright protection; I note, for example, that Copilot sometimes outputs the comments that go with a code snippet from its input, which are more likely to be protected than the code itself.

My guess is that if it comes to court, the training process and model will be non-infringing definitionally, because the law says so in the first case, and because in the second case, it's not a reproduction of the copyrighted inputs. The output, however, will face the tests for whether it meets the bar for protection, and if it does and is a reproduction of someone else's work, then it could be deemed infringing; the fact that the model and its training process are not infringements does not guarantee that the output of Copilot is also non-infringing.

So on the GitHub side, the thing they're skating over is that the training process and the tool can be non-infringing without guaranteeing that the output is also non-infringing. On the SFC side, they're skating over the fact that a direct copy does not guarantee infringement, since not all code is eligible for protection. The truth all depends on what a judge says if such a case comes before them - and I'd expect to see that appealed to the highest legal authorities (Supreme Court in the USA).

Why is Copilot so bad?

Posted Jul 5, 2022 11:52 UTC (Tue) by SLi (subscriber, #53131) [Link] (2 responses)

Yeah, I'm not sure there are lots of people who both understand something about the law and would be willing to declare that it's clear cut either way. I definitely am not. My gut feeling is that it will be deemed legal, possibly with some minor changes or post-processing, but I wouldn't bet my life on it.

My more important point is that it *should* be legal, as a matter of sane policy that also would be the result that benefits free software, just like most pushback against copyright maximalism.

Why is Copilot so bad?

Posted Jul 5, 2022 15:26 UTC (Tue) by farnz (subscriber, #17727) [Link] (1 responses)

I disagree that it should be legal - taking that position to an absurd extreme, if I train an ML model on Linux kernel versions alone, I could have an ML model that's cost me a few million dollars but that outputs proprietary kernels that are Linux-compatible and work on the hardware I care about. Effectively, copyright becomes non-existent for big companies who can afford to do this.

My position therefore depends strongly on what the tool actually outputs; if the snippets are such that they are not protected by copyright in their own right, and the tool only outputs unprotected snippets, then I'm OK with it; this probably needs some filtering on the output of the tool to remove known infringing snippets, which I'm also fine with ensuring is legal (it should not be infringement to include content purely for the purpose of ensuring that that content is not output by the tool - fair use sort of argument).

I also very strongly believe that the model itself should not be copyright infringement in and of itself - it's the output that may or may not be infringing, depending on how you use it, and it's the user of the model who infringes if they use infringing output from the model. That may sound like splitting hairs, but it means that Copilot and similar systems are fine, legally speaking, as are any other models trained from publicly available data. It's only the use you put them to that needs care - you could end up infringing by using a tool that is capable of outputting protected material, and it's on the tool user to watch for that and not accept infringing outputs from their tools.

Why is Copilot so bad?

Posted Jul 5, 2022 17:37 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

> I also very strongly believe that the model itself should not be copyright infringement in and of itself - it's the output that may or may not be infringing, depending on how you use it, and it's the user of the model who infringes if they use infringing output from the model. That may sound like splitting hairs, but it means that Copilot and similar systems are fine, legally speaking, as are any other models trained from publicly available data.

I suspect that would very much depend if someone manages to find a business model where a model, trained on someone else’s copyrighted production, makes a lot on money on its own (not via the output of original work copycats). People and lawmakers tend to take a dim view on someone making a lot of money from other people’s belongings without those getting a cut.

I doubt, for example, that the pharmaceutical companies will manage to escape forever paying back the countries whose fauna/flora they sampled to create medicines. The pressure will only grow with climate change and such natural products becoming harder to preserve.

Why is Copilot so bad?

Posted Jul 5, 2022 15:35 UTC (Tue) by nye (subscriber, #51576) [Link]

> My guess is that if it comes to court, the training process and model will be non-infringing definitionally, because the law says so in the first case, and because in the second case, it's not a reproduction of the copyrighted inputs. The output, however, will face the tests for whether it meets the bar for protection, and if it does and is a reproduction of someone else's work, then it could be deemed infringing; the fact that the model and its training process are not infringements does not guarantee that the output of Copilot is also non-infringing.

This seems eminently reasonable and appears (from the outside, of course) to be the same conclusion that Microsoft's lawyers have made. So far as I'm aware they haven't made an explicit statement on the matter, but I think it's reasonable to infer the first part (training process and model) from the fact that they approved the release of the software, and the second part (the status of the output) from the fact that they recommend "IP scanning" of any output that you use.

At least in the EU it's clearer such that it's hard to see how there could really be any other possible interpretation. Not sure if we have the same laws regarding ML data collection here in Brexit Britain or if that came too late.