I was working in Libya on voter registration tools with the UN and the High National Election Commission. The government decided to not implement a planned TZ change, and didn't inform the public until the day of. Not the hardest thing we dealt with, that was a full country internet shutoff by a mob outside our data centre (https://www.bbc.com/news/world-africa-25481794). Sometimes the politics of a project are more complicated than the technology...
We did implement an all-SMS voter registration system, which was pretty cool. Hasn't been used much since, but it's all open source. https://github.com/hnec-vr
Complete tangent, but I don't think many Americans know this (I'm assuming you're an American)
If you aren't American, you now are now ineligible to go to America as a tourist without an expensive hasslesome visit to a US embassy. (No online ESTA)
I have friends that have gone to countries like Libya, and Syria to do similar international work. An British engineer I know recently went to Syria for a few days.
I pointed out that he is no longer allowed to go to the US without going to the embassy for a visa. He's ineligible for an ESTA.
He said "fine, work will have to pay for it".
I then pointed out this is for the rest of his life. He regularly holidays in Florida. He might leave the media or change jobs so they no longer pay for a visa.
I've been asked to go to Iraq in the past, but I've said no because of this. Was a very expensive weekend for my friend.
Another friend is in the British Army, he's gone to various places as part of both British and NATO deployment, not using his personal passport - but using travel orders. He managed to avoid going to Iraq which is lucky for him, means he can still get an ESTA.
Many Americans (and non-Americans) also don't know that not all countries are supported with ESTA. So, for me, being a citizen of a '3rd world country' in Europe, I have to visit the embassy.
Although I am resident in EU and haven't been to any of those 'flagged' countries.
I am an American, and most of the other folks on the project were as well. I did go through an interview with Customs officers when I returned to the US, but it didn't affect my ability to use GlobalEntry. I don't have clearance, but it might cause questions if I did apply for that in the future.
I've also been to Syria as a tourist several times, and at one point had to maintain a second passport for visiting Israel or the West Bank. You can't travel to most of the Arab world if you have Israeli stamps, but you can get another book from the US to keep them separate.
Second passport is fine, but they stopped stamping at Tel Aviv many years ago. I still got a stamp last time I went to gaza, but that was about a decade ago and pre current passport. Obviously not a concern now)
I did something similar for my wedding website back in 2013. We used a mail-in service that produced a decent TTF, and then I converted it to a WOFF. Still online at https://ruthandjosh.net/story/ (warning, millennial cringe ahead)
Aaron Swartz, cofounder of Reddit and inventor of RSS and Markdown, was hounded to death by an overzealous prosecutor for downloading articles from JSTOR, with the intent to learn from them. He was charged with over a million dollars in fines and could have faced 35 years in prison.
He and Sam Altman were in the same YC class. OpenAI is doing the same thing at a larger scale, and their technology actually reproduces and distributes copyrighted material. It's shameful that they are making claims that they aren't infringing creator's rights when they have scraped the entire internet.
I'm familiar with Aaron Swartz's case, and that is actually why I phrased it as "books". In any case, while tragic, Swartz wasn't prosecuted for copyright infringement, but rather for wire fraud and computer fraud due to the manner in which he bypassed protections in MIT's network and the JSTOR API. This wouldn't have been an issue if he downloaded the articles from a source that freely shared them, like sci-hub.
It would be incredibly naive to assume that the scraping done for these models did not at any point circumvent protections.
The fundamental contention is that both accessed, saved and distributed material that they didn't have a "right" to access, save, and distribute. One was made a billionaire for it and another was driven to suicide. It's not tragic, it's societal malpractice.
> It's shameful that they are making claims that they aren't infringing creator's rights when they have scraped the entire internet.
Scraping the Internet is generally very different from piracy. You are given a limited right to that data when you access it, and you can make local copies. if further use does something sufficiently non-copying, then creator rights aren't being infringed.
> Can you compress the internet including copyrighted material and then sell access to it?
Define access?
If you mean sending out the compressed copy, generally no. For things people normally call compression.
If you want to run a search engine, then you should be fine.
> At what percentage of lossy compression it becomes infringement?
It would have to be very very lossy.
But some AI stuff is. For example there are image models with fewer parameters than source images. Those are, by and large, not able to store enough data to infringe with. (Copying can creep in with images that have multiple versions, but that's a small sliver of the data.)
Commercial audio generation models were caught reproducing parts of copyrighted music in a distorted and low-quality form. This is not "learning", just "imitating".
Also, as I understand they didn't even buy the CDs with music for training; they got it somewhere else. Why do organizations that prosecute people for downloading a movie do not want to look if it is ok to make a business on illegal copies of copyrighted works?
When you identify where the infringing party has stored the source material in their artifact.{zip,pdf,safetensor,connectome,etc}. In ML, this discovery stage is called "mechanistic interpretability", and in humans it's called "illegal."
It's not that clear cut. Since they're talking about taking lossy compression to the limit, there are ways to go so lossy that you're not longer infringing even if you can point exactly at where it's stored.
It was overzealous prosecution of the breaking into a closet to wire up some ethernet cables to gain access to the materials
Not the downloading with intent
And apparently the most controversial take on this community is the observation that many people would have done the trial, plea and time, regardless of how overzealous the prosecution was
I'm glad you still have that much faith in the system. That's much more faith than I have in the system (and more faith than I had in the system back then, too).
35 years is a press release sentence. The way DOJ calculates sentences when they write press releases ignores the alleged facts of the particular case and just uses for each charge the theoretically maximum possible sentence that someone could get for that charge.
To actually get that maximum typically requires things like the person is a repeat offender, drug dealing was involved, people were physically harmed, it involved organized crime, it involved terrorism, a large amount of money was involved, or other things that make it an unusual big and serious crime.
The DOJ knows exactly what they are alleging the defendant did. They could easily looks at the various factors that affect sentencing for the charge and see which apply to that case and come up with a realistic number but that doesn't make it sound as impressive in the press release.
Another thing that inflates the numbers in the press releases is that defendants are often charged with several related charges. For many crimes there are groups of related charges that for sentencing get merged. If you are charged with say 3 charges from the same group and convicted on all you are only sentenced for whichever one of them has the longest sentence.
If you've got 3 charges from such a group in the press release the DOJ might just take the completely bogus maximum for each as described above and just add those 3 together.
Here's a good article on DOJ's ridiculous sentence numbers [1].
Here's a couple of articles from an expert in this area of law that looks specifically at what Swartz was charged with and what kind of sentence he was actually looking at [2][3].
Why do you think Swartz was downloading the articles to learn from them? As far as I've seen know one knows for sure what he was intending.
If he wanted to learn from JSTOR articles he could have downloaded them using the JSTOR account he had through his research fellowship at Harvard. Why go to MIT and use their public JSTOR WiFi access, and then when that was cut off hide a computer in a wiring closet hooked into their ethernet?
I've seen claims that he wanted to do was meta research about scientific publishing as a whole which could explain why he needed to download more than he could download with his normal JSTOR account from Harvard, but again why do that using MIT's public WiFi access? JSTOR has granted more direct access to large amounts of data for such research. Did he talk to them first to try to get access that way?
He might have wanted other people to have access to the knowledge, and for free. In comparison, AI companies want to sell access to the knowledge they got by scraping copyrighted works.
I used to work at a company that used Sentinel-2 data and a large scale AI model to detect changes in land use and land cover anywhere in the world. They provide free global data at 10m resolution on an annual basis, or paid versions at 3m resolution over a custom timeframe.
I did a project at the MIT Center for Future Civic Media on building a hyper-local radio platform, and we'd always use Car Talk as an example of the kind of interactions we wanted to enable. Like, what if you were a vet in rural Uganda, and you wanted to have a show for farmers to describe their problems and get expert help. Call it Goat Talk, and we'd descend into bleats and baahs.
This was an informal thing for a long time, but I didn't know that there's now an actual certificate. I may have to go back and request mine retroactively.
I was working in Libya on voter registration tools with the UN and the High National Election Commission. The government decided to not implement a planned TZ change, and didn't inform the public until the day of. Not the hardest thing we dealt with, that was a full country internet shutoff by a mob outside our data centre (https://www.bbc.com/news/world-africa-25481794). Sometimes the politics of a project are more complicated than the technology...
We did implement an all-SMS voter registration system, which was pretty cool. Hasn't been used much since, but it's all open source. https://github.com/hnec-vr