BlueSky’s “user intents” is a good proposal, and it’s weird to see some people flaming them for it as though this is equivalent to them welcoming in AI scraping (rather than trying to add a consent signal to allow users to communicate preferences for the scraping that is already happening).
github.com/bluesky-social/prop…
proposals/0008-user-intents at main · bluesky-social/proposals
Bluesky proposal discussions. Contribute to bluesky-social/proposals development by creating an account on GitHub.GitHub
Molly White
in reply to Molly White • • •I think the weakness with this and Creative Commons’ similar proposal for “preference signals” is that they rely on scrapers to respect these signals out of some desire to be good actors. We’ve already seen some of these companies blow right past robots.txt or pirate material to scrape.
ietf.org/slides/slides-aicontr…
#BlueSky #AI
Molly White
in reply to Molly White • • •I do think that they are good technical foundations, and there is the potential for enforcement to be layered atop them.
Technology alone won’t solve this issue, nor will it provide the levers for enforcement, so it’s somewhat reasonable that they don’t attempt to.
But it would be nice to see some more proactive recognition from groups proposing these signals that enforcement is going to be needed, and perhaps some ideas for how their signals could be incorporated into such a regime.
#BlueSky #AI
flaeky pancako
in reply to Molly White • • •Molly White
in reply to flaeky pancako • • •flaeky pancako
in reply to Molly White • • •i ran an email server for like 10-15 years and with spamassasin i just never had spam in my personal email..
spamassassin.apache.org/
maybe it wouldn't be fool proof but everytime they get through you plug the hole , it's a game of whack a mole but it costs them money ..
Apache SpamAssassin: Welcome
spamassassin.apache.orgflaeky pancako
in reply to flaeky pancako • • •side note , i enjoyed this random reddit user's idea :
*random reddit user*
The best technique I've seen to combat this is:
Put a random, bad link in robots.txt. No human will ever read this.
Monitor your logs for hits to that URL. All those IPs are LLM scraping bots.
Take that IP and tarpit it.
Molly White
in reply to flaeky pancako • • •phi1997
in reply to Molly White • • •@fleeky
Jan Ainali
in reply to Molly White • • •Molly White
in reply to Jan Ainali • • •Jan Ainali
in reply to Molly White • • •ShadSterling
in reply to Molly White • • •Peter Makholm (Dansk)
in reply to Molly White • • •katzenberger 🇺🇦
in reply to Molly White • • •The disregard for enforcement issues is mostly a consequence of such proposals never starting from a data protection perspective, because that would lead to entirely different conceptualizations.
Instead, concepts are based on lists of "who wants to ingest, and for what", because the goal is to make data "available" to the respective entities.
This takes away the legitimization burden from those entities, and puts the consequences onto the shoulders of individuals: we're expected to familiarize ourselves with a growing typology of data krakens, and what stance we should take towards each of them.
Proposals like the ones you've mentioned also normalize this work, as if said entities were absolutely entitled to us doing it.
Molly White
in reply to katzenberger 🇺🇦 • • •katzenberger 🇺🇦
in reply to Molly White • • •Interesting. I'd call individuals and their communities on platforms "closed" with respect to their reason for being there.
IMHO what matters isn't the "open" technicalities of the platform that hosts them. Open protocols that facilitate data exchange between servers are not per se a kind of permission to tap into the exchange.
In that respect, communities and their platforms cannot be considered an "ecosystem" with the same "openness" rules applying to both components.
E.g., recently, I've started to become very suspicious of "APIs" being defined that have but one purpose, from a developer's perspective: extracting content, and helping with ingestion.
This is in total disregard for what brings individuals and communities together (informal, psychologically safe, and free conversation among like-minded people).
It subjugates communities to technical considerations, refusing to abide with conventions (or even laws): You don't want me to get hold of your "content"? Find a technically watertight way to prevent me from doing it, or be ready to be ridiculed for your naivete.
Matt Terenzio
in reply to Molly White • • •IEEE Standards Association
IEEE SAMolly White
Unknown parent • • •@hellpie I agree that hoping crawlers will respect these signals out of the pureness of their hearts is not sufficient.
I disagree that copyright is “the legally binding version of user preferences”, though. There is no good legal framework for consent when it comes to AI training, and so a lot of people are falling back to copyright, since there are laws on the books. But I don’t think copyright is the weapon with which to fight this battle — in my view, AI training pretty clearly falls under fair use (and should).
Kuba Suder • @mackuba.eu on 🦋
in reply to Molly White • • •