Skip to main content


BlueSky’s “user intents” is a good proposal, and it’s weird to see some people flaming them for it as though this is equivalent to them welcoming in AI scraping (rather than trying to add a consent signal to allow users to communicate preferences for the scraping that is already happening).

github.com/bluesky-social/prop…

#BlueSky #AI

in reply to Molly White

I think the weakness with this and Creative Commons’ similar proposal for “preference signals” is that they rely on scrapers to respect these signals out of some desire to be good actors. We’ve already seen some of these companies blow right past robots.txt or pirate material to scrape.

ietf.org/slides/slides-aicontr…

#BlueSky #AI

in reply to Molly White

I do think that they are good technical foundations, and there is the potential for enforcement to be layered atop them.

Technology alone won’t solve this issue, nor will it provide the levers for enforcement, so it’s somewhat reasonable that they don’t attempt to.

But it would be nice to see some more proactive recognition from groups proposing these signals that enforcement is going to be needed, and perhaps some ideas for how their signals could be incorporated into such a regime.

#BlueSky #AI

in reply to Molly White

is there a reason why it would be a bad idea to just do ip block lists for known bad actors ?
in reply to flaeky pancako

@fleeky that could be part of it, but it’s a game of whack-a-mole and AI companies are used to circumventing IP blocks. plus with federated protocols, maintaining and enforcing a blocklist is considerably more challenging.
in reply to Molly White

i ran an email server for like 10-15 years and with spamassasin i just never had spam in my personal email..

spamassassin.apache.org/

maybe it wouldn't be fool proof but everytime they get through you plug the hole , it's a game of whack a mole but it costs them money ..

in reply to flaeky pancako

side note , i enjoyed this random reddit user's idea :
*random reddit user*

The best technique I've seen to combat this is:

Put a random, bad link in robots.txt. No human will ever read this.

Monitor your logs for hits to that URL. All those IPs are LLM scraping bots.

Take that IP and tarpit it.

in reply to flaeky pancako

@fleeky this assumes that scrapers are a) looking at robots.txt at all and b) forming their crawl strategy based on disallowed paths, which seems unlikely
in reply to Molly White

If the only thing you care about is sucking up as much data as possible, any path you know about is something to raid
@fleeky
in reply to Molly White

what if there isn't any legal lever for enforcement? Or do you see some other potential?
in reply to Molly White

I think you are right. I am not holding my breath though, the state of the world makes that seem far away.
in reply to Molly White

I keep thinking the technical part should be for the server to enforce acceptance of the terms by requiring something like a cookie that matches the TLS cert and the servers record that that cert has accepted the terms, and if that’s not the case you get redirected to the page that offers the terms. That way there’s no possibility of a reasonable belief that something being openly accessible implies any broader terms of use than were explicitly agreed to
in reply to Molly White

At least in the EU this could probably constructed to have legal recognition under the DSM Text and Data mining exemption.
in reply to Molly White

The disregard for enforcement issues is mostly a consequence of such proposals never starting from a data protection perspective, because that would lead to entirely different conceptualizations.

Instead, concepts are based on lists of "who wants to ingest, and for what", because the goal is to make data "available" to the respective entities.

This takes away the legitimization burden from those entities, and puts the consequences onto the shoulders of individuals: we're expected to familiarize ourselves with a growing typology of data krakens, and what stance we should take towards each of them.

Proposals like the ones you've mentioned also normalize this work, as if said entities were absolutely entitled to us doing it.

This entry was edited (7 months ago)
in reply to katzenberger 🇺🇦

@katzenberger I think it makes sense for projects like CC and Bluesky/ATproto to start from this perspective because they are fundamentally open. I agree that closed ecosystems should take a very different approach.
in reply to Molly White

Interesting. I'd call individuals and their communities on platforms "closed" with respect to their reason for being there.

IMHO what matters isn't the "open" technicalities of the platform that hosts them. Open protocols that facilitate data exchange between servers are not per se a kind of permission to tap into the exchange.

In that respect, communities and their platforms cannot be considered an "ecosystem" with the same "openness" rules applying to both components.

E.g., recently, I've started to become very suspicious of "APIs" being defined that have but one purpose, from a developer's perspective: extracting content, and helping with ingestion.

This is in total disregard for what brings individuals and communities together (informal, psychologically safe, and free conversation among like-minded people).

It subjugates communities to technical considerations, refusing to abide with conventions (or even laws): You don't want me to get hold of your "content"? Find a technically watertight way to prevent me from doing it, or be ready to be ridiculed for your naivete.

in reply to Molly White

We'd have to convince lots of people and companies to use something like Solid pods, or adopt standards.ieee.org/ieee/7012/7…
Unknown parent

mastodon - Link to source
Molly White

@hellpie I agree that hoping crawlers will respect these signals out of the pureness of their hearts is not sufficient.

I disagree that copyright is “the legally binding version of user preferences”, though. There is no good legal framework for consent when it comes to AI training, and so a lot of people are falling back to copyright, since there are laws on the books. But I don’t think copyright is the weapon with which to fight this battle — in my view, AI training pretty clearly falls under fair use (and should).

This website uses cookies. If you continue browsing this website, you agree to the usage of cookies.