stux⚡️

Saturday, August 9, 2025, 7:00 PM

stux⚡️
Saturday, August 9, 2025, 7:00 PM

So according to this article: dropsitenews.com/p/meta-facebo…

#Meta is scraping the media proxies of mstdn, masto and .coffee..

If this is true, this is very worriying and pisses me very much off

No wonder our media loads so crappy if they are constantly tapping in..

Fuck #Meta to hell

LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI

The tech giant is sidestepping guardrails that websites use to prevent being scraped, data show, in a move whistleblowers say is unethical and potentially illegal.

^{Murtaza Hussain (Drop Site News)}

#meta

reshared this

in reply to stux⚡️

stux⚡️

in reply to stux⚡️ Saturday, August 9, 2025, 7:02 PM

Does anyone know a good way to block them?

I do not want anything to do with their shady business and they should stay away from ours

in reply to stux⚡️

👽🐦🦇🐉💻

in reply to stux⚡️ Saturday, August 9, 2025, 7:06 PM

maybe some kind of rate limiter?

would work if their request all originate from the same ip address (which is likely considering they are large companies).

in reply to stux⚡️

Jorijn Schrijvershof

in reply to stux⚡️ Saturday, August 9, 2025, 7:11 PM

yes. Block their user agent. We did: github.com/toot-community/plat…

platform/manifests/applications/ingress-nginx/helm-values.yaml at main · toot-community/platform

Platform code for the toot.community Mastodon instance - toot-community/platform

^GitHub

in reply to Jorijn Schrijvershof

stux⚡️

in reply to Jorijn Schrijvershof Saturday, August 9, 2025, 7:12 PM

@jorijn oh! is that enough? 😀

@Jorijn Schrijvershof

in reply to stux⚡️

xyhhx 🔻

in reply to stux⚡️ Saturday, August 9, 2025, 7:18 PM

what about iocaine, quixotic, etc?

in reply to stux⚡️

Uckermark MacGyver

in reply to stux⚡️ Saturday, August 9, 2025, 7:20 PM

maybe have a look at my setup: repos.mxhdr.net/maxheadroom/Tr…

Traefik-Bot-Blocking

This is my Traefik Middleware to block bots

^{Forgejo: Beyond coding. We Forge.}

in reply to stux⚡️

stux⚡️

in reply to stux⚡️ Saturday, August 9, 2025, 7:21 PM

Just edited our #nginx configs and added

if ($http_user_agent ~* "Meta-ExternalAgent") {
return 403;
}

to the server block

#AI #NoAI #Meta

#AI #meta #nginx #noAI

This entry was edited (Saturday, August 9, 2025, 7:24 PM)

calvin 🛋️ likes this.

in reply to stux⚡️

👽🐦🦇🐉💻

in reply to stux⚡️ Saturday, August 9, 2025, 7:24 PM

The media in this post is not displayed to visitors. To view it, please go to the original post.

😀

http.cat/403

in reply to stux⚡️

Kevin Russell

in reply to stux⚡️ Saturday, August 9, 2025, 7:24 PM

Needs hashtags. (lol)

in reply to stux⚡️

食 Shoku the Umbreon

in reply to stux⚡️ Saturday, August 9, 2025, 7:27 PM

I'm very curious to know what effect this has on the traffic. From the articles I read, they usually aren't that forthcoming about their scraping attempts.

in reply to stux⚡️

stux⚡️

in reply to stux⚡️ Saturday, August 9, 2025, 7:30 PM

In addition, i've also enabled CloudFlare's AI scrape block for masto.ai and .coffee

mstdn.social DNS is at Hetzner

in reply to stux⚡️

Kevin Russell

in reply to stux⚡️ Saturday, August 9, 2025, 7:34 PM

Is there an Instance Runner newsletter for all nodes to.learn from your efforts?

in reply to Kevin Russell

stux⚡️

in reply to Kevin Russell Saturday, August 9, 2025, 7:36 PM

@kevinrns When Ghost's new federation works fully ill setup a small blog on a sub of mstdn with info, posts and also financial data etc 💪

@Kevin Russell

in reply to stux⚡️

Kevin Russell

in reply to stux⚡️ Saturday, August 9, 2025, 8:20 PM

👏 👏

in reply to stux⚡️

Cybarbie

in reply to stux⚡️ Saturday, August 9, 2025, 9:02 PM

this is why I think instances should be authenticated only. There is an optional authorized-fetch/secure-mode. I had assumed my instance had it enabled but no. 🙁

cloudflare recently called out perplexity for bypassing their AI labyrinth.

in reply to stux⚡️

WideEyedCurious 🇺🇸 💙 🇺🇦 & 🇨🇦

in reply to stux⚡️ Saturday, August 9, 2025, 11:02 PM

I use Cloudflare at work and have enabled every AI-scraping bot feature available in the free plan since we all know they’re ignoring anything we put in the robot.txt file.

in reply to stux⚡️

Viss

in reply to stux⚡️ Saturday, August 9, 2025, 7:30 PM

i wonder if adding another line to that stanza, specifically so that the logs of everything hitting that rule goes to a separate log file - so you can harvest all the source IPs out - would be handy?

because im sure they will do the anthropic thing soon, and will start moving their scrapers into various other clouds to get around people blocking their asn and user agents.

if you can fingerprint their patterns, block them by pattern >😁

reverse-cambrige-analytica them.

in reply to stux⚡️

Kevin Russell

in reply to stux⚡️ Saturday, August 9, 2025, 7:37 PM

Thanks so much for your constant efforts on our behalf, youre a hero and general inspiration.

#mastodon

#Mastodon

in reply to stux⚡️

Paul Chambers🚧

in reply to stux⚡️ Saturday, August 9, 2025, 7:37 PM

I've been looking at The Ultimate Nginx Bad Bot Blocker, I just want to make sure it doesn't include Mastodon due to the "DDOS" link preview issue.

It claims, "The Ultimate Nginx Bad Bot, User-Agent, Spam Referrer Blocker, Adware, Malware and Ransomware Blocker, Clickjacking Blocker, Click Re-Directing Blocker, SEO Companies and Bad IP Blocker with Anti DDOS System, Nginx Rate Limiting and Wordpress Theme Detector Blocking. Stop and Block all kinds of bad internet traffic even Fake Googlebots from ever reaching your web sites. "

github.com/mitchellkrogza/ngin…

GitHub - mitchellkrogza/nginx-ultimate-bad-bot-blocker: Nginx Block Bad Bots, Spam Referrer Blocker, Vulnerability Scanners, User-Agents, Malware, Adware, Ransomware, Malicious Sites, with anti-DDOS, Wordpress Theme Detector Blocking and Fail2Ban Jail for

Nginx Block Bad Bots, Spam Referrer Blocker, Vulnerability Scanners, User-Agents, Malware, Adware, Ransomware, Malicious Sites, with anti-DDOS, Wordpress Theme Detector Blocking and Fail2Ban Jail f...

^GitHub

This entry was edited (Saturday, August 9, 2025, 7:43 PM)

in reply to stux⚡️

nvsr・ニーク [PREMIUM]

in reply to stux⚡️ Saturday, August 9, 2025, 7:47 PM

returning http code 444 from nginx just closes the connection, clearing up resources faster

in reply to stux⚡️

Steve's Place

in reply to stux⚡️ Saturday, August 9, 2025, 9:15 PM

I did the same.

in reply to stux⚡️

Eoin O'Neill

in reply to stux⚡️ Saturday, August 9, 2025, 11:07 PM

Wouldn't it be more effective to fight fire with fire?

Create a mass amount of fake-ish looking content and then serve that up as real content to the scraper, effectively poisoning the AI? So impersonate a fake user post, a fake image with improper alt text.

This way, they might not catch onto it right away. A 403 means they'll instantly change their methodology because they know they're blocked.

in reply to stux⚡️

Erwin Wessels

in reply to stux⚡️ Sunday, August 10, 2025, 5:41 PM

wonder if this is one of those rare cases one can use a 451…

in reply to stux⚡️

Owen G. Richards - ANTIFAscist

in reply to stux⚡️ Saturday, August 9, 2025, 10:59 PM

I doubt this will be acceptable to many, but I use a captcha to keep the bots at bay... (mine's homegrown, not a 'free'/'purchase' option).

in reply to stux⚡️

Nitin Khanna

in reply to stux⚡️ Saturday, August 9, 2025, 11:02 PM

wonder if Anubis or Fail2ban can do more for you…

in reply to stux⚡️

Fruity Mercury

in reply to stux⚡️ Sunday, August 10, 2025, 4:23 AM

block? We should be suing

in reply to stux⚡️

Distante

in reply to stux⚡️ Sunday, August 10, 2025, 4:39 AM

Does this thing work?

404media.co/the-open-source-so…

The Open-Source Software Saving the Internet From AI Bot Scrapers

Anubis, which block AI scrapers from scraping websites to death, has been downloaded almost 200,000 times.

^{Emanuel Maiberg (404 Media)}

Unknown parent

stux⚡️

Unknown parent Saturday, August 9, 2025, 7:04 PM

nuclear option - i am not entirely serious

Sensitive content

in reply to stux⚡️

Extreme Electronics

in reply to stux⚡️ Saturday, August 9, 2025, 7:06 PM

not just Meta, none of them respect any content any more. No point in robots.txt. Will not rate limit what so ever. They all chop and change ips and user agents to evade limiting.

Regularly kills smaller servers dead, then move on. (If your lucky) they are trashing the Web for small hosts.

in reply to stux⚡️

grob (teeth era) 🇺🇦🏳️‍🌈🏳️‍⚧️

in reply to stux⚡️ Saturday, August 9, 2025, 7:07 PM

does that mean Fedi people can take part in the class action law suit that could do "immense harm not only to a single AI company, but to the entire fledgling AI industry and to America’s global technological competitiveness." as stated by said industry reps?

arstechnica.com/tech-policy/20…

via @Lazarou
mastodon.social/@Lazarou/11499…

AI industry horrified to face largest copyright class action ever certified

Copyright class actions could financially ruin AI industry, trade groups say.

^{Ashley Belanger (Ars Technica)}

@Lazarou Monkey Terror 🚀💙🌈

in reply to stux⚡️

DJ Sexy Fresh

in reply to stux⚡️ Saturday, August 9, 2025, 7:11 PM

lol. They prohibit, shadowban, or actual-ban any acknowledgement of the existence of the fediverse on their own platforms, while also scraping it for content illegally.

Duplicitous and vile.

Unknown parent

stux⚡️

Unknown parent Saturday, August 9, 2025, 7:14 PM

@alex yes, im gonna test dis:

if ($http_user_agent ~* "Meta-ExternalAgent") {
return 403;
}

in reply to stux⚡️

SpaceLifeForm

in reply to stux⚡️ Saturday, August 9, 2025, 9:32 PM

The robots.txt file is just an 'ask' but means nothing.

See Cloudflare and Perplexity pointing at each other.

in reply to stux⚡️

Exostence

in reply to stux⚡️ Saturday, August 9, 2025, 9:41 PM

techcrunch.com/2025/08/04/perp…

Once you put information out there, it appears corporations and other unscrupulous actors take that as an invitation to use the "free data" for whatever they want. They have weaponized it more than once (clearview, et al.) and I see nothing that prevents them from doing this again. "Public domain", they'll cry.

I've minimized my web presence, but there's nothing to suggest they will stop at what they consider "public domain".

Perplexity accused of scraping websites that explicitly blocked AI scraping | TechCrunch

Internet giant Cloudflare says it detected Perplexity crawling and scraping websites, even after customers had added technical blocks telling Perplexity not to scrape their pages.

^{Lorenzo Franceschi-Bicchierai (TechCrunch)}

in reply to stux⚡️

VulcanTourist

in reply to stux⚡️ Saturday, August 9, 2025, 11:20 PM

Can't Meta's IP blocks be rejected from accessing Fedi domains? It would probably turn into a game of whack-a-mole, but it's better than doing nothing.

in reply to stux⚡️

nicdex 🇨🇦 🇺🇦

in reply to stux⚡️ Saturday, August 9, 2025, 11:40 PM

Damn it, looks like TechHub is on there too >🙁

Unknown parent

Jonathan Lamothe

Unknown parent Sunday, August 10, 2025, 12:26 AM

@hellosilverpatriot If they're ignoring robots.txt (which appears to be the case) they should have their training data poisoned.

in reply to stux⚡️

Crovanian

in reply to stux⚡️ Sunday, August 10, 2025, 12:29 AM

huh, neat. Can zuck fuck off please, thank you 😀

in reply to stux⚡️

Ferdi Groenewald

in reply to stux⚡️ Sunday, August 10, 2025, 5:31 AM

@thomas You know about this?

@Tom

This website uses cookies. If you continue browsing this website, you agree to the usage of cookies.

⇧

stux⚡️ Saturday, August 9, 2025, 7:00 PM • •

stux⚡️
Saturday, August 9, 2025, 7:00 PM