So according to this article: dropsitenews.com/p/meta-facebo…
#Meta is scraping the media proxies of mstdn, masto and .coffee..
If this is true, this is very worriying and pisses me very much off
No wonder our media loads so crappy if they are constantly tapping in..
Fuck #Meta to hell
LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI
The tech giant is sidestepping guardrails that websites use to prevent being scraped, data show, in a move whistleblowers say is unethical and potentially illegal.Murtaza Hussain (Drop Site News)
reshared this
stux⚡️
in reply to stux⚡️ • • •Does anyone know a good way to block them?
I do not want anything to do with their shady business and they should stay away from ours
👽🐦🦇🐉💻
in reply to stux⚡️ • • •maybe some kind of rate limiter?
would work if their request all originate from the same ip address (which is likely considering they are large companies).
Jorijn Schrijvershof
in reply to stux⚡️ • • •platform/manifests/applications/ingress-nginx/helm-values.yaml at main · toot-community/platform
GitHubstux⚡️
in reply to Jorijn Schrijvershof • • •xyhhx 🔻
in reply to stux⚡️ • • •Uckermark MacGyver
in reply to stux⚡️ • • •Traefik-Bot-Blocking
Forgejo: Beyond coding. We Forge.stux⚡️
in reply to stux⚡️ • • •Just edited our #nginx configs and added
if ($http_user_agent ~* "Meta-ExternalAgent") {
return 403;
}
to the server block
#AI #NoAI #Meta
calvin 🛋️ likes this.
👽🐦🦇🐉💻
in reply to stux⚡️ • • •😀
http.cat/403
Kevin Russell
in reply to stux⚡️ • • •食 Shoku the Umbreon
in reply to stux⚡️ • • •stux⚡️
in reply to stux⚡️ • • •In addition, i've also enabled CloudFlare's AI scrape block for masto.ai and .coffee
mstdn.social DNS is at Hetzner
Kevin Russell
in reply to stux⚡️ • • •stux⚡️
in reply to Kevin Russell • • •Kevin Russell
in reply to stux⚡️ • • •Cybarbie
in reply to stux⚡️ • • •this is why I think instances should be authenticated only. There is an optional authorized-fetch/secure-mode. I had assumed my instance had it enabled but no. 🙁
cloudflare recently called out perplexity for bypassing their AI labyrinth.
WideEyedCurious 🇺🇸 💙 🇺🇦 & 🇨🇦
in reply to stux⚡️ • • •Viss
in reply to stux⚡️ • • •i wonder if adding another line to that stanza, specifically so that the logs of everything hitting that rule goes to a separate log file - so you can harvest all the source IPs out - would be handy?
because im sure they will do the anthropic thing soon, and will start moving their scrapers into various other clouds to get around people blocking their asn and user agents.
if you can fingerprint their patterns, block them by pattern >😁
reverse-cambrige-analytica them.
Kevin Russell
in reply to stux⚡️ • • •Thanks so much for your constant efforts on our behalf, youre a hero and general inspiration.
#mastodon
Paul Chambers🚧
in reply to stux⚡️ • • •I've been looking at The Ultimate Nginx Bad Bot Blocker, I just want to make sure it doesn't include Mastodon due to the "DDOS" link preview issue.
It claims, "The Ultimate Nginx Bad Bot, User-Agent, Spam Referrer Blocker, Adware, Malware and Ransomware Blocker, Clickjacking Blocker, Click Re-Directing Blocker, SEO Companies and Bad IP Blocker with Anti DDOS System, Nginx Rate Limiting and Wordpress Theme Detector Blocking. Stop and Block all kinds of bad internet traffic even Fake Googlebots from ever reaching your web sites. "
github.com/mitchellkrogza/ngin…
GitHub - mitchellkrogza/nginx-ultimate-bad-bot-blocker: Nginx Block Bad Bots, Spam Referrer Blocker, Vulnerability Scanners, User-Agents, Malware, Adware, Ransomware, Malicious Sites, with anti-DDOS, Wordpress Theme Detector Blocking and Fail2Ban Jail for
GitHubnvsr・ニーク [PREMIUM]
in reply to stux⚡️ • • •Steve's Place
in reply to stux⚡️ • • •Eoin O'Neill
in reply to stux⚡️ • • •Wouldn't it be more effective to fight fire with fire?
Create a mass amount of fake-ish looking content and then serve that up as real content to the scraper, effectively poisoning the AI? So impersonate a fake user post, a fake image with improper alt text.
This way, they might not catch onto it right away. A 403 means they'll instantly change their methodology because they know they're blocked.
Erwin Wessels
in reply to stux⚡️ • • •Owen G. Richards - ANTIFAscist
in reply to stux⚡️ • • •Nitin Khanna
in reply to stux⚡️ • • •Fruity Mercury
in reply to stux⚡️ • • •Distante
in reply to stux⚡️ • • •Does this thing work?
404media.co/the-open-source-so…
The Open-Source Software Saving the Internet From AI Bot Scrapers
Emanuel Maiberg (404 Media)stux⚡️
Unknown parent • • •Sensitive content
@alex Nop unfort
Let's nuke FB HQ
Extreme Electronics
in reply to stux⚡️ • • •not just Meta, none of them respect any content any more. No point in robots.txt. Will not rate limit what so ever. They all chop and change ips and user agents to evade limiting.
Regularly kills smaller servers dead, then move on. (If your lucky) they are trashing the Web for small hosts.
grob (teeth era) 🇺🇦🏳️🌈🏳️⚧️
in reply to stux⚡️ • • •does that mean Fedi people can take part in the class action law suit that could do "immense harm not only to a single AI company, but to the entire fledgling AI industry and to America’s global technological competitiveness." as stated by said industry reps?
arstechnica.com/tech-policy/20…
via @Lazarou
mastodon.social/@Lazarou/11499…
AI industry horrified to face largest copyright class action ever certified
Ashley Belanger (Ars Technica)DJ Sexy Fresh
in reply to stux⚡️ • • •lol. They prohibit, shadowban, or actual-ban any acknowledgement of the existence of the fediverse on their own platforms, while also scraping it for content illegally.
Duplicitous and vile.
stux⚡️
Unknown parent • • •@alex yes, im gonna test dis:
if ($http_user_agent ~* "Meta-ExternalAgent") {
return 403;
}
SpaceLifeForm
in reply to stux⚡️ • • •The robots.txt file is just an 'ask' but means nothing.
See Cloudflare and Perplexity pointing at each other.
Exostence
in reply to stux⚡️ • • •techcrunch.com/2025/08/04/perp…
Once you put information out there, it appears corporations and other unscrupulous actors take that as an invitation to use the "free data" for whatever they want. They have weaponized it more than once (clearview, et al.) and I see nothing that prevents them from doing this again. "Public domain", they'll cry.
I've minimized my web presence, but there's nothing to suggest they will stop at what they consider "public domain".
Perplexity accused of scraping websites that explicitly blocked AI scraping | TechCrunch
Lorenzo Franceschi-Bicchierai (TechCrunch)VulcanTourist
in reply to stux⚡️ • • •nicdex 🇨🇦 🇺🇦
in reply to stux⚡️ • • •Jonathan Lamothe
Unknown parent • •Crovanian
in reply to stux⚡️ • • •Ferdi Groenewald
in reply to stux⚡️ • • •