Jon's Friendica Node

ASRG

6 months ago • •

ASRG
6 months ago • •

Sabot in the Age of AI

Here is a curated list of strategies, offensive methods, and tactics for (algorithmic) sabotage, disruption, and deliberate poisoning.

🔻 iocaine
The deadliest AI poison—iocaine generates garbage rather than slowing crawlers.
🔗 git.madhouse-project.org/alger…

🔻 Nepenthes
A tarpit designed to catch web crawlers, especially those scraping for LLMs. It devours anything that gets too close. @aaron
🔗 zadzmo.org/code/nepenthes/

🔻 Quixotic
Feeds fake content to bots and robots.txt-ignoring #LLM scrapers. @marcusb
🔗 marcusb.org/hacks/quixotic.htm…

🔻 Poison the WeLLMs
A reverse-proxy that serves diassociated-press style reimaginings of your upstream pages, poisoning any LLMs that scrape your content. @mike
🔗 codeberg.org/MikeCoats/poison-…

🔻 Django-llm-poison
A django app that poisons content when served to #AI bots. @Fingel
🔗 github.com/Fingel/django-llm-p…

🔻 KonterfAI
A model poisoner that generates nonsense content to degenerate LLMs.
🔗 codeberg.org/konterfai/konterf…

iocaine

^{MadHouse Git Repositories}

#AI #LLM @Aaron @marcusb @Mike Coats 🏴󠁧󠁢󠁳󠁣󠁴󠁿🇪🇺🌍♻️ @Fingel

reshared this

in reply to ASRG

Mr_Hat_2010

in reply to ASRG • 6 months ago • •

@Fingel maybe add glaze to the list?

glaze.cs.uchicago.edu/index.ht…

Glaze - Protecting Artists from Generative AI

^{glaze.cs.uchicago.edu}

@Fingel

in reply to ASRG

Nicole Parsons

in reply to ASRG • 6 months ago • •

@Fingel

Larry Ellison, Oracle, and fossil fuel funded fascism...

cbsnews.com/news/trump-announc…

sfchronicle.com/tech/article/p…

washingtonpost.com/politics/20…

arstechnica.com/information-te…

propublica.org/article/project…

oracle.com/jo/news/announcemen…

arstechnica.com/tech-policy/20…

Larry Ellison chips in a cool billion toward Musk’s Twitter takeover

Musk also hopes to get former Twitter CEO Jack Dorsey to pitch in.

^{Financial Times (Ars Technica)}

@Fingel

Nicole Parsons reshared this.

in reply to ASRG

Matt 🔶 (LordMatt)

in reply to ASRG • 6 months ago • •

@Fingel This very much reminds me of the infinitely crawlable nonsense first designed (by me) to give Microsoft Recall a headache.

@Fingel

in reply to ASRG

Marcos Dione

in reply to ASRG • 6 months ago • •

@Fingel another take that I hope I have time to write:

An app that feeds either static text or a poisoned Markov Chain, but it writes back one byte at a time, and tries to delay the client as much as possible. It would probably would have to have start with a big delay, and every time the client disconnects, it registers the IP and the delay in a db so next time it tries a lower delay until it finds the best delay for each client.

@Fingel

in reply to Marcos Dione

Marcos Dione

in reply to Marcos Dione • 6 months ago • •

@Fingel is there a site where some of the craziest delusions from the original LLMs are recorded? We should feed them that back.

@Fingel

in reply to ASRG

Mr Salteena is not quite a gentleman

in reply to ASRG • 6 months ago • •

@Fingel I have been doing something primitive with fail2ban and a "trigger" URL. But. What I see is that the latest in scraping is to use a rotating set of IPs or proxies so requests never seem to come from the same IP number, and with plausible user agents. I'm struggling with this because although I can see the overall behaviour, it's not clear until after a request is made that's part of a scrape session, and blocking that IP number won't block the remaining scrapes. Firms are offering this kind of service commercially and there are plenty of writeups on how to do it.

@Fingel

in reply to Mr Salteena is not quite a gentleman

Aaron

in reply to Mr Salteena is not quite a gentleman • 6 months ago • •

A medium term plan for Nepenthes is to coordinate data amongst instances to conclusively identity crawlers, and hopefully allow people to ban them preemptively.

Still thinking through it. No ETA.
@asrg

@ASRG

in reply to ASRG

[moved] Floppy 💾

in reply to ASRG • 6 months ago • •

Thank you for this comprehensive list, this is great! I will need one of such defenses sooner or later I believe.

I worry a bit that looking at IP-ranges and User-Agent-strings might not be enough long-term. UA can be faked easily and different ranges are possible.

Are there other reliable ways employed to identify bots, without giving away any important content (e.g. beyond landing page)? Maybe looking at behavioural data (e.g. rate of accessing URLs)?

@aaron @marcusb @mike @Fingel

@Aaron @marcusb @Mike Coats 🏴󠁧󠁢󠁳󠁣󠁴󠁿🇪🇺🌍♻️ @Fingel

This website uses cookies. If you continue browsing this website, you agree to the usage of cookies.

⇧