Protecting Content from AI Bots?

blocman94 · August 29, 2023, 7:31pm

Has anyone played around with the code required to deny AI bots from scraping content to enhance their machine learning skills? I’m concerned about this unilateral use of bots that potentially infringes on copy-write protections and would be curious to see what others have done to protect their work.

KBConcepts · August 29, 2023, 7:55pm

Very Interesting. I’ll be waiting for someone to reply.

PeteSharp · August 29, 2023, 8:54pm

You do this with robots.txt

You have to create entries for each bot. But there are a ton of online generators.

Eg. ChatGPT documents this for their bot

The whole search industry is changing, I wouldn’t be too reactive just yet, as we don’t know where the line gets drawn. Early days.

Flashman · August 29, 2023, 10:04pm

For ChatGPT add this to your robots.txt file

User-agent: GPTBot
Disallow: /

The problem is that there are so many AI bots appearing, so I am not sure how effective this would be. For example Bard has already practically indexed the entire internet through Google.

PeteSharp · August 29, 2023, 10:11pm

Which brings up the question about how search and indexing is used now and how it will work in the future.

A lot of data is collected by Common Crawl already and is used by language models.

KBConcepts · August 29, 2023, 10:33pm

brechtryckaert · August 30, 2023, 10:28am

The problem is also that not every bot will adhere to the instructions of the robots.txt file.
So then there’s the option to block bots based upon a user-agent in your .htaccess file (on Apache based webservers).

# BLOCK BAD USER-AGENTS EXAMPLE
SetEnvIfNoCase User-Agent "USERAGENTGOESHERE" bad_user
Deny from env=bad_user
# END BLOCK BAD USER-AGENTS

This is also referenced in my Htaccess Booklet for Blocs and Blocs Plus

Jerry · August 30, 2023, 3:13pm

Fascinating topic.
However, I’m curious as to why you’d want (to try) LLM’s blocking to crawl your pages. This approach seems to be selective accessibility to your website, where some are granted access while others are not.

brechtryckaert · August 30, 2023, 3:19pm

I would think you’d not want your content indexed, so your content can only be consulted via your site itself?

Besides that, most of you know I work at Combell.com (belgium’s largest hosting company) and there I’m one of the folks monitoring a shared hosting environment of over 1000 servers. I’ve seen a lot of extra load being generated on those servers by the crawler of OpenAI. Too much extra load and you might start to see impact (slowdowns) for your customers. So from a hosting perspecitve I prefer not too much crawling (besides legitimate crawlers, e.g. Googlebot and the likes) going on But that’s my personal opinion, not speaking for the company.

Jerry · August 30, 2023, 3:40pm

That’s a valid argument. However, by that logic, you’re essentially opposing the existence of OpenAI, Google Search, and similar platforms.

PeteSharp · August 30, 2023, 4:06pm

Not really. Depends on what your content is about. Blocking bots isn’t new.

Jerry · August 30, 2023, 4:23pm

I wholeheartedly concur, yet by adopting such a stance, you are essentially declaring: I have no desire for my webpage to appear in Google (or any other) searches, and I will never utilize AI in any capacity in its present state.

Intelligent discussions are awesome!

Flashman · August 30, 2023, 4:43pm

I read an article the other day suggesting that AI will generate answers to all questions as a means of keeping people on their platform, so they rarely if ever need to visit a website. Following on, something like Bard serves up replies, then only offers links to paid ads.

These big tech companies are all looking to dominate and maximise profits, so I wouldn’t exclude any possibilities. If they no longer need to show our content, because they can steal it and spin it into something similar with guaranteed earnings they may well be tempted.

Jerry · August 30, 2023, 5:20pm

Which means your clients will not be able to find you anymore on the www.

Flashman · August 30, 2023, 5:22pm

See how these things can evolve in ways we may not have considered before.

Jerry · August 30, 2023, 5:28pm

At the end, you and I are the product for these guys. If we want to beat them we’ll need to buy them.

PeteSharp · August 30, 2023, 5:40pm

The IP addresses of the internet used to be managed on a spread sheet by one man. You would ring him if you wanted to be assigned a number

So change is the only constant here, and so will our workflow…

I started on a BBS with ASCII code over dialup with a 300bps modem that had dip switches All your artwork was made with characters, one at a time, and my BSS went down if someone at home wanted to use the phone. It ran off a floppy disc.

Now I use page builders, IDEs and AI.

Tomorrow we might all have propriety Meta chips in our brain and all our kids will be called zuck…. Well that’s the dream of one man. I hope for a better tomorrow

Jerry · August 30, 2023, 5:42pm

Thank god I don’t have kids

brechtryckaert · August 30, 2023, 5:43pm

That’s the hard part of my job. Figuring out which bots are legitimate and which are not.

I’ll give an example:
Semrush is an SEO oriented bot that crawls sites like there’s no tomorrow, basically doing a small DDOS (Directed Denial Of Service). However, some customer use this to analyze their site for SEO purposes… But in other cases the bot crawls the site unwanted, reducing availability of that very site.

With bots it’s never black or white. But 5000 shades of grey

Jerry · August 30, 2023, 5:45pm

I’ll get my fax out again.