Has anyone played around with the code required to deny AI bots from scraping content to enhance their machine learning skills? I’m concerned about this unilateral use of bots that potentially infringes on copy-write protections and would be curious to see what others have done to protect their work.
Very Interesting. I’ll be waiting for someone to reply.
You do this with robots.txt
You have to create entries for each bot. But there are a ton of online generators.
Eg. ChatGPT documents this for their bot
The whole search industry is changing, I wouldn’t be too reactive just yet, as we don’t know where the line gets drawn. Early days.
For ChatGPT add this to your robots.txt file
User-agent: GPTBot
Disallow: /
The problem is that there are so many AI bots appearing, so I am not sure how effective this would be. For example Bard has already practically indexed the entire internet through Google.
Which brings up the question about how search and indexing is used now and how it will work in the future.
A lot of data is collected by Common Crawl already and is used by language models.
![]()
The problem is also that not every bot will adhere to the instructions of the robots.txt file.
So then there’s the option to block bots based upon a user-agent in your .htaccess file (on Apache based webservers).
# BLOCK BAD USER-AGENTS EXAMPLE
SetEnvIfNoCase User-Agent "USERAGENTGOESHERE" bad_user
Deny from env=bad_user
# END BLOCK BAD USER-AGENTS
This is also referenced in my Htaccess Booklet for Blocs and Blocs Plus ![]()
I would think you’d not want your content indexed, so your content can only be consulted via your site itself?
Besides that, most of you know I work at Combell.com (belgium’s largest hosting company) and there I’m one of the folks monitoring a shared hosting environment of over 1000 servers. I’ve seen a lot of extra load being generated on those servers by the crawler of OpenAI. Too much extra load and you might start to see impact (slowdowns) for your customers. So from a hosting perspecitve I prefer not too much crawling (besides legitimate crawlers, e.g. Googlebot and the likes) going on
But that’s my personal opinion, not speaking for the company.
Not really. Depends on what your content is about. Blocking bots isn’t new.
I read an article the other day suggesting that AI will generate answers to all questions as a means of keeping people on their platform, so they rarely if ever need to visit a website. Following on, something like Bard serves up replies, then only offers links to paid ads.
These big tech companies are all looking to dominate and maximise profits, so I wouldn’t exclude any possibilities. If they no longer need to show our content, because they can steal it and spin it into something similar with guaranteed earnings they may well be tempted.
See how these things can evolve in ways we may not have considered before.
The IP addresses of the internet used to be managed on a spread sheet by one man. You would ring him if you wanted to be assigned a number ![]()
So change is the only constant here, and so will our workflow…
I started on a BBS with ASCII code over dialup with a 300bps modem that had dip switches
All your artwork was made with characters, one at a time, and my BSS went down if someone at home wanted to use the phone. It ran off a floppy disc.
Now I use page builders, IDEs and AI. ![]()
Tomorrow we might all have propriety Meta chips in our brain and all our kids will be called zuck…. Well that’s the dream of one man. I hope for a better tomorrow ![]()
That’s the hard part of my job. Figuring out which bots are legitimate and which are not.
I’ll give an example:
Semrush is an SEO oriented bot that crawls sites like there’s no tomorrow, basically doing a small DDOS (Directed Denial Of Service). However, some customer use this to analyze their site for SEO purposes… But in other cases the bot crawls the site unwanted, reducing availability of that very site.
With bots it’s never black or white. But 5000 shades of grey ![]()
You could adopt the Flashman approach to spam and assume the worst, unless proven otherwise.
I think American’s still use fax machines. I think so they can contact their bank about their cheques.
A place I worked at 20 years ago had a fax machine. But it would convert the fax to a pdf and email it to the right person.
. We had a lot of old school customers.
The problem with AI crawling all content is that it will inevitably end in a flattening curve.
Currently AI is generating responses (in case of ChatGPT 3.X) based on a controlled dataset with content crawled before 2022 (and thus before the major AI boom where everyone was using it).
The downside is that now AI is crawling blogposts that are written by AI for example. The thing is that things like ChatGPT are language models. They’re great at generating text, but not great at generating factually correct text.
Most of us here will no doubt fact check whatever comes out ChatGPT, but not everyone does. Thus incorrect content goes online. It then gets crawled again by OpenAI bot and added to the database that responds in ChatGPT 4. See where this is going?
The more content we spew out with such language models, the more inaccurate these models will inevitably become, hence a flattening curve.
At least, that’s my understanding of how these models and their bots work.
I thought about this earlier and concluded that if AI keeps generating text based on other AI text, eventually there will be one answer to cover every question and it will be wrong.
Interesting thought. Language models trained on language model content.
That will be like a photocopy of a photocopy of a photocopy. ![]()
An AI generated image based on AI generated images.
That’s actually a great analogy.