Has anyone played around with the code required to deny AI bots from scraping content to enhance their machine learning skills? I’m concerned about this unilateral use of bots that potentially infringes on copy-write protections and would be curious to see what others have done to protect their work.
Very Interesting. I’ll be waiting for someone to reply.
You do this with robots.txt
You have to create entries for each bot. But there are a ton of online generators.
Eg. ChatGPT documents this for their bot
The whole search industry is changing, I wouldn’t be too reactive just yet, as we don’t know where the line gets drawn. Early days.
For ChatGPT add this to your robots.txt file
The problem is that there are so many AI bots appearing, so I am not sure how effective this would be. For example Bard has already practically indexed the entire internet through Google.
Which brings up the question about how search and indexing is used now and how it will work in the future.
A lot of data is collected by Common Crawl already and is used by language models.
The problem is also that not every bot will adhere to the instructions of the robots.txt file.
So then there’s the option to block bots based upon a user-agent in your .htaccess file (on Apache based webservers).
# BLOCK BAD USER-AGENTS EXAMPLE SetEnvIfNoCase User-Agent "USERAGENTGOESHERE" bad_user Deny from env=bad_user # END BLOCK BAD USER-AGENTS
This is also referenced in my Htaccess Booklet for Blocs and Blocs Plus
However, I’m curious as to why you’d want (to try) LLM’s blocking to crawl your pages. This approach seems to be selective accessibility to your website, where some are granted access while others are not.
I would think you’d not want your content indexed, so your content can only be consulted via your site itself?
Besides that, most of you know I work at Combell.com (belgium’s largest hosting company) and there I’m one of the folks monitoring a shared hosting environment of over 1000 servers. I’ve seen a lot of extra load being generated on those servers by the crawler of OpenAI. Too much extra load and you might start to see impact (slowdowns) for your customers. So from a hosting perspecitve I prefer not too much crawling (besides legitimate crawlers, e.g. Googlebot and the likes) going on But that’s my personal opinion, not speaking for the company.
That’s a valid argument. However, by that logic, you’re essentially opposing the existence of OpenAI, Google Search, and similar platforms.
Not really. Depends on what your content is about. Blocking bots isn’t new.
I wholeheartedly concur, yet by adopting such a stance, you are essentially declaring: I have no desire for my webpage to appear in Google (or any other) searches, and I will never utilize AI in any capacity in its present state.
Intelligent discussions are awesome!
I read an article the other day suggesting that AI will generate answers to all questions as a means of keeping people on their platform, so they rarely if ever need to visit a website. Following on, something like Bard serves up replies, then only offers links to paid ads.
These big tech companies are all looking to dominate and maximise profits, so I wouldn’t exclude any possibilities. If they no longer need to show our content, because they can steal it and spin it into something similar with guaranteed earnings they may well be tempted.
Which means your clients will not be able to find you anymore on the www.
See how these things can evolve in ways we may not have considered before.
At the end, you and I are the product for these guys. If we want to beat them we’ll need to buy them.
The IP addresses of the internet used to be managed on a spread sheet by one man. You would ring him if you wanted to be assigned a number
So change is the only constant here, and so will our workflow…
I started on a BBS with ASCII code over dialup with a 300bps modem that had dip switches All your artwork was made with characters, one at a time, and my BSS went down if someone at home wanted to use the phone. It ran off a floppy disc.
Now I use page builders, IDEs and AI.
Tomorrow we might all have propriety Meta chips in our brain and all our kids will be called zuck…. Well that’s the dream of one man. I hope for a better tomorrow
Thank god I don’t have kids
That’s the hard part of my job. Figuring out which bots are legitimate and which are not.
I’ll give an example:
Semrush is an SEO oriented bot that crawls sites like there’s no tomorrow, basically doing a small DDOS (Directed Denial Of Service). However, some customer use this to analyze their site for SEO purposes… But in other cases the bot crawls the site unwanted, reducing availability of that very site.
With bots it’s never black or white. But 5000 shades of grey
I’ll get my fax out again.