Skip to content
  • Categories
  • Recent
  • Tags
  • All Topics
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Caint logo. It's just text.
  1. Home
  2. General Discussion
  3. Are other instance admins running into AI crawlers impacting availability?
Welcome to Caint!

Issues? Post in Comments & Feedback
You can now view, reply, and favourite posts from the Fediverse. You can click here or click on the on the navigation bar on the left.

Are other instance admins running into AI crawlers impacting availability?

Scheduled Pinned Locked Moved General Discussion
fediadmin
6 Posts 3 Posters 3 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • julianJ This user is from outside of this forum
    julianJ This user is from outside of this forum
    julian
    wrote last edited by
    #1

    A general call-out to #fediadmin about whether they have been experiencing issues with site availability, specifically an on-again off-again bot wave that crawls publicly-available ActivityPub data at a rate so punishing that it overwhelms my server.

    We had to turn on “I’m Under Attack” from the CF dashboard, but that comes with some not-so-nice tradeoffs (including blocking all automated crawlers, even the good ones.)

    I initially thought someone just didn’t like us (community.nodebb.org, actually), but now we’ve received reports of it happening to another NodeBB, and silverpill@mitra.social’s mitra.social is down for the count currently.

    Admittedly, that last one could be unrelated.

    RimuR 1 Reply Last reply
    0
    • julianJ julian

      A general call-out to #fediadmin about whether they have been experiencing issues with site availability, specifically an on-again off-again bot wave that crawls publicly-available ActivityPub data at a rate so punishing that it overwhelms my server.

      We had to turn on “I’m Under Attack” from the CF dashboard, but that comes with some not-so-nice tradeoffs (including blocking all automated crawlers, even the good ones.)

      I initially thought someone just didn’t like us (community.nodebb.org, actually), but now we’ve received reports of it happening to another NodeBB, and silverpill@mitra.social’s mitra.social is down for the count currently.

      Admittedly, that last one could be unrelated.

      RimuR This user is from outside of this forum
      RimuR This user is from outside of this forum
      Rimu
      wrote last edited by
      #2

      Yes it’s affecting everyone.

      The way you described it, it sounds like they’re making ActivityPub requests, with the Accept: application/activity+json header set?? I haven’t seen that, they all just make normal web requests.

      Some instances have set up Anubis but I found it just about impossible to get configured right, in a way that doesn’t break federation or api access.

      I found that the user agent strings they use for their requests have random numbers for the browser version so if you craft a nginx regex-based config that blocks really old browsers, that gets rid of 95% of it.

      Also in cloudflare you can block entire ASNs in the security rules area. 136907 and 45899 are block-worthy.

      Also https://community.nodebb.org/robots.txt is extremely weak. Compare that with https://piefed.social/robots.txt

      julianJ 1 Reply Last reply
      0
      • silverpillS This user is from outside of this forum
        silverpillS This user is from outside of this forum
        silverpill
        wrote last edited by
        #3

        >mitra.social is down

        There was an outage at the data center somewhere. I haven't noticed any unusual crawler activity.

        1 Reply Last reply
        0
        • RimuR Rimu

          Yes it’s affecting everyone.

          The way you described it, it sounds like they’re making ActivityPub requests, with the Accept: application/activity+json header set?? I haven’t seen that, they all just make normal web requests.

          Some instances have set up Anubis but I found it just about impossible to get configured right, in a way that doesn’t break federation or api access.

          I found that the user agent strings they use for their requests have random numbers for the browser version so if you craft a nginx regex-based config that blocks really old browsers, that gets rid of 95% of it.

          Also in cloudflare you can block entire ASNs in the security rules area. 136907 and 45899 are block-worthy.

          Also https://community.nodebb.org/robots.txt is extremely weak. Compare that with https://piefed.social/robots.txt

          julianJ This user is from outside of this forum
          julianJ This user is from outside of this forum
          julian
          wrote last edited by
          #4

          rimu@piefed.social said in Are other instance admins running into AI crawlers impacting availability?:
          > The way you described it, it sounds like they’re making ActivityPub requests, with the Accept: application/activity+json header set?? I haven’t seen that, they all just make normal web requests

          They’re just regular requests. Just hitting routes that aren’t local to the NodeBB (they’re probably getting them from our version of the federated timeline.)

          Thanks for the actionable recommendations! A lot of this is easy to do if you use CloudFlare, but would be harder if you don’t have that option available to you.

          I always thoughts robots.txt was basically pointless. We use it as guidance for SEO purposes, but these crawlers don’t respect it do they??!

          1 Reply Last reply
          0
          • RimuR This user is from outside of this forum
            RimuR This user is from outside of this forum
            Rimu
            wrote last edited by
            #5

            The ones in the PieFed robots.txt no longer appear in my nginx logs. It can take a few days for them to notice but it works eventually. They are generally well behaved (1 request per second, maximum) but collectively they still add up to quite something.

            The truly naughty scrapers, like those with the fake user agents, yeah they ignore it.

            julianJ 1 Reply Last reply
            1
            0
            • R ActivityRelay shared this topic
            • RimuR Rimu

              The ones in the PieFed robots.txt no longer appear in my nginx logs. It can take a few days for them to notice but it works eventually. They are generally well behaved (1 request per second, maximum) but collectively they still add up to quite something.

              The truly naughty scrapers, like those with the fake user agents, yeah they ignore it.

              julianJ This user is from outside of this forum
              julianJ This user is from outside of this forum
              julian
              wrote last edited by
              #6

              See that’s the thing, I don’t know if I want to block all AI crawlers. I’m not bullish on AI at all, and I do think that it’s a huge waste of resources for these companies to proactively crawl everything they can get their hands on, but by and large it’s not really been a major issue.

              I could be wrong… I know @baris did do some work on attempting to block some of these bots back when they were being quite egregious too.

              The wave we’re seeing are all using common user agents and rotating IP addresses from Brazil/Vietnam/etc.

              I’d try a geo-block next I think. Possibly Anubis, as well as this little tidbit you mentioned above:

              > I found that the user agent strings they use for their requests have random numbers for the browser version so if you craft a nginx regex-based config that blocks really old browsers, that gets rid of 95% of it.

              I need to inspect the user agents more closely next wave…

              1 Reply Last reply
              0
              Reply
              • Reply as topic
              Log in to reply
              • Oldest to Newest
              • Newest to Oldest
              • Most Votes


              • Login

              • Don't have an account? Register

              • Login or register to search.
              • First post
                Last post
              0
              • Categories
              • Recent
              • Tags
              • All Topics
              • Popular
              • World
              • Users
              • Groups