[go: up one dir, main page]

I’ve recently added anubis to lemmy.ml, and it seems to be working well.

I have a PR to add anubis to lemmy-ansible (our main installation method), and I could use some help tweaking / optimizing its botPolicy.yaml config, for federated services.

Anyone with experience running anubis, this would be much appreciated.

  • olof@lemmy.ml
    link
    fedilink
    English
    arrow-up
    2
    ·
    5 hours ago

    Not Lemmy specific, but I wanted to set up Anubis in a setup where I have one reverse proxy (nginx) handling many different domains. Last time I looked, it seemed to need one Anubis instance per domain. Is that still the case? Goal was to have a single Anubis instance and route all through it

    • poVoq@slrpnk.net
      link
      fedilink
      English
      arrow-up
      1
      ·
      4 hours ago

      You could probably put Anubis in front of your reverse-proxy, but then you need something else in front of it that handles TLS certificates. So maybe something like this: HAProxy->Anubis->Nginx.

    • Dessalines@lemmy.mlOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      4 hours ago

      I’m not an expert, but I think the fact that you need to set a TARGET in anubis, IE, where does anubis send you after passing it, means that you do need separate anubis’s for each site.

  • Adam@doomscroll.n8e.dev
    link
    fedilink
    English
    arrow-up
    5
    ·
    8 hours ago

    It would be nice to have a much more aggressive anti-bot stance for communities/content that aren’t local. If google or any other crawler wants to crawl c/lemmy@lemmy.ml then it should do it on the source instance. Doing it on mine makes no sense.

    • Björn@swg-empire.de
      link
      fedilink
      English
      arrow-up
      10
      ·
      edit-2
      9 hours ago

      I regularly encounter images not loading from quock.au. No idea if they’ve got that under control now but that is the most visible issue every instance fights with. Gonna be great when we have a recommended configuration for Lemmy.

      • Dessalines@lemmy.mlOP
        link
        fedilink
        English
        arrow-up
        7
        ·
        9 hours ago

        Yep, essentially the botPolicy.yaml there could be a collectively developed anubis config, based on what works best.

      • Otter@lemmy.ca
        link
        fedilink
        English
        arrow-up
        4
        ·
        9 hours ago

        We are not running Anubis, although we do block a large number of AI/LLM companies through IP addresses. Each time we block a new one, it makes a noticeable difference in the performance graphs.

  • julian@activitypub.space
    link
    fedilink
    arrow-up
    5
    ·
    9 hours ago

    Sure. I have found that the default botPolicy works fine for blocking the AI bots, but blocks federation.

    At the reverse proxy level:

    if ($request_method = POST) {
        proxy_pass http://nodebb/; 
    }
    

    Because Anubis can’t filter by HTTP method, unless I am mistaken. This just broadly allows all incoming activities. If you want to get specific, limit it to your shared inbox or individual user inboxes via regular expression or something. I didn’t find that it was necessary.

    As for botPolicies.yaml

      # Allow /inbox
      - name: allow-ap-headers
        headers_regex:
          Accept: application/ld\+json; profile="https://www.w3.org/ns/activitystreams"
          Accept: application/activity\+json
        action: ALLOW
    
      - name: allow-assets
        path_regex: /assets
        action: ALLOW
    

    The former allows those specific AP headers (it is naive, some AP impls. send slight variations of those two headers.

    The latter allows our uploads.

    • Dessalines@lemmy.mlOP
      link
      fedilink
      English
      arrow-up
      4
      ·
      9 hours ago

      Lemmy has a separated UI and backend hosted on different ports, so its trivial for us to just only use anubis for the front end. We couldn’t put it in front of everything due to apps also.

    • poVoq@slrpnk.net
      link
      fedilink
      English
      arrow-up
      6
      ·
      9 hours ago

      This is the botPolicy.yaml that we use on slrpnk.net :

      bots:
        - name: known-crawler
          action: CHALLENGE
          expression:
            # https://anubis.techaro.lol/docs/admin/configuration/expressions
            all:
              # Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36
              - userAgent.contains("Macintosh; Intel Mac") && userAgent.contains("Chrome/125.0.0.0") # very old chrome?
              - missingHeader(headers, "Sec-Ch-Ua") # a valid chrome has this header
          challenge:
            difficulty: 6
            algorithm: slow
      
          # Assert behaviour that only genuine browsers display.
          # This ensures that Chrome or Firefox versions
        - name: realistic-browser-catchall
          expression:
            all:
              - '"User-Agent" in headers'
              - '( userAgent.contains("Firefox") ) || ( userAgent.contains("Chrome") ) || ( userAgent.contains("Safari") )'
              - '"Accept" in headers'
              - '"Sec-Fetch-Dest" in headers'
              - '"Sec-Fetch-Mode" in headers'
              - '"Sec-Fetch-Site" in headers'
              - '"Accept-Encoding" in headers'
              - '( headers["Accept-Encoding"].contains("zstd") || headers["Accept-Encoding"].contains("br") )'
              - '"Accept-Language" in headers'
          action: CHALLENGE
          challenge:
            difficulty: 2
            algorithm: fast
      
        - name: generic-browser
          user_agent_regex: (?i:mozilla|opera)
          action: CHALLENGE
          challenge:
            difficulty: 4
            algorithm: fast
      
      status_codes:
        CHALLENGE: 202
        DENY: 406
      
      dnsbl: false
      
      #store:
      #  backend: valkey
      #  parameters:
      #    url: redis://valkey-primary:6379/0
      

      I think I just took it over from Codeberg.org back from when they still used Anubis. Nothing really relevant to Lemmy specifically and it is only in front of the frontends, not the s2s federation API.

      It seems though like there are some crawlers that use 3rd party hosted alternative frontends to crawl (unintentionally?) through the federation API, so something in front of that would be useful I guess.