mastodon.gamedev.place is one of the many independent Mastodon servers you can use to participate in the fediverse.
Mastodon server focused on game development and related topics.

Server stats:

5.1K
active users

Juan

I've read here multiple reports of people suffering disruption in their forge instances because broken web crawlers.

I have not noticed anything, but: 1. I don't use a forge but cgit (and I cache aggressively), 2. I am blocking a long list of bad user agents at nginx level, 3. I don't have monitoring 😅

So it may be all down to the last point, isn't it?

(although I have never seen any load over 0.16 or so, doesn't mean it is not happening)

Had 216k requests on my cgit host just yesterday. So... yes, I'm experiencing the same issue, is just that I'm not really paying attention 😅

First offending IP in the lot... owned by Alibaba Cloud. Of course.

Not an identifiable user agent. Bastards!

Filtering a whole /16 -- I don't give a shit 😂

And I have a simple script to detect abusers.

Run whois on the IP, see who owns it, probably safe to block the full range, and add them to iptables. This is fun!

My awk is still good, but I may feel cute tonight and write a python script.

Although this works:

awk '{ print $1 }' < /var/log/nginx/[snip].access.log.1 | sort | uniq -c | sort -n | tail -n 10

Gives you the 10 IPs making most requests. In my case I had two IPs with more than 100k requests each of them, being the third most busy... 300.

(which is likely to be something fishy, but I can handle that amount of requests in 24h)

Just to be clear:

1. They are ignoring the robots.txt
2. That crawler is likely to be broken and they aren't getting anythig useful from my server
3. It is an IP range owned by a "cloud" company
4. Sharing my git repos is a gift from me to the world and I set the conditions
5. There's no 5th point

No abusers today! 🎉

I should probably set a cron to email me when we get over a threshold 🤔

@reidrac What would awk look like if it had Python-style indentation sensitivity? :blobcatthinking:

@riley like a bad perl? 😂

@reidrac Perl doesn't sense indentation.

@riley that's why I said broken. I was joking with the fact that Perl was created because Larry Wall was pushing awk too hard 😂

@reidrac Well, there's always Parrot ...

# copy stdin to stdout, except for lines starting with #
while left_angle_right_angle:
if dollar_underscore[0] =eq= "#":
continue_next;
}
print dollar_underscore;
}

@reidrac gosh I'm so proud I moved everything to my Synology NAS, my own gitea instance was among the first things I did when setting it up

@hollowone if it was private, I wouldn't have issues. The tricky part is making things available for everybody and deal with abuse at the same time.

@reidrac yeah, I don’t share that much anymore and if I do, it means I gently let it be abandoned in the public domain sphere and usually the IP and best practice of that code can be… easily questioned. if AI learns from that, no doubts it still fucks up simple 2D/3D rasterization routines when I ask it to improve the damn thing… mad circle, I accept that. :)

@reidrac torch down all of their ASs, just in case

@reidrac personally, I just blackhole all of cn subnets by default. Nothing good ever came from there.

@dpwiz any non residential IP, that's probably OK.

@reidrac Yup ... done the same. I got tired of attacks from China so looked up IANA assigns and blocked them. Then IPFire came along and made it dirt simple with a location block addon. Want to block Belarus? Select the country's checkbox and hit save. Gotta love it! (specifically in response to blocking a /16)

@reidrac if the server is for personal use only, would it make sense to host the web server to a different port, like 8081? I don't think the crawler care about port != 80.

That said, even my gopher server gets hit thousands of times per month by crawlers and I have no idea what's the point having crawlers for such old services.

If I hosted a web server, I'd consider feeding some bullshit to the crawler, like animated gif or something like that.

@Blackthorn I use SSL, so it is 443. It doesn't matter. If a URL gets shared, eventually it will be crawled 🤷

It is fine, generally. The issue is the bad actors, that with the new AI frenzy, are more common than it used to.