Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RateLimiter: Use Cloudflare Turnstile to detect bots #4079

Draft
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

maccabeelevine
Copy link
Member

@maccabeelevine maccabeelevine commented Nov 13, 2024

This extends the RateLimiterManager module to use a free cloud-hosted "CAPTCHA alternative" bot-detection service, Cloudflare Turnstile.
https://developers.cloudflare.com/turnstile/

The idea is that a RateLimiterManager policy can have a second RateLimiter with a lower quota, which when it first runs out, challenges the user to prove they are not a bot.

  • If that challenge fails, the user is blocked immediately with a 429 (as with the normal rate limiter).
  • If the challenge succeeds, the user can pass through, at least until the normal RateLimiter quota kicks in.
  • The result of the challenge is cached (with a configurable duration) so it only needs to be done once.

The challenges are non-intrusive (i.e. no CAPTCHA; either no interaction at all, or just checking a box) and the page linked above claims it is WCAG 2.1 AA compliant.

Regarding user privacy, Cloudflare claims

Unlike CAPTCHA options, Turnstile never harvests data for ad retargeting. You can preserve the privacy of your users without sacrificing effectiveness.

and I've taken some extra steps to obfuscate the page that the user requested.

TODO

  • Tests, once there is agreement on the approach.
  • Real caching layer

Comment on lines +158 to +159
# Verify API URL. Default is https://challenges.cloudflare.com/turnstile/v0/siteverify
# verifyUrl: https://challenges.cloudflare.com/turnstile/v0/siteverify
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels silly to write the URL twice here and again as the default in the code, when honestly I'm not sure it should be configurable in the first place (if the URL changes, there's likely some logic within that will also change). Remove?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth making this configurable. I could imagine region-specific URLs or even something like an open source "work-alike" service that implements the same API. Probably not likely, but it doesn't hurt to have an easy configuration point just in case, so if the situation arises, we don't have to do any work to react to it. That being said, if you don't want to include the URL twice in the configuration, you could say "The example below shows the default value" (or something like that).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That make sense...but if I'm planning for a work-alike service, would it also share the JS https://challenges.cloudflare.com/turnstile/v0/api.js loaded in the template? I'm thinking I should make that configurable as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely. I would always avoid hard-coded URLs when possible.

$policyId = $rateLimiterManager->checkPolicyUsesTurnstile($event);
$context = base64_encode(json_encode([
'policyId' => $policyId,
'destination' => $event->getRequest()->getUri()->getPath(),
Copy link
Member Author

@maccabeelevine maccabeelevine Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could just display the challenge right now. I'm redirecting to a /Turnstile/Challenge URL so the requested page is not accessible to Turnstile's javascript code, so it can't be sent to Cloudflare as part of the signal. (This also assumes that referrer is blocked.) Combining these params and base64-encoding them is an attempt at further obfuscation in case it uses the query parameters as signals as well. Of course it's still decode-able, but it seemed worthwhile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth putting some of this reasoning into a comment in the code for future reference... and/or making a separate redirectToTurnstile method so it's easy to override/customize the behavior if desired.

$priorTurnstileResult = $this->checkPriorTurnstileResult($policyId, $this->clientIp);
if (!$priorTurnstileResult && !$turnstileLimit->isAccepted()) {
$result['allow'] = false;
$result['message'] = $this->getTooManyRequestsResponseMessage($event, $result);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any need for a different message when Turnstile is the cause of the rate limit failure, but someone may disagree.

protected function getTurnstileCacheKey($policyId, $clientIp)
{
$key = $policyId . '--' . $clientIp;
$key = str_replace('.', '-', $key);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The (temporary) cache used now complained about the dots in the IP address. This is possibly moot once we have a real cache, depending what its rules are for keys.

Comment on lines +227 to +229
$cacheManager = $this->getService(\VuFind\Cache\Manager::class);
$cache = $cacheManager->getCache('object', $storageOptions['namespace']);
$cache->getOptions()->setTtl($storageOptions['turnstileTtl'] ?? 60 * 60 * 24);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So createTurnstileCache is for the cache that records whether you have already taken a Turnstile challenge, and what the result was. Using \VuFind\CacheManager is a problem; I definitely don't want to use a file-based cache for the same performance reason that it's not used for the rest of the RateLimiter. I could use the createCache method above to create (yet) another in-memory cache instance for this purpose, but at minimum I'd have to mess with it a little to pass in the appropriate TTL, and there might be a better approach.

Comment on lines +20 to +21
<input name="policyId" type="hidden" value="<?=$this->escapeHtml($policyId)?>">
<input name="destination" type="hidden" value="<?=$this->escapeHtml($destination)?>">
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm escaping these two variables (and $siteKey above) even though they never should come from user input, but they flowed through URL params on the way here, so it seemed the safe thing to do. If someone actually does mess with the params intentionally

  • a fake $siteKey would in theory let the attacker supply their own key, but in Turnstile configuration their key would not be set to work with this domain, so the challenge should not return a valid token.
  • a fake $policyId could let them substitute a more lenient policy, I guess, that they happen to know is also configured on the same server? That one I guess is a risk
  • a fake $destination ... would just take them to a fake destination after success, but that destination is still subject to the rate limiter, so no point there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to include a salted hash of the key parameters to validate that no tampering has occurred? Or does this flow in a way that makes that impossible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would welcome some insights on that, not sure how to make it work. I guess the salt would be a random value secret in the config file, that the institution would have to create?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a Security/HMACkey setting in config.ini that might be suitable for salting purposes. (It also looks like it might be valuable to randomize this as part of the InstallController, but that's a separate project).

@@ -130,3 +143,17 @@ Policies:
policy: token_bucket
limit: 2000
rate: { interval: '10 minutes', amount: 400 }

# Cloudflare Turnstile is a cloud-based "CAPTCHA alternative" to detect bots.
Copy link
Member Author

@maccabeelevine maccabeelevine Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did I not use VuFind's existing Captcha logic? Mostly because I really see this an extension of RateLimiter and want to take advantage of all the existing config logic there (ipRanges, loggedIn, filters, etc.), as well as the way Bootstrapper prevents the page from loading until it's satisfied -- none of that is in Captcha. That said there may be a way to combine the two existing services, if someone thinks I should look in that direction. We discussed this briefly in Slack and I did reference config.ini captchas below.

{
$context = json_decode(base64_decode($this->params()->fromQuery('context')), true);

$yamlReader = $this->getService(\VuFind\Config\YamlReader::class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of the "thin controller" principle, I wonder if we'd be better off having a Turnstile-specific service that has the configuration injected by a factory... then we could get all this factory and business logic out of the controller, and make the controller itself a lot simpler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I thought of loading the config in a TurnstileControllerFactory, didn't do that just as it didn't seem the pattern (I see only one controller factory). I also considered doing a Turnstile service at one stage but didn't see the advantage, but I didn't get that we were trying to keep the controllers themselves to a minimum -- I guess that does make sense in an optimal MVC context. I'll iterate on this, thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of technical debt in our controllers, so the patterns you see are definitely not the patterns we'd necessarily like to have in the long run. I just saw an opportunity to do this in a more sustainable way and figured it might be worth doing since we're in a green field anyway. ;-)

@demiankatz
Copy link
Member

Thanks, @maccabeelevine. I haven't given this a really close review yet since it's still a draft, but I responded to some of your comments and left a couple thoughts of my own. Please let me know if/when you'd like a deeper and more formal review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants