Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML search: check for query terms presence in index term properties #13097

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

jayaddison
Copy link
Contributor

@jayaddison jayaddison commented Nov 2, 2024

Feature or Bugfix

  • Bugfix

Purpose

  • Prevent queries for the JavaScript class prototype reference propery name from causing errors in the HTML search JavaScript code provided in the basic theme.

Detail

  • Ensure that the user input query term is found as a named property of the relevant index objects before accessing them, using Object.hasOwnProperty.

Relates

@jayaddison jayaddison added html search javascript Pull requests that update Javascript code labels Nov 2, 2024
Copy link
Contributor

@wlach wlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how this would be a problem but I'm a little unclear on how the solution actually works? See comments

@@ -514,8 +514,8 @@ const Search = {
searchTerms.forEach((word) => {
const files = [];
const arr = [
{ files: terms[word], score: Scorer.term },
{ files: titleTerms[word], score: Scorer.title },
{ files: terms.hasOwnProperty(word) ? terms[word] : [], score: Scorer.term },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually a little confused by how this might work, as it doesn't appear that JavaScript supports setting the __proto__ property in an object. Am I missing something? Should we be using a map to store these terms instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When parsing JSON, an object key named "__proto__" is accepted, and converted to the JavaScript object key ["__proto__"] -- distinct from the __proto__ property (a reference to a class prototype).

During read-access, introducing hasOwnProperty allows us to filter out the latter, so that we don't mistakenly retrieve the prototype class reference from terms.

I'm not certain whether the below example will help to clarify, but I find experimentation helpful to understand some of the behaviours:

> x = JSON.parse('{}')
{}
> y = JSON.parse('{"__proto__": "testing"}')
{ ['__proto__']: 'testing' }
> '__proto__' in x
true
> x.hasOwnProperty('__proto__')
false
> '__proto__' in y
true
> y.hasOwnProperty('__proto__')
true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to respond to your Map question: that is certainly possible, although I feel that doing so might add some runtime overhead (since we'd need to load the native JS object contents from the index into respective Map objects).

As I understand it, Map objects are also more difficult to make immutable in JavaScript than Object and Array instances. Freezing a Map object doesn't prevent its private contents from being updated:

> x = {'foo': 'bar'}
{ foo: 'bar' }
> Object.freeze(x)
{ foo: 'bar' }
> x['foo'] = 'baz'
'baz'
> x['foo']
'bar'
> y = new Map([['foo', 'bar']])
Map(1) { 'foo' => 'bar' }
> Object.freeze(y)
Map(1) { 'foo' => 'bar' }
> y.set('foo', 'baz')
Map(1) { 'foo' => 'baz' }
> y.get('foo')
'baz'

So another factor is that, to the extent that I'd like to see the search index contents become immutable at load-time in future (#13098), I'd prefer not to migrate to the Map type unless there is a compelling reason to (maybe avoiding prototype pollution is that reason -- but I don't think we're exposed to that currently).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok got it, this makes sense to me re: the property. TIL!

I'm a little bit less sure about whether the index should be immutable, but happy to defer that discussion to another day.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, one extra piece of optional feedback: since this is kind of counter-intuitive, you might want to add a comment on this line explaining what you're doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, OK - I'll try to add some phrasing to explain this succinctly. Thanks for the code review!

tests/js/roots/ecmascript/index.rst Outdated Show resolved Hide resolved
The JS prototype property should _not_ be found in unrelated documents
@@ -514,8 +514,8 @@ const Search = {
searchTerms.forEach((word) => {
const files = [];
const arr = [
{ files: terms[word], score: Scorer.term },
{ files: titleTerms[word], score: Scorer.title },
{ files: terms.hasOwnProperty(word) ? terms[word] : [], score: Scorer.term },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok got it, this makes sense to me re: the property. TIL!

I'm a little bit less sure about whether the index should be immutable, but happy to defer that discussion to another day.

{ files: terms[word], score: Scorer.term },
{ files: titleTerms[word], score: Scorer.title },
{ files: terms.hasOwnProperty(word) ? terms[word] : [], score: Scorer.term },
{ files: titleTerms.hasOwnProperty(word) ? titleTerms[word] : [], score: Scorer.title },
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preparing that comment has made me realize: we perform an === equality check against the undefined value further down the function. We should probably not alter that behaviour in this PR.

@AA-Turner
Copy link
Member

Would this work as an alternative?

    // in Search.setIndex()
    const frozen_index = JSON.parse(index, (key, value) => {
      typeof value === "object" && value !== null
        ? Object.freeze(value)
        : value;
    });

If we change Search.setIndex() to taking a JSON string, it seems we could use the 'reviver' function to freeze arrays and objects?

A

@jayaddison
Copy link
Contributor Author

@AA-Turner although I'd recommend some performance testing, that could be useul for #13098, yep.

@jayaddison
Copy link
Contributor Author

In fact, no, my mistake: JSON.parse may in fact be necessary to solve this robustly. The test appears only to be passing due to the fact that __proto__ is found in the objects part of the searchindex.js file. The terms entry for __proto__ is not being handled correctly.

@jayaddison
Copy link
Contributor Author

I've pushed the change to use JSON.parse and reviver -- a note of caution, though: when I was testing that against a JSON-ized version of the Python documentation searchindex.js file yesterday, I was encountering ~200ms duration within the setIndex function in my browser.

@wlach
Copy link
Contributor

wlach commented Nov 14, 2024

Personally, I'd be inclined to merge the previous version of this and address that (if at all) in a followup. Trying to "freeze" the index here feels like scope creep to me and it would be nice to get this fix in.

@jayaddison
Copy link
Contributor Author

That's reasonable, yep, and I've no strong opinions on whether we include or omit the Object.freeze aspect. However, without 462c859, I don't believe this branch would genuinely fix the reported bug.

The `JSON.dumps` call escapes single-quotes within strings; all JSON strings are enclosed by double-quotes
@jayaddison
Copy link
Contributor Author

Without the reviver function, I recall the setIndex + JSON.parse performance overhead being in the order of ~60ms on my local machine. I'll confirm those stats in some more detail within the next day or so. I get the feeling that some of this will be coupled to the browser+implementation (Firefox) details, so I may try to compare to Chrom[e|ium] also.

@wlach
Copy link
Contributor

wlach commented Nov 14, 2024

In fact, no, my mistake: JSON.parse may in fact be necessary to solve this robustly. The test appears only to be passing due to the fact that __proto__ is found in the objects part of the searchindex.js file. The terms entry for __proto__ is not being handled correctly.

Sorry, I missed this part. Do you know why this is the case? The solution looked correct to me?

@jayaddison
Copy link
Contributor Author

In fact, no, my mistake: JSON.parse may in fact be necessary to solve this robustly. The test appears only to be passing due to the fact that __proto__ is found in the objects part of the searchindex.js file. The terms entry for __proto__ is not being handled correctly.

Sorry, I missed this part. Do you know why this is the case? The solution looked correct to me?

The test case in 0fff170 was inaccurate -- it demonstrated that an object Object.__proto__ could be found based on a user query for __proto__.

However, there was still a bug: a document text term __proto__ -- represented by the "terms":{"__proto__":0 fragment of the searchindex.js -- was not searchable as intended.

Comment on lines +222 to +223
const reviver = (k, v) => (typeof v === "object" && v !== null) ? Object.freeze(v) : v;
Search._index = JSON.parse(index, reviver);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @wlach is right that immutability is orthoganal here. I think 'simply' using JSON.parse rather than an object literal would solve the __proto__ case. Would we then need to keep the hasOwnProperty checks?

An alternative would be to use a reviver function that constructs Map objects/serialise the index with Map literals, and use .get(), but that may also be expensive.

Suggested change
const reviver = (k, v) => (typeof v === "object" && v !== null) ? Object.freeze(v) : v;
Search._index = JSON.parse(index, reviver);
Search._index = JSON.parse(index);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @wlach is right that immutability is orthoganal here. I think 'simply' using JSON.parse rather than an object literal would solve the __proto__ case.

Ok, yep - I can revert 443fd21 to remove the use of Object.freeze / the reviver function.

Would we then need to keep the hasOwnProperty checks?

We should keep those, I think. Without them, queries for __proto__ may continue to inadvertently retrieve object prototypes instead of getting an undefined result.

An alternative would be to use a reviver function that constructs Map objects/serialise the index with Map literals, and use .get(), but that may also be expensive.

What do you think about constructing those Map objects directly in the searchindex.js file format, instead of using JSON.parse and a reviver?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about constructing those Map objects directly in the searchindex.js file format, instead of using JSON.parse and a reviver?

IMO this is definitely worth a shot, I suspect it'll be quite fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DO NOT MERGE html search javascript Pull requests that update Javascript code type:performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTML search: make the search index immutable at load-time HTML search: bug with JS-prototype-property query
3 participants