Skip to content

Commit

Permalink
Search: unify Sphinx and generic parser (#10520)
Browse files Browse the repository at this point in the history
The current differences from the generic parser and the Sphinx parser are:

- The Sphinx parser doesn't need to find the main content node, since that's given by the `.fjson` file.
- The Sphinx parser removes nodes that have the following classes:
   - .toctree-wrapper (toc from the `toctree` directive)
   - .contents.local.topic (local toc from the `contents` directive)

How can we unify both?

- I think we handle finding the main node for Sphinx projects correctly with the generic parser.
- Removing elements that have the `.contents.local.topic` class is redundant, since that's inside a `nav` tag (we remove those).
- Elements generated by the `toctree` directive are plain divs with a  `.toctree-wrapper` class, so we can't have a generic way of removing them. This is the only impediment for merging both parsers, we have two options

a) Just don't remove those elements, and index them, maybe suggest upstream to enclose them inside a nav tag.
    The downside is that search results will have some duplicates (matching the toctree and the actual document), for example, here you can see two pages matching, the actual page and the index pages that has the ToC.
![Screenshot 2023-07-06 at 17-25-07 Search project dev Contributing to Read the Docs Read the Docs](https://github.com/readthedocs/readthedocs.org/assets/4975310/04379572-2247-426a-8f99-c04d9254cf27)

b) Just add that special case to our generic parser (we kind of already do this for `linenos`, `lineno`, and `headerlink`). Maybe suggest upstream to use a nav tag as well, so one day we may be able to stop especial casing that :D

I think I'm fine with option b.
  • Loading branch information
stsewd authored Jul 26, 2023
1 parent 31bd5ad commit 552bf02
Show file tree
Hide file tree
Showing 7 changed files with 654 additions and 30 deletions.
1 change: 1 addition & 0 deletions docs/dev/search-integration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@ Special rules that are derived from specific documentation tools applied in the
- ``.linenos``, ``.lineno`` (line numbers in code-blocks, comes from both MkDocs and Sphinx)
- ``.headerlink`` (added by Sphinx to links in headers)
- ``.toctree-wrapper`` (added by Sphinx to the table of contents generated from the ``toctree`` directive)

Example:

Expand Down
31 changes: 4 additions & 27 deletions readthedocs/search/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -269,12 +269,15 @@ def _clean_body(self, body):
body.css('nav'),
body.css('[role=navigation]'),
body.css('[role=search]'),
# Permalinks
# Permalinks, this is a Sphinx convention.
body.css('.headerlink'),
# Line numbers from code blocks, they are very noisy in contents.
# This convention is popular in Sphinx.
body.css(".linenos"),
body.css(".lineno"),
# Sphinx doesn't wrap the result from the `toctree` directive
# in a nav tag. so we need to manually remove that content.
body.css(".toctree-wrapper"),
)
for node in nodes_to_be_removed:
node.decompose()
Expand Down Expand Up @@ -516,29 +519,3 @@ def _process_fjson(self, fjson_path):
"title": title,
"sections": sections,
}

def _clean_body(self, body):
"""
Removes nodes in Sphinx-generated HTML structures.
This method is overridden to remove contents that are likely
to be useless for search indexing.
Currently: TOC elements.
"""
body = super()._clean_body(body)

# TODO: see if we really need to remove TOC elements like below?
# benjaoming: I didn't see this in sphinx-rtd-theme, however since it wraps the menu in
# a <nav>, it's already covered. I didn't see this match local table of contents, neither
# and they are also wrapped in a <nav> so covered by _clean_body as well.
nodes_to_be_removed = itertools.chain(
body.css(".toctree-wrapper"),
body.css(".contents.local.topic"),
)

# removing all nodes in list
for node in nodes_to_be_removed:
node.decompose()

return body
446 changes: 446 additions & 0 deletions readthedocs/search/tests/data/sphinx/in/local-toc.html

Large diffs are not rendered by default.

89 changes: 89 additions & 0 deletions readthedocs/search/tests/data/sphinx/in/toctree.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
<section id="public-api">
<h1>Public API<a class="headerlink" href="#public-api" title="Permalink to this heading"></a></h1>
<p>This section of the documentation details the public API
usable to get details of projects, builds, versions and other details
from Read the Docs.</p>
<div class="toctree-wrapper compound">
<ul>
<li class="toctree-l1"><a class="reference internal" href="v3.html">API v3</a>
<ul>
<li class="toctree-l2"><a class="reference internal"
href="v3.html#authentication-and-authorization">Authentication and authorization</a>
<ul>
<li class="toctree-l3"><a class="reference internal" href="v3.html#token">Token</a></li>
<li class="toctree-l3"><a class="reference internal" href="v3.html#session">Session</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="v3.html#resources">Resources</a>
<ul>
<li class="toctree-l3"><a class="reference internal" href="v3.html#projects">Projects</a>
</li>
<li class="toctree-l3"><a class="reference internal" href="v3.html#versions">Versions</a>
</li>
<li class="toctree-l3"><a class="reference internal" href="v3.html#builds">Builds</a></li>
<li class="toctree-l3"><a class="reference internal" href="v3.html#subprojects">Subprojects</a></li>
<li class="toctree-l3"><a class="reference internal" href="v3.html#translations">Translations</a></li>
<li class="toctree-l3"><a class="reference internal" href="v3.html#redirects">Redirects</a>
</li>
<li class="toctree-l3"><a class="reference internal" href="v3.html#environment-variables">Environment
variables</a></li>
<li class="toctree-l3"><a class="reference internal" href="v3.html#organizations">Organizations</a>
</li>
<li class="toctree-l3"><a class="reference internal" href="v3.html#remote-organizations">Remote
organizations</a></li>
<li class="toctree-l3"><a class="reference internal" href="v3.html#remote-repositories">Remote
repositories</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="v3.html#embed">Embed</a></li>
<li class="toctree-l2"><a class="reference internal" href="v3.html#additional-apis">Additional
APIs</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="v2.html">API v2</a>
<ul>
<li class="toctree-l2"><a class="reference internal"
href="v2.html#authentication-and-authorization">Authentication and authorization</a></li>
<li class="toctree-l2"><a class="reference internal" href="v2.html#resources">Resources</a>
<ul>
<li class="toctree-l3"><a class="reference internal" href="v2.html#projects">Projects</a>
</li>
<li class="toctree-l3"><a class="reference internal" href="v2.html#versions">Versions</a>
</li>
<li class="toctree-l3"><a class="reference internal" href="v2.html#builds">Builds</a></li>
<li class="toctree-l3"><a class="reference internal" href="v2.html#embed">Embed</a></li>
<li class="toctree-l3"><a class="reference internal"
href="v2.html#undocumented-resources-and-endpoints">Undocumented resources and
endpoints</a></li>
</ul>
</li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../server-side-search/api.html">Server
side search API</a>
<ul>
<li class="toctree-l2"><a class="reference internal" href="../server-side-search/api.html#api-v3">API
v3</a>
<ul>
<li class="toctree-l3"><a class="reference internal"
href="../server-side-search/api.html#migrating-from-api-v2">Migrating from API v2</a>
</li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal"
href="../server-side-search/api.html#authentication-and-authorization">Authentication and
authorization</a></li>
<li class="toctree-l2"><a class="reference internal"
href="../server-side-search/api.html#api-v2-deprecated">API v2 (deprecated)</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="cross-site-requests.html">Cross-site
requests</a>
<ul>
<li class="toctree-l2"><a class="reference internal" href="cross-site-requests.html#cookies">Cookies</a>
</li>
</ul>
</li>
</ul>
</div>
</section>
46 changes: 46 additions & 0 deletions readthedocs/search/tests/data/sphinx/out/local-toc.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
{
"path": "local-toc.html",
"title": "Security reports",
"sections": [
{
"id": "security-reports",
"title": "Security reports",
"content": "Security is very important to us at Read the Docs. We follow generally accepted industry standards to protect the personal information submitted to us, both during transmission and once we receive it. In the spirit of transparency, we are committed to responsible reporting and disclosure of security issues. See also Security policy Read our policy for security, which we base our security handling and reporting on."
},
{
"id": "supported-versions",
"title": "Supported versions",
"content": "Only the latest version of Read the Docs will receive security updates. We don’t support security updates for custom installations of Read the Docs."
},
{
"id": "reporting-a-security-issue",
"title": "Reporting a security issue",
"content": "If you believe you’ve discovered a security issue at Read the Docs, please contact us at [email protected] (optionally using our PGP key). We request that you please not publicly disclose the issue until it has been addressed by us. You can expect: We will respond acknowledging your email typically within one business day. We will follow up if and when we have confirmed the issue with a timetable for the fix. We will notify you when the issue is fixed. We will create a GitHub advisory and publish it when the issue has been fixed and deployed in our platforms."
},
{
"id": "pgp-key",
"title": "PGP key",
"content": "You may use this PGP key to securely communicate with us and to verify signed messages you receive from us."
},
{
"id": "bug-bounties",
"title": "Bug bounties",
"content": "While we sincerely appreciate and encourage reports of suspected security problems, please note that the Read the Docs is an open source project, and does not run any bug bounty programs."
},
{
"id": "security-issue-archive",
"title": "Security issue archive",
"content": "You can see all past reports at https://github.com/readthedocs/readthedocs.org/security/advisories."
},
{
"id": "version-3-2-0",
"title": "Version 3.2.0",
"content": "Version 3.2.0 resolved an issue where a specially crafted request could result in a DNS query to an arbitrary domain. This issue was found by Cyber Smart Defence who reported it as part of a security audit to a firm running a local installation of Read the Docs."
},
{
"id": "release-2-3-0",
"title": "Release 2.3.0",
"content": "Version 2.3.0 resolves a security issue with translations on our community hosting site that allowed users to modify the hosted path of a target project by adding it as a translation project of their own project. A check was added to ensure project ownership before adding the project as a translation. In order to add a project as a translation now, users must now first be granted ownership in the translation project."
}
]
}
11 changes: 11 additions & 0 deletions readthedocs/search/tests/data/sphinx/out/toctree.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"path": "",
"title": "",
"sections": [
{
"id": "public-api",
"title": "Public API",
"content": "This section of the documentation details the public API usable to get details of projects, builds, versions and other details from Read the Docs."
}
]
}
60 changes: 57 additions & 3 deletions readthedocs/search/tests/test_parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,16 +266,70 @@ def test_sphinx_autodoc(self, storage_open, storage_exists):
expected_json = json.load(open(data_path / "sphinx/out/autodoc.json"))
assert parsed_json == expected_json

@mock.patch.object(BuildMediaFileSystemStorage, "exists")
@mock.patch.object(BuildMediaFileSystemStorage, "open")
def test_sphinx_local_toc(self, storage_open, storage_exists):
"""
Test that the local table of contents from the ``contents``
directive is not included in the indexed content.
"""
# Source:
# https://docs.readthedocs.io/en/stable/security.html
html_content = data_path / "sphinx/in/local-toc.html"
storage_open.side_effect = self._mock_open(html_content.open().read())
storage_exists.return_value = True

self.project.feature_set.add(self.feature)
self.version.documentation_type = SPHINX
self.version.save()

page_file = get(
HTMLFile,
project=self.project,
version=self.version,
path="local-toc.html",
)

parsed_json = page_file.processed_json
expected_json = json.load(open(data_path / "sphinx/out/local-toc.json"))
assert parsed_json == expected_json

@mock.patch.object(BuildMediaFileSystemStorage, "exists")
@mock.patch.object(BuildMediaFileSystemStorage, "open")
def test_sphinx_toctree(self, storage_open, storage_exists):
"""
Test that the table of contents from the ``toctree``
directive is not included in the indexed content.
"""
# Source:
# https://docs.readthedocs.io/en/stable/api/index.html
html_content = data_path / "sphinx/in/toctree.html"
json_content = {"body": html_content.open().read()}
storage_open.side_effect = self._mock_open(json.dumps(json_content))
storage_exists.return_value = True

self.version.documentation_type = SPHINX
self.version.save()

page_file = get(
HTMLFile,
project=self.project,
version=self.version,
path="toctree.html",
)

parsed_json = page_file.processed_json
expected_json = json.load(open(data_path / "sphinx/out/toctree.json"))
assert parsed_json == expected_json

@mock.patch.object(BuildMediaFileSystemStorage, "exists")
@mock.patch.object(BuildMediaFileSystemStorage, "open")
def test_sphinx_requests(self, storage_open, storage_exists):
# Source:
# https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-automodule
json_file = data_path / "sphinx/in/requests.json"
html_content = data_path / "sphinx/in/requests.html"

json_content = json.load(json_file.open())
json_content["body"] = html_content.open().read()
json_content = {"body": html_content.open().read()}
storage_open.side_effect = self._mock_open(json.dumps(json_content))
storage_exists.return_value = True

Expand Down

0 comments on commit 552bf02

Please sign in to comment.