Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reticulate is creating files in the R library path after installation #1680

Open
klmr opened this issue Oct 9, 2024 · 12 comments
Open

reticulate is creating files in the R library path after installation #1680

klmr opened this issue Oct 9, 2024 · 12 comments

Comments

@klmr
Copy link

klmr commented Oct 9, 2024

The ‘reticulate’ package is creating files inside the R library path after installation, when it is being used. Here’s an MWE:

$ install.packages('reticulate')
$ py_path = system.file('python', package = 'reticulate')
$ system2('tree', shQuote(py_path))
/Users/rudolpk2/Library/R/arm64/4.3/library/reticulate/python
└── rpytools
    ├── __init__.py
    ├── call.py
    ├── generator.py
    ├── help.py
    ├── ipython.py
    ├── loader.py
    ├── output.py
    ├── run.py
    ├── subprocess.py
    ├── test.py
    └── thread.py

2 directories, 11 files

$ library(reticulate)
$ use_python(system2('which', 'python', stdout = TRUE))
$ import('os')
Module(os)

$ system2('tree', shQuote(py_path))
/Users/rudolpk2/Library/R/arm64/4.3/library/reticulate/python
└── rpytools
    ├── __init__.py
    ├── __pycache__
    │   ├── __init__.cpython-311.pyc
    │   ├── call.cpython-311.pyc
    │   └── loader.cpython-311.pyc
    ├── call.py
    ├── generator.py
    ├── help.py
    ├── ipython.py
    ├── loader.py
    ├── output.py
    ├── run.py
    ├── subprocess.py
    ├── test.py
    └── thread.py

3 directories, 14 files

It is my understanding of CRAN guidelines that packages are not allowed to write into the R package library path after installation, and in fact this behaviour is causing breaking issues for us on a system with a shared group library folder: if person A installs the package and person B first uses it, the folder owners get mixed up. In combination with default permissions (which make these folders non-writable for other group members!), this means that removing the package or installing updates cannot be done without the cooperation of multiple people.

@t-kalinowski
Copy link
Member

Thank you for opening this issue.

We can't assume python will be available during the R package installation, but we can make a best effort by checking if it's present. If it is, we could attempt to run python -m compileall reticulate/python to pre-compile the Python artifacts during package installation. We should probably do this for all the minor Python versions detected, as .pyc files are version-specific.

reticulate::virtualenv_starter(all = TRUE) should find most Python installations on the host, and we could do something like this during package installation:

df <- reticulate::virtualenv_starter(all = TRUE)
df <- df[order(df$version, decreasing = TRUE), ]
df$minor <- df$version[, 1:2]
df <- df[!duplicated(df$minor), ]
setwd("python")
for (python in df$path) {
  system2(python, "-m compileall .")
}

Would you be interested in submitting a pull request?

@kevinushey
Copy link
Collaborator

Could we make use of sys.pycache_prefix here? https://docs.python.org/3/library/sys.html#sys.pycache_prefix

@t-kalinowski
Copy link
Member

Could we make use of sys.pycache_prefix here?

We could possibly set this to tools::R_user_dir().

However, setting this would be a global option that affects all module imports. Also, (I haven't tested), I suspect it might also make python ignore existing .pyc files generated the normal way. This seems like a large invasive change which would lead to overall slower performance and bloated, scattered installations.

@idavydov
Copy link

idavydov commented Oct 10, 2024

Just a suggestion, but maybe R could always use user's cache directory? E.g. via:

pythoncache_prefix <- file.path(rappdirs::user_cache_dir("reticulate"), "pycache")
dir.create(pythoncache_prefix, showWarnings = FALSE, recursive = TRUE)
Sys.senv(PYTHONPYCACHEPREFIX=pythoncache_prefix)

From my understanding the slowdown would be noticeable only upon first time the user imports a library.

P.S. I think PYTHONPYCACHEPREFIX was added in python 3.8, but looking forward it's probably good enough.

P.P.S. Hm, rappdirs::user_cache_dir() is not respecting XDG_CACHE_HOME.

@klmr
Copy link
Author

klmr commented Oct 10, 2024

@t-kalinowski Actually the more I think about it the more I’m convinced that it’s a bad idea to have any cache data written in the installation path: it categorically does not belong there. Unfortunately Python chose the wrong default behaviour. And, as you say, changing this behaviour inside ‘reticulate’ would affect all Python code loaded in that session.

Maybe a hybrid approach would be to check whether the user has already specified PYTHONCACHEPREFIX and only set it if not present?

For our own purposes we’ve now worked around this issue by creating an empty python/rpytools/__pycache__ file inside the ‘reticulate’ installation to prevent Python from writing any cache data here. This doesn’t seem to cause any issues. Maybe ‘reticulate’ could do that? It feels like the cleanest and easiest approach (unless it has some negative side-effects that I didn’t think about, besides a potential, infinitesimal performance decrease due to the lack of cached byte code).

@t-kalinowski
Copy link
Member

This scenario—where a package library is shared by users, some with write permissions and others without—seems relatively niche. Typically, shared package libraries are managed by a root account and provided to users as read-only. In these cases:

  • Python won’t write .pyc files, as it lacks write permissions in the installation directory.
  • User-installed packages would be placed in a separate library, and other users with read permissions can opt into using those packages via .libPaths().

Alternatively, if shared libraries are writable by all users, mixed ownership isn’t an issue.

I wouldn’t want to introduce significant complexity or a runtime performance hit for such a niche case. That said, I’m open to precompiling .pyc files during package installation. In environments with mixed permissions, we can assume the system was configured by an expert, and most likely, Python binaries are available. That should resolve the problem.

@idavydov
Copy link

idavydov commented Oct 10, 2024

Is there anything which speaks agains this, @t-kalinowski ?

  • this is not violating CRAN policies
  • it's only a couple of lines and doesn't require compile-time python
  • it does respect PYTHONPYCACHEPREFIX if set
  • it has negligible performance impact (only upon first use)
if (is.na(Sys.getenv("PYTHONPYCACHEPREFIX", NA)) {
  pythoncache_prefix <- file.path(tools::R_user_dir("reticulate", which="cache"), "pycache")
  if (!dir.exists(pythoncache_prefix)) {
    dir.create(pythoncache_prefix, recursive=TRUE)
  }
  Sys.setenv(PYTHONPYCACHEPREFIX=pythoncache_prefix)
}

@t-kalinowski
Copy link
Member

Do we know what happens on Windows without long path support enabled?
I took a quick peek at the source but didn't see any special handling for that there. Does it just fail silently?

@t-kalinowski
Copy link
Member

Also, we would probably want a strategy for periodically clearing the cache on long-lived systems.

@klmr
Copy link
Author

klmr commented Oct 10, 2024

Clearing the cache is usually the user’s responsibility. I see no reason why ‘reticulate’ should deviate from this practice, or indeed do anything special in this regard. In fact, I’d argue against doing anything special because of POLA.

Regarding Windows path handling, I also don’t think that this requires special consideration — if this would turn out to be a problem (unlikely, but who knows), it would be a general Python problem, not one specific to ‘reticulate’.

@t-kalinowski
Copy link
Member

CRAN Policy states (emphasis mine):

For R version 4.0 or later (hence a version dependency is required or only conditional use is possible), packages may store user-specific data, configuration and cache files in their respective user directories obtained from tools::R_user_dir(), provided that by default sizes are kept as small as possible and the contents are actively managed (including removing outdated material).

@klmr
Copy link
Author

klmr commented Oct 14, 2024

I don’t think a single CRAN package actually does that (okay, that’s too easily disproved by finding a single counter-example, but I’d wager that the majority of packages doesn’t).

But fair enough — how about exporting a function purge_python_cache() which runs unlink(file.path(R_user_dir("reticulate", "cache"), "pycache"), recursive = TRUE, force = TRUE)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants