Allow patched pandas.read_* in restricted Python
(please see wendelin!99 (closed))
This merge request aims to add restricted (secure) versions of selected pandas.read_*
functions, so that they can be used in the zope sandbox.
It does so by monkey-patching the respective functions as suggested by @tatuya.
The monkey-patched versions only allow str
inputs, instances of any other data type (e.g. pathlib.Path
, bytes) are prohibited.
If the parsed argument is a str
, it will convert it to a StringIO
instance and further parse it to the original function.
In this way the restriction prevents that users can parse urls, file paths, etc. which would be downloaded by pandas.
The implementation is heuristic, since it makes the assumption that pandas won't search for urls etc. in StringIO
instances (which is true in 0.19.x, as the tests showed).
Besides general code issues, my main question is: can the given solution be considered to be secure or should I move to other solutions (see also solution in Wendelin SR and Note below)?
Then: I didn't add read_html
yet, because it has an additional unresolved dependency (html5lib). Should we add this dependency and also add read_html
?
Regarding the tests: For "the proof of security" the tests need to use working links from which pandas can actually initialize DataFrame
instances if the function wouldn't be restricted in correct way. For now I simply added random real-word links, which should be replaced.
Note:
Initially I wanted to patch read_json
more explicitly, e.g. more or less directly parse the given argument to pandas internal FrameParser
or SeriesParser
and simply bypass all of pandas url-etc. parsing parts. But this solution has other disadvantages:
-
Since we would have to rely on code which doesn't belong to the public api of pandas it would potentially break when switching pandas version (in fact in pandas up-to-date version the mentioned classes have been moved to other modules).
-
We may have to duplicate code in order to provide users with mostly the same api (rich set of different kwargs and args) and we would have to adapt it when moving to newer pandas versions in order to avoid unexpected behaviour.
-
This patch wouldn't be generic for the selected
read_*
functions, because the code structure between pandas differentread
functions differ widely.