- 28 Sep, 2018 5 commits
-
-
Kirill Smelkov authored
In the next patch decoding a pickle will depend on whether current protocol is <= 2 or >= 3. For this let's teach PROTO opcode handler to recall last seen protocol version. For testing - prepare tests infrastructure for cases where protocol version affects decoding semantic: if a test pickle in tests table comes with \x80\xff prefix, on decode tests this prefix will be changed to concrete \x80 ver with all versions protperly iterated as specified in TestPickle.protov. We will also need this functionality in the next patch.
-
Kirill Smelkov authored
In Python bytes is immutable and read-only array of bytes. It is also hashable and so is different from go []byte in that it can be used as a dict key. Thus the closes approximation for Python bytes in Go is some type derived from Go's string - it will be different from string and at the same time will inherit from string it immutability property and being able to be used as map key. So - add ogórek.Bytes type to represent Python bytes - add support to decode BINBYTES* pickle opcodes (these are protocol 3 opcodes) - add support to encode ogórek.Bytes via those BINBYTES* opcodes - for protocols <= 2, where there is no opcodes to directly represent bytes, adopt the same approach as Python - by pickling bytes as _codecs.encode(byt.decode('latin1'), 'latin1') this way unpickling it on Python3 will give bytes, while unpickling it on Python2 will give str: In [1]: sys.version Out[1]: '3.6.6 (default, Jun 27 2018, 14:44:17) \n[GCC 8.1.0]' In [2]: byt = b'\x01\x02\x03' In [3]: _codecs.encode(byt.decode('latin1'), 'latin1') Out[3]: b'\x01\x02\x03' --- In [1]: sys.version Out[1]: '2.7.15+ (default, Aug 31 2018, 11:56:52) \n[GCC 8.2.0]' In [2]: byt = b'\x01\x02\x03' In [3]: _codecs.encode(byt.decode('latin1'), 'latin1') Out[3]: '\x01\x02\x03' - correspondingly teach decoder to recognize particular calls to _codecs.encode as being representation for bytes and decode it appropriately. - since we now have to emit byt.decode('latin1') as UNICODE - add, so far internal, `type unicode(string)` that instructs ogórek encoder to always emit the string with UNICODE opcodes (regular string is encoded to unicode pickle object only for protocol >= 3). - For []byte encoding preserve the current status - even though dispatching in Encoder.encode changes, the end result is the same - []byte was and stays currently encoded as just regular string. This was added in 555efd8f "first draft of dumb pickle encoder", and even though that might be not a good choice, changing it is a topic for another patch.
-
Kirill Smelkov authored
Previously, we were quoting UNICODE opcode argument with strconv.QuoteToASCII(). However that function, in addition to \u and \U escapes, can produce e.g. \n, \r, \xAA etc escapes. And all of the latter variants are not treated as special escapes of a unicode literal by Python, thus leading to data being wrongly received. Fix it by doing exactly the same that Python pickle encoder does - the UNICODE argument comes are "raw-unicode-escape" encoded. This patch contains only codec tests - not end-to-end pickle tests, because currently Encoder.encodeUnicode() is called only from under Encoder.encodeString(), and there only from under if e.config.Protocol >= 3 We will indirectly add tests for encodeUnicode @ protocol=0 in the next patches, while adding support for Python bytes.
-
Kirill Smelkov authored
I initially added pyquote only as debugging tool (in b429839d "tests: Show pickles in a way that can be copy-pasted into Python"), and later it started to be used in the Encoder (see 18004fbd "Move pyquote into main codebase"). However there is no explicit tests for pyquote. Add some pyquote tests.
-
Kirill Smelkov authored
Both object and objectDecodedBack are interace{}. So it could turn out, if e.g. they are both of the same non-comparable type (e.g. []byte), comparing them directly will panic.
-
- 26 Sep, 2018 7 commits
-
-
Kirill Smelkov authored
I was (re-)running fuzz tests again, and in addition to what was already known (e.g. self-referencing structures) found two new issues: - long as dict keys (https://github.com/kisielk/og-rek/issues/55), and - encoding global with '\n' in module name (fixed in previous patch). Update the corpus with points go-fuzz found, so that next runs could restart with having all the previous interesting input vectors available.
-
Kirill Smelkov authored
Caught via fuzzing: "\x8c\x030\n02\x93." 0: \x8c SHORT_BINUNICODE '0\n0' 5: 2 DUP 6: \x93 STACK_GLOBAL 7: . STOP panic: protocol 0: decode back error: err pickle: "c0\n0\n0\n0\n." goroutine 1 [running]: github.com/kisielk/og-rek.Fuzz(0x7f2f1009a000, 0x8, 0x200000, 0x3) /tmp/go-fuzz-build645492341/gopath/src/github.com/kisielk/og-rek/fuzz.go:47 +0x8b8 go-fuzz-dep.Main(0x525e10) /tmp/go-fuzz-build645492341/goroot/src/go-fuzz-dep/main.go:49 +0xad main.main() /tmp/go-fuzz-build645492341/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d exit status 2 i.e. '0\n0' module name was emitted as-is as part ot text-based GLOBAL which completely broke pickle stream. For the reference Python decodes such globals with \n in name just ok: In [10]: s = b"S'decimal\\nq'\nS'Decimal'\n\x93." In [11]: pickle.loads(s) --------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-11-764e4625bc41> in <module>() ----> 1 pickle.loads(s) ModuleNotFoundError: No module named 'decimal\nq' In [12]: import sys In [15]: d = sys.modules['decimal'] In [16]: sys.modules['decimal\nq'] = d In [17]: pickle.loads(s) Out[17]: decimal.Decimal
-
Kirill Smelkov authored
I put "err" into format string instead of argument. Apologize for that...
-
Kirill Smelkov authored
While fixing https://github.com/kisielk/og-rek/issues/48 there were corresponding test-vector additions. Regenerate the fuzz test corpus.
-
Kirill Smelkov authored
UNICODE is text-based opcode, which is used at protocol 0 and follows 'raw-unicode-escape' encoded argument till EOL. - for decoding we must explicitly implement Python's 'raw-unicode-escape' codec decoding, which is used by Python's pickle for UNICODE argument. Updates and hopefully fixes: https://github.com/kisielk/og-rek/issues/48
-
Kirill Smelkov authored
STRING is text-based opcode which is used at protocol 0 and follows \-escaped argument till EOL. - for encoding we must not use Go's %q, since that will use \u and \U when seeing corresponding bytes, and since Python does not interpret \u or \U in string literals, the data received at Python side will be different. - for decoding we must explicitly implement Python's 'string-escape' codec decoding which is used by Python's pickle for STRING opcode argument. Updates: https://github.com/kisielk/og-rek/issues/48
-
Kirill Smelkov authored
We added pyquote in b429839d (tests: Show pickles in a way that can be copy-pasted into Python). However in the next patch this functionality will be needed for the encoder to fix encoding of strings at protocol 0. Thus we need this function to be not test-only. Plain code movement here, no semantic change.
-
- 25 Sep, 2018 8 commits
-
-
Kirill Smelkov authored
Even though we tried to catch whether dict keys are ok to be used via reflect.TypeOf(key).Comparable() (see da5f0342 "decoder: Fix crashes found by fuzzer (#32)"), that turned out to be not enough. For example if key is a struct, e.g. of the following type type Ref struct { Pid interface{} } it will be comparable. But the comparision, depending on dynamic .Pid type, might panic. This is what was actually cauht by fuzz-testing recently: https://github.com/kisielk/og-rek/issues/50 (second part of the report) So instead of recursively walking a key type and checking each subfield with reflect.TypeOf().Comparable(), switch for using panic/recover for detecting the "unhashable key" situation. This slows down decoding a bit (only cumulative figure for all-test-vectors decoding): name old time/op new time/op delta DecodeLong-4 361ns ± 0% 362ns ± 0% ~ (p=0.238 n=5+4) Decode-4 93.2µs ± 0% 95.6µs ± 0% +2.54% (p=0.008 n=5+5) Encode-4 16.5µs ± 0% 16.6µs ± 0% ~ (p=0.841 n=5+5) but that is the price of correctness. And with manually recursively walking key type I doubt it would be faster. The defer overhead should be less once https://github.com/golang/go/issues/14939 is fixed. Updates: https://github.com/kisielk/og-rek/issues/30
-
Kirill Smelkov authored
In 06e06939 (encoder: Allow to specify pickle protocol version) I added ability to add to tests error cases - object inputs that on encoding should produce error. For decoding we should skip those cases as there pickle.data = "", and if not skipped it leads to --- FAIL: BenchmarkDecode ogorek_test.go:803: unexpected # of decode steps: got 100 ; want 102
-
Kirill Smelkov authored
Updates coming via `go generate` since main tests table was amended.
-
Kirill Smelkov authored
Appologize for the breakage there.
-
Kirill Smelkov authored
When encoding tests fails, the "want" and "have" pickles are printed. It is handy to copy-paste those pickles into Python console and check them further there. Pickle printing currently uses %q. However in Go fmt's %q can use \u and \U if byte sequence form a valid UTF-8 character. That poses a problem: in Python str (py2) or bytes (py3) literal \uXXXX are not processed as unicode-escapes and enter the string as is. This result in different pickle data pasted into Python and further confusion. Entering data into Python as unicode literals (where \u works) and then adding .encode('utf-8') also does not generally work - as pickle data is generally arbitrary it can be a not valid UTF-8, for example: "\x80\u043c\u0438\u0440" (= "\x80мир" = "\x80\xd0\xbc\xd0\xb8\xd1\x80") end unicode-encoding them in python also gives different data: In [1]: u"\x80\u043c\u0438\u0440".encode('utf-8') Out[1]: '\xc2\x80\xd0\xbc\xd0\xb8\xd1\x80' (note leading extra \xc2) For this reason let's implement quoting - that Python can understand - ourselves. This dumping functionality was very handy during recent encoder fixes debugging.
-
Kirill Smelkov authored
- we can use STACK_GLOBAL only if protocol >= 4. - for earlier protocols we have to use text-based GLOBAL.
-
Kirill Smelkov authored
Similarly to dict, for struct encoding switch from protocol 1 opcodes into always using protocol 0 opcodes, which is by the way 1 byte shorter. For the reference - for structs, unlike maps, the order of emitted keys is well-defined - it is the order of fields as they are defined in the struct. This way we can precisely test encoder output on structs with more than 1 field.
-
Kirill Smelkov authored
- we can use EMPTY_DICT only if protocol >= 1 Also: similarly to list (33d1926f), since we are now using EMPTY_DICT only optionally, it is logical to swit to MARK + ... + DICT from EMPTY_DICT (or MARK + DICT @proto=0) + MARK + ... + SETITEMS which is at least 1 byte longer. For the reference - SETITEMS is also from protocol 1, while DICT is from protocol 0.
-
- 21 Sep, 2018 6 commits
-
-
Kirill Smelkov authored
More corpus files appeared while running fuzz testing today for ~ 1 hour.
-
Kirill Smelkov authored
Should be better in 302c79ea (fuzz: Hook encoder into the loop), but it is hopefully never too late.
-
Kirill Smelkov authored
Found via fuzzing: "I-7\n." panic: protocol 1: decode·encode != identity: have: 4294967289 want: -7 goroutine 1 [running]: github.com/kisielk/og-rek.Fuzz(0x7f99bd8b4000, 0x5, 0x200000, 0x3) /tmp/go-fuzz-build914098789/gopath/src/github.com/kisielk/og-rek/fuzz.go:50 +0x604 go-fuzz-dep.Main(0x524df8) /tmp/go-fuzz-build914098789/goroot/src/go-fuzz-dep/main.go:49 +0xad main.main() /tmp/go-fuzz-build914098789/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d exit status 2 I've checked other handlers, like BININT1 and BININT2, and since there everywhere argument is unsigned, there is no similar problem. We needed previous patch on proper readLine EOF detection, because else the testcase for P0("I-7\n.") would be breaking: --- FAIL: TestDecode/int(-7)/"I-7\n." (0.00s) ogorek_test.go:401: no ErrUnexpectedEOF on [:2] truncated stream: v = <nil> err = &strconv.NumError{Func:"ParseInt", Num:"-", Err:(*errors.errorString)(0xc00000e1b0)}
-
Kirill Smelkov authored
Currently we use bufio.Reader.ReadLine which accepts either \n or \r\n as line ending. That is however not correct: - we should not accept e.g. "S'abc'\r\n." pickle, because it is invalid: In [32]: pickle.loads(b"S'abc'\r\n.") --------------------------------------------------------------------------- UnpicklingError Traceback (most recent call last) <ipython-input-32-b1da1988bae1> in <module>() ----> 1 pickle.loads(b"S'abc'\r\n.") UnpicklingError: the STRING opcode argument must be quoted - we should not accept e.g. "L123L\r\n.", because it is also invalid: In [33]: pickle.loads(b"L123L\r\n.") --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-33-7231ec07f5c4> in <module>() ----> 1 pickle.loads(b"L123L\r\n.") ValueError: invalid literal for int() with base 10: '123L\r\n' - treating \r as part of EOL in e.g. UNICODE pickle would just drop encoded information: # python In [34]: pickle.loads(b"Vabc\r\n.") Out[34]: 'abc\r' while ogórek currently decodes it as just 'abc' (no trailing \r). For this reason let's fix Decoder.readLine to treat only \n as EOL. Besides this fix, we now get another property: previously, when internally using bufio.Reader.ReadLine we were not able to distinguish two situations: - a line was abruptly ended without any EOL characters at all, - a line was properly ended with EOL character. Now after we switched to internally using bufio.Reader.ReadSlice, we will be able to properly detect EOF and return that as error. This property will be needed in the following patch.
-
Kirill Smelkov authored
This way whatever/whenever we add a tricky test pickle into main tests table, it should be automatically also be present as a starting point in the fuzz corpus. This should hopefully improve fuzzing coverage.
-
Kirill Smelkov authored
Continuing 5dbc8a1b (decoder: Don't allow mark to be returned as pickle result) I discovered that the mark object can be still exposed to user, but not directly. For example the following pickle: "(\x85." // MARK + TUPLE1 was creating Tuple{mark} and returning it just ok to the user. As marker must be used only internally it is invalid to do so. Python also forbids this: In [3]: s = b"(\x85." In [4]: dis(s) 0: ( MARK 1: \x85 TUPLE1 2: . STOP highest protocol among opcodes = 2 In [5]: pickle.loads(s) --------------------------------------------------------------------------- UnpicklingError Traceback (most recent call last) <ipython-input-5-764e4625bc41> in <module>() ----> 1 pickle.loads(s) UnpicklingError: unexpected MARK found So let's close all (hopefully) holes where mark object could be returned to user in a similar way.
-
- 20 Sep, 2018 5 commits
-
-
Kirill Smelkov authored
New test vectors caught by decode·encode idempotency fuzzing.
-
Kirill Smelkov authored
A pickle is considered as invalid if it tries to return MARK as the result by both Python2 and Python3, e.g.: In [2]: pickle.loads(b"(.") --------------------------------------------------------------------------- UnpicklingError Traceback (most recent call last) <ipython-input-2-0c142c82b126> in <module>() ----> 1 pickle.loads(b"(.") UnpicklingError: unexpected MARK found However until now, despite mark is unexported ogórek type, we were allowing for it to be returned just ok. The problem was caught by decode/encode roundtrip fuzz tests, e.g. "(Q." panic: protocol 1: decode·encode != identity: have: ogórek.Ref{Pid:map[interface {}]interface {}{}} want: ogórek.Ref{Pid:ogórek.mark{}} goroutine 1 [running]: github.com/kisielk/og-rek.Fuzz(0x7fbe6c15f000, 0x3, 0x200000, 0x3) /tmp/go-fuzz-build697921479/gopath/src/github.com/kisielk/og-rek/fuzz.go:87 +0x604 go-fuzz-dep.Main(0x524d78) /tmp/go-fuzz-build697921479/goroot/src/go-fuzz-dep/main.go:49 +0xad main.main() /tmp/go-fuzz-build697921479/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d exit status 2
-
Kirill Smelkov authored
We can enhance our fuzz-testing coverage by hooking Encoder also into the loop: if input data is suceessfully decoded, we have an object that can be passed back to Encoder to generate a pickle. We can't tell that that pickle must be the same as original input data, since pickle machine allows multiple representations of the same data. However we can assert that when that pickle is decoded back it should be the same as encoded object. This catches several problems: - marker is currently returned as pickle result (see next patch). - text-based STRING and UNICODE are not properly decoded (no fix yet). - self-referencing data structures kill Encoder (no fix yet).
-
Kirill Smelkov authored
Add more files go-fuzz put to corpus while discovering the crash fixed in previous commit.
-
Kirill Smelkov authored
BINSTRING opcode follows 4-byte length of the data and the data itself. Upon seeing the BINSTRING len header we were preallocating destination buffer with len as optimization (see 14aaa14f "decoder: Preallocate .buf capacity when we know (approximate) size of output to it"). However this way it is easy for a malicious pickle to specify BINSTRING biglength (4GB) and no data, and then the goroutine that processes such input will allocate 4GB immediately, which in turn might cause out-of-memory DOS. The other places where we currently grow Decoder.buf all either have 1 or 2 byte length (thus limited to 64K), or the length of the data that was already read from input stream. The problem was found by rerunning fuzz tests: "T0000" fatal error: runtime: out of memory runtime stack: runtime.throw(0x514fff, 0x16) /tmp/go-fuzz-build515164548/goroot/src/runtime/panic.go:608 +0x72 runtime.sysMap(0xc004000000, 0x34000000, 0x5f83d8) /tmp/go-fuzz-build515164548/goroot/src/runtime/mem_linux.go:156 +0xc7 runtime.(*mheap).sysAlloc(0x5dfd40, 0x34000000, 0x43c0e3, 0x7ffc94fe56b8) /tmp/go-fuzz-build515164548/goroot/src/runtime/malloc.go:619 +0x1c7 runtime.(*mheap).grow(0x5dfd40, 0x18182, 0x0) /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:920 +0x42 runtime.(*mheap).allocSpanLocked(0x5dfd40, 0x18182, 0x5f83e8, 0x203000) /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:848 +0x337 runtime.(*mheap).alloc_m(0x5dfd40, 0x18182, 0x7ffc94fe0101, 0x40a87f) /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:692 +0x119 runtime.(*mheap).alloc.func1() /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:759 +0x4c runtime.(*mheap).alloc(0x5dfd40, 0x18182, 0x7ffc94010101, 0x4135f5) /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:758 +0x8a runtime.largeAlloc(0x30303030, 0x440101, 0xc00005e240) /tmp/go-fuzz-build515164548/goroot/src/runtime/malloc.go:1019 +0x97 runtime.mallocgc.func1() /tmp/go-fuzz-build515164548/goroot/src/runtime/malloc.go:914 +0x46 runtime.systemstack(0x44eea9) /tmp/go-fuzz-build515164548/goroot/src/runtime/asm_amd64.s:351 +0x66 runtime.mstart() /tmp/go-fuzz-build515164548/goroot/src/runtime/proc.go:1229 goroutine 1 [running]: runtime.systemstack_switch() /tmp/go-fuzz-build515164548/goroot/src/runtime/asm_amd64.s:311 fp=0xc000034310 sp=0xc000034308 pc=0x44efa0 runtime.mallocgc(0x30303030, 0x4edc00, 0x1, 0xc0000343e8) /tmp/go-fuzz-build515164548/goroot/src/runtime/malloc.go:913 +0x896 fp=0xc0000343b0 sp=0xc000034310 pc=0x40ae26 runtime.makeslice(0x4edc00, 0x30303030, 0x30303030, 0xc000018108, 0x4, 0x4) /tmp/go-fuzz-build515164548/goroot/src/runtime/slice.go:70 +0x77 fp=0xc0000343e0 sp=0xc0000343b0 pc=0x43b1d7 bytes.makeSlice(0x30303030, 0x0, 0x0, 0x0) /tmp/go-fuzz-build515164548/goroot/src/bytes/buffer.go:231 +0x9d fp=0xc000034420 sp=0xc0000343e0 pc=0x4b204d bytes.(*Buffer).grow(0xc000084030, 0x30303030, 0x0) /tmp/go-fuzz-build515164548/goroot/src/bytes/buffer.go:144 +0x2e4 fp=0xc000034470 sp=0xc000034420 pc=0x4b1604 bytes.(*Buffer).Grow(0xc000084030, 0x30303030) /tmp/go-fuzz-build515164548/goroot/src/bytes/buffer.go:163 +0x86 fp=0xc000034498 sp=0xc000034470 pc=0x4b18b6 github.com/kisielk/og-rek.(*Decoder).loadBinString(0xc000084000, 0x203054, 0x0) /tmp/go-fuzz-build515164548/gopath/src/github.com/kisielk/og-rek/ogorek.go:646 +0x143 fp=0xc000034520 sp=0xc000034498 pc=0x4d5103 github.com/kisielk/og-rek.(*Decoder).Decode(0xc000084000, 0xc000080000, 0xc000084000, 0xf7eb9fa, 0x586dc) /tmp/go-fuzz-build515164548/gopath/src/github.com/kisielk/og-rek/ogorek.go:227 +0xf94 fp=0xc0000346d8 sp=0xc000034520 pc=0x4d09b4 github.com/kisielk/og-rek.Fuzz(0x7fbccd400000, 0x5, 0x200000, 0xc000034748) /tmp/go-fuzz-build515164548/gopath/src/github.com/kisielk/og-rek/fuzz.go:12 +0xb0 fp=0xc000034710 sp=0xc0000346d8 pc=0x4cf640 go-fuzz-dep.Main(0x519cf0) /tmp/go-fuzz-build515164548/goroot/src/go-fuzz-dep/main.go:49 +0xad fp=0xc000034780 sp=0xc000034710 pc=0x4642fd main.main() /tmp/go-fuzz-build515164548/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d fp=0xc000034798 sp=0xc000034780 pc=0x4db1ad runtime.main() /tmp/go-fuzz-build515164548/goroot/src/runtime/proc.go:201 +0x207 fp=0xc0000347e0 sp=0xc000034798 pc=0x428ec7 runtime.goexit() /tmp/go-fuzz-build515164548/goroot/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc0000347e8 sp=0xc0000347e0 pc=0x450f01 exit status 2
-
- 19 Sep, 2018 9 commits
-
-
Kirill Smelkov authored
- we can use BINSTRING* only if protocol >= 1; - at protocol 0 we thus have to use text STRING; - if protocol >= 3 we have to emit the string as unicode pickle object the same way as Python3 does. If we don't do - Python3 won't be generally able to load our pickle: In [1]: s = b'U\x06\xd0\xbc\xd0\xb8\xd1\x80q\x00.' In [2]: from pickletools import dis In [3]: dis(s) 0: U SHORT_BINSTRING 'миÑ\x80' 8: q BINPUT 0 10: . STOP highest protocol among opcodes = 1 In [4]: import pickle In [5]: pickle.loads(s) --------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) <ipython-input-5-764e4625bc41> in <module>() ----> 1 pickle.loads(s) UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128) We already decode unicode pickle objects into string, this way decode(encode(string)) remains always idempotent.
-
Kirill Smelkov authored
- we can use BINPERSID only if protocol >= 1 - we can use text PERSID only if protocol < 4 and Ref string does not contain \n - else encoding have to fail. Tests for Ref encoding at protocols 3 & 4 will follow after string encoding is fixed.
-
Kirill Smelkov authored
- we can use EMPTY_LIST only if protocol >= 1 Also: since we are now using EMPTY_LIST only optionally, it is logical to switch to MARK + ... + LIST instead of EMPTY_LIST (or MARK + LIST @proto=0) + MARK + ... + APPENDS which is at least 1 byte longer. For the reference - APPENDS is also from protocol 1, while LIST is from protocol 0.
-
Kirill Smelkov authored
- we can use EMPTY_TUPLE only if protocol >= 1 - also: if protocol >= 2 we can now use TUPLE{1,2,3} opcodes.
-
Kirill Smelkov authored
Corresponding TODO in the code was added in 4fd6be93 (encoder: Adjust it so that decode(encode(v)) == v)
-
Kirill Smelkov authored
We can use BINFLOAT opcode only starting from protocol >= 1. At protocol 0 we must use ASCII FLOAT.
-
Kirill Smelkov authored
We can use BININT* opcodes only starting from protocol >= 1. At protocol 0 we must use ASCII INT.
-
Kirill Smelkov authored
Starting from protocol 2 1-byte NEWTRUE/NEWFALSE opcodes are more efficient compared to 5-bytes e.g. "I01\n.". It is not only about efficiency, as protocol 4 _forbids_ use of variable length ASCII-only opcodes - whose data length is determined by doing forward scan for '\n'. Without encodeBool changes and only with the tests added it would fail this way: --- FAIL: TestEncode/True/proto=2 (0.00s) ogorek_test.go:383: encode: have: "\x80\x02I01\n." want: "\x80\x02\x88." --- FAIL: TestEncode/True/proto=3 (0.00s) ogorek_test.go:383: encode: have: "\x80\x03I01\n." want: "\x80\x03\x88." --- FAIL: TestEncode/True/proto=4 (0.00s) ogorek_test.go:383: encode: have: "\x80\x04I01\n." want: "\x80\x04\x88." --- FAIL: TestEncode/False/proto=2 (0.00s) ogorek_test.go:383: encode: have: "\x80\x02I00\n." want: "\x80\x02\x89." --- FAIL: TestEncode/False/proto=3 (0.00s) ogorek_test.go:383: encode: have: "\x80\x03I00\n." want: "\x80\x03\x89." --- FAIL: TestEncode/False/proto=4 (0.00s) ogorek_test.go:383: encode: have: "\x80\x04I00\n." want: "\x80\x04\x89."
-
Kirill Smelkov authored
There are many pickle protocol versions - 0 to 4. Python2 for example understands only versions 0 - 2. However we currently unconditionally emit opcodes from higher versions, for example STACK_GLOBAL - from version 4 - when encoding a Class, which leads to inability to decode pickles generated by ogórek on Python2. Similarly protocol 0 states that only text opcodes should be used, however we currently unconditionally emit e.g. BININT (from protocol 1) when encoding integers. Changing to always using protocol 0 opcodes would be not good, since many opcodes for efficiently encoding either integers, booleans, unicode etc are available only in protocol versions 2 and 4. For this reason, similarly to Python[1], let's allow users to specify desired pickle protocol when creating Encoder with config. For backward compatibility and common sense the protocol version that plain NewEncoder selects is 2. This commit adds only above-described user interface and testing infrastructure for verifying what was the result of encoding an object at particular protocol version. For now only a few of pickle test vectors are right wrt what the encoder should be or currently generates. Thus in the next patches we'll be step-by-step fixing encoder on this topic. [1] https://docs.python.org/3/library/pickle.html#pickle.dump
-