Commits · 57f875fd7bcfd37b3ce787ca12407d88f4527708 · Kirill Smelkov / og-rek

28 Sep, 2018 3 commits

encoder: Fix protocol 0 UNICODE emission · 57f875fd

Kirill Smelkov authored Sep 27, 2018

Previously, we were quoting UNICODE opcode argument with strconv.QuoteToASCII().
However that function, in addition to \u and \U escapes, can produce
e.g. \n, \r, \xAA etc escapes. And all of the latter variants are not
treated as special escapes of a unicode literal by Python, thus leading
to data being wrongly received.

Fix it by doing exactly the same that Python pickle encoder does - the
UNICODE argument comes are "raw-unicode-escape" encoded.

This patch contains only codec tests - not end-to-end pickle tests,
because currently Encoder.encodeUnicode() is called only from under
Encoder.encodeString(), and there only from under

	if e.config.Protocol >= 3

We will indirectly add tests for encodeUnicode @ protocol=0 in the next
patches, while adding support for Python bytes.

57f875fd

pyquote: Add tests · 9936cf9d

Kirill Smelkov authored Sep 27, 2018

I initially added pyquote only as debugging tool (in b429839d "tests:
Show pickles in a way that can be copy-pasted into Python"), and later
it started to be used in the Encoder (see 18004fbd "Move pyquote into
main codebase"). However there is no explicit tests for pyquote.

Add some pyquote tests.

9936cf9d

tests/testEncode: Fix potential panic · a15630df

Kirill Smelkov authored Sep 27, 2018

Both object and objectDecodedBack are interace{}. So it could turn out,
if e.g. they are both of the same non-comparable type (e.g. []byte),
comparing them directly will panic.

a15630df

26 Sep, 2018 7 commits

fuzz/corpus: Update · 3ab92142

Kirill Smelkov authored Sep 26, 2018

I was (re-)running fuzz tests again, and in addition to what was already
known (e.g. self-referencing structures) found two new issues:

- long as dict keys (https://github.com/kisielk/og-rek/issues/55), and
- encoding global with '\n' in module name (fixed in previous patch).

Update the corpus with points go-fuzz found, so that next runs could
restart with having all the previous interesting input vectors available.

3ab92142

encoder: Fix GLOBAL emission wrt module/name with \n · 6e5e403e

Kirill Smelkov authored Sep 26, 2018

Caught via fuzzing:

	"\x8c\x030\n02\x93."

        0: \x8c SHORT_BINUNICODE '0\n0'
        5: 2    DUP
        6: \x93 STACK_GLOBAL
        7: .    STOP

	panic: protocol 0: decode back error: err
	pickle: "c0\n0\n0\n0\n."

	goroutine 1 [running]:
	github.com/kisielk/og-rek.Fuzz(0x7f2f1009a000, 0x8, 0x200000, 0x3)
	        /tmp/go-fuzz-build645492341/gopath/src/github.com/kisielk/og-rek/fuzz.go:47 +0x8b8
	go-fuzz-dep.Main(0x525e10)
	        /tmp/go-fuzz-build645492341/goroot/src/go-fuzz-dep/main.go:49 +0xad
	main.main()
	        /tmp/go-fuzz-build645492341/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d
	exit status 2

i.e. '0\n0' module name was emitted as-is as part ot text-based GLOBAL which
completely broke pickle stream.

For the reference Python decodes such globals with \n in name just ok:

	In [10]: s = b"S'decimal\\nq'\nS'Decimal'\n\x93."

	In [11]: pickle.loads(s)
	---------------------------------------------------------------------------
	ModuleNotFoundError                       Traceback (most recent call last)
	<ipython-input-11-764e4625bc41> in <module>()
	----> 1 pickle.loads(s)

	ModuleNotFoundError: No module named 'decimal\nq'

	In [12]: import sys

	In [15]: d = sys.modules['decimal']

	In [16]: sys.modules['decimal\nq'] = d

	In [17]: pickle.loads(s)
	Out[17]: decimal.Decimal

6e5e403e

fuzz: Fix error printing thinko · f2f59c50
Kirill Smelkov authored Sep 26, 2018
```
I put "err" into format string instead of argument. Apologize for that...
```
f2f59c50

fuzz/corpus: Update · 192173dc

Kirill Smelkov authored Sep 25, 2018

While fixing https://github.com/kisielk/og-rek/issues/48 there were
corresponding test-vector additions. Regenerate the fuzz test corpus.

192173dc

Fix UNICODE decoding · 9daf6a2a

Kirill Smelkov authored Sep 25, 2018

UNICODE is text-based opcode, which is used at protocol 0 and follows
'raw-unicode-escape' encoded argument till EOL.

- for decoding we must explicitly implement Python's
  'raw-unicode-escape' codec decoding, which is used by Python's pickle
  for UNICODE argument.

Updates and hopefully fixes: https://github.com/kisielk/og-rek/issues/48

9daf6a2a

Fix STRING encoding/decoding · f62fe97f

Kirill Smelkov authored Sep 25, 2018

STRING is text-based opcode which is used at protocol 0 and follows
\-escaped argument till EOL.

- for encoding we must not use Go's %q, since that will use \u and \U
  when seeing corresponding bytes, and since Python does not interpret
  \u or \U in string literals, the data received at Python side will be
  different.

- for decoding we must explicitly implement Python's 'string-escape'
  codec decoding which is used by Python's pickle for STRING opcode
  argument.

Updates: https://github.com/kisielk/og-rek/issues/48

f62fe97f

Move pyquote into main codebase · 18004fbd

Kirill Smelkov authored Sep 25, 2018

We added pyquote in b429839d (tests: Show pickles in a way that can be
copy-pasted into Python). However in the next patch this functionality
will be needed for the encoder to fix encoding of strings at protocol 0.
Thus we need this function to be not test-only.

Plain code movement here, no semantic change.

18004fbd

25 Sep, 2018 8 commits

decoder: Fix panic on dict with non-comparable keys · cdd8269c

Kirill Smelkov authored Sep 24, 2018

Even though we tried to catch whether dict keys are ok to be used via
reflect.TypeOf(key).Comparable() (see da5f0342 "decoder: Fix crashes
found by fuzzer (#32)"), that turned out to be not enough. For example
if key is a struct, e.g. of the following type

	type Ref struct {
		Pid interface{}
	}

it will be comparable. But the comparision, depending on dynamic .Pid
type, might panic. This is what was actually cauht by fuzz-testing
recently:

	https://github.com/kisielk/og-rek/issues/50 (second part of the report)

So instead of recursively walking a key type and checking each subfield
with reflect.TypeOf().Comparable(), switch for using panic/recover for
detecting the "unhashable key" situation.

This slows down decoding a bit (only cumulative figure for all-test-vectors decoding):

	name          old time/op    new time/op    delta
	DecodeLong-4     361ns ± 0%     362ns ± 0%    ~     (p=0.238 n=5+4)
	Decode-4        93.2µs ± 0%    95.6µs ± 0%  +2.54%  (p=0.008 n=5+5)
	Encode-4        16.5µs ± 0%    16.6µs ± 0%    ~     (p=0.841 n=5+5)

but that is the price of correctness. And with manually recursively walking key
type I doubt it would be faster.

The defer overhead should be less once https://github.com/golang/go/issues/14939 is fixed.

Updates: https://github.com/kisielk/og-rek/issues/30

cdd8269c

tests/BenchmarkDecode: Skip empty pickles from tests table · e3797bb1

Kirill Smelkov authored Sep 24, 2018

In 06e06939 (encoder: Allow to specify pickle protocol version) I added
ability to add to tests error cases - object inputs that on encoding
should produce error.

For decoding we should skip those cases as there pickle.data = "", and
if not skipped it leads to

	--- FAIL: BenchmarkDecode
	    ogorek_test.go:803: unexpected # of decode steps: got 100  ; want 102

e3797bb1

fuzz/corpus: Update · 6daa7b2c
Kirill Smelkov authored Sep 21, 2018
```
Updates coming via `go generate` since main tests table was amended.
```
6daa7b2c
fixup! fuzz: Add more details when reporting failures · c54a5739
Kirill Smelkov authored Sep 21, 2018
```
Appologize for the breakage there.
```
c54a5739

tests: Show pickles in a way that can be copy-pasted into Python · b429839d

Kirill Smelkov authored Sep 21, 2018

When encoding tests fails, the "want" and "have" pickles are printed. It
is handy to copy-paste those pickles into Python console and check them
further there.

Pickle printing currently uses %q. However in Go fmt's %q can use \u and
\U if byte sequence form a valid UTF-8 character. That poses a problem:
in Python str (py2) or bytes (py3) literal \uXXXX are not processed as
unicode-escapes and enter the string as is. This result in different
pickle data pasted into Python and further confusion.

Entering data into Python as unicode literals (where \u works) and then
adding .encode('utf-8') also does not generally work - as pickle data is
generally arbitrary it can be a not valid UTF-8, for example:

	"\x80\u043c\u0438\u0440"	(= "\x80мир"   = "\x80\xd0\xbc\xd0\xb8\xd1\x80")

end unicode-encoding them in python also gives different data:

	In [1]: u"\x80\u043c\u0438\u0440".encode('utf-8')
	Out[1]: '\xc2\x80\xd0\xbc\xd0\xb8\xd1\x80'

(note leading extra \xc2)

For this reason let's implement quoting - that Python can understand -
ourselves. This dumping functionality was very handy during recent
encoder fixes debugging.

b429839d

encoder: Fix class wrt protocol version · 656974e2

Kirill Smelkov authored Sep 21, 2018

- we can use STACK_GLOBAL only if protocol >= 4.
- for earlier protocols we have to use text-based GLOBAL.

656974e2

encoder: Fix struct encoding wrt protocol version · 95f42543

Kirill Smelkov authored Sep 21, 2018

Similarly to dict, for struct encoding switch from protocol 1 opcodes
into always using protocol 0 opcodes, which is by the way 1 byte
shorter.

For the reference - for structs, unlike maps, the order of emitted keys
is well-defined - it is the order of fields as they are defined in the
struct. This way we can precisely test encoder output on structs with
more than 1 field.

95f42543

encoder: Fix dict wrt protocol version · bb7b117b

Kirill Smelkov authored Sep 21, 2018

- we can use EMPTY_DICT only if protocol >= 1

Also: similarly to list (33d1926f), since we are now using EMPTY_DICT
only optionally, it is logical to swit to

	MARK + ... + DICT

from

	EMPTY_DICT (or MARK + DICT @proto=0) + MARK + ... + SETITEMS

which is at least 1 byte longer.

For the reference - SETITEMS is also from protocol 1, while DICT is from
protocol 0.

bb7b117b

21 Sep, 2018 6 commits

fuzz/corpus: Update · 89930c10

Kirill Smelkov authored Sep 21, 2018

More corpus files appeared while running fuzz testing today for ~ 1 hour.

89930c10

fuzz: Add more details when reporting failures · 65d6314e

Kirill Smelkov authored Sep 21, 2018

Should be better in 302c79ea (fuzz: Hook encoder into the loop), but it
is hopefully never too late.

65d6314e

decoder: Fix BININT decoding for negative values · bd5a7fd4

Kirill Smelkov authored Sep 21, 2018

Found via fuzzing:

	"I-7\n."

	panic: protocol 1: decode·encode != identity:
	have: 4294967289
	want: -7

	goroutine 1 [running]:
	github.com/kisielk/og-rek.Fuzz(0x7f99bd8b4000, 0x5, 0x200000, 0x3)
	        /tmp/go-fuzz-build914098789/gopath/src/github.com/kisielk/og-rek/fuzz.go:50 +0x604
	go-fuzz-dep.Main(0x524df8)
	        /tmp/go-fuzz-build914098789/goroot/src/go-fuzz-dep/main.go:49 +0xad
	main.main()
	        /tmp/go-fuzz-build914098789/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d
	exit status 2

I've checked other handlers, like BININT1 and BININT2, and since there
everywhere argument is unsigned, there is no similar problem.

We needed previous patch on proper readLine EOF detection, because else
the testcase for P0("I-7\n.") would be breaking:

    --- FAIL: TestDecode/int(-7)/"I-7\n." (0.00s)
        ogorek_test.go:401: no ErrUnexpectedEOF on [:2] truncated stream: v = <nil>  err = &strconv.NumError{Func:"ParseInt", Num:"-", Err:(*errors.errorString)(0xc00000e1b0)}

bd5a7fd4

decoder: Don't treat \r\n as combined EOL · 57137139

Kirill Smelkov authored Sep 21, 2018

Currently we use bufio.Reader.ReadLine which accepts either \n or \r\n
as line ending. That is however not correct:

- we should not accept e.g. "S'abc'\r\n." pickle, because it is
  invalid:

	In [32]: pickle.loads(b"S'abc'\r\n.")
	---------------------------------------------------------------------------
	UnpicklingError                           Traceback (most recent call last)
	<ipython-input-32-b1da1988bae1> in <module>()
	----> 1 pickle.loads(b"S'abc'\r\n.")

	UnpicklingError: the STRING opcode argument must be quoted

- we should not accept e.g. "L123L\r\n.", because it is also invalid:

	In [33]: pickle.loads(b"L123L\r\n.")
	---------------------------------------------------------------------------
	ValueError                                Traceback (most recent call last)
	<ipython-input-33-7231ec07f5c4> in <module>()
	----> 1 pickle.loads(b"L123L\r\n.")

	ValueError: invalid literal for int() with base 10: '123L\r\n'

- treating \r as part of EOL in e.g. UNICODE pickle would just drop encoded
  information:

	# python
	In [34]: pickle.loads(b"Vabc\r\n.")
	Out[34]: 'abc\r'

  while ogórek currently decodes it as just 'abc' (no trailing \r).

For this reason let's fix Decoder.readLine to treat only \n as EOL.

Besides this fix, we now get another property: previously, when internally
using bufio.Reader.ReadLine we were not able to distinguish two situations:

- a line was abruptly ended without any EOL characters at all,
- a line was properly ended with EOL character.

Now after we switched to internally using bufio.Reader.ReadSlice, we will be
able to properly detect EOF and return that as error. This property will be
needed in the following patch.

57137139

fuzz: Automatically export all tests pickles into fuzz/corpus · 9d1344ba

Kirill Smelkov authored Sep 21, 2018

This way whatever/whenever we add a tricky test pickle into main tests
table, it should be automatically also be present as a starting point in
the fuzz corpus. This should hopefully improve fuzzing coverage.

9d1344ba

decoder: More mark exposing fixes · 7aeda71a

Kirill Smelkov authored Sep 21, 2018

Continuing 5dbc8a1b (decoder: Don't allow mark to be returned as pickle
result) I discovered that the mark object can be still exposed to user,
but not directly. For example the following pickle:

	"(\x85." // MARK + TUPLE1

was creating Tuple{mark} and returning it just ok to the user.

As marker must be used only internally it is invalid to do so. Python
also forbids this:

        In [3]: s = b"(\x85."

        In [4]: dis(s)
            0: (    MARK
            1: \x85     TUPLE1
            2: .        STOP
        highest protocol among opcodes = 2

        In [5]: pickle.loads(s)
        ---------------------------------------------------------------------------
        UnpicklingError                           Traceback (most recent call last)
        <ipython-input-5-764e4625bc41> in <module>()
        ----> 1 pickle.loads(s)

        UnpicklingError: unexpected MARK found

So let's close all (hopefully) holes where mark object could be returned to
user in a similar way.

7aeda71a

20 Sep, 2018 5 commits

fuzz/corpus: Update · b7c2a34e
Kirill Smelkov authored Sep 20, 2018
```
New test vectors caught by decode·encode idempotency fuzzing.
```
b7c2a34e

decoder: Don't allow mark to be returned as pickle result · 5dbc8a1b

Kirill Smelkov authored Sep 20, 2018

A pickle is considered as invalid if it tries to return MARK as the
result by both Python2 and Python3, e.g.:

        In [2]: pickle.loads(b"(.")
        ---------------------------------------------------------------------------
        UnpicklingError                           Traceback (most recent call last)
        <ipython-input-2-0c142c82b126> in <module>()
        ----> 1 pickle.loads(b"(.")

        UnpicklingError: unexpected MARK found

However until now, despite mark is unexported ogórek type, we were
allowing for it to be returned just ok.

The problem was caught by decode/encode roundtrip fuzz tests, e.g.

	"(Q."

panic: protocol 1: decode·encode != identity:
have: ogórek.Ref{Pid:map[interface {}]interface {}{}}
want: ogórek.Ref{Pid:ogórek.mark{}}

goroutine 1 [running]:
github.com/kisielk/og-rek.Fuzz(0x7fbe6c15f000, 0x3, 0x200000, 0x3)
        /tmp/go-fuzz-build697921479/gopath/src/github.com/kisielk/og-rek/fuzz.go:87 +0x604
go-fuzz-dep.Main(0x524d78)
        /tmp/go-fuzz-build697921479/goroot/src/go-fuzz-dep/main.go:49 +0xad
main.main()
        /tmp/go-fuzz-build697921479/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d
exit status 2

5dbc8a1b

fuzz: Hook encoder into the loop · 302c79ea

Kirill Smelkov authored Sep 20, 2018

We can enhance our fuzz-testing coverage by hooking Encoder also into
the loop: if input data is suceessfully decoded, we have an object that
can be passed back to Encoder to generate a pickle. We can't tell that
that pickle must be the same as original input data, since pickle
machine allows multiple representations of the same data. However we can
assert that when that pickle is decoded back it should be the same as
encoded object.

This catches several problems:

- marker is currently returned as pickle result (see next patch).
- text-based STRING and UNICODE are not properly decoded (no fix yet).
- self-referencing data structures kill Encoder (no fix yet).

302c79ea

fuzz/corups: Update · 230ffba9

Kirill Smelkov authored Sep 20, 2018

Add more files go-fuzz put to corpus while discovering the crash fixed
in previous commit.

230ffba9

decoder: Fix DOS in BINSTRING decoding · dbc3bd9a

Kirill Smelkov authored Sep 20, 2018

BINSTRING opcode follows 4-byte length of the data and the data itself.

Upon seeing the BINSTRING len header we were preallocating destination
buffer with len as optimization (see 14aaa14f "decoder: Preallocate .buf
capacity when we know (approximate) size of output to it"). However this
way it is easy for a malicious pickle to specify

	BINSTRING biglength (4GB)

and no data, and then the goroutine that processes such input will
allocate 4GB immediately, which in turn might cause out-of-memory DOS.

The other places where we currently grow Decoder.buf all either have 1
or 2 byte length (thus limited to 64K), or the length of the data that
was already read from input stream.

The problem was found by rerunning fuzz tests:

	"T0000"

fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x514fff, 0x16)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/panic.go:608 +0x72
runtime.sysMap(0xc004000000, 0x34000000, 0x5f83d8)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/mem_linux.go:156 +0xc7
runtime.(*mheap).sysAlloc(0x5dfd40, 0x34000000, 0x43c0e3, 0x7ffc94fe56b8)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/malloc.go:619 +0x1c7
runtime.(*mheap).grow(0x5dfd40, 0x18182, 0x0)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:920 +0x42
runtime.(*mheap).allocSpanLocked(0x5dfd40, 0x18182, 0x5f83e8, 0x203000)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:848 +0x337
runtime.(*mheap).alloc_m(0x5dfd40, 0x18182, 0x7ffc94fe0101, 0x40a87f)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:692 +0x119
runtime.(*mheap).alloc.func1()
        /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:759 +0x4c
runtime.(*mheap).alloc(0x5dfd40, 0x18182, 0x7ffc94010101, 0x4135f5)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:758 +0x8a
runtime.largeAlloc(0x30303030, 0x440101, 0xc00005e240)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/malloc.go:1019 +0x97
runtime.mallocgc.func1()
        /tmp/go-fuzz-build515164548/goroot/src/runtime/malloc.go:914 +0x46
runtime.systemstack(0x44eea9)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/asm_amd64.s:351 +0x66
runtime.mstart()
        /tmp/go-fuzz-build515164548/goroot/src/runtime/proc.go:1229

goroutine 1 [running]:
runtime.systemstack_switch()
        /tmp/go-fuzz-build515164548/goroot/src/runtime/asm_amd64.s:311 fp=0xc000034310 sp=0xc000034308 pc=0x44efa0
runtime.mallocgc(0x30303030, 0x4edc00, 0x1, 0xc0000343e8)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/malloc.go:913 +0x896 fp=0xc0000343b0 sp=0xc000034310 pc=0x40ae26
runtime.makeslice(0x4edc00, 0x30303030, 0x30303030, 0xc000018108, 0x4, 0x4)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/slice.go:70 +0x77 fp=0xc0000343e0 sp=0xc0000343b0 pc=0x43b1d7
bytes.makeSlice(0x30303030, 0x0, 0x0, 0x0)
        /tmp/go-fuzz-build515164548/goroot/src/bytes/buffer.go:231 +0x9d fp=0xc000034420 sp=0xc0000343e0 pc=0x4b204d
bytes.(*Buffer).grow(0xc000084030, 0x30303030, 0x0)
        /tmp/go-fuzz-build515164548/goroot/src/bytes/buffer.go:144 +0x2e4 fp=0xc000034470 sp=0xc000034420 pc=0x4b1604
bytes.(*Buffer).Grow(0xc000084030, 0x30303030)
        /tmp/go-fuzz-build515164548/goroot/src/bytes/buffer.go:163 +0x86 fp=0xc000034498 sp=0xc000034470 pc=0x4b18b6
github.com/kisielk/og-rek.(*Decoder).loadBinString(0xc000084000, 0x203054, 0x0)
        /tmp/go-fuzz-build515164548/gopath/src/github.com/kisielk/og-rek/ogorek.go:646 +0x143 fp=0xc000034520 sp=0xc000034498 pc=0x4d5103
github.com/kisielk/og-rek.(*Decoder).Decode(0xc000084000, 0xc000080000, 0xc000084000, 0xf7eb9fa, 0x586dc)
        /tmp/go-fuzz-build515164548/gopath/src/github.com/kisielk/og-rek/ogorek.go:227 +0xf94 fp=0xc0000346d8 sp=0xc000034520 pc=0x4d09b4
github.com/kisielk/og-rek.Fuzz(0x7fbccd400000, 0x5, 0x200000, 0xc000034748)
        /tmp/go-fuzz-build515164548/gopath/src/github.com/kisielk/og-rek/fuzz.go:12 +0xb0 fp=0xc000034710 sp=0xc0000346d8 pc=0x4cf640
go-fuzz-dep.Main(0x519cf0)
        /tmp/go-fuzz-build515164548/goroot/src/go-fuzz-dep/main.go:49 +0xad fp=0xc000034780 sp=0xc000034710 pc=0x4642fd
main.main()
        /tmp/go-fuzz-build515164548/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d fp=0xc000034798 sp=0xc000034780 pc=0x4db1ad
runtime.main()
        /tmp/go-fuzz-build515164548/goroot/src/runtime/proc.go:201 +0x207 fp=0xc0000347e0 sp=0xc000034798 pc=0x428ec7
runtime.goexit()
        /tmp/go-fuzz-build515164548/goroot/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc0000347e8 sp=0xc0000347e0 pc=0x450f01
exit status 2

dbc3bd9a

19 Sep, 2018 11 commits

encoder: Fix string wrt protocol version · e7d96969

Kirill Smelkov authored Sep 19, 2018

- we can use BINSTRING* only if protocol >= 1;
- at protocol 0 we thus have to use text STRING;
- if protocol >= 3 we have to emit the string as unicode pickle object
  the same way as Python3 does. If we don't do - Python3 won't be
  generally able to load our pickle:

	In [1]: s = b'U\x06\xd0\xbc\xd0\xb8\xd1\x80q\x00.'

  	In [2]: from pickletools import dis

  	In [3]: dis(s)
  	    0: U    SHORT_BINSTRING 'Ð¼Ð¸Ñ\x80'
  	    8: q    BINPUT     0
  	   10: .    STOP
  	highest protocol among opcodes = 1

  	In [4]: import pickle

  	In [5]: pickle.loads(s)
  	---------------------------------------------------------------------------
  	UnicodeDecodeError                        Traceback (most recent call last)
  	<ipython-input-5-764e4625bc41> in <module>()
  	----> 1 pickle.loads(s)

  	UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

We already decode unicode pickle objects into string, this way
decode(encode(string)) remains always idempotent.

e7d96969

encoder: Fix Ref wrt protocol version · 98fb1987

Kirill Smelkov authored Sep 19, 2018

- we can use BINPERSID only if protocol >= 1
- we can use text PERSID only if protocol < 4 and Ref string does not
  contain \n - else encoding have to fail.

Tests for Ref encoding at protocols 3 & 4 will follow after string
encoding is fixed.

98fb1987

encoder: Fix list wrt protocol version · 33d1926f

Kirill Smelkov authored Sep 19, 2018

- we can use EMPTY_LIST only if protocol >= 1

Also: since we are now using EMPTY_LIST only optionally, it is logical
to switch to

	MARK + ... + LIST

instead of

	EMPTY_LIST (or MARK + LIST @proto=0) + MARK + ... + APPENDS

which is at least 1 byte longer.

For the reference - APPENDS is also from protocol 1, while LIST is from
protocol 0.

33d1926f

encoder: Fix tuple wrt protocol version · 24695efa

Kirill Smelkov authored Sep 19, 2018

- we can use EMPTY_TUPLE only if protocol >= 1
- also: if protocol >= 2 we can now use TUPLE{1,2,3} opcodes.

24695efa

encoder: Add tests todo to use LONG1 and LONG4 · 08fb378e

Kirill Smelkov authored Sep 19, 2018

Corresponding TODO in the code was added in 4fd6be93 (encoder: Adjust it
so that decode(encode(v)) == v)

08fb378e

encoder: Fix float wrt protocol version · 606e9f5a

Kirill Smelkov authored Sep 19, 2018

We can use BINFLOAT opcode only starting from protocol >= 1.
At protocol 0 we must use ASCII FLOAT.

606e9f5a

encoder: Fix int wrt protocol version · 9053359b

Kirill Smelkov authored Sep 19, 2018

We can use BININT* opcodes only starting from protocol >= 1.
At protocol 0 we must use ASCII INT.

9053359b

encoder: Fix bool wrt protocol version · 6e6a8aa3

Kirill Smelkov authored Sep 19, 2018

Starting from protocol 2 1-byte NEWTRUE/NEWFALSE opcodes are more
efficient compared to 5-bytes e.g. "I01\n.".

It is not only about efficiency, as protocol 4 _forbids_ use of variable
length ASCII-only opcodes - whose data length is determined by doing
forward scan for '\n'.

Without encodeBool changes and only with the tests added it would fail
this way:

    --- FAIL: TestEncode/True/proto=2 (0.00s)
        ogorek_test.go:383: encode:
            have: "\x80\x02I01\n."
            want: "\x80\x02\x88."
    --- FAIL: TestEncode/True/proto=3 (0.00s)
        ogorek_test.go:383: encode:
            have: "\x80\x03I01\n."
            want: "\x80\x03\x88."
    --- FAIL: TestEncode/True/proto=4 (0.00s)
        ogorek_test.go:383: encode:
            have: "\x80\x04I01\n."
            want: "\x80\x04\x88."
    --- FAIL: TestEncode/False/proto=2 (0.00s)
        ogorek_test.go:383: encode:
            have: "\x80\x02I00\n."
            want: "\x80\x02\x89."
    --- FAIL: TestEncode/False/proto=3 (0.00s)
        ogorek_test.go:383: encode:
            have: "\x80\x03I00\n."
            want: "\x80\x03\x89."
    --- FAIL: TestEncode/False/proto=4 (0.00s)
        ogorek_test.go:383: encode:
            have: "\x80\x04I00\n."
            want: "\x80\x04\x89."

6e6a8aa3

encoder: Allow to specify pickle protocol version · 06e06939

Kirill Smelkov authored Sep 19, 2018

There are many pickle protocol versions - 0 to 4. Python2 for example
understands only versions 0 - 2. However we currently unconditionally
emit opcodes from higher versions, for example STACK_GLOBAL - from
version 4 - when encoding a Class, which leads to inability to decode
pickles generated by ogórek on Python2.

Similarly protocol 0 states that only text opcodes should be used,
however we currently unconditionally emit e.g. BININT (from protocol 1)
when encoding integers.

Changing to always using protocol 0 opcodes would be not good, since many
opcodes for efficiently encoding either integers, booleans, unicode etc
are available only in protocol versions 2 and 4.

For this reason, similarly to Python[1], let's allow users to specify
desired pickle protocol when creating Encoder with config. For backward
compatibility and common sense the protocol version that plain
NewEncoder selects is 2.

This commit adds only above-described user interface and testing
infrastructure for verifying what was the result of encoding an object
at particular protocol version.

For now only a few of pickle test vectors are right wrt what the encoder
should be or currently generates. Thus in the next patches we'll be
step-by-step fixing encoder on this topic.

[1] https://docs.python.org/3/library/pickle.html#pickle.dump

06e06939

tests: Allow to specify several pickles for one test case · 93075d82

Kirill Smelkov authored Sep 19, 2018

For now all decoding all those pickles is tested to give the same
object, as in e.g.

	"(I1\nI2\ntp0\n.", // MARK + TUPLE + INT

and

	"I1\nI2\n\x86."),  // TUPLE2 + INT

But having a way to specify several pickles to a test case will become
even more handy when later adding precise tests for Encoder - there we
will need to assert that at such and such protocol encoding gives such
and such pickles. And there are 5 protocol versions to test...

93075d82

tests: Merge Encode and Decode tests data · e8189e5f

Kirill Smelkov authored Sep 19, 2018

Previously there were two separate tables - for decode and encode tests.

The table for encode tests was very small. TestDecode, which was
operating on data from decode table, was also performing checks that
decode(encode(object)) is idempotent - the work which is already too by
TestEncode. However the coverage of input objects from decode table was
not a strict superset of encode table objects.

For the reasons above let's stop this divergence. Let's have a common
table that define tests where for every test case there can be:

	- an "in" object,
	- a pickle, and
	- an "out" object

where

	1. pickle must decode to "out" object, and
	2. encoding "in" object must give some pickle that decodes to "out" object.

	NOTE: In the usual case "in" object == "out" object and they can only
	differ if "in" object contains a Go struct.

This will allow us to cover all existing decode and encode tests logic.

However the coverage of each logic is now higher - for example Encoder
tests are now run on every object from main table, not only for 3 cases
like it was before.

e8189e5f