• Nigel Tao's avatar
    html: rewrite the tokenizer to be more consistent. · a49b8b98
    Nigel Tao authored
    Previously, the tokenizer made two passes per token. The first pass
    established the token boundary. The second pass picked out the tag name
    and attributes inside that boundary. This was problematic when the two
    passes disagreed. For example, "<p id=can't><p id=won't>" caused an
    infinite loop because the first pass skipped everything inside the
    single quotes, and recognized only one token, but the second pass never
    got past the first '>'.
    
    This change rewrites the tokenizer to use one pass, accumulating the
    boundary points of token text, tag names, attribute keys and attribute
    values as it looks for the token endpoint.
    
    It should still be reasonably efficient: text, names, keys and values
    are not lower-cased or unescaped (and converted from []byte to string)
    until asked for.
    
    One of the token_test test cases was fixed to be consistent with
    html5lib. Three more test cases were temporarily disabled, and will be
    re-enabled in a follow-up CL. All the parse_test test cases pass.
    
    R=andybalholm, gri
    CC=golang-dev
    https://golang.org/cl/5244061
    a49b8b98
token_test.go 6.88 KB