Parsing in Sky is a strict pipeline consisting of four stages:
decoding, which converts incoming bytes into Unicode characters using UTF-8
normalising, which converts certain sequences of characters
tokenising, which converts these characters into tokens
tree construction, which converts these tokens into a tree of nodes
Later stages cannot affect earlier stages.
When a sequence of bytes is to be parsed, there is always a defined parsing context, which is either “application” or “module”.
To decode a sequence of bytes bytes for parsing, the UTF-8 decoder must be used to transform bytes into a sequence of characters characters.
This sequence must then be passed to the normalisation stage.
To normalise a sequence of characters, apply the following rules:
Any U+000D character followed by a U+000A character must be removed.
Any U+000D character not followed by a U+000A character must be converted to a U+000A character.
Any U+0000 character must be converted to a U+FFFD character.
The converted sequence of characters must then be passed to the tokenisation stage.
To tokenise a sequence of characters, a state machine is used.
Initially, the state machine must begin in the signature state.
Each character in turn must be processed according to the rules of the state at the time the character is processed. A character is processed once it has been consumed. This produces a stream of tokens; the tokens must be passed to the tree construction stage.
When the last character is consumed, the tokeniser ends.
When the user agent is to expect a string, it must run these steps:
Let expectation be the string to expect. When this string is indexed, the first character has index 0.
Assertion: The first character in expectation is the current character, and expectation has more than one character.
Consume the current character.
Let index be 1.
Let success and failure be the states specified for success and failure respectively.
Switch to the expect a string state.
If the current character is...
‘#
’: If the parsing context is not “application”, switch to the failed signature state. Otherwise, expect the string “#!mojo mojo:sky
”, with after signature as the success state and failed signature as the failure state.
‘S
’: If the parsing context is not “module”, switch to the failed signature state. Otherwise, expect the string “SKY MODULE
”, with after signature as the success state, and failed signature as the failure state.
Anything else: Jump to the failed signature state.
If the current character is not the same as the indexth character in expectation, then switch to the failure state.
Otherwise, consume the character, and increase index. If index is now equal to the length of expectation, then switch to the success state.
If the current character is...
Stop parsing. No tokens are emitted. The file is not a sky file.
If the current character is...
If the current character is...
‘<
’: Consume the character and switch to the tag open state.
‘&
’: Consume the character and switch to the character reference state, with the return state set to the data state, the extra terminating character unset (or set to U+0000, which has the same effect), and the emitting operation being to emit a character token for the given character.
Anything else: Emit the current input character as a character token. Consume the character. Stay in this state.
If the current character is...
‘<
’: Consume the character and switch to the script raw data: close 1 state.
Anything else: Emit the current input character as a character token. Consume the character. Stay in this state.
If the current character is...
‘/
’: Consume the character and switch to the script raw data: close 2 state.
Anything else: Emit ‘<
’ character tokens. Consume the character. Switch to the script raw data state.
If the current character is...
‘s
’: Consume the character and switch to the script raw data: close 3 state.
Anything else: Emit ‘</
’ character tokens. Consume the character. Switch to the script raw data state.
If the current character is...
‘c
’: Consume the character and switch to the script raw data: close 4 state.
Anything else: Emit ‘</s
’ character tokens. Consume the character. Switch to the script raw data state.
If the current character is...
‘r
’: Consume the character and switch to the script raw data: close 5 state.
Anything else: Emit ‘</sc
’ character tokens. Consume the character. Switch to the script raw data state.
If the current character is...
‘i
’: Consume the character and switch to the script raw data: close 6 state.
Anything else: Emit ‘</scr
’ character tokens. Consume the character. Switch to the script raw data state.
If the current character is...
‘p
’: Consume the character and switch to the script raw data: close 7 state.
Anything else: Emit ‘</scri
’ character tokens. Consume the character. Switch to the script raw data state.
If the current character is...
‘t
’: Consume the character and switch to the script raw data: close 8 state.
Anything else: Emit ‘</scrip
’ character tokens. Consume the character. Switch to the script raw data state.
If the current character is...
U+0020, U+000A, ‘/
’, ‘>
’: Create an end tag token, and let its tag name be the string ‘script
’. Switch to the before attribute name state without consuming the character.
Anything else: Emit ‘</script
’ character tokens. Consume the character. Switch to the script raw data state.
If the current character is...
‘<
’: Consume the character and switch to the style raw data: close 1 state.
Anything else: Emit the current input character as a character token. Consume the character. Stay in this state.
If the current character is...
‘/
’: Consume the character and switch to the style raw data: close 2 state.
Anything else: Emit ‘<
’ character tokens. Consume the character. Switch to the style raw data state.
If the current character is...
‘s
’: Consume the character and switch to the style raw data: close 3 state.
Anything else: Emit ‘</
’ character tokens. Consume the character. Switch to the style raw data state.
If the current character is...
‘t
’: Consume the character and switch to the style raw data: close 4 state.
Anything else: Emit ‘</s
’ character tokens. Consume the character. Switch to the style raw data state.
If the current character is...
‘y
’: Consume the character and switch to the style raw data: close 5 state.
Anything else: Emit ‘</st
’ character tokens. Consume the character. Switch to the style raw data state.
If the current character is...
‘l
’: Consume the character and switch to the style raw data: close 6 state.
Anything else: Emit ‘</sty
’ character tokens. Consume the character. Switch to the style raw data state.
If the current character is...
‘e
’: Consume the character and switch to the style raw data: close 7 state.
Anything else: Emit ‘</styl
’ character tokens. Consume the character. Switch to the style raw data state.
If the current character is...
U+0020, U+000A, ‘/
’, ‘>
’: Create an end tag token, and let its tag name be the string ‘style
’. Switch to the before attribute name state without consuming the character.
Anything else: Emit ‘</style
’ character tokens. Consume the character. Switch to the style raw data state.
If the current character is...
‘!
’: Consume the character and switch to the comment start 1 state.
‘/
’: Consume the character and switch to the close tag state state.
‘>
’: Emit character tokens for ‘<>
’. Consume the current character. Switch to the data state.
‘0
’..‘9
’, ‘a
’..‘z
’, ‘A
’..‘Z
’, ‘-
’, ‘_
’, ‘.
’: Create a start tag token, let its tag name be the current character, consume the current character and switch to the tag name state.
Anything else: Emit the character token for ‘<
’. Switch to the data state without consuming the current character.
If the current character is...
‘>
’: Emit character tokens for ‘</>
’. Consume the current character. Switch to the data state.
‘0
’..‘9
’, ‘a
’..‘z
’, ‘A
’..‘Z
’, ‘-
’, ‘_
’, ‘.
’: Create an end tag token, let its tag name be the current character, consume the current character and switch to the tag name state.
Anything else: Emit the character tokens for ‘</
’. Switch to the data state without consuming the current character.
If the current character is...
U+0020, U+000A: Consume the current character. Switch to the before attribute name state.
‘/
’: Consume the current character. Switch to the void tag state.
‘>
’: Consume the current character. Switch to the after tag state.
Anything else: Append the current character to the tag name, and consume the current character. Stay in this state.
If the current character is...
‘>
’: Consume the current character. Switch to the after tag state.
Anything else: Switch to the before attribute name state without consuming the current character.
If the current character is...
U+0020, U+000A: Consume the current character. Stay in this state.
‘/
’: Consume the current character. Switch to the void tag state.
‘>
’: Consume the current character. Switch to the after tag state.
Anything else: Create a new attribute in the tag token, and set its name to the current character. Consume the current character. Switch to the attribute name state.
If the current character is...
U+0020, U+000A: Consume the current character. Switch to the after attribute name state.
‘/
’: Consume the current character. Switch to the void tag state.
‘=
’: Consume the current character. Switch to the before attribute value state.
‘>
’: Consume the current character. Switch to the after tag state.
Anything else: Append the current character to the most recently added attribute's name, and consume the current character. Stay in this state.
If the current character is...
U+0020, U+000A: Consume the current character. Stay in this state.
‘/
’: Consume the current character. Switch to the void tag state.
‘=
’: Consume the current character. Switch to the before attribute value state.
‘>
’: Consume the current character. Switch to the after tag state.
Anything else: Create a new attribute in the tag token, and set its name to the current character. Consume the current character. Switch to the attribute name state.
If the current character is...
U+0020, U+000A: Consume the current character. Stay in this state.
‘>
’: Consume the current character. Switch to the after tag state.
‘'
’: Consume the current character. Switch to the single-quoted attribute value state.
‘"
’: Consume the current character. Switch to the double-quoted attribute value state.
Anything else: Set the value of the most recently added attribute to the current character. Consume the current character. Switch to the unquoted attribute value state.
If the current character is...
‘'
’: Consume the current character. Switch to the before attribute name state.
‘&
’: Consume the character and switch to the character reference state, with the return state set to the single-quoted attribute value state, the extra terminating character set to ‘'
’, and the emitting operation being to append the given character to the value of the most recently added attribute.
Anything else: Append the current character to the value of the most recently added attribute. Consume the current character. Stay in this state.
If the current character is...
‘"
’: Consume the current character. Switch to the before attribute name state.
‘&
’: Consume the character and switch to the character reference state, with the return state set to the double-quoted attribute value state, the extra terminating character set to ‘"
’, and the emitting operation being to append the given character to the value of the most recently added attribute.
Anything else: Append the current character to the value of the most recently added attribute. Consume the current character. Stay in this state.
If the current character is...
U+0020, U+000A: Consume the current character. Switch to the before attribute name state.
‘>
’: Consume the current character. Switch to the data state. Switch to the after tag state.
‘&
’: Consume the character and switch to the character reference state, with the return state set to the unquoted attribute value state, the extra terminating character unset (or set to U+0000, which has the same effect), and the emitting operation being to append the given character to the value of the most recently added attribute.
Anything else: Append the current character to the value of the most recently added attribute. Consume the current character. Stay in this state.
Emit the tag token.
If the tag token was a start tag token and the tag name was ‘script
’, then and switch to the script raw data state.
If the tag token was a start tag token and the tag name was ‘style
’, then and switch to the style raw data state.
Otherwise, switch to the data state.
If the current character is...
‘-
’: Consume the character and switch to the comment start 2 state.
‘>
’: Emit character tokens for ‘<!>
’. Consume the current character. Switch to the data state.
If the current character is...
‘-
’: Consume the character and switch to the comment state.
‘>
’: Emit character tokens for ‘<!->
’. Consume the current character. Switch to the data state.
If the current character is...
‘-
’: Consume the character and switch to the comment end 1 state.
Anything else: Consume the character and switch to the comment state.
If the current character is...
‘-
’: Consume the character, switch to the comment end 2 state.
Anything else: Consume the character, and switch to the comment state.
If the current character is...
‘>
’: Consume the character and switch to the data state.
‘-
’: Consume the character, but stay in this state.
Anything else: Consume the character, and switch to the comment state.
Let raw value be the string ‘&
’.
Append the current character to raw value.
If the current character is...
‘#
’: Consume the character, and switch to the numeric character reference state.
‘l
’: Consume the character and switch to the named character reference L state.
‘a
’: Consume the character and switch to the named character reference A state.
‘g
’: Consume the character and switch to the named character reference G state.
‘q
’: Consume the character and switch to the named character reference Q state.
Any other character in the range ‘0
’..‘9
’, ‘a
’..‘f
’, ‘A
’..‘F
’: Consume the character and switch to the bad named character reference state.
Anything else: Run the emitting operation for all but the last character in raw value, and switch to the data state without consuming the current character.
Append the current character to raw value.
If the current character is...
‘x
’, ‘X
’: Let value be zero, consume the character, and switch to the hexadecimal numeric character reference state.
‘0
’..‘9
’: Let value be the numeric value of the current character interpreted as a decimal digit, consume the character, and switch to the decimal numeric character reference state.
Anything else: Run the emitting operation for all but the last character in raw value, and switch to the data state without consuming the current character.
Append the current character to raw value.
If the current character is...
‘0
’..‘9
’, ‘a
’..‘f
’, ‘A
’..‘F
’: Let value be sixteen times value plus the numeric value of the current character interpreted as a hexadecimal digit.
‘;
’: Consume the character. If value is between 0x0001 and 0x10FFFF inclusive, but is not between 0xD800 and 0xDFFF inclusive, run the emitting operation with a unicode character having the scalar value value; otherwise, run the emitting operation with the character U+FFFD. Then, in either case, switch to the return state.
Anything else: Run the emitting operation for all but the last character in raw value, and switch to the data state without consuming the current character.
Append the current character to raw value.
If the current character is...
‘0
’..‘9
’: Let value be ten times value plus the numeric value of the current character interpreted as a decimal digit.
‘;
’: Consume the character. If value is between 0x0001 and 0x10FFFF inclusive, but is not between 0xD800 and 0xDFFF inclusive, run the emitting operation with a unicode character having the scalar value value; otherwise, run the emitting operation with the character U+FFFD. Then, in either case, switch to the return state.
Anything else: Run the emitting operation for all but the last character in raw value, and switch to the data state without consuming the current character.
Append the current character to raw value.
If the current character is...
‘t
’: Let character be ‘<
’, consume the current character, and switch to the after named character reference state.
Anything else: Switch to the bad named character reference state without consuming the character.
Append the current character to raw value.
If the current character is...
‘p
’: Consume the current character and switch to the named character reference AP state.
‘m
’: Consume the current character and switch to the named character reference AM state.
Anything else: Switch to the bad named character reference state without consuming the character.
Append the current character to raw value.
If the current character is...
‘p
’: Let character be ‘&
’, consume the current character, and switch to the after named character reference state.
Anything else: Switch to the bad named character reference state without consuming the character.
Append the current character to raw value.
If the current character is...
‘o
’: Consume the current character and switch to the named character reference APO state.
Anything else: Switch to the bad named character reference state without consuming the character.
Append the current character to raw value.
If the current character is...
‘s
’: Let character be ‘'
’, consume the current character, and switch to the after named character reference state.
Anything else: Switch to the bad named character reference state without consuming the character.
Append the current character to raw value.
If the current character is...
‘t
’: Let character be ‘>
’, consume the current character, and switch to the after named character reference state.
Anything else: Switch to the bad named character reference state without consuming the character.
Append the current character to raw value.
If the current character is...
‘u
’: Consume the current character and switch to the named character reference QU state.
Anything else: Switch to the bad named character reference state without consuming the character.
Append the current character to raw value.
If the current character is...
‘o
’: Consume the current character and switch to the named character reference QUO state.
Anything else: Switch to the bad named character reference state without consuming the character.
Append the current character to raw value.
If the current character is...
‘t
’: Let character be ‘"
’, consume the current character, and switch to the after named character reference state.
Anything else: Switch to the bad named character reference state without consuming the character.
Append the current character to raw value.
If the current character is...
‘;
’: Consume the character. Run the emitting operation with the character character. Switch to the return state.
The extra terminating character: Run the emitting operation with the character U+FFFD. Switch to the return state without consuming the current character.
Anything else: Switch to the bad named character reference state without consuming the current character.
Append the current character to raw value.
If the current character is...
‘;
’: Consume the character. Run the emitting operation with the character U+FFFD. Switch to the return state.
The extra terminating character: Switch to the return state without consuming the current character.
Any other character in the range ‘0
’..‘9
’, ‘a
’..‘f
’, ‘A
’..‘F
’: Consume the character and stay in this state.
Anything else: Run the emitting operation for all but the last character in raw value, and switch to the data state without consuming the current character.
To construct a node tree from a sequence of tokens and a document document:
TODO(ianh): <template>, <t>