Parsing

Parsing in Sky is a strict pipeline consisting of four stages:

decoding, which converts incoming bytes into Unicode characters using UTF-8
normalising, which converts certain sequences of characters
tokenising, which converts these characters into tokens
tree construction, which converts these tokens into a tree of nodes

Later stages cannot affect earlier stages.

When a sequence of bytes is to be parsed, there is always a defined parsing context, which is either “application” or “module”.

Decoding stage

To decode a sequence of bytes bytes for parsing, the UTF-8 decoder must be used to transform bytes into a sequence of characters characters.

This sequence must then be passed to the normalisation stage.

Normalisation stage

To normalise a sequence of characters, apply the following rules:

Any U+000D character followed by a U+000A character must be removed.
Any U+000D character not followed by a U+000A character must be converted to a U+000A character.
Any U+0000 character must be converted to a U+FFFD character.

The converted sequence of characters must then be passed to the tokenisation stage.

Tokenisation stage

To tokenise a sequence of characters, a state machine is used.

Initially, the state machine must begin in the signature state.

Each character in turn must be processed according to the rules of the state at the time the character is processed. A character is processed once it has been consumed. This produces a stream of tokens; the tokens must be passed to the tree construction stage.

When the last character is consumed, the tokeniser ends.

Expecting a string

When the user agent is to expect a string, it must run these steps:

Let expectation be the string to expect. When this string is indexed, the first character has index 0.
Assertion: The first character in expectation is the current character, and expectation has more than one character.
Consume the current character.
Let index be 1.
Let success and failure be the states specified for success and failure respectively.
Switch to the expect a string state.

Tokeniser states

Signature state

If the current character is...

‘#’: If the parsing context is not “application”, switch to the failed signature state. Otherwise, expect the string “#!mojo mojo:sky”, with after signature as the success state and failed signature as the failure state.
‘S’: If the parsing context is not “module”, switch to the failed signature state. Otherwise, expect the string “SKY MODULE”, with after signature as the success state, and failed signature as the failure state.
Anything else: Jump to the failed signature state.

Expect a string state

If the current character is not the same as the indexth character in expectation, then switch to the failure state.

Otherwise, consume the character, and increase index. If index is now equal to the length of expectation, then switch to the success state.

After signature state

If the current character is...

U+000A: Consume the character and switch to the data state.
U+0020: Consume the character and switch to the consume rest of line state.
Anything else: Switch to the failed signature state.

Failed signature state

Stop parsing. No tokens are emitted. The file is not a sky file.

Consume rest of line state

If the current character is...

U+000A: Consume the character and switch to the data state.
Anything else: Consume the character and stay in this state.

Data state

If the current character is...

‘<’: Consume the character and switch to the tag open state.
‘&’: Consume the character and switch to the character reference state, with the return state set to the data state, the extra terminating character unset (or set to U+0000, which has the same effect), and the emitting operation being to emit a character token for the given character.
Anything else: Emit the current input character as a character token. Consume the character. Stay in this state.

Script raw data state

If the current character is...

‘<’: Consume the character and switch to the script raw data: close 1 state.
Anything else: Emit the current input character as a character token. Consume the character. Stay in this state.

Script raw data: close 1 state

If the current character is...

‘/’: Consume the character and switch to the script raw data: close 2 state.
Anything else: Emit ‘<’ character tokens. Consume the character. Switch to the script raw data state.

Script raw data: close 2 state

If the current character is...

‘s’: Consume the character and switch to the script raw data: close 3 state.
Anything else: Emit ‘</’ character tokens. Consume the character. Switch to the script raw data state.

Script raw data: close 3 state

If the current character is...

‘c’: Consume the character and switch to the script raw data: close 4 state.
Anything else: Emit ‘</s’ character tokens. Consume the character. Switch to the script raw data state.

Script raw data: close 4 state

If the current character is...

‘r’: Consume the character and switch to the script raw data: close 5 state.
Anything else: Emit ‘</sc’ character tokens. Consume the character. Switch to the script raw data state.

Script raw data: close 5 state

If the current character is...

‘i’: Consume the character and switch to the script raw data: close 6 state.
Anything else: Emit ‘</scr’ character tokens. Consume the character. Switch to the script raw data state.

Script raw data: close 6 state

If the current character is...

‘p’: Consume the character and switch to the script raw data: close 7 state.
Anything else: Emit ‘</scri’ character tokens. Consume the character. Switch to the script raw data state.

Script raw data: close 7 state

If the current character is...

‘t’: Consume the character and switch to the script raw data: close 8 state.
Anything else: Emit ‘</scrip’ character tokens. Consume the character. Switch to the script raw data state.

Script raw data: close 8 state

If the current character is...

U+0020, U+000A, ‘/’, ‘>’: Create an end tag token, and let its tag name be the string ‘script’. Switch to the before attribute name state without consuming the character.
Anything else: Emit ‘</script’ character tokens. Consume the character. Switch to the script raw data state.

Style raw data state

If the current character is...

‘<’: Consume the character and switch to the style raw data: close 1 state.
Anything else: Emit the current input character as a character token. Consume the character. Stay in this state.

Style raw data: close 1 state

If the current character is...

‘/’: Consume the character and switch to the style raw data: close 2 state.
Anything else: Emit ‘<’ character tokens. Consume the character. Switch to the style raw data state.

Style raw data: close 2 state

If the current character is...

‘s’: Consume the character and switch to the style raw data: close 3 state.
Anything else: Emit ‘</’ character tokens. Consume the character. Switch to the style raw data state.

Style raw data: close 3 state

If the current character is...

‘t’: Consume the character and switch to the style raw data: close 4 state.
Anything else: Emit ‘</s’ character tokens. Consume the character. Switch to the style raw data state.

Style raw data: close 4 state

If the current character is...

‘y’: Consume the character and switch to the style raw data: close 5 state.
Anything else: Emit ‘</st’ character tokens. Consume the character. Switch to the style raw data state.

Style raw data: close 5 state

If the current character is...

‘l’: Consume the character and switch to the style raw data: close 6 state.
Anything else: Emit ‘</sty’ character tokens. Consume the character. Switch to the style raw data state.

Style raw data: close 6 state

If the current character is...

‘e’: Consume the character and switch to the style raw data: close 7 state.
Anything else: Emit ‘</styl’ character tokens. Consume the character. Switch to the style raw data state.

Style raw data: close 7 state

If the current character is...

U+0020, U+000A, ‘/’, ‘>’: Create an end tag token, and let its tag name be the string ‘style’. Switch to the before attribute name state without consuming the character.
Anything else: Emit ‘</style’ character tokens. Consume the character. Switch to the style raw data state.

Tag open state

If the current character is...

‘!’: Consume the character and switch to the comment start 1 state.
‘/’: Consume the character and switch to the close tag state state.
‘>’: Emit character tokens for ‘<>’. Consume the current character. Switch to the data state.
‘0’..‘9’, ‘a’..‘z’, ‘A’..‘Z’, ‘-’, ‘_’, ‘.’: Create a start tag token, let its tag name be the current character, consume the current character and switch to the tag name state.
Anything else: Emit the character token for ‘<’. Switch to the data state without consuming the current character.

Close tag state

If the current character is...

‘>’: Emit character tokens for ‘</>’. Consume the current character. Switch to the data state.
‘0’..‘9’, ‘a’..‘z’, ‘A’..‘Z’, ‘-’, ‘_’, ‘.’: Create an end tag token, let its tag name be the current character, consume the current character and switch to the tag name state.
Anything else: Emit the character tokens for ‘</’. Switch to the data state without consuming the current character.

Tag name state

If the current character is...

U+0020, U+000A: Consume the current character. Switch to the before attribute name state.
‘/’: Consume the current character. Switch to the void tag state.
‘>’: Consume the current character. Switch to the after tag state.
Anything else: Append the current character to the tag name, and consume the current character. Stay in this state.

Void tag state

If the current character is...

‘>’: Consume the current character. Switch to the after tag state.
Anything else: Switch to the before attribute name state without consuming the current character.

Before attribute name state

If the current character is...

U+0020, U+000A: Consume the current character. Stay in this state.
‘/’: Consume the current character. Switch to the void tag state.
‘>’: Consume the current character. Switch to the after tag state.
Anything else: Create a new attribute in the tag token, and set its name to the current character. Consume the current character. Switch to the attribute name state.

Attribute name state

If the current character is...

U+0020, U+000A: Consume the current character. Switch to the after attribute name state.
‘/’: Consume the current character. Switch to the void tag state.
‘=’: Consume the current character. Switch to the before attribute value state.
‘>’: Consume the current character. Switch to the after tag state.
Anything else: Append the current character to the most recently added attribute's name, and consume the current character. Stay in this state.

After attribute name state

If the current character is...

U+0020, U+000A: Consume the current character. Stay in this state.
‘/’: Consume the current character. Switch to the void tag state.
‘=’: Consume the current character. Switch to the before attribute value state.
‘>’: Consume the current character. Switch to the after tag state.
Anything else: Create a new attribute in the tag token, and set its name to the current character. Consume the current character. Switch to the attribute name state.

Before attribute value state

If the current character is...

U+0020, U+000A: Consume the current character. Stay in this state.
‘>’: Consume the current character. Switch to the after tag state.
‘'’: Consume the current character. Switch to the single-quoted attribute value state.
‘"’: Consume the current character. Switch to the double-quoted attribute value state.
Anything else: Set the value of the most recently added attribute to the current character. Consume the current character. Switch to the unquoted attribute value state.

Single-quoted attribute value state

If the current character is...

‘'’: Consume the current character. Switch to the before attribute name state.
‘&’: Consume the character and switch to the character reference state, with the return state set to the single-quoted attribute value state, the extra terminating character set to ‘'’, and the emitting operation being to append the given character to the value of the most recently added attribute.
Anything else: Append the current character to the value of the most recently added attribute. Consume the current character. Stay in this state.

Double-quoted attribute value state

If the current character is...

‘"’: Consume the current character. Switch to the before attribute name state.
‘&’: Consume the character and switch to the character reference state, with the return state set to the double-quoted attribute value state, the extra terminating character set to ‘"’, and the emitting operation being to append the given character to the value of the most recently added attribute.
Anything else: Append the current character to the value of the most recently added attribute. Consume the current character. Stay in this state.

Unquoted attribute value state

If the current character is...

U+0020, U+000A: Consume the current character. Switch to the before attribute name state.
‘>’: Consume the current character. Switch to the data state. Switch to the after tag state.
‘&’: Consume the character and switch to the character reference state, with the return state set to the unquoted attribute value state, the extra terminating character unset (or set to U+0000, which has the same effect), and the emitting operation being to append the given character to the value of the most recently added attribute.
Anything else: Append the current character to the value of the most recently added attribute. Consume the current character. Stay in this state.

After tag state

Emit the tag token.

If the tag token was a start tag token and the tag name was ‘script’, then and switch to the script raw data state.

If the tag token was a start tag token and the tag name was ‘style’, then and switch to the style raw data state.

Otherwise, switch to the data state.

Comment start 1 state

If the current character is...

‘-’: Consume the character and switch to the comment start 2 state.
‘>’: Emit character tokens for ‘<!>’. Consume the current character. Switch to the data state.

Comment start 2 state

If the current character is...

‘-’: Consume the character and switch to the comment state.
‘>’: Emit character tokens for ‘<!->’. Consume the current character. Switch to the data state.

Comment state

If the current character is...

‘-’: Consume the character and switch to the comment end 1 state.
Anything else: Consume the character and switch to the comment state.

Comment end 1 state

If the current character is...

‘-’: Consume the character, switch to the comment end 2 state.
Anything else: Consume the character, and switch to the comment state.

Comment end 2 state

If the current character is...

‘>’: Consume the character and switch to the data state.
‘-’: Consume the character, but stay in this state.
Anything else: Consume the character, and switch to the comment state.

Character reference state

Let raw value be the string ‘&’.

Append the current character to raw value.

If the current character is...

‘#’: Consume the character, and switch to the numeric character reference state.
‘l’: Consume the character and switch to the named character reference L state.
‘a’: Consume the character and switch to the named character reference A state.
‘g’: Consume the character and switch to the named character reference G state.
‘q’: Consume the character and switch to the named character reference Q state.
Any other character in the range ‘0’..‘9’, ‘a’..‘f’, ‘A’..‘F’: Consume the character and switch to the bad named character reference state.
Anything else: Run the emitting operation for all but the last character in raw value, and switch to the data state without consuming the current character.

Numeric character reference state

Append the current character to raw value.

If the current character is...

‘x’, ‘X’: Let value be zero, consume the character, and switch to the hexadecimal numeric character reference state.
‘0’..‘9’: Let value be the numeric value of the current character interpreted as a decimal digit, consume the character, and switch to the decimal numeric character reference state.
Anything else: Run the emitting operation for all but the last character in raw value, and switch to the data state without consuming the current character.

Hexadecimal numeric character reference state

Append the current character to raw value.

If the current character is...

‘0’..‘9’, ‘a’..‘f’, ‘A’..‘F’: Let value be sixteen times value plus the numeric value of the current character interpreted as a hexadecimal digit.
‘;’: Consume the character. If value is between 0x0001 and 0x10FFFF inclusive, but is not between 0xD800 and 0xDFFF inclusive, run the emitting operation with a unicode character having the scalar value value; otherwise, run the emitting operation with the character U+FFFD. Then, in either case, switch to the return state.
Anything else: Run the emitting operation for all but the last character in raw value, and switch to the data state without consuming the current character.

Decimal numeric character reference state

Append the current character to raw value.

If the current character is...

‘0’..‘9’: Let value be ten times value plus the numeric value of the current character interpreted as a decimal digit.
‘;’: Consume the character. If value is between 0x0001 and 0x10FFFF inclusive, but is not between 0xD800 and 0xDFFF inclusive, run the emitting operation with a unicode character having the scalar value value; otherwise, run the emitting operation with the character U+FFFD. Then, in either case, switch to the return state.
Anything else: Run the emitting operation for all but the last character in raw value, and switch to the data state without consuming the current character.

Named character reference L state

Append the current character to raw value.

If the current character is...

‘t’: Let character be ‘<’, consume the current character, and switch to the after named character reference state.
Anything else: Switch to the bad named character reference state without consuming the character.

Named character reference A state

Append the current character to raw value.

If the current character is...

‘p’: Consume the current character and switch to the named character reference AP state.
‘m’: Consume the current character and switch to the named character reference AM state.
Anything else: Switch to the bad named character reference state without consuming the character.

Named character reference AM state

Append the current character to raw value.

If the current character is...

‘p’: Let character be ‘&’, consume the current character, and switch to the after named character reference state.
Anything else: Switch to the bad named character reference state without consuming the character.

Named character reference AP state

Append the current character to raw value.

If the current character is...

‘o’: Consume the current character and switch to the named character reference APO state.
Anything else: Switch to the bad named character reference state without consuming the character.

Named character reference APO state

Append the current character to raw value.

If the current character is...

‘s’: Let character be ‘'’, consume the current character, and switch to the after named character reference state.
Anything else: Switch to the bad named character reference state without consuming the character.

Named character reference G state

Append the current character to raw value.

If the current character is...

‘t’: Let character be ‘>’, consume the current character, and switch to the after named character reference state.
Anything else: Switch to the bad named character reference state without consuming the character.

Named character reference Q state

Append the current character to raw value.

If the current character is...

‘u’: Consume the current character and switch to the named character reference QU state.
Anything else: Switch to the bad named character reference state without consuming the character.

Named character reference QU state

Append the current character to raw value.

If the current character is...

‘o’: Consume the current character and switch to the named character reference QUO state.
Anything else: Switch to the bad named character reference state without consuming the character.

Named character reference QUO state

Append the current character to raw value.

If the current character is...

‘t’: Let character be ‘"’, consume the current character, and switch to the after named character reference state.
Anything else: Switch to the bad named character reference state without consuming the character.

After named character reference state

Append the current character to raw value.

If the current character is...

‘;’: Consume the character. Run the emitting operation with the character character. Switch to the return state.
The extra terminating character: Run the emitting operation with the character U+FFFD. Switch to the return state without consuming the current character.
Anything else: Switch to the bad named character reference state without consuming the current character.

Bad named character reference state

Append the current character to raw value.

If the current character is...

‘;’: Consume the character. Run the emitting operation with the character U+FFFD. Switch to the return state.
The extra terminating character: Switch to the return state without consuming the current character.
Any other character in the range ‘0’..‘9’, ‘a’..‘f’, ‘A’..‘F’: Consume the character and stay in this state.
Anything else: Run the emitting operation for all but the last character in raw value, and switch to the data state without consuming the current character.

Tree construction

To construct a node tree from a sequence of tokens and a document document:

Initialize the stack of open nodes to be document.
Consider each token token in the sequence of tokens in turn.
- If token is a text token,
  1. Create a text node node with character data token.data.
  2. Append node to the top node in the stack of open nodes.
- If token is a start tag token,
  1. Create an element node with tag name token.tagName and attributes token.attributes.
  2. Append node to the top node in the stack of open nodes.
  3. If the token.selfClosing flag is not set, push node onto the stack of open elements.
  4. If token.tagName is script, TODO: Execute the script.
- If token is an end tag token,
  1. If the stack of open nodes contains a node whose tagName is token.tagName,
    - Pop nodes from the stack of open nodes until a node with a tagName equal to token.tagName has been popped.
  2. Otherwise, ignore token.
- If token is a comment token,
  1. Ignore token.
- If token is an EOF token,
  1. Pop all the nodes from the stack of open nodes.
  2. Signal document that parsing is complete.

TODO(ianh): <template>, <t>