Parsing

Parsing in Sky is a strict pipeline consisting of five stages:

  • decoding, which converts incoming bytes into Unicode characters using UTF-8.

  • normalising, which manipulates the sequence of characters.

  • tokenising, which converts these characters into three kinds of tokens: character tokens, start tag tokens, and end tag tokens. Character tokens have a single character value. Tag tokens have a tag name, and a list of name/value pairs known as attributes.

  • token cleanup, which converts sequences of character tokens into string tokens, and removes duplicate attributes in tag tokens.

  • tree construction, which converts these tokens into a tree of nodes.

Later stages cannot affect earlier stages.

When a sequence of bytes is to be parsed, there is always a defined parsing context, which is either an Application object or a Module object.

Decoding stage

To decode a sequence of bytes bytes for parsing, the UTF-8 decoder must be used to transform bytes into a sequence of characters characters.

This sequence must then be passed to the normalisation stage.

Normalisation stage

To normalise a sequence of characters, apply the following rules:

  • Any U+000D character followed by a U+000A character must be removed.

  • Any U+000D character not followed by a U+000A character must be converted to a U+000A character.

  • Any U+0000 character must be converted to a U+FFFD character.

The converted sequence of characters must then be passed to the tokenisation stage.

Tokenisation stage

To tokenise a sequence of characters, a state machine is used.

Initially, the state machine must begin in the signature state.

Each character in turn must be processed according to the rules of the state at the time the character is processed. A character is processed once it has been consumed. This produces a stream of tokens; the tokens must be passed to the token cleanup stage.

When the last character is consumed, the tokeniser ends.

Expecting a string

When the user agent is to expect a string, it must run these steps:

  1. Let expectation be the string to expect. When this string is indexed, the first character has index 0.

  2. Assertion: The first character in expectation is the current character, and expectation has more than one character.

  3. Consume the current character.

  4. Let index be 1.

  5. Let success and failure be the states specified for success and failure respectively.

  6. Switch to the expect a string state.

Tokeniser states

Signature state

If the current character is...

  • #’: If the parsing context is not an Application, switch to the failed signature state. Otherwise, expect the string “#!mojo mojo:sky”, with after signature as the success state and failed signature as the failure state.

  • S’: If the parsing context is not a Module, switch to the failed signature state. Otherwise, expect the string “SKY MODULE”, with after signature as the success state, and failed signature as the failure state.

  • Anything else: Jump to the failed signature state.

Expect a string state

If the current character is not the same as the indexth character in expectation, then switch to the failure state.

Otherwise, consume the character, and increase index. If index is now equal to the length of expectation, then switch to the success state.

After signature state

If the current character is...

  • U+000A: Consume the character and switch to the data state.
  • U+0020: Consume the character and switch to the consume rest of line state.
  • Anything else: Switch to the failed signature state.

Failed signature state

Stop parsing. No tokens are emitted. The file is not a sky file.

Consume rest of line state

If the current character is...

  • U+000A: Consume the character and switch to the data state.
  • Anything else: Consume the character and stay in this state.

Data state

If the current character is...

  • <’: Consume the character and switch to the tag open state.

  • &’: Consume the character and switch to the character reference state, with the return state set to the data state, the extra terminating character unset (or set to U+0000, which has the same effect), and the emitting operation being to emit a character token for the given character.

  • Anything else: Emit the current input character as a character token. Consume the character. Stay in this state.

Script raw data state

If the current character is...

  • <’: Consume the character and switch to the script raw data: close 1 state.

  • Anything else: Emit the current input character as a character token. Consume the character. Stay in this state.

Script raw data: close 1 state

If the current character is...

  • /’: Consume the character and switch to the script raw data: close 2 state.

  • Anything else: Emit ‘<’ character tokens. Consume the character. Switch to the script raw data state.

Script raw data: close 2 state

If the current character is...

  • s’: Consume the character and switch to the script raw data: close 3 state.

  • Anything else: Emit ‘</’ character tokens. Consume the character. Switch to the script raw data state.

Script raw data: close 3 state

If the current character is...

  • c’: Consume the character and switch to the script raw data: close 4 state.

  • Anything else: Emit ‘</s’ character tokens. Consume the character. Switch to the script raw data state.

Script raw data: close 4 state

If the current character is...

  • r’: Consume the character and switch to the script raw data: close 5 state.

  • Anything else: Emit ‘</sc’ character tokens. Consume the character. Switch to the script raw data state.

Script raw data: close 5 state

If the current character is...

  • i’: Consume the character and switch to the script raw data: close 6 state.

  • Anything else: Emit ‘</scr’ character tokens. Consume the character. Switch to the script raw data state.

Script raw data: close 6 state

If the current character is...

  • p’: Consume the character and switch to the script raw data: close 7 state.

  • Anything else: Emit ‘</scri’ character tokens. Consume the character. Switch to the script raw data state.

Script raw data: close 7 state

If the current character is...

  • t’: Consume the character and switch to the script raw data: close 8 state.

  • Anything else: Emit ‘</scrip’ character tokens. Consume the character. Switch to the script raw data state.

Script raw data: close 8 state

If the current character is...

  • U+0020, U+000A, ‘/’, ‘>’: Create an end tag token, and let its tag name be the string ‘script’. Switch to the before attribute name state without consuming the character.

  • Anything else: Emit ‘</script’ character tokens. Consume the character. Switch to the script raw data state.

Style raw data state

If the current character is...

  • <’: Consume the character and switch to the style raw data: close 1 state.

  • Anything else: Emit the current input character as a character token. Consume the character. Stay in this state.

Style raw data: close 1 state

If the current character is...

  • /’: Consume the character and switch to the style raw data: close 2 state.

  • Anything else: Emit ‘<’ character tokens. Consume the character. Switch to the style raw data state.

Style raw data: close 2 state

If the current character is...

  • s’: Consume the character and switch to the style raw data: close 3 state.

  • Anything else: Emit ‘</’ character tokens. Consume the character. Switch to the style raw data state.

Style raw data: close 3 state

If the current character is...

  • t’: Consume the character and switch to the style raw data: close 4 state.

  • Anything else: Emit ‘</s’ character tokens. Consume the character. Switch to the style raw data state.

Style raw data: close 4 state

If the current character is...

  • y’: Consume the character and switch to the style raw data: close 5 state.

  • Anything else: Emit ‘</st’ character tokens. Consume the character. Switch to the style raw data state.

Style raw data: close 5 state

If the current character is...

  • l’: Consume the character and switch to the style raw data: close 6 state.

  • Anything else: Emit ‘</sty’ character tokens. Consume the character. Switch to the style raw data state.

Style raw data: close 6 state

If the current character is...

  • e’: Consume the character and switch to the style raw data: close 7 state.

  • Anything else: Emit ‘</styl’ character tokens. Consume the character. Switch to the style raw data state.

Style raw data: close 7 state

If the current character is...

  • U+0020, U+000A, ‘/’, ‘>’: Create an end tag token, and let its tag name be the string ‘style’. Switch to the before attribute name state without consuming the character.

  • Anything else: Emit ‘</style’ character tokens. Consume the character. Switch to the style raw data state.

Tag open state

If the current character is...

  • !’: Consume the character and switch to the comment start 1 state.

  • /’: Consume the character and switch to the close tag state state.

  • >’: Emit character tokens for ‘<>’. Consume the current character. Switch to the data state.

  • 0’..‘9’, ‘a’..‘z’, ‘A’..‘Z’, ‘-’, ‘_’, ‘.’: Create a start tag token, let its tag name be the current character, consume the current character and switch to the tag name state.

  • Anything else: Emit the character token for ‘<’. Switch to the data state without consuming the current character.

Close tag state

If the current character is...

  • >’: Emit character tokens for ‘</>’. Consume the current character. Switch to the data state.

  • 0’..‘9’, ‘a’..‘z’, ‘A’..‘Z’, ‘-’, ‘_’, ‘.’: Create an end tag token, let its tag name be the current character, consume the current character and switch to the tag name state.

  • Anything else: Emit the character tokens for ‘</’. Switch to the data state without consuming the current character.

Tag name state

If the current character is...

  • U+0020, U+000A: Consume the current character. Switch to the before attribute name state.

  • /’: Consume the current character. Switch to the void tag state.

  • >’: Consume the current character. Switch to the after tag state.

  • Anything else: Append the current character to the tag name, and consume the current character. Stay in this state.

Void tag state

If the current character is...

  • >’: Consume the current character. Switch to the after void tag state.

  • Anything else: Switch to the before attribute name state without consuming the current character.

Before attribute name state

If the current character is...

  • U+0020, U+000A: Consume the current character. Stay in this state.

  • /’: Consume the current character. Switch to the void tag state.

  • >’: Consume the current character. Switch to the after tag state.

  • Anything else: Create a new attribute in the tag token, and set its name to the current character. Consume the current character. Switch to the attribute name state.

Attribute name state

If the current character is...

  • U+0020, U+000A: Consume the current character. Switch to the after attribute name state.

  • /’: Consume the current character. Switch to the void tag state.

  • =’: Consume the current character. Switch to the before attribute value state.

  • >’: Consume the current character. Switch to the after tag state.

  • Anything else: Append the current character to the most recently added attribute's name, and consume the current character. Stay in this state.

After attribute name state

If the current character is...

  • U+0020, U+000A: Consume the current character. Stay in this state.

  • /’: Consume the current character. Switch to the void tag state.

  • =’: Consume the current character. Switch to the before attribute value state.

  • >’: Consume the current character. Switch to the after tag state.

  • Anything else: Create a new attribute in the tag token, and set its name to the current character. Consume the current character. Switch to the attribute name state.

Before attribute value state

If the current character is...

  • U+0020, U+000A: Consume the current character. Stay in this state.

  • >’: Consume the current character. Switch to the after tag state.

  • '’: Consume the current character. Switch to the single-quoted attribute value state.

  • "’: Consume the current character. Switch to the double-quoted attribute value state.

  • Anything else: Set the value of the most recently added attribute to the current character. Consume the current character. Switch to the unquoted attribute value state.

Single-quoted attribute value state

If the current character is...

  • '’: Consume the current character. Switch to the before attribute name state.

  • &’: Consume the character and switch to the character reference state, with the return state set to the single-quoted attribute value state, the extra terminating character set to ‘'’, and the emitting operation being to append the given character to the value of the most recently added attribute.

  • Anything else: Append the current character to the value of the most recently added attribute. Consume the current character. Stay in this state.

Double-quoted attribute value state

If the current character is...

  • "’: Consume the current character. Switch to the before attribute name state.

  • &’: Consume the character and switch to the character reference state, with the return state set to the double-quoted attribute value state, the extra terminating character set to ‘"’, and the emitting operation being to append the given character to the value of the most recently added attribute.

  • Anything else: Append the current character to the value of the most recently added attribute. Consume the current character. Stay in this state.

Unquoted attribute value state

If the current character is...

  • U+0020, U+000A: Consume the current character. Switch to the before attribute name state.

  • >’: Consume the current character. Switch to the data state. Switch to the after tag state.

  • &’: Consume the character and switch to the character reference state, with the return state set to the unquoted attribute value state, the extra terminating character unset (or set to U+0000, which has the same effect), and the emitting operation being to append the given character to the value of the most recently added attribute.

  • Anything else: Append the current character to the value of the most recently added attribute. Consume the current character. Stay in this state.

After tag state

Emit the tag token.

If the tag token was a start tag token and the tag name was ‘script’, then and switch to the script raw data state.

If the tag token was a start tag token and the tag name was ‘style’, then and switch to the style raw data state.

Otherwise, switch to the data state.

After void tag state

Emit the tag token.

If the tag token is a start tag token, emit an end tag token with the same tag name.

Switch to the data state.

Comment start 1 state

If the current character is...

  • -’: Consume the character and switch to the comment start 2 state.

  • >’: Emit character tokens for ‘<!>’. Consume the current character. Switch to the data state.

Comment start 2 state

If the current character is...

  • -’: Consume the character and switch to the comment state.

  • >’: Emit character tokens for ‘<!->’. Consume the current character. Switch to the data state.

Comment state

If the current character is...

  • -’: Consume the character and switch to the comment end 1 state.

  • Anything else: Consume the character and switch to the comment state.

Comment end 1 state

If the current character is...

  • -’: Consume the character, switch to the comment end 2 state.

  • Anything else: Consume the character, and switch to the comment state.

Comment end 2 state

If the current character is...

  • >’: Consume the character and switch to the data state.

  • -’: Consume the character, but stay in this state.

  • Anything else: Consume the character, and switch to the comment state.

Character reference state

Let raw value be the string ‘&’.

Append the current character to raw value.

If the current character is...

  • #’: Consume the character, and switch to the numeric character reference state.

  • l’: Consume the character and switch to the named character reference L state.

  • a’: Consume the character and switch to the named character reference A state.

  • g’: Consume the character and switch to the named character reference G state.

  • q’: Consume the character and switch to the named character reference Q state.

  • Any other character in the range ‘0’..‘9’, ‘a’..‘f’, ‘A’..‘F’: Consume the character and switch to the bad named character reference state.

  • Anything else: Run the emitting operation for all but the last character in raw value, and switch to the data state without consuming the current character.

Numeric character reference state

Append the current character to raw value.

If the current character is...

  • x’, ‘X’: Let value be zero, consume the character, and switch to the hexadecimal numeric character reference state.

  • 0’..‘9’: Let value be the numeric value of the current character interpreted as a decimal digit, consume the character, and switch to the decimal numeric character reference state.

  • Anything else: Run the emitting operation for all but the last character in raw value, and switch to the data state without consuming the current character.

Hexadecimal numeric character reference state

Append the current character to raw value.

If the current character is...

  • 0’..‘9’, ‘a’..‘f’, ‘A’..‘F’: Let value be sixteen times value plus the numeric value of the current character interpreted as a hexadecimal digit.

  • ;’: Consume the character. If value is between 0x0001 and 0x10FFFF inclusive, but is not between 0xD800 and 0xDFFF inclusive, run the emitting operation with a unicode character having the scalar value value; otherwise, run the emitting operation with the character U+FFFD. Then, in either case, switch to the return state.

  • Anything else: Run the emitting operation for all but the last character in raw value, and switch to the data state without consuming the current character.

Decimal numeric character reference state

Append the current character to raw value.

If the current character is...

  • 0’..‘9’: Let value be ten times value plus the numeric value of the current character interpreted as a decimal digit.

  • ;’: Consume the character. If value is between 0x0001 and 0x10FFFF inclusive, but is not between 0xD800 and 0xDFFF inclusive, run the emitting operation with a unicode character having the scalar value value; otherwise, run the emitting operation with the character U+FFFD. Then, in either case, switch to the return state.

  • Anything else: Run the emitting operation for all but the last character in raw value, and switch to the data state without consuming the current character.

Named character reference L state

Append the current character to raw value.

If the current character is...

  • t’: Let character be ‘<’, consume the current character, and switch to the after named character reference state.

  • Anything else: Switch to the bad named character reference state without consuming the character.

Named character reference A state

Append the current character to raw value.

If the current character is...

  • p’: Consume the current character and switch to the named character reference AP state.

  • m’: Consume the current character and switch to the named character reference AM state.

  • Anything else: Switch to the bad named character reference state without consuming the character.

Named character reference AM state

Append the current character to raw value.

If the current character is...

  • p’: Let character be ‘&’, consume the current character, and switch to the after named character reference state.

  • Anything else: Switch to the bad named character reference state without consuming the character.

Named character reference AP state

Append the current character to raw value.

If the current character is...

  • o’: Consume the current character and switch to the named character reference APO state.

  • Anything else: Switch to the bad named character reference state without consuming the character.

Named character reference APO state

Append the current character to raw value.

If the current character is...

  • s’: Let character be ‘'’, consume the current character, and switch to the after named character reference state.

  • Anything else: Switch to the bad named character reference state without consuming the character.

Named character reference G state

Append the current character to raw value.

If the current character is...

  • t’: Let character be ‘>’, consume the current character, and switch to the after named character reference state.

  • Anything else: Switch to the bad named character reference state without consuming the character.

Named character reference Q state

Append the current character to raw value.

If the current character is...

  • u’: Consume the current character and switch to the named character reference QU state.

  • Anything else: Switch to the bad named character reference state without consuming the character.

Named character reference QU state

Append the current character to raw value.

If the current character is...

  • o’: Consume the current character and switch to the named character reference QUO state.

  • Anything else: Switch to the bad named character reference state without consuming the character.

Named character reference QUO state

Append the current character to raw value.

If the current character is...

  • t’: Let character be ‘"’, consume the current character, and switch to the after named character reference state.

  • Anything else: Switch to the bad named character reference state without consuming the character.

After named character reference state

Append the current character to raw value.

If the current character is...

  • ;’: Consume the character. Run the emitting operation with the character character. Switch to the return state.

  • The extra terminating character: Run the emitting operation with the character U+FFFD. Switch to the return state without consuming the current character.

  • Anything else: Switch to the bad named character reference state without consuming the current character.

Bad named character reference state

Append the current character to raw value.

If the current character is...

  • ;’: Consume the character. Run the emitting operation with the character U+FFFD. Switch to the return state.

  • The extra terminating character: Switch to the return state without consuming the current character.

  • Any other character in the range ‘0’..‘9’, ‘a’..‘f’, ‘A’..‘F’: Consume the character and stay in this state.

  • Anything else: Run the emitting operation for all but the last character in raw value, and switch to the data state without consuming the current character.

Token cleanup stage

Replace each sequence of character tokens with a single string token whose value is the concatenation of all the characters in the character tokens.

For each start tag token, remove all but the first name/value pair for each name (i.e. remove duplicate attributes, keeping only the first one).

For each end tag token, remove the attributes entirely.

If the token is a start tag token, notify the JavaScript token stream callback of the token.

Then, pass the tokens to the tree construction stage.

Tree construction stage

To construct a node tree from a sequence of tokens and a document document:

  1. Initialize the stack of open nodes to be document.
  2. Consider each token token in the sequence of tokens in turn, as follows. If a token is to be skipped, then jump straight to the next token, without doing any more work with the skipped token.
    • If token is a string token,
      1. If the value of the token contains only U+0020 and U+000A characters, and there is no t element on the stack of open nodes, then skip the token.
      2. Create a text node node whose character data is the value of the token.
      3. Append node to the top node in the stack of open nodes.
    • If token is a start tag token,
      1. Create an element node with tag name and attributes given by the token.
      2. Append node to the top node in the stack of open nodes.
    • If token is an end tag token:
      1. Let node be the topmost node in the stack of open nodes whose tag name is the same as the token‘s tag name, if any. If there isn’t one, skip this token.
      2. If there's a template element in the stack of open nodes above node, then skip this token.
      3. Pop nodes from the stack of open nodes until node has been popped.
      4. If node‘s tag name is script, then yield until there are no pending import loads, then execute the script given by the element’s contents.
  3. Yield until there are no pending import loads.
  4. Fire a load event at the parsing context object.