Parsing in Sky is a strict pipeline consisting of five stages:
decoding, which converts incoming bytes into Unicode characters using UTF-8.
normalising, which manipulates the sequence of characters.
tokenising, which converts these characters into three kinds of tokens: character tokens, start tag tokens, and end tag tokens. Character tokens have a single character value. Tag tokens have a tag name, and a list of name/value pairs known as attributes.
token cleanup, which converts sequences of character tokens into string tokens, and removes duplicate attributes in tag tokens.
tree construction, which converts these tokens into a tree of nodes.
Later stages cannot affect earlier stages.
When a sequence of bytes is to be parsed, there is always a defined parsing context, which is either an Application object or a Module object.
To decode a sequence of bytes bytes for parsing, the utf-8 decode algorithm must be used to transform bytes into a sequence of characters characters.
Note: The decoder will strip a leading BOM if any.
This sequence must then be passed to the normalisation stage.
To normalise a sequence of characters, apply the following rules:
Any U+000D character followed by a U+000A character must be removed.
Any U+000D character not followed by a U+000A character must be converted to a U+000A character.
Any U+0000 character must be converted to a U+FFFD character.
The converted sequence of characters must then be passed to the tokenisation stage.
To tokenise a sequence of characters, a state machine is used.
Initially, the state machine must begin in the signature state.
Each character in turn must be processed according to the rules of the state at the time the character is processed. A character is processed once it has been consumed. This produces a stream of tokens; the tokens must be passed to the token cleanup stage.
When the last character is consumed, the tokeniser ends.
When the user agent is to expect a string, it must run these steps:
Let expectation be the string to expect. When this string is indexed, the first character has index 0.
Assertion: The first character in expectation is the current character, and expectation has more than one character.
Consume the current character.
Let index be 1.
Let success and failure be the states specified for success and failure respectively.
Switch to the expect a string state.
If the current character is...
‘#
’: If the parsing context is not an Application, switch to the failed signature state. Otherwise, expect the string “#!mojo mojo:sky
”, with after signature as the success state and failed signature as the failure state.
‘S
’: If the parsing context is not a Module, switch to the failed signature state. Otherwise, expect the string “SKY MODULE
”, with after signature as the success state, and failed signature as the failure state.
Anything else: Jump to the failed signature state.
If the current character is not the same as the indexth character in expectation, then switch to the failure state.
Otherwise, consume the character, and increase index. If index is now equal to the length of expectation, then switch to the success state.
If the current character is...
Stop parsing. No tokens are emitted. The file is not a sky file.
If the current character is...
If the current character is...
‘<
’: Consume the character and switch to the tag open state.
‘&
’: Consume the character and switch to the character reference state, with the return state set to the data state, and the emitting operation being to emit a character token for the given character.
Anything else: Emit the current input character as a character token. Consume the character. Stay in this state.
If the current character is...
‘<
’: Consume the character and switch to the script raw data: close 1 state.
Anything else: Emit the current input character as a character token. Consume the character. Stay in this state.
If the current character is...
‘/
’: Consume the character and switch to the script raw data: close 2 state.
Anything else: Emit ‘<
’ character tokens. Switch to the script raw data state without consuming the character.
If the current character is...
‘s
’: Consume the character and switch to the script raw data: close 3 state.
Anything else: Emit ‘</
’ character tokens. Switch to the script raw data state without consuming the character.
If the current character is...
‘c
’: Consume the character and switch to the script raw data: close 4 state.
Anything else: Emit ‘</s
’ character tokens. Switch to the script raw data state without consuming the character.
If the current character is...
‘r
’: Consume the character and switch to the script raw data: close 5 state.
Anything else: Emit ‘</sc
’ character tokens. Switch to the script raw data state without consuming the character.
If the current character is...
‘i
’: Consume the character and switch to the script raw data: close 6 state.
Anything else: Emit ‘</scr
’ character tokens. Switch to the script raw data state without consuming the character.
If the current character is...
‘p
’: Consume the character and switch to the script raw data: close 7 state.
Anything else: Emit ‘</scri
’ character tokens. Switch to the script raw data state without consuming the character.
If the current character is...
‘t
’: Consume the character and switch to the script raw data: close 8 state.
Anything else: Emit ‘</scrip
’ character tokens. Switch to the script raw data state without consuming the character.
If the current character is...
U+0020, U+000A, ‘/
’, ‘>
’: Create an end tag token, and let its tag name be the string ‘script
’. Switch to the before attribute name state without consuming the character.
Anything else: Emit ‘</script
’ character tokens. Switch to the script raw data state without consuming the character.
If the current character is...
‘<
’: Consume the character and switch to the style raw data: close 1 state.
Anything else: Emit the current input character as a character token. Consume the character. Stay in this state.
If the current character is...
‘/
’: Consume the character and switch to the style raw data: close 2 state.
Anything else: Emit ‘<
’ character tokens. Switch to the style raw data state without consuming the character.
If the current character is...
‘s
’: Consume the character and switch to the style raw data: close 3 state.
Anything else: Emit ‘</
’ character tokens. Switch to the style raw data state without consuming the character.
If the current character is...
‘t
’: Consume the character and switch to the style raw data: close 4 state.
Anything else: Emit ‘</s
’ character tokens. Switch to the style raw data state without consuming the character.
If the current character is...
‘y
’: Consume the character and switch to the style raw data: close 5 state.
Anything else: Emit ‘</st
’ character tokens. Switch to the style raw data state without consuming the character.
If the current character is...
‘l
’: Consume the character and switch to the style raw data: close 6 state.
Anything else: Emit ‘</sty
’ character tokens. Switch to the style raw data state without consuming the character.
If the current character is...
‘e
’: Consume the character and switch to the style raw data: close 7 state.
Anything else: Emit ‘</styl
’ character tokens. Switch to the style raw data state without consuming the character.
If the current character is...
U+0020, U+000A, ‘/
’, ‘>
’: Create an end tag token, and let its tag name be the string ‘style
’. Switch to the before attribute name state without consuming the character.
Anything else: Emit ‘</style
’ character tokens. Switch to the style raw data state without consuming the character.
If the current character is...
‘!
’: Consume the character and switch to the comment start 1 state.
‘/
’: Consume the character and switch to the close tag state state.
‘>
’: Emit character tokens for ‘<>
’. Consume the current character. Switch to the data state.
‘0
’..‘9
’, ‘a
’..‘z
’, ‘A
’..‘Z
’, ‘-
’, ‘_
’, ‘.
’: Create a start tag token, let its tag name be the current character, consume the current character and switch to the tag name state.
Anything else: Emit the character token for ‘<
’. Switch to the data state without consuming the current character.
If the current character is...
‘>
’: Emit character tokens for ‘</>
’. Consume the current character. Switch to the data state.
‘0
’..‘9
’, ‘a
’..‘z
’, ‘A
’..‘Z
’, ‘-
’, ‘_
’, ‘.
’: Create an end tag token, let its tag name be the current character, consume the current character and switch to the tag name state.
Anything else: Emit the character tokens for ‘</
’. Switch to the data state without consuming the current character.
If the current character is...
U+0020, U+000A: Consume the current character. Switch to the before attribute name state.
‘/
’: Consume the current character. Switch to the void tag state.
‘>
’: Consume the current character. Switch to the after tag state.
Anything else: Append the current character to the tag name, and consume the current character. Stay in this state.
If the current character is...
‘>
’: Consume the current character. Switch to the after void tag state.
Anything else: Switch to the before attribute name state without consuming the current character.
If the current character is...
U+0020, U+000A: Consume the current character. Stay in this state.
‘/
’: Consume the current character. Switch to the void tag state.
‘>
’: Consume the current character. Switch to the after tag state.
Anything else: Create a new attribute in the tag token, and set its name to the current character and its value to the empty string. Consume the current character. Switch to the attribute name state.
If the current character is...
U+0020, U+000A: Consume the current character. Switch to the after attribute name state.
‘/
’: Consume the current character. Switch to the void tag state.
‘=
’: Consume the current character. Switch to the before attribute value state.
‘>
’: Consume the current character. Switch to the after tag state.
Anything else: Append the current character to the most recently added attribute's name, and consume the current character. Stay in this state.
If the current character is...
U+0020, U+000A: Consume the current character. Stay in this state.
‘/
’: Consume the current character. Switch to the void tag state.
‘=
’: Consume the current character. Switch to the before attribute value state.
‘>
’: Consume the current character. Switch to the after tag state.
Anything else: Create a new attribute in the tag token, and set its name to the current character and its value to the empty string. Consume the current character. Switch to the attribute name state.
If the current character is...
U+0020, U+000A: Consume the current character. Stay in this state.
‘>
’: Consume the current character. Switch to the after tag state.
‘'
’: Consume the current character. Switch to the single-quoted attribute value state.
‘"
’: Consume the current character. Switch to the double-quoted attribute value state.
Anything else: Switch to the unquoted attribute value state without consuming the current character.
If the current character is...
‘'
’: Consume the current character. Switch to the before attribute name state.
‘&
’: Consume the character and switch to the character reference state, with the return state set to the single-quoted attribute value state and the emitting operation being to append the given character to the value of the most recently added attribute.
Anything else: Append the current character to the value of the most recently added attribute. Consume the current character. Stay in this state.
If the current character is...
‘"
’: Consume the current character. Switch to the before attribute name state.
‘&
’: Consume the character and switch to the character reference state, with the return state set to the double-quoted attribute value state and the emitting operation being to append the given character to the value of the most recently added attribute.
Anything else: Append the current character to the value of the most recently added attribute. Consume the current character. Stay in this state.
If the current character is...
U+0020, U+000A: Consume the current character. Switch to the before attribute name state.
‘>
’: Consume the current character. Switch to the after tag state.
‘&
’: Consume the character and switch to the character reference state, with the return state set to the unquoted attribute value state, and the emitting operation being to append the given character to the value of the most recently added attribute.
Anything else: Append the current character to the value of the most recently added attribute. Consume the current character. Stay in this state.
Emit the tag token.
If the tag token was a start tag token and the tag name was ‘script
’, then and switch to the script raw data state.
If the tag token was a start tag token and the tag name was ‘style
’, then and switch to the style raw data state.
Otherwise, switch to the data state.
Emit the tag token.
If the tag token is a start tag token, emit an end tag token with the same tag name.
Switch to the data state.
If the current character is...
‘-
’: Consume the character and switch to the comment start 2 state.
Anything else: Emit character tokens for ‘<!
’. Switch to the data state without consuming the current character.
If the current character is...
‘-
’: Consume the character and switch to the comment state.
Anything else: Emit character tokens for ‘<!-
’. Switch to the data state without consuming the current character.
If the current character is...
‘-
’: Consume the character and switch to the comment end 1 state.
Anything else: Consume the character and stay in this state.
If the current character is...
‘-
’: Consume the character, switch to the comment end 2 state.
Anything else: Consume the character, and switch to the comment state.
If the current character is...
‘>
’: Consume the character and switch to the data state.
‘-
’: Consume the character, but stay in this state.
Anything else: Consume the character, and switch to the comment state.
Let raw value be the string ‘&
’.
Append the current character to raw value.
If the current character is...
‘#
’: Consume the character, and switch to the numeric character reference state.
‘0
’..‘9
’, ‘a
’..‘f
’, ‘A
’..‘F
’: switch to the named character reference state without consuming the current character.
Anything else: Run the emitting operation for all but the last character in raw value, and switch to the return state without consuming the current character.
Append the current character to raw value.
If the current character is...
‘x
’, ‘X
’: Consume the character and switch to the before hexadecimal numeric character reference state.
‘0
’..‘9
’: Let value be the numeric value of the current character interpreted as a decimal digit, consume the character, and switch to the decimal numeric character reference state.
Anything else: Run the emitting operation for all but the last character in raw value, and switch to the return state without consuming the current character.
Append the current character to raw value.
If the current character is...
‘0
’..‘9
’, ‘a
’..‘f
’, ‘A
’..‘F
’: Let value be the numeric value of the current character interpreted as a hexadecimal digit, consume the character, and switch to the hexadecimal numeric character reference state.
Anything else: Run the emitting operation for all but the last character in raw value, and switch to the return state without consuming the current character.
Append the current character to raw value.
If the current character is...
‘0
’..‘9
’, ‘a
’..‘f
’, ‘A
’..‘F
’: Let value be sixteen times value plus the numeric value of the current character interpreted as a hexadecimal digit.
‘;
’: Consume the character. If value is between 0x0001 and 0x10FFFF inclusive, but is not between 0xD800 and 0xDFFF inclusive, run the emitting operation with a unicode character having the scalar value value; otherwise, run the emitting operation with the character U+FFFD. Then, in either case, switch to the return state.
Anything else: Run the emitting operation for all but the last character in raw value, and switch to the return state without consuming the current character.
Append the current character to raw value.
If the current character is...
‘0
’..‘9
’: Let value be ten times value plus the numeric value of the current character interpreted as a decimal digit.
‘;
’: Consume the character. If value is between 0x0001 and 0x10FFFF inclusive, but is not between 0xD800 and 0xDFFF inclusive, run the emitting operation with a unicode character having the scalar value value; otherwise, run the emitting operation with the character U+FFFD. Then, in either case, switch to the return state.
Anything else: Run the emitting operation for all but the last character in raw value, and switch to the return state without consuming the current character.
Append the current character to raw value.
If the current character is...
‘;
’: Consume the character. If the raw value is...
'&
: Emit Run the emitting operation for the character ‘&
’.
''
: Emit Run the emitting operation for the character ‘'
’.
'>
: Emit Run the emitting operation for the character ‘>
’.
'<
: Emit Run the emitting operation for the character ‘<
’.
'"
: Emit Run the emitting operation for the character ‘"
’.
Then, switch to the return state.
‘0
’..‘9
’, ‘a
’..‘z
’, ‘A
’..‘Z
’: Consume the character and stay in this state.
Anything else: Run the emitting operation for all but the last character in raw value, and switch to the return state without consuming the current character.
Replace each sequence of character tokens with a single string token whose value is the concatenation of all the characters in the character tokens.
For each start tag token, remove all but the first name/value pair for each name (i.e. remove duplicate attributes, keeping only the first one).
TODO(ianh): maybe sort the attributes?
For each end tag token, remove the attributes entirely.
If the token is a start tag token, notify the JavaScript token stream callback of the token.
Then, pass the tokens to the tree construction stage.
To construct a node tree from a sequence of tokens and a document document:
t
element on the stack of open nodes, then skip the token.template
element, then:DocumentFragment
object that the template
element uses as its template contents container.import
element, then:url
be the value of node's src
attribute.parsing context
's importModule()
method, passing it url
.as
attribute, associate the entry with that name.template
element in the stack of open nodes above node, then skip this token.script
, then yield until imported modules contains no entries with unresolved promises, then execute the script given by the element’s contents, using the associated names as appropriate.load
event at the parsing context object.