sky/specs/parsing.md - mojo-tools - Git at Google

 Parsing
 =======

 Parsing in Sky is a strict pipeline consisting of four stages:

 - decoding, which converts incoming bytes into Unicode characters
   using UTF-8

 - normalising, which converts certain sequences of characters

 - tokenising, which converts these characters into tokens

 - tree construction, which converts these tokens into a tree of nodes

 Later stages cannot affect earlier stages.

 When a sequence of bytes is to be parsed, there is always a defined
 _parsing context_, which is either "application" or "module".


 Decoding stage
 --------------

 To decode a sequence of bytes _bytes_ for parsing, the [UTF-8
 decoder](https://encoding.spec.whatwg.org/#utf-8-decoder) must be used
 to transform _bytes_ into a sequence of characters _characters_.

 This sequence must then be passed to the normalisation stage.


 Normalisation stage
 -------------------

 To normalise a sequence of characters, apply the following rules:

 * Any U+000D character followed by a U+000A character must be removed.

 * Any U+000D character not followed by a U+000A character must be
   converted to a U+000A character.

 * Any U+0000 character must be converted to a U+FFFD character.

 The converted sequence of characters must then be passed to the
 tokenisation stage.


 Tokenisation stage
 ------------------

 To tokenise a sequence of characters, a state machine is used.

 Initially, the state machine must begin in the **signature** state.

 Each character in turn must be processed according to the rules of the
 state at the time the character is processed. A character is processed
 once it has been _consumed_. This produces a stream of tokens; the
 tokens must be passed to the tree construction stage.

 When the last character is consumed, the tokeniser ends.


 ### Expecting a string ###

 When the user agent is to _expect a string_, it must run these steps:

 1. Let _expectation_ be the string to expect. When this string is
    indexed, the first character has index 0.

 2. Assertion: The first character in _expectation_ is the current
    character, and _expectation_ has more than one character.

 3. Consume the current character.

 4. Let _index_ be 1.

 5. Let _success_ and _failure_ be the states specified for success and
    failure respectively.

 6. Switch to the **expect a string** state.


 ### Tokeniser states ###

 #### **Signature** state ####

 If the current character is...

 * '```#```': If the _parsing context_ is not "application", switch to
   the _failed signature_ state. Otherwise, expect the string
   "```#!mojo mojo:sky```", with _after signature_ as the _success_
   state and _failed signature_ as the _failure_ state.

 * '```S```': If the _parsing context_ is not "module", switch to the
   _failed signature_ state. Otherwise, expect the string
   "```SKY MODULE```", with _after signature_ as the _success_ state,
   and _failed signature_ as the _failure_ state.

 * Anything else: Jump to the **failed signature** state.


 #### **Expect a string** state ####

 If the current character is not the same as the <i>index</i>th character in
 _expectation_, then switch to the _failure_ state.

 Otherwise, consume the character, and increase _index_. If _index_ is
 now equal to the length of _expectation_, then switch to the _success_
 state.


 #### **After signature** state ####

 If the current character is...

 * U+000A: Consume the character and switch to the **data** state.
 * U+0020: Consume the character and switch to the **consume rest of
   line** state.
 * Anything else: Switch to the **failed signature** state.


 #### **Failed signature** state ####

 Stop parsing. No tokens are emitted. The file is not a sky file.


 #### **Consume rest of line** state ####

 If the current character is...

 * U+000A: Consume the character and switch to the **data** state.
 * Anything else: Consume the character and stay in this state.


 ### **Data** state ###

 If the current character is...

 * '```<```': Consume the character and switch to the **tag open** state.

 * '```&```': Consume the character and switch to the **character
   reference** state, with the _return state_ set to the **data**
   state, the _extra terminating character_ unset (or set to U+0000,
   which has the same effect), and the _emitting operation_ being to
   emit a character token for the given character.

 * Anything else: Emit the current input character as a character
   token. Consume the character. Stay in this state.


 ### **Script raw data** state ###

 If the current character is...

 * '```<```': Consume the character and switch to the **script raw
   data: close 1** state.

 * Anything else: Emit the current input character as a character
   token. Consume the character. Stay in this state.


 ### **Script raw data: close 1** state ###

 If the current character is...

 * '```/```': Consume the character and switch to the **script raw
   data: close 2** state.

 * Anything else: Emit '```<```' character tokens. Consume the
   character. Switch to the **script raw data** state.


 ### **Script raw data: close 2** state ###

 If the current character is...

 * '```s```': Consume the character and switch to the **script raw
   data: close 3** state.

 * Anything else: Emit '```</```' character tokens. Consume the
   character. Switch to the **script raw data** state.


 ### **Script raw data: close 3** state ###

 If the current character is...

 * '```c```': Consume the character and switch to the **script raw
   data: close 4** state.

 * Anything else: Emit '```</s```' character tokens. Consume the
   character. Switch to the **script raw data** state.


 ### **Script raw data: close 4** state ###

 If the current character is...

 * '```r```': Consume the character and switch to the **script raw
   data: close 5** state.

 * Anything else: Emit '```</sc```' character tokens. Consume the
   character. Switch to the **script raw data** state.


 ### **Script raw data: close 5** state ###

 If the current character is...

 * '```i```': Consume the character and switch to the **script raw
   data: close 6** state.

 * Anything else: Emit '```</scr```' character tokens. Consume the
   character. Switch to the **script raw data** state.


 ### **Script raw data: close 6** state ###

 If the current character is...

 * '```p```': Consume the character and switch to the **script raw
   data: close 7** state.

 * Anything else: Emit '```</scri```' character tokens. Consume the
   character. Switch to the **script raw data** state.


 ### **Script raw data: close 7** state ###

 If the current character is...

 * '```t```': Consume the character and switch to the **script raw
   data: close 8** state.

 * Anything else: Emit '```</scrip```' character tokens. Consume the
   character. Switch to the **script raw data** state.


 ### **Script raw data: close 8** state ###

 If the current character is...

 * U+0020, U+000A, '```/```', '```>```': Create an end tag token, and
   let its tag name be the string '```script```'. Switch to the
   **before attribute name** state without consuming the character.

 * Anything else: Emit '```</script```' character tokens. Consume the
   character. Switch to the **script raw data** state.


 ### **Style raw data** state ###

 If the current character is...

 * '```<```': Consume the character and switch to the **style raw
   data: close 1** state.

 * Anything else: Emit the current input character as a character
   token. Consume the character. Stay in this state.


 ### **Style raw data: close 1** state ###

 If the current character is...

 * '```/```': Consume the character and switch to the **style raw
   data: close 2** state.

 * Anything else: Emit '```<```' character tokens. Consume the
   character. Switch to the **style raw data** state.


 ### **Style raw data: close 2** state ###

 If the current character is...

 * '```s```': Consume the character and switch to the **style raw
   data: close 3** state.

 * Anything else: Emit '```</```' character tokens. Consume the
   character. Switch to the **style raw data** state.


 ### **Style raw data: close 3** state ###

 If the current character is...

 * '```t```': Consume the character and switch to the **style raw
   data: close 4** state.

 * Anything else: Emit '```</s```' character tokens. Consume the
   character. Switch to the **style raw data** state.


 ### **Style raw data: close 4** state ###

 If the current character is...

 * '```y```': Consume the character and switch to the **style raw
   data: close 5** state.

 * Anything else: Emit '```</st```' character tokens. Consume the
   character. Switch to the **style raw data** state.


 ### **Style raw data: close 5** state ###

 If the current character is...

 * '```l```': Consume the character and switch to the **style raw
   data: close 6** state.

 * Anything else: Emit '```</sty```' character tokens. Consume the
   character. Switch to the **style raw data** state.


 ### **Style raw data: close 6** state ###

 If the current character is...

 * '```e```': Consume the character and switch to the **style raw
   data: close 7** state.

 * Anything else: Emit '```</styl```' character tokens. Consume the
   character. Switch to the **style raw data** state.


 ### **Style raw data: close 7** state ###

 If the current character is...

 * U+0020, U+000A, '```/```', '```>```': Create an end tag token, and
   let its tag name be the string '```style```'. Switch to the
   **before attribute name** state without consuming the character.

 * Anything else: Emit '```</style```' character tokens. Consume the
   character. Switch to the **style raw data** state.


 ### **Tag open** state ###

 If the current character is...

 * '```!```': Consume the character and switch to the **comment start
   1** state.

 * '```/```': Consume the character and switch to the **close tag
   state** state.

 * '```>```': Emit character tokens for '```<>```'. Consume the current
   character. Switch to the **data** state.

 * '```0```'..'```9```', '```a```'..'```z```', '```A```'..'```Z```',
   '```-```', '```_```', '```.```': Create a start tag token, let its
   tag name be the current character, consume the current character and
   switch to the **tag name** state.

 * Anything else: Emit the character token for '```<```'. Switch to the
   **data** state without consuming the current character.


 ### **Close tag** state ###

 If the current character is...

 * '```>```': Emit character tokens for '```</>```'. Consume the current
   character. Switch to the **data** state.

 * '```0```'..'```9```', '```a```'..'```z```', '```A```'..'```Z```',
   '```-```', '```_```', '```.```': Create an end tag token, let its
   tag name be the current character, consume the current character and
   switch to the **tag name** state.

 * Anything else: Emit the character tokens for '```</```'. Switch to
   the **data** state without consuming the current character.


 ### **Tag name** state ###

 If the current character is...

 * U+0020, U+000A: Consume the current character. Switch to the
   **before attribute name** state.

 * '```/```': Consume the current character. Switch to the **void tag**
   state.

 * '```>```': Consume the current character. Switch to the **after
   tag** state.

 * Anything else: Append the current character to the tag name, and
   consume the current character. Stay in this state.


 ### **Void tag** state ###

 If the current character is...

 * '```>```': Consume the current character. Switch to the **after
   tag** state.

 * Anything else: Switch to the **before attribute name** state without
   consuming the current character.


 ### **Before attribute name** state ###

 If the current character is...

 * U+0020, U+000A: Consume the current character. Stay in this state.

 * '```/```': Consume the current character. Switch to the **void tag**
   state.

 * '```>```': Consume the current character. Switch to the **after
   tag** state.

 * Anything else: Create a new attribute in the tag token, and set its
   name to the current character. Consume the current character. Switch
   to the **attribute name** state.


 ### **Attribute name** state ###

 If the current character is...

 * U+0020, U+000A: Consume the current character. Switch to the **after
   attribute name** state.

 * '```/```': Consume the current character. Switch to the **void tag**
   state.

 * '```=```': Consume the current character. Switch to the **before
   attribute value** state.

 * '```>```': Consume the current character. Switch to the **after
   tag** state.

 * Anything else: Append the current character to the most recently
   added attribute's name, and consume the current character. Stay in
   this state.


 ### **After attribute name** state ###

 If the current character is...

 * U+0020, U+000A: Consume the current character. Stay in this state.

 * '```/```': Consume the current character. Switch to the **void tag**
   state.

 * '```=```': Consume the current character. Switch to the **before
   attribute value** state.

 * '```>```': Consume the current character. Switch to the **after
   tag** state.

 * Anything else: Create a new attribute in the tag token, and set its
   name to the current character. Consume the current character. Switch
   to the **attribute name** state.


 ### **Before attribute value** state ###

 If the current character is...

 * U+0020, U+000A: Consume the current character. Stay in this state.

 * '```>```': Consume the current character. Switch to the **after
   tag** state.

 * '```'```': Consume the current character. Switch to the
   **single-quoted attribute value** state.

 * '```"```': Consume the current character. Switch to the
   **double-quoted attribute value** state.

 * Anything else: Set the value of the most recently added attribute to
   the current character. Consume the current character. Switch to the
   **unquoted attribute value** state.


 ### **Single-quoted attribute value** state ###

 If the current character is...

 * '```'```': Consume the current character. Switch to the
   **before attribute name** state.

 * '```&```': Consume the character and switch to the **character
   reference** state, with the _return state_ set to the
   **single-quoted attribute value** state, the _extra terminating
   character_ set to '```'```', and the _emitting operation_ being to
   append the given character to the value of the most recently added
   attribute.

 * Anything else: Append the current character to the value of the most
   recently added attribute. Consume the current character. Stay in
   this state.


 ### **Double-quoted attribute value** state ###

 If the current character is...

 * '```"```': Consume the current character. Switch to the
   **before attribute name** state.

 * '```&```': Consume the character and switch to the **character
   reference** state, with the _return state_ set to the
   **double-quoted attribute value** state, the _extra terminating
   character_ set to '```"```', and the _emitting operation_ being to
   append the given character to the value of the most recently added
   attribute.

 * Anything else: Append the current character to the value of the most
   recently added attribute. Consume the current character. Stay in
   this state.


 ### **Unquoted attribute value** state ###

 If the current character is...

 * U+0020, U+000A: Consume the current character. Switch to the
   **before attribute name** state.

 * '```>```': Consume the current character. Switch to the **data**
   state. Switch to the **after tag** state.

 * '```&```': Consume the character and switch to the **character
   reference** state, with the _return state_ set to the **unquoted
   attribute value** state, the _extra terminating character_ unset (or
   set to U+0000, which has the same effect), and the _emitting
   operation_ being to append the given character to the value of the
   most recently added attribute.

 * Anything else: Append the current character to the value of the most
   recently added attribute. Consume the current character. Stay in
   this state.


 ### **After tag** state ###

 Emit the tag token.

 If the tag token was a start tag token and the tag name was
 '```script```', then and switch to the **script raw data** state.

 If the tag token was a start tag token and the tag name was
 '```style```', then and switch to the **style raw data** state.

 Otherwise, switch to the **data** state.


 ### **Comment start 1** state ###

 If the current character is...

 * '```-```': Consume the character and switch to the **comment start
   2** state.

 * '```>```': Emit character tokens for '```<!>```'. Consume the
   current character. Switch to the **data** state.


 ### **Comment start 2** state ###

 If the current character is...

 * '```-```': Consume the character and switch to the **comment**
   state.

 * '```>```': Emit character tokens for '```<!->```'. Consume the
   current character. Switch to the **data** state.


 ### **Comment** state ###

 If the current character is...

 * '```-```': Consume the character and switch to the **comment end 1**
   state.

 * Anything else: Consume the character and switch to the **comment**
   state.


 ### **Comment end 1** state ###

 If the current character is...

 * '```-```': Consume the character, switch to the **comment end 2**
   state.

 * Anything else: Consume the character, and switch to the **comment**
   state.


 ### **Comment end 2** state ###

 If the current character is...

 * '```>```': Consume the character and switch to the **data** state.

 * '```-```': Consume the character, but stay in this state.

 * Anything else: Consume the character, and switch to the **comment**
   state.


 ### **Character reference** state ###

 Let _raw value_ be the string '```&```'.

 Append the current character to _raw value_.

 If the current character is...

 * '```#```': Consume the character, and switch to the **numeric
   character reference** state.

 * '```l```': Consume the character and switch to the **named character
   reference L** state.

 * '```a```': Consume the character and switch to the **named character
   reference A** state.

 * '```g```': Consume the character and switch to the **named character
   reference G** state.

 * '```q```': Consume the character and switch to the **named character
   reference Q** state.

 * Any other character in the range '```0```'..'```9```',
   '```a```'..'```f```', '```A```'..'```F```': Consume the character
   and switch to the **bad named character reference** state.

 * Anything else: Run the _emitting operation_ for all but the last
   character in _raw value_, and switch to the **data state** without
   consuming the current character.


 ### **Numeric character reference** state ###

 Append the current character to _raw value_.

 If the current character is...

 * '```x```', '```X```': Let _value_ be zero, consume the character,
   and switch to the **hexadecimal numeric character reference** state.

 * '```0```'..'```9```': Let _value_ be the numeric value of the
   current character interpreted as a decimal digit, consume the
   character, and switch to the **decimal numeric character reference**
   state.

 * Anything else: Run the _emitting operation_ for all but the last
   character in _raw value_, and switch to the **data state** without
   consuming the current character.


 ### **Hexadecimal numeric character reference** state ###

 Append the current character to _raw value_.

 If the current character is...

 * '```0```'..'```9```', '```a```'..'```f```', '```A```'..'```F```':
   Let _value_ be sixteen times _value_ plus the numeric value of the
   current character interpreted as a hexadecimal digit.

 * '```;```': Consume the character. If _value_ is between 0x0001 and
   0x10FFFF inclusive, but is not between 0xD800 and 0xDFFF inclusive,
   run the _emitting operation_ with a unicode character having the
   scalar value _value_; otherwise, run the _emitting operation_ with
   the character U+FFFD. Then, in either case, switch to the _return
   state_.

 * Anything else: Run the _emitting operation_ for all but the last
   character in _raw value_, and switch to the **data state** without
   consuming the current character.


 ### **Decimal numeric character reference** state ###

 Append the current character to _raw value_.

 If the current character is...

 * '```0```'..'```9```': Let _value_ be ten times _value_ plus the
   numeric value of the current character interpreted as a decimal
   digit.

 * '```;```': Consume the character. If _value_ is between 0x0001 and
   0x10FFFF inclusive, but is not between 0xD800 and 0xDFFF inclusive,
   run the _emitting operation_ with a unicode character having the
   scalar value _value_; otherwise, run the _emitting operation_ with
   the character U+FFFD. Then, in either case, switch to the _return
   state_.

 * Anything else: Run the _emitting operation_ for all but the last
   character in _raw value_, and switch to the **data state** without
   consuming the current character.


 ### **Named character reference L** state ###

 Append the current character to _raw value_.

 If the current character is...

 * '```t```': Let _character_ be '```<```', consume the current
   character, and switch to the **after named character reference**
   state.

 * Anything else: Switch to the _bad named character reference_ state
   without consuming the character.


 ### **Named character reference A** state ###

 Append the current character to _raw value_.

 If the current character is...

 * '```p```': Consume the current character and switch to the **named
   character reference AP** state.

 * '```m```': Consume the current character and switch to the **named
   character reference AM** state.

 * Anything else: Switch to the _bad named character reference_ state
   without consuming the character.


 ### **Named character reference AM** state ###

 Append the current character to _raw value_.

 If the current character is...

 * '```p```': Let _character_ be '```&```', consume the current
   character, and switch to the **after named character reference**
   state.

 * Anything else: Switch to the _bad named character reference_ state
   without consuming the character.


 ### **Named character reference AP** state ###

 Append the current character to _raw value_.

 If the current character is...

 * '```o```': Consume the current character and switch to the **named
   character reference APO** state.

 * Anything else: Switch to the _bad named character reference_ state
   without consuming the character.


 ### **Named character reference APO** state ###

 Append the current character to _raw value_.

 If the current character is...

 * '```s```': Let _character_ be '```'```', consume the current
   character, and switch to the **after named character reference**
   state.

 * Anything else: Switch to the _bad named character reference_ state
   without consuming the character.


 ### **Named character reference G** state ###

 Append the current character to _raw value_.

 If the current character is...

 * '```t```': Let _character_ be '```>```', consume the current
   character, and switch to the **after named character reference**
   state.

 * Anything else: Switch to the _bad named character reference_ state
   without consuming the character.


 ### **Named character reference Q** state ###

 Append the current character to _raw value_.

 If the current character is...

 * '```u```': Consume the current character and switch to the **named
   character reference QU** state.

 * Anything else: Switch to the _bad named character reference_ state
   without consuming the character.


 ### **Named character reference QU** state ###

 Append the current character to _raw value_.

 If the current character is...

 * '```o```': Consume the current character and switch to the **named
   character reference QUO** state.

 * Anything else: Switch to the _bad named character reference_ state
   without consuming the character.


 ### **Named character reference QUO** state ###

 Append the current character to _raw value_.

 If the current character is...

 * '```t```': Let _character_ be '```"```', consume the current
   character, and switch to the **after named character reference**
   state.

 * Anything else: Switch to the _bad named character reference_ state
   without consuming the character.


 ### **After named character reference** state ###

 Append the current character to _raw value_.

 If the current character is...

 * '```;```': Consume the character. Run the _emitting operation_ with
   the character _character_. Switch to the _return state_.

 * The _extra terminating character_: Run the _emitting operation_ with
   the character U+FFFD. Switch to the _return state_ without consuming
   the current character.

 * Anything else: Switch to the _bad named character reference_ state
   without consuming the current character.


 ### **Bad named character reference** state ###

 Append the current character to _raw value_.

 If the current character is...

 * '```;```': Consume the character. Run the _emitting operation_ with
   the character U+FFFD. Switch to the _return state_.

 * The _extra terminating character_: Switch to the _return state_
   without consuming the current character.

 * Any other character in the range '```0```'..'```9```',
   '```a```'..'```f```', '```A```'..'```F```': Consume the character
   and stay in this state.

 * Anything else: Run the _emitting operation_ for all but the last
   character in _raw value_, and switch to the **data state** without
   consuming the current character.


 Tree construction
 -----------------

 To construct a node tree from a _sequence of tokens_ and a document _document_:

 1. Initialize the _stack of open nodes_ to be _document_.
 2. Consider each token _token_ in the _sequence of tokens_ in turn.
    - If _token_ is a text token,
      1. Create a text node _node_ with character data _token.data_.
      2. Append _node_ to the top node in the _stack of open nodes_.
    - If _token_ is a start tag token,
      1. Create an element _node_ with tag name _token.tagName_ and attributes
         _token.attributes_.
      2. Append _node_ to the top node in the _stack of open nodes_.
      3. If the _token.selfClosing_ flag is not set, push _node_ onto the
         _stack of open elements_.
      4. If _token.tagName_ is _script_, TODO: Execute the script.
    - If _token_ is an end tag token,
      1. If the _stack of open nodes_ contains a node whose _tagName_ is
         _token.tagName_,
         - Pop nodes from the _stack of open nodes_ until a node with
           a _tagName_ equal to _token.tagName_ has been popped.
      2. Otherwise, ignore _token_.
    - If _token_ is a comment token,
      1. Ignore _token_.
    - If _token_ is an EOF token,
      1. Pop all the nodes from the _stack of open nodes_.
      2. Signal _document_ that parsing is complete.

 TODO(ianh): &lt;template>, &lt;t>