-
Notifications
You must be signed in to change notification settings - Fork 35
GenericLexer
The generic lexer aims at solving the performance issues with the Regex Lexer. The idea is to start from a limited set of classical lexemes and to refine this set to fit your needs.
Those lexemes are recognize through a Finite State Machine, way more efficient than looping through a set of regexes.
The lexer can be configured with a [Lexer]
attribute, and is available from version 2.4.0.6. The [Lexer]
attribute has several properties:
-
IgnoreWS
: Ignore whitespace characters. Iffalse
, any whitespace occuring in the lexed text must be explicitly handled in the lexer. Default istrue
. -
IgnoreEOL
: Ignore end of line characters. Iffalse
, any end of line characters occuring in the lexed text must be explicitly handled in the lexer. Default istrue
. -
WhiteSpace
: An array of characters that are considered whitespace ifIgnoreWS
istrue
. Default is ' ' (space) and '\t` (tab). -
KeyWordIgnoreCase
: Iftrue
, any keywords ([Lexeme(GenericToken.Keyword, "...")]
) are matched ignoring case. That is, the keywordif
also matchesIF
,If
, etc. Default isfalse
.
The basic lexemes are :
-
GenericToken.Identifier
: An identifier. From version 2.0.3Identifier
accepts an extra parameter to specify an identifier pattern:-
IdentifierType.Alpha
: Only alpha characters (default value, only pattern available before version 2.0.3). -
IdentifierType.AlphaNum
: Starting with an alpha char and then alpha or numeric char. -
IdentifierType.AlphaNumDash
: Starting with an alpha or '_' (underscore) char and then alphanumeric or '-'(minus) or '_' (underscore) char. -
IdentifierType.Custom
: Accepts two parameters; the starting character pattern and the rest character pattern. The pattern string contains 'c' (allowed char) and 'l-u' (allowed char range). If '-' (dash) should be an allowed char, it should be the first character in the pattern. An example that duplicatesIdentifierType.AlphaNumDash
is[Lexeme(GenericToken.Identifier, IdentifierType.Custom, "_A-Za-z", "-_0-9A-Za-z")]
. (From version 2.4.0.6). Another example relating to issue #468 can be found here Discussion468Lexer.cs
-
-
GenericToken.String
: A classical string delimited by double quotes ". See below for more details. -
GenericToken.Int
: An int (i.e. a serie of one or more digit). -
GenericToken.Double
: A float number (decimal separator can be specified. default decimal separator is dot '.'). -
GenericToken.Hexa
: an hexadecimal number. Hexa numbers are denoted with a prefix that can be configured (example : "0x").⚠️ Beware that badly choosing the prefix can make an hexa decimal number match the identifier lexeme (idf used) leading to conflicts and lexing errors, default prefix is0x
. -
GenericToken.Date
: A date. Needs 2 parameters :- format
- either
DateFormat.YYYYMMDD
- or
DateFormat.DDMMYYYY
(default)
- either
- separator char used to separate year, month and day (defualt is '-')
- format
-
GenericToken.KeyWord
: A keyword is an identifier with a special meaning (it comes with the same constraint as theGenericToken.Identifier
. Here again performance comes at the price of less flexibility. This lexeme is configurable. -
GenericToken.SugarToken
: A general purpose lexeme with no special constraint except the use of a leading alpha char. This lexeme is configurable. -
GenericToken.UpTo
: Match all until some patterns are found. For instanceLexeme(GenericToken.UpTo,"<","{")
will match all characters until<
or{
.
To build a generic lexer Lexeme attribute we have 2 different constructors:
- static generic lexeme. this constructor allows to do a 1 to 1 mapping between a generic token and your lexer token. It uses only one parameter that is the mapped generic token :
[Lexeme(GenericToken.String)]
(static lexemes are String, Int , Double and Identifier) - configurable lexemes (KeyWord and SugarToken). It takes 2 parameters :
- the mapped GenericToken
- the value of the keyword or sugar token.
There is also short code attributes for each basic lexeme type :
-
GenericToken.Identifier
:-
IdentifierType.Alpha
:[AlphaId]
-
IdentifierType.AlphaNum
:[AlphaNumId]
-
IdentifierType.AlphaNumDash
:[AlphaNumDashId]
-
IdentifierType.Custom
:[CustomId(startingPattern, endingPattern)]
-
-
GenericToken.String
:[String(delimiterChar, escapeChar)]
-
GenericToken.Int
:[Int]
-
GenericToken.Double
:[Double]
-
GenericToken.Hexa
:[Hexa]
-
GenericToken.Date
:[Date]
-
GenericToken.KeyWord
:[Keyword(pattern)]
-
GenericToken.SugarToken
:[Sugar(pattern)]
.⚠️ a sugar token can not start like a valid identifier (that is letter or_
or-
) -
GenericToken.UpTo
:[UpTo(string pattern1, pattern2 .... patternn)]
Strings lexeme definitions take 2 parameters :
- a string delimiter char. Default is " (double quote)
- an escape delimiter char to allow the use of the delimiter char inside a string. Default is \ (backslash). Use of the same char for delimiter and escape char is allowed.
Important note : when tokenizing, only string delimiters are escaped, escape sequences such as \n
are not interpreted as they can mean different depending on the context. For example hello \n \"world\"
will return a token value hello \n "world"
with no lin feed.
examples
// matches 'hello \' world' => 'hello ' world'
[Lexeme(GenericToken.String,"'","\\")]
STRING
or
// matches 'that''s my hello world' => 'that's my hello world'
[Lexeme(GenericToken.String,"'","'")]
STRING
Many string patterns
Many string patterns are allowed in the same lexer. For instance you should want to match double quote delimited string as well as single quote delimiter string. For this you can simply apply many lexeme attribute with to the same enum value :
// matches 'hello \' world' => 'hello ' world'
// as well as "hello \" world" => "hello " world"
[Lexeme(GenericToken.String,"'","'")]
[Lexeme(GenericToken.String,"'","\\")]
STRING
The generic lexer offers support for comments.
Comments are removed from the token stream before the parse start to ignore them. Nevertheless you can get them, for any special purpose, using directly the lexer.
Comment declaration
Comments use dedicated attributes on enum value that declares the comment delimiters
[Comments(singleline, multilinestart, multilineend)]
COMMENT,
- singleline : the single line comment delimiter ( "//" for all C derived languages)
- multilinestart : the starting multi line comment delimiter ( "/*" in all C derived language)
- multilineend : the closing multi line delimiter ( "*/" in all C derived language)
[SingleLineComment(singleline)]
SINGlE_LINE_COMMENT,
- singleline : the single line comment delimiter ( "//" for all C derived languages)
[MultiLineComment(multilinestart, multilineend)]
MULTI_LINE_COMMENT,
- multilinestart : the starting multi line comment delimiter ( "/*" in all C derived language)
- multilineend : the closing multi line delimiter ( "*/" in all C derived language)
public enum WhileTokenGeneric
{
#region keywords 0 -> 19
[Lexeme(GenericToken.KeyWord,"if")]
IF = 1,
[Lexeme(GenericToken.KeyWord, "then")]
THEN = 2,
[Lexeme(GenericToken.KeyWord, "else")]
ELSE = 3,
[Lexeme(GenericToken.KeyWord, "while")]
WHILE = 4,
[Lexeme(GenericToken.KeyWord, "do")]
DO = 5,
[Lexeme(GenericToken.KeyWord, "skip")]
SKIP = 6,
[Lexeme(GenericToken.KeyWord, "true")]
TRUE = 7,
[Lexeme(GenericToken.KeyWord, "false")]
FALSE = 8,
[Lexeme(GenericToken.KeyWord, "not")]
NOT = 9,
[Lexeme(GenericToken.KeyWord, "and")]
AND = 10,
[Lexeme(GenericToken.KeyWord, "or")]
OR = 11,
[Lexeme(GenericToken.KeyWord, "print")]
PRINT = 12,
#endregion
#region literals 20 -> 29
// identifier with IdentifierType.AlphaNumDash pattern
[Lexeme(GenericToken.Identifier, IdentifierType.AlphaNumDash)]
IDENTIFIER = 20,
[Lexeme(GenericToken.String)]
STRING = 21,
[Lexeme(GenericToken.Int)]
INT = 22,
#endregion
#region operators 30 -> 49
[Lexeme(GenericToken.SugarToken,">")]
GREATER = 30,
[Lexeme(GenericToken.SugarToken, "<")]
LESSER = 31,
[Lexeme(GenericToken.SugarToken, "==")]
EQUALS = 32,
[Lexeme(GenericToken.SugarToken, "!=")]
DIFFERENT = 33,
[Lexeme(GenericToken.SugarToken, ".")]
CONCAT = 34,
[Lexeme(GenericToken.SugarToken, ":=")]
ASSIGN = 35,
[Lexeme(GenericToken.SugarToken, "+")]
PLUS = 36,
[Lexeme(GenericToken.SugarToken, "-")]
MINUS = 37,
[Lexeme(GenericToken.SugarToken, "*")]
TIMES = 38,
[Lexeme(GenericToken.SugarToken, "/")]
DIVIDE = 39,
#endregion
#region sugar 50 -> 99
[Lexeme(GenericToken.SugarToken, "(")]
LPAREN = 50,
[Lexeme(GenericToken.SugarToken, ")")]
RPAREN = 51,
[Lexeme(GenericToken.SugarToken, ";")]
SEMICOLON = 52,
#endregion
#region comments : C like comments
[Comment("//","/*","*/")]
COMMENTS = 100
#endregion
EOF = 0
#endregion
}
The same using short code attributes :
public enum ShortWhileTokenGeneric
{
#region keywords 0 -> 19
[Keyword("IF")] [Keyword("if")]
IF = 1,
[Keyword("THEN")] [Keyword("then")]
THEN = 2,
[Keyword("ELSE")] [Keyword("else")]
ELSE = 3,
[Keyword("WHILE")] [Keyword("while")]
WHILE = 4,
[Sugar("DO")] [Sugar("do")]
DO = 5,
[Keyword("SKIP")] [Keyword( "skip")]
SKIP = 6,
[Keyword( "TRUE")] [Keyword("true")]
TRUE = 7,
[Keyword( "FALSE")] [Keyword( "false")]
FALSE = 8,
[Keyword( "NOT")] [Keyword("not")]
NOT = 9,
[Keyword( "AND")] [Keyword("and")]
AND = 10,
[Keyword( "OR")] [Keyword("or")]
OR = 11,
[Keyword( "PRINT")] [Keyword("print")]
PRINT = 12,
#endregion
#region literals 20 -> 29
[AlphaId] IDENTIFIER = 20,
[String] STRING = 21,
[Int] INT = 22,
#endregion
#region operators 30 -> 49
[Sugar( ">")] GREATER = 30,
[Sugar( "<")] LESSER = 31,
[Sugar( "==")]
EQUALS = 32,
[Sugar( "!=")]
DIFFERENT = 33,
[Sugar( ".")] CONCAT = 34,
[Sugar( ":=")]
ASSIGN = 35,
[Sugar( "+")] PLUS = 36,
[Sugar( "-")] MINUS = 37,
[Sugar( "*")] TIMES = 38,
[Sugar( "/")] DIVIDE = 39,
#endregion
#region sugar 50 ->
[Sugar( "(")] LPAREN = 50,
[Sugar( ")")] RPAREN = 51,
[Sugar( ";")] SEMICOLON = 52,
EOF = 0
#endregion
}