Saturday, September 27, 2008

RXRDG Part II - Shunting trucks and hauling freight

The lexer we built in Part I is capable of turning a RE into a stream of tokens. Lets add a few more tokens to make it a bit more interesting.


public class ParenthesisLeftToken : IToken
{
public TokenType TokenType { get { return TokenType.ParenthesisLeft; } }
}

public class ParenthesisRightToken : IToken
{
public TokenType TokenType { get { return TokenType.ParenthesisRight; } }
}

Next we must add the appropriate TokenBuilder methods (not shown) and update the default (literal) lexer state.


case '(':
return TokenBuilder.BuildParenthesisLeftToken();
case ')':
return TokenBuilder.BuildParenthesisRightToken();
case '*':
return TokenBuilder.BuildZeroOrMoreToken();
case '?':
return TokenBuilder.BuildZeroOrOneToken();
case '+':
return TokenBuilder.BuildOneOrMoreToken();
default:
return TokenBuilder.BuildLiteralToken(context.Current);

Extending the lexer with new functionality was easy! So now we can lex a fairly complex expression.

a(b1)*2
<literal character='a'/>
<paranthesisLeft/>
<literal character='b'/>
<literal character='1'/>
<paranthesisRight/>
<repetition minOccurs='0' maxOccurs='-1'/>
<literal character='2'/>

We now need to turn this stream of tokens into an abstract syntax tree (AST) that captures the meaning of the parsed RE.

Looking at the AST we can see that the repetition node is a parent of "b" although it follows "b" in the token stream. And the next token will not be a child of the repetition node. Our parser will need to know how to build this tree. Luckily an algorithm exists that does just that. It's called the Shunting yard algorithm (SYA) (Whenever I think of that algorithm Thomas the Tank Engine song pops into my head). SYA is primarily use to convert from infix to a postfix (or prefix) notation (eg. turns 1+2*3 into 123*+).

So how does SYA work? By placing operands (tokens which are acted upon - ie. literals) and operators (tokens which act upon other tokens - ie. repetition) in separate stacks and outputting operator/operand(s) pairs according to operator precedence rules. Eg:

1+2*3+4


stepinputoperatorsoperands
1.1
2.++1
3.2+1,2
4.*+,*1,2
5.3+,*1,2,3
6.++,*,+1,2,3
7.+,+1,(2*3)
8.4+,+1,(2*3),4
9.eof+1,((2*3)+4)
10.eof((2*3)+4)+1)

Our next task is to define AST nodes that will be constructed by the parser. The node interface will wrap a token, implement a simple composite pattern and provide the node type and precedence properties.


public enum NodeType
{
UnaryOperator,
BinaryOperator,
Operand
}

public interface INode
{
IList ChildNodes { get; }
IToken Token { get; }
NodeType NodeType { get; }
int Precedence { get; }
}

Since nodes will share a lot of the implementation it makes sense to define an abstract node class.


public abstract class NodeBase : INode
{
public IList ChildNodes { get; internal set; }
public IToken Token { get; internal set; }

public abstract NodeType NodeType { get; }
public abstract int Precedence { get; }
}

The nodes themselves are quite easy to do. We just need to hide the base token property under the correct type and ad the correct node type and precedence values.


public class LiteralNode : NodeBase
{
public LiteralNode(LiteralToken token)
{
ChildNodes = new List();
base.Token = token;
}

public new LiteralToken Token
{
get
{
return (LiteralToken)base.Token;
}
}

public override NodeType NodeType
{
get { return NodeType.Operand; }
}

public override int Precedence
{
get { return 100; }
}
}
//rest ommited

The type and precedence values used are:

classnodeTypeprecedence
LiteralNodeOperand100
RepetitionNodeUnaryOperator150
ParenthesisNodeUnaryOperator25
ConcatenationNodeBinaryOperator50

The last operator (ConcatenationNode) requires a new token. Concatenation node is basically a placeholder operator if no other operator exists. Eg. in the expression abc the parser will add concatenation nodes to join a, b and c into a sequence and return a&b&c. Our basic building blocks are now quite simple and readable. The parser itself will be quite similar to the lexer but the parser state is where SYA will need to be implemented. So it is worth a closer look. The basic handing of lexers output is easy.


public class ParseState
{
private Stack _operators = new Stack();
private Stack _operators = new Stack();

public void Handle(Parser context)
{
IToken token = context.Current;
switch (token.TokenType)
{
case TokenType.Literal:
AddOperand(new LiteralNode((LiteralToken)token));
break;
case TokenType.Repetition:
AddOperator(new RepetitionNode((RepetitionToken)token));
break;
case TokenType.ParenthesisLeft:
context.ToState(new ParseState());
break;
case TokenType.ParenthesisRight:
AddOperator(new ParenthesisNode((ParenthesisRightToken)token));
context.EndState();
break;
default:
break;
}
}
}

Finding a literal we add it to the operands stack. Finding paranthesis we simply jump to a new state or descend to a previous one. But when adding operators we need to keep track of precedence rules and process operators that can be processed.


public void AddOperator(INode operatorNode)
{
while (_operators.Count > 0 && _operators.Peek().Precedence > operatorNode.Precedence)
{
ProcessOperator(_operators.Pop());
}
_operators.Push(operatorNode);
}

While processing operators we need to distinguish between binary and unary operators. We also make sure concatenation does not create uneccesary tree nodes and that the original order of operands is preserved.


public void ProcessOperator(INode operatorNode)
{
switch (operatorNode.NodeType)
{
case NodeType.UnaryOperator:
operatorNode.ChildNodes.Insert(0, _operands.Pop());
_operands.Push(operatorNode);
break;
case NodeType.BinaryOperator:
INode lastOperand = _operands.Pop();
if (lastOperand.Token.TokenType == TokenType.Concatenation && operatorNode.Token.TokenType == TokenType.Concatenation)
{
foreach (INode childNode in lastOperand.ChildNodes)
{
operatorNode.ChildNodes.Add(childNode);
}
}
else
{
operatorNode.ChildNodes.Add(lastOperand);
}
operatorNode.ChildNodes.Insert(0, _operands.Pop());
_operands.Push(operatorNode);
break;
case NodeType.Operand:
default:
break;
}
}

So operators become parents of their operands and are added to the stack as operands (we can have repetition of a repetition node!). The operands are simply added to the stack and a concatenation token is added if needed to keep the nuber of operands and binary operators in ballance.


public void AddOperand(INode operandNode)
{
if (IsBalanced() == false)
{
AddOperator(new ConcatenationNode(new ConcatenationToken()));
}
_operands.Push(operandNode);
}

private bool IsBalanced()
{
int binaryOperatorsCount = 0;
foreach (INode operatorNode in _operators)
{
if (operatorNode.NodeType == NodeType.BinaryOperator)
{
binaryOperatorsCount++;
}
}
return binaryOperatorsCount == _operands.Count;
}

Whenever the state closes it should process the remaining operators and return the resulting (operand) node.


public INode Close()
{
while (_operators.Count > 0)
{
ProcessOperator(_operators.Pop());
}
return _operands.Pop();
}

So finally we're able to finish the parser. As we have done most of the work in the state the context is easy. We just need to remember to close the state before adding it's result to the current state.


public class Parser
{
Stack _states;
ParseState _currentState;
IEnumerator _tokens;

public INode Parse(string expression)
{
_states = new Stack();
_currentState = new ParseState();
_tokens = new Lexer().Tokenize(expression).GetEnumerator();
while (_tokens.MoveNext())
{
_currentState.Handle(this);
}
return _currentState.Close();
}

public void ToState(ParseState state)
{
_states.Push(_currentState);
_currentState = state;
}

public void EndState()
{
ParseState toState = _states.Pop();
toState.AddOperand(_currentState.Close());
_currentState = toState;
}

public IToken Current
{
get
{
return _tokens.Current;
}
}
}

That was a lot of work! To test if this works we'll print out the contents of the AST. Thats a task for the next installment.


Related posts:

No comments: