Saturday, September 27, 2008

RXRDG Part II - Shunting trucks and hauling freight

The lexer we built in Part I is capable of turning a RE into a stream of tokens. Lets add a few more tokens to make it a bit more interesting.


public class ParenthesisLeftToken : IToken
{
public TokenType TokenType { get { return TokenType.ParenthesisLeft; } }
}

public class ParenthesisRightToken : IToken
{
public TokenType TokenType { get { return TokenType.ParenthesisRight; } }
}

Next we must add the appropriate TokenBuilder methods (not shown) and update the default (literal) lexer state.


case '(':
return TokenBuilder.BuildParenthesisLeftToken();
case ')':
return TokenBuilder.BuildParenthesisRightToken();
case '*':
return TokenBuilder.BuildZeroOrMoreToken();
case '?':
return TokenBuilder.BuildZeroOrOneToken();
case '+':
return TokenBuilder.BuildOneOrMoreToken();
default:
return TokenBuilder.BuildLiteralToken(context.Current);

Extending the lexer with new functionality was easy! So now we can lex a fairly complex expression.

a(b1)*2
<literal character='a'/>
<paranthesisLeft/>
<literal character='b'/>
<literal character='1'/>
<paranthesisRight/>
<repetition minOccurs='0' maxOccurs='-1'/>
<literal character='2'/>

We now need to turn this stream of tokens into an abstract syntax tree (AST) that captures the meaning of the parsed RE.

Looking at the AST we can see that the repetition node is a parent of "b" although it follows "b" in the token stream. And the next token will not be a child of the repetition node. Our parser will need to know how to build this tree. Luckily an algorithm exists that does just that. It's called the Shunting yard algorithm (SYA) (Whenever I think of that algorithm Thomas the Tank Engine song pops into my head). SYA is primarily use to convert from infix to a postfix (or prefix) notation (eg. turns 1+2*3 into 123*+).

So how does SYA work? By placing operands (tokens which are acted upon - ie. literals) and operators (tokens which act upon other tokens - ie. repetition) in separate stacks and outputting operator/operand(s) pairs according to operator precedence rules. Eg:

1+2*3+4


stepinputoperatorsoperands
1.1
2.++1
3.2+1,2
4.*+,*1,2
5.3+,*1,2,3
6.++,*,+1,2,3
7.+,+1,(2*3)
8.4+,+1,(2*3),4
9.eof+1,((2*3)+4)
10.eof((2*3)+4)+1)

Our next task is to define AST nodes that will be constructed by the parser. The node interface will wrap a token, implement a simple composite pattern and provide the node type and precedence properties.


public enum NodeType
{
UnaryOperator,
BinaryOperator,
Operand
}

public interface INode
{
IList ChildNodes { get; }
IToken Token { get; }
NodeType NodeType { get; }
int Precedence { get; }
}

Since nodes will share a lot of the implementation it makes sense to define an abstract node class.


public abstract class NodeBase : INode
{
public IList ChildNodes { get; internal set; }
public IToken Token { get; internal set; }

public abstract NodeType NodeType { get; }
public abstract int Precedence { get; }
}

The nodes themselves are quite easy to do. We just need to hide the base token property under the correct type and ad the correct node type and precedence values.


public class LiteralNode : NodeBase
{
public LiteralNode(LiteralToken token)
{
ChildNodes = new List();
base.Token = token;
}

public new LiteralToken Token
{
get
{
return (LiteralToken)base.Token;
}
}

public override NodeType NodeType
{
get { return NodeType.Operand; }
}

public override int Precedence
{
get { return 100; }
}
}
//rest ommited

The type and precedence values used are:

classnodeTypeprecedence
LiteralNodeOperand100
RepetitionNodeUnaryOperator150
ParenthesisNodeUnaryOperator25
ConcatenationNodeBinaryOperator50

The last operator (ConcatenationNode) requires a new token. Concatenation node is basically a placeholder operator if no other operator exists. Eg. in the expression abc the parser will add concatenation nodes to join a, b and c into a sequence and return a&b&c. Our basic building blocks are now quite simple and readable. The parser itself will be quite similar to the lexer but the parser state is where SYA will need to be implemented. So it is worth a closer look. The basic handing of lexers output is easy.


public class ParseState
{
private Stack _operators = new Stack();
private Stack _operators = new Stack();

public void Handle(Parser context)
{
IToken token = context.Current;
switch (token.TokenType)
{
case TokenType.Literal:
AddOperand(new LiteralNode((LiteralToken)token));
break;
case TokenType.Repetition:
AddOperator(new RepetitionNode((RepetitionToken)token));
break;
case TokenType.ParenthesisLeft:
context.ToState(new ParseState());
break;
case TokenType.ParenthesisRight:
AddOperator(new ParenthesisNode((ParenthesisRightToken)token));
context.EndState();
break;
default:
break;
}
}
}

Finding a literal we add it to the operands stack. Finding paranthesis we simply jump to a new state or descend to a previous one. But when adding operators we need to keep track of precedence rules and process operators that can be processed.


public void AddOperator(INode operatorNode)
{
while (_operators.Count > 0 && _operators.Peek().Precedence > operatorNode.Precedence)
{
ProcessOperator(_operators.Pop());
}
_operators.Push(operatorNode);
}

While processing operators we need to distinguish between binary and unary operators. We also make sure concatenation does not create uneccesary tree nodes and that the original order of operands is preserved.


public void ProcessOperator(INode operatorNode)
{
switch (operatorNode.NodeType)
{
case NodeType.UnaryOperator:
operatorNode.ChildNodes.Insert(0, _operands.Pop());
_operands.Push(operatorNode);
break;
case NodeType.BinaryOperator:
INode lastOperand = _operands.Pop();
if (lastOperand.Token.TokenType == TokenType.Concatenation && operatorNode.Token.TokenType == TokenType.Concatenation)
{
foreach (INode childNode in lastOperand.ChildNodes)
{
operatorNode.ChildNodes.Add(childNode);
}
}
else
{
operatorNode.ChildNodes.Add(lastOperand);
}
operatorNode.ChildNodes.Insert(0, _operands.Pop());
_operands.Push(operatorNode);
break;
case NodeType.Operand:
default:
break;
}
}

So operators become parents of their operands and are added to the stack as operands (we can have repetition of a repetition node!). The operands are simply added to the stack and a concatenation token is added if needed to keep the nuber of operands and binary operators in ballance.


public void AddOperand(INode operandNode)
{
if (IsBalanced() == false)
{
AddOperator(new ConcatenationNode(new ConcatenationToken()));
}
_operands.Push(operandNode);
}

private bool IsBalanced()
{
int binaryOperatorsCount = 0;
foreach (INode operatorNode in _operators)
{
if (operatorNode.NodeType == NodeType.BinaryOperator)
{
binaryOperatorsCount++;
}
}
return binaryOperatorsCount == _operands.Count;
}

Whenever the state closes it should process the remaining operators and return the resulting (operand) node.


public INode Close()
{
while (_operators.Count > 0)
{
ProcessOperator(_operators.Pop());
}
return _operands.Pop();
}

So finally we're able to finish the parser. As we have done most of the work in the state the context is easy. We just need to remember to close the state before adding it's result to the current state.


public class Parser
{
Stack _states;
ParseState _currentState;
IEnumerator _tokens;

public INode Parse(string expression)
{
_states = new Stack();
_currentState = new ParseState();
_tokens = new Lexer().Tokenize(expression).GetEnumerator();
while (_tokens.MoveNext())
{
_currentState.Handle(this);
}
return _currentState.Close();
}

public void ToState(ParseState state)
{
_states.Push(_currentState);
_currentState = state;
}

public void EndState()
{
ParseState toState = _states.Pop();
toState.AddOperand(_currentState.Close());
_currentState = toState;
}

public IToken Current
{
get
{
return _tokens.Current;
}
}
}

That was a lot of work! To test if this works we'll print out the contents of the AST. Thats a task for the next installment.


Related posts:

Thursday, September 25, 2008

RXRDG Part I - Regular Expression Random Data Generator

A while ago I was faced with a task of setting up a test database for the system I was building. I was preparing to use existing sql scripts to populate the database with random values but it seemed like an awful lot of work. Whenever something looks like an awful lot of work that's an opportunity to do something creative. But first you should always Google!

There are number of useful random data generators out there and the idea that I found interesting was using regular expressions (RE) to create random data. RE are usually used as a means of validating existing data but that means that data generated to fit those expressions should be valid test data as well.

So the basic idea is to take a regular RE, parse it and use the results to generate random data. Now that seems like an interesting and useful project. What I will try to do in the next few installments of this blog is to slowly build the RE random data generator and hopefully learn a few things along the way. The main objective is to learn and teach, to write readable and not the fastest code. I hope to provide you with a download of a working project to finish off the series.

Parsing expressions can become a complex operation. We need to recognize usable blocks of texts that have meaning on their own (tokens) and we need to decipher the meaning of the sequence of such blocks. The first phase is called tokenization and the second is called parsing. Now tokenization is usually done using regular expressions but it would be quite ironic to use that approach here :) Also most regular expression tokens are single characters (such as letters, dot operator, star operator etc…). Still - splitting the parsing process in two parts helps reduce the complexity of the process and will prove quite helpful in the long run.

The first step should be to build a lexer capable of recognizing at least some of the regular expression tokens. So how does one perform lexical analysis on an expression? Iterating through characters would be a good start. First we define some interfaces. We'll need a token type enumeration to distinguish between tokens and a lexer.


public enum TokenType
{
Literal,
Repeatition
}

public interface IToken
{
TokenType TokenType { get; }
}

public interface IContext
{
IEnumerable Tokenize(string expression);
}

I did say that the most of regular expression tokens are single characters, but not all. This means that when the lexer hits a specific character it will need to recognize it as a start of a usable block of characters and act accordingly. In other words - lexer will analyze the input and change its state. Such behavior calls for implementation of a state pattern. We will therefore split the lexer into lexer context which drives the analysis and lexer state which analyzes character stream and builds tokens. The lexer will output a stream of tokens from an expression and keep track of its state. So lets update the IContext interface and add another interface for the lexer states.


public interface IState
{
IToken Handle(IContext context);
}

public interface IContext
{
IEnumerable Tokenize(string expression);
void ToState(IState state);
void EndState();
char Current { get; }
}

It’s time to build the lexer class, the default state and a few tokens. But first we'll define the literal (characters) and repetition (quantifiers) token types.


public class LiteralToken : IToken
{
public char Character { get; set; }
public TokenType TokenType { get { return TokenType.Literal; } }
}

public class RepetitionToken : IToken
{
public int MinOccurs { get; set; }
public int MaxOccurs { get; set; }
public TokenType TokenType { get { return TokenType.Repetition; } }
}

Next we'll define the default lexer state. The job of a lexer state is to take a character and do either buid a token or transition to another state. This way tokenizing logic is always contained within a certain context. This will prove very helpful in the future.


public class LiteralState : IState
{
public IToken Handle(IContext context)
{
switch (context.Current)
{
case '(':
context.ToState(new LiteralState());
break;
case ')':
context.EndState();
break;
case '*':
return new RepetitionToken()
{ MinOccurs = 0, MaxOccurs = - };
default:
return new LiteralToken()
{ Character = context.Current };
}
return null;
}
}

And finally we can define the lexer class. Since lexer states are responsible for processing the input the lexer needs only to iterate through the characters of the expression and keep track of its states.


public class Lexer : IContext
{
Stack _states;
IState _currentState;
IEnumerator _characters;

public IEnumerable Tokenize(string expression)
{
_states = new Stack();
_currentState = new LiteralState();
_characters = expression.GetEnumerator();
while (_characters.MoveNext())
{
IToken token = _currentState.Handle(this);
if (token != null)
{
yield return token;
}
}
}

public void ToState(IState state)
{
_states.Push(_currentState);
_currentState = state;
}

public void EndState()
{
_currentState = _states.Pop();
}

public char Current
{
get
{
return _characters.Current;
}
}
}

That was easy! We can also refactor the token building procedures to a token builder. Remember to update the lexer state.


public static class TokenBuilder
{
public static LiteralToken BuildLiteralToken(char character)
{
return new LiteralToken()
{ Character = character };
}

public static RepetitionToken BuildZeroOrMoreToken()
{
return TokenBuilder.BuildRepetitionToken(0, -1);
}

public static RepetitionToken BuildRepetitionToken(int minOccurs, int maxOccurs)
{
return new RepetitionToken()
{ MinOccurs = minOccurs, MaxOccurs = maxOccurs};
}
}

We can quickly test our lexer on a few expressions (and think about adding new states).


static void Main(string[] args)
{
string expression = Console.ReadLine();

Lexer lexer = new Lexer();
foreach (IToken token in lexer.Tokenize(expression))
{
Console.WriteLine(token.TokenType.ToString());
}

Console.ReadLine();
}

Related posts: