Monday, October 20, 2008

RXRDG V - Ins and outs

If you haven't done already I suggest you at least skim over the previous RXRDG installments to get an idea of what we have done so far. Our next addition to the project will be character sets.

Character sets are defined by placing regular expression tokens within brackets (ie. [abc])- so we'll need bracket tokens to our lexer. The brackets act in a way similar to parenthesis (which we have already implemented) so we'll copy that approach. We'll add a set state and new cases for transitioning from literal to set state. There are a few things to keep in mind though. The rules for meta-characters within the character sets are quite complex. Eg.

  • The character "^" (caret) is a literal but encountered at the beginning of a set it is a negation of the set.
  • The character "-" denotes a range between the previous and the next literal but encountered at the beginning or the end of a set it is a literal.
  • The character "]" marks the end of a set a range between the previous and the next literal but encountered at the beginning of a set or following a "^" at the begining of a set it is a literal.
  • ...

The mind boggles! What we need to keep doing is making small steps. Our regexp parser might not be perfect yet but we're making baby steps towards a better one with each addition. We'll worry about all the special cases later. There's one quick shortcut we can make here. Looking at the list it is clear that tokens are treated differently at the beginning of a set state. When we don't match any special tokens at the beginning we transition into "normal" set state. We can address this issue by simply having two set states with appropriate transitions.

The character sets are defined by containing literals or ranges (ie. a-z means characters from a to z). Therefore we'll also need to define a range token. As mentioned the sets can be negated so we'll also define a not token. As character sets are quite useful it is no surprise that some common sets have shorthands. Most notably the any set (the dot) matches any character.

There are quite a few new tokens and states we need to implement so it's best to start right away. We'll need to add:

  • Additional lexer token types and tokens (BracketLeft, BracketRight, Range, Not and Any (not shown)
  • Additional methods for the token builder (not shown)
  • Additional cases for the literal state
  • Begin set and set state

case '[':
context.ToState(new BeginSetState());
return TokenBuilder.BuildBracketLeftToken();
case '.':
return TokenBuilder.BuildAnyToken();

This takes care of the first transition. Next we'll define the two set states. Begin set state is always the first one and it can transition into set state. Ending the set state means transitioning through the begin set state so a double call to end state will be needed.


internal class BeginSetState : IState
{
public IToken Handle(IContext context)
{
switch (context.Current)
{
case '^':
return TokenBuilder.BuildNotToken();
default:
context.ToState(new SetState());
return TokenBuilder.BuildLiteralToken(context.Current);
}
}
}

internal class SetState : IState
{
public IToken Handle(IContext context)
{
switch (context.Current)
{
case ']':
context.EndState();
context.EndState();
return TokenBuilder.BuildBracketRightToken();
case '-':
return TokenBuilder.BuildRangeToken();
default:
return TokenBuilder.BuildLiteralToken(context.Current);
}
}
}

Now that our lexer is working we need to focus on the parser. For the set token we simply need to follow the parenthesis pattern and define a bracket node as an unary operator setting its precedence quite high (10). Looking at the range token it becomes clear that it simply connects two literal tokens. Eg. a-z becomes:


<range>
<literal character='a'>
<literal character='z'>
</range>

So we'll define range as a binary operator (setting the precedence to 130). The not token is a simple unary operator with a high precedence (11). Defining the nodes is an exercise left to the reader. That leaves us with the any token. Any token matches any character - ranging from the " " to the "~". So the dot is simply a predefined character set containing a range between space and a tilde. This gives us a chance to follow the builder pattern and define a node builder to complement the token builder class.


public static class NodeBuilder
{
public static LiteralNode BuildLiteralNode(LiteralToken literalToken)
{
return new LiteralNode(literalToken);
}

public static RangeNode BuildRangeNode(LiteralToken from, LiteralToken to)
{
RangeNode rangeNode = new RangeNode(TokenBuilder.BuildRangeToken());
rangeNode.ChildNodes.Add(BuildLiteralNode(from));
rangeNode.ChildNodes.Add(BuildLiteralNode(to));
return rangeNode;
}

public static RangeNode BuildAnyNode()
{
return BuildRangeNode(TokenBuilder.BuildLiteralToken((char)32), TokenBuilder.BuildLiteralToken((char)126));
}
}

And we're ready to add additional cases to the parse state.


case TokenType.Range:
RangeToken range = (RangeToken)token;
INode rangeNode = new RangeNode(range);
AddOperator(rangeNode);
break;
case TokenType.BracketLeft:
context.ToState(new ParseState());
break;
case TokenType.BracketRight:
BracketRightToken set = (BracketRightToken)token;
INode setNode = new BracketNode(set);
AddOperator(setNode);
context.EndState();
break;
case TokenType.Any:
RangeNode anyNode = NodeBuilder.BuildAnyNode();
AddOperand(anyNode);
break;
case TokenType.Not:
NotToken not = (NotToken)token;
INode notNode = new NotNode(not);
AddOperator(notNode);
break;

We'll need to update our visitor interface and updating xml visitor should be a breeze. The data generator however poses some problems. Character sets need to generate random data based on containing literals and ranges. A naive implementation would follow the pattern of visiting a random character set member. There are however two problems with that approach:

  1. The random pattern would not be fair. A large range node would be equaly visited as a neighboring literal.
  2. It would be impossible to implement character set negation.

The easiest way to deal with the problem is to visit the set child nodes and gather enough information to generate a simple set containing only literals. We'll create a literal node collection class to keep literals unique. As a side note - KeyedCollection generic is one of my favorite classes. It provides a solution for types that containt the id within themselves. All we need to do is to implement the GetKeyForItem method to specify how the key is retrieved from each item.


private class LiteralNodeCollection : System.Collections.ObjectModel.KeyedCollection
{
protected override char GetKeyForItem(LiteralNode item)
{
return item.Token.Character;
}
}

While visiting the bracket node we might encounter various types of nodes and deal with them immediately.

  • If we encounter the not node it means we need to negate the set. We'll leave the implementation to the not visiting method.
  • We must skip any concatenation nodes.
  • We must add literal nodes to the literal nodes collection
  • If we encounter a range node we must add every literal nodes within the range to the literal nodes collection

public void Visit(BracketNode node)
{
if (node.ChildNodes[0].Token.TokenType == TokenType.Not)
{
node.ChildNodes[0].Accept(this);
return;
}

INode current = node;
while (current.ChildNodes[0].Token.TokenType == TokenType.Concatenation)
{
current = node.ChildNodes[0];
}

LiteralNodeCollection nodes = new LiteralNodeCollection();
foreach (INode childNode in current.ChildNodes)
{
switch (childNode.Token.TokenType)
{
case TokenType.Literal:
LiteralNode literalChildNode = (LiteralNode)childNode;
if (nodes.Contains(literalChildNode) == false)
{
nodes.Add(literalChildNode);
}
break;
case TokenType.Range:
int min = (int)((LiteralNode)childNode.ChildNodes[0]).Token.Character;
int max = (int)((LiteralNode)childNode.ChildNodes[1]).Token.Character;
for (int i = min; i < max; i++)
{
char c = (char)i;
if (nodes.Contains(c) == false)
{
nodes.Add(NodeBuilder.BuildLiteralNode(TokenBuilder.BuildLiteralToken(c)));
}
}
break;
}
}
int index = RandomNumberProvider.GetRandomNumber(0, nodes.Count);
nodes[index].Accept(this);
}

Visiting the not node is just the opposite. First we add all nodes to the collection and then remove the ones we find. We must add range node visiting method for completeness and we're done. We've got ourselves a quite functional regular expression parser and a random data generator to match.

As promised the source code for the regular expression random data generator can be found at Google code. The series on RXRDG is far from over and the code will be updated accordingly.


Related posts:

Sunday, October 12, 2008

RXRDG Part IV - Basic extensions

To recap what we have learned in parts I, II and III. We've built a regular expression lexer using iterator pattern to loop through a sequence of characters and turn them into a sequence of regular expression tokens. We've used a state pattern to simplify the lexer code that enables us to deal with each context individually. We've extracted token building methods to a TokenBuilder in order to make the code more expressive. We've built a regular expression token parser on top of the lexer output stream of tokens. The parser itself used a state pattern to simplify expression grouping. The parser state implemented a Shunting Yard Algorithm to create an Abstract Sytax Tree. We've implemented a visitor pattern in the tree nodes and built a visitor that outputted a nicely indented xml representation of the AST. Aside from learning a few tricks what is the payoff for making these design decisions?

Narrow scope of responsibility

Our parser lacks the ability to correctly parse escaped characters. Some characters are reserved for use in the regular expressions. Yet sometimes you wish to look for exactly those characters. Eg. we wish to find "{" within a text. To match these special characters we need to prefix them with a "\". So how hard is it to extend our parser to implement this?

It's quite simple actually. When we find "\" in the stream of character we know we will treat the next character differently. In other words - we will change state! So we add another case to the LiteralState (ironically '\' must be prefixed by itself in the matching condition).


case '\\':
context.ToState(new EscapeState());

We also need to define the escape state.


internal class EscapeState : IState
{
public IToken Handle(IContext context)
{
context.EndState();
switch (context.Current)
{
default:
return TokenBuilder.BuildLiteralToken(context.Current);
}
}
}

Done! The narrow scope of current context proves very beneficial. There's no need to think about the fact "*" indicates that previous operand should be repeated zero or more times in the escape state. So there are no ifs and checks to cover all the angles.

Separation of concerns

It took us a great deal of effort to implement the visitor pattern to retrieve value from nodes. We could have just added an overload to the GetString method and be done. But the original idea for this series was to generate random data and not a pretty printed AST. So how about we do that?

To generate our random output we'll just define another visitor. Our first order of concern is to define a random source.


public static class RandomNumberProvider
{
static readonly object _padlock = new object();

private static RandomNumberGenerator _rnd;
private static RandomNumberGenerator Rnd
{
get
{
lock (_padlock)
{
if (_rnd == null)
{
_rnd = RandomNumberGenerator.Create();
}
return _rnd;
}
}
}

public static int GetRandomNumber(int min, int max)
{
byte[] randbyte = new byte[1];
Rnd.GetNonZeroBytes(randbyte);
Random rand = new Random(Convert.ToInt32(randbyte[0]));

return rand.Next(min, max);
}
}

As we're not building an online gambling system, this should be sufficiently random. We'll copy most of the functionality of the XmlVisitor


public class GeneratorVisitor : IVisitor
{
private StringBuilder _builder = new StringBuilder();
private const int DefaultMaxOccurs = 11;

private GeneratorVisitor()
{ }

public static string Visit(INode node)
{
GeneratorVisitor visitor = new GeneratorVisitor();
node.Accept(visitor);
return visitor._builder.ToString();
}
}

And finaly we'll add the required Visit methods. The beauty of AST is that it gives us another view of separation of concerns:

  • concatenation and paranthesis nodes will "visit" their child nodes
  • literal node has no child nodes and will simply output its character.
  • repetition node will "visit" it's child nodes - repetedly

public void Visit(LiteralNode node)
{
_builder.Append(node.Token.Character.ToString());
}

public void Visit(RepetitionNode node)
{
int index = RandomNumberProvider.GetRandomNumber(node.Token.MinOccurs, node.Token.MaxOccurs > -1 ? node.Token.MaxOccurs + 1 : DefaultMaxOccurs);
for (int i = 0; i < index; i++)
{
foreach (INode childNode in node.ChildNodes)
{
childNode.Accept(this);
}
}
}

public void Visit(ConcatenationNode node)
{
foreach (INode childNode in node.ChildNodes)
{
childNode.Accept(this);
}
}

public void Visit(ParenthesisNode node)
{
foreach (INode childNode in node.ChildNodes)
{
childNode.Accept(this);
}
}
}

And we have our first data generator that uses regular expressions to generate random strings conforming to those patterns!

Still we're missing some important regular expression features such as alternation and character sets. We'll add alternation and leave the character sets for the next installment.

Adding new tokens is the most difficult task in our project as new tokens propagate throughout the system (tokens, tokenbuilder methods, nodes, visitors...) First we need to update the lexer part.

  1. Add alternation to token type enum.
  2. Add alternation token
  3. Add build alternation token method to token builder
  4. Add a case to literal state switch (case '|') - not shown

public enum TokenType
{
Alternation
...
}

public class AlternationToken : IToken
{
public TokenType TokenType { get { return TokenType.Alternation; } }
}

public static AlternationToken BuildAlternationToken()
{
return new AlternationToken();
}

Next we update the parser part.

  1. Add the alternation node (alternation is the lowest precedence operator barring paranthesis)
  2. Add a case to parser state switch (remember - alternation is a binary operator)

public class AlternationNode : NodeBase
{
public AlternationNode(AlternationToken token)
: base(token) { }

public override NodeType NodeType
{
get { return NodeType.BinaryOperator; }
}

public override int Precedence
{
get { return 30; }
}

public override void Accept(IVisitor visitor)
{
visitor.Visit(this);
}
}

public void Handle(Parser context)
{
IToken token = context.Current;
switch (token.TokenType)
{
case TokenType.Alternation:
AlternationToken alternation = (AlternationToken)token;
INode alternationNode = new AlternationNode(alternation);
AddOperator(alternationNode);
break;
...
}
}

Which reminds us to update the visitors.


public interface IVisitor
{
void Visit(AlternationNode node);
...
}

public void Visit(AlternationNode node)
{
string tab = GetTab(_level);
_builder.AppendFormat("{0}<{1}>{2}", tab, "alternation", LR);
_level++;
foreach (INode childNode in node.ChildNodes)
{
childNode.Accept(this);
}
_level--;
_builder.AppendFormat("{0}{2}", tab, "alternation", LR);
}

public void Visit(AlternationNode node)
{
int index = RandomNumberProvider.GetRandomNumber(0,2);
node.ChildNodes[index].Accept(this);
}

A nice expression to test our new feature is "(ab|cd)|ef" which should evaluate to:


<alternation>
<parenthesis>
<alternation>
<sequence>
<literal>a</literal>
<literal>>b</literal>
</sequence>
<sequence>
<literal>c</literal>
<literal>d</literal>
</sequence>
</alternation>
</parenthesis>
<sequence>
<literal>e</literal>
<literal>f</literal>
</sequence>
</alternation>

Thanks to our modular design adding new functionality to the parser is easy.


Related posts:

Saturday, October 4, 2008

RXRDG Part III - (Re)visiting the tree

This is the third part in the Regular Expression Random Data Generator series. This installment builds on and some knowledge of the previous two is required.

It's time to output our syntax tree. The main objective here will be too traverse the tree and output the type and other information on the nodes as we go.


static void Main(string[] args)
{
string expression = "ab(cd)*";

Parser parser = new Parser();
INode node = parser.Parse(expression);

Console.WriteLine(Traverse(node));
}

static string Traverse(INode node)
{
string result = string.Empty;
result += string.Format("<{0}>{1}", node.Token.TokenType, Environment.NewLine);
foreach (INode childNode in node.ChildNodes)
{
result += Traverse(childNode);
}
result += string.Format("{1}", node.Token.TokenType, Environment.NewLine);

return result;
}

This certainly does the job but there are a few problems with this approach:

  • It isn't very efficient - it uses recursion and string concatenation
  • It isn't very extensible - what happens when we change the type of output we want?
  • It isn't very helpful - to get better information out of a node we'll need to add a switch statement and test for the type of INode

A more interesting way of getting the output is using a visitor design pattern. The basic idea behind the visitor pattern is a mechanism called double dispatch. Lets define an interface for the visitor class.


public interface IVisitor
{
void Visit(LiteralNode node);
void Visit(RepetitionNode node);
void Visit(ConcatenationNode node);
void Visit(ParenthesisNode node);
}

The visitor will contain all the necessary logic to create the output we desire. Each Visit method takes a type of node so extracting information from a node within a method will be easy. The tricky part is to call the correct method with the correct parameter. Unfortunately C# does not support multimethods so there are only two ways of doing that. One is to write a switch statement to identify a type of node and call the appropriate method. The other is to let the nodes do it. First we define the caller interface.


public interface IVisitable
{
void Accept(IVisitor visitor);
}

Next we change the INode interface to inherit the IVisitable interface and extend the NodeBase class by adding an Accept method as required.


public abstract class NodeBase : INode
{
public abstract void Accept(IVisitor visitor);

//omited...
}

And finally we need to extend each node class with the concrete implementation of the Accept method.


public override void Accept(IVisitor visitor)
{
visitor.Visit(this);
}

So every node is capable of calling the correct Visitor method passing itself as a parameter. You may be wondering why we are violating the DRY principle here. Couldn't we simply implement the Accept method in the base class? We could, but the base class would simply call the Accept(NodeBase node) method and not the concrete method we need!

The disadvantages of this pattern are:

  • The code is less clear if the design pattern is not known.
  • The list of derived visitable classes must be known in advance
  • Boilerpate code is added to every derived visitable class

There are ways to mitigate these limitations (using generic visitors or dynamic casting) but that's out of scope for this series. In our case the complete list of RE tokens is known and not likely to extend. What visitor allows us to extend is the ways in which to utilize strong typed information we recieve from individual nodes. Lets write our XmlVisitor! First we define the basics.


public class XmlVisitor : IVisitor
{
readonly string LR = Environment.NewLine;
readonly static int TabSpaces = 3;

private int _level;
private StringBuilder _builder = new StringBuilder();

private static string GetTab(int tabLevel)
{
string tab = string.Empty;
return tab.PadLeft(tabLevel * TabSpaces, ' ');
}
}

The StringBuilder will help with the efficiency of building a string representation of the tree while GetTab method will take care of proper indentation. Next we fill in the interface members.


public void Visit(LiteralNode node)
{
string tab = GetTab(_level);
_builder.AppendFormat("{0}<{1}>{2}{3}", tab, "literal", node.Token.Character, LR);
}

public void Visit(RepetitionNode node)
{
string tab = GetTab(_level);
_builder.AppendFormat("{0}<{1} min='{2}' max='{3}'>{4}", tab, "repeatition", node.Token.MinOccurs, node.Token.MaxOccurs, LR);
_level++;
foreach (INode childNode in node.ChildNodes)
{
childNode.Accept(this);
}
_level--;
_builder.AppendFormat("{0}{2}", tab, "repeatition", LR);
}

//rest omited...

Using our visitor is still quite cumbersome as we need to create an instance of the visitor, get the node to accept the visitor and then have a method return the StringBuilder value. We can wrap that functionality in a single static method. This means that initializing the visitor will not be needed so we can declare its default constructor private.


private XmlVisitor()
{ }

public static string Visit(INode node)
{
XmlVisitor visitor = new XmlVisitor();
node.Accept(visitor);
return visitor._builder.ToString();
}

We can remove the cumbersome Traverse method and test our shiny new visitor!


Console.WriteLine(XmlVisitor.Visit(node));
//Console.WriteLine(Traverse(node));


Related posts:

Saturday, September 27, 2008

RXRDG Part II - Shunting trucks and hauling freight

The lexer we built in Part I is capable of turning a RE into a stream of tokens. Lets add a few more tokens to make it a bit more interesting.


public class ParenthesisLeftToken : IToken
{
public TokenType TokenType { get { return TokenType.ParenthesisLeft; } }
}

public class ParenthesisRightToken : IToken
{
public TokenType TokenType { get { return TokenType.ParenthesisRight; } }
}

Next we must add the appropriate TokenBuilder methods (not shown) and update the default (literal) lexer state.


case '(':
return TokenBuilder.BuildParenthesisLeftToken();
case ')':
return TokenBuilder.BuildParenthesisRightToken();
case '*':
return TokenBuilder.BuildZeroOrMoreToken();
case '?':
return TokenBuilder.BuildZeroOrOneToken();
case '+':
return TokenBuilder.BuildOneOrMoreToken();
default:
return TokenBuilder.BuildLiteralToken(context.Current);

Extending the lexer with new functionality was easy! So now we can lex a fairly complex expression.

a(b1)*2
<literal character='a'/>
<paranthesisLeft/>
<literal character='b'/>
<literal character='1'/>
<paranthesisRight/>
<repetition minOccurs='0' maxOccurs='-1'/>
<literal character='2'/>

We now need to turn this stream of tokens into an abstract syntax tree (AST) that captures the meaning of the parsed RE.

Looking at the AST we can see that the repetition node is a parent of "b" although it follows "b" in the token stream. And the next token will not be a child of the repetition node. Our parser will need to know how to build this tree. Luckily an algorithm exists that does just that. It's called the Shunting yard algorithm (SYA) (Whenever I think of that algorithm Thomas the Tank Engine song pops into my head). SYA is primarily use to convert from infix to a postfix (or prefix) notation (eg. turns 1+2*3 into 123*+).

So how does SYA work? By placing operands (tokens which are acted upon - ie. literals) and operators (tokens which act upon other tokens - ie. repetition) in separate stacks and outputting operator/operand(s) pairs according to operator precedence rules. Eg:

1+2*3+4


stepinputoperatorsoperands
1.1
2.++1
3.2+1,2
4.*+,*1,2
5.3+,*1,2,3
6.++,*,+1,2,3
7.+,+1,(2*3)
8.4+,+1,(2*3),4
9.eof+1,((2*3)+4)
10.eof((2*3)+4)+1)

Our next task is to define AST nodes that will be constructed by the parser. The node interface will wrap a token, implement a simple composite pattern and provide the node type and precedence properties.


public enum NodeType
{
UnaryOperator,
BinaryOperator,
Operand
}

public interface INode
{
IList ChildNodes { get; }
IToken Token { get; }
NodeType NodeType { get; }
int Precedence { get; }
}

Since nodes will share a lot of the implementation it makes sense to define an abstract node class.


public abstract class NodeBase : INode
{
public IList ChildNodes { get; internal set; }
public IToken Token { get; internal set; }

public abstract NodeType NodeType { get; }
public abstract int Precedence { get; }
}

The nodes themselves are quite easy to do. We just need to hide the base token property under the correct type and ad the correct node type and precedence values.


public class LiteralNode : NodeBase
{
public LiteralNode(LiteralToken token)
{
ChildNodes = new List();
base.Token = token;
}

public new LiteralToken Token
{
get
{
return (LiteralToken)base.Token;
}
}

public override NodeType NodeType
{
get { return NodeType.Operand; }
}

public override int Precedence
{
get { return 100; }
}
}
//rest ommited

The type and precedence values used are:

classnodeTypeprecedence
LiteralNodeOperand100
RepetitionNodeUnaryOperator150
ParenthesisNodeUnaryOperator25
ConcatenationNodeBinaryOperator50

The last operator (ConcatenationNode) requires a new token. Concatenation node is basically a placeholder operator if no other operator exists. Eg. in the expression abc the parser will add concatenation nodes to join a, b and c into a sequence and return a&b&c. Our basic building blocks are now quite simple and readable. The parser itself will be quite similar to the lexer but the parser state is where SYA will need to be implemented. So it is worth a closer look. The basic handing of lexers output is easy.


public class ParseState
{
private Stack _operators = new Stack();
private Stack _operators = new Stack();

public void Handle(Parser context)
{
IToken token = context.Current;
switch (token.TokenType)
{
case TokenType.Literal:
AddOperand(new LiteralNode((LiteralToken)token));
break;
case TokenType.Repetition:
AddOperator(new RepetitionNode((RepetitionToken)token));
break;
case TokenType.ParenthesisLeft:
context.ToState(new ParseState());
break;
case TokenType.ParenthesisRight:
AddOperator(new ParenthesisNode((ParenthesisRightToken)token));
context.EndState();
break;
default:
break;
}
}
}

Finding a literal we add it to the operands stack. Finding paranthesis we simply jump to a new state or descend to a previous one. But when adding operators we need to keep track of precedence rules and process operators that can be processed.


public void AddOperator(INode operatorNode)
{
while (_operators.Count > 0 && _operators.Peek().Precedence > operatorNode.Precedence)
{
ProcessOperator(_operators.Pop());
}
_operators.Push(operatorNode);
}

While processing operators we need to distinguish between binary and unary operators. We also make sure concatenation does not create uneccesary tree nodes and that the original order of operands is preserved.


public void ProcessOperator(INode operatorNode)
{
switch (operatorNode.NodeType)
{
case NodeType.UnaryOperator:
operatorNode.ChildNodes.Insert(0, _operands.Pop());
_operands.Push(operatorNode);
break;
case NodeType.BinaryOperator:
INode lastOperand = _operands.Pop();
if (lastOperand.Token.TokenType == TokenType.Concatenation && operatorNode.Token.TokenType == TokenType.Concatenation)
{
foreach (INode childNode in lastOperand.ChildNodes)
{
operatorNode.ChildNodes.Add(childNode);
}
}
else
{
operatorNode.ChildNodes.Add(lastOperand);
}
operatorNode.ChildNodes.Insert(0, _operands.Pop());
_operands.Push(operatorNode);
break;
case NodeType.Operand:
default:
break;
}
}

So operators become parents of their operands and are added to the stack as operands (we can have repetition of a repetition node!). The operands are simply added to the stack and a concatenation token is added if needed to keep the nuber of operands and binary operators in ballance.


public void AddOperand(INode operandNode)
{
if (IsBalanced() == false)
{
AddOperator(new ConcatenationNode(new ConcatenationToken()));
}
_operands.Push(operandNode);
}

private bool IsBalanced()
{
int binaryOperatorsCount = 0;
foreach (INode operatorNode in _operators)
{
if (operatorNode.NodeType == NodeType.BinaryOperator)
{
binaryOperatorsCount++;
}
}
return binaryOperatorsCount == _operands.Count;
}

Whenever the state closes it should process the remaining operators and return the resulting (operand) node.


public INode Close()
{
while (_operators.Count > 0)
{
ProcessOperator(_operators.Pop());
}
return _operands.Pop();
}

So finally we're able to finish the parser. As we have done most of the work in the state the context is easy. We just need to remember to close the state before adding it's result to the current state.


public class Parser
{
Stack _states;
ParseState _currentState;
IEnumerator _tokens;

public INode Parse(string expression)
{
_states = new Stack();
_currentState = new ParseState();
_tokens = new Lexer().Tokenize(expression).GetEnumerator();
while (_tokens.MoveNext())
{
_currentState.Handle(this);
}
return _currentState.Close();
}

public void ToState(ParseState state)
{
_states.Push(_currentState);
_currentState = state;
}

public void EndState()
{
ParseState toState = _states.Pop();
toState.AddOperand(_currentState.Close());
_currentState = toState;
}

public IToken Current
{
get
{
return _tokens.Current;
}
}
}

That was a lot of work! To test if this works we'll print out the contents of the AST. Thats a task for the next installment.


Related posts:

Thursday, September 25, 2008

RXRDG Part I - Regular Expression Random Data Generator

A while ago I was faced with a task of setting up a test database for the system I was building. I was preparing to use existing sql scripts to populate the database with random values but it seemed like an awful lot of work. Whenever something looks like an awful lot of work that's an opportunity to do something creative. But first you should always Google!

There are number of useful random data generators out there and the idea that I found interesting was using regular expressions (RE) to create random data. RE are usually used as a means of validating existing data but that means that data generated to fit those expressions should be valid test data as well.

So the basic idea is to take a regular RE, parse it and use the results to generate random data. Now that seems like an interesting and useful project. What I will try to do in the next few installments of this blog is to slowly build the RE random data generator and hopefully learn a few things along the way. The main objective is to learn and teach, to write readable and not the fastest code. I hope to provide you with a download of a working project to finish off the series.

Parsing expressions can become a complex operation. We need to recognize usable blocks of texts that have meaning on their own (tokens) and we need to decipher the meaning of the sequence of such blocks. The first phase is called tokenization and the second is called parsing. Now tokenization is usually done using regular expressions but it would be quite ironic to use that approach here :) Also most regular expression tokens are single characters (such as letters, dot operator, star operator etc…). Still - splitting the parsing process in two parts helps reduce the complexity of the process and will prove quite helpful in the long run.

The first step should be to build a lexer capable of recognizing at least some of the regular expression tokens. So how does one perform lexical analysis on an expression? Iterating through characters would be a good start. First we define some interfaces. We'll need a token type enumeration to distinguish between tokens and a lexer.


public enum TokenType
{
Literal,
Repeatition
}

public interface IToken
{
TokenType TokenType { get; }
}

public interface IContext
{
IEnumerable Tokenize(string expression);
}

I did say that the most of regular expression tokens are single characters, but not all. This means that when the lexer hits a specific character it will need to recognize it as a start of a usable block of characters and act accordingly. In other words - lexer will analyze the input and change its state. Such behavior calls for implementation of a state pattern. We will therefore split the lexer into lexer context which drives the analysis and lexer state which analyzes character stream and builds tokens. The lexer will output a stream of tokens from an expression and keep track of its state. So lets update the IContext interface and add another interface for the lexer states.


public interface IState
{
IToken Handle(IContext context);
}

public interface IContext
{
IEnumerable Tokenize(string expression);
void ToState(IState state);
void EndState();
char Current { get; }
}

It’s time to build the lexer class, the default state and a few tokens. But first we'll define the literal (characters) and repetition (quantifiers) token types.


public class LiteralToken : IToken
{
public char Character { get; set; }
public TokenType TokenType { get { return TokenType.Literal; } }
}

public class RepetitionToken : IToken
{
public int MinOccurs { get; set; }
public int MaxOccurs { get; set; }
public TokenType TokenType { get { return TokenType.Repetition; } }
}

Next we'll define the default lexer state. The job of a lexer state is to take a character and do either buid a token or transition to another state. This way tokenizing logic is always contained within a certain context. This will prove very helpful in the future.


public class LiteralState : IState
{
public IToken Handle(IContext context)
{
switch (context.Current)
{
case '(':
context.ToState(new LiteralState());
break;
case ')':
context.EndState();
break;
case '*':
return new RepetitionToken()
{ MinOccurs = 0, MaxOccurs = - };
default:
return new LiteralToken()
{ Character = context.Current };
}
return null;
}
}

And finally we can define the lexer class. Since lexer states are responsible for processing the input the lexer needs only to iterate through the characters of the expression and keep track of its states.


public class Lexer : IContext
{
Stack _states;
IState _currentState;
IEnumerator _characters;

public IEnumerable Tokenize(string expression)
{
_states = new Stack();
_currentState = new LiteralState();
_characters = expression.GetEnumerator();
while (_characters.MoveNext())
{
IToken token = _currentState.Handle(this);
if (token != null)
{
yield return token;
}
}
}

public void ToState(IState state)
{
_states.Push(_currentState);
_currentState = state;
}

public void EndState()
{
_currentState = _states.Pop();
}

public char Current
{
get
{
return _characters.Current;
}
}
}

That was easy! We can also refactor the token building procedures to a token builder. Remember to update the lexer state.


public static class TokenBuilder
{
public static LiteralToken BuildLiteralToken(char character)
{
return new LiteralToken()
{ Character = character };
}

public static RepetitionToken BuildZeroOrMoreToken()
{
return TokenBuilder.BuildRepetitionToken(0, -1);
}

public static RepetitionToken BuildRepetitionToken(int minOccurs, int maxOccurs)
{
return new RepetitionToken()
{ MinOccurs = minOccurs, MaxOccurs = maxOccurs};
}
}

We can quickly test our lexer on a few expressions (and think about adding new states).


static void Main(string[] args)
{
string expression = Console.ReadLine();

Lexer lexer = new Lexer();
foreach (IToken token in lexer.Tokenize(expression))
{
Console.WriteLine(token.TokenType.ToString());
}

Console.ReadLine();
}

Related posts: