Sunday, October 12, 2008

RXRDG Part IV - Basic extensions

To recap what we have learned in parts I, II and III. We've built a regular expression lexer using iterator pattern to loop through a sequence of characters and turn them into a sequence of regular expression tokens. We've used a state pattern to simplify the lexer code that enables us to deal with each context individually. We've extracted token building methods to a TokenBuilder in order to make the code more expressive. We've built a regular expression token parser on top of the lexer output stream of tokens. The parser itself used a state pattern to simplify expression grouping. The parser state implemented a Shunting Yard Algorithm to create an Abstract Sytax Tree. We've implemented a visitor pattern in the tree nodes and built a visitor that outputted a nicely indented xml representation of the AST. Aside from learning a few tricks what is the payoff for making these design decisions?

Narrow scope of responsibility

Our parser lacks the ability to correctly parse escaped characters. Some characters are reserved for use in the regular expressions. Yet sometimes you wish to look for exactly those characters. Eg. we wish to find "{" within a text. To match these special characters we need to prefix them with a "\". So how hard is it to extend our parser to implement this?

It's quite simple actually. When we find "\" in the stream of character we know we will treat the next character differently. In other words - we will change state! So we add another case to the LiteralState (ironically '\' must be prefixed by itself in the matching condition).


case '\\':
context.ToState(new EscapeState());

We also need to define the escape state.


internal class EscapeState : IState
{
public IToken Handle(IContext context)
{
context.EndState();
switch (context.Current)
{
default:
return TokenBuilder.BuildLiteralToken(context.Current);
}
}
}

Done! The narrow scope of current context proves very beneficial. There's no need to think about the fact "*" indicates that previous operand should be repeated zero or more times in the escape state. So there are no ifs and checks to cover all the angles.

Separation of concerns

It took us a great deal of effort to implement the visitor pattern to retrieve value from nodes. We could have just added an overload to the GetString method and be done. But the original idea for this series was to generate random data and not a pretty printed AST. So how about we do that?

To generate our random output we'll just define another visitor. Our first order of concern is to define a random source.


public static class RandomNumberProvider
{
static readonly object _padlock = new object();

private static RandomNumberGenerator _rnd;
private static RandomNumberGenerator Rnd
{
get
{
lock (_padlock)
{
if (_rnd == null)
{
_rnd = RandomNumberGenerator.Create();
}
return _rnd;
}
}
}

public static int GetRandomNumber(int min, int max)
{
byte[] randbyte = new byte[1];
Rnd.GetNonZeroBytes(randbyte);
Random rand = new Random(Convert.ToInt32(randbyte[0]));

return rand.Next(min, max);
}
}

As we're not building an online gambling system, this should be sufficiently random. We'll copy most of the functionality of the XmlVisitor


public class GeneratorVisitor : IVisitor
{
private StringBuilder _builder = new StringBuilder();
private const int DefaultMaxOccurs = 11;

private GeneratorVisitor()
{ }

public static string Visit(INode node)
{
GeneratorVisitor visitor = new GeneratorVisitor();
node.Accept(visitor);
return visitor._builder.ToString();
}
}

And finaly we'll add the required Visit methods. The beauty of AST is that it gives us another view of separation of concerns:

  • concatenation and paranthesis nodes will "visit" their child nodes
  • literal node has no child nodes and will simply output its character.
  • repetition node will "visit" it's child nodes - repetedly

public void Visit(LiteralNode node)
{
_builder.Append(node.Token.Character.ToString());
}

public void Visit(RepetitionNode node)
{
int index = RandomNumberProvider.GetRandomNumber(node.Token.MinOccurs, node.Token.MaxOccurs > -1 ? node.Token.MaxOccurs + 1 : DefaultMaxOccurs);
for (int i = 0; i < index; i++)
{
foreach (INode childNode in node.ChildNodes)
{
childNode.Accept(this);
}
}
}

public void Visit(ConcatenationNode node)
{
foreach (INode childNode in node.ChildNodes)
{
childNode.Accept(this);
}
}

public void Visit(ParenthesisNode node)
{
foreach (INode childNode in node.ChildNodes)
{
childNode.Accept(this);
}
}
}

And we have our first data generator that uses regular expressions to generate random strings conforming to those patterns!

Still we're missing some important regular expression features such as alternation and character sets. We'll add alternation and leave the character sets for the next installment.

Adding new tokens is the most difficult task in our project as new tokens propagate throughout the system (tokens, tokenbuilder methods, nodes, visitors...) First we need to update the lexer part.

  1. Add alternation to token type enum.
  2. Add alternation token
  3. Add build alternation token method to token builder
  4. Add a case to literal state switch (case '|') - not shown

public enum TokenType
{
Alternation
...
}

public class AlternationToken : IToken
{
public TokenType TokenType { get { return TokenType.Alternation; } }
}

public static AlternationToken BuildAlternationToken()
{
return new AlternationToken();
}

Next we update the parser part.

  1. Add the alternation node (alternation is the lowest precedence operator barring paranthesis)
  2. Add a case to parser state switch (remember - alternation is a binary operator)

public class AlternationNode : NodeBase
{
public AlternationNode(AlternationToken token)
: base(token) { }

public override NodeType NodeType
{
get { return NodeType.BinaryOperator; }
}

public override int Precedence
{
get { return 30; }
}

public override void Accept(IVisitor visitor)
{
visitor.Visit(this);
}
}

public void Handle(Parser context)
{
IToken token = context.Current;
switch (token.TokenType)
{
case TokenType.Alternation:
AlternationToken alternation = (AlternationToken)token;
INode alternationNode = new AlternationNode(alternation);
AddOperator(alternationNode);
break;
...
}
}

Which reminds us to update the visitors.


public interface IVisitor
{
void Visit(AlternationNode node);
...
}

public void Visit(AlternationNode node)
{
string tab = GetTab(_level);
_builder.AppendFormat("{0}<{1}>{2}", tab, "alternation", LR);
_level++;
foreach (INode childNode in node.ChildNodes)
{
childNode.Accept(this);
}
_level--;
_builder.AppendFormat("{0}{2}", tab, "alternation", LR);
}

public void Visit(AlternationNode node)
{
int index = RandomNumberProvider.GetRandomNumber(0,2);
node.ChildNodes[index].Accept(this);
}

A nice expression to test our new feature is "(ab|cd)|ef" which should evaluate to:


<alternation>
<parenthesis>
<alternation>
<sequence>
<literal>a</literal>
<literal>>b</literal>
</sequence>
<sequence>
<literal>c</literal>
<literal>d</literal>
</sequence>
</alternation>
</parenthesis>
<sequence>
<literal>e</literal>
<literal>f</literal>
</sequence>
</alternation>

Thanks to our modular design adding new functionality to the parser is easy.


Related posts:

No comments: