Pages

Friday 18 March 2011

Compilers: Constructing an SLR(1) parser table & parsing a string

Compilers is a very difficult subject and this isn't a dirt basics course. Before reading this you should probably learn about grammars and how they are used in parsing, and the differences between the different parsers.

Here is just a simple (well I can't think of a simpler way to put it) way to explain how to construct an SLR(1) table from a grammar and then use this to parse a string using the largest match strategy.


Note: this may take you may have to go over this a few times to understand, and it's not exactly something you'll get good at just by doing it.
Note 2: I don't care for mathematical notation. Only about 0.1% of the population understands it and for everyone else it is confusing/scary. If you want the mathematical information about SLR1 parsers, I suggest using a different source. Wikipedia is a good starting block.

Now after getting formalities out of the way, let’s get down to business.

We will start off with the grammar:

1. S::=bAS 2. S::=ab 3. A::=bA 4. A::= ϵ

The capital letters are non-terminals and the small letters are called terminals.

For the sake of it just looking easier to use we’ll right it in this way:

S’ -> S
S -> bAS | ab
A -> bA | ϵ

The S’ always goes to S. This is just something that’s done. I don’t know the direct reasoning behind it, but in the end it’s often to know better to just do it than to worry why it’s needed.

So we will go through each position trying to get to the end of the string while doing a closure each time.

We always start off with the first point in, which is always S’ -> S

We do a closure on this. This looks like:

{ (S’ -> .S), (S-> .bAS), (S->.ab) }

What we have done here is gone from S’ to the start position of the right hand side. Once we had S’ -> .S, we then have to look and say “Are there and non-terminals after the . (dot).

And as we can quite clearly see, there is an S after the dot, so we expand on this by also recursively doing the same thing. We put in all that S can go to and look at any non-terminals directly after the .

So once we have { (S’ -> .S), (S-> .bAS), (S->.ab) } we can say let’s call this state 0.

You then have to go through 0, trying to get to the next position through a terminal or non-terminal. This sounds confusing but it’s really not once you see it, it’s just a case of moving the . across.

So we then do:

0S = { (S’ -> S.) }

we have travelled 0 along S. And the only point we can do this is here (note the . now on the right hand side of S).

as you can see above, we can also go along b in S -> .bAS and along a in S -> ab

So we then do these

0a = { (S -> a.b) }

0b = { ( S -> b.AS) }

Now in the first two we did, there are no more non-terminals you can go along.
S’ -> S. can’t go anywhere because it’s finished, and
S -> a.b can only go along b, which is a terminal.

So we now appoint numbers to these as well. So

0S = { (S’ -> S.) } = 1

0a = { (S -> a.b) } = 2

Technically you could also say 0 along A goes to the empty set, but because it’s nothing I don’t see any point since it would take up a lot of room later.

0b = { ( S -> b.AS) } can go along a non-terminal, however. Therefore we do a closure on this.

cl(0b) = { (S -> b.AS), (A -> .bA), (A -> .) } = 3

here again we have started off by first putting in the set 0b, then looking at the next non-terminal. We then evaluate this non-terminal, putting everything that can go to.
As you can see A -> . (dot). This is because it goes to ϵ, then nothing character, effectively. Therefore you won’t ever go past epsilon as you already have since it’s nothing, is probably the best way I can describe it (unfortunately).

As you can see here I’ve called this state 3. This is because whenever you do a closure, you then name it a new state, if the set isn’t the same as one you already have.
This is actually the reason we named 1 and 2 before.
If you have for instance
0a = { (S -> a.b) } – then nothing within the set has a non-terminal after the . which means the closure of this will just be itself, meaning we can just miss it out completely.

Anyway, we have completely evaluated 0, and we now have:

0S = { (S’ -> S.) } = 1

0a = { (S -> a.b) } = 2

cl(0b) = { (S -> b.AS), (A -> .bA), (A -> .) } = 3

State 1 has parsed to the end of the string (the . is right at the end), so we can’t do anything further with this.

So

2b = { (S -> ab.) } = 4

again no non-terminal after the dot, so we give it a new state since this hasn’t appeared before (this isn’t the same as state 2, the placement of the . does count).

So now we can’t do anything else we move onto 3.

3A = { (S -> bA.S) }
3b = { (A -> b.A) }

and there’s no 3ϵ since as I said we can’t move past the epsilon.
Everything 3 went to has a non-terminal after the dot, so no new states yet. Again we repeat the process doing the closure of each

cl(3A) = { (S -> bA.S), (S -> .bAS), (S->.ab) } = 5
cl(3b) = { (A->b.A), (A->.bA), (A->.) } = 6

Now this is done we go back to where we last evaluated a state. The last state we evaluated was 3, so we now try 4, but this is at the end so we can’t go any further. So we go onto 5 and 6:

5S = { (S-> bAS.) } = 7
5b = { (S-> b.AS) }
now if you notice, this is actually the same as 0b. This means the closure will be the same, meaning we won’t have to redo it again. Therefore we just say
5b = { (S-> b.AS) } = 3
5a = { (S -> a.b) } = 0a, cl(5a) = 2

6A = { (A -> bA.) } = 8
6b ={ (A ->b.A) } = 3b, cl(6b) = 6

If this looks confusing, don’t worry I’ve literally just done the same thing I said above but in a little bit of shorthand. If you are confused, I suggest you re-read over the couple of paragraphs before.

We now look through every number since the last ones evaluated. So we’ve done 5 and 6. We look at 7 and see the . is in the last position, therefore can’t go any further, and the same holds true for 8, and there is no higher number which means we are done!

Now comes actually building the table.

We have a row for each state (the numbers) and a column for each terminal, $ and each non-terminal.
Now the $ is actually the end of string symbol. This means that you have absolutely completed anything and is called the accepting state.

This means that in the table we have, the state that has the final accepting state, i.e. where S’ -> S. is where you will put ‘acc’ in the table.
So the row of the state, in the $ column.
We also need to go through every state we have and look to see if there are any other points where we have a . in the end position. We then make a rule for them.

All the states we have are:

1 = { (S’ -> S.) }

2 = { (S -> a.b) }

3 = { (S -> b.AS), (A -> .bA), (A -> .) } #rule 1

4 = { (S -> ab.) } #rule 2

5 = { (S -> bA.S), (S -> .bAS), (S->.ab) }

6 = { (A->b.A), (A->.bA), (A->.) } #rule 1 (again)

7 = { (S-> bAS.) } #rule 3

8 = { (A -> bA.) } #rule 4

It doesn’t really matter what numbers you give the rules, just as long as you make the consistent.
These rules are actually reductions, and we need them in the table.
To calculate where to put these we need to work out the follow sets.
(I’m not going to go over how to work out the follow set, you must look this up yourself)

But the Follow set is written FW(X), where X is a non-terminal, and the follow sets are made up of terminals, since they are effectively finding what terminal will come next after the non-terminal.

FW(S) = {$}
FW(A) = {a, b}

Therefore we look at the rule, for instance rule 2.
Rule 2 only happens in state 4. So we know the row it will be in our table. And the rule is S -> ab. – the part on the left of the rule is the part you’re looking for. The follow set of this is the column where you put the rule.
So you need to put r2 in column $, row 2.
And do this for every rule.

We also need to put shifts and gotos in the table.
A shift is for terminals and a goto is for non-terminals.

It’s probably best to explain with an example.

I’ll use state 3, since it’s one of the most complicated (and I hate when people give you the easiest example and expect you can do that hardest):

cl(3A) = { (S -> bA.S), (S -> .bAS), (S->.ab) } = 5
cl(3b) = { (A->b.A), (A->.bA), (A->.) } = 6

For this part we need to know what states you can get to from state 3.

As you can see, from state 3 via A, you can get to 5,
So we put g5 in row 3, column A. (g for goto since A is non-terminal)

And we can see, from state 3 via b, you can get to 6,
Meaning we put s6 in row 3, column b.

We do this for every state and we get the table:
a b $ S A
0 s2 s3 g1
1 acc
2 s4
3 r1 s6/r1 g5
4 r2
5 s2 s3 g7
6 r1 s6/r1 g8
7 r3
8 r4 r4

Now this is the SLR(1) parse table we’ve been wanting to get!

Now actually for the tricky bit (this will probably need going over at least a couple of times).

Parsing a string with our table.
We’re going to be parsing the string bbbab with this parse table and hopefully getting back to the accepting state.

Since SLR(1) parsers are ‘bottom-up’ parsers, they take the string, and then try to go through the different variations until it can get back to the original accepting state, meaning it is a correct string for the grammar.

And this is also where the largest match strategy comes in (trust me I did mention it at the start). This basically means that if you come to a point in the table where you can do a shift or a reduction, choose the shift.
You actually have a stack, and the input. You’re trying to basically derive the stack down to $0 and the something that will make you go to the row that has the accepting state. So in this case we’re trying to get the stack to $0S1, and the input to the end terminal, $, and the action as accept (acc).

Stack                                        Input                                        Actions

$0                                              bbbab$                                       s3



So this is where we will always start, with our stack and our input.
The action is obtained by looking at our parse table. You look at the last number on the stack (0 in this case) and the next input, b.
This gives you the action s3.
Shifting means you take the terminal from the input and put it on the stack, with the row number (3 in this case).

And then you keep on doing this.
$0b3                                           bbab$                                             s6
$0b3b6                                        bab$                                               s6
$0b3b6b6                                    ab$                                                 r1

Now that we have a different input, a, it means the action is now r1, or rule 1.
This means you look at the row you got this from, 6 (also the last number on the stack if you forgot) and then you have to look at the rule.
Rule 1 is (A -> .). When you do a reduction, you look at the right hand side and pop this off (take it off) the stack.
So in this case the right hand side of the rule is epsilon, so nothing, which means you don’t need to pop anything off of the stack.
Once you have done this you push the left hand side of the rule onto the stack, making
$0b3b6b6A
Now you need to find the number that goes here. The non-terminal on the left hand side of the rule is A, so we look up state 6, column A and we can see there is a g8, so you put the number associated with A on the stack as 8.

$0b3b6b6A8                              ab$                                                   r4


As you can see above the input did not change. Each time a reduction is done, it does not change the input!

So now we look at r4, which is (A -> bA.)
We look at the right hand side and we see bA, so we pop this off the stack.
So

$0b3b6b6A8
becomes
$0b3b6
We take a note of the number at the end here, and then we pop on the right hand side
$0b3b6A
and look up the number for A. For this we have to use the number we took a note of (the number before the A) and find this in the table. Again it’s g8 so we have:

$0b3b6A8

And now we carry on:

$0b3b6A8                                   ab$                                                   r4
$0b3A5                                      ab$                                                    s2
$0b3A5a2                                   b$                                                     s4
$0b3A5a2b4                                $                                                      r2
$0b3A5S7                                   $                                                       r3
$0S1                                           $                                                      acc

And now we have got to the accepting state action, we have finished, and the string has been parsed!

I fully expect this may need a couple of read-throughs and it still might not make sense. The key is to just take it slowly, being careful to look up at the table and look up the rules when doing reductions.

-lavamunky

1 comment:

  1. Hey,nice post but it would be great if you could change the color of your blog. The blue shades of font are difficult to read on a blue background.

    ReplyDelete