Scanners (not the movie)

In my previous post I introduced a program I wrote to that uses the method described in First-Order Logic byRaymond M. Smullyan to prove an arbitrary propositional logic expression is a tautology. In this program I use some techniques that I want to describe. The first is the scanner. When writing scanners I have always hand written the scanner. There are two primary reasons for this, first I have never found scanners that difficult to write and, second, they are very time critical, often taking more than 20% of the overall parsing time, and auto-generated scanners are hard to optimize. Tableaux uses a hand-written scanner. The code for the scanner is in Scanner.cs in the zip file foundhere.

A scanner is a routine that takes text input and classifies it into a stream of tokens. Scanners typically give one token at a time. The scanner is the lowest level of the parsing process. Parsers are built on top of scanners and use the scanner results to determine the meaning of the source using some kind of grammar. The distinction between the parser and the scanner is somewhat arbitrary. You could just consider each character of the input as a token and build the parser up from there. Doing this is typically slower than a traditional scanner, however, so most parsers use a scanner. The line between the scanner and the parser is usually drawn at the lexical elements such as identifiers, reserved words, numeric and string constants, comments, etc. More formally, a lexical element is a string of characters that can be matched and classified by a regular expression. Scanner generators, such as Lex, use regular expressions to specify the scanner.

The trick to a fast scanner is to reduce the per-character cost of the scanner. One way to do this is to make sure you are executing the fewest number of instructions per-character-scanned as possible. A few of tricks I use to reduce the per-character cost include,

Use the switch statement
- That included switch or case statement typically heavily optimize the result, often generating jump tables.
Pull fields into local variables and only write them at the before returning.
- Local variables can be mapped to registers during code generation. Fields never will be. Stores to global memory (i.e. a field of a object allocated from the heap) are very hard for code generators to optimize. Local variable are much easier.
Make checking for boundaries (end of file or end of buffer) appear as a character.
- This merges the end-of-file check with character classification.
- In this case I use the character value 0 to indicate end of file. If I was willing to use unsafe code I would have taken advantage of the fact that the CLR ensures a 0 value at the end of every string.

One thing that looks slow that isn't is the GetChar() routine. Normally you would see that and think I am paying for a CALL/RET pair but his gets inlined in retail. I am paying for the end of buffer check every character but, as noted above, this can be removed if I wanted to use unsafe code, definitely overkill for a small program like tableaux.

I went a bit overboard, intentionally, with identifier. It handles C# style identifiers, handling Unicode letter and number classifications. To avoid calling the Unicode classification routines for every character I hard-coded the ASCII part and only call the Unicode classification routines when I encounter a non-ASCII potential identifier character. It is not that the classification routines are slow, it is that every cycle counts in the scanner and avoiding the CALL/RET and character table look is significant for file that contain mostly ASCII characters. Note also that I hard-code some non-identifier characters in the identifier internal loop. This avoids calling the Unicode classification routines for those character, which are the most common characters you might find after an identifier, which avoids calling the Unicode classification routines at all for most input.

Scanners (not the movie)

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112