Working through Introduction to Computing Systems, I had an opportunity to create a simple assembler for an assembly language called Hack. It is an extremely straightforward language (even for assembly), so a regex can parse it exactly to spec, unlike languages with more complicated grammars. Not only that, but by taking advantage of some extensions — verbose mode and named captures — I can make a parsing regex that isn’t completely opaque, thereby completing the project and having some fun in the process.
For more information about the Hack assembly language or Introduction to Computing Systems, you can visit their website.
But, before we see the pretty version, here’s the parsing regex stripped of all whitespace and comments, since half the fun of regexes are their incomprehensible terseness:
^\s*(?P<instruction>(?P<L>\((?P<Lsymbol>[A-Za-z_.$:][A-Za-z0-9_.$:]*)\))|
(?P<A>@(?:(?P<Ainstruction>\d+)|(?P<Asymbol>[A-Za-z_.$:][A-Za-z0-9_.$:]*)
))|(?P<C>(?:(?P<Cdest>A?M?D?)=)?(?P<Ccomp>0|1|-1|[-!]?[DAM]|[DAM][-+]1|[D
AM][-+&|][DAM])(?:;(?P<Cjump>J(G[TE]|L[TE]|EQ|NE|MP)))?))?\s*(?://.*)?$
And here is the version with full whitespace and comments. This is the longest regex I’ve ever written (by far).
^
\s* # Whitepace gobbler
# This capture group is non-empty if the line contains an instruction
(?P<instruction>
# Matching labels of form:
# (Xxx)
# This group is non-empty if a label was matched.
(?P<L>
\(
# Lsymbol group contains label name.
# Matches words to Hack spec.
(?P<Lsymbol> [A-Za-z_.$:][A-Za-z0-9_.$:]* )
\)
)
# Matching labels of form:
# @123 or @Xxx
# This group is non-empty if an A instruction was matched.
|(?P<A>
@(?:
# Ainstruction group contains instruction target (numeric)
# of form: @123
(?P<Ainstruction> \d+ )
|
# Asymbol group contains label symbol
# of form: @Xxx
(?P<Asymbol> [A-Za-z_.$:][A-Za-z0-9_.$:]* )
)
)
# Matching labels of form dest=comp;jump
# A=D+M or AM=1 or A=!D;EQ etc.
# This group is non-empty if a C instruction was matched
|(?P<C>
(?:
# Contains C destination
(?P<Cdest> A?M?D? )
=
)?
# Contains C computation
(?P<Ccomp>
0
| 1
| -1
| [-!]?[DAM]
| [DAM][-+]1
| [DAM][-+&|][DAM]
)
(?:
;
# Contains C jump conditional
(?P<Cjump> J(G[TE]|L[TE]|EQ|NE|MP) )
)?
)
)? # Note: each line can have 0 or 1 instructions.
\s* # Whitespace gobbler
(?://.*)? # Comment matching.
$