Monday, May 11, 2009

Efficient Parsing in Javascript

I rewrote the Google Code source code syntax highlighter recently. It decorates source code in a variety of languages and I learned a few things about parsing in Javascript that others might find useful. Below I compare a few methods for parsing in Javascript and show benchmarks at the end.

Problem

I got some reports that the syntax highlighter for Google Code was slow on large input files, so I decided to rewrite it.

The inner loop of the parser looked something like:
var patterns = [
// Not the real patterns:
/^\s+/, // whitespace
/^[a-z]\w*/i, // An identifier
/^\d+(?:\.\d+)?/, // A number
/^"(?:[^"\\]|\\[^\r\n])*"/, // A string
...
];
while (input) {
var token;
for (var i = 0, n = patterns.length; i < n; ++i) {
token = patterns[i].match(input);
if (token) { break; }
}
// do something with the token
input = input.substring(token[0].length);
}

Diagnosis

I came up with a few hypotheses about what could be wrong:
  • One or more of the patterns is trying to match the entire input before failing, making the loop O(n2).
  • The substring operation is doing an array copy.
  • The regexp implementation is doing an array copy for the post-match content ($' on perl) even though I never use it which would make the loop O(n2)
  • All the small strings generated are causing the GC to thrash.
  • One or more of the regex patterns are worse than O(n) on size matched.
I ruled out (1) and (5) by testing with a simple pattern set [/^\s+/, /^\S+/] and still saw O(n2) behavior. I was really surprised that it seemed to be O(n2) even on non-IE 6 browsers, since I knew thought that most browsers newer than IE6 were much more efficient when it came to string operations.

I couldn't find an easy way to differentiate between (2) and (3) but don't need to since addressing one would address the other.

I ruled out (4) since that would predict that performance would be a step-function — linear with a good constant factor for small inputs, and linear with a bad constant factor for large inputs. But I didn't see that.

Experiment

I tried merging the regular expressions into one so that I could do a global match — using one pass over the input to break it into all tokens.

var uberPattern = new RegExp(
'/^(?'
+ '\s+' // whitespace
+ '|[a-zA-Z]\\w*' // An identifier
+ '\\d+(?:\.\\d+)?' // A number
+ '"(?:[^"\\\\]|\\\\[^\\r\\n])*"' // A string
+ ')',
'g'); // global makes match return a list of all matches
var tolens = input.match(uberPattern);
for (var i = 0, n = tokens.length; i < n; ++i) {
// do something with the token
input = input.substring(token[0].length);
}
and this got rid of the O(n2) behavior on all browsers I tested (IE6, Safari3.2 and Firefox 3.0).

The do something bit requires that I know which pattern the token matched, so I still have to run through the individual patterns to classify this token, but this is applying the regexs to a much shorter string, and one that does not vary with input size. Also, this classification can be memoized effectively since the majority of tokens in real code are short repeated tokens (single spaces, braces, keywords, etc.).

Solution

function makeParser(patterns) {
// A poor man's compiler compiler that takes multiple regular expressions
// and returns a RegExp that is not anchored at the start of the input, and
// that globally matches the union of the languages matched by the individual
// patterns.
var uberPattern = combineRegexs(patterns);
return function parser(input) {
var tokens = input.match(uberPattern);
var classifications = {};
for (var i = 0, nTokens = tokens.length; i < nTokens; ++i) {
var token = tokens[i];
var classification = -1;
if (hasOwnProperty(classifications, token)) {
classification = classifications[token];
} else {
for (var j = 0; j < nPatterns; ++j) {
if (patterns[j].test(token)) {
classification = j;
break;
}
}
classifications[token] = classification;
}
// do stuff with token and classification.
}
};
}

Faking lookbehind

There were a number of places where I needed lookbehind, but Javascript doesn't support that. In javascript, determining wheether a solidus (/) starts a regular expression or a division character requires knowing the preceding token†, so lookbehind comes in really handy there.
† — Not actually true since Javascript has no regular lexical grammar, but Waldemar Horwat found a grammar that is lexically regular and indistinguishable in all but a few, mostly useless, cases from the real EcmaScript grammar.

Steven Levinthan came up with a couple mechanisms for faking it though, and I already had a mechanism for embedding one language in another so that I could, e.g. decorate CSS in HTML style attributes and elements. I reused this embedded language mechanism to let me do the equivalent of a regular expression replace by delegating to a small sub-grammar, and that gave me lookahead.

Benchmark Setup

I wrote a benchmark which creates large data file with respectively 512, 1024, 2048, ... rows. It creates one datafile for JSON like
[
{ "friendly": true, "blinking": "hell-yes", "greeting": "Howdy!" },
{ "friendly": true, "blinking": "hell-yes", "greeting": "Howdy!" },

]
and a similar JSON file. It then times how long it takes to decorate each input and tabulates the results.

The results show pretty strikingly that there was a definite O(n2) behavior before, and the new method is both faster in terms of its constant factor, but is linear (or almost so on Safari).

Benchmark Results

I ran all tests on a 2.16 GHz MacBook Pro running Mac OS 10.5.6. IE6 numbers are taken from IE running on VMWare Fusion.

The tests take a while to run, and in some cases I had to click "Scripts on this page are taking a long time to run." dialogs which add noise to the results, but which I do not think invalidate them. I ran fewer input sizes on the old version since it scales so badly, and on Firefox there is a known bug that prevents certain uses of regular expressions on very large strings.


IE6 under VMWare, New
XML
CountLength (B)Time (ms)Items Per Second
512588882391214.1
10241177684906208.7
204823552814343142.8
409647104823282175.9
819294208845047181.9
163841884168105047156.0
JSON
CountLength (B)Time (ms)Items Per Second
51234816859596.0
1024696321625630.2
20481392643281624.2
40962785287344557.7
819255705615813518.1
16384111411235856456.9



IE6 under VMWare, Old
XML
CountLength (B)Time (ms)Items Per Second
512588881248441.0
10241177684090625.0
204823552816646212.3
JSON
CountLength (B)Time (ms)Items Per Second
512348164391116.6
1024696322009351.0
20481392648100025.3



Safari New
XML
CountLength (B)Time (ms)Items Per Second
51258888634807.6
10241177681307783.5
20482355282660769.9
40964710485348765.9
819294208811132735.9
16384188416823829687.6
JSON
CountLength (B)Time (ms)Items Per Second
512348164481142.9
1024696324512270.5
20481392649042265.5
409627852817732310.2
819255705637512184.0
16384111411273762221.3



Safari Old
XML
CountLength (B)Time (ms)Items Per Second
512588881672030.6
10241177685879917.4
20482355282368288.6
JSON
CountLength (B)Time (ms)Items Per Second
512348161393736.7
1024696326531315.7
20481392642336078.8



Firefox New
XML
CountLength (B)Time (ms)Items Per Second
512588881153444.1
10241177682315442.3
20482355285587366.6
40964710489054452.4
JSON
CountLength (B)Time (ms)Items Per Second
512348163011701.0
1024696326041695.4
204813926412041701.0
409627852836751114.6
819255705648341694.7



Firefox Old
XML
CountLength (B)Time (ms)Items Per Second
51258888983652.1
10241177684066925.2
204823552815577613.1
JSON
CountLength (B)Time (ms)Items Per Second
51234816662977.2
1024696322753337.2
204813926410428219.6

Obligatory First Post

This blog will contain posts on computer science, cycling and other things that interest me. It reflects my own opinions, which often differ from those of my employer.

Creative Commons License
Mike Samuel's blog is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.