Simply Parse in C
by Chloe Kudryavtsev
People are terrified of parsers and parsing. To the point of using magical libraries with custom syntaxes to learn just to get started. In the hopes of completely shattering this preconception, I will write a parser for the “ini” file format in about 150 lines of pure and readable ISO C99. Furthermore, this parser will be something that's nice to use and has error correcting features, such that it is actually useful and usable outside this example setting.
Design
There's no standard for the ini format, and all the existing implementations disagree, so let's take some liberties, XKCD 927 style. No newlines in keys, values, or section names. Empty values are not allowed. Comments only on their own lines (minus whitespace). Whitespace-insensitive (whitespace at the start of line, end of line, around the “=”, is all ignored). No need for a terminating newline either. Oh that's more than most C ini parsers do? Isn't that convenient.
To get the jargon out of the way, the parser will be recursive-descent LL(1).
The API will have a single entry point. You call parse_ini
against a FILE*
, with optional userdata (just any context the API user wants to have available in their callback), and a required callback. The callback will be called for every key-value pair, and it will receive the section, key, and value, alongside the userdata.
All of the strings in question will be temporary private data, and must be copied over (for simplicity). If a given terminal does not fit inside the private data, the given value is truncated, but the parsing continues without error.
Implementing
Let's start by writing the top-level function. We macro-define the max length. This is done firstly so that it can be changed (by surrounding it with ifdefs later), and secondly because we're allocating temporary memory. Typical parsers already operate in quadratic time, adding countless allocations and reallocations for a use-case where all values are typically small isn't reasonable.
#define INI_SEC_MAXLEN 64
#define INI_KEY_MAXLEN INI_SEC_MAXLEN
#define INI_VAL_MAXLEN INI_KEY_MAXLEN * 16
Now we can write the top-level parse_ini
function. The effective PEG for it is ini <- expr*
. We start from the top-level because we have a specific idea for the UX, so we need to write the rest of the functions in a way as to ensure we can deliver on it.
// if the callback returns non-zero, parsing will stop
typedef int (*callback)(const char*, const char*, const char*, void*);
// we return the number of bytes handled, sort of, you'll see
int parse_ini(FILE *src, void *userdata, callback cb) {
char section[INI_SEC_MAXLEN] = {0};
char key[INI_KEY_MAXLEN] = {0};
char value[INI_VAL_MAXLEN] = {0};
int status, out = 0;
// we stop going whenever we fail to consume any data, explicitly error,
// the stream errors, or the stream ends
while ((status = parse_expr(src, userdata, section, key, value, cb) >= 0) {
out += status;
if (feof(src) || ferror(src)) break;
}
return ferror(src) ? -out : out;
}
Nothing too complicated here, you just iterate the stream until something makes us stop. We return negative values in the case of an irrecoverable stream error that isn't EOF. We get into the real meat in parse_expr
.
Note that all the future functions and definitions are static
, but I'm omitting that keyword (as well as the occasional inline
for brevity – you don't need to see it in this case).
In parse_expr
, we want to parse an expression. In an ini file, you have comments, section declarations, and key-value pairs. We also want to skip whitespace. So let's do exactly that. The effective PEG is expr <- ws* (section / comment / kv)
.
int parse_expr(FILE *src, void *userdata, char *section, char *key, char *value, callback cb) {
int len = parse_skipws(src);
if (len) return len; // to avoid confusing byte counts, we're in a loop anyway
int c;
// figure out the next expression
switch ((c = fgetc(src))) {
case EOF:
return 0; // let the outer loop figure out if this was an error or not
case '[': // a section
return parse_section(src, section);
case '#':
case ';':
return parse_skipuntil(src, "\n");
default: // key-value pair
ungetc(c, src); // we need to conserve this one
return parse_kv(src, userdata, section, key, value, cb);
}
}
Note: we don't need to ungetc
in sections or comments, since the only real use of the character is to identify the type of expression. Before we get into the implementations of parse_section
and parse_kv
, let's write the utility functions we'll need, such as parse_skipuntil
.
Let's start with parse_skipwhile
and parse_skipuntil
, which will skip characters while a condition holds, or until a condition occurs. We'll define a condition as the next character being in a secondary argument (a string), a check we can perform using strchr
. Neither of these have direct PEG alternatives, since this approach is closer to a lexer technique, which we can utilize precisely because we're writing a hand parser.
int parse_skipuntil(FILE *src, const char *s) {
int out = 0;
for (int c; (c = fgetc(src)) != EOF; out++) {
if (strchr(s, c)) return out;
}
return ferror(src) ? -out : out;
}
int parse_skipwhile(FILE *src, const char *s) {
int out = 0;
for (int c; (c = fgetc(src)) != EOF; out++) {
if (!strchr(s, c)) {
ungetc(c, s);
return out;
}
}
return ferror(src) ? -out : out;
}
Note that they're almost the same. The primary notable difference is the negation of the strchr
call, and that skipwhile
does an ungetc
(skipuntil
consumes the terminator).
We can now define parse_skipws
. It's actually quite simple.
// the characters we consider to be whitespace
const char wss[] = " \t\r\n";
#define parse_skipws(src) parse_skipwhile(src, wss)
Before we move on to the juicy stuff, we'll implement one more thing: parse_until
. Semantically, it's the same as parse_skipuntil
, but it will actually write what it reads into a string pointer, up to its maximum length. Then it just becomes parse_skipuntil
. You could do the same thing with parse_skipwhile
too.
int parse_until(FILE *src, char *ptr, ssize_t maxlen, const char *s) {
int out = 0, c;
while (out < maxlen) {
c = fgetc(src);
if (*ptr == EOF) { // hit error while scanning
*ptr = 0;
return ferror(src) ? -out : out;
} else if (strchr(s, c)) {
*ptr = 0;
return out;
}
(*ptr++) = c; out++;
}
// we only make it here if we hit maxlen
(*--ptr) = 0;
int skipped = parse_skipwhile(src, s);
if (skipped > 0) {
return out + skipped; // errors are negative, eof is ok
}
return ferror(src) ? (skipped - out) : (out - skipped);
}
We can now write parse_section
and such!
#define parse_section(src, section) parse_until(src, section, INI_SEC_MAXLEN, "]\n")
Yeah, that's all there is to it. Note that we also allow terminating with a newline. This way, if someone forgot a “]”, we can still parse it. Though it does break comments on the same line.
Let's also write parse_key
and parse_value
, since we'll need them in parse_kv
in a second.
int parse_key(FILE *src, char *key) {
int out = parse_until(src, key, INI_KEY_MAXLEN, "=\n");
return stripright(key, wss);
}
int parse_value(FILE *src, char *value) {
int out = parse_until(src, value, INI_VAL_MAXLEN, "\n");
return stripright(value, wss);
}
First of all, we do the same thing with the key that we did with the section: if a key is not terminated by an “=”, presume the newline is the “=”. This way, in the absolute worst-case scenario, two key-value pairs get corrupted, and the rest parses fine, though it may be a good idea to check the value of “out” too.
Secondly, you'll note the “stripright”. We already discard leading whitespace in parse_expr
, but to strip trailing whitespace after the key (so before the “=”) and the value (so before the “\n”), we'll need to perform string brain surgery.
It's not too bad though, look:
// returns the number of output bytes
int stripright(char *c, const char *s) {
ssize_t len = strlen(s);
if (!len) return len; // already empty
while ((--len) >= 0 && strchr(s, c[len])) {}
// either strchr failed or len is now -1
if (len < 0) {
*c = 0;
return 0;
}
c[++len] = 0;
return len;
}
We can now finish the parser out with the most complex function in it: parse_kv
with an incredible 12 lines of code, including a callback. The closest PEG analogue is kv <- key ws* value
(where's the “=”? parse_kv
takes care of it, and the whitespace before it for us).
int parse_kv(FILE *src, void *userdata, const char *section, char *key, char *value, callback cb) {
int len = 0, tmp;
tmp = parse_key(src, key); // consumes the =
// if the key doesn't have a value or errored, we can't continue
if (tmp <= 0 || feof(src)) return 0;
len += tmp;
tmp = parse_skipws(src); // the whitespace after the "="
// if the value would have been empty, it ends up using the next line as the value
// it may not error or eof, but it may be empty
if (tmp < 0 || feof(src)) return 0;
len += tmp;
tmp = parse_value(src, value);
// any errors are fine, since we've finished parsing now
len += tmp > 0 ? tmp : -tmp;
// let callback request terminating the parse by returning non-zero
if (cb(section, key, value, userdata)) len *= -1;
return len;
}
And that's it, you have a functioning parser.
Discussion
To explain what's going on here on a less “here's the code, enjoy” level, we need to explain some of the jargon.
A grammar is the definition of the language being parsed. One may intuitively presume that the grammar for a given language or format is merely itself, i.e. that the BNF form is canonical. However, this is not the case. A grammar is actually dependent upon the parser in use. To discuss the platonic ideal of the language being recognized, we use the term “language”.
Any given language can have an arbitrary number of grammars describing it. These grammars are considered functionally equivalent. Since the language rarely if ever specifies expected resolutions to edge-conditions (if it did, it would be a “deterministic” language), grammars can be functionally equivalent while exhibiting different behaviors.
For example, let's consider the Lua language. The Lua language includes the following rules:
prefixexp <- var | functioncall | '(' exp ')'
functioncall <- prefixexp args | prefixexp ':' Name args
var <- Name | prefixexp '[' exp ']' | prefixexp '.' Name
This grammar is left-recursive. This means that a category of parsing algorithms cannot parse it! There are a number of reformulations that you can apply to achieve a different grammar that does not have the left-recursion problem, which will then recognize the Lua language successfully, but may have different edge-case behaviors (that are nowhere in the spec).
In short, there are a few reasons that parsing is a mess, and none of those reasons are actually resolvable by parser generators. For example, you will need to modify your grammar to get it to run on, say, tree-sitter, or antlr4, or Janet's PEGs.
A lot of the problem and solution space actually exists in the grammar definition, language design, and implementation flexibility. In this project, I set out to demonstrate all three.
For instance, you may notice that there is essentially zero error-reporting in the entire library. This is because I built the grammar in such a way that most inputs that are likely to actually occur are valid! Those that aren't will typically occur at the end of the file, and thus not affect the overall utility. Similarly, LL(1) is typically lexed, but I built the grammar in such a way as to be able to do LL(1) scannerless.
To summarize, to be good at parsing, you do not need to be an academic wizard, or to know all parsing theory. You need to be aware of the typical failure modes and make your own job easier by building in the flexibility you will then immediately use. The only exception is when you're implementing a strict specification that does specify what the edge-conditions do. Thankfully, those virtually do not exist, since specification writers often are aware of this. This is why the BNF forms of formal languages are so seemingly useless – it's there for your benefit, like undefined behavior in C, except it's actually beneficial.
Conclusion
Write your parsers, it can be fun and easy if you don't make it hard for yourself. The “true” (not edited for the article) sources of this tiny ini parsing library are available here in single-header form. It also contains an implementation of parse_while
, just for you, in case you happen to be building a parser that can take advantage of the same things I did (i.e. that is an LL(1)-compatible grammar that doesn't need a lexing phase; though you'll likely want to pass around a context struct instead of three bare pointers). It can also do heap allocation, so you know.
Hopefully, this has taught you something, and you'll reach for parser generators less. I do find it amusing how I ended up publishing this one before irex (a from scratch “irregular” expression educational library). It may or may not be the next article, presuming I actually finish it.
P.S. This blog will eventually move to the bunkerlabs.net domain... somewhere. There's currently an OIDC identity provider in the works that should be trivial to self-host that will be used to authenticate various services. Long story short, hopefully sometime in the next year, this blog will be ported over to there. The front page will likely mention the new location once that's a thing (there is no front page currently).
P.P.S. The point of having written this in C is to show how small and useful an LL(1) recursive descent parser can be, even in C. It should not be taken to mean that you should use C for all your parsing needs (unless you need the single codebase to be available in every language, in which case you have no choice). These techniques are applicable to virtually every language, and can result in even smaller parsers.
It is also of note that LL(1) is generally sufficient for most things you may want to do (ini was picked since it is not parsable with all the features present here with regular expressions or string splitting (as it is not a regular language), and to ensure that the implementation could be sufficiently small; if I code golfed it, it could be under 100 LOC, but I didn't want to code golf it.). Most programming languages, for instance, are designed in a way as to be parsable with an LL(k) grammar, which are provably transformable into LL(1) grammars.