Interoperability between D code and outside data sources like sql databases or network apis is pretty common, and your D code needs some understanding of the structure of that external data. This structure is always there and it is always verified eventually, but there's a lot of different ways to go about both these things.
What I most often do is lean into dynamic typing: I write code that assumes some implicit structure and run the code. When it tries actually to use the data, it either works or it doesn't.
foreach(row; db.query("SELECT user_id, sum(spend) FROM test GROUP BY user_id")) { writeln(row["user_id"], " spent ", row[1]); }
I like this since it tends to be quick to write and quick to compile, but on the other hand, the compiler doesn't verify that the row elements exist and can't tell me if my sql syntax is correct. I have to actually run the code to see if it works. If the database changes, there's a chance my code will break and I may not know until someone tries to run this specific function again.
To differently balance these concerns, traditionally, we've leaned into some kind of schema: either generating D from some schema document, or the other way around, generating a schema document from D.
One approach to provide verification is to write a program - whether using CTFE or an external generator - that generates some D data types from an external data definition. For example, a long time ago, I wrote StructFromCreateTable which reads a .sql file, which is assumed to match what the database was created with, and creates a D struct with a matching definition. You could also imagine this kind of thing coupling with queries in the D source code to even provide result types and names by parsing the sql string in CTFE.
This approach also suffers from its own problems. Like the documentation to my old function warns: "It's not terribly smart, so it will probably break on complex tables." Since it is my own reimplementation of a parser, it differs from the real database in all kinds of ways. This is pretty much always the case; you can go pretty far, but even simple data definition languages are likely to have extensions from other vendors the D generator won't support.
Additionally, compile times tend to suffer from these generators, as they generate schemas plus conversion code, serializer bindings, etc., and you still don't necessarily catch every problem at compile time, because the external schema might fall out of sync with your program!
Finally, it can just be a lot of work to set this up, and not worth it for some quicker program.
There's one other way of differently balancing these concerns I see a lot: making D the authoritative definition of the data, and generating the external definitions from the D code.
With D code as the authoritative definition, you can use the D compiler, via its own reflection traits, to generate your external data definitions. This can compile faster and avoid the problem of your parser being incomplete.
There's several libraries that do this kind of thing out there, including my own arsd.database_generation.
The problem with this is that you're still limited by the library's incompatibilities, and possibly worse than the other way around! The library being unable to parse something doesn't necessarily prevent you from using the feature, but when the library provides the definition, it not supporting a feature does pretty much lock you out of it.
Additionally, each library may do things subtly different. It you know standard SQL, that's a pretty transferable skill between projects. You can use it in D, or in Python, or in a command shell. You can use it with MySQL or Postgres or whatever. Sure, there's some differences, but most your knowledge will apply.
If you know arsd.database_generation... good luck if the other guy uses hibernated. Gonna be lots of differences in naming, conventions, etc. They may not even build on the same concepts.
And, of course, if the external source decides to break your rules, you're still getting a runtime exception!
The biggest blocker to this though is that it may not be possible at all, since your D program isn't in any position to be an authority.
I keep saying "differently balancing these concerns", since I don't think any of these approaches are necessarily better or worse than the others; I've used all three before, but I do certainly have a preference toward keeping as close to the source material as we can and writing the smallest amount of binding code to get the job done. So I keep going back to the dynamic typing.
I'd like to test actual code against the actual external system, but not have to wait until the function itself is executed. One possibility is to run the preparation/test code at program startup, all up front, without changing anything. Is this possible?
Well, it depends on the particular external data source, but for the Postgres database, it is: you can try to prepare every statement you want, then use the "describe prepared" api to get parsed information out of it and compare it against your assumptions.
The problem with this is that it tends to be a pain to code. You need to have all these prepared statements gathered in one place with some kind of struct definition (or something) to declare the result to D. At least you can bind to it from reflection-generated code, but you still need to write it out, and if you want it to run early, you need to bounce around the project to gather it yourself, and reading the code is also affected by this, since you see some named function and wonder what it does and have to bounce back. And coming up with a name, ugh. Just annoying.
But, what if you could get that result... without doing it yourself in the code?
I just randomly drop something very much like the dynamic example anywhere I need it in the code:
string x = "remarkable feature"; db.query(i"INSERT INTO test (name) VALUES ($x)"); foreach(row; db.query(i"SELECT * FROM test where id < $(args.length) limit 4")) { if(row.id.isNull) continue; auto id = row.id.get!int; writeln(id, row.name); }
And it magically knows to prepare statements?
Ever since opDispatch came out (and yes, that was a pretty long time ago!), I've had some thoughts kicking around my brain about using it to gather compile time data, but not actually process it until runtime. Indeed, I use that approach quite a bit in D: use compile time reflection to build data tables, then pass those data tables to functions to do something with at runtime.
In the past, I thought about using the opDispatch info to gather what columns I actually referenced to auto-generate a replacement for the * in the query, like an optimized ORM. I never actually implemented that, knew it was possible by making opDispatch generate a shared static this. Also, when working on the design of interpolated expression sequences, I had this concept in mind as well and was careful to keep the string literal components available as part of the type, meaning you can reference it in a static constructor too.
Combining these features finally let me do this!
Let's take a look at that code example again, expanded in more context.
import arsd.postgres; shared dbPool = new shared ConnectionPool!(() => new PostgreSql("dbname=me")); void main() { auto db = dbPool.get(); if(!dbPool.checkQueries(db.borrow)) { // our startup check failed, do not proceed // since we'd just get an exception anyway eventually return; } string x = "remarkable feature"; db.query(i"INSERT INTO test (name) VALUES ($x)"); foreach(row; db.query(i"SELECT * FROM test where id < $(args.length) limit 4")) { if(row.id.isNull) continue; auto id = row.id.get!int; writeln(id, " ", row.name); } }
Notice the use of a connection pool and expression interpolation. These are both dual-use elements: they provide a useful function on their own, but also provide scaffolding for the library to do the query verification magic. Using regular string queries off the db object without a pool will not trigger the new functionality, that still works same way as it always has as a simple runtime call.
Let's go through the lines. First thing is declaring a database connection pool. This must be declared shared at top level to be effective, and it takes a factory lambda. All the accumulated queries are associated with a connection pool. This gives you a centralized place to use that data, without being over centralized - you can have multiple database connections in one application and it tracks which query is associated with which database configuration.
Inside the main function, first thing I did was get a database connection object out of the pool. The pool will reuse an existing one if available, and if not, construct one from its factory. The db object here must not be retained; it has a destructor that knows when to return the connection to the pool for future reuse. Even if you only need a single database object, the pool is easy to use anyway, and again, I needed some place to track queries and a global shared object associated with connection configurations is just the thing.
The next line of code is calling dbPool.checkQueries. This asks the object to use the information it gathered against a live connection to verify your assumptions. It does this by looping dbPool.allQueries and examining the returned objects - the sql, the referenced names and types, and some file+line location information for printing error messages. It then tries to prepare the queries against the database, then call the describe prepared statement functionality to verify it. My current implementation only works on PostreSQL, but the same should be possible in principle in other implementations too. You also have access to this data and can do it yourself.
Note that you can call checkQueries whenever you like, it need not be only early in main(). You might do it in some kind of unittest, or a specific command line arg, or even on something like an internal health check web route. The data is there, the function is available, do with it what you please.
The rest of the code is fairly similar to the dynamic typing example I've done for years, based on PHP 4's mysql functions way back when! The biggest difference is the use of interpolated sequences. Note this is not string interpolation per se, though it looks like it to the casual look. It doesn't attempt to convert the given value, like $x, to a string. It instead converts it to a placeholder in a prepared statement at compile time, then binds the actual value at runtime. This provides much more safety and correctness than strings and provides the capability for the compile-time gathering of queries - the database can verify the prepared statement without knowing any of the actual data it will bind to later. Of course, it would also be possible to do this with hand-written placeholders and bind calls, similar to the dynamic calls, but if you can use the interpolation feature, it is kinda nice, isn't it? And besides, then the compiler is responsible for associating placeholders and data, removing one more source of mismatch mistakes.
To resolve this, I made a little optional helper function, called VALUES. Take a look:
db.query(i"INSERT INTO test $(VALUES.id(id).name(name))");
Now, the column name and value are right next to each other again, and the string still basically looks like standard SQL. You can also pass sql!"..." as a value to put in some specific sql code for that column, so you aren't losing functionality.
The VALUES function uses opDispatch(T)(T value) to gather the names and values, again, names at compile time, values at runtime, to plug them in to the query as standard SQL. From there, it works the same as if you wrote it out yourself.
This is totally optional but one example of potential little conveniences we can add in the library.
Finally, the code has a pretty normal result loop. The query, when executed, returns a kind of result set object, which gives you access to the columns, via opDispatch, so it can gather the referenced names at compile time. Each result set object type is unique to the specific query used to generate it, allowing this information to be tracked and gathered reliably. The data itself is still a dynamic type, but the get method lets you declare your intention to use it as a particular static type. Again, this intention is tracked per connection factory, per query, per column name, gathered at compile time, all while feeling natural to the user.
It is important to realize that no matter where the query is in your codebase, it can all be checked centrally early after starting main!
All the following mistakes can be caught by the checkQueries function before running them:
// the sql itself is checked by the database db.query(i"SELECT UPDATE FROM test"); // the database checks the names and tables and such db.query(i"SELECT no_such_column FROM test"); // inserts are checked by the database db.query(i"INSERT INTO test (id, name) VALUES ($mismatch_column_count)"); // the names you reference on the row are checked foreach(row; db.query(i"SELECT id FROM test")) { row.name; // failure detected row.id; // fine } // the types you coerce get some checking as well foreach(row; db.query(i"SELECT name FROM test")) { row.name.get!int; // nope, the db gave us a string } // you can get as complex as the database supports, it is not // limited by what the library provides
What actually happens inside the library? Let me show you some of the implementation. Let's start from the pooled connection:
struct PooledConnection(ConnectionPoolType) { // snip template query(string file = __FILE__, size_t line = __LINE__, Args...) { enum asSql = sqlFromInterpolatedArgs!(Args); __gshared queryMetadata = new QueryMetadata!(asSql, file, line); @standalone @system shared static this() { ConnectionPoolType.registeredQueries_ ~= queryMetadata; } auto query(InterpolationHeader ihead, Args args, InterpolationFooter ifoot) { return new QueryResult!queryMetadata(dli.db.queryImpl(asSql, variantsFromInterpolatedArgs(args))); } } // snip }
The line auto db = dbPool.get; references the above struct; db is of type PooledConnection!(...). This is really the entry point of the api: it has the query method, and you can see here that it does more than meets the eye: it is both a function and a static constructor, as well as declaring a global metadata object, unique to this particular query.
Notice how it extracts the sql from the interpolated string at compile time, the line enum asSql = ..., then declares a global variable holding the meta. That global variable is initialized to a class object at compile time too, then the standalone shared static this is a static constructor that appends this meta object to the connection pool's list of queries.
$(ASIDE I remember when classes were first supported in CTFE. I was so excited to try to use dom.d to parse html at compile time. It worked! But it wasn't that useful. But still, the functionality of classes is nice to have. )
A static constructor is a function managed by the compiler to be called at some point after druntime initialization but before main(). The @standalone annotation (from core.attribute, or actually, from arsd.core in the real code since I polyfill it if not present for old compilers), tells the compiler and druntime that it is not dependent on any particular order. This helps avoid cycle errors when druntime tries to run them in the right order. That's OK, we don't care which order the entries are added to the list, we just want them to be there. Future processing can wait for later.
Once we have an object in the list, we need it to be populated with details. This will happen later. To do this, and to correctly associate query results with the query object, the metadata needs to be passed through the type system all the way to the columns:
1 // this class represents the whole result set, tied back to the query metadata 2 // the alias to the queryMetadata as a template param lets us use it 3 // and creates a new type for further compile time use at the same time. 4 class QueryResult(alias queryMetadata) : QueryResultBase { 5 private ResultSet resultSet; 6 7 // notice how this returns a row attached to passed metadata... 8 QueryResultRow!queryMetadata front() { 9 return new QueryResultRow!queryMetadata(resultSet.front); 10 } 11 12 // snip the rest 13 } 14 15 // this class represents an individual row, again, 16 // associated to the query metadata, by the same technique 17 class QueryResultRow(alias queryMetadata) { 18 // it is a wrapper around the dynamically typed Row 19 // provided from the database driver 20 Row row; // if you use this though, you bypass the checkers 21 22 // this fetches an individual column, still passing that metadata down 23 // notice how it also captures the file and line of the usage, 24 // which is useful for creating error messages 25 AssociatedDatabaseDatum!(queryMetadata, name, file, line) opDispatch(string name, string file = __FILE__, size_t line = __LINE__)() if(name != "__dtor") { 26 return typeof(return)(row[name]); 27 } 28 29 // snip the rest 30 } 31 32 33 // and this is the individual column. Again, same ideas at work: 34 // queryMetadata part of the type and available as a global object, 35 // and the column name, filename, and line number for future use. 36 struct AssociatedDatabaseDatum(alias queryMetadata, string name, string file, size_t line) { 37 @standalone @system shared static this() { 38 queryMetadata.registerName(name, file, line); 39 } 40 41 template get(T, string file = __FILE__, size_t line = __LINE__) { 42 shared static this() { 43 // snip some details 44 queryMetadata.registerType(name, LimitedVariant(t), T.stringof, file, line); 45 } 46 47 T get() { 48 import std.conv; 49 return datum.toString().to!T; 50 } 51 } 52 53 // again, it is a wrapper around the dynamically typed 54 // result from the database driver 55 DatabaseDatum datum; 56 57 // snip the rest of the api 58 }
Let me talk a little bit more about that AssociatedDatabaseDatum, though you might notice the pattern here. There's two more static constructors here: the first one, which registers a used column name with the metadata object, and this is normal code - the queryMetadata is part of the type, yes, but it also just an ordinary global variable, so we can call an ordinary method on it, passing the data from compile time into the run time object.
Then, the get template uses this static constructor injection technique one more time to register the type used for this name, and it also passes a dynamically typed piece of sample data the checker can use too. This static constructor is created with compile time knowledge, but by the time it runs, it is back in the runtime world, so constructing that piece of sample data here is easier than trying to reconstruct it later.
Once the static constructors have run, since all references to the metadata object are private, I can be reasonably confident they won't be modified again, and thus cast them to immutable, as done in the allQueries getter that checkQueries uses. It is possible that another object can be added or removed from the list due to shared library loading/unloading, but since I know each individual object won't be changed again, this cast ought to be reasonable and gives some protection to strange behavior while still allowing the gathered data to be exposed to the user.
And with this, once main runs, we have all the compile-time data tables from all over the program gathered into a single object! At this point, verifying it against the database is a straightforward job of looping over the objects, inspecting them, and running PREPARE queries against the live database to have them type checked. Unlike trying to do this at compile time, we don't have to worry about collapsing it down into a single expression. We can use all or ordinary runtime arrays, loops, exceptions etc.
I started using this in a work project and found it added about 30ms to the compile time, which is fairly significant for a project of its small size, so there is a concern that this will add up in a larger codebase, but I think it is worth it right now. This helped me find renamed columns without running a full test, just like we expect from compile time checks. I'm getting similar productivity and reliability gains to a full compile time schema at a fraction of the cost.
It is still fairly new, but I'm already very happy with it. While using the old functions would bypass the new checks, I'm still glad they're there - if I do hit a wall, I'm no worse off than I was before. The compiler will help me when it can, and get out of the way when it can't, so I'll never be blocked.
My implementation right now only works on postgres databases, but it should be possible to do similar with other drivers. What it comes down to is something like PQdescribePrepared for the others - a way to get metadata out of the live system without actually running it. This isn't universal, but it isn't uncommon either. Some http apis and other network things also support a kind of "dry run" mode, so the same concepts can be applied there too.
I plan to continue working on the quality of my implementation to add better error messages and perhaps more helpers like that $VALUES... thing. In particular, I'm keen on experimenting with building a column list from only what's referenced by checking the metadata when building the query, and with reusing the prepared statements instead of asking the database to reparse each query (could put a tag in the query metadata object, but i'd have to initialize it per connection, or else track if it has been lazily initialized yet which complicates it, but should be a solvable problem).
Finally, I'm also tempted to integrate it with some other naming conventions, like maybe allowing row.joined_table.col. I don't want to venture too far from standard sql but these things can sometimes be useful too.
I also, as always, have more work I want to do with the drivers and other libraries too! My next blog entry may be about my new arsd.shell module, or about using interpolation for html and urls too. Stay tuned.
If you want to play with the tech preview yourself, you can git fetch arsd master. Once I stabilize it a little, this will be included in the opend standard distribution.
It is interesting to remark too that while I did consider language changes to improve this over the last couple weeks, none actually proved necessary - D was already capable of implementing everything I envisioned as-is.
You've heard of static typing, checked at compile time, and dynamic typing, checked at runtime. I now have something in between: checking types against a live database at program startup using compile time info.