What is Datahike?

Datahike can be described as a Datomic lite. It uses most of the excellent Datascript, ports it to the JVM, and persists the datoms to the local disk. It has very modest requirements and should run even on very small EC2 instances.

Why Datahike?

I am working on a small web app with a seemingly very simple data model that turned out to be very hard to model in SQL due to to a number of many-to-many relationships. I trust Rich Hickey’s choices deeply, so I had been curious about Datomic for years. Thinking about my modelling problems, it seemed like Datomic could save me from the horror of all the join tables I would have to write. And, wouldn’t you know it, one of the videos in the Day of Datomic video series described exactly what I watned to do! Unfortunately, Datomic always seemed too complex and expensive for my tiny, personal apps, even with the new $1/day Datomic Cloud Solo Topology. Also, the lack of alternatives to port my apps to (if I ever had to, for some reason) always worried me. Datahike helps with both of these issues, so I decided to give it a try over the Christmas holidays.

What is this blog post?

When I started out with Datahike, I tried to find experience reports of others that had used it. I didn’t find any, so this is what I would have liked to read when considering and starting out with Datahike. This post applies to Datahike 0.1.2, which is current as of 2018-12. This post describes the problems with Datahike (and touches on the goodness of it), describes how I ended up with Datomic proper, and what I needed to do to migrate.

The problems with Datahike

Documentation

Datahike is not very easy to get started with. There is a sample project at https://gitlab.com/replikativ/datahike-invoice/tree/master and a few tests in https://github.com/replikativ/datahike/blob/master/test/datahike/test/api.cljc but this is barely enough to get you up and running. You will have to discover most of the functionality by reading Datahike github issues and Datomic and Datascript documentation, and from then on it’s trial and error. Unfortunately, there are subtle incompatibilities between Datahike and both Datascript and Datomic ¹ and you won’t know about them until you run into them.

Error messages

Datahike seems to do barely any input validation. Because of this, there aren’t helpful error messages to guide you to valid input (Datomic mostly does a great job here, somewhat to my surprise!), and you can expect to see a bunch of ClassCastExceptions and NullPointerExceptions ² . This is especially painful because some corners of Datahike seem a bit rough ³ and I can’t tell if I am doing anything stupid or if these are actual bugs. One such instance is https://github.com/replikativ/datahike/issues/15

Schema migrations

Datahike does not support schema migrations. The way to migrate a schema is to export the database, create a new database with the new schema, and import the exported data into it. This seemed manageable for my tiny app (once I learned about it!) but turned out to be very annoying, even during development. Some schema changes were incompatible (going from :db.cardinality/one to :db.cardinality/many, in my case), which led to very frustrating crashes (as per the previous section) and me having to edit the export file by hand.

The author is busy

Christian Weilbach, who seems to be the lead architect of Datahike, has been nothing but helpful. Unfortunately he is also obviously very busy, and so issues like https://github.com/replikativ/hitchhiker-tree/pull/1 are still languishing and impacting my productivity. This is in no way Christian’s fault, but with no other community to speak of Christian is my only way of getting help and support, and this makes his busyness a major issue for me.

Moving to Datomic Free

The problems I just mentioned proved too much for me, and after two weeks I decided to re-evaluate if this was really worth the pain. As I said initially, my main issue with Datomic were the price, the complexity (for small applications), and the lack of compatible options. Datahike may not be quite there yet, but I hope it will be a viable option soon, and it did help me a lot with my painful SQL queries. So much so, that I couldn’t bear the thought of replacing Datahike with one of the popular SQL or NoSQL-based RDBMSs. I then looked again into “real” Datomic and found that, while not really advertised and hidden away on the homepage, Datomic Free seems to address my other two issues with Datomic! Datomic Free is, well, free ⁴, and seems more than sufficient for my use case. It saves the data locally to the transactor’s disc into a H2 database, which is just as good as what Datahike can provide. The transactor can run on the same instance as the application, which further reduces cost and complexity. And if one of my little apps ever becomes popular, I have the peace of mind of knowing that I can easily migrate to one of the production-grade Datomic plans!

Compatibility and Migration

Moving my app to Datomic was surprisingly very painless. This is what I had to do to make it happen:

Change the schema format

Datahike (and Datascript) support a schema like this:

{:user/name {:db/type        :db.type/string
             :db/cardinality :db.cardinality/one}}

whereas (new versions of) Datomic need at least this:

[{:db/ident       :user/name,
  :db/valueType   :db.type/string,
  :db/cardinality :db.cardinality/one}]

Thankfully, a datomic schema is data, so I could automate this process (take that, SQL!):

(def newschema
  (->>  (for [[k v ] schema]
          (assoc v :db/ident k))
        (map #(clojure.set/rename-keys % {:db/type :db/valueType})))

Then I made sure that every attribute has at least the required keys:

(map #(every? % [:db/ident :db/cardinality :db/valueType]) newschema)

Now the data was correct, but it was hard to read, so I sorted the individual schema maps to have :db/ident on top, and have similiar entity attributes grouped together:

(->> newschema
     (map (fn [m]
      (into
        (sorted-map-by
          (fn [k1 k2]
            (compare [(-> k1 str count) k1]
                     [(-> k2 str count) k2])))
       m)))
    (sort-by :db/ident)
    vec)

That gave me the new Datomic-compatible and easy-to-read schema.

Change the database functions

I was using clojure.string/lower-case in a few places as database functions. This works in Datahike, but it doesn’t in Datomic because the clojure.string namespace isn’t imported. Java instance functions work though, so I could fall back to (.toLowerCase the-string).

Fix types

Datahike doesn’t actually enforce the schema, whereas Datomic does. When you tell Datomic that something is a UUID, it won’t just accept any old UUID string. I had also defined a :submission/submitter attribute, but then ended up using :submission/author in the code, which Datahike happily went along with. Overall, you want to validate your data at the edges, and I’m glad that Datomic forced me to do this and pointed out my mistakes.

A final word

I really hope Datahike addresses their issues, because there is a lot of potential. And without Datahike, I wouldn’t have given Datomic an honest try either, so I’m glad I went with it. Now let’s see how Datomic ends up working out…

The Datahike schema differs from the Datomic one, the details are in this section. The main difference to Datascript is that you install the schema upon creating the database, not when connecting to it. There are a few more things, like the missing validations, that are discussed in later sections. ↩︎
Did you know that the JVM will sometimes just drop the stack trace on NullPointerExceptions? I didn’t, and this was very frustrating until I learned read this: https://dzone.com/articles/clojurejava-prevent-exceptions ↩︎
I’ve run into a bunch of issues with :db/isComponent true attributes, and I still don’t know why Datahike sometimes forces me to use :db/type and sometimes :db/valueType ↩︎
From what I can tell - and my license legalese isn’t very strong - it is really completely free for small usecases, both commercial/closed source and free/open source, as long as one is using the disk-storage backend no more than two peers. Somebody please correct me if I am wrong! ↩︎

‹ gerlach.coffee

Taking Datahike for a Spin

Dec 27, 2018