Thoughts On The Design Of The Fossil DVCS
Two questions (or criticisms) that arise frequently regarding Fossil can be summarized as follows:
- Why is Fossil based on SQLite instead of a distributed NoSQL database?
- Why is Fossil written in C instead of a modern high-level language?
Neither question can be answered directly because they are both based on false assumptions. We claim that Fossil is not based on SQLite at all and that Fossil is not based on a distributed NoSQL database because Fossil is a distributed NoSQL database. And, Fossil does use a modern high-level language for its implementation, namely SQL.
Fossil Is A NoSQL Database
We begin with the first question: Fossil is not based on a distributed NoSQL database because Fossil is a distributed NoSQL database. Fossil is not based on SQLite. The current implementation of Fossil uses SQLite as a local store for the content of the distributed database and as a cache for meta-information about the distributed database that is precomputed for quick and easy presentation. But the use of SQLite in this role is an implementation detail and is not fundamental to the design. Some future version of Fossil might do away with SQLite and substitute a pile-of-files or a key/value database in place of SQLite. (Actually, that is very unlikely to happen since SQLite works amazingly well in its current role, but the point is that omitting SQLite from Fossil is a theoretical possibility.)
The underlying database that Fossil implements has nothing to do with SQLite, or SQL, or even relational database theory. The underlying database is very simple: it is an unordered collection of "artifacts". An artifact is a list of bytes - a "file" in the usual manner of thinking. Many artifacts are simply the content of source files that have been checked into the Fossil repository. Call these "content artifacts". Other artifacts, known as "control artifacts", contain ASCII text in a particular format that defines relationships between other artifacts, such as which content artifacts that go together to form a particular version of the project. Each artifact is named by its SHA1 hash and is thus immutable. Artifacts can be added to the database but not removed (if we ignore the exceptional case of shunning.) Repositories synchronize by computing the union of their artifact sets. SQL and relation theory play no role in any of this.
SQL enters the picture only in the implementation details. The current implementation of Fossil stores each artifact as a BLOB in an SQLite database. The current implementation also parses up each control artifact as it arrives and stores the information discovered from that parse in various other SQLite tables to facilitate rapid generation of reports such as timelines, file histories, file lists, branch lists, and so forth. Note that all of this additional information is derived from the artifacts. The artifacts are canonical. The relational tables serve only as a cache. Everything in the relational tables can be recomputed from the artifacts, and in fact that is exactly what happens when one runs the "fossil rebuild" command on a repository.
So really, Fossil works with two separate databases. There is the bag-of-artifacts database which is non-relational and distributed (like a NoSQL database) and there is the local relational database. The bag-of-artifacts database has a fixed format and is what defines a Fossil repository. Fossil will never modify the file format of the bag-of-artifacts database in an incompatible way because to do so would be to make something that is no longer "Fossil". The local relational database, on the other hand, is a cache that contains information derived from the bag-of-artifacts. The schema of the local relational database changes from time to time as the Fossil implementation is enhanced, and the content is recomputed from the unchanging bag of artifacts. The local relational database is an implementation detail which currently happens to use SQLite.
Another way to think of the relational tables in a Fossil repository is as an index for the artifacts. Without the relational tables, to generate a report like a timeline would require scanning every artifact - the equivalent of a full table scan. The relational tables hold pointers the relevant artifacts in presorted order so that generating a timeline is much more efficient. So like an index in a relational database, the relational tables in an Fossil repository do not add any new information, they merely make the information in the artifacts faster and easier to look up.
Fossil is not "based" on SQLite. Fossil simply exploits SQLite as a powerful tool to make the implementation easier. And Fossil doesn't use a distributed NoSQL database because Fossil is a distributed NoSQL database. That answers the first question.
SQL Is A High-Level Scripting Language
The second concern states that Fossil does not use a high-level scripting language. But that is not true. Fossil uses SQL (as implemented by SQLite) as its scripting language.
This misunderstanding likely arises because people fail to appreciate that SQL is a programming language. People are taught that SQL is a "query language" as if that were somehow different from a "programming language". But they really are two different favors of the same thing. I find that people do better with SQL if they think of SQL as a programming language and each statement of SQL is a separate program. SQL is a peculiar programming language in that one uses SQL to specify what to compute whereas in most other programming languages one specifies how to carry out the computation. This difference means that SQL is an extraordinary high-level programming language, but it is still just a programming language.
For certain types of problems, SQL has a huge advantage over other programming languages because it is so high level and because it allows programmers to focus more on the what and less of the how of a computation. In other words, programmers tend to think about problems at a much higher level when using SQL; this can result in better applications. SQL is also very dense. In practice, this often means that a few lines of SQL can often replace hundreds or thousands of lines of procedural code, with a corresponding decrease in programming effort and opportunities to introduce bugs. Fossil happens to be one of those problems for which SQL is well suited.
Much of the "heavy lifting" within the Fossil implementation is carried out using SQL statements. It is true that these SQL statements are glued together with C code, but it turns out that C works surprisingly well in that role. Several early prototypes of Fossil were written in a scripting language (TCL). We normally find that TCL programs are shorter than the equivalent C code by a factor of 10 or more. But in the case of Fossil, the use of TCL was actually making the code longer and more difficult to understand. And so in the final design, we switched from TCL to C in order to make the code easier to implement and debug.
Without the advantages of having SQLite built in, the design might well have followed a different path. Most reports generated by Fossil involve a complex set of queries against the relational tables of the repository database. These queries are normally implemented in only a few dozen lines of SQL code. But if those queries had been implemented procedurally using a key/value or pile-of-files database, it may have well been the case that a high-level scripting language such as Tcl, Python, or Ruby may have worked out better than C.