A Technical Overview
Of The Design And Implementation
Of Fossil
1.0 Introduction
At its lowest level, a Fossil repository consists of an unordered set of immutable "artifacts". You might think of these artifacts as "files", since in many cases the artifacts exactly that. But other "control artifacts" are also included in the mix. These control artifacts define the relationships between artifacts - which files go together to form a particular version of the project, who checked in that version and when, what was the check-in comment, what wiki pages are included with the project, what are the edit histories of each wiki page, what bug reports or tickets are included, who contributed to the evolution of each ticket, and so forth. This low-level file format is called the "global state" of the repository, since this is the information that is synced to peer repositories using push and pull operations. The low-level file format is also called "enduring" since it is intended to last for many years. The details of the low-level, enduring, global file format are described separately.
This article is about how Fossil is currently implemented. Instead of dealing with vague abstractions of "enduring file formats" as the other document does, this article provides some detail on how Fossil actually stores information on disk.
2.0 Three Databases
Fossil stores state information in SQLite database files. SQLite keeps an entire relational database, including multiple tables and indices, in a single disk file. The SQLite library allows the database files to be efficiently queried and updated using the industry-standard SQL language. SQLite updates are atomic, so even in the event of a system crashes or power failure the repository content is protected.
Fossil uses three separate classes of SQLite databases:
- The configuration database
- Repository databases
- Checkout databases
The configuration database is a one-per-user database that holds global configuration information used by Fossil. There is one repository database per project. The repository database is the file that people are normally referring to when they say "a Fossil repository". The checkout database is found in the working checkout for a project and contains state information that is unique to that working checkout.
Fossil does not always use all three database files. The web interface, for example, typically only uses the repository database. And the fossil setting command only opens the configuration database when the --global option is used. But other commands use all three databases at once. For example, the fossil status command will first locate the checkout database, then use the checkout database to find the repository database, then open the configuration database. Whenever multiple databases are used at the same time, they are all opened on the same SQLite database connection using SQLite's ATTACH command.
The chart below provides a quick summary of how each of these database files are used by Fossil, with detailed discussion following.
Configuration Database
|
Repository Database
|
Checkout Database
|
2.1 The Configuration Database
The configuration database holds cross-repository preferences and a list of all repositories for a single user.
The fossil setting command can be used to specify various operating parameters and preferences for Fossil repositories. Settings can apply to a single repository, or they can apply globally to all repositories for a user. If both a global and a repository value exists for a setting, then the repository-specific value takes precedence. All of the settings have reasonable defaults, and so many users will never need to change them. But if changes to settings are desired, the configuration database provides a way to change settings for all repositories with a single command, rather than having to change the setting individually on each repository.
The configuration database also maintains a list of repositories. This list is used by the fossil all command in order to run various operations such as "sync" or "rebuild" on all repositories managed by a user.
On unix systems, the configuration database is named ".fossil" and is located in the user's home directory. On windows, the configuration database is named "_fossil" (using an underscore as the first character instead of a dot) and is located in the directory specified by the LOCALAPPDATA, APPDATA, or HOMEPATH environment variables, in that order.
2.2 Repository Databases
The repository database is the file that is commonly referred to as "the repository". This is because the repository database contains, among other things, the complete revision, ticket, and wiki history for a project. It is customary to name the repository database after then name of the project, with a ".fossil" suffix. For example, the repository database for the self-hosting Fossil repository is called "fossil.fossil" and the repository database for SQLite is called "sqlite.fossil".
2.2.1 Global Project State
The bulk of the repository database (typically 75 to 85%) consists of the artifacts that comprise the enduring, global, shared state of the project. The artifacts are stored as BLOBs, compressed using zlib compression and, where applicable, using delta compression. The combination of zlib and delta compression results in a considerable space savings. For the SQLite project, at the time of this writing, the total size of all artifacts is over 2.0 GB but thanks to the combined zlib and delta compression, that content only takes up 32 MB of space in the repository database, for a compression ratio of about 64:1. The average size of a content BLOB in the database is around 500 bytes.
Note that the zlib and delta compression is not an inherent part of the Fossil file format; it is just an optimization. The enduring file format for Fossil is the unordered set of artifacts. The compression techniques are just a detail of how the current implementation of Fossil happens to store these artifacts efficiently on disk.
All of the original uncompressed and undeltaed artifacts can be extracted from a Fossil repository database using the fossil deconstruct command. Individual artifacts can be extracted using the fossil artifact command. When accessing the repository database using raw SQL and the fossil sql command, the extension function "content()" with a single argument which is the SHA1 hash of an artifact will return the complete undeleted and uncompressed content of that artifact.
Going the other way, the fossil reconstruct command will scan a directory hierarchy and add all files found to a new repository database. The fossil import command works by reading the input git-fast-export stream and using it to construct corresponding artifacts which are then written into the repository database.
2.2.2 Project Metadata
The global project state information in the repository database is supplemented by computed metadata that makes querying the project state more efficient. Metadata includes information such as the following:
- The names for all files found in any checkin.
- All check-ins that modify a given file
- Parents and children of each checkin.
- Potential timeline rows.
- The names of all symbolic tags and the checkins they apply to.
- The names of all wiki pages and the artifacts that comprise each wiki page.
- Attachments and the wiki pages or tickets they apply to.
- Current content of each ticket.
- Cross-references between tickets, checkins, and wiki pages.
The metadata is held in various SQL tables in the repository database. The metadata is designed to facilitate queries for the various timelines and reports that Fossil generates. As the functionality of Fossil evolves, the schema for the metadata can and does change. But schema changes do no invalidate the repository. Remember that the metadata contains no new information - only information that has been extracted from the canonical artifacts and saved in a more useful form. Hence, when the metadata schema changes, the prior metadata can be discarded and the entire metadata corpus can be recomputed from the canonical artifacts. That is what the fossil rebuild command does.
2.2.3 Display And Processing Preferences
The repository database also holds information used to help format the display of web pages and configuration settings that override the global configuration settings for the specific repository. All of this information (and the user credentials and privileges too) is local to each repository database; it is not shared between repositories by fossil sync. That is because it is entirely reasonable that two different websites for the same project might have completely different display preferences and user communities. One instance of the project might be a fork of the other, for example, which pulls from the other but never pushes and extends the project in ways that the keepers of the other website disapprove of.
Display and processing information includes the following:
- The name and description of the project
- The CSS file, header, and footer used by all web pages
- The project logo image
- Fields of tickets that are considered "significant" and which are therefore collected from artifacts and made available for display
- Templates for screens to view, edit, and create tickets
- Ticket report formats and display preferences
- Local values for settings that override the global values defined in the per-user configuration database.
Though the display and processing preferences do not move between repository instances using fossil sync, this information can be shared between repositories using the fossil config push and fossil config pull commands. The display and processing information is also copied into new repositories when they are created using fossil clone.
2.2.4 User Credentials And Privileges
Just because two development teams are collaborating on a project and allow push and/or pull between their repositories does not mean that they trust each other enough to share passwords and access privileges. Hence the names and emails and passwords and privileges of users are considered private information that is kept locally in each repository.
Each repository database has a table holding the username, privileges, and login credentials for users authorized to interact with that particular database. In addition, there is a table named "concealed" that maps the SHA1 hash of each users email address back into their true email address. The concealed table allows just the SHA1 hash of email addresses to be stored in tickets, and thus prevents actual email addresses from falling into the hands of spammers who happen to clone the repository.
The content of the user and concealed tables can be pushed and pulled using the fossil config push and fossil config pull commands with the "user" and "email" as the AREA argument, but only if you have administrative privileges on the remote repository.
2.2.5 Shunned Artifact List
The set of canonical artifacts for a project - the global state for the project - is intended to be an append-only database. In other words, new artifacts can be added but artifacts can never be removed. But it sometimes happens that inappropriate content is mistakenly or maliciously added to a repository. The only way to get rid of the undesired content is to "shun" it. The "shun" table in the repository database records the SHA1 hash of all shunned artifacts.
The shun table can be pushed or pulled using the fossil config command with the "shun" AREA argument. The shun table is also copied during a clone.
2.3 Checkout Databases
Unlike several other popular DVCSes, Fossil allows a single repository to have multiple working checkouts. Each working checkout has a single database in its root directory that records the state of that checkout. The checkout database is named "_FOSSIL_" by default, but can be renamed to ".fslckout" if desired. (Future versions of Fossil might make ".fslckout" the default name.) The checkout database records information such as the following:
- The name of the repository database file.
- The version that is currently checked out.
- Files that have been added, removed, or renamed but not yet committed.
- The mtime and size of files as they were originally checked out, in order to expedite checking which files have been edited.
- Other checkins that have been merged into the working checkout but not yet committed.
- Copies of files prior to the most recent undoable operation - needed to implement the undo and redo commands.
- The stash.
- State information for the bisect command.
For Fossil commands that run from within a working checkout, the first thing that happens is that Fossil locates the checkout database. Fossil first looks in the current directory. If not found there, it looks in the parent directory. If not found there, the parent of the parent. And so forth until either the checkout database is found or the search reaches the root of the filesystem. (In the latter case, Fossil returns an error, of course.) Once the checkout database is located, it is used to locate the repository database.
Notice that the checkout database contains a pointer to the repository database but that the repository database has no record of the checkout databases. That means that a working checkout directory tree can be freely renamed or copied or deleted without consequence. But the repository database file, on the other hand, has to stay in the same place with the same name or else the open checkout databases will not be able to find it.
A checkout database is created by the fossil open command. A checkout database is deleted by fossil close. The fossil close command really isn't needed; one can accomplish the same thing simply by deleting the checkout database.
Note that the stash, the undo stack, and the state of the bisect command are all contained within the checkout database. That means that the fossil close command will delete all stash content, the undo stack, and the bisect state. The close command is not undoable. Use it with care.