In the previous article, I focused on making history and storing executables in repository. In this article, I will focus on storing nested directories in repository by building nested trees as Merkle trees[5].

A Bit of Theory

Git stores nested directories as nested trees and each tree is a Merkle tree. From wikipedia definition of Merkle tree —

"a hash tree or Merkle tree is a tree in which every leaf node is labelled with the cryptographic hash of a data block, and every non-leaf node is labelled with the cryptographic hash of the labels of its child nodes. Hash trees allow efficient and secure verification of the contents of large data structures"

From Git perspective, what that means is — when we compare two trees of two commits, if two tree entries have same hash ID, we know their content are exactly the same and we can skip over them entirely. This is a huge performance win and also it reduces transfer for redundant data over network. This same data structure is also behind distributed consensus protocol used in Bitcon[6].

For this article, we will consider a directory structure as below —

None

If we commit this in Git and examine the commits and trees using git cat-file -p command, we can see nested trees are stored with mode 040000 —

None

Here e.txt file is stored inside a/b/c/d path and as we can see each of these directories is stored as a tree with permission 040000.

Focus of This Article

In this article, focus is to replicate this behaviour so that egit commit can —

  • store nested directories as nested trees
  • each tree is stored as Merkle tree

Elixir Code Walkthrough

The finished code for this article is available here — https://github.com/imeraj/elixir_git

The README file in the repository contains instructions to build egit and how to use git init and updated git commit command.

Store Nested Trees

To support building nested trees, I added a few helper method in Entry type which now looks as below —

Here —

  • line 14–17 — implements parent_dirs() function that given an entry returns all the parent directories. So, if entry.name contains "a/b/c/d/e.txt", the output will be like —

iex(3)> Entry.parent_dirs(entry) ["a", "a/b", "a/b/c", "a/b/c/d"]

  • line 40–55 — implements descend() function which works similar to Ruby's descend() method. Note: I could not find an implementation of this in Elixir's standard library, so I had to create one.
  • line 10 — adds new directory mode

Newly added Layout type is responsible for building the layout of nested trees and entries —

Here —

  • line 23–29 — implements the build() function that builds the entires layout of trees and entries. It first sort the top level entries given as parameter by name and then uses Enum.reduce to build the layout starting with empty tree.
  • line 8–21 — implements two clauses for add_entry() function once when parent is empty [] and the other when there is some parent (case for nested directories). With our example directory structure, a sample call would be —

add_entry(["bin"], Entry(name="bin/egit"))

add_entry([], Entry(name="hello.txt"))

add_entry([], Entry(name="world.txt"))

Entire output from a call to Layout.build() results in —

Note: I could store only the mode for each entry instead of entire stat information for some memory saving since we need only that. May be something to refactor in future.

Since now I store each entry as a hash ( "a" => %Egit.Types.Tree), I had to update Tree.to_s() function to address this. Also, since now I need to return the correct mode based on whether an entry is of type Entry or Tree, Tree.to_s() has been updated to address that as below —

None
Tree.to_s()

commit.ex has been updated as—

None
https://github.com/imeraj/elixir_git/blob/main/lib/commit.ex

It now builds the layout (line 31) and calls traverse() function to traverse the built layout recursively. At each step it calculates the content and oid for each parent and stores the trees in Database. traverse() function has two clauses and they look as below —

None

Here —

  • line 49–63 — implements the first clause when we are dealing with Tree. It traverses all the leaves before the root and updates parent trees by calling Tree.build_content() function on them. At each step, after building content it also saves the tree in Database using Database.store() function.
  • line 65 — implements the second clause which executes for Entry and just returns the entry

Tree.build_content() looks as below now (in previous article it used to be called build_tree() and resided in commit.ex) —

None

Taking Egit for a Spin

At this stage, commit command is capable of storing nested trees. If I commit the sample directories and examine the stored trees, we can see —

None

output from git show

None

Conclusion

In part 4 of this article series, I have updated egit commit command to store nested subdirectories as nested trees where each tree is essentially a Merkle tree. In the next article, I will focus on building index and implementing basic version of Git add command.

For more elaborate and in depth future technical posts please follow me here or on twitter.

References

  1. https://elixir-lang.org/getting-started/introduction.html
  2. https://elixir-lang.org/docs.html
  3. https://git-scm.com/docs
  4. https://github.com/imeraj/elixir_git
  5. https://en.wikipedia.org/wiki/Merkle_tree
  6. https://en.wikipedia.org/wiki/Bitcoin