Automatically Simple Since 2002
19 July 2013
How to migrate from ClearCase to Git in a large development organization, while preserving history, including labels, branches, authors, timestamps, and ticketing integration. In short, making the Git history look native.
You may also enjoy my article on Dual Syncing ClearCase and Git.
ClearCase was first released in 1992. 20 years later, it has really started to show its age. I was recently reading about performance tuning and came across this gem:
Client workstations supporting a single user should have a minimum of 10–15 MIPS processing power, 16Mb of main memory, and 300Mb of disk storage. An additional 8–16Mb of main memory will further improve performance.
My virtual machine with 80GB memory and 8 CPUs is clearly overpowered, and throwing additional hardware at ClearCase has diminishing returns (except perhaps block i/o performance). Migrating from such a legacy system to something new and shiny is not trivial.
Git was written in 2005 and with a little help from GitHub, has become the #800 gorilla in version control.
There is something to be said for the cyclical nature of development tools in general and version control tools in particular. It is healthy and pragmatic to plan on changing verson control tools every 10 years. Plan on it. I fully intend to be doing conversions from Git into something else starting in about 2020. It will not surprise me if the pressure to move out of git starts several years earlier than that.
The trick is to stay in the majority of the bell curve for tool usage. If you fall too far outside it, you fall into “legacy” territory and it becomes even harder to convert.
In this case, converting from ClearCase to Git in 2013 is not a common path. Too few people use ClearCase, and the market for converting to a tool as new as Git is small. This forced our hand into writing a lot of custom code to achieve a desired result.
There are commercial tools available. However, even they recommend only importing the history separately because it is too messy to be useful.
The ClearCase code base (VOB) in question consists of ~5100 labels (“tags” in git parlance), ~110 branches, and 10 years of development history.
The desire is to retain all of that history and transition smoothly from ClearCase into Git without compromising development efforts. This rules out generic conversion tools as well as migration strategies that rely on eliminating or altering history.
What follows is a brain-dump that would have saved me months of time if I had only known before starting. The short list is:
Understand the ClearCase data model and the way it is used locally is the first step. In our case, we use base ClearCase. I understand that UCM may provide additional meta-data that would be useful, but I have never worked with it.
The two trickiest aspects of this ClearCase environment are determining branch points, and getting the performance of ClearCase to an adequate level to support importing the history in a reasonable time frame.
Git is not without its own caveats. First, Git’s snapshot-based data model requires that history get imported correctly the first time. There is no tuning or tweaking once the repository starts being used by developers (without causing significant pains to their development environments).
One other small issue is symlinks. ClearCase makes symlinks work cross platform but Git does not support symlinks in Windows at this time. In the interest of time and backward compatibility, we decided to duplicate any symlinked files and document them so they can be resolved in a cross-platform way going forward.
This ClearCase environment has multiple potentially conflicting sources of data:
Each of these sources of information provide a distinct perspective which may not always be 100% consistent with the other perspectives. For example, a file could be modified in ClearCase without an associated record in ClearQuest. Likewise, conceivably, a record could exist in ClearQuest without the corresponding change in ClearCase. Additionally, branching methodology can vary during 10 years of history and file elements can occasionally be inconsistently included or excluded from branches.
For these reasons and others, most existing import tools will come with caveats such as not guaranteeing labels, not creating labels, not discovering files that were deleted in the course of a branch, and others.
In short, ClearCase history lacks transactional integrity and therefore will invariably be inconsistent. This environment also does not use atomic commits, though ClearCase does support them.
However, it does have labels. After every build since 2003, the code base was consistently labeled, creating point-in-time snapshots that could be used to recreate any build. These labels, therefore, are the uncompromising cornerstone of the Git migration strategy. They must be preserved accurately and the rest of the history can be filled in from these points in time.
Performance, or lack thereof, quickly becomes the focus of migrating 10 years of history. With 5100 labels, every minute per label of runtime translates into about 3.5 days. Some of our initial strategies cost 15-20 minutes per label, resulting in months of runtime.
Dynamic views, a unique feature to ClearCase, can be set to an arbitrary label very quickly (sub second). Transferring data out of a dynamic view, however, can be costly. Traversing the view (by doing an rsync -a without any actual changes) can take over 5 minutes when running against the production ClearCase environment over the network.
The transfer time can be decreased to about 1 minute per traversal by installing the VOB locally/offline on a raid 0 array of solid state disks. Further, using a ClearCase client in Linux gives us another large speed boost, dropping the rsync time to less than a minute per label.
Linux is also important for case sensitivity. It is hard to deal with same-case renames on a case insensitive filesystem when using rsync and git.
Even so, some ClearCase operations that are simply too slow: Anything involving
cleartool find
or cleartool lshistory
. However, cleartool describe
is
speedy.
Dynamic views, the list of labels, rsync, and cleartool describe
provide all
of the primitives we need to extract all history.
The main loop (not considering edge cases) looks like this:
rsync -a --delete --exclude=.git/
--exclude=lost+found
)git status --porcelain --untracked-files
)cleartool describe
to find the most recent change, using a
custom format for the fields of interestpredecessor-version
and parse it too, recursively, until the
beginning of the file element is reached, or another labeled element. Once
we hit a previous label, we know we can stop. Alternatively/additionally,
stop if you hit a ClearCase branch point.cleartool get
on particular element versions, build the git commits
one by one and commit them (see next section)git stash
. It is faster, but I
was hitting some bugs and lost trust in it, so I fell back to rsync.One alternative strategy instead of using rsync, is to make a Git repository out
of a ClearCase dynamic view directly. You could modify the config spec to
generate the contents of each git commit, instead of using cleartool get
. I am
not sure it is any faster - in testing, a git status
took about the same time
as an rsync
. Traversing the files in a dynamic view appears to be the big,
unavoidable bottle neck.
ClearCase, without UCM and without atomic commits, does not group commits in any way. Each file change is a distinct commit. This would be far too granular to bring into Git (but it would be accurate) - so it is prudent to combine the ClearCase into logical Git commits.
My initial approach was to sort the ClearCase commits sequentially by time, and combine them with adjacent commits if the author and commit message matched.
This approach fell short. First, in this environment developers will often provide different commit messages for each file, even though it is really the same logical commit. Second, frequently two developers will commit a large number of files at the same time. When this happens, their commits are no longer “adjacent” in time and so they get broken up.
The final approach is to combine commits regardless of time, as long as there are no conflicts detected. Commits are combined if the author matches and the ClearQuest record associated with the commit matches.
In pseudocode:
closed_groups = []
open_groups = []
for each change:
# Loop through open groups to ensure the current file is not in
# any # existing group. If it is; close the group and make a new
# open group for the current change.
# Loop through open groups to find a match (user, ClearQuest
# record match)
# If there are no open groups, create a new open group with the
# current change
return closed_groups + open_groups
It is important to still consider time - the easiest approach is to keep the changes sorted by time and the groups sorted by time, starting with the earliest changes.
The description of each Git commit should pull in data from as many sources as possible.
For example, it is imperative to preserve the original author, commit time, ClearCase comment, and ClearQuest associations.
In this case, it is also nice to mine ClearQuest for additional data that can be pulled into Git at this time, creating a rich history in Git going forward.
First, validate the label names and potentially correct inconsistencies.
Second, find and ignore incomplete labels. For each label, verify the number of files in the label are consistent with adjacent labels. Investigate the labels that are outliers.
Incomplete labels will end up looking like file deletes and adds in Git, which
will break things like git blame
.
The old clearcase branch model and names no longer make sense in Git. Instead of pushing forward clearcase branch names, the Git branches are derived from the clearcase labels by removing the build numbers.
This provides easy to understand branches historically, and paves a simple branching model going forward.
Because ClearCase branches are being largely ignored, some effort must be taken manually to decide how the branch hierarchy will look in Git.
Initially, I just imported all labels in order. However, the labels must be imported in the correct order to preserve the correct branch points.
There is no great way to find these branch points in this ClearCase environment. The strategy here is to use version tree of individual files to approximate and map out the start point of each branch manually. This is a bit messy, but as long as it is close it turns out OK.
This is an opportunity to fix any branching or naming conventions in the source repository going forward.
It’s difficult to get a complete import in one shot, and impossible to develop if you have to start over every time there’s an error.
Since the git tag is the absolute last step after importing a label and verifying its contents, it can be used as a safe point.
The strategy I used for restarting the import is to reset the git repository to the most recent tag and continuing on from the start of the next label.
The ClearCase command to fetch all labels can take many minutes. A simple early optimization is to cache this result during the iterative development process.
If a ClearCase to Git migration is performed well, Git can actually have richer, more usable history than ClearCase. It is possible to provide it cleanly and consistently, preserving important details for future developers of the project.
If you’ve read this far, you may also enjoy my article on Dual Syncing ClearCase and Git.