Harmonically Migrating from ClearCase to Git

19 July 2013

Summary

How to migrate from ClearCase to Git in a large development organization, while preserving history, including labels, branches, authors, timestamps, and ticketing integration. In short, making the Git history look native.

You may also enjoy my article on Dual Syncing ClearCase and Git.

Background

ClearCase was first released in 1992. 20 years later, it has really started to show its age. I was recently reading about performance tuning and came across this gem:

Client workstations supporting a single user should have a minimum of 10–15 MIPS processing power, 16Mb of main memory, and 300Mb of disk storage. An additional 8–16Mb of main memory will further improve performance.

My virtual machine with 80GB memory and 8 CPUs is clearly overpowered, and throwing additional hardware at ClearCase has diminishing returns (except perhaps block i/o performance). Migrating from such a legacy system to something new and shiny is not trivial.

Git was written in 2005 and with a little help from GitHub, has become the #800 gorilla in version control.

There is something to be said for the cyclical nature of development tools in general and version control tools in particular. It is healthy and pragmatic to plan on changing verson control tools every 10 years. Plan on it. I fully intend to be doing conversions from Git into something else starting in about 2020. It will not surprise me if the pressure to move out of git starts several years earlier than that.

The trick is to stay in the majority of the bell curve for tool usage. If you fall too far outside it, you fall into “legacy” territory and it becomes even harder to convert.

In this case, converting from ClearCase to Git in 2013 is not a common path. Too few people use ClearCase, and the market for converting to a tool as new as Git is small. This forced our hand into writing a lot of custom code to achieve a desired result.

There are commercial tools available. However, even they recommend only importing the history separately because it is too messy to be useful.

Scope

The ClearCase code base (VOB) in question consists of ~5100 labels (“tags” in git parlance), ~110 branches, and 10 years of development history.

The desire is to retain all of that history and transition smoothly from ClearCase into Git without compromising development efforts. This rules out generic conversion tools as well as migration strategies that rely on eliminating or altering history.

What follows is a brain-dump that would have saved me months of time if I had only known before starting. The short list is:

Be label-centric
Use a linux client for performance reasons and case sensitivity
Use dynamic views and clearcase describe
Avoid most clearcase commands - they’re too slow
Use rsync and git to determine changes
Make git commit descriptions beautiful

ClearCase Constraints

Understand the ClearCase data model and the way it is used locally is the first step. In our case, we use base ClearCase. I understand that UCM may provide additional meta-data that would be useful, but I have never worked with it.

The two trickiest aspects of this ClearCase environment are determining branch points, and getting the performance of ClearCase to an adequate level to support importing the history in a reasonable time frame.

Git Constraints

Git is not without its own caveats. First, Git’s snapshot-based data model requires that history get imported correctly the first time. There is no tuning or tweaking once the repository starts being used by developers (without causing significant pains to their development environments).

One other small issue is symlinks. ClearCase makes symlinks work cross platform but Git does not support symlinks in Windows at this time. In the interest of time and backward compatibility, we decided to duplicate any symlinked files and document them so they can be resolved in a cross-platform way going forward.

Design Considerations

This ClearCase environment has multiple potentially conflicting sources of data:

ClearQuest contains records of ClearCase commits
ClearCase branches
ClearCase labels
ClearCase file element histories

Each of these sources of information provide a distinct perspective which may not always be 100% consistent with the other perspectives. For example, a file could be modified in ClearCase without an associated record in ClearQuest. Likewise, conceivably, a record could exist in ClearQuest without the corresponding change in ClearCase. Additionally, branching methodology can vary during 10 years of history and file elements can occasionally be inconsistently included or excluded from branches.

For these reasons and others, most existing import tools will come with caveats such as not guaranteeing labels, not creating labels, not discovering files that were deleted in the course of a branch, and others.

In short, ClearCase history lacks transactional integrity and therefore will invariably be inconsistent. This environment also does not use atomic commits, though ClearCase does support them.

However, it does have labels. After every build since 2003, the code base was consistently labeled, creating point-in-time snapshots that could be used to recreate any build. These labels, therefore, are the uncompromising cornerstone of the Git migration strategy. They must be preserved accurately and the rest of the history can be filled in from these points in time.

Working with ClearCase

Performance, or lack thereof, quickly becomes the focus of migrating 10 years of history. With 5100 labels, every minute per label of runtime translates into about 3.5 days. Some of our initial strategies cost 15-20 minutes per label, resulting in months of runtime.

Dynamic views, a unique feature to ClearCase, can be set to an arbitrary label very quickly (sub second). Transferring data out of a dynamic view, however, can be costly. Traversing the view (by doing an rsync -a without any actual changes) can take over 5 minutes when running against the production ClearCase environment over the network.

The transfer time can be decreased to about 1 minute per traversal by installing the VOB locally/offline on a raid 0 array of solid state disks. Further, using a ClearCase client in Linux gives us another large speed boost, dropping the rsync time to less than a minute per label.

Linux is also important for case sensitivity. It is hard to deal with same-case renames on a case insensitive filesystem when using rsync and git.

Even so, some ClearCase operations that are simply too slow: Anything involving cleartool find or cleartool lshistory. However, cleartool describe is speedy.

Dynamic views, the list of labels, rsync, and cleartool describe provide all of the primitives we need to extract all history.

Mining History from ClearCase

The main loop (not considering edge cases) looks like this:

Set dynamic view config spec to the next label
Rsync from view to the git repo (rsync -a --delete --exclude=.git/ --exclude=lost+found)
Run git status to find out what files have been added, removed, or changed (git status --porcelain --untracked-files)
For each file in git status:
- Run and parse cleartool describe to find the most recent change, using a custom format for the fields of interest
- Pull out predecessor-version and parse it too, recursively, until the beginning of the file element is reached, or another labeled element. Once we hit a previous label, we know we can stop. Alternatively/additionally, stop if you hit a ClearCase branch point.
Collect all of these cleartool versions and aggregate them into Git commits (see next section).
Using cleartool get on particular element versions, build the git commits one by one and commit them (see next section)
This is the most important step! After all commits, verify the contents of the label by running the same rsync command, again. Run a git status and verify that there are no pending changes.
- Note that this could be accomplished using git stash. It is faster, but I was hitting some bugs and lost trust in it, so I fell back to rsync.
Tag Git using the same label from ClearCase.

One alternative strategy instead of using rsync, is to make a Git repository out of a ClearCase dynamic view directly. You could modify the config spec to generate the contents of each git commit, instead of using cleartool get. I am not sure it is any faster - in testing, a git status took about the same time as an rsync. Traversing the files in a dynamic view appears to be the big, unavoidable bottle neck.

Generating Git Commits

ClearCase, without UCM and without atomic commits, does not group commits in any way. Each file change is a distinct commit. This would be far too granular to bring into Git (but it would be accurate) - so it is prudent to combine the ClearCase into logical Git commits.

My initial approach was to sort the ClearCase commits sequentially by time, and combine them with adjacent commits if the author and commit message matched.

This approach fell short. First, in this environment developers will often provide different commit messages for each file, even though it is really the same logical commit. Second, frequently two developers will commit a large number of files at the same time. When this happens, their commits are no longer “adjacent” in time and so they get broken up.

The final approach is to combine commits regardless of time, as long as there are no conflicts detected. Commits are combined if the author matches and the ClearQuest record associated with the commit matches.

In pseudocode:

closed_groups = []
open_groups = []
for each change:

    # Loop through open groups to ensure the current file is not in 
    # any # existing group. If it is; close the group and make a new 
    # open group for the current change.

    # Loop through open groups to find a match (user, ClearQuest 
    # record match)

    # If there are no open groups, create a new open group with the 
    # current change

return closed_groups + open_groups

It is important to still consider time - the easiest approach is to keep the changes sorted by time and the groups sorted by time, starting with the earliest changes.

Commit Metadata

The description of each Git commit should pull in data from as many sources as possible.

For example, it is imperative to preserve the original author, commit time, ClearCase comment, and ClearQuest associations.

In this case, it is also nice to mine ClearQuest for additional data that can be pulled into Git at this time, creating a rich history in Git going forward.

Dealing with Branches

Exclude bad and dumb labels

First, validate the label names and potentially correct inconsistencies.

Second, find and ignore incomplete labels. For each label, verify the number of files in the label are consistent with adjacent labels. Investigate the labels that are outliers.

Incomplete labels will end up looking like file deletes and adds in Git, which will break things like git blame.

Deduce Branch from Label

The old clearcase branch model and names no longer make sense in Git. Instead of pushing forward clearcase branch names, the Git branches are derived from the clearcase labels by removing the build numbers.

This provides easy to understand branches historically, and paves a simple branching model going forward.

Manually generate branch hierarchy

Because ClearCase branches are being largely ignored, some effort must be taken manually to decide how the branch hierarchy will look in Git.

Initially, I just imported all labels in order. However, the labels must be imported in the correct order to preserve the correct branch points.

There is no great way to find these branch points in this ClearCase environment. The strategy here is to use version tree of individual files to approximate and map out the start point of each branch manually. This is a bit messy, but as long as it is close it turns out OK.

Consistency Opportunity

This is an opportunity to fix any branching or naming conventions in the source repository going forward.

Further Performance Optimizations

Idempotence - Restarting by resetting to last git tag

It’s difficult to get a complete import in one shot, and impossible to develop if you have to start over every time there’s an error.

Since the git tag is the absolute last step after importing a label and verifying its contents, it can be used as a safe point.

The strategy I used for restarting the import is to reset the git repository to the most recent tag and continuing on from the start of the next label.

Fetching and Caching Labels

The ClearCase command to fetch all labels can take many minutes. A simple early optimization is to cache this result during the iterative development process.

Result

If a ClearCase to Git migration is performed well, Git can actually have richer, more usable history than ClearCase. It is possible to provide it cleanly and consistently, preserving important details for future developers of the project.

Dual Syncing

If you’ve read this far, you may also enjoy my article on Dual Syncing ClearCase and Git.