[Solved] How do store sensitive files in my repo without tracking it? [duplicate]

Question

The answer to the question in the subject line:

How do store sensitive files in my repo without tracking it?

is: you don’t.

The reason is simple: Git builds new commits from whatever is in Git’s index. The index, a.k.a. the staging area, holds the copies of files that will go into your next commit. It’s initially filled in by copying out the files from the current commit. These same files are copied to your working tree so that you can see and work on them.¹ Then, as you modify your working tree copies, you run git add to copy the working tree versions back into Git’s index, so that the proposed next commit is also updated.

A tracked file is one that is in Git’s index. It is therefore proposed to be in your next commit. If you untrack the file (by removing it from Git’s index), it is proposed that the next commit should omit that file, i.e., the file is deleted between the two commits.

The answer—well, an answer—to the question inside the text:

I prefer to store the json file with empty fields in my repo so the credentials can be later filled out either by my build script, or filled out by another developer after the repo is initialized:
{
     "username": ""  // to be filled by build or user
     "password" : ""  // ditto gitto
}

is to use Git’s smudge and clean filter mechanism so that the stored file, in Git, omits the sensitive data, while the working tree copy of that same file—the data that you can see in a file-viewer and edit in an editor—shows it.

The smudge and clean filter mechanism is a little tricky, and carelessness can result in the sensitive data winding up in the repository.

I’ve encountered a few ways to achieve with Git in the past, but they all require the user to actively perform additional steps …

Setting up the smudge and clean filters has this same problem. Once set up, though, the clean filter can take the working tree copy, which has the sensitive data, and strip that sensitive data out of the file-contents as the file is copied from your working tree into Git’s index. So the proposed next commit does not have the sensitive data. The smudge filter can put the sensitive data back into the file as it’s copied from a commit, or from Git’s index, to your working tree copy. (Of course your smudge filter needs to get the sensitive data from somewhere. So: where are you keeping the actual data? Why not keep it there and only there?²)

In general, then, the right answer is: don’t put this stuff in the repo at all. Instead of a json file that needs to be filled-in, supply an example (or “template”) json file, or keep that data in some other file.

¹The difference between Git’s index copy of a file, and your working tree copy of the same file, is … well, see the smudge and clean filter stuff as well, but the important difference to Git itself is that the copy in Git’s index is already in the special format that Git uses to store files. This format is compressed and de-duplicated and does not use the storage system that your OS uses. It can therefore hold files whose names your OS cannot pronounce, as it were, depending on your OS. It’s also very fast to commit: it doesn’t require scanning through the data to compress and de-duplicate it, for instance.

²Convenience, stubbornness, spite, obstinacy … there are lots of good reasons! ?

Accepted Answer

The answer to the question in the subject line:

How do store sensitive files in my repo without tracking it?

is: you don’t.

The reason is simple: Git builds new commits from whatever is in Git’s index. The index, a.k.a. the staging area, holds the copies of files that will go into your next commit. It’s initially filled in by copying out the files from the current commit. These same files are copied to your working tree so that you can see and work on them.¹ Then, as you modify your working tree copies, you run git add to copy the working tree versions back into Git’s index, so that the proposed next commit is also updated.

A tracked file is one that is in Git’s index. It is therefore proposed to be in your next commit. If you untrack the file (by removing it from Git’s index), it is proposed that the next commit should omit that file, i.e., the file is deleted between the two commits.

The answer—well, an answer—to the question inside the text:

I prefer to store the json file with empty fields in my repo so the credentials can be later filled out either by my build script, or filled out by another developer after the repo is initialized:
{
     "username": ""  // to be filled by build or user
     "password" : ""  // ditto gitto
}

is to use Git’s smudge and clean filter mechanism so that the stored file, in Git, omits the sensitive data, while the working tree copy of that same file—the data that you can see in a file-viewer and edit in an editor—shows it.

The smudge and clean filter mechanism is a little tricky, and carelessness can result in the sensitive data winding up in the repository.

I’ve encountered a few ways to achieve with Git in the past, but they all require the user to actively perform additional steps …

Setting up the smudge and clean filters has this same problem. Once set up, though, the clean filter can take the working tree copy, which has the sensitive data, and strip that sensitive data out of the file-contents as the file is copied from your working tree into Git’s index. So the proposed next commit does not have the sensitive data. The smudge filter can put the sensitive data back into the file as it’s copied from a commit, or from Git’s index, to your working tree copy. (Of course your smudge filter needs to get the sensitive data from somewhere. So: where are you keeping the actual data? Why not keep it there and only there?²)

In general, then, the right answer is: don’t put this stuff in the repo at all. Instead of a json file that needs to be filled-in, supply an example (or “template”) json file, or keep that data in some other file.

¹The difference between Git’s index copy of a file, and your working tree copy of the same file, is … well, see the smudge and clean filter stuff as well, but the important difference to Git itself is that the copy in Git’s index is already in the special format that Git uses to store files. This format is compressed and de-duplicated and does not use the storage system that your OS uses. It can therefore hold files whose names your OS cannot pronounce, as it were, depending on your OS. It’s also very fast to commit: it doesn’t require scanning through the data to compress and de-duplicate it, for instance.

²Convenience, stubbornness, spite, obstinacy … there are lots of good reasons! ?