There doesn't seem to be a lot of documentation on the AccurateRip checksum algorithm. Actually there seems to be nothing at all. I've wanted to use the AccurateRip checksum to do a batch scan of a bunch of audio CD rips and that has made me curious as to how the AccurateRip system really works. The result of my investigation into AccurateRip is an efficient version of the (offset) checksum algorithm.
Update 2010-10-20: Source code available here.
The Basics
The AccurateRip algorithm in itself is very simple. Since it operates on audio data extracted from CDs the basic unit is one audio CD sample, i.e. one 16-bit, little-endian, signed integer. As it is stereo, samples always appear in pairs, one sample for the left channel one for the right. The algorithm loops through these sample pairs while it keeps track of the 1-based index of the current pair, i.e. 1 is the index of the first pair in the data stream, 2 is the index of the next pair and so forth. For each pair the algorithm produces a 32-bit value by adding the first sample of the pair, to the second sample shifted left 16 places. This value is then multiplied by the index and added to the accumulated checksum. At the end of the data stream the result is simply the accumulated value:
- Set the checksum value to zero,
c = 0
- Set the index value,
i
, to the start of the track - Loop while there are more samples in the track
- Read two samples (left and right),
x_l
,x_r
- Combine the samples,
v = (x_r << 16) | x_l
- Update the checksum value,
c += i⋅v
- Increment
i
- Read two samples (left and right),
- Return the value in
c
This is not exactly a great checksum algorithm. It has been argued, that only about 97% of the bits in the data stream participate in the final checksum. This disadvantage is of course offset by the large number of entries in the AccurateRip database. Fixing the checksum would require creating a new database from scratch.
AccurateRip Quirks
The checksum is not calculated the same way for all tracks on a disc; the first and last tracks are calculated differently. On the audio CD, samples are grouped into larger blocks called frames (sometimes known as sectors). Each frame contains 588 sample pairs (which is exactly 1/75 second worth of audio). In the first track, almost five frames are skipped at the start of the track. It is not exactly five frames because the last sample pair of frame five is actually used. Also note that the index value is incremented even for the samples that are not used in calculating the checksum (only step three in the loop of the algorithm is skipped). In the last track, exactly five frames are skipped at the end.
The effect of these cuts is that the checksum is calculated for a window on the data stream that is slightly shorter than the whole data stream. By sliding this window back and forth it is possible to obtain different checksums for different window offsets. This is important because several pressings of the same audio CDs often exist, the only difference between them being that the audio data on each pressing is slightly offset compared to the others. It would therefore be advantageous to use this sliding window to be able to determine whether a disc is in the AccurateRip database even if the disc registered in the database is of a different pressing.
Obviously, we could use the above algorithm, modified slightly, to calculate a checksum for each offset. That would require going through the loop 5880 to get all the possible checksums. It can be done much more efficiently!
The Math
Consider a sequence of values D = v1, v2, ..., vn calculated from sample pairs in a data stream as in step two in the loop of the above algorithm. Now, suppose that this is the data from the whole audio CD, and that each track is a subsequence of D. Let's further assume that the sequence can be read only sequentially as, in practice, we would be working with some kind of data stream; either data from a CD or from audio files. To calculate the checksum of a track starting at index i of length k one would calculate the sum 1⋅vi + 2⋅vi+1 + ... + k⋅vi+k-1. This is the checksum at index i so lets call it ci.
The goal here is the find checksums that are offset so lets look at that same track but this time offset by one. The checksum of this track is the sum 1⋅vi+1 + 2⋅vi+2 + ... + k⋅vi+k which we will call ci+1. To see if there is an easy way of obtaining the offset checksum from the base checksum let's subtract the last one from the first: ci+1 - ci = (1⋅vi+1 + 2⋅vi+2 + ... + k⋅vi+k) - (1⋅vi + 2⋅vi+1 + ... + k⋅vi+k-1) = k⋅vi+k - (vi + vi+1 + ... + vi+k-1).
It can now be seen that ci+1 = ci + k⋅vi+k - (vi + vi+1 + ... + vi+k-1), i.e. the checksum at offset one can be derived from the base checksum by adding a term and subtracting a different term. Now, let's consider which parts of this formula we already know. Let's also assume that we have just calculated the base checksum of the track, i.e. the next value in the sequence will be vi+k since we have just read k values starting at index i. But vi+k is exactly the value we need as part of the term that we're going to add to the base checksum. We just need to read it and multiply it with k which is also a known value (the track length).
The term that we're going to subtract is obviously just the sum of the values of the track. It would be trivial to compute this as we're going through the values to compute the checksum anyway, so with a small modification of the algorithm this is no problem as well. It should be clear now that we don't need to go through k values to obtain the checksum offset by one, we can just derive it from the base checksum by doing one simple addition and one subtraction. Let's see if we can extend this to obtain more than just the first derived checksum. Let's see if we can obtain ci+2.
Reusing the formula obtained earlier, we see that ci+2 = ci+1 + k⋅v(i+1)+k - (vi+1 + vi+2 + ... + vi+k). The right hand side has three terms. The first term is the first derived checksum which we have just calculated. The next term depends on the value v(i+1)+k but this is the next value in the data stream since we've just read vi+k. The last term is the tricky part. We need the sum, not of the base track this time, but of the track offset by one. We don't have this since we didn't loop through the track offset by one to calculate the checksum, we just derived it from the base checksum. We can, however, adjust the sum when we calculate a derived checksum. When we calculate the first derived checksum we need to adjust the sum so it is no longer the sum of the base track but is the sum of the track offset by one. We can do this by subtracting the first value in the track, then adding the last value in the track offset by one. This will in effect move the track window, to which the sum pertains, one step, and with just two operations we can get the correct sum. The only problem is that we need the first value of the track to do this. Either we need to seek back through the stream and retrieve this value again, or we could have saved it somewhere when we were calculating the checksum at the first index in the track.
The latter solution may seem like a waste of memory space, but if we use a clever trick, it actually isn't. We will need to allocate some memory to save all the derived checksums anyway, so why not use this space temporarily to save a value from the beginning of the track? Each calculation of a derived checksum will need just one value to adjust the sum, so clearly, there's exactly enough space. Furthermore, each value that is saved from the start of the track will need to be used once, after which it can be immediately overwritten by the derived checksum.
One last thing we need to consider is that the improved method sketched above can't work directly on the first track in practice, because the first five frames (almost) are skipped but the index is still incremented. This makes it impossible to use the above formula to derive checksums from the base checksum of the first track because it was assumed that the track index would start at one. Now assume the more general case of a track whose index counter starts at, not one, but p. If we can derive a formula similar to the above from this more general case we can use it to handle all tracks. The checksum of a track starting at index i but whose index counter starts at p+1, ci,p = (p+1)⋅vi + (p+2)⋅vi+1 + ... + (p+k)⋅vk+i-1. Now let's subtract ci,p from ci+1,p like above. Continuing this gets us ci+1,p = ci,p + (p+k)⋅vk+i - (vi + vi+1 + ... + vk+i-1) - p⋅vi. The first three terms on the right hand side are clearly equivalent to the formula above. The last term depends on the first value of the track, which we, conveniently, have already made available because we needed it to adjust the sum. Also notice that if p = 0 the last term disappears. This is what makes the formula work exactly as the simpler one derived above in the case where the index counter of the track starts at one.
Improved algorithm
First of all, it must be noted that the above math will be used to construct an algorithm that has to work on real computers, specifically computers that work with 32-bit fixed-width integers. Fortunately, the above formula can be used directly because it only make use of addition and multiplication operators which are compatible with modular arithmetic:
- Allocate memory space for 5880 values
- Set first memory field to zero;
m[1] = 0
- Set sum to zero;
s = 0
- Set the index value,
i
, to the start of the track minus (5880/2)+1 - Loop
k
times, wherek
is the length of the track- Read two samples (left and right);
x_l
,x_r
- Combine the samples;
v = (x_r << 16) | x_l
- If
i
is less than 5880, savev
in memory fieldi+1
- Update the sum;
s += v
- Update the checksum value in the first memory field;
m[1] += i⋅v
- Increment
i
- Read two samples (left and right);
- Loop 5880 minus one times, let
i
denote the number of the current loop- Read two samples (left and right);
x_l
,x_r
- Combine the samples;
v = (x_r << 16) | x_l
- Put aside the value in memory field
i+1
;f = m[i+1]
- Calculate the derived checksum by the above formula and store it in memory field
i-1
;m[i+1] = m[i] + (p+k)⋅v - s - p⋅f
, wherep
is the offset added to the index counter, andk
is the length of the track - Adjust the sum;
s = s + v - f
- Read two samples (left and right);
- Return the values in memory space; the checksum at offset
o
will be in memory field(5880/2)+o
Notice how there's only one loop that depends on the track length, followed by a loop of constant length that in practice will be very short compared to the first loop. There are no inner loops. Also notice, that seeking in the data stream is completely unecessary; the samples are always read sequentially.
A common use case of this algorithm would probably be to calculate checksums for entire discs. Interleaving the two loops for adjacent tracks can be done by calculating derived checksums of one track while calculating the base checksum for the next track. This way, the algorithm can calculate all checksums for all tracks with only one sequential read through the entire data stream.