Data redundancy
Average Redundancy
Some people define redundancy as "the total number of reflections
scanned divided by the total number of unique reflections". I don't
think this is a correct definition: On a serial diffractometer
measuring the 4 0 0 reflection 500 times in a dataset of 200
reflections would result in an average redundancy of 2.5! Admitted,
the completeness is 0%, but I think this argument proves that the
"average redundancy" is meaningless.
A better quantification
Instead of average redundancy, we will be using the 90th percentile
redundancy, defined like a median value. Imaging a data set of 21
unique reflections, measured redundantly. 50 reflections were
collected in total. Count the number of times each reflection is
measured, and sort the numbers:
0 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 4 4 5 5
So: 1 reflection was missed, 3 reflections measured once, 10
reflections measured twice etc. The (nonsense number) "average
redundancy" here is 50/21=2.4. Now lets look at how we calculate
redundancy instead:
0 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 4 4 5 5
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
Here we can see: 90% of all reflections was measured 1 or more times.
There is one more factor that comes into play. If we're calculating
the redundancy this way, there is no way of seeing the difference between
the data set above, and another one like:
0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 4
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
This one also has a "90% redundancy" of 1. To be able to make a
difference between this data set and the previous one, we need to
change the redundancy from an integer to a floating point number.
How?
Well, in the first data set, the "1" at 90% is the central "1" out of
3. If we imagine the ones as being rounded numbers they would have been
the result of numbers between 0.5 and 1.5. Since our selected "1" is
the central one, "originally" it most probably was a "1.0", from the
center of the 0.5 to 1.5 range.
In the second data set, the "1" at 90% is the second out of 8, so it
is at a quarter of the range. That would make the "90% redundancy"
around 0.75.
Obviously the "floating point" trick is a trick, because the
redundancy numbers never were floating point numbers to begin with. It
allows us, however, to quantify redundancy in a more fine-grained way.
Using the "average redundancy" number would make some people happier,
because the number is higher. This gives, however, a false sense of
security. "average redundancy" numbers might be biased by a few
reflections at low theta which occur very frequently because they're
hard to avoid. These reflections do not give sufficient information
for a good empirical absorption correction.
With the "collect" definition of "90% redundancy", a more even distribution
of redundancy will be favored over uneven distributions, setting a proper
target for getting more accurate final data.
Please note that the actual percentile used by "collect" strategy
calculations can be changed in the configuration file.
What now is a good data collection strategy?
Lets look back at one of the data sets we measured.
0 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 4 4 5 5
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
If we want to increase the redundancy, what reflections should we
be looking for?
- If we want to improve the average redundancy, we
can measure anything we like. If the average redundancy is the
target of the strategy algorithm, it will probably propose some
really easy scan that is very effective in collecting a lot of
data. Most probably, many of the scanned reflections were already
highly redundant before.
- If we want to improve the 50%-redundancy, we have
to re-measure some of the reflections that are now measured twice,
increasing the "2" in the middle of the list to a "3".
- If we want to improve the quality, we should do
our utmost to find the one reflection that is not collected yet,
and try to find some of the reflections that were only seen once so
far. This will probably mean that we have to drive the goniostat to
some difficult-to-reach position, and measure less reflections
in total.
The last strategy will effectively raise the 90%-redundancy, but it will
be less effective in raising the 50% redundancy or the average redundancy.
It might appear that the final data collection is less effective than one
that targets the average redundancy, but in fact the quality
of the data set will be better.
Note
Please note that collecting exactly a "half sphere" (assuming that
would be possible with an area detector; only at low to intermediate
resolution protein work we can get close) will not give you the same
redundancy for each reflection. Take an orthorhombic set. In a half
sphere, you might have all the equivalent reflections -2,1,3 and 2,1,3
and 2,-1,3 and -2,-1,3: a redundancy of 4 as expected. But: the -2,0,3
and 2,0,3 reflections make a redundancy of 2. And the 0,0,3 reflection
is scanned only once. The fraction of "symmetric" reflections like
this in a highly symmetric data set is surprisingly high!