Painting – Uniform color schemes

A short post, so far I’ve been using pretty much the exact same color scheme for all my new infinity terrain. While perhaps a bit dull, it does really pull everything together quite nicely. The only major thing I’m considering is add a few vertical (red?) bands on the higher buildings. Also apparently Micro Art Studio will release some more printed acryllics soon, those should find a nice spot in this collection.

WP_20150830_002

Statistics – When to stop

A colleague of mine had an interesting problem. He was trying to calculate fill rates for tables in a database, but the only way to access the records was one-by-one. Given that it was a pretty large database, getting the fill rates was quite slow. Because the fill-rates are only displayed with three digit precision in his application (e.g. 94.3%) he was looking for a way to get to an early out when it wasn’t very likely anymore that a specific value was going to change anymore.

This rang a bell because it was pretty similar to a problem I solved some years ago, given the average of some binary variable and the amount of samples, what is the confidence interval. As usually, those types of problems have been solved and beaten to death, the actually problem is usually just finding the relevant solution. For this problem a good solution is the Wilson score interval. It gives you a relation between the average of some binomial variable, the amount of samples and a required confidence for the confidence interval. The results are the upper and lower boundary.

Wilson confidence interval, image taken from wikipedia

While a bit intimidating, the above is easily expressed in a few lines of code

public static double Lower(double value, double sigma, int sampleCount)
{
 double n = (double)sampleCount;
 double z = sigma;
 double phat = value;
 return (phat + z * z / (2 * n) - z * Math.Sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n);
}

public static double Upper(double value, double sigma, int sampleCount)
{
 double n = (double)sampleCount;
 double z = sigma;
 double phat = value;
 return (phat + z * z / (2 * n) + z * Math.Sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n);
}

Note that sigma is the width of a Gaussian, a sane choice is 1.96 which corresponds to a 95% confidence interval.

Using that, we can iteratively search for the lowest value of sampleCount where the minimum of Upper and Lower is smaller than 0.0005 (remember we didn’t want to change the 3-digit representation).

Interestingly enough, the results are highly dependent on the sample count. If the average is around 99.9% (or 0.1%) you only need ~7680 samples. At the center, so close to 50.0% you need at least 4.2 million samples to get the same error boundaries. So unfortunately you can only do early outs for tables that are very sparsely or very densely populated. Anything ‘moderately’ filled will still require many, many samples to get reasonably accurate statistics.

Biometrics – Database sizes, scenarios and thresholds

There are many possible operational scenarios where biometrics are involved. In this power I will attempt to give some ingith in the possible scenarios and the effect on the type of biometrics required and the associated settings and procedures. The most practival use of this post is to use it as a reference for determining if you ‘best’ fused ROC curve is good enough for the scenario you have in mind. Use the thougts presented in this chapter to choose a final threshold wisely.

Classification

You can think of many biometrics applications. For example gaining access to a secure building using an iris scan, verifying at a border control that the visum is the same person as to whom it was issued or perhaps a surveillance camera on an airport looking for a known terrorist.

I propose to divde such applications in 6 categories. One element is the database size, which can be tiny, medium or large. The other element is the database purpose, either a blacklist or whitelist.

Database sizes

The database sizes can be tiny, medium or large. A typical tiny database is a chip on a passport, the owner data of a laptop or the information on a smartcard. A medium database can be a watchlist with known terrorists or a list of employees for a company. Large databases typically will contain all people that applied for a visa through foreign embassies or all people that have been issued a certain smartcard.

White and Black

The records in a database can be either a whitelist or a blacklist. A whitelist can contain biometric data of people with certain privileges like access to a building or fingerprints of people who have succesfully applied for a visa. Adversaries of such a system will typically try to be identified as a person on a whitelist.

A blacklist will contain people that deserve special attention from the owner of a biometrics system. For example a camera surveillance system can use a blacklist to identify criminals in shops, or alternatively a fingerprint system can check for known terrorists at border crossings. Adversaries of such a system will usually try to avoid being identified as a person on a blacklist.

Examples

As with most concepts, they are best explained through some examples.

Tiny whitelists

Tiny whitelists can be e-passports, smartcards or biometric information on a personal laptop.

Tiny blacklists

Probably the odd one out in this list because it has few realistic use-cases (keeping just a small number of people out). A possible example could be the biometrics of an operator at an enrollment station to avoid fraud.

Medium whitelists

Secure access to a company building falls into this category. Also identifying people in a closed environment is a good example. For example it can be known which people are in an airport terminal, making it much easier to identity someone in there.

Medium blacklists

A watchlist is a good example of this. People on such a list can be terrorists, repeat offenders or people denied access to a private club.

Large whitelists

All people hat have been issued an e-passport can be registered in a database like this, a person will then still be identifiable if the e-passport gets lost or unreadable. Also people that have been issued a visa will be in such a database.

Large blacklists

If biometrics are coupled to some entitlement or license, there should be a de-duplication process to ensure that a single person does not apply twice, here you effectively use the entire already enrolled population as a blacklist.

Consequence of errors

The two types of error that can be made by a biometrics system (false match and false non-match) have different implications depending on whether the system is a whitelist or a blacklist based system. For a blacklist based system a false match may result in an innocent civilian being brought in for questioning. Or in case of some entitlement, it may be incorrectly denied, which is inconvenient if it concerns something like a food ration or some license required to perform a job. A false non-match in a blacklist system can result in a terrorist entering a country undetected, or somebody being allocated the same resource twice (food rations, licenses).

In a whitelist scenario, a false match could give a corporate spy access to a research department, or allow an illegal immigrant to enter the country. A false non-match on a whitelist will result in disgruntled employees or unnecessary delays a border crossing.

Whitelist errors

For a whitelist, the consequence of a false reject will usually annoy a normal user of the system. We can therefore think of the false non-match rate (FNMR) axis of an ROC curve as the annoyance axis. The other axis with the false match rate (FMR) is the security axis.

Blacklist errors

With blacklists, the annoyance and security axis are switched. The false match rate axis is now the annoyance axis, and the false non-match rate axis is the security axis.

Numbers

It is impossible to give exact numbers for these general categories, but I can give a few guidelines on how to determine the FMR and FNMR settings for a system.

Below, you will find a few sample database sizes associated with ‘tiny’, ‘medium’ and ‘large’.

  • Tiny: 1
  • Medium: 500
  • Large: 106

Now depending on the actual implementation, the tradeoff between the annoyance and security factors will be different. As an example I have listed some sample implementations here.

Scenario White/blacklist Database size Max annoyance Max security fail FMR FNMR
Company acces with credentials White Tiny 1% 10% 2*10-4 1*10-2
Company access without credentials White Medium 1% 10% 1*10-1 1*10-2
Deduplication Black Large 1% 3% 1*10-8 3*10-2
Passive watchlist Black Medium 0.50% 10% 1*10-5 1*10-1
Active watchlist Black Medium 1% 1% 2*10-5 1*10-2
Regular border access White Tiny 0.10% 1% 1*10-2 1*10-3
Visa border access White Large 0.10% 1% 1*10-8 1*10-3
Terrorist check at border Black Medium 0.10% 3% 2*10-6 3*10-2

In the case of border access with a visa, we asusme that this is done through an identification process. This will be the case when a visa is unreadable, lost, or when forgery is suspected.

Using this sample set of values we can construct a figure with aread indicating the maximum allowed values for the algorithm FMR and FNMR. Please note that there is subtlety involved when calculating the algorithm FMR. Given the algorithm FMR&alg and the database size N, the observed system FMRsys is given by FMRsys~=FMRalg*N. The reported values in the tables and figures are all the algorithm FMRalg, which is also reported as the FAR (false accept rate).

Capture
The minum FMR/FNMR requirements for the sample scenarios. ( Dedup = De-duplication, BA (Visa) = Border Access using visa, BA (nrm) = Normal border access, BA (terr.) = Terrorist check at border, P/A. Watchlist = Passive/Active watchlist and CA w/w/o = Company access with/without credentials)