Statistics – When to stop

A colleague of mine had an interesting problem. He was trying to calculate fill rates for tables in a database, but the only way to access the records was one-by-one. Given that it was a pretty large database, getting the fill rates was quite slow. Because the fill-rates are only displayed with three digit precision in his application (e.g. 94.3%) he was looking for a way to get to an early out when it wasn’t very likely anymore that a specific value was going to change anymore.

This rang a bell because it was pretty similar to a problem I solved some years ago, given the average of some binary variable and the amount of samples, what is the confidence interval. As usually, those types of problems have been solved and beaten to death, the actually problem is usually just finding the relevant solution. For this problem a good solution is the Wilson score interval. It gives you a relation between the average of some binomial variable, the amount of samples and a required confidence for the confidence interval. The results are the upper and lower boundary.

Wilson confidence interval, image taken from wikipedia

While a bit intimidating, the above is easily expressed in a few lines of code

public static double Lower(double value, double sigma, int sampleCount)
{
 double n = (double)sampleCount;
 double z = sigma;
 double phat = value;
 return (phat + z * z / (2 * n) - z * Math.Sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n);
}

public static double Upper(double value, double sigma, int sampleCount)
{
 double n = (double)sampleCount;
 double z = sigma;
 double phat = value;
 return (phat + z * z / (2 * n) + z * Math.Sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n);
}

Note that sigma is the width of a Gaussian, a sane choice is 1.96 which corresponds to a 95% confidence interval.

Using that, we can iteratively search for the lowest value of sampleCount where the minimum of Upper and Lower is smaller than 0.0005 (remember we didn’t want to change the 3-digit representation).

Interestingly enough, the results are highly dependent on the sample count. If the average is around 99.9% (or 0.1%) you only need ~7680 samples. At the center, so close to 50.0% you need at least 4.2 million samples to get the same error boundaries. So unfortunately you can only do early outs for tables that are very sparsely or very densely populated. Anything ‘moderately’ filled will still require many, many samples to get reasonably accurate statistics.

Biometrics – Database sizes, scenarios and thresholds

There are many possible operational scenarios where biometrics are involved. In this power I will attempt to give some ingith in the possible scenarios and the effect on the type of biometrics required and the associated settings and procedures. The most practival use of this post is to use it as a reference for determining if you ‘best’ fused ROC curve is good enough for the scenario you have in mind. Use the thougts presented in this chapter to choose a final threshold wisely.

Classification

You can think of many biometrics applications. For example gaining access to a secure building using an iris scan, verifying at a border control that the visum is the same person as to whom it was issued or perhaps a surveillance camera on an airport looking for a known terrorist.

I propose to divde such applications in 6 categories. One element is the database size, which can be tiny, medium or large. The other element is the database purpose, either a blacklist or whitelist.

Database sizes

The database sizes can be tiny, medium or large. A typical tiny database is a chip on a passport, the owner data of a laptop or the information on a smartcard. A medium database can be a watchlist with known terrorists or a list of employees for a company. Large databases typically will contain all people that applied for a visa through foreign embassies or all people that have been issued a certain smartcard.

White and Black

The records in a database can be either a whitelist or a blacklist. A whitelist can contain biometric data of people with certain privileges like access to a building or fingerprints of people who have succesfully applied for a visa. Adversaries of such a system will typically try to be identified as a person on a whitelist.

A blacklist will contain people that deserve special attention from the owner of a biometrics system. For example a camera surveillance system can use a blacklist to identify criminals in shops, or alternatively a fingerprint system can check for known terrorists at border crossings. Adversaries of such a system will usually try to avoid being identified as a person on a blacklist.

Examples

As with most concepts, they are best explained through some examples.

Tiny whitelists

Tiny whitelists can be e-passports, smartcards or biometric information on a personal laptop.

Tiny blacklists

Probably the odd one out in this list because it has few realistic use-cases (keeping just a small number of people out). A possible example could be the biometrics of an operator at an enrollment station to avoid fraud.

Medium whitelists

Secure access to a company building falls into this category. Also identifying people in a closed environment is a good example. For example it can be known which people are in an airport terminal, making it much easier to identity someone in there.

Medium blacklists

A watchlist is a good example of this. People on such a list can be terrorists, repeat offenders or people denied access to a private club.

Large whitelists

All people hat have been issued an e-passport can be registered in a database like this, a person will then still be identifiable if the e-passport gets lost or unreadable. Also people that have been issued a visa will be in such a database.

Large blacklists

If biometrics are coupled to some entitlement or license, there should be a de-duplication process to ensure that a single person does not apply twice, here you effectively use the entire already enrolled population as a blacklist.

Consequence of errors

The two types of error that can be made by a biometrics system (false match and false non-match) have different implications depending on whether the system is a whitelist or a blacklist based system. For a blacklist based system a false match may result in an innocent civilian being brought in for questioning. Or in case of some entitlement, it may be incorrectly denied, which is inconvenient if it concerns something like a food ration or some license required to perform a job. A false non-match in a blacklist system can result in a terrorist entering a country undetected, or somebody being allocated the same resource twice (food rations, licenses).

In a whitelist scenario, a false match could give a corporate spy access to a research department, or allow an illegal immigrant to enter the country. A false non-match on a whitelist will result in disgruntled employees or unnecessary delays a border crossing.

Whitelist errors

For a whitelist, the consequence of a false reject will usually annoy a normal user of the system. We can therefore think of the false non-match rate (FNMR) axis of an ROC curve as the annoyance axis. The other axis with the false match rate (FMR) is the security axis.

Blacklist errors

With blacklists, the annoyance and security axis are switched. The false match rate axis is now the annoyance axis, and the false non-match rate axis is the security axis.

Numbers

It is impossible to give exact numbers for these general categories, but I can give a few guidelines on how to determine the FMR and FNMR settings for a system.

Below, you will find a few sample database sizes associated with ‘tiny’, ‘medium’ and ‘large’.

  • Tiny: 1
  • Medium: 500
  • Large: 106

Now depending on the actual implementation, the tradeoff between the annoyance and security factors will be different. As an example I have listed some sample implementations here.

Scenario White/blacklist Database size Max annoyance Max security fail FMR FNMR
Company acces with credentials White Tiny 1% 10% 2*10-4 1*10-2
Company access without credentials White Medium 1% 10% 1*10-1 1*10-2
Deduplication Black Large 1% 3% 1*10-8 3*10-2
Passive watchlist Black Medium 0.50% 10% 1*10-5 1*10-1
Active watchlist Black Medium 1% 1% 2*10-5 1*10-2
Regular border access White Tiny 0.10% 1% 1*10-2 1*10-3
Visa border access White Large 0.10% 1% 1*10-8 1*10-3
Terrorist check at border Black Medium 0.10% 3% 2*10-6 3*10-2

In the case of border access with a visa, we asusme that this is done through an identification process. This will be the case when a visa is unreadable, lost, or when forgery is suspected.

Using this sample set of values we can construct a figure with aread indicating the maximum allowed values for the algorithm FMR and FNMR. Please note that there is subtlety involved when calculating the algorithm FMR. Given the algorithm FMR&alg and the database size N, the observed system FMRsys is given by FMRsys~=FMRalg*N. The reported values in the tables and figures are all the algorithm FMRalg, which is also reported as the FAR (false accept rate).

Capture
The minum FMR/FNMR requirements for the sample scenarios. ( Dedup = De-duplication, BA (Visa) = Border Access using visa, BA (nrm) = Normal border access, BA (terr.) = Terrorist check at border, P/A. Watchlist = Passive/Active watchlist and CA w/w/o = Company access with/without credentials)

Copy protection schemes in a professional environment

The subject of copy protection, be it for media or software, for customer usage is pretty much a dead horse everybody still loves to beat. Extreme examples of copy protection gone wrong are plenty, both at the technical level (e.g. sony installing rootkits) to law enforcement, just look at the stifling effect the DMCA has had on security research in general.

Of course, from a software or media owner point of view it is understandable to introduce these schemes. People copying your stuff costs at least some money, which is never nice.

Now another angle is the professional software market, especially for smaller companies and niche products. The same problems can theoretically appear, say a small startup builds some great tool, licenses it for a small amount of money as an evaluation to a large company who then never buys a full license and just copies the cheaper product ad infinitum. The main difference between this case and the consumer case is the agreements made between parties. Where a consumer doesn’t sign any legal documents when buying a new cd or some software, companies spend much effort drafting and signing legal agreements that prevent just the above from happening, in the sample above, the smaller company can just start a legal procedure for breach of contract and all will be well (in theory).

Now for some reason this doesn’t lead to any technical alleviation of the urge to add complex copying measures into new software products. Being involved with integration of many, many 3rd party pieces of software into our software platform I have pretty much seen it all. Some of them working ok, some of them utter failures. Roughly speaking there are two groups of copy protection, those that work and you forget about them, and a group that sometimes work and constantly remind you of their existence. In practice, the second group is unfortunately larger than you’d hope for.

Starting with some ‘nice’ schemes, my favorite one is still the time-limited scheme, and then one that doesn’t bother checking if you mess with your clock. You install the software, put a reminder in your calendar a few days before expiry and you’re done, it just works. Why I mention the clock? Sometimes our ntp server has issues and I want to manually fix the clock, even changing the time with a few minutes will typically permanently break the license for my machine, so I have to open a support ticket, spend a few hours to days without the license, all because I didn’t want to be late for lunch again.

A step up in complexity, but still manageable is the mac address based scheme. You send your mac address to the vendor, and you get some license back that works with that mac. Nothing spectacular, and luckily this usually keeps working if you are changing your network settings because you have to run some VMs or connect to a VPN using weird software. The reason I like this less than the time-limited scheme is because my machine can (and will) break every 2-3 years, and it will usually take close to a week to re-license all the bits and pieces of software. Also, the little thing called cloud software is becoming popular, and typically this will break copy-protection software as well.

Another step up and we start running into the realm of broken-by-design software. Let’s start with the online schemes where you license your machine with some online service. You install some software which calculates a hardware id, it negotiates with the remote server and if you’re lucky you can now use the software. That is, if the license service is online, your sysadmin isn’t performing maintenance on the external connection (which of course never should take more than a few seconds downtime but somehow always does), and you’re actually allowed to connect to the outside world using some proprietary protocol on a non-standard port. The only benefit is that this is pretty much the only scheme that could apply for cloud deployments if the vendor supports this, other than that, it sucks.

Then we have to 3rd party 3rd party schemes, where the vendor has bought a ‘solution’ to enforce some copy protection scheme, the most well known is probably HASP. The vendor ticks a few boxes (usually all of them) and the HASP software will disallow usage of their product in many if not most cases. Because the ticks include features like ‘disallow remote desktop’ you usually end up with software which you can’t work on from home and troubleshooting on a server is pretty much impossible (if your problems are more than the HASP software itself). Because many products are SDKs, the likelihood of somebody attaching a debugger to the software are pretty high (I’m one of those developers that prefers a debugger over printf debugging) but of course the ‘don’t allow debuggers attached’ is a common scenario; joy.

Last we have the category of home-brew protection software, here all bets are off. The funniest behavior, for some definitions of funny, I have seen are schemes where the copy-protection code is so buggy that it corrupts its own license file after some use. Another fun example is where the software would do an internal check every 90 seconds, and on some (AMD) processors would randomly fail half the time, so your software would work for 90 seconds, then break for 90, work for 90 etc etc.

Any why all this fuzz? We do have to sign that we won’t leak/share/use documentation, source, binaries outside of the contract agreements, and we’re trusted not to do that, but somehow the software itself is almost always wrapped in another layer of poop to actually prevent us using it. At least in my company we won’t even think of using software that is unlicensed, it can be costly and personally it can mean loss of job, so why would I bother risking that. It just adds to frustration and lots of unnecessary work.

The Joys of Open Source

On the job we use a large mixture of technologies, from open source javascript libraries found on github to expensive professional development IDEs like Microsoft Visual Studio. In the past (little over a year ago), all our builds were done through custom build scripts, we used ant here and there but most of the plumbing was done through a set of well-maintained shell scripts. Last year we decided to standardize on more fancy build chains, for the build process itself use maven wherever possible, use jenkins on the buildserver and store all artifacts in nexus. So far so good, we managed to get pretty much all of our projects ported to this new build-chain (we still have some oddball projects on the todo list). Things that are supported out of the box (java stuff) works like a charm, create some simple poms, add a few jenkins jobs and you’re done. Integrating other projects (c++ projects that have to be built with Visual Studio on Windows and gcc on Linux) are a bit more problematic. For some reason there doesn’t seem to be a simple way to do this, but we managed to get things to work by building larger, more elaborate poms. (More poms is better right?).

In our own development cycle we take the slow&stable approach, which has worked well for our company over the years, but now that we start using more and more things that depend on fancy open source things, we keep hitting small speed-bumps that seem to be the result of the ‘move fast and break stuff’ mentality evangelized by various communities. Over the past month we had release builds fail because some library that was stuffed deep into the dependency hierarchy made some breaking changes, all jenkins jobs started failing today with uninformative errors because it was deemed necessary to support >8GB files (breaking other sizes) and we have to manually restart various bits and pieces of the build environment on a regular basis because automatic updates break stuff all the time.

So up till now the experience has been mixed, being able to build your entire project by clicking a little ‘build’ button is awesome, spending hours on wild goose chases because somebody pushed untested software to all users is a little less awesome.