Project Dejavu is my attempt to build a fast video fingerprinting system that is robust to recording artifacts such as glare, deformations, and partial obscuration. One possible application for this system is to allow users to record a short 5  10 second clip of some unknown target video and then identify the original content.
Existing research into video fingerprinting primarily focuses on identifying illegal content (DMCA, etc.) where it is safe to assume that the target content and original content are similar by most measures  for example, an illegal torrent of a movie will not be too degraded (unless it was camrip in which case it's perfect for this system). In addition, existing research often assumes the the target and original are of similar length and therefore takes advantage of key frame timings.
Project Dejavu makes neither assumption and is designed to handle extremely noisy content such as recordings of a lowresolution video played on a glossy screen under direct sunlight with a very shaky hand and other similar worstcase scenarios.
Goals
Scalability
 40 kilobytes for each hour of content
 30 seconds to vectorize an hour of content
 0.2 seconds to identify a clip from an hour of content
 Linear time vectorization, logarithmic time recognition.
Robustness
 Saturation, brightness, lineartransform invariant.
 Works indoors and outdoors, on glossy and matte screens.

80% of the time, the top result should be correct

90% of the time, one of the top two results should be correct

95% of the time, one of the top three results should be correct
Prototype
Dataset
The original content data  a set of 63 lowresolution MP4 files that add up to approximately 2.5 hours of content  was extracted from the YouTube playlists listed below.
 https://www.youtube.com/playlist?list=PLDVdFCWgvHe4mx96dLfVupzXLqXuYnhir
 https://www.youtube.com/playlist?list=PLDVdFCWgvHe4jGbLczQlNAClpZrHsWxn1
 https://www.youtube.com/playlist?list=PLDVdFCWgvHe6mWXSqyKoWxL2NzGEwdo7y
 https://www.youtube.com/playlist?list=PLDVdFCWgvHe5LVO6sK1N5MmepBS9saruN
This data set do not belong to me and therefore should not be used for anything other than feature/model evaluation purposes.
Algorithm
Vectorization (Bordinals, Hmedians)
Brightness ordinals  start by splitting the frame using a 3x3 grid and computing the average brightness for each of the 9 cells. Assign a ordinal to each of the cells according to the relative brightness  if stored optimally, this should take less than 18.5 bits. This ordinal hash is surprisingly robust and can be used to greatly reduce the search space. Supposing a random distribution and perfect camera, this hash would reduce the search space 1 / 362,880  after taking noisy capture (blurs, glare, etc.) into account, the actual reduction is closer to 1 / 1,000 but this is still surprisingly effective.
Hue medians  using the same 9 cells, compute the median hue for each component. The median is superior to the mean because there is a high likelihood that there will be outliers  experimentation shows that the median is more robust to recording artifacts. The hue value can be stuffed into 8 bins without significant loss in accuracy  this means it can be stored in 27 bits. When measuring the distance between hue values, take care to have the value wrap around  assuming 0 indexing, the distance between 7 and 0 is 1, the distance between 6 and 0 is 2, and so on due to the shape of the HSV color space.
46 bits of data per frame translates to roughly 20 kilobytes of storage for each hour of video sampled at 1 frame per second. Based on initial observations, this data is not easily compressible as there is very little redundancy  brightness and hue are orthonormal in the color space and the large time difference would make it difficult to take advantage of similarities between neighboring frames.
Recognition (Linear scan, ANN, FFT)
Currently, the Bordinal is used to index into a subset of the original time series search space before the Hmedian distance is applied using a linear scan to fine tune the results. This operation scales linearly once the Bordinal index space is saturated  although initially there are 362,880 possible values, noisy data needs to be taken into account and an index table is used to transform similar values into the same bin, resulting in roughly 1000 values, many of which will be unused since only a subset of these spaces are occupied by natural images.
One option for improving the time complexity is to use approximate nearest neighbor search, but that is difficult to implement due to the nonEuclidean hue dimension. The wraparound nature of the hue value makes classic time series techniques such as the Fourier transform tricky to implement correctly  it appears that creating an extra copy of the data shifted by 4 (half of the maximum hue value) is an decent way of handling the wraparound, but I remain sceptical.
Demo
Check out https://dejavu.skyi.io to see the demo in action. Just open one of the videos on your computer / TV and scan it using the web app. The mobile app only runs in Chrome, but it does work on Android devices. It should successfully recognize whatever video is playing in less than a second  if it takes longer, thatâ€™s probably because my extraordinarily cheap 512mb RAM server is busy processing other requests.
Next steps
 Consider using YUV instead of HSV because it has a better behaved distance function and is easier to compute.
 Consider reducing ordinal resolution to save memory  halving the number of ordinal values shouldn't result in too much loss in accuracy.
 Try to detect the target screen and crop to the appropriate aspect ratio.
 Test different techniques for detecting and removing glare.
Prototype
import cv2
import sys
import glob
import json
import numpy
import shutil
import joblib
import tempfile
import subprocess
def image_hash (image):
"""
This function takes a BGR image, transforming it into the YUV color space and
resizing it to the standard 120x120 pixel image size. It returns a dictionary
object containing three feature vectors: `l_ord`, `cr_ord`, and `cb_ord`.
"""
cv2.cvtColor(image, cv2.cv.CV_BGR2YCrCb)
if image.shape != (120, 120, 3):
image = cv2.resize(image, (120, 120))
luma = [
numpy.sum(image[0:30,0:30,0:1]),
numpy.sum(image[0:30,30:60,0:1]),
numpy.sum(image[0:30,60:90,0:1]),
numpy.sum(image[0:30,90:120,0:1]),
numpy.sum(image[30:60,0:30,0:1]),
numpy.sum(image[30:60,30:60,0:1]),
numpy.sum(image[30:60,60:90,0:1]),
numpy.sum(image[30:60,90:120,0:1]),
numpy.sum(image[60:90,0:30,0:1]),
numpy.sum(image[60:90,30:60,0:1]),
numpy.sum(image[60:90,60:90,0:1]),
numpy.sum(image[60:90,90:120,0:1]),
numpy.sum(image[90:120,0:30,0:1]),
numpy.sum(image[90:120,30:60,0:1]),
numpy.sum(image[90:120,60:90,0:1]),
numpy.sum(image[90:120,90:120,0:1])
]
luma = [(luma[i], i) for i in range(len(luma))]
luma = [arr[1] for arr in sorted(luma)]
l_ord = [0]*len(luma)
for i in range(len(luma)):
l_ord[luma[i]] = i
cred = [
numpy.sum(image[0:30,0:30,1:2]),
numpy.sum(image[0:30,30:60,1:2]),
numpy.sum(image[0:30,60:90,1:2]),
numpy.sum(image[0:30,90:120,1:2]),
numpy.sum(image[30:60,0:30,1:2]),
numpy.sum(image[30:60,30:60,1:2]),
numpy.sum(image[30:60,60:90,1:2]),
numpy.sum(image[30:60,90:120,1:2]),
numpy.sum(image[60:90,0:30,1:2]),
numpy.sum(image[60:90,30:60,1:2]),
numpy.sum(image[60:90,60:90,1:2]),
numpy.sum(image[60:90,90:120,1:2]),
numpy.sum(image[90:120,0:30,1:2]),
numpy.sum(image[90:120,30:60,1:2]),
numpy.sum(image[90:120,60:90,1:2]),
numpy.sum(image[90:120,90:120,1:2])
]
cred = [(cred[i], i) for i in range(len(cred))]
cred = [arr[1] for arr in sorted(cred)]
cr_ord = [0]*len(cred)
for i in range(len(cred)):
cr_ord[cred[i]] = i
cbed = [
numpy.sum(image[0:30,0:30,2:3]),
numpy.sum(image[0:30,30:60,2:3]),
numpy.sum(image[0:30,60:90,2:3]),
numpy.sum(image[0:30,90:120,2:3]),
numpy.sum(image[30:60,0:30,2:3]),
numpy.sum(image[30:60,30:60,2:3]),
numpy.sum(image[30:60,60:90,2:3]),
numpy.sum(image[30:60,90:120,2:3]),
numpy.sum(image[60:90,0:30,2:3]),
numpy.sum(image[60:90,30:60,2:3]),
numpy.sum(image[60:90,60:90,2:3]),
numpy.sum(image[60:90,90:120,2:3]),
numpy.sum(image[90:120,0:30,2:3]),
numpy.sum(image[90:120,30:60,2:3]),
numpy.sum(image[90:120,60:90,2:3]),
numpy.sum(image[90:120,90:120,2:3])
]
cbed = [(cbed[i], i) for i in range(len(cbed))]
cbed = [arr[1] for arr in sorted(cbed)]
cb_ord = [0]*len(cbed)
for i in range(len(cbed)):
cb_ord[cbed[i]] = i
return {
"l_ord": l_ord,
"cr_ord": cr_ord,
"cb_ord": cb_ord
}
def processImage (image_files):
"""
This function applies the image_hash function to each of the images specified
in the image_files array and returns the result.
"""
return [image_hash(cv2.imread(image)) for image in image_files]
def processVideo (video_file):
"""
This function uses ffmpeg to decode the video and extract images at 1 fps into
a temporary directory. It calls `processImage` on the images and returns the
result after cleaning up the temporary files.
"""
data_dir = tempfile.mkdtemp()
image_file = data_dir + "/%6d.png"
subprocess.call(["ffmpeg", "i", video_file, "vf", "fps=1,scale=120:120", image_file])
vector = processImage(sorted(glob.glob(data_dir + "/*")))
shutil.rmtree(data_dir)
return vector
def alignVector (needle, haystack, label):
"""
This function takes a needle and a haystack and finds the best alignment. It doesn't
bother to account for time warping as the needle is a relatively short clip. It does
a bruteforce linear scan.
"""
if len(needle) > len(haystack):
needle, haystack = haystack, needle
min_diff = 1e5
min_index = 1
for i in range(len(haystack)  len(needle) + 1):
diff = 0
for j in range(len(needle)):
for k in range(len(needle[j]['l_ord'])):
diff += abs(haystack[i+j]['l_ord'][k]  needle[j]['l_ord'][k])
for k in range(len(needle[j]['cr_ord'])):
diff += abs(haystack[i+j]['cr_ord'][k]  needle[j]['cr_ord'][k])
for k in range(len(needle[j]['cb_ord'])):
diff += abs(haystack[i+j]['cb_ord'][k]  needle[j]['cb_ord'][k])
diff /= len(needle)
if diff < min_diff:
min_index = i
min_diff = diff
return {
"diff": min_diff,
"index": min_index,
"label": label
}
def matchVector (needle, haystacks):
"""
This function processes the haystacks in parallel, returning the result of aligning
the needle with each haystack.
"""
results = joblib.Parallel(n_jobs=1)(
joblib.delayed(alignVector)(needle, haystacks[key], key) for key in haystacks.keys()
)
results.sort(key=lambda x: x['diff'])
return results
if __name__ == "__main__":
if len(sys.argv) > 1:
if sys.argv[1] == "video":
print(json.dumps(processVideo(sys.argv[2]), indent=4))
elif sys.argv[1] == "image":
print(json.dumps(processImage(sys.argv[2:]), indent=4))
elif sys.argv[1] == "match":
needle = json.load(open(sys.argv[2], 'rb'))
haystacks = {}
for label in sys.argv[3:]:
haystacks[label] = json.load(open(label, 'rb'))
print(json.dumps(matchVector(needle, haystacks), indent=4))
else:
print("Usage:")
print(" d2v.py video video.mp4 > output.djv")
print(" d2v.py image 1.png 2.png > output.djv")
print(" d2v.py match mystery.djv 1.djv 2.djv ...")
print("")