Project Dejavu is my attempt to build a fast video fingerprinting system that is robust to recording artifacts such as glare, deformations, and partial obscuration. One possible application for this system is to allow users to record a short 5 - 10 second clip of some unknown target video and then identify the original content.

Existing research into video fingerprinting primarily focuses on identifying illegal content (DMCA, etc.) where it is safe to assume that the target content and original content are similar by most measures - for example, an illegal torrent of a movie will not be too degraded (unless it was cam-rip in which case it's perfect for this system). In addition, existing research often assumes the the target and original are of similar length and therefore takes advantage of key frame timings.

Project Dejavu makes neither assumption and is designed to handle extremely noisy content such as recordings of a low-resolution video played on a glossy screen under direct sunlight with a very shaky hand and other similar worst-case scenarios.

Goals

Scalability
  • 40 kilobytes for each hour of content
  • 30 seconds to vectorize an hour of content
  • 0.2 seconds to identify a clip from an hour of content
  • Linear time vectorization, logarithmic time recognition.
Robustness
  • Saturation, brightness, linear-transform invariant.
  • Works indoors and outdoors, on glossy and matte screens.
  • 80% of the time, the top result should be correct

  • 90% of the time, one of the top two results should be correct

  • 95% of the time, one of the top three results should be correct

Prototype

video_management

prototype-1

Dataset

The original content data - a set of 63 low-resolution MP4 files that add up to approximately 2.5 hours of content - was extracted from the YouTube playlists listed below.

This data set do not belong to me and therefore should not be used for anything other than feature/model evaluation purposes.

Algorithm

Vectorization (B-ordinals, H-medians)

Brightness ordinals - start by splitting the frame using a 3x3 grid and computing the average brightness for each of the 9 cells. Assign a ordinal to each of the cells according to the relative brightness - if stored optimally, this should take less than 18.5 bits. This ordinal hash is surprisingly robust and can be used to greatly reduce the search space. Supposing a random distribution and perfect camera, this hash would reduce the search space 1 / 362,880 - after taking noisy capture (blurs, glare, etc.) into account, the actual reduction is closer to 1 / 1,000 but this is still surprisingly effective.

Hue medians - using the same 9 cells, compute the median hue for each component. The median is superior to the mean because there is a high likelihood that there will be outliers - experimentation shows that the median is more robust to recording artifacts. The hue value can be stuffed into 8 bins without significant loss in accuracy - this means it can be stored in 27 bits. When measuring the distance between hue values, take care to have the value wrap around - assuming 0 indexing, the distance between 7 and 0 is 1, the distance between 6 and 0 is 2, and so on due to the shape of the HSV color space.

46 bits of data per frame translates to roughly 20 kilobytes of storage for each hour of video sampled at 1 frame per second. Based on initial observations, this data is not easily compressible as there is very little redundancy - brightness and hue are orthonormal in the color space and the large time difference would make it difficult to take advantage of similarities between neighboring frames.

Recognition (Linear scan, ANN, FFT)

Currently, the B-ordinal is used to index into a subset of the original time series search space before the H-median distance is applied using a linear scan to fine tune the results. This operation scales linearly once the B-ordinal index space is saturated - although initially there are 362,880 possible values, noisy data needs to be taken into account and an index table is used to transform similar values into the same bin, resulting in roughly 1000 values, many of which will be unused since only a subset of these spaces are occupied by natural images.

One option for improving the time complexity is to use approximate nearest neighbor search, but that is difficult to implement due to the non-Euclidean hue dimension. The wrap-around nature of the hue value makes classic time series techniques such as the Fourier transform tricky to implement correctly - it appears that creating an extra copy of the data shifted by 4 (half of the maximum hue value) is an decent way of handling the wrap-around, but I remain sceptical.

Demo

Check out https://dejavu.skyi.io to see the demo in action. Just open one of the videos on your computer / TV and scan it using the web app. The mobile app only runs in Chrome, but it does work on Android devices. It should successfully recognize whatever video is playing in less than a second - if it takes longer, that’s probably because my extraordinarily cheap 512mb RAM server is busy processing other requests.

video_capture_results

Next steps

  • Consider using YUV instead of HSV because it has a better behaved distance function and is easier to compute.
  • Consider reducing ordinal resolution to save memory - halving the number of ordinal values shouldn't result in too much loss in accuracy.
  • Try to detect the target screen and crop to the appropriate aspect ratio.
  • Test different techniques for detecting and removing glare.

Prototype

import cv2
import sys
import glob
import json
import numpy
import shutil
import joblib
import tempfile
import subprocess

def image_hash (image):
    """
    This function takes a BGR image, transforming it into the YUV color space and
    resizing it to the standard 120x120 pixel image size. It returns a dictionary
    object containing three feature vectors: `l_ord`, `cr_ord`, and `cb_ord`.
    """
    cv2.cvtColor(image, cv2.cv.CV_BGR2YCrCb)
    if image.shape != (120, 120, 3):
        image = cv2.resize(image, (120, 120))

    luma = [
        numpy.sum(image[0:30,0:30,0:1]),
        numpy.sum(image[0:30,30:60,0:1]),
        numpy.sum(image[0:30,60:90,0:1]),
        numpy.sum(image[0:30,90:120,0:1]),
        numpy.sum(image[30:60,0:30,0:1]),
        numpy.sum(image[30:60,30:60,0:1]),
        numpy.sum(image[30:60,60:90,0:1]),
        numpy.sum(image[30:60,90:120,0:1]),
        numpy.sum(image[60:90,0:30,0:1]),
        numpy.sum(image[60:90,30:60,0:1]),
        numpy.sum(image[60:90,60:90,0:1]),
        numpy.sum(image[60:90,90:120,0:1]),
        numpy.sum(image[90:120,0:30,0:1]),
        numpy.sum(image[90:120,30:60,0:1]),
        numpy.sum(image[90:120,60:90,0:1]),
        numpy.sum(image[90:120,90:120,0:1])
    ]
    luma = [(luma[i], i) for i in range(len(luma))]
    luma = [arr[1] for arr in sorted(luma)]
    l_ord = [0]*len(luma)
    for i in range(len(luma)):
        l_ord[luma[i]] = i

    cred = [
        numpy.sum(image[0:30,0:30,1:2]),
        numpy.sum(image[0:30,30:60,1:2]),
        numpy.sum(image[0:30,60:90,1:2]),
        numpy.sum(image[0:30,90:120,1:2]),
        numpy.sum(image[30:60,0:30,1:2]),
        numpy.sum(image[30:60,30:60,1:2]),
        numpy.sum(image[30:60,60:90,1:2]),
        numpy.sum(image[30:60,90:120,1:2]),
        numpy.sum(image[60:90,0:30,1:2]),
        numpy.sum(image[60:90,30:60,1:2]),
        numpy.sum(image[60:90,60:90,1:2]),
        numpy.sum(image[60:90,90:120,1:2]),
        numpy.sum(image[90:120,0:30,1:2]),
        numpy.sum(image[90:120,30:60,1:2]),
        numpy.sum(image[90:120,60:90,1:2]),
        numpy.sum(image[90:120,90:120,1:2])
    ]
    cred = [(cred[i], i) for i in range(len(cred))]
    cred = [arr[1] for arr in sorted(cred)]
    cr_ord = [0]*len(cred)
    for i in range(len(cred)):
        cr_ord[cred[i]] = i

    cbed = [
        numpy.sum(image[0:30,0:30,2:3]),
        numpy.sum(image[0:30,30:60,2:3]),
        numpy.sum(image[0:30,60:90,2:3]),
        numpy.sum(image[0:30,90:120,2:3]),
        numpy.sum(image[30:60,0:30,2:3]),
        numpy.sum(image[30:60,30:60,2:3]),
        numpy.sum(image[30:60,60:90,2:3]),
        numpy.sum(image[30:60,90:120,2:3]),
        numpy.sum(image[60:90,0:30,2:3]),
        numpy.sum(image[60:90,30:60,2:3]),
        numpy.sum(image[60:90,60:90,2:3]),
        numpy.sum(image[60:90,90:120,2:3]),
        numpy.sum(image[90:120,0:30,2:3]),
        numpy.sum(image[90:120,30:60,2:3]),
        numpy.sum(image[90:120,60:90,2:3]),
        numpy.sum(image[90:120,90:120,2:3])
    ]
    cbed = [(cbed[i], i) for i in range(len(cbed))]
    cbed = [arr[1] for arr in sorted(cbed)]
    cb_ord = [0]*len(cbed)
    for i in range(len(cbed)):
        cb_ord[cbed[i]] = i

    return {
        "l_ord": l_ord,
        "cr_ord": cr_ord,
        "cb_ord": cb_ord
    }

def processImage (image_files):
    """
    This function applies the image_hash function to each of the images specified
    in the image_files array and returns the result.
    """
    return [image_hash(cv2.imread(image)) for image in image_files]

def processVideo (video_file):
    """
    This function uses ffmpeg to decode the video and extract images at 1 fps into
    a temporary directory. It calls `processImage` on the images and returns the 
    result after cleaning up the temporary files.
    """
    data_dir = tempfile.mkdtemp()
    image_file = data_dir + "/%6d.png"
    subprocess.call(["ffmpeg", "-i", video_file, "-vf", "fps=1,scale=120:120", image_file])
    vector = processImage(sorted(glob.glob(data_dir + "/*")))
    shutil.rmtree(data_dir)
    return vector

def alignVector (needle, haystack, label):
    """
    This function takes a needle and a haystack and finds the best alignment. It doesn't
    bother to account for time warping as the needle is a relatively short clip. It does
    a brute-force linear scan.
    """
    if len(needle) > len(haystack):
        needle, haystack = haystack, needle
    min_diff = 1e5
    min_index = -1
    for i in range(len(haystack) - len(needle) + 1):
        diff = 0
        for j in range(len(needle)):
            for k in range(len(needle[j]['l_ord'])):
                diff += abs(haystack[i+j]['l_ord'][k] - needle[j]['l_ord'][k])
            for k in range(len(needle[j]['cr_ord'])):
                diff += abs(haystack[i+j]['cr_ord'][k] - needle[j]['cr_ord'][k])
            for k in range(len(needle[j]['cb_ord'])):
                diff += abs(haystack[i+j]['cb_ord'][k] - needle[j]['cb_ord'][k])
        diff /= len(needle)
        if diff < min_diff:
            min_index = i
            min_diff = diff
    return {
        "diff": min_diff,
        "index": min_index,
        "label": label
    }

def matchVector (needle, haystacks):
    """
    This function processes the haystacks in parallel, returning the result of aligning
    the needle with each haystack.
    """
    results = joblib.Parallel(n_jobs=-1)(
        joblib.delayed(alignVector)(needle, haystacks[key], key) for key in haystacks.keys()
    )
    results.sort(key=lambda x: x['diff'])
    return results

if __name__ == "__main__":
    if len(sys.argv) > 1:
        if sys.argv[1] == "--video":
            print(json.dumps(processVideo(sys.argv[2]), indent=4))
        elif sys.argv[1] == "--image":
            print(json.dumps(processImage(sys.argv[2:]), indent=4))
        elif sys.argv[1] == "--match":
            needle = json.load(open(sys.argv[2], 'rb'))
            haystacks = {}
            for label in sys.argv[3:]:
                haystacks[label] = json.load(open(label, 'rb'))
            print(json.dumps(matchVector(needle, haystacks), indent=4))
    else:
        print("Usage:")
        print("  d2v.py --video video.mp4 > output.djv")
        print("  d2v.py --image 1.png 2.png > output.djv")
        print("  d2v.py --match mystery.djv 1.djv 2.djv ...")
        print("")