Project overview

A cutting-edge AI-driven computer vision solution that extracts all unique faces, objects, and vehicles from camera footage, provides specialized video editing tools, and produces detailed reports with metadata for every video frame.

Client: a Fortune 500 corporation with thousands of technology patents and dozens of subsidiaries around the world.

Case highlights

  • 80.7% person identification accuracy during real-time tracking
  • Over 95% object detection accuracy with unstable footage from body and in-vehicle cameras
  • Highly performant, real-time full HD frame processing during object classification and tracking
  • Smart logic that allows detecting up to 400% more object instances than most competitors on the market
  • Ultra-fast 30 fps HD-quality footage processing during video redaction
  • Ability to detect virtually any object of 20px in size or more in the provided footage
  • Achieved a dramatic boost in police officer productivity by cutting video processing time by an average of 98.67%
  • Intelligent semi-automatic mode identifies all the dubious video segments and provides police officers with actionable hints for 100% object detection accuracy
Public safety, Electronics
Delivery Model
Scope-driven milestone-based development
Effort and Duration
4 months, 16 man-months
Python, C++, Python, TensorFlow, OpenCV, Windows Media Foundation, CUDA, cuDNN, IMFMediaEngine, Javascript, WebAssembly, Emscripten

Business challenge

The client’s objective was to develop a comprehensive computer vision-driven solution designed around the specific needs and requirements of the police force. Considering the sphere of application, the task posed several challenges:

  • ingestion and processing of video from body-worn and in-vehicle cameras, including live feeds, shaky footage, and footage filmed in adverse environment conditions;
  • utmost accuracy of face and object detection and identification;
  • multi-faceted search in the video library, e.g. by race, gender, clothing, headgear, tattoos, behavior, and more;
  • reliable evidence redaction tool capable of blurring a certain face, object, or vehicle from every single frame of a video.

Overall, the project aimed at helping the police minimize the time and effort spent on filtering and manually correcting video evidence during investigations and court proceedings.

Assembling a highly competent team

Oxagile provided an experienced, well-balanced team, including a deep learning engineer, computational mathematics experts, a data analysis expert, and systems integration specialists. The team was perfectly suited to take on the project’s challenges and managed to achieve great levels of productivity within a limited timeframe.

Oxagile’s robust R&D department was involved at critical stages to ensure that optimal technical solutions were found to emerging problems.

Delivered solution

The client received a powerful AI-driven computer vision platform designed to ingest camera footage in order to detect, identify, and track faces, objects, and vehicles.

The solution effectively uses filtration and business logic, such as timeline dependencies, to reduce error in complex scenarios like partially obscured faces or vehicles moving in snowfall.

Oxagile’s team successfully implemented the evidence redaction feature that allows users to have any face or object of their choice blurred in every frame of the video. This process is critical for witness protection in court and was previously done by hand, which consumed a lot of time and increased the risk of human error.

At the ingestion stage, the video file is decoded and presented as a set of frames. Then, advanced pre-processing algorithms are used to fix the fish-eye distortion of body-worn cameras.

The solution relies on neural networks to find required entities in every frame, detect people’s poses, and locate the vehicles’ license plates. With all objects of interest discovered, the system is able to track them across a group of frames.

Finally, the system produces a report with extensive metadata regarding each entity and its appearance in every frame (e.g. a thumbnail of every detected vehicle, its license plate, color, body style, and behavior on the road).

Intelligent Custom Logic for 100% Detection Accuracy

The solution’s custom logic combines four types of video analysis. Used together, they allow achieving close to 100% detection accuracy in a variety of situations.

When a previously captured entity suddenly disappears from view, the system locates it in the following frames and applies linear approximation to all the frames in-between. This intelligent approximation algorithm is designed to greatly increase detection speed, as well as improve accuracy in situations where the deep learning approach fails or is too computationally expensive.

Deep learning-based analysis is engaged whenever an identified face disappears and the approximation analysis logic cannot recover it — a common case is when a face passes the frame border.

Reverse analysis is applied to previous frames when a person or a vehicle approach the camera, becoming easier to identify.

Missed object analysis starts in scenarios when the object is temporarily hidden from view, e.g. when passing a column. The system pushes a few future frames to quickly recapture it.


The reporting feature allows users to create highly detailed overviews of every video based on the user-provided target list of entities — with a configurable similarity level.

The system supplies a comprehensive overview of generated thumbnails, insights on the entities that were successfully identified, as well as those misidentified or never detected. Each identified entity is accompanied with a log of the frames it’s visible in and the frames where it will have to be blurred at the redaction stage.

Powerful Tech Stack

The application is written in C++ and based on Dlib cross-platform software libraries and TensorFlow custom neural networks. The solution utilizes Max-Margin Object Detection (MMOD), Convolutional Neural Network (CNN and R-CNN), Fully Convolutional Network (FCN), and Deep Neural Network (DNN).

The project relies on the Microsoft Media Foundation framework with a corresponding decoding plugin to decompress H.264 and H.265 video clips in the MP4 container and present the video as a set of frames.

Functional Modules

The C++ Windows application helps automate the detection, identification, and tracking of people, objects, and vehicles. This module is responsible for choosing the optimal thumbnails to be included into a report, and supports multi-faceted entity search.

The custom JS player enables a host of video editing operations, including file import, frame by frame navigation, object blurring, zooming, and resizing, and cutting — all with separate layers for each detected face and object.

The redaction system governs user management (creating, editing, blocking users), video storage and search, and the video processing backend.

Business value

After initial trials, the solution has demonstrated superior quality of detection and identification in comparison with other similar products intended for professional use. The system relies on highly intelligent logic that detects up to 400% more object instances than competition.

According to recent estimates, the solution delivers dramatic gains in video processing, boosting police officer productivity up to 60 times.

As of today, it’s the only end-to-end solution optimized to address a particular set of pain points that police work presents, and the only one to ensure seamless automation of video analysis.

The combination of outstanding accuracy, process automation, and rich video editing capabilities makes the application an extremely valuable asset in public safety solutions, police investigations, court proceedings.