A Phase-based Approach for ENF Signal Extraction from Rolling Shutter Videos

Abstract—Electric Network Frequency (ENF) analysis has been an intriguing tool for multimedia forensics as former studies have paved the way for estimating ENF signals from digital audio, video, or even image files. However, for ENF signals to be widely used in extensive applications, supplementary research is needed so that ENF signals can be stably extracted without restrictions. In this letter, we propose a new phase-based approach for extracting ENF signals from CMOS sensor recordings. It uses phase differences between row signals from two consecutive frames, such that problems due to missing sample points during the idle periods are circumvented. The proposed method has substantial advantages in that it is applicable without a predefined read-out time and when the length of given videos is too short. Extensive experiments conducted with numerous devices demonstrate that the proposed method can take precedence over state-of-the-art methods because it robustly produces accurate ENF estimates in terms of aliased frequency on the frame-level. The coding framework used for this letter is available at: https://github.com/hyekyunghan/Phase-based-ENF-extraction-method. [pdf]

ENF Signal used as a tool for Multimedia Forensics

Since ENF signal can be embedded in digital recordings, It has been researched extensively for forensic applications including time stamp verification, location identification, and tampering detection.



ENF signals embedded in videos

https://www.youtube.com/watch?v=naW8xmh1QY0





Things to consider when extracting ENF from Video

1) Rolling Shutter mechanism

The rolling shutter mechanism increases the sampling rate since successive rows of an image are acquired at a different but regular time interval.

While CCD sensors acquire pixels of an image at once, cameras with a CMOS sensor apply the rolling shutter that acquires pixels of an image frame sequentially one row at a time, and thus increases the sampling rate of a camera.

The read-out time is the amount of time during which a camera receives a frame. What makes the difference between the frame rate and the read-out time is the existence of an idle time. A camera needs some time to digitize pixels right after capturing a frame and thus it governs the read-out rate.

2) Presence of the idle time

Pixels are acquired only at given times; during the read-out time.

During the idle time, the camera does not perform exposure and hence the transmitted signal is lost during this period.



Effects of Aliasing (on the frame level)

ENF signals embedded in a video recording captured under the influence of electrically powered light are affected by aliasing effect.

This is inevitable due to the rapidly changing signal of the light source (100 Hz or 120 Hz) compared to the low frame rate (around 30 fps) of a camera. Thus, commercial cameras operate below the Nyquist rate and the flicker signal ends up at an aliased frequency of the light.

It is difficult to estimate the ENF when the aliased ENF is almost zero because the frequency derived from the flickering light is obscured by the frame rate. And this is a very common situation.

This is because camera manufacturers deliberately design to make commercial cameras keep up with the flickering light that they face to avoid the dim cycle of lighting.

In this sense, we proposed a novel method overcoming the problem of the aliasing effect (on the frame level). Temporal average of alias frequency (fa,0) frequently becomes almost zero due to the video standard *, and previous methods often become inapplicable when fa,0 reaches almost 0, making the proposed method highly useful. The results show that our proposed model has strength in its robustness and accurateness.

*Phase Alternate Line (PAL) and National Television Standards Committee (NTSC) are the video format standard. PAL and NTSC use power frequencies of 50 and 60 Hz, respectively, and frame rates are usually adjusted to 25 and 30 fps, respectively.

The subspace-based methods also produce accurate ENF estimates, but the usage of these methods is very limited. When fa becomes almost zero, MUSIC and ESPRIT broke down.



Experimental Results

Relationship between real-time ground truth ENF signal measured using the FDR and the frame rate multiple while recording video #7 (left) and video #9 (right). The average variation of real-time ENF from the frame rate multiple of video #9 is larger than that of video #7.

Estimated ENF signals extracted from video #7 (left) and video #9 (right) with y-axis indicating normalized frequency. The results are dependent on the relationship between real-time ENF and the frame rate illustrated in the figure above.





The failure in estimating ENF can be explained by the relationship between the real-time ENF value and the frame rate of the videos.

The MUSIC and ESPRIT fail in parts where the multiple of the frame rate becomes approximately equal to the actual ENF value.

The spectrogram- based methods (weighted energy and periodic zero-padding (PZP) methods), especially the PZP, are more resilient to this problem. However, the powerful PZP method also fails when the average deviation of ENF from the frame multiple is about 0.0072 Hz (video #7), which was applicable when the value was slightly larger at 0.0296 Hz (video #9).

The proposed method successfully estimates accurate ENF signals in all cases.In this sense, our proposed method overrides the state-of-the-art ENF estimation methods because it can be still be used when the aliased ENF becomes almost zero, which happens frequently. Thus, our proposed method stably and accurately extract ENF signals beyond other compelling methods.


How?

Previous studies have used the spatial average of each row as a sample to extract ENF from visual recordings, but one of the concerns they encountered in this line of work is the existence of idle time.

To deal with the idle time, we sought to use row-by-row samples independently on a frame-level not to embrace any discontinuity due to the idle period. However, the authors of previous studies directly concatenated the row signals by ignoring the idle time that occurred at the end of each frame, so the discontinuity in their input signal made the noise which led to less accurate ENF estimates.

If the read-out time is long enough, the lost signal during the idle time can be disregarded, however, for devices with a very short read-out time, it cannot be ignored.