Using FFmpeg to replace video frames

Machine learning algorithms for video processing typically work on frames (images) rather than video.

In a typical use-case, FFmpeg can be used to extract images from video – in this example, a 50-frame sequence starting at 1:47:

>ffmpeg -i input.vid -vf "select='gte(t,107)*lt(selected_n,50)'" -vsync passthrough '107+%06d.png'

Omit the -vf option if extracting the entire video. The video processing algorithm can then be applied to the individual images.

But now, how to replace the original video frames with the edited ones?

If the source video is a known constant frame-rate (CFR) segment, then the edited frames can be overlaid (replaced) using the overlay filter – in this example, a 12.5fps CFR video:

>ffmpeg -i input.vid -itsoffset 107 -framerate 25/2 -i '107+%06d.png' -filter_complex "[0:v:0][1]overlay=eof_action=pass" output.vid

Omit the -itsoffset option if replacing the entire video. Note: To test whether a video (or segment) is CFR or VFR, and find its frame-rate, see here.

Alright, how about replacing frames in a variable frame-rate (VFR) video?

Replacing frames in a VFR video can be very complicated. As such, it may be preferable to convert the video to CFR instead. The short version of the process is to replace the above commands with the following to extract the images:

>ffmpeg -i input.vid -vf "select='gte(t,107)*lt(selected_n,50)',showinfo" -vsync passthrough '107+%06d.png' 2>&1 | sed 's/\r/\n/g' | showinfo2concat.py --prefix="107+" >concat.txt

This requires a script that can be downloaded at the link provided. After editing the images, update the source video with:

>ffmpeg -i input.vid -f concat -safe 0 -i concat.txt -filter_complex "[1]settb=1/90000,setpts=9644455+PTS*25/90000[o];[0:v:0][o]overlay=eof_action=pass" -vsync passthrough -r 90000 output.vid

Where 90000 is the timescale (inverse of timebase), and 9644455 is the PTS of the first frame to replace.

Come again?

Replacing frames in a VFR video involves 3 steps: Capture the frame timestamps, generate a replacement video segment, and then overlay that video over the original.

Capturing the timestamps of extracted frames is most conveniently done using showinfo at the same time that images are extracted. Following the above example:

>ffmpeg -i input.vid -vf "select='gte(t,107)*lt(selected_n,50)',showinfo" -vsync passthrough '107+%06d.png' 2>&1 | sed 's/\r/\n/g' | egrep '^\[Parsed_showinfo_'

To do it after the fact, use:

>ffmpeg -i input.vid -vf "select='gte(t,107)*lt(selected_n,50)',showinfo" -f null /dev/null 2>&1 | sed 's/\r/\n/g' | egrep '^\[Parsed_showinfo_'

Use only -vf showinfo to capture timestamps for the entire video.

How to build a variable frame-rate (VFR) video from images and timestamps?

The replacement overlay video segment can be generated using a concat file with durations.

One catch is that concat files use a hard-coded frame-rate of 25fps, and therefore all frame durations must be multiples of (1/25=)0.04s. This is a problem if the VFR video segment requires higher granularity. To work around this, durations can be multiplied by some factor that maintains granularity above 25fps, and then that same factor can be divided back out when generating the video segment.

Another catch is that timings as reported by showinfo may be rounded, and lack the precision necessary for overlaying VFR video. For this reason, it is preferable to work with presentation timestamp (PTS) durations, as these are the precise internal representation of timestamps used in video files.

Since PTS values are timestamps multiplied by the timescale, we can kill 2 birds with 1 stone by using the PTS values for durations, and then factoring the timescale back out when generating the video. This is the trick to generating VFR overlay video segments with precise frame timings. Consider the following list of images and timestamps:

File nameFramePTSTimestamp (s)PTS duration/25
107+000001.png7399644455107.1618627345.08
107+000002.png7409653082107.25615931637.24
107+000003.png7419669013107.4333000120

The PTS duration column is the difference between the next frame’s PTS and the current frame. The last column is the PTS duration divided by 25. Since PTS values are integers, and concat files support a granularity of 1/25, the PTS duration can safely be divided by 25, as long as we remember to multiply back when generating the overlay video. Doing this is optional, but reduces the likelihood of exceeding duration limits.

A concat file – named concat.txt – can now be constructed as follows:

ffconcat version 1.0
file 107+000001.png
duration 345.08
file 107+000002.png
duration 637.24
file 107+000003.png
duration 120
...

It is best to do this part using a script, such as the one provided above.

How to overlay the VFR video over the original frames?

There is one more piece of information needed to proceed: The source video’s timebase. The showinfo commands above provide this value. With this, the VFR video segment can be overlaid on top of the original video using the following command (in this example, the timebase is 1/90000, which makes the timescale 90000):

>ffmpeg -i input.vid -f concat -safe 0 -i concat.txt -filter_complex "[1]settb=1/90000,setpts=9644455+PTS*25/90000[o];[0:v:0][o]overlay=eof_action=pass" -vsync passthrough -r 90000 output.vid

Notice the setpts clause: It contains the starting PTS value (9644455) from the table above (instead of -itsoffset 107) – omit this value if replacing the entire video. Also notice how the precise timestamp is extracted by factoring back out the timescale and also the extra 25 that we divided by earlier.

There may be additional options worth considering if you want to complete this process losslessly.

It is crucial that the overlay video have exactly the same frame timestamps as the underlay video for this process to work reliably. Thus, if things don’t work exactly right, it may be useful for debugging purposes to generate the overlay video separately:

>ffmpeg -f concat -safe 0 -i concat.txt -vf "settb=1/90000,setpts=PTS*25/90000,showinfo" -vsync vfr -r 90000 overlay.vid 2>&1 | sed 's/\r/\n/g' | egrep '^\[Parsed_showinfo_'

Then the showinfo output can be compared to the source video’s output, to see where things went wrong.

Leave a Reply

Your email address will not be published. Required fields are marked *