Updated March 18/00
1. Do NOT use Panasonic to individually encode .wav to .mp2. The resulting .mp2 will not sync with intended video. Using DVMPEG to encode .wav to .mp2 (albeit quality is not as good as Panasonic), will create .mp2 audio that sync with video. But, if you put .avi and .wav to panasonic together, resulting .mpg will sync. Strange, eh?
Video and Audio Syncing Problem: Why and How.
Since the first release of Powerip in mid-1999, people have been experiencing the problem of determining the correct speed of video and audio when converting an NTSC mpeg-2 video/audio stream to any other format possible (e.g. mpeg-1, avi, asf, or divx) to get a perfect video and audio syncing.
This video and audio syncing problem is the result of an incorrect conversion of the mpeg-2 video stream (either using Powerip, mpeg2avi or any other conversion utility out there). This document is not meant to discard Squeezer or Flask, but it is in fact can be considered as a support so PERHAPS, the explanation can be applied to perfect-ize both Squeezer or Flask -- or even the AGrabber plugin. To be of note, there have been a lot of successful "synced" conversion made using utilities such as "SQUEEZER" and "FLASK". But there are some cases, where none of the conversion utilities produce a total "synced" video and audio.
Why? Let's see the process of transferring a 35mm film format to an NTSC video format, to see the root of this evil.
35mm Film to NTSC Video Conversion
Movie is usually made on a 35mm Film Negative. This format has a 24 FRAME per second speed. A Frame is the smallest unit of a FILM format. NTSC Video is a "field-based" format of 59.94 FIELD per second. A Field is the smallest unit in Video format. 2 Fields made up into 1 FRAME. So, this 59.94 FIELD per second equals 29.97 FRAME per second. Now we can see the difference. 1 second in FILM (24 frame) is NOT equal to 1 second in NTSC Video (29.97 frame).
To be able to "match" the speed of an NTSC Video, conversion from a FILM format to an NTSC Video format undergone a process called "2:3 pulldown" or TELECINE. This process, in its simplest term, means "to add 6 frames so that a 24 fps becomes 30fps -- which is VERY close to 29.97fps". The problem that rises when doing this TELECINE transfer, is to decide WHICH 6 FRAMES to be added - or REPEATED?
Some kind of community of film/moviemaker/videomaker/engineers created a STANDARDIZATION of this TELECINE conversion. Since a Video FRAME consist of 2 Fields, why not make the FILM format into Field first, so that the smallest unit of both formats is the same? Let's see the process:
1. 24 FRAMES becomes 48 FIELDS
A | B | C | D |
Atop | Abottom | Btop | Bbottom | Ctop | Cbottom | Dtop | Dbottom |
Frame A becomes 2 fields: Atopfield +Abottomfield. Thus, 4 Frames becomes 8 Fields, and 24 Frames becomes 48 Fields. This "field-based" material is then TELECINED into an NTSC Video signal. As TELECINE is a STANDARDIZED conversion, we have to follow the rules of engagement ;). The rule is to do a REPEAT_FIRST_FIELD in a 2:3 sequence.
2. 4 FRAMES (8 FIELDS) becomes 5 FRAMES (10 FIELDS)
A | B | C | D |
Atop | Abottom | Atop | Bbottom | Btop | Cbottom | Ctop | Cbottom | Dtop | Dbottom |
If we look closely, we can see a sequence of At Al At followed by Bl Bt then Cl Ct Cl then Dt Dl. But, since 1 FRAME consists of 2 FIELDS, then the sequence becomes AA AB BC CC DD. What we have now is a conversion from 4 SOLID frame into 5 FRAMES consisting of 3 SOLID FRAMES and 2 INTERLACED FRAMES. By INTERLACED I am referring to a FRAME that's made-up from 2 FIELDs of DIFFERENT FRAME source. The AB frame is the example.
So, 4 FRAMES becomes 5 FRAMES, thus 24 becomes..... 30, DONE! Done? Nope, not by a longshot. The NTSC Video is 29.97fps, so PLAYBACK of 30fps must be slow-down into 29.97fps, which brings us to the term DROP_FRAME.
Don't get a wrong concept of DROP_FRAME as "FRAMES being REMOVED or DROPPED". In a 30fps Video sequence, a DROP_FRAME time code counts video frames accurately in relationship to real time. DROP_FRAME time code counts each video frame, but, when that .03 finally adds up to a video frame, it skips (or drops) a number. It does not drop a film or video frame, it merely skips a number and continues counting. This allows it to keep accurate time. So if you're cutting a scene using drop frame time code, and the duration reads as, say, 30 minutes and 0 frames, then you can be assured the duration is really 30 minutes. Confusing? Well, to put it in simple term, DROP_FRAME here is in essence EQUAL a SLOWED_DOWN playback from a pure 30fps into the correct NTSC 29.97fps SPEED. In an MPEG-2 domain, this means that the 00 and 01 frames are dropped or SKIPPED from time code, at the start of each minute except minutes which are even multiples of 10.
NOW, it is DONE.
Telecine in MPEG-2 Video
In an Mpeg-2 Video, storing a 30fps frames in 1 second will create a much bigger files than storing a 24 frames. If you do your calculation, a 1 second of 24 frames is 20% SMALLER in SIZE than 1 second of 30fps. But, as we have already discussed, NTSC video should be 29.97fps. It would mean that ALL movies that's created from 35mm FILM should be TELECINED, then ENCODED to 29.97fps Mpeg-2 Video stream, right? ..... NO!
A good thing about Mpeg-2 Video is that it can contain some FLAGS or PROGRAMMING, that would tell a SOFTWARE or HARDWARE to perform a TELECINE when playing the Video. Since the INTERLACED FRAMES that made-up the 29.97fps is a REPEATED field(s), it is REDUNDANT, and TRASHABLE. Just let the FLAGS tells the player to perform the TELECINE. Really, it CAN do that ;). The benefit of this that the movie CAN be stored in its original 24 FRAME per second, and thus SAVE 20% of total filesize!.
The FLAGS related to this are: REPEAT_FIRST_FIELD, TOP_FIELD_FIRST. The rules of applying these FLAGS follows the STANDARDIZATION. So you don't have to worry about the process not meeting the standard :). Let see some example:
3. Adding T_F_F and R_F_F Flags
Top Field First 1 | Top Field First 0 | Top Field First 0 | Top Field First 1 |
Repeat First Field 1 | Repeat First Field 0 | Repeat First Field 1 | Repeat First Field 0 |
A | B | C | D |
Atop | Abottom | Atop | Bbottom | Btop | Cbottom | Ctop | Cbottom | Dtop | Dbottom |
As the we can see, a Value of 1 for both T_F_F and R_F_F will ORDER the player to DISPLAY FRAME A in a sequence of Atop Abottom Atop, and the Value of 0 both T_F_F and R_F_F will ORDER the player to display FRAME B in a sequence of Bbottom Btop.
When T_F_F is 0 and R_F_F is 1 (FRAME C), the player will display FRAME C in a sequence of Bbottom Btop Bbottom and so forth. Since it is a STANDARDIZED conversion, we can see a repeating Value of T_F_F and R_F_F as the following:
T_F_F sequence: 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
R_F_F sequence: 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
So, now we have an Mpeg-2 Video stream CONTAINING 24 FRAMES per second and TFF and RFF flags in action. This will create a CONFLICT between 24 fps versus 30fps and the VERBATIM 29.97fps NTSC Video standard. To solve this, there are 2 other advantages of Mpeg-2 Video stream than can be applied, the FPS flag and the DROP_FRAME flag.
When the FPS flag value is PROGRAMMED in the header of an Mpeg-2 Video stream, it will ORDER the player to PLAY this Video stream at an exact SPEED. So, if the FPS flag is set as 29.97fps, the Video stream will play at exactly 29.97 frames per second.
When the DROP_FRAME flag value is 1, it will ORDER the player to REMEMBER that the 00 and 01 frames are dropped at the start of each minute except minutes which are even multiples of 10. The result is much the same as applying the 29.97fps value.
So, THAT is how we make an Mpeg-2 NTSC video stream as 24 FRAME stored, but 29.97fps playback speed. Now that we understand the the process, we are ready to REVERSE it, in order to achieve total Video and Audio syncing when converting BACK from a 24-stored-29.97-fps Mpeg-2 Video stream into any video format we want.
How? Let start with "mpeg2avi", an utility that converts an Mpeg-2 Video stream into .avi format (with codecs of your choosing).
4. Mpeg2avi, VideoMatrix or Graphedit
After a careful reading to the readme.txt that comes with this utility, we can convert the Mpeg-2 Video stream into EXACTLY 24fps .avi. I am ASSUMING that mpeg2avi converts the FRAMES in DISPLAY_ORDER and not in CODED_ORDER. If the conversion is based on CODED_ORDER, then we are totally screwed.
Another way is to use VideoMatrix utility, that will convert an Mpeg-2 Video stream into .avi format BASED on the PLAYBACK of the Mpeg-2 Video stream. This utility ENSURES that the conversion is done in DISPLAY_ORDER.
Or, if you are quite familiar with Graphedit, you can use this utility to convert Mpeg-2 Video into .avi too. All of these three utilities MUST give you a full 720x480 .avi of 24 FRAMES per second. This 720x480 24fps is a REQUIREMENT.
5. M2VInfo
This utility is written to help analysing the FLAGS behaviour and values in an Mpeg-2 Video stream. It is a DOS command utility with usage as follow:
C:\M2VInfo filename.m2v > dump.txt
You can stop the process, because actually only the first GOP information is needed to determine the pattern of the FLAGS values and behaviour.
A sample of the dump.txt is as follow:Type 1 tff 1 rff 1 temp_reference 2
Type 3 tff 0 rff 1 temp_reference 0
Type 3 tff 1 rff 0 temp_reference 1
Type 2 tff 1 rff 0 temp_reference 5
Type 3 tff 0 rff 0 temp_reference 3
Type 3 tff 0 rff 1 temp_reference 4
Type 2 tff 0 rff 1 temp_reference 8
Type 3 tff 1 rff 1 temp_reference 6
Type 3 tff 0 rff 0 temp_reference 7
Type 2 tff 0 rff 0 temp_reference 11
Type 3 tff 1 rff 0 temp_reference 9
Type 3 tff 1 rff 1 temp_reference 10
Notes:
Now, what we need to do is to reconstruct the DISPLAY ORDER. I do this from the dump.txt above by rearranging the Frames according to the temp_ref. order
B B I B B P B B P B B P
Then, let's put the T_F_F and R_F_F in order too
0 1 1 0 0 1 1 0 0 1 1 0
1 0 1 0 1 0 1 0 1 0 1 0
Now we have the following sequence in DISPLAY_ORDER:
B B I B B P B B P B B P
0 1 1 0 0 1 1 0 0 1 1 0
1 0 1 0 1 0 1 0 1 0 1 0
Now, since we only want to know the CORRECT TELECINE sequence... just take the first 5 frames from the sequence above, and assume its an A B C D E sequence:
A B C D E
Apply the T_F_F and R_F_F values to the sequence above, and correctly follow the first T_F_F value, so we know which STARTING FIELD. I got this:
AlAu AlBu BlCu ClCu DlDu ElEu El
The STARTING_FIELD from above sequence is Al = Frame A lower field. Separate the sequence above into 1 frame (containing 2 fields), and we can calculate the TELECINE sequence within this .M2V as follow:
W S S W W W
So, the WSSWW sequence is the EXACT TELECINE sequence taking place in this particular .M2V
From the dump.txt, we can now determine the TELECINE sequence to reconstruct the 2:3 pulldown into the 24fps avi. By this we MIMIC the actual DISPLAY_ORDER of the Mpeg-2 Video PLAYBACK into the AVI domain. In short, we CONVERT the 24fps avi into 29.97fps NTSC avi. It is important to apply the specific TELECINE sequence AND the correct STARTING FIELD. When the reconstruction gives lower_field as the BEGINNING of the DISPLAY_ORDER, we have to conform the conversion AS IS. Both the SPECIFIC TELECINE sequence and STARTING FIELD is IMPORTANT to create a 100% video and audio sync.
At the time of writing this document, I can only use Adobe After Effects to correctly convert the 24fps into 29.97fps while at the same time apply both determining factors above. Yes, you can do this directly in any ENCODER, but the applied conversion DOES NOT conform to the STARDARD TELECINE transfer. Encoders will apply a "4th frame repeated" calculation to get from 24fps to 29.97fps (30fps drop frame). Such conversion will be like this:
Current Encoders TELECINE creates: AA BB CC DD DD EE FF GG HH HH
Correct TELECINE Tranfer would be: AA AB BC CC DD EE EF FG GG HH
As you can see, even the STARTING FIELD factor is nowhere to be applied (panasonic encoder has this option, though). If this error is added to the whole 1 hour of Video conversion, Audio syncing WILL be screwed. This stated, I prefer to use Adobe After Effects (albeit it's a difficult program), UNTIL a new method can be found.
6. Adobe After Effects
This is the stage where I really can't explain much. You have to KNOW how Adobe After Effects works. That's why I really want to find a way to simplify this stage, or to put simply, NOT USE THIS ADOBE AFTER EFFECTS. But, as a quick referrence, what I do in After Effects is like this:
All this steps are done in the 720x480 domain. The resulting rendered avi will be 29.97fps, accordingly conform to the SAME TELECINE transfer sequence that is programmed within the Mpeg-2 Video -- THE DISPLAY_ORDER. This avi will then become the source of the conversion to OTHER format, or if you choose DIVX codec from the start, you are now left with adding the Audio stream.
7. Converting the Audio
Use Graphedit to load any audio format. You have to edit the connection of the boxes in the Graphedit.
When it is finished, you now have 2 files: VIDEO.AVI and AUDIO.WAV. Now you decide what to do with it. You can:
8. Converting to Mpeg-1 with VideoCD compliant stream
At the end of the conversion, you will have a 29.97fps Mpeg-1 VCD Compliant (352x240), and to be of note: the panning video sequence will pan smoothly!
The advantage of a 29.97fps Mpeg-1 Video/Audio stream is that I can edit it in IFilm Mpeg-1 editor, while a 23.976fps Mpeg-1 cannot.
By writing this document, I am describing the situation as closely as I could to help answer the questions of Video and Audio syncing problem, and hope that a coder can help SIMPLIFYING the process. I know that SQUEEZER or FLASK have been reported to be able to create a TOTAL SYNC. But since there are some reports that tells otherwise, PERHAPS this document could help pointing the reason, and thus we can come up with a solution or two for PERFECT A/V syncing for ANY conversion.
Some of the ideas to jot down are as follow:
Anyone up for the task?
regards,
robshot
march 17, 00