Encoding Op1a Hardware Limitations

Questions and answers on how to get the most out of FFAStrans
andrezagato
Posts: 43
Joined: Tue Jun 09, 2020 4:07 pm

Encoding Op1a Hardware Limitations

Post by andrezagato »

Hello,
My job purchased a new workstation specifically for transcoding.
AMD Ryzen Threadripper PRO 5965WX 24-Cores 3.80 GHz boost up to 4.5 GHz.
128 GB DDR4 RAM
10 GB NICS
512 NVME System Drive
2 TB NVME Cache Drive

I will use mainly to encode to MXF Op1a. From my tests, the max frames/sec converted is 120, meaning that if I have 1 job slot at a time, it will convert at 120 frames/sec. If it is 2 jobs, around 60 fps each. If its 4 simultaneous jobs, around 30 fps each.

The thing is that is not that different from an I9 10th gen I also use. I expected to be a lot better, specially with more simultaneous jobs. Is there anything I am missing, something I can improve ? Worth mentioning that the CPU and the cores are all being used, and normally at 80% to 100% usage.

Is there some sort of "limitation" regarding mxf op1a files ?
emcodem
Posts: 1631
Joined: Wed Sep 19, 2018 8:11 am

Re: Encoding Op1a Hardware Limitations

Post by emcodem »

Morning!

Which codec are we talking about? XDCAM or AVCI or something different? And which resolution?
emcodem, wrapping since 2009 you got the rhyme?
andrezagato
Posts: 43
Joined: Tue Jun 09, 2020 4:07 pm

Re: Encoding Op1a Hardware Limitations

Post by andrezagato »

Mainly it is XDCAM HD 50 422.
But I have been testing with DNxHD, and the results are a little better, but not much. I've tested AVC-I also, and it was even worse than XDCAM. So far my tests have been using FullHD MXF Op1a as origin footage, to transcode to Full HD MXF-OP Atom --- I've just realized I've made a mistake before, I wrote to encode to Op1a, it is actually to Op Atom (for Avid).
User avatar
FranceBB
Posts: 230
Joined: Sat Jun 25, 2016 3:43 pm
Contact:

Re: Encoding Op1a Hardware Limitations

Post by FranceBB »

Were those tests made with FFAStrans?
You should know that libavcodec's MPEG-2 encoder was - just like pretty much any other MPEG-2 encoder - written a very long time ago, which means that it won't make use of some of the new instructions set that new CPUs have. For instance, I have a 56c/112th Intel Xeon with AVX-512, but I hardly use it on single encodes and in particular I'm pretty sure the MPEG-2 encoder is using SSE2 only, with no SSSE3, SSE4.1, SSE4.2, AVX, AVX2 or AVX512 as by the time those were around, it was already too late and people lost interest in MPEG-2 in favor of xvid and later H.264.
I know you tested AVC Intra and it was worse, but that is because although x264 is much more parallelized than the MPEG-2 encoder, the computational complexity of H.264 is also higher, so the tradeoff isn't in your favor. By the way, x264 is multithreaded and can use up to AVX-512 but ONLY in the 8bit flavor, while the 10bit flavor (which is used by the AVC Intra profile) is limited to AVX2 only (although I've been begging people to add AVX-512 to the 10bit part too for free over the last 4 months).

Just to give you a glimpse on this, the code in cpu.c detects the CPU features, while in dct.c you can see the SIMD flags (like the one I mentioned above X264_CPU_AVX512) to enable (or not) certain function:
https://raw.githubusercontent.com/mirro ... mmon/dct.c
however, if you take a look at the pre-processor directives in that file, the X264_CPU_AVX512 flag does not appear in the HIGH_BIT_DEPTH part at all and unfortunately AVC Intra is 10bit, so "HIGH_BIT_DEPTH"... :(
In other words, there is nothing in the "high bit-depth" code path that gets enabled/disabled depending on the availability of AVX-512 support.
And... this is far from over, there are more you dig, the more "bad" things you'll find related to AVC Intra Class 100, for instance, CAVLC only has the pure C implementation with (almost) no manually written assembly code whatsoever, which makes it extremely slow. Ironically, CABAC does have some assemblies, but unfortunately x264 can't use them as it must be disabled for the Intra Class 100 flavor to be compliant...

Anyway, back to XDCAM-50, to answer your question, no, there's no artificial limitation, it's just the lack of people actually contributing to old encoders like the MPEG-2 one to make it properly multithreaded and with modern SIMD, I'm afraid... :(
emcodem
Posts: 1631
Joined: Wed Sep 19, 2018 8:11 am

Re: Encoding Op1a Hardware Limitations

Post by emcodem »

andrezagato wrote: Thu Feb 16, 2023 8:34 pm max frames/sec converted is 120, meaning that if I have 1 job slot at a time, it will convert at 120 frames/sec. If it is 2 jobs, around 60 fps each. If its 4 simultaneous jobs, around 30 fps each.

... and normally at 80% to 100% usage.
Ok so if you encode XDcamhd, the cpu utilisation you mention does not make sense. 1 Job MPEG2 encoding using our XDCAMHD ffmpeg settings should not come even close to using 24 cores 80%. It should be like 15-20% (of 24 physical Cores) . The reason for the slowness is as @FranceBB says just missing interest in the community to make it utilize more cores paralell (e.g. by encoding multiple standalone GOP's internally parallel as Mainconcept encoder would do)

For me on some old HP Workstation it is like the base speed is 80fps, the difference to your 120 fps is because you have 4GHz in turbo mode and i have 2,5 on my xeons. Leaving that aside, i got 2x6 cores and for me it is like 1 job = 80fps, 35% CPU usage. 2 Jobs each 70fps, 60% CPU usage. 3 Jobs 50fps, 70% CPU usage.
The number stay more or less the same, independently of Hyperthreading on or off setting.

The Results do not make sense because if 1 job with 120 fps uses only 30% CPU, why does the speed drop down when you run 2 or more parallel transcodes. I have no answer for this but the behaviour was the same totally independent of the CPU model or Software Encoder Model that i tested. E.g. Mainconcept encoder does 250fps at 20% CPU (24 Cores) but when 2 jobs run parallel we only have like half the speed and at 30% CPU. So the behaviour does not seem to be codec but more OS dependent.

Anyway, i have no insights in AMD processors, only Intel. For Intel it is like XEONS are better when you utilize all cores concurrently because they can keep a higher turbo frequency for all cores. While a Core Ix processor can only turbo "one core" to a high frequency and the others stay low due to temperature reasons, on a XEON, all cores wills stay overclocked - not to the maximum possible frequency but still higher than the base frequency is and a lot higher than the non overlocked I9 would be. BUT have in mind this is only for XEON vs CoreI processors. AMD only has consumer processors, i have no clue about how they work in terms of overclocking and overall available GHz when under heavy load for very long time.

Also depending on the Cooling, The Core Ix System would probably from time to time limit the Frequency to 1GHz or such in order to cool down while the XEON system would keep with all cores medium overclocked 24/7.
In return, a Single job would always be faster on a consumer/Gaming processor because it can overlock at least 1 core to very high frequency. - This is exactly what we need to overcome the limitations of "not heavily multicore optimized software" like mpeg2 and h264.

Anyway, totally independent of "who produced the cpu", for parallel Encoding we just need to make a Sum of all GHz that the System can Deliver in order to compare systems. E.g. 1Core 4,5Ghz + 11Cores 2,8GHz = the power this system can deliver when you use all cores parallel.
emcodem, wrapping since 2009 you got the rhyme?
User avatar
FranceBB
Posts: 230
Joined: Sat Jun 25, 2016 3:43 pm
Contact:

Re: Encoding Op1a Hardware Limitations

Post by FranceBB »

emcodem wrote: Sat Feb 18, 2023 8:42 pm Anyway, i have no insights in AMD processors, only Intel.
Same here. I used to be a very hardcore AMD Fanboy from 1999 to 2012, I had lots of monocore CPUs back in the days for my personal desktop at home, namely Athlon 500 (1999), an Athlon 1600 (2001), an Athlon 3700 (2004) and I gotta say that back then AMD was rocking with better performances than Intel. Unfortunately, though, when Intel created the first multi core CPU (with the dual cores), things started to shift in their favor. From there, I went to the AMD Athlon 64 x2 4000 (2006) which was a dual core and then a whopping 6 core, the AMD Phenom II X6 (2010). Unfortunately, AMD wasn't really great in the multicore architecture, while Intel was blazing fast and taking advantage of multithreading with twice the threads of the cores in a CPU (in AMD world hyperthreading wasn't a thing). In the end, that 6 core I had was losing benchmarks (including encoding) against a puny Intel i5 4c/4th, not even an i7 4c/8th and was only competing with an i3 2c/4th. It was at that point that I said: "that's it, goodbye AMD", so I moved to Intel with an i7 at home and Xeons at work and never looked back.

(if anyone wonders "what happened before 1999", well, I wasn't old enough to be able to use a computer as my first PC came with Windows98SE xD)
emcodem wrote: Sat Feb 18, 2023 8:42 pm AMD only has consumer processors
Things kinda recently changed, nowadays AMD CPUs are Intel copy-cat with the same hyperthreading thing, which means that along with their Ryzen consumer CPUs they also have the Epyc series which is for professional workstations and datacenters.
Needless to say, the only company that moved to AMD was Amazon for their AWS EC2 machines, while the rest of the world stayed on the good old reliable Intel Xeon.
On that note, very very very recently, AMD managed to get AVX-512 in their Epyc CPUs, so it's gonna be interesting to see what Intel is gonna do now that they finally have some competition.
The answer seems to be Intel AMX (Advanced Matrix Extensions), but we'll see how that it's gonna play out in the future.
andrezagato
Posts: 43
Joined: Tue Jun 09, 2020 4:07 pm

Re: Encoding Op1a Hardware Limitations

Post by andrezagato »

I have updates about the encoding process.
I've described the new encoding computer with 128GB of ram. But I didn't know that it had only 16Gb when I tested, the rest of the RAM was installed last week. And since the upgrade in RAM, the process jumped from 100 to 120 fps (one job at a time. If I divided between jobs, the sum of all would still be around 120 fps all together. And with the cpu at 100%.
Now a single job is around 300fps, with the CPU at 25%. 4 jobs managed up to 600 fps.

I will keep testing, but I wanted to let you know this change in performance.
Anyone knows why is that ? If i had more ram? Could I go faster ?
emcodem
Posts: 1631
Joined: Wed Sep 19, 2018 8:11 am

Re: Encoding Op1a Hardware Limitations

Post by emcodem »

Well i can only guess here but the size of RAM will not influence MPEG2 encoding speed (not if we talk 16GB+). Maybe the mainboard was not able to work with most of the cores as long as there not enough RAM slots filled? Or you forgot to tell it in the BIOS that all cores use the same NUMA node (if your mainboard has NUMA settings at all)...
I bet that it is something mainboard specific tough, not OS specific.
emcodem, wrapping since 2009 you got the rhyme?
andrezagato
Posts: 43
Joined: Tue Jun 09, 2020 4:07 pm

Re: Encoding Op1a Hardware Limitations

Post by andrezagato »

hey emcodem, I don't know what NUMA is, i will look it up.
I have a friend also using on his company FFAStrans, an after we found out about the ram. He also saw a substantial performance on his encoding after increasing the RAM.
He even tested with different clock speeds and noticed a change in the performance. I will get more information and let you know.
emcodem
Posts: 1631
Joined: Wed Sep 19, 2018 8:11 am

Re: Encoding Op1a Hardware Limitations

Post by emcodem »

Ah i think NUMA is only there when you got 2 or more sockets (not just because you got lots of cores)...
emcodem, wrapping since 2009 you got the rhyme?
Post Reply