Encoding Op1a Hardware Limitations

Questions and answers on how to get the most out of FFAStrans
User avatar
FranceBB
Posts: 230
Joined: Sat Jun 25, 2016 3:43 pm
Contact:

Re: Encoding Op1a Hardware Limitations

Post by FranceBB »

emcodem wrote: Thu Mar 02, 2023 8:09 am Ah i think NUMA is only there when you got 2 or more sockets (not just because you got lots of cores)...
Correct.
In single socket configurations there's no NUMA Nodes.
andrezagato wrote: Wed Mar 01, 2023 5:50 pm I have a friend also using on his company FFAStrans, an after we found out about the ram. He also saw a substantial performance on his encoding after increasing the RAM.
He even tested with different clock speeds and noticed a change in the performance. I will get more information and let you know.
that's about right and there are two reasons for it:

1) CPU lanes
2) Cached frames

The first one is pretty intuitive.
A CPU has a certain number of "lanes" connected to the motherboard via the socket with which it interacts with all other devices, including RAM.
Back in the days there used to be northbridge and southbridge, then it became PCH (Platform Controller Hub) and nowadays it's UPI (Ultra Path Interconnect).
Anyway, regardless of the name, the concept is that a CPU has x number of lanes connected to the motherboard and of course the motherboard can use those lanes to connect it to other devices like RAM, SSD, GPU etc and each one of those use x number of lanes.
This is mainly the reason why CPUs like AMD Epyc and Intel Xeon have more lanes compared to their consumer counterparts like AMD Ryzen and Intel i9.
Having more lanes means that you can have more connections and therefore better speed.
For instance, if you have like a program that uses CUDA and you have an NVIDIA Card and you add a second NVIDIA GPU in SLI and you have a consumer CPU, you won't probably gain much 'cause you wouldn't have enough lanes to communicate with both GPUs effectively anyway.

Now, going back to your use case, encoding, what happened here regarding the "first" point is that your CPU had enough lanes available, however you only had 1 single RAM slot, therefore you were able to allocate memory and read from it far too slowly. However, when you added the additional RAM in the other slots, you had more lanes available and the OS used them all at the same time to allocate and de-allocate memory, thus making encoding faster.


Now let's go to the second point.
When you encode a file, this can be of any resolution, framerate, bit depth etc and it could be very different from the output you're targeting.
In FFAStrans there's something very complex called "filter_builder.a3x" which will use the info from both ffprobe and mediainfo and use them to create the perfect filter-chain to reach your output.
Just like Avisynth, FFMpeg also has its own "frameserver" as it "glues together" a series of decoders, filters and encoders.
For instance, in the case of AVC Intra, it means that your file will be decoded by libav, it will go through a series of filters and the uncompressed a/v stream (which lives in your ram) will be passed to the encoder, x264 (in this case libx264 bundled inside ffmpeg) which will encode it and create the raw_video.h264 which the ffmpeg muxer (or the BBC Muxer) will mux in .mxf or whatever container you choose on the fly (meaning that you won't see two files, just one).
Given that some filters may require spatial/temporal access, frames will be cached inside RAM with malloc() and then accessed only to be discarded when they're not needed any longer, so you can see how RAM is important.
Not just that, in case of multiple filters, RAM can and will be distributed across modules so that access is faster per filter, but of course this can't be done in an intra-filter basis as it needs to be contiguous at least within the same filter in the threadpool (it's a bit more complicated than that but I won't elaborate now, maybe later if you're interested :) ).


Disclaimer: I know very little about FFMpeg's threadpool so a lot of what I wrote above is assumed from my knowledge of the Avisynth threadpool as I expect the two to work in a very similar manner, which would explain the speed bump you've got.

p.s if you guys are interested in the inner working of Avisynth and its thread pools, I can go on and further elaborate on that. I could go on for hours xD


Cheers,
Frank
emcodem
Posts: 1631
Joined: Wed Sep 19, 2018 8:11 am

Re: Encoding Op1a Hardware Limitations

Post by emcodem »

FranceBB wrote: Thu Mar 02, 2023 2:14 pm When you encode a file, this can be of any resolution, framerate, bit depth etc
Each and every individual filter, decoder and encoder will behave very differently regarding threading and memory management. That is why for benchmarking encoding speed, it is important to encode to the exact same properties as the input file has (resolution, framerate, bit depth) etc... and of course leave all audios aside. A benchmark that includes decoding/encoding and filtering is usually not very helpful except you have a strict workflow where you always have the same input file, filters and output file.
emcodem, wrapping since 2009 you got the rhyme?
User avatar
FranceBB
Posts: 230
Joined: Sat Jun 25, 2016 3:43 pm
Contact:

Re: Encoding Op1a Hardware Limitations

Post by FranceBB »

Well but in that case you will be benchmarking the encoder eheheheh

I think Andrea was doing some "real life scenario" tests more than real benchmarks eheheheh

When I want to make individual benchmarks I too test encoders and filters separately (the first with x264/x265 directly and the latter with AVS Meter), though, and of course I test different builds compiled with different compilers, namely CLang LLVM, ICC (Intel compiler), GCC and MSVC (Microsoft compiler) with the last two usually scoring worse and the first two scoring better and being very close (although on AVX-512 Xeon ICC is slightly faster).
andrezagato
Posts: 43
Joined: Tue Jun 09, 2020 4:07 pm

Re: Encoding Op1a Hardware Limitations

Post by andrezagato »

Wow, what a masterclass FranceBB!
And that felt like just the intro! Amazing! Thanks for taking the time to answer me. And show some light into the subject. I had an overall knowledge about the pci lanes on a CPU, but din't know that would inpact on transcoding.
I would like to know more, but feel that I won't understand much of what you would explain! hahaha I will try to look into it around the web to gain a better understanding.

And like you said, my tests were "real life scenario". Focusing specifically on my needs, so I didn't opened up to other tests. I am just trying to be as effective as possible. So I can deliver the dailies as fast as possible to the rest of the team.
emcodem
Posts: 1631
Joined: Wed Sep 19, 2018 8:11 am

Re: Encoding Op1a Hardware Limitations

Post by emcodem »

Hehe Frank just found a complicated way to express dual channel :D Your mainboards manual should say something about how you have to put the memory modules in order to profit from this technology. The connection to memory is a little slower than one might think, a single channel ddr4 does only about 17Gbit/s. Uncompressed 25fps HD video is about 2,5Gbit/s. Imagine the encoding software (e.g. ffmpeg) needs to copy a single video frame only 2 times while processing, you could end up using 5Gbit for 1 realtime processing that need to flow between CPU and RAM...

But it all depends on which software components you do use. You can't really benchmark your hardware using a "real life scenario" where you work with fitlering and all kinds of stuff, you would just test if you have a shitty processing chain and as long as you don't process always the exact same video format that you benchmark, you could never guess if another one is faster or slower or uses more or less System resources
emcodem, wrapping since 2009 you got the rhyme?
Post Reply