FFAStrans hardware configuration

Questions and answers on how to get the most out of FFAStrans
veks
Posts: 80
Joined: Fri Oct 25, 2019 6:51 am

FFAStrans hardware configuration

Post by veks »

Hi all,
I have a question about what type of hardware you're running FFAStrans on and also a farm.
I know that developers of FFAStrans are running it as a farm. So I'd like to know on which hardware?
Do you use CPU or GPU for transcoding?
Specific hardware, what CPU, what GPU, how much RAM?
Do you use virtual machines?

Did you found out that CPU is creating a better looking video in transcoding as someone say that GPU is creating lower quality videos, which I don't agree with, but...

And other information which would be useful when buying hardware for such workflow :)

Thanks!
momocampo
Posts: 594
Joined: Thu Jun 08, 2017 12:36 pm
Location: France-Paris

Re: FFAStrans hardware configuration

Post by momocampo »

Hello Veks,
I will give you my configurations for FFAStrans. To be honest I have a production installation and several test computers.

-Production installation : Farm with 2 main computers and sometimes 3 (depends on workflow).
2 X HP Z800 Xeon X5650 2.67Ghz with 48 Go RAM + 1 HP Z400 Xeon W3680 3.33Ghz with 26 Go Ram

-Test computer : HP Z440 Xeon E5-1650 3.50Ghz with 32Go RAM

About CPU-GPU, latest Nvidia cards can use "-nvenc" H264 encoder. My results (with a GTX 1080) are not great, sometimes faster sometimes not...:(
The cpu uses all the power, the gpu not.
Anyway, in my opinion, CPU and storage (for cache files) are the most important elements.

Cheers.

B.
veks
Posts: 80
Joined: Fri Oct 25, 2019 6:51 am

Re: FFAStrans hardware configuration

Post by veks »

Thanks a lot momocampo!

As for the NVIDIA CUDA GPU encoding, we're running it with built FFMpeg from GitHub that has NVIDIA CUDA support.
Right now on Linux Mint (latest) with Nvidia Quadro P2000 and that works great!
Inputs are 6 Full HD - HEVC channels that are being transcoded to 6 profiles.

It would be great to see FFAStrans support NVIDIA CUDA encoding :)

As GPU encoding is far faster and cheaper compared to CPU :)

P.S. do you have some benchmarks for these servers, how much simultaneous transcoding you can do depending on source and transcoded profiles? And other information related to it.
Ghtais
Posts: 164
Joined: Thu Jan 19, 2017 11:06 am

Re: FFAStrans hardware configuration

Post by Ghtais »

Hi

Interesting thread

1 x DELL PRECISION T7910 with 2 x XEON E5-2640 v3 @ 2.60 Ghz / 64 Go RAM
2 x SSD (RAID 0) FFASTrans storage (local file and cache)
1 Nvidia GTX 1080
Ethernet 10 GbE for transfer
Windows 7 professional

I use only CPU.
64Go ram is unneeded as FFAStrans doesn't use more than 10 GO.

Unfortunately it doesn't take all CPU power during transcoding. I don't know why. It depends of your workflow, with H264 processor it takes only 16% CPU power.

cheers.
emcodem
Posts: 1752
Joined: Wed Sep 19, 2018 8:11 am

Re: FFAStrans hardware configuration

Post by emcodem »

Hi :-)

This topic must find its way into the Wiki but it will require a lot of resources because it most be done very scientific and 100% reproduceable.

I do use ffastrans on a lot different configurations, depending on the usecase (which workflow?) and available hardware. GPU is for consumer and web (proxy) stuff only but i mostly create Mezzanine formats (for production) like XDCAMHD or XAVC Class 480, sometimes even Prores in various configurations.
In general i can say that virtual machines only make sense for very special setups as my transcoding servers typically use all available CPU's most of the day (except for high quality filtering workflows), which would not really make sense for a virtual machine. Also all the Hardware Piping to virtual machines for using Intel, Nvidia or other Hardware encoders is not easy to get on a virtual machine.
veks wrote: Mon Mar 02, 2020 3:44 pm It would be great to see FFAStrans support NVIDIA CUDA encoding
Ok, up in front, ffastrans does not yet have much focus on the "delivery" sector but what you want basically only serves the delivery sector (instead of the production sector). Even if we would start working on the delivery sector, we would probably first go for HLS and multibitrate because this topic is much more important than supporting those Hardware encoders that are built into Graphics cards.

Personally i would never use CUDA at all for encoding purposes (only for filtering). If i go for Hardware Encoding, i typically use Quicksync or NVenc.
We had some testing session in that direction and it did not look like we can easily integrate the currently existing ffmpeg encoders for quicksync and nvenc. The reason is that ffmpeg does not yet have it completely integrated and we dont want to workaround and reverse engineer all the flaws. However, as it can be used for specific workflows (e.g. always the same input file format) anyway, you can easily integrate it with a custom ffmpeg node.
veks wrote: Mon Mar 02, 2020 3:44 pm As GPU encoding is far faster and cheaper compared to CPU
Sorry but that sounds like you jumped on the train of rumours and advertising. In general you are totally wrong. In detail you can achieve very special workflows using Graphics boards a few dollars cheaper than using CPU because of the Power cost but if you calculate the needed engineering for it you will see that you burnt many thousand dollars for it - or maybe even hundreds of thousands, so it only pays off in extemely high scale (like thousands of servers)

OK, first typically you dont use the GPU for encoding. The graphics boards from INTEL, NVIDIA and AMD have special chips onboard called ASIC (Application Specific Integrated Circuit) which allow to encode into a limited set of codecs and presets. The GPU is not even really used when using those ASICS. But there is other stuff like filtering that we developers can use the GPU (e.g. using CUDA on NVIDIA cards)
Anyway, my tests show that you can typically do about 3 parallell FullHD encodings in low quality on one Nvidia GTX 1080, each will have about 2.5 realtime while on a CPU like Core I7 9700K you can do 11x realtime encoding without any tuning.
Also, the ASICs always have different limitations like no high bitrate encoding and no 4:2:2 support which makes them only useable for consumer stuff like gaming industry (live streaming to twitch what you currently play - thats what it is built for). No production codecs can be done currently using that Hardware. Matrox and i think Rhode&Schwarz has some ASICS for Production codecs, their cards start from 7k€
veks wrote: Mon Mar 02, 2020 3:44 pm P.S. do you have some benchmarks for these servers, how much simultaneous transcoding you can do depending on source and transcoded profiles? And other information related to it.
This topic is not easy at all, it mostly depends on the input format and conversions that you do. I can benchmark some CPU's and ASIC's for you if you tell me what exactly you want to go for, e.g. speed for one single encoding at lowest possible quality or highest quality or such....
Ghtais wrote: Mon Mar 02, 2020 4:17 pm 64Go ram is unneeded as FFAStrans doesn't use more than 10 GO.

Unfortunately it doesn't take all CPU power during transcoding. I don't know why. It depends of your workflow, with H264 processor it takes only 16% CPU power.
FFAStrans is just driving a workflow, it does not use lots of RAM at all, typicall just about a few MB of RAM. FFmpeg and Avisynth uses most RAM and their demands depend on a lot of factors.
The RAM comes into use when dealing with UHD material. The companies i work for typically buy all the servers with minimum 128GB because the price of the RAM does not really matter anymore and it is always better to have more than less ;-)
H.264 encoding alone should typically take most of your CPU, independent of the settings. 16% sounds like it uses exactly ONE physical Core which might most likely relate to some filtering.

In the End, for production purposes, it all comes down to Using a Core I processor instead of a XEON if you want to go fast for a single stream or do filtering, typically because of the higher CPU clock.
emcodem, wrapping since 2009 you got the rhyme?
veks
Posts: 80
Joined: Fri Oct 25, 2019 6:51 am

Re: FFAStrans hardware configuration

Post by veks »

Maybe the difference would be dependable on a workflow in production and for which kind of content it is being done.
For example, in OTT, you need to transcode for example MPEG2 to H.264 or H.265 which Quadro GPUs transcode like nothing.
You can easily stack them in servers and use them for many simultaneous transcoding jobs.
Of course that GTX 1080 can't transcode more then 2 jobs (2 max concurrent sessions), because it's CUDA chip is locked to 2 jobs only, while Quadro GPUs have unlimited concurrent sessions available.
Check this:
https://devblogs.nvidia.com/nvidia-ffmp ... ing-guide/
And this comparison:
https://developer.nvidia.com/video-enco ... ort-matrix

As I mentioned, for OTT where you have dozens of videos 24/7 and where you transcode in MBR, so let's say into at least 6 different profiles, while it being a VOD or live, it's cheaper to go with a single GPU which price is 430€ then compared to I7 9700K which is 370€ and has only 8 cores.
For example, you could do at least 12 or even more simultaneous transcodings with P2000 from full HD to let's say HD videos at really high fps without a problem.

Also, as for the CPU goes, AMD Ryzen 9 3900x rocks socks off of any Xeon for such a lower price and you can still do hardware encoding/decoding with it.

Sure, these are some speculations from my testings, but I haven't tested it in production (yet).
emcodem
Posts: 1752
Joined: Wed Sep 19, 2018 8:11 am

Re: FFAStrans hardware configuration

Post by emcodem »

veks wrote: Tue Mar 03, 2020 8:23 am For example, you could do at least 12 or even more simultaneous transcodings with P2000 from full HD to let's say HD videos at really high fps without a problem.
If HD is not FullHD, what you mean by HD, 720p50?
To be honest from my testing, i dont think you can run 12 streams in realtime on a P2000, Maybe if they are SD and you choose lowest quality settings.
Anyway, you are totally correct that the ASIC encoders are used in encoding for delivery formats (e.g. the "OTT" usecase)

Additionally to your resources i can add some experience: the customers that i know who encode for OTT purpose in high quality go for CPU encoding because they say it save some bitrate compared to the results of ASIC encoders. But that is far beyond my knowledge, i never tested "saving bitrate and maintain quality" stuff. Some customers even do VP9 encoding in 0.01 realtime even when an ASIC could do it in 1x Realtime or faster... just to save some bits per second ^^

Anyway, our final conclusio @ffastrans to this topic was that as we don't have anyone that has experience in other sectors than the production one, we cannot come up with anything in that direction right now.
emcodem, wrapping since 2009 you got the rhyme?
veks
Posts: 80
Joined: Fri Oct 25, 2019 6:51 am

Re: FFAStrans hardware configuration

Post by veks »

Well said, and I'd really like to see someone comment on this with that kind of knowledge.
I was trying to find some papers about it online, but not much from 2019 and above.
Mostly before 2018 when GPU was far slower and not that useable as nowadays.
NVENC/CUDA chips changed a lot.
We're seeing them (and Nvidia Tegra) in many electric and AA/AI cars for other purposes but similar.

If HD is not FullHD, what you mean by HD, 720p50?
HD = 720p
FHD = 1080p

But ye, 720p50 or 720p30.
So, from 1080p30 to 720p30.

As I mentioned, in our workflow with single P2000 Quadro powered by FFMpeg we're transcoding Full HD live - 6 channels to this:
H.264
1280x720, 3.8 Mbps H264 High Profile Level 4.1
1280x720, 2.8 Mbps H264 Main profile Level 3.1
960x540, 2.2 Mbps H264 Main profile Level 3.1
960x540, 1.7 Mbps H264 Main profile Level 3.1
640x360, 1.1 Mbps H264 Main profile Level 3.1
480x270, 0.6 Mbps H264 Main profile Level 3.1
Audio sampling 44.1 kHZ, AAC-HE 96 kbps

EDIT1:
Forgot to mention that GPUs NvENC chip is at around 70% usage while NvDEC is at 0% as decoding is being done via CPU as we didn't had time to investigate how to enable decoding via NvDEC which wouldn't impact GPU at all, as it's separate decoder.
GPU itself is at 20% of usage (using NvTOP on Ubuntu to see this data).
Tho, memory is at 4.4GB from maximum of 5.3GB. So only problem for adding more streams would be GDDR5 memory, which higher-end Quadro GPU has for example 12GB or more.
Ghtais
Posts: 164
Joined: Thu Jan 19, 2017 11:06 am

Re: FFAStrans hardware configuration

Post by Ghtais »

Hi

I think you have a very good result because the resizing process is more efficient with the GPU than with the CPU. Have you tested to encode your 1080p MPEG stream into a 1080p H264 file with GPU and CPU to see which is faster

could someone share a simple ffmpeg command for H264 using nvidia GPU ?
I can make some test with my current workflow to see if it is faster or not.

Thanks
emcodem
Posts: 1752
Joined: Wed Sep 19, 2018 8:11 am

Re: FFAStrans hardware configuration

Post by emcodem »

@veks
for decoding how about that? https://devblogs.nvidia.com/nvidia-ffmp ... ing-guide/

What you should know about decoding on Intel/AMD/Nvidia cards is that you can of course only decode a few configurations of each codec, e.g. little britrate, certain levels etc... and much more interesting is that due to its restrictions (it's a chip), it sometimes delivers defective video (decoding errors) while software decoders would not have any problems with it. So from my experience you can only use the decoders for a workflow where the input video was created by a controlled set of encoders that always deliver the exact same output. Maybe i should ask Nvidia about that topic because they really advertise ffmpeg currently.

Also, when you do benchmarking and post results, please make sure that you invested all efforts to make sure nothing was disturbing, e.g. when you benchmark encoding, make sure that there is no filtering at all done, e.g. feed the encoder with the exact same resolution, fps, pixel format, color etc... as you encode. Otherwise the speed and result is influenced by factors that are different for every user and the benchmark does not help at all.
Typically one concentrates on a single topic for benchmarking, either decoding, filtering or encoding. Combined tests are typically not called benchbark but more kind of a blog.
emcodem, wrapping since 2009 you got the rhyme?
Post Reply