Page 39 of 40

Re: Webinterface

Posted: Wed Nov 13, 2024 11:37 pm
by emcodem
DCCentR wrote: Wed Nov 13, 2024 5:57 pm UPD: rebooting the server with FFAStrans dir solved the problem with security permissions on json's :D
Fantastic report, thanks a lot for the debugging you did.
So there is a recent bug in ffastrans open regarding leaving these files but i guess loosing all the security info on it means basically a bug with the storage or even Microsoft SMB Client (unlikely). We have the same issue at work on netapp sometimes with mxf files but there are so much servers and clients involved that its nearly impossible to debug. So we just inform the storage admins from time to time, they are able to delete the stuff from commandline directly on the storage.

However, this case should be much simpler because the reboot solved it. I guess one of our processes is just still holding the file handle and won't let go of it. But we don't know currently if it is ffastrans itself or webint.
Maybe next time you can apply the procedure to find out which process is holding a file handle to the affected file described here: https://serverfault.com/questions/1966/ ... in-windows

Also, if i remember correctly on Netapp and Isilon we had to opt-in a related feature called "oplocks" (opportunistic file locking), maybe you can check back with the storage documentation if there is some related setting and report back about it?

One thing i want to add for future emcodem here is that webinterface opens the files with "share mode delete" enabled. The documentation of these file open flags is not really good but i believe it means that webint can open the file for reading "while" another one (ffastrans) is allowed to delete the same file - in this case when webint closes the filehandle, the OS should send a delete command to the storage (which it kind of did, why did the permissions get lost otherwise).
Don't get me wrong, i am not sure if the problem is actually anyhow related to webinterface's bare existence, it is absolutely possible that the same error would occur when ffastrans runs alone. We just don't know currently.
file_open_shared.jpg
file_open_shared.jpg (38.43 KiB) Viewed 603 times

Re: Webinterface

Posted: Wed Nov 13, 2024 11:46 pm
by emcodem
artjuice wrote: Wed Nov 13, 2024 6:47 pm Hi emcodem
We use for a few days new webinterface_1.4.0.85 and today we see a problem - FFAStrans WebInterface process use almost 30Gb RAM :D
What logs i can send you for a review?
Dear @artjuice,
it is so nice from you to support us on getting this webint release done, thanks also for this report. The issues are of course related to the latest changes which optimize caching. I already found and mitigated most if not all of them. I'll upload another prerelease as soon as i am happy with the changes and notify you here.
But man, really, you got 30 gig ram usage from job json file caching in a few days? How many jobs you run through every day? :D

Re: Webinterface

Posted: Thu Nov 14, 2024 10:06 am
by DCCentR
emcodem wrote: Wed Nov 13, 2024 11:37 pm However, this case should be much simpler because the reboot solved it. I guess one of our processes is just still holding the file handle and won't let go of it. But we don't know currently if it is ffastrans itself or webint.
Maybe next time you can apply the procedure to find out which process is holding a file handle to the affected file described here: https://serverfault.com/questions/1966/ ... in-windows
The next one didn't take long :) This time it was the jobs of the same workflow as before, but they was stuck in web and in status monitor too:
Снимок экрана 2024-11-14 110110.png
Снимок экрана 2024-11-14 110110.png (60.43 KiB) Viewed 587 times
I found these files are locked only by host machine (SRV-CINEGY1) serving the FFAStrans dir and also doing local processing:
FFAStrans dir server3.png
FFAStrans dir server3.png (37.77 KiB) Viewed 587 times
Opened by Administrator for reading:
FFAStrans dir server.png
FFAStrans dir server.png (448.44 KiB) Viewed 587 times
Also I managed to start Process Monitor on SRV-CINEGY1 something around 13 minutes before jobs get stuck. it seems that when cmd.exe appeared, the problem started:
FFAStrans dir server2.png
FFAStrans dir server2.png (1.72 MiB) Viewed 587 times
Here is the events list from Process Monitor filtered by one of the stuck json (20241114-0613-4367-35a0-e076c917518b~1-0-0.json): https://dropmefiles.com/LHJuA
And full events list without filter, containts all three jsons (20241114-0615-1434-9971-43cca26ef65f~1-0-0.json, 20241114-0613-4367-35a0-e076c917518b~1-0-0.json, 20241114-0616-4487-7178-ca416d100302~1-0-0.json): https://dropmefiles.com/wuW7z
emcodem wrote: Wed Nov 13, 2024 11:37 pm Also, if i remember correctly on Netapp and Isilon we had to opt-in a related feature called "oplocks" (opportunistic file locking), maybe you can check back with the storage documentation if there is some related setting and report back about it?
For the farm we use self-built NAS'es, which are Windows Server servers with hardware raid controllers. In this case, the host machine for FFAStrans dir is a regular Windows 10 Pro.
It looks like Windows has settings regarding Oplocks in the registry
For the farm we use self-built NAS'es and servers, which are Windows Server 2019+ servers with hardware raid controllers. In this case, the host machine for FFAStrans dir is a regular Windows 10 Pro.
It looks like Windows has settings for disabling Oplocks in the registry for SMB1 https://support.storeporter.com/hc/en-u ... 20networks. But it seems that this is no longer relevant as SMB1 is disabled by default in current versions of WIndows.
emcodem wrote: Wed Nov 13, 2024 11:37 pm Don't get me wrong, i am not sure if the problem is actually anyhow related to webinterface's bare existence, it is absolutely possible that the same error would occur when ffastrans runs alone. We just don't know currently.
This time I didn't restart the server to solve the problem. I started stopping the “FFAStrans REST-Service” and “FFAStrans Webinterface” services one by one on all servers involved in the farm.
After stopping all services, the files were not unlocked (I could read them, but could not move them) and were still occupied by System.
I started to restart all services and after starting them on the last server, I noticed that the json files were missing from the monitor directory.
I don't know after which server and service this happened (next time I'll try to wait more after restarting the service), but the last server was the FFAStrans web interface server and it failed to start the “FFAStrans REST-Service” service immediately - it succeeded only from the fourth time.
When starting the service there was an error (I don't remember the number), after which there were warnings in the event log:
Child process [11576 -\\\192.168.100.231\FF_Install\processors\rest_service.exe /ErrorStdOut] finished with 1
and
The “FFAStrans REST-API Service” service was unexpectedly terminated. This occurred (once): 3.

Re: Webinterface

Posted: Thu Nov 14, 2024 10:28 pm
by emcodem
@DCCentR
sorry for the delay and again thanks for the spot on report and for helping us finally track this nasty thing down. I spent some hours analyzing what you sent and trying to reproduce but no success so far.
The cmd.exe you noticed is not the problem but it wants to be a solution. It is started by ffastrans and it runs in an indefinite loop, trying to delete the file every 5 seconds in case the normal file deletion did not work. Steinar told me that this workaround is in ffastrans since a long time. But obviously the workaround does not work :D

What catches my interest in your PML and screenshot from locked files is that the path is "V:". To be exact, procmon shows that ffastrans (exe_manager.exe) does access the file sometimes using path "V:" and sometimes using the full path.
To be more exact, within the same millisecond, we see for example exe_manager.exe ReadFile access to these 2 paths:

Code: Select all

\\192.168.24.231\FF_Install\Processors\db\cache\jobs\20241114-0613-4367-35a0-e076c917518b\log\20241114T082655692_9892_9f00e2_sys_SRV-CINEGY1.json
V:\FF_Install\Processors\db\cache\jobs\20241114-0613-4367-35a0-e076c917518b\log\20241114T082655692_9892_9f00e2_sys_SRV-CINEGY1.json
I first thought V: is a mapped network drive but according to my tests, access via mapped network drive letters is still logged in procmon with the full UNC path, at least in my installation.

Do you know what V: is and why every process seems to access both, the full UNC path and V:?

How can we go on with this: we must simplify it and make it easy to reproduce. First i believe that it's a ffastrans problem and webint file access to the json files does not influence in any way, nor does windows defender access disturb. Second, i believe all reports about this problem include workflows "with branches" and also i believe all of them always "hang" after the same processor in the workflow, in your case the RBS worklow after the "custom ffmpeg" processor.
Does every RBS job appear to "hang" after the custom ffmpeg or only some jobs?
Can you please share the workflow?

Re: Webinterface

Posted: Fri Nov 15, 2024 6:33 pm
by DCCentR
emcodem wrote: Thu Nov 14, 2024 10:28 pm Do you know what V: is and why every process seems to access both, the full UNC path and V:?
V: - is a local disk on the SRV-CINEGY1 (192.168.24.231) server. is the host machine for the FFAStrans directory and it also does local processing within the farm.

To be honest, I don't even know why two paths are used at the same time :? On every machine (even the host machine itself 192.168.24.231) in the farm, FF was started from the full UNC path (\\192.168.24.23.231\FF_Install) and installed as a service.

Maybe (not sure) I had FFAStrans.exe open on the 192.168.24.231 server (besides the service itself) via the local path V:\FF_Install, through which I was editing workflows while connecting via RDP.
emcodem wrote: Thu Nov 14, 2024 10:28 pm Does every RBS job appear to "hang" after the custom ffmpeg or only some jobs?
Not every execution of this workflow ends up hanging. So far it was only on the files mentioned in my reports - here and here. Both time on custom ffmpeg processor. That's all it's been so far.
And it's a fairly common task, it processes files every day:
Снимок экрана 2024-11-15 212545.png
Снимок экрана 2024-11-15 212545.png (355.44 KiB) Viewed 503 times
I set aside a couple of source files where the hangs were occurring. I'll try to reprocess them a few times.
emcodem wrote: Thu Nov 14, 2024 10:28 pm Can you please share the workflow?
UPD:
Since we mentioned this worflow, here's something else I noticed - for some reason the column for the file in web does not contain its original extension:
1.png
1.png (207.12 KiB) Viewed 494 times
2.png
2.png (55.44 KiB) Viewed 494 times
It's not related to hangs and probably only cosmetic but still ;)

Re: Webinterface

Posted: Fri Nov 15, 2024 11:20 pm
by emcodem
@DCCentR
ok so steinar and me were spending some time with everything you sent but not yet coming forward.
Would you be so kind and upload the job directories from the stuff we see in your PMLs?

\Processors\db\cache\jobs\20241114-0613-4367-35a0-e076c917518b
and
20241114-0615-1434-9971-43cca26ef65f
20241114-0616-4487-7178-ca416d100302

Re: Webinterface

Posted: Sat Nov 16, 2024 7:18 am
by DCCentR
emcodem wrote: Fri Nov 15, 2024 11:20 pm @DCCentR
ok so steinar and me were spending some time with everything you sent but not yet coming forward.
Would you be so kind and upload the job directories from the stuff we see in your PMLs?

\Processors\db\cache\jobs\20241114-0613-4367-35a0-e076c917518b
and
20241114-0615-1434-9971-43cca26ef65f
20241114-0616-4487-7178-ca416d100302
20241114-0613-4367-35a0-e076c917518b.zip
(167.07 KiB) Downloaded 23 times
20241114-0615-1434-9971-43cca26ef65f.zip
(139.21 KiB) Downloaded 22 times
20241114-0616-4487-7178-ca416d100302.zip
(166.2 KiB) Downloaded 22 times

Re: Webinterface

Posted: Sat Nov 16, 2024 1:06 pm
by admin
Hi DCCentR,

We really appreciate your help with this issue, which is not an easy one to solve. Can I ask for you to also send some logs of jobs that was OK? That is jobs that did NOT have the stuck monitor files.

Thanks! :-)

-steinar

Re: Webinterface

Posted: Sat Nov 16, 2024 1:31 pm
by DCCentR
admin wrote: Sat Nov 16, 2024 1:06 pm Hi DCCentR,

We really appreciate your help with this issue, which is not an easy one to solve. Can I ask for you to also send some logs of jobs that was OK? That is jobs that did NOT have the stuck monitor files.

Thanks! :-)

-steinar
Hi steinar,
I appreciate your work, guys)

Here is two folders from FF_Install\Processors\db\cache\jobs:
20241116-1016-1647-0d7c-cff08662e0c3.zip
(239.26 KiB) Downloaded 21 times
20241116-1016-1144-3063-891cd41bfcd6.zip
(185.75 KiB) Downloaded 21 times
This is a re-processing of files that got stuck earlier. This time they were processed properly

***
Today I found two more stuck tasks of the same workflow:
Снимок экрана 2024-11-16 163500.png
Снимок экрана 2024-11-16 163500.png (224.5 KiB) Viewed 442 times
Their folders from FF_Install\Processors\db\cache\jobs:
20241115-2342-1036-40f7-8cc9f022c250.zip
(264.43 KiB) Downloaded 19 times
20241116-0353-5797-9b79-4220dfba963c.zip
(297.33 KiB) Downloaded 20 times
The \\\192.168.24.231\FF_Install\Processors\db\cache\monitor directory also had stuck json files that could not be opened for reading. I stopped the “FFAStrans Webinterface” service on my web server SRV-CINEGY2 (192.168.24.232), after which the json files immediately disappeared from the monitor directory - so no more these hung events.
On the host (ff instal dir) and web servers, before the service stopped, I ran Process Monitor and ran event logging, these are the logs:
https://dropmefiles.com/cZeTv
https://dropmefiles.com/VIjnR

I hope this helps.

Re: Webinterface

Posted: Sun Nov 17, 2024 7:44 pm
by emcodem
Aside from Mr. @DCCentR 's problems, here is the newest "Prerelease" which at least fixes the latest memory leaks caused by the caching strategy updates from the last weeks.

https://github.com/emcodem/ffastrans_we ... g/1.4.0.89

@artjuice regarding "multiple systems in shown in one webint", this is the first version that attempts to do this. In network config, you can supply multiple "API Hosts", each of it must be an installation of webinterface. E.g. if you have multiple ffastrans systems, you install at least one webint per installation and set the list of webint hostnames in API Hosts like: host1,host2,host3...

@Stef i changed how log files are written because it stopped logging for you nightly. Maybe you find the time to check if you now get new log entries all day long, even after 03 a.m. :D

@DCCentR thanks again for all the insights you provided on the hanging running job display issue. After reviewing everyghing we were still not certain about the details how it came into this situation. However steinar implemented something that should workaround the problems. After all its just a matter about how ffastrans deals with concurrent file access in the /db folder and there are many solutions for one problem possible :D Not sure if he will or can deliver a patch for 1.4.0.7 version so you can verify the solution @admin, what you think'?