Automatical job retry after job fails

michael85 · Post by **michael85** » Mon Jun 27, 2022 9:28 am

Dear FFAStrans Comunity,

from time to time one of my workflows ends up in an unknown error. The soultion is always to restart the job from the start in status monitor and then it rans successfuly until the end. A log file of one of these failed jobs is attached.
Is it possible to set a number of automatical retries if one job in any node fails or the generaly set a automatical restart of failed jobs?

Best regards
Michael

emcodem · Post by **emcodem** » Tue Jun 28, 2022 7:13 am

Hey michael,

i cannot tell whats wrong looking at this log, i'd need the "full_log.json" file for it.
Regarding retry, i use this for live recording, here is how it works:

emcodem_retry_with_count.json: (7.2 KiB) Downloaded 1694 times

The 2 processors on the top are some example workflow, encode and deliver. both can fail, so both are connected to the retry nodes at bottom. It hould be possible to just copy/paste all the retry nodes (hold ctrl and mark all the bottom nodes, then copy) into any of your workflow, there you connect all processors that can fail to the start of the retry stuff (red input connector).

Let me know any question.

Post by **FranceBB** » Tue Jun 28, 2022 7:15 am

EDIT: Emcodem was faster than me to reply, but given that I wrote this while he was replying (having no idea he was actually replying), I'll leave it in case someone finds it useful.

Ok, so there are different ways to do this.
The easiest one would be to use the delete cache for workflows that are based on watchfolders.

Go in the Workflow Manager, select the workflow, right click, insert processors, others, command executor

At this point, right click on the box and choose "Execute on error":

You can rename the box the way you want, but I like to call it "Try Again".
Inside the box, paste the following command:

Code: Select all

%comspec% /c"if exist "%s_cache_record%" del /f /q "%s_cache_record%""

so that it's gonna be like this:

Now all is left to do is to connect this box to any other node you like in your workflow.
If the node outputs "Success" it will pass on to the next node, otherwise, if it outputs "Fail", the delete cache will be executed.
This means that every time this command is executed, the record of the file processed will be deleted from the watchfolder history, so the watchfolder will pick the file up again after a sleep cycle and the workflow will start from scratch.

Side Node: by doing it this way, the result in the status monitor will be "Success" even if it actually fails. The reason is that the workflow executes the command when things fail, so the workflow itself doesn't fail. To prevent seeing "Success" when it's actually "Failed" you can use the populate variable, set the %s_error% variable and add that node after the command executor so that you're gonna see when the job actually failed in the status monitor.

Side Node 2: Some users might be tempted to connect nodes in a circle or to use things like a workflow calling another workflow to create a loop. This will not work. The reason is that during the FFAStrans implementation, we made sure that the user could not create an infinite loop 'cause it's... bad. Anyway, let me know if what I wrote makes sense.

emcodem · Post by **emcodem** » Tue Jun 28, 2022 7:33 am

The good thing about franks workflow is that it's easy and clean, also it is more compatible to the different "monitors" processors, e.g. P2, image sequence and such.
The bad thing is that it only works with watchfolder submission and there is no retry count so it would potentially retry forever. One could combine both solutions and just exchange the http processor in my example workflow above by the commandline for deleting cache record in frank's version.

michael85 · Post by **michael85** » Tue Jun 28, 2022 11:38 am

Many thanks for this great solutions and for your quick response.
I have combined both of your solutions like emcodem mentioned in his last post. Since all of my workflows are working with watch folders there should be no trouble.

If it is useful for you, here is the full log for the failed job.

full_log.json: (116.96 KiB) Downloaded 1507 times

Thank you guys.
Best regards
Michael

emcodem · Post by **emcodem** » Tue Jun 28, 2022 12:27 pm

Oh it just came to my mind that if you change the http processor against the cmd, there is no easy way to let the next job know the current retry number so after all it might be a good idea to keep it as it is

But as usual, do whatever works for you!

Anyway, looking at the log, we see this:

Code: Select all

Exiting normally, received signal 2.

That was after about 30 minutes processing time. Is the workflow expected to run that long anyways?

Which means the ffmpeg.exe process was killed from outside. There could be a number of reasons for that, both ffastrans internal and on the OS. I fear we need @admin (who is on vacation currently) to tell us if there are ffastrans internal reasons that could cause this, like timeouts or such.
I am not aware about windows killing processes when the server load is too high (i know that only from linux), but anyway maybe you want to enable the host metrics in webinterface and capture the system load so next time a job errors you can check out cpu/ram statistics and such.

syehoonkim · Post by **syehoonkim** » Fri Apr 04, 2025 8:47 am

emcodem wrote: Tue Jun 28, 2022 7:13 am Hey michael,

i cannot tell whats wrong looking at this log, i'd need the "full_log.json" file for it.
Regarding retry, i use this for live recording, here is how it works:

emcodem_retry_with_count.json

The 2 processors on the top are some example workflow, encode and deliver. both can fail, so both are connected to the retry nodes at bottom. It hould be possible to just copy/paste all the retry nodes (hold ctrl and mark all the bottom nodes, then copy) into any of your workflow, there you connect all processors that can fail to the start of the retry stuff (red input connector).

Let me know any question.

Dear, Thank you for providing such a good workflow.
I just want to comment that I had to edit "inputfile": "%s_source%" in the HTTP communicate body.
In JSON, backslash has to be double backslash to be escaped, so I needed $replace("%s_source%","\","\\") instead of jut %s_source%.
Hope this helps somebody.
Thank you.

emcodem · Post by **emcodem** » Fri Apr 04, 2025 7:30 pm

Dear Kim, @syehoonkim
well that is a very nice first post, i'll happily correct the issue, thank you so much for letting us know.
Also of course thank you for using FFAStrans and welcome to the forum, hope to hear more from you

Regards,
emcodem

FFAStrans forum

Automatical job retry after job fails

Automatical job retry after job fails

Re: Automatical job retry after job fails

Re: Automatical job retry after job fails

Re: Automatical job retry after job fails

Re: Automatical job retry after job fails

Re: Automatical job retry after job fails

Re: Automatical job retry after job fails

Re: Automatical job retry after job fails