Farm Machine Not Picking Up Jobs

Here you can submit bugreports
ernie1901
Posts: 16
Joined: Fri Aug 07, 2020 8:50 am

Farm Machine Not Picking Up Jobs

Post by ernie1901 »

Hi all,

I have FFAStrans installed on a network share. All the PCs in my farm can access this network share. Most of the time this works great.
However I've seen at random time periods a PC can stop taking jobs from the queue. Its still online I can ping the API via that machine and it will still accept API job submissions or return the JSON of job history, it just wont process anything new from the queue.

A Windows reboot fixes it and the machine takes jobs again. There seems to be no pattern to this, it can be within hours or days of the reboot the machine stops processing new jobs. Its not the CPU limit either the PC isnt busy during the time in which it can and then cannot recieve new jobs.
It seems to render the machine pretty useless other than API access as even a right click manual submit on a workflow doesnt make the machine pick up the job.

Have you seen this behaviour before or do you have an idea of what could be causing it? I'm running Windows 10.

Thanks.
momocampo
Posts: 594
Joined: Thu Jun 08, 2017 12:36 pm
Location: France-Paris

Re: Farm Machine Not Picking Up Jobs

Post by momocampo »

Hello Ernie,
Ok so first, is it still the same host? (the same pc that doesn't take jobs)
When the issue comes, have you check the "FFAStrans rest-api service" is still "running"? (Manage/services)
I have sometimes had strange behaviors, for example the rest service has stopped for no reason.
;)
B.
ernie1901
Posts: 16
Joined: Fri Aug 07, 2020 8:50 am

Re: Farm Machine Not Picking Up Jobs

Post by ernie1901 »

Thanks for the quick reply @momocampo.

I've had it occur on 3 of 6 hosts so far.
The service is still running, I can even still send it a GET/POST request and it responds. It just doesnt do any jobs, its really strange!
If I try to stop the hosts API service and restart it the Services window crashes and stalls in 'Stopping' state, I have to reboot the machine completely.
emcodem
Posts: 1752
Joined: Wed Sep 19, 2018 8:11 am

Re: Farm Machine Not Picking Up Jobs

Post by emcodem »

Hey ernie,

i guess you are running Version 1.1?

Maybe you can check out the "job ticket management" described here, check if the exe_manager is running and also check out the filesystem /tickets folder as described?
http://ffastrans.com/wiki/doku.php?id=system:processes

It is mostly interesting if there are ticket files and if yes, in which folder, in the temp folder or running etc...?
If i am correct, only one Server from the farm stops working. When thats the case, log on on the server using the same username/Pw as your ffastrans service is set to run as and check out the tickets on the central quorum location. Maybe you see some error message from windows explorer when you want to access the quorum share

[EDIT] Sorry i need to correct myself, if it is like only one machine in the farm stops to pick up tickets, then it is not interesting in which folders you actually see tickets but instead only the second question, so the access to the quorum share from the machine that stopped working is really interesting. If it can still access the quorum location, try to write a file to the /db folder too please.
emcodem, wrapping since 2009 you got the rhyme?
ernie1901
Posts: 16
Joined: Fri Aug 07, 2020 8:50 am

Re: Farm Machine Not Picking Up Jobs

Post by ernie1901 »

Yes V1.1. Some further observations:

The machine currently not picking up jobs has multiple instances of 'exe_manager' in Windows Task Manager, but this looks like expected behaviour from the documentation you linked to.
The machine currently not responding to jobs can read and write from the db folder with no issue in Windows Explorer.

The txt file in db/cache/exe_log with the hostname was last modified yesterday. The remainder of the working hosts have a txt file modified today.
The bad host JSON in db/configs/hosts has a last heartbeat time of over 24 hours ago, whereas working hosts are 10 minutes ago which also indicates it cant call back to this file for some reason, but I can open and edit it fine.
emcodem
Posts: 1752
Joined: Wed Sep 19, 2018 8:11 am

Re: Farm Machine Not Picking Up Jobs

Post by emcodem »

Thats great info but unfortunately i still don't have a clue what could be the reason for this faults.

Is it possible for you to zip and send us the whole cache directory including all logs and such? if uploading here dont work, maybe you can use wetransfer and send us the link in a PM?
\Processors\db\cache
emcodem, wrapping since 2009 you got the rhyme?
ernie1901
Posts: 16
Joined: Fri Aug 07, 2020 8:50 am

Re: Farm Machine Not Picking Up Jobs

Post by ernie1901 »

Hey @emcodem,

Confirming that so far (24 hours or so) this seems to be fixed on 1.1.1.0.
I will monitor and report back if that changes.
admin
Site Admin
Posts: 1680
Joined: Sat Feb 08, 2014 10:39 pm

Re: Farm Machine Not Picking Up Jobs

Post by admin »

Great, but you need to be aware of that the 1.1.1.0 is not released yet and may be prone to other bugs. I guess frank informed you about that ;-)

-steinar
Den
Posts: 10
Joined: Wed Mar 04, 2020 7:06 pm

Re: Farm Machine Not Picking Up Jobs

Post by Den »

I have the similar problem. Still on 0.9.4 due to the issue ernie1901 described. When moved to 1.1.0.2 the jobs are only taken by 2 machine whereas the other 4 machines are being ignored. Almost same HW & SW... Hope this gets addressed in next update. Blocking me to upgrade. THanks
ernie1901
Posts: 16
Joined: Fri Aug 07, 2020 8:50 am

Re: Farm Machine Not Picking Up Jobs

Post by ernie1901 »

Den wrote: Tue Dec 29, 2020 10:06 am I have the similar problem. Still on 0.9.4 due to the issue ernie1901 described. When moved to 1.1.0.2 the jobs are only taken by 2 machine whereas the other 4 machines are being ignored. Almost same HW & SW... Hope this gets addressed in next update. Blocking me to upgrade. THanks
What do you have set as your windows service recovery settings?
Mine was set to ignore any failures. You cna change it to restart the service.

Yes it shouldnt fail but this at least restarts the Windows Services if they fall over, whilst the issue is investigated.
Post Reply