Compare two sources and take files from one

artjuice · Post by **artjuice** » Fri Nov 01, 2024 10:43 am

Hi,

Can't find a solution how to monitor two and more sources, check files, and if there equal files - take only from one source.

My workflow includes work with two AirSpeeds, materials are sent to both in parallel, after which these materials must be archived. And this is where I need to avoid double copying of the same material from both AirSpeeds. First i need to compare and then copy. If the material is present on both - copy only from AirSpeed-2, if the material is present on only one of the two - copy from where it is.

emcodem · Post by **emcodem** » Fri Nov 01, 2024 11:21 pm

Nothing inbuilt again i fear:D
So any complex monitoring scenario needs the monitoring to be done in a separate process (your own script). We can schedule your script for you in webint scheduler but you need to do the logic work on your own.

If you can access the airspeeds with UNC, you might be able to find a solution using the files find plugin processor, e.g. monitor both servers in separate monitor processors and use files find to check if the other location contains the file, the apply your logic and send one job into "dispel" which ends it and deletes it from the job status monitor. These 2 jobs would still run separately, if you want to interact between them you would need to use some external files or similar, not a good practice if you can avoid it.

I guess the most common mistake by users in this scenario is that they make a connection in their brain between monitoring and the graphical workflow representation. It is not as it looks like in the workflow, every monitor processor runs in it's own separate process and kicks off it's own new job in case it finds something to process.
This means, as long as you stick to using inbuilt monitor processors, you need to think or design your stuff in a way that corresponds to this principle: every monitor kicks off it's own copy of the job.
Anything else means that you either work with scheduled jobs or you do the monitoring in a self-made tool and kick off jobs e.g. using API.

artjuice · Post by **artjuice** » Wed Nov 06, 2024 10:44 am

Hi emcodem,

Thx for your replay.
It is interesting to hear your opinion on the following options for implementing such a workflow:

1. We write a Python script that looks at and monitors the specified network shares, and creates files.txt without duplicates. And we process this file with FFAStrans.
I chose the job through the txt-file because the workflow should work according to the schedule.
If there is no schedule, then you can push immediately through the API.

2. Using Monitoring nodes, we form #x files.txt purely with file names for each network share, sort through the script removing duplicates and give unique names to the processor.

3. Make monitoring & copy full throw the script, FFAStrans takes the results of it.

Thx

emcodem · Post by **emcodem** » Wed Nov 06, 2024 1:23 pm

Hey,

it depends a lot on the requirements but i can try to give you some tipps from experience to avoid unneccessary troubles:
1) to be robust, your script should be simple and do only what it must do, not more
2) avoid the need for checking if files have stopped growing and also do not copy by your script (ffastrans has retry, reporting and failure handling builtin)
3) avoid the need for a "database" (where you remember which files have been processed)
4) avoid a long running script, only do the list and compare work and then exit the process. If you need multiple runs, schedule the script to be executed every minute (either from windows task scheduler or webint scheduler)
5) let your script write a log file (just append log msgs to a text file and delete/recreate the file when it gets larger than 10 MB or so). Log the startup time and also when you found something new to process and when the script gracefully ends at least.
6) last but not least, avoid to use unneccessary extra installed python modules, the script is easy portable if you can only use core modules

I assume you can avoid the need for check growing because you just run the stuff at night, if no you can probably just ignore all files younger than X hours (where X is the maximum recording duration you expect).

So your approach number 2 sounds good to me.
I would maybe only have 1 monitor processor and 1 directory where all the txt files go.
The .txt can have the same name as the file to be transferred and content of the txt is the full UNC path (which contains already the server name).
In this case, you use a populate processor at workflow start and set s_source = $read("%s_source%").
If you leave the txt file in the monitored folder after processing, you have a "database" containing what was processed this way too. Your script would just need to do it's work and only write a new txt file if the file does not exist already.

I hope what i wrote is easy to understand

Here is about what i have in mind (untested), ffastrans watchfolder would be C:\output\airspeed_txt_files in this case.
For scheduling the whole thing, you could have a small ffastrans workflow with 2 monitors, one for each airspeed. After the monitors just execute your script and finish. Have another workflow for the txt monitoring and real processing

Code: Select all


import glob
import os

# Specify the UNC paths
unc_path_1 = r'\\server1\share\directory1'
unc_path_2 = r'\\server2\share\directory2'
output_dir = r'C:\output\airspeed_txt_files'

# Ensure the output directory exists
os.makedirs(output_dir, exist_ok=True)

# List all .mxf files in both directories
files_1 = set(os.path.basename(file) for file in glob.glob(os.path.join(unc_path_1, '*.mxf')))
files_2 = set(os.path.basename(file) for file in glob.glob(os.path.join(unc_path_2, '*.mxf')))

#make intersection and diff lists
intersection_files = files_1.intersection(files_2)
only_in_files_1 = files_1.difference(files_2)
only_in_files_2 = files_2.difference(files_1)

def write_files_with_path(file_set, unc_path, output_dir):
    for file_name in file_set:
        full_path = os.path.join(unc_path, file_name)
        output_file = os.path.join(output_dir, f"{file_name}.txt")
        if os.path.exists(output_file):
            continue # skip output txt file, it already exists
        with open(output_file, 'w') as f:
            f.write(full_path)
        print(f"Written {output_file} with path: {full_path}")

# Write txt files
write_files_with_path(only_in_files_1, unc_path_1, output_dir)
write_files_with_path(only_in_files_2, unc_path_2, output_dir)
write_files_with_path(intersection_files, unc_path_2, output_dir) # if material is on both, copy from server2

FFAStrans forum

Compare two sources and take files from one

Compare two sources and take files from one

Re: Compare two sources and take files from one

Re: Compare two sources and take files from one

Re: Compare two sources and take files from one