BUSINESS PROBLEM
The Data Pipeline from our client was running at about 1 task per minute. The main issue was that the processing required custom business logic, and some tasks even required OCR capabilities. When a large volume of data needed to be processed, it usually took several days
SOLUTION
The DE team initially came up with a few potential solutions and discussed them with the customer to ensure we picked the most suitable solution – We decided to use parallel processing via message queues – The concept is that instead of using a single computer for all the data processing, we will split the data management and processing parts to separate worker nodes (different computers) – This approach allows us to completely decouple the data processing and data orchestration applications
IMPACT
Using this approach, we were able to process over 50 tasks per minute – There was minimal cost increase, as most of the solution was architected to run with limited external dependencies – This also allows the software to be very flexible and can adapt to bigger data volumes