Command Line ETL Framework (Orange, 2018-2019)
Info
- Employer Orange Romania
- Project type Software development
- Period 2018-2019
- My Role Developer, Solution Architect
- Technologies Shell, PL/SQL, Java, html, Javascript and awk
- Architecture Pipe-filter, Event-bus, Layered patterns
- Keywords Parallel processing, Distributed processing, Highly Scalable, No Single Point Of Failure, Machine Learning, Framework
- Skills developed Framework thinking, Parallel processing, Machine learning
- Detailed presentation View Here
Description
- The purpose of this project was to develop a scalable data processing framework and DevOps tool, with the capability of creating multiple parallel flows, running on multiple threads, processing data sets with billions of records per day, using multiple target databases (having different DBMSs/versions) and having the capability to be distributed (with no point of failure).
- We have developed this framework using Shell, PL/SQL, Java, html, Javascript and awk and delivered the framework for several internal teams and systems. Each application admin is able to use the framework by installing it in a few steps and configuring a new flow in minutes.
- On this particular project, I was acted as a developer and solution architect and I collaborated with another 2 developers and 1 integration engineer. The project was implemented for the biggest telecom operator in Romania and its instances process network traffic generated by more than 10 million subscribers and is able to process up to 35k records/s (up to 90B records/month) as of Sept 2019.
Technical presentation
This was a project implemented in a private company and all intellectual rights belong to it. The original name and technical description are confidential and irrelevant for the purpose of this presentation.
The following presentation is about my own version, fully implemented by me, more than 2 years after concluding the contract with the employer.
Project code name: GULA
GULA is…
- a command line ETL tool
- works under Linux bash
- has a complete set of predefined commands which can be used by app-admin without external intervention
- capable to Extract data from files and databases, Transform content during the data lifecycle and Load data into the database
- data manipulation tool
- capable to perform logical operations (lookup, aggregation, filter)
- capable to decode multiple types of formats
- capable to perform the move, archive and check (format and content) operations at runtime (and pre/post process)
- monitoring tool
- has a smart logging mechanism
- has its own mechanism to verify itself
- includes an audit system
- has its own configurable web GUI
- a framework
- can be instantiated
- is fully configurable (no development needs)
- is developed with scalability and enhancement in mind
- is fully documented
Purposes
- Collect input data (push or pull methods)
- Decode data sets
- Reduce development time for new flows (only perform some configurations)
- Load data in database in time, accurately, without errors and duplicates
- Automate the monitoring process
- Easy fix errors and reprocess data
- Well documented fail scenarios
- Archive data sets
Key concepts
- FLOW – all the transformations of a data set on the time axis
- MODULE – one of the stages of a flow during the data set lifecycle
- PLUGIN – an additional component that works on its own time axis which implements custom features at the flow, module or framework level
- THREAD – task executed simultaneously (or not) with another task as part of the same flow
- CONTROLLER – a component that manages flows, modules, plugins or threads
- FILE AUDIT – a component that supervises the input file streaming and performs tasks such as logging, archiving and checking for duplicates or corrupted data sets
- METRIC – attribute of quality or quantity of a flow, module, plugin or a state of the system
- HEARTBEAT – a component that collects METRICS that can be used to monitor the health of the system
- NOTIFICATION GATEWAY – a component that manages alerts and notifications generated by the system and notification methods and distribution lists
Feature list
- Parallel processing – each flow can be individually configured to be executed on multiple threads
- Data flow management – each flow has its own custom configuration
- Single file config – one single file to config all of the instance’s parameters (except flow decoder & definition)
- Portable – has no external dependency at installation (except prerequisites)
- Scalable – new flows, threads and target databases, can be added only by config; supports new plugin integrations
- Can read any type of data file – for wide-known formats only config needed
- Has its own command line set – can be controlled without Linux shell commands or other external tools
- Audit – checks every input file dynamically for data integrity, duplication, archiving and reprocess conditions
- Notification gateway – allows the configuration of delivery methods, priority levels, distribution lists
- Archive management – allows custom archiving or archiving with a timer for each flow
- Multiple target databases – allows parallel data loading on multiple databases for the same instance
- Duplicate checker – checks the content or the file name for duplication, or both
- Modular – each flow goes through available core modules and optionally through pre/post-processing modules
- Status – app-admin has multiple modes to check the status of the system or of some components (global status, only for running flows/modules/plugins, only if there are health issues) and a complete log with the history of the instance’s status
- One-click setup instance – a new instance can be installed only by executing a local script
- Self-learning– learns from its own heartbeat component
- Smart logging – the log can be tailed on multiple levels by flow, module, plugin, file name, keyword or criticality
- One click setup flow – a new flow can be created only by executing a script
- Process management – the lifecycle of each flow can be configured individually
- Force self-kill – if some processes get blocked (especially for external reasons) the app admin can use the internal mechanism to force kill running components of the instance (flows, modules, plugins – complete or partial, or global kill)
- Customizable – each of the following parameters can be custom configured for each instance: paths, filenames, directory names, flag names, log names, flow’s running frequency
- Speed configuration – app admin can decide if the loading process will prioritize old or new data sets, if indexes maintenance is enabled, the commit interval, the number of threads in order to increase/decrease the speed of processing
- Running time – app admin can decide how many steps will run a flow/plugin until it will stop automatically
- Partial data file loading & Reprocess – in case of corrupted data scenarios, the system can be configured to process either everything except the corrupted part, everything until the corrupted part or nothing, so that it automatically decides if at the end of the data set will commit or rollback changes or set a commit point for that file
- Thread balancer – when a flow runs on multiple threads and the data sets come slower than the threads are processing, the thread balancer helps to equalize the loading on unused threads
- Plugin ready – any developer can create and integrate a new feature by writing a plugin and configure the GULA controller to handle it
- Dynamic directory hierarchy – all directory names and paths can be dynamically configured and auto-created if missing
- Log rotator – logs are rotated every time the configured time or size thresholds are exceeded
- Self monitor – GULA monitors its own activity (errors in logs, corrupted data set, speed of processing, available resources, the health status) automatically and is capable of learning the “normal” conditions of work and to notify only when anomalies occur
- Self cleaning – GULA cleans its own old logs or archived data without external intervention
- WYSIWYG Web GUI – built in web interface, with a visual control panel and an integrated webserver and database drivers to help app admin monitor the status of each flow or system performance
- Cost effective archiving model – archives can be stored on cheaper/slower storage
Team
While working on this project I collaborated with an amazingly creative and dedicated team. Please reach out to the following team members for further references:
- Daniela Niculae – Billing Enablers and Convergent Services Manager @Orange Romania SA
- Delia Ioana Lepadatu – Software Development Engineer @Orange Services
- Dorin Florea – Database Infrastructure Manager @Orange Romania
Work with me
For further details please check out the full technical presentation here.
If you are interested in even more details, consultancy or you just want to have a profound discussion on some technical key concepts as scalability, parallel processing or designing a software framework, please contact me privately.
Also, if you need a product like this or if you have a problem with your current ETL, feel free to ping me anytime.