Building Data Engineering Pipelines For Health Analytics

CEMA is the Center for Epidemiological Modelling and Analysis based within University Of Nairobi’s institute of tropical and infectious diseases. The Center brings together various expertise from epidemiologists, clinicians, data scientists and software engineers to inform important studies into the distributions and determinants of health and disease in Kenya and the region for the betterment of health. Qhala is an industry partner that supplies engineers and data scientists to inform health decision making through modelling and visualization of trends.

To visualize the trends of essential health indicators we examined trends over a three year period, with County comparisons to inform reported rates from Kenya District Health Information System (DHIS2). We created an automated system that provides visualization of the health data to inform policy making and planning.

The flow for the process is represented in the figure 1 in the appendix.

The first step of the automation was to acquire the data which was achieved by using the selenium library in Python. We developed a script that allowed access to DHIS2 after providing selenium with user credentials and a url to the site.This helped interface with DHIS2 and download the required datasets. The second step required us to clean and transform the downloaded datasets for easy storage and visualization. Once data wrangling had been carried out, we stored the transformed data into a PostgreSQL database that we later used to read into our visualization tool and an Amazon Web Services(AWS) s3 bucket for enabling active downloads. We utilised AWS Lambda and API gateway to develop a microservice that enabled upload and download of files on the s3 bucket.

To visualise the key health indicators, we employed Apache Superset as it easily accesses the data on our database and also the fact that it updates the visualizations in real time as soon as the data is altered.

Having recourse to Apache Airflow allowed us to schedule when and how the steps should run. We found Airflow to be efficient as it has a cool UI and a scheduler that made it easy to track job failures and also enable us to look at the logs in one place.

We used Directed Acyclic Graphs (DAGs) to store our collection of tasks via directed lines. Traversing the graph starting from any task is not possible with this type of graph, hence the acyclic nature of our workflows, as shown in Figure 2, with all tasks executed in order.

We stored the DAGs in a directory within Airflow and from this directory, the Airflow scheduler looks for file names with dag or airflow strings and parses all the DAGs at regular intervals (once on the 22nd and once on the 30th) and keeps updating the metadata database about the changes if any. See figure 3

Most of the tools that we looked to for the automation process are open source and easily accessible. The language used for scripting the files was Python and some of its libraries such as pandas for cleaning and transformation of data. One important library used for Python in this case is selenium which allows for automatic interfacing and interaction with websites. Its main benefit is that it is supported by multiple browsers such as Firefox,Chrome and Edge. The library is embedded into the scripted files and the required credentials are passed such as usernames,passwords and most crucial the path to the website.

Another tool used in this process is the Amazon s3 bucket is an Amazon Storage Service. AWS API gateway and AWS Lambda were used to develop an API that interacts with the Amazon s3 storage for upload and download of files. For visualization purposes, Apache superset was used. It is a data exploration and visualization platform designed to be visual, intuitive and interactive. It consists of two primary interfaces: SQL Lab which enables fast and flexible access of data from our PostgreSQL database and a Data Exploration Interface that converts the data tables into rich visual insights.

Figure 4 in the appendix shows reported cases of diarrhoea at national and county level. If one hovers on the map, a figure can be displayed and in this case we have Mandera reporting 2.28k cases. The bar chart on the right shows reported cases in 2018, 2019 and 2020 until August. Reported cases of diarrhoea are considerably lower in 2020 compared to 2018 and 2019 after the covid-19 pandemic was announced in the country. It would be interesting to explore why the number of cases significantly reduced in 2020 but this may be due to frequency of handwashing or reduced hospital visits due to fear of contracting covid although literature suggests diarrhoea to have been observed in later stages of the pandemic in other economies (reference in 1)


Fantao Wang, Shiliang Zheng, Chengbin Zheng, Xiaodong Sun,Attaching clinical significance to COVID-19-associated diarrhea, 2020, Life Sciences, Volume 260,118312, ISSN 0024–3205,


Figure 1 — Pipeline Orchestration

Figure 2 — A DAG Graph View on Airflow

Figure 3 — Diagram showing a trigger for a DAG

Figure 4 — Superset Visualization of Diarrhoea as a reported case and Trends going back three years.

Article contributors: Elizabeth Thuo, Ian Koome, Claudine Wangari, Mutanu Malinda, Nderito Gatere

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store