Use Philter with Apache NiFi

Apache NiFi is a powerful application for processing, transforming, and moving data. By using Philter with Apache NiFi you can find and remove sensitive information moving through your Apache NiFi data flows. See Philter’s features.

Prerequisites

An instance of Philter is required. You can launch one in the cloud or as a Docker container. If you are in AWS there are CloudFormation and Terraform scripts for launching a single instance of Philter or a load-balanced auto-scaled set of Philter instances.

Integrating Philter with Apache NiFi

To integrate Philter with Apache NiFi we will make use of Philter’s API to filter sensitive information from text. The Apache NiFi flow will send text to Philter and Philter will return the filtered text. We will use Apache NiFi’s InvokeHTTP processor for making the API call to Philter. We are using Apache Kafka to manage the incoming and outgoing streaming text but this is not required. You could modify the Apache NiFi data flow to interact with Philter directly and omit Apache Kafka from the flow.

Here’s an illustration of our data flow:

The Apache NiFi flow:

The the text to be filtered has been previously published to an Apache Kafka cluster. The ConsumeKafka Apache NiFi processor is used to consume the text from the Kafka brokers and get it into the data flow.

An InvokeHTTP processor sends the text consumed from the Kafka brokers to Philter via Philter’s API. Philter responds with the filtered text which is then published onto a separate Kafka topic via the PutKafka processor. When complete, we have two topics on Kafka – the first topic contains the text unfiltered text and the second topic contains the filtered text.

If we had already had a pipeline using Apache Kafka and Apache NiFi, this configuration allows us to insert Philter into the pipeline with minimal changes. Our downstream process would just need to update the name of the Apache Kafka topic to be the name of the topic containing the filtered text. The configuration presented here is a powerful way to inject the removal of sensitive information into an existing pipeline.

This flow does not require any Apache NiFi processors outside of the processors that are included with the standard Apache NiFi distribution, ensuring compatibility across deployments.

Processor Configurations

ConsumeKafka

InvokeHTTP

PutKafka

Considerations

We are using a single instance of Philter in this article. For a production environment, a cluster of Philter instances deployed behind a load balancer would provide improved performance. The only change to the Apache NiFi flow configuration would be to change the InvokeHTTP processor’s Remote URL to point to the load balancer instead of an individual Philter hostname or IP address.