Redact PII Using Machine Learning

I found a model that can redact personally identifiable information using machine learning.

It needs a model – how the data will look, and a mask, what it will redact, but other than that it works.

Here’s a description of the process

High level overview
We’re using a deep learning technique called Semantic Segmentation to auto-redact information from the user-submitted W2 images. The model architecture we ultimately trained the data on was U-Net pre-trained on ResNet50.

Creating a labeled dataset
So first things first, we need a labeled dataset that we can train our model on. I looked around online but wasn’t able to find any pre-labeled datasets for W2 semantic segmentation. So I just made my own using a tool called LabelBox. LabelBox is great, it’s an online tool you can use to make labeled datasets for object detection, semantic segmentation, and more. It took me a few hours to label over 300 W2 images.

