I found a model that can redact personally identifiable information using machine learning.
It needs a model – how the data will look, and a mask, what it will redact, but other than that it works.
Here’s a description of the process
High level overview
We’re using a deep learning technique called Semantic Segmentation to auto-redact information from the user-submitted W2 images. The model architecture we ultimately trained the data on was U-Net pre-trained on ResNet50.
Creating a labeled datasethttps://blog.usejournal.com/using-machine-learning-to-redact-personal-identifying-information-b95b53b935a9
So first things first, we need a labeled dataset that we can train our model on. I looked around online but wasn’t able to find any pre-labeled datasets for W2 semantic segmentation. So I just made my own using a tool called LabelBox. LabelBox is great, it’s an online tool you can use to make labeled datasets for object detection, semantic segmentation, and more. It took me a few hours to label over 300 W2 images.