Can you apply Machine Learning to Data Cleansing?
Machine learning is one of the big buzzwords, or phrases if you like, that is floating around at the moment.
In short, it is a variation of AI.
There is an important distinction to make here. It is generally acknowledged that there are two sides to Artificial Intelligence, Generalised AI and Applied AI. Applied AI includes the tech behind self driving cars and trading programs, Machine Learning though is in the Generalised AI field. This covers the types of device or system that can, in theory, handle any task as it learns via the data. Ultimately aiming to be able to replicate or even improve upon human cognitive abilities.
With that in mind, as anyone who has had to deal with it will know, data cleansing is something that has been an issue for a long time and as the amount of data increases dramatically year on year, will become more and more of an issue. After all, without good data, almost everything else in a technology setup struggles. From integrations, to analytics, to allowing for informed decisions to be made.
With these two subjects on the table the next question understandably is, ‘Surely we can use machine learning to speed up and improve the data cleansing process?’
It’s a good question and one that does need to be answered. So let’s keep things simple, I’ll start with a pros and cons list and then we can explore a few of the key points in detail.
- It’s quick, much quicker than a manual process and as we all know, time is money
- How advanced it is at this stage, for this particular task, isn’t clear, but it will improve as time goes on
- As the sheer volume of data increases at a speed reminiscent of the Bitcoin explosion, the manual approach starts to appear old fashioned and will struggle with the amount of data to process
- Repeatable, once it is working, it can be run consistently going forward
- The more data that is submitted to the model, the better it gets
- A computer is a computer, it will make mistakes
- The lack of human intuition
- Time required to mature
- If data remains unstructured and disparate, then there will always be issues for the algorithms. Good databases and a good implementation plan are still going to be hugely significant
Having read these through it seems that the obvious answer is that a balance needs to be struck.
Speed vs Accuracy
In time, as the volume of data continues to go up, these data cleaning algorithms will be vitally important. They will need to mature quickly and compliment the human interactions that will be needed to ensure that mistakes are avoided as far as possible. By focusing on learning, on getting cleverer, the system can analyse, rate and utilise data, which will in turn result in significantly reduced coding hours and far better data.
This is just another interesting use of AI to solve an age old problem. A problem that needs to be solved in order for organisations to make the most of new technologies as they move forward – good data, good analytics, easy integration etc. It’s not perfect, but it does seem to be a serious and sensible option to consider.
One final point before I leave you to read around the subject online. Once the data has been cleaned, decision makers will need to be looking for ways to ensure it stays that way. Setting up processes that make the need for cleansing a less regular occurrence is the sensible next step.