How can I handle noisy data via machine learning?
When I was just starting out with data science, I held the assumption that data needed to be cleaned before machine learning processes. I’ve since come to understand this as false – rather, data must be be in a state to draw correct conclusions from it. (Notwithstanding, data must be cleaned prior to analysis…)
Doesn’t clean data mean usable data?
Surprisingly, it does not! The cocktail party effect is well known example where people can focus on a select conversation in a noisy room. Our amazing human brains do this by filtering, adjusting, and deducing relevant sounds – not by removing sound-waves. Such is the power of sophisticated intelligence. (Humans are smart!)
Sonar – literally noise data. (Often very noisy too!)
Sound waves travel ~4.3X faster in water than air. Because water is so dense, some sounds can travel thousands of kilometers. (Typical max range for a submarine’s effective use of SONAR is about 50km in shallow water.) That being said, the sounds present in a sonar reading can be incredibly noisy.
Sonar is an effective technology because such noise can be filtered out.
Filtering of environmental noise is neatly demonstrated in this paper on SLAM algorithms. Their choice of weapon (for noise reduction) is an SVM (support Vector Machine) classifier.
Oceanographers are experts on noisy data.
(This is an opinion of the author. However, the figure below provided by this research on ocean noise strongly supports this notion.)
Oceanographer research has figured out how to eliminate noise.
If you’re working with noisy data, I’d suggest reading some oceanography research – or even getting to know someone who works in that field.
Applying this to machine learning:
A while back, I was working on a project which involved an incredibly noisy dataset. I had been iterating for the better part of 2 weeks on several models and data preparation techniques with little improvement. I came across a paper which described a classification technique using Extremely Randomized Trees Ensemble Classifier – similar to Random Forests but randomly splits samples between ensembles. (Double the random, double the fun.)
Boom! The result of the ET model beat out the best performing model (a RNN) by a significant margin – not to mention at a fraction of the processing time.
In sum, there are many algorithms and approaches to take when working with noisy data. As per my consulting oceanographer research, I came across a technique which had fabulous performance in my use case.
Source: Yaakov Bressler, Data Scientist at NYC Startup
Do you want to know how your competitors are doing business?
Tell us a little about yourself below to gain data for free
Hi What’s your name?
Gotcha! Do you want to monitor any specific competitor or market?
List of Competitors
- Add competitor…
Your Data is on the Way!
Our data scientists team is working for you by collecting data and we’ll come back to you shortly with a pre-assessment and proposal.