Technology & Innovation


Carlos Barge

When I was just starting out with data science, I held the assumption that data needed to be cleaned before machine learning processes. I’ve since come to understand this as false – rather, data must be be in a state to draw correct conclusions from it. (Notwithstanding, data must be cleaned prior to analysis…)

Doesn’t clean data mean usable data?

Surprisingly, it does not! The cocktail party effect is well known example where people can focus on a select conversation in a noisy room. Our amazing human brains do this by filtering, adjusting, and deducing relevant sounds – not by removing sound-waves. Such is the power of sophisticated intelligence. (Humans are smart!)

Sonar – literally noise data. (Often very noisy too!)

Sound waves travel ~4.3X faster in water than air. Because water is so dense, some sounds can travel thousands of kilometers. (Typical max range for a submarine’s effective use of SONAR is about 50km in shallow water.) That being said, the sounds present in a sonar reading can be incredibly noisy.

In sum, there are many algorithms and approaches to take when working with noisy data.

Sonar is an effective technology because such noise can be filtered out.

Filtering of environmental noise is neatly demonstrated in this paper on SLAM algorithms. Their choice of weapon (for noise reduction) is an SVM (support Vector Machine) classifier.

Oceanographers are experts on noisy data.

(This is an opinion of the author. However, the figure below provided by this research on ocean noise strongly supports this notion.)

Oceanographer research has figured out how to eliminate noise.

If you’re working with noisy data, I’d suggest reading some oceanography research – or even getting to know someone who works in that field.

In sum, there are many algorithms and approaches to take when working with noisy data.

Applying this to machine learning:

A while back, I was working on a project which involved an incredibly noisy dataset. I had been iterating for the better part of 2 weeks on several models and data preparation techniques with little improvement. I came across a paper which described a classification technique using Extremely Randomized Trees Ensemble Classifier – similar to Random Forests but randomly splits samples between ensembles. (Double the random, double the fun.)

In sum, there are many algorithms and approaches to take when working with noisy data.

Boom! The result of the ET model beat out the best performing model (a RNN) by a significant margin – not to mention at a fraction of the processing time.

In sum, there are many algorithms and approaches to take when working with noisy data. As per my consulting oceanographer research, I came across a technique which had fabulous performance in my use case.

Source: Yaakov Bressler, Data Scientist at NYC Startup

Free Pre-Assessment Request

Do you want to know how your competitors are doing business?

Tell us a little about yourself below to gain data for free

Hi What’s your name?


Hi [First Name], what is your company’s name and website?



Is your company looking for any data on the following services:



Gotcha! Do you want to monitor any specific competitor or market?

List of Competitors

  • Add competitor…



Finally, what’s your email address and your phone number?



Your Data is on the Way!

Our data scientists team is working for you by collecting data and we’ll come back to you shortly with a pre-assessment and proposal.

WYgroup BI uses the information you provide to us to contact you about our relevant content, products, and services . You can unsubscribe from communications from HubSpot at any time. For more information, check out WYgroup’s Privacy Notice.

Leave a Comment: