Skip to main content

Statistical Outlier Curation Kernel Software (SOCKS)

 SOCKS: Statistical Outlier Curation Kernel Software
A Noise Reduction Software for Machine Learning

Written by Gitika Gorthi, Chantilly High School

Have you ever had your classmates' voices overpower your teacher’s, making it hard for you to listen to the important instructions being given? Or have you ever had a tough time understanding news on the radio because of the heavy static sounds? Noise in data sets is similar to the noisy disturbances we hear in our daily lives; it is additional information that serves no apparent purpose such as in the form of data corruption. Noise in data often causes the algorithms to miss out patterns or specific trends in the data, similarly to how we can miss important instructions or news due to the background disturbances. The study “Dealing With Noise In Defect Prediction” has determined that false positive and false negative noises alone can lead to a 20-35% decrease in prediction performance (Kim et al, 2011). In order to reduce noise from data and increase prediction performance to enhance the artificial intelligence (AI) model training and efficiency of the program, Statistical Outlier Curation Kernel Software (or short for SOCKS) was developed.


SOCKS is a software to reduce noise from data seamlessly and often agnostically, and curate the underlying data when necessary to reveal pristine information, but how does it help revolutionize AI and edge computing? By reducing the noise through SOCKS, the AI will be more accurate and faster in analyzing patterns, allowing us to rely on the program more. According to Dr. Shivani and Atul Gupta in their journal “Dealing with Noise Problem in Machine Learning Data-sets: A Systematic Review,” noisy data in data sets can significantly impact prediction of any meaningful information by dramatically decreasing classification accuracy. Hence, once noise is reduced, the technology can continually improve itself through precise trends and cater to the users’ needs better. To put this in a more real-world context, imagine scrolling through Youtube or Twitter, wouldn’t you want your content suggestions to be in your correct areas of interest? For example, if you are into comedy content, you would want Youtube or Twitter to suggest various kinds of comedy content and not sidetrack into horror suggestions because of an outlier or accidental click. SOCKS will help programs reduce the noise data for better efficiency of the AI program.



Image 1: Image Noise Reduction in order to illustrate how efficient program becomes

with the reduction of noise -- not only in images, but in other program accuracy outputs


There has been past work conducted in attempts to reduce noise in data through several noise filtering techniques in order to improve quality of the data in classification tasks. According to Dr. Garcia et al, many current techniques scan the data for noise identification in a preprocessing step. However, some noisy data can still remain unidentified through these techniques and sometimes even safe data is removed (Garcia et al, 2016). The development of SOCKS hopes to improve the current accuracy in removing noise in data from already existing noise filtering techniques.


Apart from having a general user benefit in many daily activities, noise control has benefits for allowing programs to do tasks that are more difficult for human workers to do in a job. For example, there are many data analysts who have to do mundane visual tasks with efficiency and accuracy; however, if a machine could do this through AI, wouldn’t that save a lot of time and effort? Now on top of that, imagine SOCKS enabling the software to work at an even faster rate with more precision. SOCKS enables greater accuracy in repetitive visual tasks, and combined with GSFRS, it can revolutionize the technology industry through instant scaling of visual task completion, rapid training for deploying computer vision, and access to data more quickly.


In conclusion, noise is unfavorable for machine learning training, and if this can be curated before training occurs, a lot of time can be saved!


Check out the next blog focused around how SOCKS and GSFRS can work together to make AI software lightning fast and bullseye accurate.


Comments

Popular posts from this blog

Speeding up Model Training with Multithreading and GSFRS

S peeding up Model Training with Multithreading and GSFRS                   written by Rahat Ahmed Talukder , Notre Dame University Bangladesh                  We live in a multicore universe where great things can happen in parallel. Parallel processing is equivalent to enormous performance gain. Organized parallelism is how our own body works through dynamic bit organized activation of billions of single neurons. Everybody wants to parallelize a workload done on a data frame. In the machine learning (ML) lifecycle, different workloads are parallelized across a large VM. This allows you to take advantage of the efficiency of the VM and maximize the use of your notebook session. Nonetheless, many of the machine learning or scientific libraries used by data scientists ( Numpy, Pandas, sci-kit-learn,...) release the GIL, allowing their use on multiple threads. It is important to keep in mind that when our dataset is large, threads are more practical than processes because of the possible

Catalyzing A Data Science Revolution: SOCKS + GSFRS

  Catalyzing A Data Science Revolution: SOCKS + GSFRS Written by Gitika Gorthi, Chantilly High School Why the technologies Giant Signal File Random Sampler (GSFRS) and Statistical Outlier Curation Kernel Software (SOCKS) ? How will they benefit you in achieving your data science goals? “Data is the new fuel of the digital economy” or can be viewed as the new gold; harnessing and accurately decoding the meaning of the numbers is crucial to increase two types of efficiency for organizations: speed and accuracy. GSFRS addresses speed and SOCKS addresses accuracy, coupled together, make the power team. Suppose you are a data analyst and you are assigned to take a bunch of numbers and make sense of them -- I know your reaction, you are most probably scared. But don’t worry, we have artificial intelligence to the rescue in order to develop algorithms to do the hardwork for us (yay!). Now the question is, is the program really telling us the right information? Some may trust blindly whatever

GSFRS : The Story of a Gigantic Random Sampler by Dr. Prasanta Pal, Brown University

                                                            What is GSFRS ? All about data and more data! We want stuff! A lot of stuff! Often, more stuff than we can handle. These days, with everything turning digital, it means, we are looking for a lot of data which in disguise means trouble! Let's understand the kind of of troubles we may be asking for through some stories. Suppose you've been collecting radio signals from the alien world and storing it in a gigantic file (call it the big book of universal secrets) for last million years. By now, the size of the book got so big that you started counting it in exabytes! (10^18 bytes). If someone tells you that the secret equation for time-travel is buried somewhere around the 3 trillionth line and you want the secret code right now because a giant asteroid is about to hit the earth, there is no other way around but to time travel! How do you retrieve the time-travel code from "The big book of universal secrets" at