Skip to main content

Catalyzing A Data Science Revolution: SOCKS + GSFRS

 Catalyzing A Data Science Revolution: SOCKS + GSFRS

Written by Gitika Gorthi, Chantilly High School



Why the technologies Giant Signal File Random Sampler (GSFRS) and Statistical Outlier Curation Kernel Software (SOCKS)? How will they benefit you in achieving your data science goals?


“Data is the new fuel of the digital economy” or can be viewed as the new gold; harnessing and accurately decoding the meaning of the numbers is crucial to increase two types of efficiency for organizations: speed and accuracy. GSFRS addresses speed and SOCKS addresses accuracy, coupled together, make the power team.


Suppose you are a data analyst and you are assigned to take a bunch of numbers and make sense of them -- I know your reaction, you are most probably scared. But don’t worry, we have artificial intelligence to the rescue in order to develop algorithms to do the hardwork for us (yay!). Now the question is, is the program really telling us the right information? Some may trust blindly whatever the screen shows, while others may question the validity of the data analysis.


Studies have presented that there is “noise” in data sets that can reduce an algorithm’s accuracy by over 20% at times (Gupta, 2019). Noise in the data significantly impacts the prediction of meaningful information leading to decreased data classification accuracy and poor prediction results. To put this in a more real-world scenario, let’s imagine you are scrolling through social media -- such as Instagram, Facebook, LinkedIn, or TikTok. Would you want your “for you pages” or main pages to be filled with content you enjoy or content you find boring? Most of us probably would like to have content that we find amusing, and to ensure this, having an accurate algorithm that is not distracted by the “noise” data (such as outliers in search topics) is essential in providing convenience for users. Some of you may be wondering, how does this apply to larger technology companies?


Figure 1: Simple flow chart on how TikTok Algorithm works and the importance of reducing noise from data


Essentially, it is a similar concept. Imagine the National Aeronautics and Space Administration (NASA) or SpaceX launching a rover onto a planet -- such as the most popular planet in the industry at the moment, Mars. One of NASA’s main focuses is finding extraterrestrial life or water evidence through Rovers. Rovers use AI-powered devices; Perseverance for example uses a device known as Planetary Instruction for X-ray Lithochemistry and to search for clues. To increase the accuracy of the artificial learning program to perform actions more beneficial for NASA’s missions, implementing SOCKS on a larger scale will be helpful.


Figure 2: NASA Perseverance Rover


Now that accuracy has been addressed, let’s talk about speed. What if I could tell you we can process big data as fast as the Flash? Well, maybe not as fast as the superhuman himself, but something similar? You are probably excited at the thought; I have good news for you, GSFRS can do just that!


GSFRS, as mentioned in prior blogs, can be utilized to access random parts of a large data file without downloading the entire file and has parallel processing of data on multiple threads. Let me break that down for you to help you understand. Imagine reading the book Harry Potter and the Order of the Phoenix --the largest book in the series--, and once you finished reading you forgot a certain event and would like to go back to it. Flipping through every single page to get to a single event can be time-consuming, and not to mention, frustrating. Now, imagine that you could just open the book to the page you wanted, wouldn’t that make things a whole lot easier? GSFRS does just that with large data by accessing only what is necessary. It can also do this simultaneously with other data being processed too.


Real-world applications for GSFRS are limitless because who doesn’t want a fast output of data analysis and information? I sure would want my phone to load a lot quicker if I had the opportunity. GSFRS can be used in rovers, satellites, in any tech company looking to quicken their system processing, or as software for essentially any other software out there.


In conclusion, SOCKS --a software to increase accuracy-- and GSFRS --a software to increase speed of big data processing-- can work together to revolutionize technology. Check out the other blogs to learn more about their impact on a larger scale and in more depth!

Comments

Popular posts from this blog

Speeding up Model Training with Multithreading and GSFRS

S peeding up Model Training with Multithreading and GSFRS                   written by Rahat Ahmed Talukder , Notre Dame University Bangladesh                  We live in a multicore universe where great things can happen in parallel. Parallel processing is equivalent to enormous performance gain. Organized parallelism is how our own body works through dynamic bit organized activation of billions of single neurons. Everybody wants to parallelize a workload done on a data frame. In the machine learning (ML) lifecycle, different workloads are parallelized across a large VM. This allows you to take advantage of the efficiency of the VM and maximize the use of your notebook session. Nonetheless, many of the machine learning or scientific libraries used by data scientists ( Numpy, Pandas, sci-kit-learn,...) release the GIL, allowing their use on multiple threads. It is important to keep in mind that when our dataset is large, threads are more practical than processes because of the possible

GSFRS : The Story of a Gigantic Random Sampler by Dr. Prasanta Pal, Brown University

                                                            What is GSFRS ? All about data and more data! We want stuff! A lot of stuff! Often, more stuff than we can handle. These days, with everything turning digital, it means, we are looking for a lot of data which in disguise means trouble! Let's understand the kind of of troubles we may be asking for through some stories. Suppose you've been collecting radio signals from the alien world and storing it in a gigantic file (call it the big book of universal secrets) for last million years. By now, the size of the book got so big that you started counting it in exabytes! (10^18 bytes). If someone tells you that the secret equation for time-travel is buried somewhere around the 3 trillionth line and you want the secret code right now because a giant asteroid is about to hit the earth, there is no other way around but to time travel! How do you retrieve the time-travel code from "The big book of universal secrets" at