Skip to main content

GSFRS : The Story of a Gigantic Random Sampler by Dr. Prasanta Pal, Brown University

Search for the aliens with radio telescope

                                                What is GSFRS ? All about data and more data!



We want stuff! A lot of stuff! Often, more stuff than we can handle. These days, with everything turning digital, it means, we are looking for a lot of data which in disguise means trouble! Let's understand the kind of of troubles we may be asking for through some stories.

Suppose you've been collecting radio signals from the alien world and storing it in a gigantic file (call it the big book of universal secrets) for last million years. By now, the size of the book got so big that you started counting it in exabytes! (10^18 bytes). If someone tells you that the secret equation for time-travel is buried somewhere around the 3 trillionth line and you want the secret code right now because a giant asteroid is about to hit the earth, there is no other way around but to time travel! How do you retrieve the time-travel code from "The big book of universal secrets" at lightning ⚡ speed? Databases are not practical for such large volumes of data (May be google has a secret way but it would be too expensive for you!). GSFRS is designed exactly to solve that problem and you can possibly do it with your laptop if everything has been planned as per GSFRS's core philosophy: "index first" and "parse as you like"!



The big book of universal secrets
The big book of universal secrets


Let us imagine another slightly different situation where, NASA has been recording the temperature of the sun with arbitrary precision, at ultra high sampling rate through various sensors for last 50 years. Suppose everything has been stored in a single gigantic file as data records (GDR). Now we want to ask the question "what is the average temperature of sun" over the entire recording period? How on average, does it change over a one year cycle of time? How is the average fluctuation of the temperature each year? How would we perform that kind of estimation?


A different manifestation of the same problem would happen in a situation where, suppose a new researcher joins an existing team and (s)he wants to know what was the temperature of the sun on each of (her) his birthdays?
Well, (s)he has to start reading the big book from the start and flip through the pages over again to look for the data on those specific dates. Suppose, (s)he finds a secret trend that the Sun heats up a bit more on his/her birthdays to conclude that the birthday parties were really warming up the sun! Well, now (s)he set a trend!


Everyone in the team wants to know if they too have any trend of solar temperature on their birthdays! So everyone has to flip over the pages of the big book again and again and agin! 😓😰😅💦💦


Realistically, it is so much of work and more importantly too much of redundant work! This is a typical situation where GSFRS would come very handy. You'll get the answer to each of the above questions almost immediately by reading only the pieces of the data you want when you know what part of the data you want!


Under various real-world circumstances, there are different manifestations of the problems described above. However, the crux of all these problems is the ability to have random access to any part of the arbitrarily sized data record file without having to maintain a database infrastructure to load the entire big book on the computer memory!



Applications

Instantaneous access to random parts of a large data file
Random Sampling from large files
Design of efficient low power query systems
Parallel processing of a data file on multiple threads
To learn more about the applications visit Gitika Gorthi's GSFRS Blog!













Comments

Popular posts from this blog

Speeding up Model Training with Multithreading and GSFRS

S peeding up Model Training with Multithreading and GSFRS                   written by Rahat Ahmed Talukder , Notre Dame University Bangladesh                  We live in a multicore universe where great things can happen in parallel. Parallel processing is equivalent to enormous performance gain. Organized parallelism is how our own body works through dynamic bit organized activation of billions of single neurons. Everybody wants to parallelize a workload done on a data frame. In the machine learning (ML) lifecycle, different workloads are parallelized across a large VM. This allows you to take advantage of the efficiency of the VM and maximize the use of your notebook session. Nonetheless, many of the machine learning or scientific libraries used by data scientists ( Numpy, Pandas, sci-kit-learn,...) release the GIL, allowing their use on multiple threads. It is important to keep in mind that when our dataset is large, threads are more practical than processes because of the possible

Catalyzing A Data Science Revolution: SOCKS + GSFRS

  Catalyzing A Data Science Revolution: SOCKS + GSFRS Written by Gitika Gorthi, Chantilly High School Why the technologies Giant Signal File Random Sampler (GSFRS) and Statistical Outlier Curation Kernel Software (SOCKS) ? How will they benefit you in achieving your data science goals? “Data is the new fuel of the digital economy” or can be viewed as the new gold; harnessing and accurately decoding the meaning of the numbers is crucial to increase two types of efficiency for organizations: speed and accuracy. GSFRS addresses speed and SOCKS addresses accuracy, coupled together, make the power team. Suppose you are a data analyst and you are assigned to take a bunch of numbers and make sense of them -- I know your reaction, you are most probably scared. But don’t worry, we have artificial intelligence to the rescue in order to develop algorithms to do the hardwork for us (yay!). Now the question is, is the program really telling us the right information? Some may trust blindly whatever