The Lindahl Letter
The Lindahl Letter
Synthetic data notebooks
0:00
Current time: 0:00 / Total time: -5:22
-5:22

Synthetic data notebooks

Thank you for tuning in to this audio only podcast presentation. This is week 132 of The Lindahl Letter publication. A new edition arrives every Friday. This week the topic under consideration for The Lindahl Letter is, “Synthetic data notebooks.”

People are totally working on this one actively which I thought was pretty interesting. I have a general interest in how to create synthetic data using notebooks as it helps to provide people with lessons on how to do it from an educational based perspective. Really solid automated testing process may include some of this in the development process. It makes automation even more amazing as a part of the process. It looks like the folks over at Towards AI released a nice guide to synthetic data that is geared at beginners in March of 2023 [1]. That guide walked through some of the concepts and a few pieces of information like how some report from a researcher at Gartner showed that half of future AI data will end up being synthetic data. Don’t worry I went out and found sourcing for that from Gartner and Alexander Linden who estimated by 2030 that synthetic data would outpace the regular data [2]. 

Those future considerations aside, the reason that is happening is that most people are going to be expanding their datasets with synthetic data to help them do training and work with models [3]. We are pretty far into the second paragraph and you might be wanting to access a couple of Google Colab notebooks to be able to do some of this yourself. Don’t worry that is about to happen for you. The team over at gretel AI shared a couple of notebooks that you can use for this type of effort:

https://colab.research.google.com/github/gretelai/gretel-synthetics/blob/master/examples/synthetic_records.ipynb

The first notebook had all sorts of errors and would not work. 

https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/create_synthetic_data_from_a_dataframe_or_csv.ipynb

The second one required a Gretel API key to get going which was a lot less fun than it could have been without that part of the equation. I went out to the website over at

https://gretel.ai/

and they have some free elements. I got into the dashboard they have pretty easily and started to look around to see what they are offering [4]. I went out to YouTube and found a 12 minute video from one of the co-founders Alex Watson showing how to do this effort. They did quickly show how to get the API key for the above notebook. 

I really did follow those instructions to get that magic API key which totally worked in the 2nd Google Colab notebook link from above. I stepped through the entire notebook in about 15 minutes and was able to see the process of synthetic data generation from a dataframe or CSV which was exciting to watch and learn about in a notebook. The main model training took 7 minutes so don’t expect that it will happen in just a click.

Maybe you wanted to see somebody else do some generation of synthetic data in Google Colab on YouTube. You can see YData work with a fabric environment and work in a notebook. They had 44 subscribers and the video had 28 views before I shared this link for your enjoyment.

You could also check out this other YouTube video from The Next Phase team that shows more information about “Synthetic data generation with CTGAN” in a Google Colab notebook as well [5]. 

Maybe you wanted to switch gears a bit and learn a little bit about how to create 8-bit audio samples [6]. I’m going to share one more article here that walks through how to generate datasets as I thought it was actually pretty good [7]. I’m going to close this one out with a zoom out to what some people think is the future of these synthetic data driven creations which is in fact the politician of the open internet and eventual model collapse [8]. People have even recently gone as far as to say copies of the internet before all this generated content are worth more for training than the derivatives. We will see what happens soon. Oftentimes in the machine learning spaces people have used randomness, chaos, or other techniques of shifting things around to overcome blocks. I’ll be curious to see if something is developed to overcome these potential model collapse elements. 

Footnotes:

[1] https://towardsai.net/p/machine-learning/a-beginners-guide-to-synthetic-data

[2] https://www.gartner.com/en/newsroom/press-releases/2022-06-22-is-synthetic-data-the-future-of-ai

[3] https://towardsdatascience.com/generating-expanding-your-datasets-with-synthetic-data-4e27716be218 

[4] https://console.gretel.ai/use_cases/cards/use-case-synthetic/projects 

[5] https://colab.research.google.com/drive/18vavq2Kt8HqhSnZvvFxJUc-70NCPeDU_?usp=sharing 

[6] https://medium.com/mlearning-ai/python-machine-learning-gans-synthetic-data-and-google-colab-5bb43491a8c7

[7] https://medium.com/nerd-for-tech/synthetically-generate-datasets-using-deep-learning-c1f6ee7a0990 

[8] https://arxiv.org/pdf/2305.17493.pdf 

What’s next for The Lindahl Letter? 

  • Week 133: Automated survey methods

  • Week 134: Make a link based news report automatically

  • Week 135: Saving some notebooks every day

  • Week 136: What if July was startup month? 31 days for 31 ideas

  • Week 137: The battle is about having the idea

If you enjoyed this content, then please take a moment and share it with a friend. If you are new to The Lindahl Letter, then please consider subscribing. New editions arrive every Friday. Thank you and enjoy the week ahead.

Discussion about this podcast