The Lindahl Letter
The Lindahl Letter
Code generating systems
1
0:00
-3:39

Code generating systems

1

You probably were wondering how long into the new year before a bunch of focus and attention were placed at the efforts of Hugging Face. You won’t have to wait any longer as this missive will dig into BigCode and some other efforts to use AI to build out code generating systems [1]. I was reading a TechCrunch article from Kyle Wiggers and wondering about how many different systems existed [2]. That article references 5 code generating systems that you could in practice elect to go evaluate. For completeness I’m listing Codex and Copilot separately in this last given that the interfaces are holistically different. 

  • BigCode - Hugging Face & ServiceNow’s R&D division [3]

  • AlphaCode - DeepMind [4]

  • CodeWhisperer - Amazon [5]

  • Codex - OpenAI [6]

  • Copilot - GitHub (Codex based) [7]

One of the things you might be interested in learning about at this point would be a dataset called “The Stack” which happens to be a collection of 6 terabytes of permissive code data that covers 300 programming languages [8]. The permissive code part of the dataset is interesting. The GitHub archive was roughly 69 terabytes of data that they filtered by licensing which they considered permissive and ended up with that 6 terabyte collection. Understanding how the dataset that feeds the code generating system was built is very important. All my contributions on GitHub are intended to be MIT license which I think should be permissive [9]. You have to deeply consider that a lot of propriety code writers and corporations employing said coders would not have given permission to use their code in a code generation system. 

Generative coding systems will abound shortly and are in an early and developing state at the moment. We are getting to the point where you can instruct Codex to build something code related in terms of creating an application and you might get a great result. It’s not a universal code generation engine at this point. However, we are getting closer and closer to conversational code generation or some flavor of that outcome which I would classify as a generative coding system. It will be a seismic shift in code generation based on democratizing the creation of applications. 

What would ChatGPT create?

If you were wondering what ChatGPT from OpenAI would have generated with the same prompt, then you are in luck. I had that output generated over at https://chat.openai.com/chat by issuing a prompt.

Links and thoughts:

Top 5 Tweets of the week:

Footnotes:

[1] https://huggingface.co/bigcode 

[2] https://techcrunch.com/2022/09/26/hugging-face-and-servicenow-launch-bigcode-a-project-to-open-source-code-generating-ai-systems/ 

[3] https://www.bigcode-project.org/

[4] https://alphacode.deepmind.com/

[5] https://aws.amazon.com/codewhisperer/ 

[6] https://openai.com/blog/openai-codex/ 

[7] https://github.com/features/copilot 

[8] https://huggingface.co/datasets/bigcode/the-stack 

[9] https://github.com/nelslindahlx 

What’s next for The Lindahl Letter? 

  • Week 107: Highly cited AI papers.

  • Week 108: Twitter as a company probably would not happen today

  • Week 109: Robots in the house

  • Week 110: Understanding knowledge graphs

  • Week 111: Natural language processing

If you enjoyed this content, then please take a moment and share it with a friend. If you are new to The Lindahl Letter, then please consider subscribing. New editions arrive every Friday. Thank you and enjoy the week ahead.

Lindahl, N. (2023). The Lindahl letter: 104 Machine Learning Posts. Lulu Press, Inc. https://www.lulu.com/shop/nels-lindahl/the-lindahl-letter-104-machine-learning-posts/ebook/product-y244ep.html  

1 Comment
The Lindahl Letter
The Lindahl Letter
Thoughts about technology (AI/ML) in newsletter form every Friday