Trained on billions of lines of public code, GitHub Copilot puts the knowledge you need at your fingertips, saving you time and helping you stay focused.
GitHub Copilot is powered by Codex, the new AI system created by OpenAI. GitHub Copilot understands significantly more context than most code assistants. So, whether it’s in a docstring, comment, function name, or the code itself, GitHub Copilot uses the context you’ve provided and synthesizes code to match. Together with OpenAI, we’re designing GitHub Copilot to get smarter at producing safe and effective code as developers use it.
HOW COPILOT WORKS
GitHub describes Copilot as the AI equivalent of pair programming, in which two developers work together at a single computer. The idea is that one developer can bring new ideas or spot problems that the other developer might’ve missed, even if it requires more person-hours to do so.
In practice, though, Copilot is more of a utilitarian time saver, integrating the resources that developers might otherwise have to look up elsewhere. As users type into Copilot, the tool will suggest snippets of code to add by clicking a button. That way, they don’t have to spend time searching through API documentation or looking up sample code on sites like StackOverflow. (A second developer probably wouldn’t have memorized those examples, either.)
As with most AI tools, GitHub also wants Copilot to get smarter over time based on the data it collects from users. CNBC reports that when users accept or reject Copilot’s suggestions, its machine learning model will use that feedback to improve future suggestions, so perhaps the tool will become more human-like as it learns.
THE BACKLASH Not long after Copilot’s launch, some developers started sounding alarms over the use of public code to train the tool’s AI.
One concern is that if Copilot reproduces large enough chunks of existing code, it could violate copyright or effectively launder open-source code into commercial uses without proper licensing. The tool can also spit out personal details that developers have posted publicly, and in one case it reproduced widely-cited code from the 1999 PC Game Quake III Arena—including developer John Carmack’s expletive-laden commentary.
Cole Garry, a Github spokesperson, declined to comment on those issues and only pointed to the company’s existing FAQ on Copilot’s web page, which does acknowledge that the tool can produce verbatim code snippets from its training data. This happens roughly 0.1% of the time, GitHub says, typically when users don’t provide enough context around their requests or when the problem has a commonplace solution.
“We are building an origin tracker to help detect the rare instances of code that is repeated from the training set, to help you make good real-time decisions about GitHub Copilot’s suggestions,” the company’s FAQ says.
In the meantime, GitHub CEO Nat Friedman has argued on Hacker News that training machine learning systems on public data is fair use, though he acknowledged that “IP and AI will be an interesting policy discussion” in which the company will be an eager participant. (As The Verge‘s David Gershgorn reports, that legal footing is largely untested.)
The tool also has defenders outside of Microsoft, including Google Cloud principal engineer Kelsey Hightower. “Developers should be as afraid of GitHub Copilot as mathematicians are of calculators,” he said.