<Article>IBM teaches AI to translate computer languages </Article>

-	Important news
-	News
-	Shenzhen
-	China
-	World
-	Opinion
-	Sports
-	Kaleidoscope
-	Photos
-	Business
-	Markets
-	Business/Markets
-	World Economy
-	Speak Shenzhen
-	Health
-	Leisure
-	Culture
-	Travel
-	Entertainment
-	Digital Paper
-	In-Depth
-	Weekend
-	Newsmaker
-	Lifestyle
-	Diversions
-	Movies
-	Hotels and Food
-	Special Report
-	Yes Teens!
-	News Picks
-	Tech and Science
-	Glamour
-	Campus
-	Budding Writers
-	Fun
-	Qianhai
-	Advertorial
-	CHTF Special
-	Futian Today

szdaily -> Tech and Science ->

IBM teaches AI to translate computer languages

2021-05-17 08:53 Shenzhen Daily

IBM announced during its Think 2021 conference last week that its researchers have crafted a Rosetta Stone for programming code.

Over the past decade, advancements in AI have mainly been “driven by deep neural networks, and even that, it was driven by three major factors: data with the availability of large data sets for training, innovations in new algorithms and the massive acceleration of faster and faster compute hardware driven by GPUs,” Ruchir Puri, IBM Fellow and chief scientist at IBM Research, said during his Think 2021 presentation, likening the new data set to the venerated ImageNet, which has spawned the recent computer vision land rush.

In effect, we’ve taught computers how to speak human, so why not also teach computers to speak more computer? That’s what IBM’s Project CodeNet seeks to accomplish. CodeNet is essentially the ImageNet of computers. It’s an expansive dataset designed to teach AI/ML systems how to translate code and consists of some 14 million snippets and 500 million lines spread across more than 55 legacy and active languages — from COBOL and FORTRAN to Java, C++ and Python.

“Since the data set itself contains 50 different languages, it can actually enable algorithms for many pairwise combinations,” Puri explained. “Having said that, there has been work done in human language areas, like neural machine translation which, rather than doing pairwise, actually becomes more language-independent and can derive an intermediate abstraction through which it translates into many different languages.”

In short, the dataset is constructed in a manner that enables bidirectional translation. That is, you can take some legacy COBOL code — which, terrifyingly, still constitutes a significant amount of the U.S. banking and federal government infrastructure — and translate it into Java as easily as you could take a snippet of Java and regress it back into COBOL.

But just as with human languages, computer code is created to be understood within a specific context. However, unlike our bipedal linguistics, “programming languages can be compared, very succinctly, on a metric of ‘does the program compile, does the program do what it was supposed to do problem and, if there is a test set, does it knows, solve, and meet the criteria of the test,’” Puri posited.

Thus, CodeNet can be used for functions like code search and clone detection, in addition to its intended translational duties and serving as a benchmark dataset. Also, each sample is labeled with its CPU run time and memory footprint, allowing researchers to run regression studies and potentially develop automated code correction systems.

Users can run individual code samples “to extract metadata and verify outputs from generative AI models for correctness,” according to an IBM press release. (SD-Agencies)