Software Development

Johns Hopkins center is bringing machine translation to lesser-written languages

JHU's Center for Language and Speech Processing has been on the forefront of using algorithms to translate languages. Now the American intelligence community is posing a new challenge.

Johns Hopkins University. (Technical.ly file photo; source unknown)
Johns Hopkins’ Center for Speech and Language Processing was among the research centers that helped develop translation and voice tools that have been finding wider use in commercial products in recent years. In a relatively close-knit field, the center stands out for its size and longevity.

Since the 1980s, researchers worked to develop technology that is used in tools like Google Translate, Siri or Facebook’s button that spits out a post in a different language. Such tools grew from open source systems, with the grandeur of the undertaking illustrated by Biblical names like Moses and Joshua.
“All that technology ultimately started at research labs like ours, said Phillipp Koehn. For 20 years, the computer science professor has worked on machine language translation and is affiliated with the center. He notes that the tools now in wide use weren’t a given until only recently, and have been worked on over decades.
“It’s impressive to see that it’s good enough for real users,” Koehn said in a recent interview. “That is quite a threshold.”
Most of those tools involve translation to languages that are widely-used, and feature lots of written work available in them. Google Translate, for instance, is available in the roughly 100 most prevalent languages.
Now, Koehn and other researchers are set to apply tools they’ve used in speech recognition, information retrieval and extraction of information from text to languages that aren’t as widely used.
He is leading a group of 20 that will look to develop a system that can respond to inquiries written in English of documents written in these “low resource” languages. The Office of the Director of National Intelligence awarded a $10.7 million grant for the project, which includes a mix of professors and about a dozen PhD students.
They’ll seek to start with languages like Swahili and Tagalog. Those are examples of languages that have good examples for kinds of languages that have “millions and millions of speakers…but just don’t have that much of a presence on the internet or official communication.”
The challenge is to take documents written in one of the languages, and produce an algorithm that would help intelligence agents get a quick look at what happened. “We have to return back to them relevant Swahili documents with a summary,” Koehn said.
After building an initial tool for the first two languages, the team will be tasked with putting it to use. For intelligence agencies, the tool could be used to quickly analyze documents in languages when a major event happens that they want to analyze. Some of the languages of interest to that end include Kurdish, Serbo-Croatian, Khmer, Hmong and Somali.
While deep learning network–oriented tools have come a long way, Koehn said there’s a new data challenge inherent in analyzing such languages. More widely-used languages often have large datasets to work with and train tools. “Now it’s much, much smaller,” Koehn said. This will require new strategies to obtain data that can be translated, whether through context or linguistic analysis.
The four-year project is the beginning of a new phase of research for the field. As Koehn noted, there are 6,000 languages in the world. The resources may not be there to translate all of them, but it means there’s plenty left to explore.
 

Before you go...

Please consider supporting Technical.ly to keep our independent journalism strong. Unlike most business-focused media outlets, we don’t have a paywall. Instead, we count on your personal and organizational support.

Our services Preferred partners The journalism fund
Engagement

Join our growing Slack community

Join 5,000 tech professionals and entrepreneurs in our community Slack today!

Trending

Trump may kill the CHIPS and Science Act. Here’s what that means for your community.

Despite big raises and contracts, a tech training giant lays off staffers and loses its CEO

After nearly a decade, the federal program for immigrant entrepreneurs is finally working

Block the bots or feed them facts? How Technical.ly uses AI in journalism

Technically Media