Johns Hopkins’ Center for Speech and Language Processing was among the research centers that helped develop translation and voice tools that have been finding wider use in commercial products in recent years. In a relatively close-knit field, the center stands out for its size and longevity.
Since the 1980s, researchers worked to develop technology that is used in tools like Google Translate, Siri or Facebook’s button that spits out a post in a different language. Such tools grew from open source systems, with the grandeur of the undertaking illustrated by Biblical names like Moses and Joshua.
“All that technology ultimately started at research labs like ours, said Phillipp Koehn. For 20 years, the computer science professor has worked on machine language translation and is affiliated with the center. He notes that the tools now in wide use weren’t a given until only recently, and have been worked on over decades.
“It’s impressive to see that it’s good enough for real users,” Koehn said in a recent interview. “That is quite a threshold.”
Most of those tools involve translation to languages that are widely-used, and feature lots of written work available in them. Google Translate, for instance, is available in the roughly 100 most prevalent languages.
Now, Koehn and other researchers are set to apply tools they’ve used in speech recognition, information retrieval and extraction of information from text to languages that aren’t as widely used.
He is leading a group of 20 that will look to develop a system that can respond to inquiries written in English of documents written in these “low resource” languages. The Office of the Director of National Intelligence awarded a $10.7 million grant for the project, which includes a mix of professors and about a dozen PhD students.
They’ll seek to start with languages like Swahili and Tagalog. Those are examples of languages that have good examples for kinds of languages that have “millions and millions of speakers…but just don’t have that much of a presence on the internet or official communication.”
The challenge is to take documents written in one of the languages, and produce an algorithm that would help intelligence agents get a quick look at what happened. “We have to return back to them relevant Swahili documents with a summary,” Koehn said.
After building an initial tool for the first two languages, the team will be tasked with putting it to use. For intelligence agencies, the tool could be used to quickly analyze documents in languages when a major event happens that they want to analyze. Some of the languages of interest to that end include Kurdish, Serbo-Croatian, Khmer, Hmong and Somali.
While deep learning network–oriented tools have come a long way, Koehn said there’s a new data challenge inherent in analyzing such languages. More widely-used languages often have large datasets to work with and train tools. “Now it’s much, much smaller,” Koehn said. This will require new strategies to obtain data that can be translated, whether through context or linguistic analysis.
The four-year project is the beginning of a new phase of research for the field. As Koehn noted, there are 6,000 languages in the world. The resources may not be there to translate all of them, but it means there’s plenty left to explore.
Before you go...
To keep our site paywall-free, we’re launching a campaign to raise $25,000 by the end of the year. We believe information about entrepreneurs and tech should be accessible to everyone and your support helps make that happen, because journalism costs money.
Can we count on you? Your contribution to the Technical.ly Journalism Fund is tax-deductible.
Join our growing Slack community
Join 5,000 tech professionals and entrepreneurs in our community Slack today!