NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
Balk EM, Chung M, Chen ML, et al. Assessing the Accuracy of Google Translate to Allow Data Extraction From Trials Published in Non-English Languages [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2013 Jan.
Assessing the Accuracy of Google Translate to Allow Data Extraction From Trials Published in Non-English Languages [Internet].
Show detailsOur results showed that using Google Translate to translate medical articles in many cases may be feasible and not a resource-intensive process that leads to operationally workable English versions. The accuracy of translation was heavily dependent on the original language of the article. Specifically, extractions of Spanish articles were most accurate, followed by fairly accurate extractions from German, Japanese, and French articles. The least accurate data extractions resulted from translated Chinese articles. With the exception of Japanese (where we found that extraction was fairly accurate) difference across languages was similar to the findings of machine translation experts for general translation “that translations between European languages are usually good, while those involving Asian languages are often relatively poor.”13
With the exception of Spanish, the findings of this analysis are generally consistent with, but more robust than, a similar analysis done as a pilot study.15 Our improved methods, including double data extraction of the original language articles together with adjustment for individual extractors' accuracy in extracting English articles provides better confidence in our conclusions. The discrepancy in the results from the translated Spanish articles are likely due to greater disagreement in data extraction (unrelated to translation issues) between individual pairs of extractors than between double data extracted and reconciled extractions and the translated extractions.
Across languages, including English, we found good levels of agreement (mostly above 85 to 90 percent) for extraction of most study design questions (eligibility criteria; funding source; number of centers; followup duration; whether the study reported randomization, allocation concealment techniques, intention-to-treat and power calculations; and who was blinded. With slightly lower agreement, there were also generally good levels of agreement (mostly above 70 percent) for extraction of descriptions of the intervention (dose, frequency, route, duration) and the number of participants randomized. For results reporting, there was consistently accurate extraction (mostly above about 85 percent) for the numbers of participants analyzed and whether mean or median data were reported. The odds ratios of accurate extractions compared with English followed similar patterns. The accuracy of descriptions of outcomes and results data (net difference, standard error, number of events or reported odds ratio, and P value of difference or odds ratio) varied widely by language, with descriptions of outcomes being commonly inaccurate from Chinese, German, and French articles, continuous results data (net difference and standard error) being inaccurate from French and Chinese articles, categorical results data (number of events or odds ratio) being inaccurate from German and Japanese articles, and P values being inaccurate in Japanese and Chinese articles. Translated Spanish articles generally yielded more accurate outcome descriptions and results data. Extractors' accuracy in finding reported outcomes from lists of outcomes was generally poor, including from English articles (with only 63 percent accuracy). Only 5 to 43 percent of translated articles yielded accurate lists of reported outcomes across languages. Most of the inaccuracy came from missing outcomes from the list, but a few arose from finding outcomes not captured in the reference extraction.
We expected to find that investigators would provide more accurate extractions when they had greater confidence in the accuracy and completeness of the translations. However, with the possible exception of French studies, we did not find this to be the case. It is unclear why the data extractors failed to be more confident about studies they more accurately extracted. It may be that they were unable to disambiguate difficulties in extracting the studies due to poor translation from those due to poor reporting. This finding should not be overinterpreted but it does call into question whether extractors can subjectively assess how accurate their extractions from translated articles are.
Although our double data extraction of original language articles and the adjustment for accuracy of extraction of English language articles improved on the limitations of the pilot study, these approaches still do not fully remove the possibility that differences (or lack of differences) between languages that we found were in part due to intrinsic differences between data extractors or the different articles in the different languages. As we describe in the Methods section, we changed our approach to reconciliation of the reference standards to allow for multiple correct answers. We did this to reduce the number of disagreements that occurred between translated and original articles that were due to differences in interpretation of the data rather than translation errors. However, it remains likely that a number of the disagreements were due to differences in interpretation. Similarly, while we controlled for extractors' likelihood of extracting English articles correctly, we could not fully control for the likelihood that extractors made errors unrelated to translation in specific articles. While extractors each extracted articles from different languages, we did not achieve an even distribution of the extractors across languages. Furthermore, there were fundamental differences in the studies across different languages, in the medical fields being examined and the complexity of the study designs, interventions, outcomes, and analyses. These intrinsic differences may have resulted in some of the differences in accuracy of extraction. We did not have extractors of translated articles reconcile their extractions and then compare the reconciled translated and reconciled original language extractions. Doing so might have more closely mimicked typical systematic review methods, but would have greatly reduced the study's power. However, despite our power calculation, the confidence intervals of the adjusted odds ratios between translated and English articles were generally wide, possibly resulting in either an overestimation of the number of items with “trends” toward large differences in accuracy (i.e., small but nonsignificant odds ratios) or an underestimation of the number of true effects (due to frequent nonsignificance).
Other limitations that were described for the pilot study still hold. While native speakers were chosen to extract the original language articles, these extractors were not always medically trained in their native language. Thus, translations that employed non-English medical terminology may have been difficult to extract from the original articles. However, this limitation should have been mitigated by the double data extraction. All extractors may or may not have been familiar with the medical topic covered by the article, which is another factor introducing variability to the results. It is likely that the data extraction error rate was higher than for a typical systematic review, since the articles were on random topics and the data extractors were neither trained nor necessarily proficient in the clinical domains.
The Google Translate tool is ever evolving and presumably improving, as users around the world improve the accuracy of translations. It is also reasonable to assume that with time more articles from more non-English language publications will be in a format that can be directly (and thus quickly) translated. However, this also implies that the accuracy of translations between different pairs of languages will at least partly depend on how many words and documents are being translated among different languages on the Internet. While we did not test for differences based on different study countries, it is of interest to note that half the Spanish articles were written in Spain and half in Latin America and half the French articles were written in France and half distributed among Canada (Quebec), Tunisia, and Turkey. All the Chinese articles were written in China in simplified Chinese. Anecdotally, it is our experience that extremely few studies from other Chinese-speaking countries or territories (Taiwan, Singapore, Hong Kong, Macao) are written in Chinese, particularly in traditional Chinese. Our analysis is relevant only for simplified Chinese. Although data extraction from translated articles was assessed to be considerably more difficult and time consuming than extraction from equivalent English language articles, extraction was always feasible in what was considered to be a reasonable amount of time, even including the extra time required to perform article translation. For this research project, we used the directly available Google Translate Web site (translate.google.com) not the Google Translate Toolkit (translate.google.com/toolkit/), which requires an account setup and login. For typical systematic reviews, the toolkit may offer some advantages including the feature that it searches for previous human translations of the same text and allows improved translations on the fly. However, this feature would likely be of value only if the investigator himself or herself, as opposed to a research assistant, does the translation and puts in the effort to critically evaluate the translation.
Even though Google translation of medical articles in most cases is far from perfect and on average results in higher levels of inaccuracies than extraction from English, we conclude that the technique has potential to be of value and that for most of the tested languages it may be reasonable to attempt translation (with Google Translate) and extraction of non–English-language articles that are available as machine-readable PDF (or HTML) files. A major caveat, though, is that we found that extraction of results data were least accurate. Thus, extra care should be taken when considering how much to rely on or accept the results data from translated articles. It would be appropriate to consistently perform sensitivity analyses regarding translated articles, where possible differences in findings (by meta-analysis) or conclusions (overall) may occur when translated articles are included or omitted. It should be recognized that any differences may be due not only to differences in applicability or methodology, but to errors in translation. Our prior anecdotal experience suggests that using Google Translate for articles in languages that an extractor is at least somewhat familiar with can be particularly useful to allow confident data extraction. Based on the evidence that machine translation is (only) mostly accurate and our anecdotal experience, an appropriate approach for systematic reviewers may be to run the machine translation and have a native speaker confirm or revise the translations. If such human translators are available, this may be a time- and cost-efficient approach. A reasonable alternative conclusion, however, is that the translation software is still sufficiently inaccurate for use in systematic review, that the risk of introducing errors is too great. Each investigator considering the inclusion of articles requiring machine translation into a systematic review will need to decide the appropriate balance between completeness and risk of extraction errors.
The value and reliability of machine translation of articles for systematic review requires further research. Questions of interest include: Are the findings of this study replicable with a different set of articles and extractors (we would suggest that if feasible, a larger sample of studies be tested)? How do different machine translators compare? How does machine translation from other languages fare (although, the value of testing languages with relatively few publications is limited)? Are there differences in extraction accuracy based on differences in study designs, including differences in clinical or content areas, pharmacological versus nonpharmacological interventions, different outcome types, or randomized versus nonrandomized studies? How would the data extraction errors from poor translation impact meta-analysis results and systematic review conclusions?
We conclude that it is reasonable for systematic reviewers to devote the small amount of resources and effort necessary to try Google Translate to include non-English articles. It will be important, however, to recognize that extraction of these articles is more prone to error than extraction of typical English language articles. Therefore, judgment will be needed to determine how much confidence the reviewers have in the accuracy of the data extraction of these articles, and to recognize that apparently missing data or unclearly reported data may be more a factor of poor translation than of poor methodology.
- Discussion - Assessing the Accuracy of Google Translate to Allow Data Extraction...Discussion - Assessing the Accuracy of Google Translate to Allow Data Extraction From Trials Published in Non-English Languages
Your browsing activity is empty.
Activity recording is turned off.
See more...