U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Balk EM, Chung M, Chen ML, et al. Assessing the Accuracy of Google Translate to Allow Data Extraction From Trials Published in Non-English Languages [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2013 Jan.

Cover of Assessing the Accuracy of Google Translate to Allow Data Extraction From Trials Published in Non-English Languages

Assessing the Accuracy of Google Translate to Allow Data Extraction From Trials Published in Non-English Languages [Internet].

Show details

Results

The articles were chosen by language only. We did not consider geographic distribution when selecting articles. All Chinese, German, and Japanese articles were from China, Germany, and Japan, respectively. The French articles were from France (5), Canada (3), Tunisia (1), and Turkey (1); and the Spanish articles were from Spain (5), Argentina (3), Colombia (1), and Mexico (1). The Chinese articles were all published in simplified Chinese. Other characteristics of the included studies are presented in Table 2.

Table 2. Characteristics of included trials.

Table 2

Characteristics of included trials.

Among the 15 investigators who extracted data from original language articles (10 investigators) and translated and English articles (9 investigators), 11 are M.D. or Ph.D. (or both) researchers, 2 are research associates (or equivalent), and 2 are medical residents with research experience. The median duration of experience with data extraction was 5 years, with 4 extractors having 10 to 14 years of experience, and 4 having less than 1 year of experience. Six of the investigators have participated in more than 20 systematic reviews, 1 has participated in 11 to 20 reviews, 5 in 6 to 10 reviews, and 3 in 5 or fewer reviews. Eight investigators have extracted more than 100 studies, 4 have extracted 51 to 100 studies, and 3 have extracted 50 or fewer articles. Nine investigators judged that they have a lot of comfort with the Cochrane risk of bias questions, 4 had moderate comfort, 1 had little comfort, and 1 had no experience with assessing risk of bias. The medical resident with no prior systematic review experience extracted 10 original language articles. She had oversight and assistance from the director of the EPC she was affiliated with.

Article Translation

The length of time required to translate articles ranged from 5 minutes (2 of 50 articles) to about 1 hour (11 articles) for most articles; 2 articles took more than 1 hour. Excluding the time taken for the latter two articles, the average time to translate was about 30 minutes. The time-for-translation distributions varied by language (Table 3), with Spanish articles being the quickest to translate and Chinese articles taking longest.

Table 3. Translation time (minutes), by language.

Table 3

Translation time (minutes), by language.

In general, the European- and Japanese-language articles could be translated automatically from their PDF or HTML files. These texts were then copied to Word documents after translation. However, the ease of translation was largely related to the file and text types used by the journals and whether Google Translate could read these directly or not.

The extra time required to translate the other articles consisted mainly of iteratively copying blocks of text (paragraphs or columns) from the article into the Google Translate Web site and then copying the translated text into Word documents. This often involved using the appropriate alphabet from the original language, and removing false line breaks, hyphens, and unnecessary spaces. We discovered (and were informed by the Chinese speakers among us) that we needed to remove false line breaks (artifactual breaks not at the end of sentences) in the Asian language articles to allow meaningful translation. Translation of tables was frequently very time consuming as it required a large number of translations of individual row and column headers and formatting in the translated Word document.

For numerous articles, particularly those in the Chinese language, Google Translate could directly translate the PDF or HTML file, but the resulting file was unreadable because of heavily overlapping text across columns; therefore, manual copying and pasting of these articles had to be done. Since Google Translate attempted to maintain the original formatting and because some written languages are much more compact than English, the English text ran from one column to the next, overlapping the text in the second column.

Other issues we encountered included that one Spanish PDF could not be read originally but could after it was saved as a TIFF file from which another PDF was created; one German article required removing multiple instances of “¬” (an optional hyphen) before translation could succeed. One German article was clearly an outlier in that it took almost 4 hours to translate because of the poor quality of the original file. When text from this particular article was copied and pasted into either Google Translate or Microsoft Word, the copied text included spaces randomly placed within most of the words. Because the quality of the translated text was greatly improved after removing these superfluous spaces, this extra step was undertaken. One Chinese article took almost 2 hours because the non-Chinese characters (such as words and numbers) within the file were copied to gibberish and had to be manually retyped for proper translation.

Data Extraction From Translated Articles

The assessment by the English language data extractors was that extraction from translated articles generally took more time than extraction from an equivalent English-language article would have taken. Extractors were asked to estimate how much additional time they spent on each translated article compared with the time they likely would have spent with a comparable English article (Table 4). For Spanish articles, extractors estimated that 56 percent of articles took less than 5 additional minutes to extract, and all but one article took up to 30 additional minutes to extract. Extraction of other translated articles generally took longer. Between 60 and 75 percent of other language articles were estimated to take between 6 and 30 additional minutes to extract, with generally most of the remaining articles requiring more than 30 minutes extra. Anecdotally, for some Chinese articles the translation was so poor that little could be extracted, which resulted in little time being required to extract the article.

Table 4. Estimated additional time required compared to extraction of a similar English article.

Table 4

Estimated additional time required compared to extraction of a similar English article.

Extractors were also asked to assess their level of confidence in the translation of the articles (Table 5). Extractors had strong confidence for the majority (60 percent) of Spanish articles. Confidence in the translation of other language articles was generally moderate with 65 to about 70 percent of articles across languages.

Table 5. Confidence in accuracy and completeness of the translation.

Table 5

Confidence in accuracy and completeness of the translation.

Table 6 provides some examples of unintelligible translations collected by the extractors. Additional issues included lack of translation of figures and some tables, blocks of gibberish, and completely untranslated text.

Table 6. Examples of poor translation, by language.

Table 6

Examples of poor translation, by language.

Comparison of Extractions From Translated and Original Articles

Table 7 displays the adjusted percentage of correct extractions per language, including English, and per analyzed extraction item; the percentages are adjusted for individuals' likelihood of correctly extracting English articles. To recap, the reference standards for English-language articles were the consensus extraction across all researchers or between the senior investigators; the reference standards for translated articles were the double data extraction results from original language extractions. The extraction items are clustered by study domain (study design, intervention description data including the number of participants randomized, outcome descriptions, and results). The specific extraction questions are described in Appendix A. In general, across languages the agreement between the extractors and the reference standards for each article (from consensus in English and from double-extracted original language articles in other languages) was greater for design and intervention domain items than for outcome descriptions and study results. In particular, extractors did relatively poorly extracting which outcomes from a given list were reported in the study and in extracting net differences (or equivalent results) and their standard errors for continuous outcomes.

Table 7. Percentage of correct extractions, per item and language, adjusted for individual's likelihood of correctly extracting the same data item from English articles.

Table 7

Percentage of correct extractions, per item and language, adjusted for individual's likelihood of correctly extracting the same data item from English articles.

Translated Chinese articles yielded the largest percentage of items (22 percent) incorrectly extracted by more than half the extractors, although Chinese articles also yielded the largest percentage of items (41 percent) extracted correctly by more than 98 percent of the extractors (including English article extractions). However, translated Chinese articles had particularly lower likelihoods of correct extractions for the important extraction items about descriptions of the interventions, the outcomes, and the results. In contrast, extractors of translated Spanish articles had relatively high likelihoods of extracting items correctly except, in comparison with English, for results data. For Spanish, only 7 percent of items had less than 50 percent correct extractions, including funding source and identifying reported outcomes. Extractions of other translated language articles yielded similar patterns as for translated Chinese articles, but with generally higher rates of correct extractions. In particular, identifying reported outcomes and extracting results were more likely to be incorrect.

Table 8 displays the adjusted odds ratios between translated and English articles of correctly extracting individual items (the odds of correct extractions from translated articles versus the odds of correct extractions from English articles, adjusted for each researcher's likelihood of correctly extracting the English data items). Of note, it was not uncommon that the odds of correctly extracting individual items from the translated articles were greater than the odds of doing so from the 10 extracted English articles. All odds ratios of 1 or greater were analyzed as being equivalent to perfect agreement. Overall, the pattern of odds ratios of adjusted odds ratios of correct answers compared with English across items and languages (Table 8) was similar to the pattern of adjusted percentages (Table 7). It highlights that for all translated languages except Spanish, extractors were statistically significantly more likely to extract incorrect data for outcome description and results from translated articles than from English articles. Similarly, the likelihood of missing reported outcomes was higher from translated articles, significantly so for German, Japanese, and Spanish articles. The seeming discrepancy between Tables 7 and 8 in the results for duration of interventions (with the “Intervention” domain) is due to the near 100 percent accuracy for all languages and thus small numbers of incorrect extractions (e.g., 2 or 4 percent versus 0 percent).

Table 8. Odds ratios (confidence intervals) compared with English of correct extractions, adjusted for individual's likelihood of correctly extracting the same data item from English articles.

Table 8

Odds ratios (confidence intervals) compared with English of correct extractions, adjusted for individual's likelihood of correctly extracting the same data item from English articles.

Risk of bias assessment typically had only slight agreement across languages and risk of bias questions. The median kappa across questions among the original language extractors (including for English articles) was 0.195 (full range -0.14, 0.78) and for extractors of translated articles was 0.22 (-0.46, 1.00). For English articles, 49 percent of biases were rated “unclear,” 12 percent “high,” and 39 percent “low.” Among other original language articles, 42 percent of biases were rated “unclear,” 18 percent “high,” and 39 percent “low.” Among translated articles, 53 percent of biases were rated “unclear,” 11 percent “high,” and 37 percent “low.” Table 9 displays the kappa values for each question, within each language, for both original and translated articles, along with P values for differences between original and translated extractions. Among 25 comparisons (5 questions in 5 languages), only 3 (12 percent) have a P value less than 0.10. Among Chinese and Spanish articles, allocation concealment was rated more consistently among translated than original articles (P = 0.06). This can be ascribed to the more universal designation of “unclear” bias among translated articles. Only for the designation of attrition bias among Chinese articles was agreement significantly poorer for translated articles (7 “unclear,” 1 “high,” 12 “low”) than for original articles (6 “unclear,” 3 “high,” 11 “low”).

Table 9. Kappa statistics for agreement of risk of bias.

Table 9

Kappa statistics for agreement of risk of bias.

We performed sensitivity analyses where we dichotomized the risk of bias assessment by setting “unclear” to be equivalent to either “high” or “low” risk of bias. The only finding that was different between the main and sensitivity analyses was when “unclear” was set to be equivalent to “high” risk of bias, among translated Spanish articles, outcome assessor blinding was more consistently graded “high/unclear” (75 percent) than among original articles (60 percent), P = 0.06.

Association Between Extractor Confidence and Accuracy

Table 5 above displays the distribution of levels of confidence extractors had in the accuracy and completeness of the translations across languages. To examine the association between their level of confidence and their extraction accuracy, we first calculated the raw percentage accuracy by confidence level, by language and across languages (Table 10). For French articles the accuracy was considerably higher when extractors had strong confidence (94 percent accuracy across articles and items) than moderate or little confidence (67 percent accuracy). However, this pattern was not seen for other languages and of note, for Chinese articles the accuracy was higher when extractors had little confidence (88 percent) than moderate or strong accuracy (73 percent). Overall across all languages, the accuracy was about the same regardless of extractors' confidence level (76 or 79 percent).

Table 10. Association between extractors' confidence in accuracy of translation and their extraction accuracy.

Table 10

Association between extractors' confidence in accuracy of translation and their extraction accuracy.

Views

  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (756K)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...