Skip to main content

View Post [edit]

Poster: njwhite Date: Jan 30, 2014 1:36am
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

Oh really? I didn't know they had pre-built modules for that (the http://finereader.abbyy.com/corporate/tech_specs/ page doesn't mention it). Note that Ancient Greek is quite different to modern Greek (more diacritics, different vocabulary).

Is there some way you can automatically select the language to OCR? Because as I mentioned at present all the Ancient Greek books I've looked at in the archive appear to have been treated as latin, with unusable results.

Reply [edit]

Poster: Jeff Kaplan Date: Jan 30, 2014 1:32pm
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

yes, if you use our html5 uploader with Chrome, Firefox or Safari at archive.org/upload you will see a language cell. Click in it and the dropdown menu includes Greek, Ancient.

Reply [edit]

Poster: shri ram parivar Date: Jan 9, 2018 3:48am
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

Jeff,

Is there support for Devanagari script (for Sanskrit, Hindi, Marathi, Nepali, Konkani languages) and other Indian scripts OCR in archive.org?

If not, is there a possibility of using tesseract or Google Drive or Google Vision API (like wikisource) for these?

https://wikisource.org/wiki/Wikisource_talk:Google_OCR

Reply [edit]

Poster: Jeff Kaplan Date: Jan 9, 2018 12:12pm
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

those are not OCRable by archive.org at this time. We would not be using the sources you mention.

Reply [edit]

Poster: kashcid vipashcit Date: Jan 10, 2018 10:24am
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

> those are not OCRable by archive.org at this time. We would not be using the sources you mention.

What is the barrier in not using tesseract or Google Vision API (like wikisource)? Is it something we can help with (with our contributions in the form of code)?

Reply [edit]

Poster: Jeff Kaplan Date: Jan 10, 2018 10:59am
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

our ocr files are unique to archive.org so that various features can work. open one up and you will see what i mean.

Reply [edit]

Poster: kashcid vipashcit Date: Jan 10, 2018 12:56pm
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

I've seen the likes of https://ia800207.us.archive.org/9/items/HistoryOfDharmasastraancientAndMediaevalReligiousAndCivilLawV.1/Kane_A-History-of-Dharmasastra-v1_1930_djvu.txt - could you point out what features you're talking about?

Even with the lack of these features, just having _some_ OCR output for all those Indic texts would be better than having nothing (in terms of search, further processing etc..).

Reply [edit]

Poster: Jeff Kaplan Date: Jan 10, 2018 1:05pm
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

https://archive.org/download/HistoryOfDharmasastraancientAndMediaevalReligiousAndCivilLawV.1/Kane_A-History-of-Dharmasastra-v1_1930_abbyy.gz is the file you want to be looking at

you are welcome to OCR anything and upload that as a separate item.

Reply [edit]

Poster: shri ram parivar Date: Jan 10, 2018 6:35pm
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

>>you are welcome to OCR anything and upload that as a separate item

Thanks for the suggestion, Jeff.

Please let me know the steps of how to upload an OCRed book (eg. In devanagari script, sanskrit language) so that the text layer in PDF is used for the text version and the searchable PDF is retained.

Thanks.

Reply [edit]

Poster: Jeff Kaplan Date: Jan 10, 2018 9:20pm
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

the system does not modify the uploaded file so it remains intact. your ocr file can be availabler for download but it will not function to allow search inside in the bookreader.

Reply [edit]

Poster: aibek Date: Jan 30, 2014 7:19pm
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

To see what Jeff meant by ‘Abbyy module’ see the attached extract from a log file.

I suggest that you upload a PDF file containing Ancient Greek exclusively, in the manner suggested by Jeff, and check what happens.

Attachment: Module-AbbyyXML.txt

Reply [edit]

Poster: tfmorris Date: Feb 11, 2014 8:56am
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

Does anyone look at improving OCR quality on an ongoing basis, whether through the use of Tesserract or other means? Is OCR ever re-done after the initial pass?

I did a study recently scoring the OCR quality of public domain eBooks in IA and found the quality to be all over the map. I suspect that Tesserract could do a better job in many cases, but I also suspect that ABBYY could be improved as well.

I saw some anecdotal evidence that high processing loads on the OCR cluster and the use of "fast mode" was correlated low OCR quality. That seems fine as an interim if the books were later requeued for full processing, but that doesn't seem to happen.

Like Nick, I'd be willing to help with improving the OCR quality.

Tom

Reply [edit]

Poster: Astrapto Date: Oct 4, 2016 10:14am
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

It's really a shame that there's so little interest in this, because I thought the Internet Archive stood for open-ness. Right now it's using (donated) money for (outdated) proprietary software, when it could be driving improvement on an open source solution If only Bill Gates could spare $10 million to develop or promote Tesseract further...there would be real benefits to better OCR software in the archive. All this knowledge would be more accurate, more searchable, more handicap-accessible... Also, if OCR accuracy deteriorates under certain conditions like Tom says, then those files should be put into a queue to be re-processed at a better time.
This post was modified by Astrapto on 2016-10-04 17:14:54