|Poster:||Albretch||Date:||May 15, 2022 9:30pm|
|Forum:||texts||Subject:||"text cleansing" / removing brownish background from pdf files ...|
archive.org has taken good care of archiving lots of data, but most (all?) texts available here are not usable as text. Most of it is not readily usable for corpora research.
In this era of "archivism" archiving a text should mean more than just saving it for it to be read by someone else some other time.
The "visually pleasing" aspect people find in pdf files is based on layers of formatting aberrations.
I think the quality of the texts can be enhanced greatly by streamlining some functionality based on
run of the mill open source software and some eye balling of certain targeted segments of text by some determined community (like the pgdp.net kinds of folks). Those texts which need care are in the public domain anyway.
Where do folks interesting in "text cleansing" hang out? A google search on: site:https://archive.org/iathreads "text cleansing" gave me 5 unhinged results and another attempt at: site:https://archive.org "text cleansing" game me nothing.
I think all text should be available in a format using an open specification such as ODF (which is also, very easily translatable to any other format, including pdf). There should also be provisions for plain texts with encoded media specified in some well-defined way.
Something very important that archive should work on before they even start such a cleansing project, is a general, well-defined fluent form of text formatting, from which all kinds of folks would benefit.
I would propose to start such project with like minded individuals.