Doing what MeEtc did is the best thing you can do:
1) Open the page
2) Copy the source (the portion between those two <div>'s containing the bulk of the text)
3) Paste the code into an UNICODE supporting regular expression testkit (can mostlikely be found online too if you don't have one)
The unicode is extremely important here since not all regular expression interpreters will support unicode properly.
4) Use a regular expression to strip out the copy-protection (the html is not misformed, it is a copy-protection scheme). Though I'm not sure if MeEtc's regular expression will work properly. I've not checked all the <span> IDs, so I might be wrong, but there might be some which aren't in the [A-Za-z0-9_] list. So I would use something like /<span(.*?)span>/ instead (which is a non-greedy regular expression). Oh and make it case independant in case there is a <sPaN>.
quote:
Originally posted by Jarrod
i haven't tried it(yet) but maybe capture the webpage as a jpg and run ocr on it?
Then at least capture it as a losless image format if you whish to do OCR or you will be introducing difficulties and possible errors even before you start OCR'ing. Never capture something in JPG if you want to do some OCR'ing (unless you have ran out of disk space or something).
Anyways, OCR will always leave some errors behind, or at least you'll never be 100% certain if everything is correct without checking each letter manually again because there is always the possebility of a mismatch (OCR is always a form of guesswork). OCR'ing unicode is also a lot more difficult than OCR'ing normal ascii...
---
lame copy-protection though... but it seems to work against some
Why on earth do I need to write such an epistel again? arrrghhh...