What happened to the Messenger Plus! forums on msghelp.net?
Shoutbox » MsgHelp Archive » Skype & Technology » Tech Talk » Text copy on webpage

Text copy on webpage
Author: Message:
CookieRevised
Elite Member
*****

Avatar

Posts: 15517
Reputation: 173
– / Male / Flag
Joined: Jul 2003
Status: Away
RE: Text copy on webpage
Doing what MeEtc did is the best thing you can do:

1) Open the page

2) Copy the source (the portion between those two <div>'s containing the bulk of the text)

3) Paste the code into an UNICODE supporting regular expression testkit (can mostlikely be found online too if you don't have one)
The unicode is extremely important here since not all regular expression interpreters will support unicode properly.

4) Use a regular expression to strip out the copy-protection (the html is not misformed, it is a copy-protection scheme). Though I'm not sure if MeEtc's regular expression will work properly. I've not checked all the <span> IDs, so I might be wrong, but there might be some which aren't in the [A-Za-z0-9_] list. So I would use something like /<span(.*?)span>/ instead (which is a non-greedy regular expression). Oh and make it case independant in case there is a <sPaN>.

;)

quote:
Originally posted by Jarrod
i haven't tried it(yet) but maybe capture the webpage as a jpg and run ocr on it?
Then at least capture it as a losless image format if you whish to do OCR or you will be introducing difficulties and possible errors even before you start OCR'ing. Never capture something in JPG if you want to do some OCR'ing (unless you have ran out of disk space or something).

Anyways, OCR will always leave some errors behind, or at least you'll never be 100% certain if everything is correct without checking each letter manually again because there is always the possebility of a mismatch (OCR is always a form of guesswork). OCR'ing unicode is also a lot more difficult than OCR'ing normal ascii...



---

lame copy-protection though... but it seems to work against some :p

Why on earth do I need to write such an epistel again? arrrghhh...

This post was edited on 02-06-2009 at 02:25 AM by CookieRevised.
.-= A 'frrrrrrrituurrr' for Wacky =-.
02-06-2009 02:18 AM
Profile PM Find Quote Report
« Next Oldest Return to Top Next Newest »

Messages In This Thread
Text copy on webpage - by RaPLeX on 02-04-2009 at 11:35 PM
RE: Text copy on webpage - by djdannyp on 02-04-2009 at 11:37 PM
RE: Text copy on webpage - by Thor on 02-04-2009 at 11:42 PM
RE: Text copy on webpage - by RaPLeX on 02-04-2009 at 11:43 PM
RE: Text copy on webpage - by MeEtc on 02-05-2009 at 12:43 AM
RE: Text copy on webpage - by Jarrod on 02-05-2009 at 06:17 AM
RE: Text copy on webpage - by Quantum on 02-05-2009 at 01:55 PM
RE: Text copy on webpage - by CookieRevised on 02-06-2009 at 02:18 AM
RE: Text copy on webpage - by Th3rmal on 02-06-2009 at 05:09 AM
RE: Text copy on webpage - by Mike on 02-06-2009 at 11:48 AM
RE: Text copy on webpage - by CookieRevised on 02-07-2009 at 12:23 AM


Threaded Mode | Linear Mode
View a Printable Version
Send this Thread to a Friend
Subscribe | Add to Favorites
Rate This Thread:

Forum Jump:

Forum Rules:
You cannot post new threads
You cannot post replies
You cannot post attachments
You can edit your posts
HTML is Off
myCode is On
Smilies are On
[img] Code is On