Shoutbox

Text copy on webpage - Printable Version

-Shoutbox (https://shoutbox.menthix.net)
+-- Forum: MsgHelp Archive (/forumdisplay.php?fid=58)
+--- Forum: Skype & Technology (/forumdisplay.php?fid=9)
+---- Forum: Tech Talk (/forumdisplay.php?fid=17)
+----- Thread: Text copy on webpage (/showthread.php?tid=88964)

Text copy on webpage by RaPLeX on 02-04-2009 at 11:35 PM

Is there any way to copy that articles to word file?

for example ;

http://www.hurriyet.com.tr/yazarlar/10921110.asp?yazarid=4&gid=61

http://www.hurriyet.com.tr/yazarlar/10921675.asp?yazarid=1&gid=61


RE: Text copy on webpage by djdannyp on 02-04-2009 at 11:37 PM

Highlight all the text then copy and paste into a word document?

Save the HMTL page to disc and open it with word?


RE: Text copy on webpage by Thor on 02-04-2009 at 11:42 PM

quote:
Originally posted by RaPLeX
Is there any way to copy that articles to word file?

for example ;

http://www.hurriyet.com.tr/yazarlar/10921110.asp?yazarid=4&gid=61

http://www.hurriyet.com.tr/yazarlar/10921675.asp?yazarid=1&gid=61
  1. Find text you want to copy
  2. Locate your cursor
  3. Aim your cursor at the beginning of what you want to copy
  4. Press and hold the left mouse button down
  5. Move the cursor to the ending of what you want to copy whilst holding down the left mouse button
  6. Release mouse button
  7. Press and hold "Ctrl" then press "C"
  8. Release "Ctrl" and "C"
  9. Switch to/open the Word window
  10. Press and hold "Ctrl" then press "V"
  11. Release "Ctrl" and "C"
  12. ???
  13. Profit!

RE: Text copy on webpage by RaPLeX on 02-04-2009 at 11:43 PM

didnt work.i tried to open with dreamweaver after save asp files but same again..


quote:
Originally posted by djdannyp
Highlight all the text then copy and paste into a word document?

Save the HMTL page to disc and open it with word?

[Image: ekranalntsay3.th.jpg]

Try to copy and paste your word file then see what you get...
RE: Text copy on webpage by MeEtc on 02-05-2009 at 12:43 AM

For all those of you saying what to do, why don't you actually TRY and TEST before you actually do it?

The site has some really messed up stylesheet and hidden text or something, causing all the copied text to have random letters and numbers added and/or replaced.

Example, 2 words on their own on the page "Sayin Hizlan" have the following HTML code

code:
Say<span class="b162">545aab</span>&#305;n <span class="eikn">ocekeu</span>H&#305;z<span class="s1ja">silzgi</span>lan<span class="g36u">utmuui</span>

I tried running a regular expression search and replace of /<span class="\w*">\w*</span>/ but even then it didn't remove everything. The HTML is badly misformed in places.

Here, I think this is how it should be:
quote:
Abbas Sayar’&#305;n bir &#351;iiri

EDEB&#304;YAT ö&#287;retmeni okurum Mehmet Gözüya&#351;l&#305;’dan bir faks ald&#305;m.

Gözüya&#351;l&#305;, sevdi&#287;im, okudu&#287;um, and&#305;&#287;&#305;m Abbas Sayar’&#305;n (1923-1999) ona imzalad&#305;&#287;&#305; Yorgan&#305;m&#305; S&#305;k&#305; Sar kitab&#305;ndaki ithafta yer alan &#351;iirini gönderdi:

Ni&#287;de, 03 &#350;ubat 2009Say&#305;n H&#305;zlanKa&#351;garl&#305;’dan günümüze nice bilim adam&#305;m&#305;z&#305;n, sanatç&#305;m&#305;z&#305;n kaybolmu&#351; ürünlerini dü&#351;ündükçe
de&#287;erli &#351;air, romanc&#305;, ressam, botanikçi Abbas Sayar’&#305;n bu &#351;iirini kaybolmaktan kurtaraca&#287;&#305;n&#305;z için size te&#351;ekkür ederim
.1982 Temmuz’unda Yozgat Yerköy Bulamaçl&#305; Kapal&#305;çar&#351;&#305;s&#305;’nda sararm&#305;&#351; ekinlerin rüzgarla ç&#305;kard&#305;&#287;&#305; sesi belirterek "Do&#287;a
konu&#351;maz diyorlar; bu do&#287;ru de&#287;il ye&#287;enim. Bak, do&#287;a konu&#351;uyor. Bunlar&#305;n dili sevgidir. Do&#287;an&#305;n dilini herkes anlamaz"
dedikten sonra bu &#351;iiri yazm&#305;&#351; ve hiçbir yerde, hiç kimsede bu &#351;iirin olmad&#305;&#287;&#305;n&#305;, ölümünden sonra o&#287;luna ula&#351;t&#305;rmam&#305;
söylemi&#351;ti.Do&#287;ayla konu&#351;an, anla&#351;an de&#287;erli sanatç&#305;m&#305;z&#305;n -Yozgat yöresi bitkileriyle ilgili önemli bir kaynak ki&#351;iydi-
an&#305;s&#305;na sayg&#305;lar&#305;mla.Mehmet Gözüya&#351;l&#305;U&#287;ur DershanesiTürkçe Edebiyat Ö&#287;retmeniNi&#287;deYe&#287;enim,Mehmet Gözüya&#351;l&#305;’ya mutlu olmak
dile&#287;iyleÇAREBakt&#305;m;Topra&#287;a dü&#351;ecek gibide&#287;il su,Tohumu;Buluta ektim."* * *ABBAS SAYAR, edebiyata &#351;iirle ba&#351;lad&#305;, Y&#305;lk&#305;
At&#305; roman&#305;yla TRT 1970 Sanat Ödülleri yar&#305;&#351;mas&#305;nda ba&#351;ar&#305; ödülünü, ikinci roman Çelo ile Türk Dil Kurumu (1973), Can
&#350;enli&#287;i ile de Madaral&#305; Roman Ödülü’nü (1975) kazand&#305;.Onu ünlendiren Y&#305;lk&#305; At&#305;’n&#305;n konusu, kocad&#305;&#287;&#305;, i&#351; göremez duruma
geldi&#287;i için k&#305;&#351;a, açl&#305;&#287;a terk edilmi&#351; bir at&#305;n hikáyesi idi.Can &#350;enli&#287;i, terk edilmi&#351; seksen ya&#351;&#305;ndaki bir adam&#305;n,
can yolda&#351;&#305; e&#351;e&#287;inin, onun için bir ’can &#351;enli&#287;i’ olmas&#305;yd&#305;. Trajik bir kitapt&#305;r.Çelo, bir cinayet davas&#305; üzerine
kurulmu&#351;tur.Ölümünden sonra yay&#305;nlanan yaz&#305;m&#305;n ba&#351;l&#305;&#287;&#305;, edebi ayn&#305; zamanda gerçekçi bir tespitti:"Orta Anadolu’yu,
bozk&#305;r&#305; &#351;iirsel bir dille romanla&#351;t&#305;rd&#305;.Orta Anadolu insan&#305;n&#305;n umars&#305;z, ac&#305;mas&#305;z ya&#351;am&#305;n&#305;n içine sevgiyi katm&#305;&#351;t&#305;
Abbas Sayar. Yaln&#305;z kalm&#305;&#351; ve kalacak ki&#351;ilerin, &#351;artlara direnme ile teslim olma aras&#305;ndaki b&#305;çak s&#305;rt&#305;nda
dola&#351;anlar&#305;n romanc&#305;s&#305;yd&#305;."* * *OKURUM edebiyat ö&#287;retmeni Mehmet Gözüya&#351;l&#305;’ya te&#351;ekkür ederim.Ölümünün onuncu
y&#305;l&#305;nda onu sevgiyle ve sayg&#305;yla anmam&#305;za vesile yaratt&#305;. Bu anman&#305;n yeniden okumalar&#305; da sa&#287;layaca&#287;&#305;
umudunday&#305;m.

OK, well everything except the unicode chars
RE: Text copy on webpage by Jarrod on 02-05-2009 at 06:17 AM

i haven't tried it(yet) but maybe capture the webpage as a jpg and run ocr on it?


RE: Text copy on webpage by Quantum on 02-05-2009 at 01:55 PM

quote:
Originally posted by Jarrod
i haven't tried it(yet) but maybe capture the webpage as a jpg and run ocr on it?


This is what i was going to say :)

prtsc the page and cut it down until it's the text you want. then run it on a program like abby :P

Good Luck
RE: Text copy on webpage by CookieRevised on 02-06-2009 at 02:18 AM

Doing what MeEtc did is the best thing you can do:

1) Open the page

2) Copy the source (the portion between those two <div>'s containing the bulk of the text)

3) Paste the code into an UNICODE supporting regular expression testkit (can mostlikely be found online too if you don't have one)
The unicode is extremely important here since not all regular expression interpreters will support unicode properly.

4) Use a regular expression to strip out the copy-protection (the html is not misformed, it is a copy-protection scheme). Though I'm not sure if MeEtc's regular expression will work properly. I've not checked all the <span> IDs, so I might be wrong, but there might be some which aren't in the [A-Za-z0-9_] list. So I would use something like /<span(.*?)span>/ instead (which is a non-greedy regular expression). Oh and make it case independant in case there is a <sPaN>.

;)

quote:
Originally posted by Jarrod
i haven't tried it(yet) but maybe capture the webpage as a jpg and run ocr on it?
Then at least capture it as a losless image format if you whish to do OCR or you will be introducing difficulties and possible errors even before you start OCR'ing. Never capture something in JPG if you want to do some OCR'ing (unless you have ran out of disk space or something).

Anyways, OCR will always leave some errors behind, or at least you'll never be 100% certain if everything is correct without checking each letter manually again because there is always the possebility of a mismatch (OCR is always a form of guesswork). OCR'ing unicode is also a lot more difficult than OCR'ing normal ascii...



---

lame copy-protection though... but it seems to work against some :p

Why on earth do I need to write such an epistel again? arrrghhh...

RE: Text copy on webpage by Th3rmal on 02-06-2009 at 05:09 AM

Where you able to solve the issue? If so, which method did you use?

quote:
Originally posted by CookieRevised

Why on earth do I need to write such an epistel again? arrrghhh...
because your Cookie :P
RE: Text copy on webpage by Mike on 02-06-2009 at 11:48 AM

quote:
Originally posted by CookieRevised
OCR'ing unicode is also a lot more difficult than OCR'ing normal ascii...
Abbyy has done the job fine for some Greek texts I OCR'ed in it. It even works great on photographs taken with my mobile phone! :O
RE: Text copy on webpage by CookieRevised on 02-07-2009 at 12:23 AM

quote:
Originally posted by Mike
quote:
Originally posted by CookieRevised
OCR'ing unicode is also a lot more difficult than OCR'ing normal ascii...
Abbyy has done the job fine for some Greek texts I OCR'ed in it. It even works great on photographs taken with my mobile phone! :O
I certainly don't doubt that. I don't know any better OCR program than Abbyy, but still, it still is pattern matching which always is a bit guessing while taking statistics into account. Because you only knew it worked after you've checked the text thoroughly, no? Do you blindly trust the output of the OCR'ing without checking anything? After checking the first two pages, would you trust the OCR'ing enough to skip he other 100 pages? For some stuff OCR'ing is absolutely great, for other things, there might be better ways which you don't need to double check everything.

Cookie, sush alright, will you...