What happened to the Messenger Plus! forums on msghelp.net?
Shoutbox » MsgHelp Archive » Skype & Technology » Tech Talk » Text copy on webpage

Pages: (2): « First [ 1 ] 2 » Last »
Text copy on webpage
Author: Message:
RaPLeX
Senior Member
****

Avatar
Fenerbahce SK

Posts: 543
Reputation: 13
33 / Male / Flag
Joined: Jun 2005
O.P. Text copy on webpage
Is there any way to copy that articles to word file?

for example ;

http://www.hurriyet.com.tr/yazarlar/10921110.asp?yazarid=4&gid=61

http://www.hurriyet.com.tr/yazarlar/10921675.asp?yazarid=1&gid=61
02-04-2009 11:35 PM
Profile PM Find Quote Report
djdannyp
Elite Member
*****

Avatar
Danny <3 Sarah

Posts: 3546
Reputation: 31
37 / Male / Flag
Joined: Mar 2006
RE: Text copy on webpage
Highlight all the text then copy and paste into a word document?

Save the HMTL page to disc and open it with word?
[Image: 1ftt0hpk-signature.png]
AutoStatus Script || Facebook Status Script
5216 days, 19 hours, 4 minutes, 38 seconds ago
02-04-2009 11:37 PM
Profile E-Mail PM Find Quote Report
Thor
Veteran Member
*****

Avatar
Awwwwwwww.

Posts: 1118
Reputation: 42
31 / – / Flag
Joined: May 2006
RE: Text copy on webpage
quote:
Originally posted by RaPLeX
Is there any way to copy that articles to word file?

for example ;

http://www.hurriyet.com.tr/yazarlar/10921110.asp?yazarid=4&gid=61

http://www.hurriyet.com.tr/yazarlar/10921675.asp?yazarid=1&gid=61
  1. Find text you want to copy
  2. Locate your cursor
  3. Aim your cursor at the beginning of what you want to copy
  4. Press and hold the left mouse button down
  5. Move the cursor to the ending of what you want to copy whilst holding down the left mouse button
  6. Release mouse button
  7. Press and hold "Ctrl" then press "C"
  8. Release "Ctrl" and "C"
  9. Switch to/open the Word window
  10. Press and hold "Ctrl" then press "V"
  11. Release "Ctrl" and "C"
  12. ???
  13. Profit!

This post was edited on 02-04-2009 at 11:45 PM by Thor.
:plus4: Translation guidelines for Messenger Plus! Live
I'm no longer around this town, but I miss the community dearly. You can always find me lurking in #banana, or at
nitrolinken.net.
02-04-2009 11:42 PM
Profile PM Web Find Quote Report
RaPLeX
Senior Member
****

Avatar
Fenerbahce SK

Posts: 543
Reputation: 13
33 / Male / Flag
Joined: Jun 2005
O.P. RE: Text copy on webpage
didnt work.i tried to open with dreamweaver after save asp files but same again..
quote:
Originally posted by djdannyp
Highlight all the text then copy and paste into a word document?

Save the HMTL page to disc and open it with word?

[Image: ekranalntsay3.th.jpg]

Try to copy and paste your word file then see what you get...
02-04-2009 11:43 PM
Profile PM Find Quote Report
MeEtc
Patchou's look-alike
*****

Avatar
In the Shadow Gallery once again

Posts: 2200
Reputation: 60
38 / Male / Flag
Joined: Nov 2004
Status: Away
RE: Text copy on webpage
For all those of you saying what to do, why don't you actually TRY and TEST before you actually do it?

The site has some really messed up stylesheet and hidden text or something, causing all the copied text to have random letters and numbers added and/or replaced.

Example, 2 words on their own on the page "Sayin Hizlan" have the following HTML code
code:
Say<span class="b162">545aab</span>&#305;n <span class="eikn">ocekeu</span>H&#305;z<span class="s1ja">silzgi</span>lan<span class="g36u">utmuui</span>

I tried running a regular expression search and replace of /<span class="\w*">\w*</span>/ but even then it didn't remove everything. The HTML is badly misformed in places.

Here, I think this is how it should be:
quote:
Abbas Sayar’&#305;n bir &#351;iiri

EDEB&#304;YAT ö&#287;retmeni okurum Mehmet Gözüya&#351;l&#305;’dan bir faks ald&#305;m.

Gözüya&#351;l&#305;, sevdi&#287;im, okudu&#287;um, and&#305;&#287;&#305;m Abbas Sayar’&#305;n (1923-1999) ona imzalad&#305;&#287;&#305; Yorgan&#305;m&#305; S&#305;k&#305; Sar kitab&#305;ndaki ithafta yer alan &#351;iirini gönderdi:

Ni&#287;de, 03 &#350;ubat 2009Say&#305;n H&#305;zlanKa&#351;garl&#305;’dan günümüze nice bilim adam&#305;m&#305;z&#305;n, sanatç&#305;m&#305;z&#305;n kaybolmu&#351; ürünlerini dü&#351;ündükçe
de&#287;erli &#351;air, romanc&#305;, ressam, botanikçi Abbas Sayar’&#305;n bu &#351;iirini kaybolmaktan kurtaraca&#287;&#305;n&#305;z için size te&#351;ekkür ederim
.1982 Temmuz’unda Yozgat Yerköy Bulamaçl&#305; Kapal&#305;çar&#351;&#305;s&#305;’nda sararm&#305;&#351; ekinlerin rüzgarla ç&#305;kard&#305;&#287;&#305; sesi belirterek "Do&#287;a
konu&#351;maz diyorlar; bu do&#287;ru de&#287;il ye&#287;enim. Bak, do&#287;a konu&#351;uyor. Bunlar&#305;n dili sevgidir. Do&#287;an&#305;n dilini herkes anlamaz"
dedikten sonra bu &#351;iiri yazm&#305;&#351; ve hiçbir yerde, hiç kimsede bu &#351;iirin olmad&#305;&#287;&#305;n&#305;, ölümünden sonra o&#287;luna ula&#351;t&#305;rmam&#305;
söylemi&#351;ti.Do&#287;ayla konu&#351;an, anla&#351;an de&#287;erli sanatç&#305;m&#305;z&#305;n -Yozgat yöresi bitkileriyle ilgili önemli bir kaynak ki&#351;iydi-
an&#305;s&#305;na sayg&#305;lar&#305;mla.Mehmet Gözüya&#351;l&#305;U&#287;ur DershanesiTürkçe Edebiyat Ö&#287;retmeniNi&#287;deYe&#287;enim,Mehmet Gözüya&#351;l&#305;’ya mutlu olmak
dile&#287;iyleÇAREBakt&#305;m;Topra&#287;a dü&#351;ecek gibide&#287;il su,Tohumu;Buluta ektim."* * *ABBAS SAYAR, edebiyata &#351;iirle ba&#351;lad&#305;, Y&#305;lk&#305;
At&#305; roman&#305;yla TRT 1970 Sanat Ödülleri yar&#305;&#351;mas&#305;nda ba&#351;ar&#305; ödülünü, ikinci roman Çelo ile Türk Dil Kurumu (1973), Can
&#350;enli&#287;i ile de Madaral&#305; Roman Ödülü’nü (1975) kazand&#305;.Onu ünlendiren Y&#305;lk&#305; At&#305;’n&#305;n konusu, kocad&#305;&#287;&#305;, i&#351; göremez duruma
geldi&#287;i için k&#305;&#351;a, açl&#305;&#287;a terk edilmi&#351; bir at&#305;n hikáyesi idi.Can &#350;enli&#287;i, terk edilmi&#351; seksen ya&#351;&#305;ndaki bir adam&#305;n,
can yolda&#351;&#305; e&#351;e&#287;inin, onun için bir ’can &#351;enli&#287;i’ olmas&#305;yd&#305;. Trajik bir kitapt&#305;r.Çelo, bir cinayet davas&#305; üzerine
kurulmu&#351;tur.Ölümünden sonra yay&#305;nlanan yaz&#305;m&#305;n ba&#351;l&#305;&#287;&#305;, edebi ayn&#305; zamanda gerçekçi bir tespitti:"Orta Anadolu’yu,
bozk&#305;r&#305; &#351;iirsel bir dille romanla&#351;t&#305;rd&#305;.Orta Anadolu insan&#305;n&#305;n umars&#305;z, ac&#305;mas&#305;z ya&#351;am&#305;n&#305;n içine sevgiyi katm&#305;&#351;t&#305;
Abbas Sayar. Yaln&#305;z kalm&#305;&#351; ve kalacak ki&#351;ilerin, &#351;artlara direnme ile teslim olma aras&#305;ndaki b&#305;çak s&#305;rt&#305;nda
dola&#351;anlar&#305;n romanc&#305;s&#305;yd&#305;."* * *OKURUM edebiyat ö&#287;retmeni Mehmet Gözüya&#351;l&#305;’ya te&#351;ekkür ederim.Ölümünün onuncu
y&#305;l&#305;nda onu sevgiyle ve sayg&#305;yla anmam&#305;za vesile yaratt&#305;. Bu anman&#305;n yeniden okumalar&#305; da sa&#287;layaca&#287;&#305;
umudunday&#305;m.

OK, well everything except the unicode chars

This post was edited on 02-05-2009 at 12:46 AM by MeEtc.
[Image: signature/]     [Image: sharing.png]
I cannot hear you. There is a banana in my ear.
02-05-2009 12:43 AM
Profile PM Web Find Quote Report
Jarrod
Veteran Member
*****

Avatar
woot simpson

Posts: 1304
Reputation: 20
– / Male / Flag
Joined: Sep 2006
RE: Text copy on webpage
i haven't tried it(yet) but maybe capture the webpage as a jpg and run ocr on it?

This post was edited on 02-05-2009 at 06:17 AM by Jarrod.

[Image: 5344.png]
[Image: sig.png]

A.k.a. The Glad Falconer














02-05-2009 06:17 AM
Profile E-Mail PM Find Quote Report
Quantum
Disabled Account
*****

Away.

Posts: 1055
Reputation: -17
30 / Male / Flag
Joined: Feb 2007
RE: Text copy on webpage
quote:
Originally posted by Jarrod
i haven't tried it(yet) but maybe capture the webpage as a jpg and run ocr on it?


This is what i was going to say :)

prtsc the page and cut it down until it's the text you want. then run it on a program like abby :P

Good Luck
No longer here.
02-05-2009 01:55 PM
Profile PM Find Quote Report
CookieRevised
Elite Member
*****

Avatar

Posts: 15519
Reputation: 173
– / Male / Flag
Joined: Jul 2003
Status: Away
RE: Text copy on webpage
Doing what MeEtc did is the best thing you can do:

1) Open the page

2) Copy the source (the portion between those two <div>'s containing the bulk of the text)

3) Paste the code into an UNICODE supporting regular expression testkit (can mostlikely be found online too if you don't have one)
The unicode is extremely important here since not all regular expression interpreters will support unicode properly.

4) Use a regular expression to strip out the copy-protection (the html is not misformed, it is a copy-protection scheme). Though I'm not sure if MeEtc's regular expression will work properly. I've not checked all the <span> IDs, so I might be wrong, but there might be some which aren't in the [A-Za-z0-9_] list. So I would use something like /<span(.*?)span>/ instead (which is a non-greedy regular expression). Oh and make it case independant in case there is a <sPaN>.

;)

quote:
Originally posted by Jarrod
i haven't tried it(yet) but maybe capture the webpage as a jpg and run ocr on it?
Then at least capture it as a losless image format if you whish to do OCR or you will be introducing difficulties and possible errors even before you start OCR'ing. Never capture something in JPG if you want to do some OCR'ing (unless you have ran out of disk space or something).

Anyways, OCR will always leave some errors behind, or at least you'll never be 100% certain if everything is correct without checking each letter manually again because there is always the possebility of a mismatch (OCR is always a form of guesswork). OCR'ing unicode is also a lot more difficult than OCR'ing normal ascii...



---

lame copy-protection though... but it seems to work against some :p

Why on earth do I need to write such an epistel again? arrrghhh...

This post was edited on 02-06-2009 at 02:25 AM by CookieRevised.
.-= A 'frrrrrrrituurrr' for Wacky =-.
02-06-2009 02:18 AM
Profile PM Find Quote Report
Th3rmal
Veteran Member
*****

Peek-a-boo! I see you!!

Posts: 1226
Reputation: 26
32 / Male / Flag
Joined: Aug 2005
RE: Text copy on webpage
Where you able to solve the issue? If so, which method did you use?

quote:
Originally posted by CookieRevised

Why on earth do I need to write such an epistel again? arrrghhh...
because your Cookie :P
You have the intellect comparable to that of a rock. Be proud.
02-06-2009 05:09 AM
Profile E-Mail PM Web Find Quote Report
Mike
Elite Member
*****

Avatar
Meet the Spam Family!

Posts: 2795
Reputation: 48
31 / Male / Flag
Joined: Mar 2003
RE: Text copy on webpage
quote:
Originally posted by CookieRevised
OCR'ing unicode is also a lot more difficult than OCR'ing normal ascii...
Abbyy has done the job fine for some Greek texts I OCR'ed in it. It even works great on photographs taken with my mobile phone! :O
YouTube closed-captions ripper (also allows you to download videos!)
02-06-2009 11:48 AM
Profile E-Mail PM Web Find Quote Report
Pages: (2): « First [ 1 ] 2 » Last »
« Next Oldest Return to Top Next Newest »


Threaded Mode | Linear Mode
View a Printable Version
Send this Thread to a Friend
Subscribe | Add to Favorites
Rate This Thread:

Forum Jump:

Forum Rules:
You cannot post new threads
You cannot post replies
You cannot post attachments
You can edit your posts
HTML is Off
myCode is On
Smilies are On
[img] Code is On