Text copy on webpage |
Author: |
Message: |
RaPLeX
Senior Member
Fenerbahce SK
Posts: 543 Reputation: 13
34 / /
Joined: Jun 2005
|
O.P. Text copy on webpage
|
|
02-04-2009 11:35 PM |
|
|
djdannyp
Elite Member
Danny <3 Sarah
Posts: 3546 Reputation: 31
38 / /
Joined: Mar 2006
|
RE: Text copy on webpage
Highlight all the text then copy and paste into a word document?
Save the HMTL page to disc and open it with word?
|
|
02-04-2009 11:37 PM |
|
|
Thor
Veteran Member
Awwwwwwww.
Posts: 1118 Reputation: 42
32 / – /
Joined: May 2006
|
RE: Text copy on webpage
quote: Originally posted by RaPLeX
Is there any way to copy that articles to word file?
for example ;
http://www.hurriyet.com.tr/yazarlar/10921110.asp?yazarid=4&gid=61
http://www.hurriyet.com.tr/yazarlar/10921675.asp?yazarid=1&gid=61
- Find text you want to copy
- Locate your cursor
- Aim your cursor at the beginning of what you want to copy
- Press and hold the left mouse button down
- Move the cursor to the ending of what you want to copy whilst holding down the left mouse button
- Release mouse button
- Press and hold "Ctrl" then press "C"
- Release "Ctrl" and "C"
- Switch to/open the Word window
- Press and hold "Ctrl" then press "V"
- Release "Ctrl" and "C"
- ???
- Profit!
This post was edited on 02-04-2009 at 11:45 PM by Thor.
|
|
02-04-2009 11:42 PM |
|
|
RaPLeX
Senior Member
Fenerbahce SK
Posts: 543 Reputation: 13
34 / /
Joined: Jun 2005
|
O.P. RE: Text copy on webpage
didnt work.i tried to open with dreamweaver after save asp files but same again..
quote: Originally posted by djdannyp
Highlight all the text then copy and paste into a word document?
Save the HMTL page to disc and open it with word?
Try to copy and paste your word file then see what you get...
|
|
02-04-2009 11:43 PM |
|
|
MeEtc
Patchou's look-alike
In the Shadow Gallery once again
Posts: 2200 Reputation: 60
38 / /
Joined: Nov 2004
Status: Away
|
RE: Text copy on webpage
For all those of you saying what to do, why don't you actually TRY and TEST before you actually do it?
The site has some really messed up stylesheet and hidden text or something, causing all the copied text to have random letters and numbers added and/or replaced.
Example, 2 words on their own on the page "Sayin Hizlan" have the following HTML code code: Say<span class="b162">545aab</span>ın <span class="eikn">ocekeu</span>Hız<span class="s1ja">silzgi</span>lan<span class="g36u">utmuui</span>
I tried running a regular expression search and replace of /<span class="\w*">\w*</span>/ but even then it didn't remove everything. The HTML is badly misformed in places.
Here, I think this is how it should be:
quote: Abbas Sayarın bir şiiri
EDEBİYAT öğretmeni okurum Mehmet Gözüyaşlıdan bir faks aldım.
Gözüyaşlı, sevdiğim, okuduğum, andığım Abbas Sayarın (1923-1999) ona imzaladığı Yorganımı Sıkı Sar kitabındaki ithafta yer alan şiirini gönderdi:
Niğde, 03 Şubat 2009Sayın HızlanKaşgarlıdan günümüze nice bilim adamımızın, sanatçımızın kaybolmuş ürünlerini düşündükçe
değerli şair, romancı, ressam, botanikçi Abbas Sayarın bu şiirini kaybolmaktan kurtaracağınız için size teşekkür ederim
.1982 Temmuzunda Yozgat Yerköy Bulamaçlı Kapalıçarşısında sararmış ekinlerin rüzgarla çıkardığı sesi belirterek "Doğa
konuşmaz diyorlar; bu doğru değil yeğenim. Bak, doğa konuşuyor. Bunların dili sevgidir. Doğanın dilini herkes anlamaz"
dedikten sonra bu şiiri yazmış ve hiçbir yerde, hiç kimsede bu şiirin olmadığını, ölümünden sonra oğluna ulaştırmamı
söylemişti.Doğayla konuşan, anlaşan değerli sanatçımızın -Yozgat yöresi bitkileriyle ilgili önemli bir kaynak kişiydi-
anısına saygılarımla.Mehmet GözüyaşlıUğur DershanesiTürkçe Edebiyat ÖğretmeniNiğdeYeğenim,Mehmet Gözüyaşlıya mutlu olmak
dileğiyleÇAREBaktım;Toprağa düşecek gibideğil su,Tohumu;Buluta ektim."* * *ABBAS SAYAR, edebiyata şiirle başladı, Yılkı
Atı romanıyla TRT 1970 Sanat Ödülleri yarışmasında başarı ödülünü, ikinci roman Çelo ile Türk Dil Kurumu (1973), Can
Şenliği ile de Madaralı Roman Ödülünü (1975) kazandı.Onu ünlendiren Yılkı Atının konusu, kocadığı, iş göremez duruma
geldiği için kışa, açlığa terk edilmiş bir atın hikáyesi idi.Can Şenliği, terk edilmiş seksen yaşındaki bir adamın,
can yoldaşı eşeğinin, onun için bir can şenliği olmasıydı. Trajik bir kitaptır.Çelo, bir cinayet davası üzerine
kurulmuştur.Ölümünden sonra yayınlanan yazımın başlığı, edebi aynı zamanda gerçekçi bir tespitti:"Orta Anadoluyu,
bozkırı şiirsel bir dille romanlaştırdı.Orta Anadolu insanının umarsız, acımasız yaşamının içine sevgiyi katmıştı
Abbas Sayar. Yalnız kalmış ve kalacak kişilerin, şartlara direnme ile teslim olma arasındaki bıçak sırtında
dolaşanların romancısıydı."* * *OKURUM edebiyat öğretmeni Mehmet Gözüyaşlıya teşekkür ederim.Ölümünün onuncu
yılında onu sevgiyle ve saygıyla anmamıza vesile yarattı. Bu anmanın yeniden okumaları da sağlayacağı
umudundayım.
OK, well everything except the unicode chars
This post was edited on 02-05-2009 at 12:46 AM by MeEtc.
I cannot hear you. There is a banana in my ear.
|
|
02-05-2009 12:43 AM |
|
|
Jarrod
Veteran Member
woot simpson
Posts: 1304 Reputation: 20
– / /
Joined: Sep 2006
|
RE: Text copy on webpage
i haven't tried it(yet) but maybe capture the webpage as a jpg and run ocr on it?
This post was edited on 02-05-2009 at 06:17 AM by Jarrod.
|
|
02-05-2009 06:17 AM |
|
|
Quantum
Disabled Account
Away.
Posts: 1055 Reputation: -17
31 / /
Joined: Feb 2007
|
RE: Text copy on webpage
quote: Originally posted by Jarrod
i haven't tried it(yet) but maybe capture the webpage as a jpg and run ocr on it?
This is what i was going to say
prtsc the page and cut it down until it's the text you want. then run it on a program like abby
Good Luck
|
|
02-05-2009 01:55 PM |
|
|
CookieRevised
Elite Member
Posts: 15517 Reputation: 173
– / /
Joined: Jul 2003
Status: Away
|
RE: Text copy on webpage
Doing what MeEtc did is the best thing you can do:
1) Open the page
2) Copy the source (the portion between those two <div>'s containing the bulk of the text)
3) Paste the code into an UNICODE supporting regular expression testkit (can mostlikely be found online too if you don't have one)
The unicode is extremely important here since not all regular expression interpreters will support unicode properly.
4) Use a regular expression to strip out the copy-protection (the html is not misformed, it is a copy-protection scheme). Though I'm not sure if MeEtc's regular expression will work properly. I've not checked all the <span> IDs, so I might be wrong, but there might be some which aren't in the [A-Za-z0-9_] list. So I would use something like /<span(.*?)span>/ instead (which is a non-greedy regular expression). Oh and make it case independant in case there is a <sPaN>.
quote: Originally posted by Jarrod
i haven't tried it(yet) but maybe capture the webpage as a jpg and run ocr on it?
Then at least capture it as a losless image format if you whish to do OCR or you will be introducing difficulties and possible errors even before you start OCR'ing. Never capture something in JPG if you want to do some OCR'ing (unless you have ran out of disk space or something).
Anyways, OCR will always leave some errors behind, or at least you'll never be 100% certain if everything is correct without checking each letter manually again because there is always the possebility of a mismatch (OCR is always a form of guesswork). OCR'ing unicode is also a lot more difficult than OCR'ing normal ascii...
---
lame copy-protection though... but it seems to work against some
Why on earth do I need to write such an epistel again? arrrghhh...
This post was edited on 02-06-2009 at 02:25 AM by CookieRevised.
.-= A 'frrrrrrrituurrr' for Wacky =-.
|
|
02-06-2009 02:18 AM |
|
|
Th3rmal
Veteran Member
Peek-a-boo! I see you!!
Posts: 1226 Reputation: 26
32 / /
Joined: Aug 2005
|
RE: Text copy on webpage
Where you able to solve the issue? If so, which method did you use?
quote: Originally posted by CookieRevised
Why on earth do I need to write such an epistel again? arrrghhh...
because your Cookie
You have the intellect comparable to that of a rock. Be proud.
|
|
02-06-2009 05:09 AM |
|
|
Mike
Elite Member
Meet the Spam Family!
Posts: 2795 Reputation: 48
32 / /
Joined: Mar 2003
Status: Online
|
RE: Text copy on webpage
quote: Originally posted by CookieRevised
OCR'ing unicode is also a lot more difficult than OCR'ing normal ascii...
Abbyy has done the job fine for some Greek texts I OCR'ed in it. It even works great on photographs taken with my mobile phone!
|
|
02-06-2009 11:48 AM |
|
|
Pages: (2):
« First
[ 1 ]
2
»
Last »
|
|