Text copy on webpage - Printable Version
-Shoutbox (https://shoutbox.menthix.net)
+-- Forum: MsgHelp Archive (/forumdisplay.php?fid=58)
+--- Forum: Skype & Technology (/forumdisplay.php?fid=9)
+---- Forum: Tech Talk (/forumdisplay.php?fid=17)
+----- Thread: Text copy on webpage (/showthread.php?tid=88964)
Text copy on webpage by RaPLeX on 02-04-2009 at 11:35 PM
Is there any way to copy that articles to word file?
for example ;
http://www.hurriyet.com.tr/yazarlar/10921110.asp?yazarid=4&gid=61
http://www.hurriyet.com.tr/yazarlar/10921675.asp?yazarid=1&gid=61
RE: Text copy on webpage by djdannyp on 02-04-2009 at 11:37 PM
Highlight all the text then copy and paste into a word document?
Save the HMTL page to disc and open it with word?
RE: Text copy on webpage by Thor on 02-04-2009 at 11:42 PM
quote: Originally posted by RaPLeX
Is there any way to copy that articles to word file?
for example ;
http://www.hurriyet.com.tr/yazarlar/10921110.asp?yazarid=4&gid=61
http://www.hurriyet.com.tr/yazarlar/10921675.asp?yazarid=1&gid=61
- Find text you want to copy
- Locate your cursor
- Aim your cursor at the beginning of what you want to copy
- Press and hold the left mouse button down
- Move the cursor to the ending of what you want to copy whilst holding down the left mouse button
- Release mouse button
- Press and hold "Ctrl" then press "C"
- Release "Ctrl" and "C"
- Switch to/open the Word window
- Press and hold "Ctrl" then press "V"
- Release "Ctrl" and "C"
- ???
- Profit!
RE: Text copy on webpage by RaPLeX on 02-04-2009 at 11:43 PM
didnt work.i tried to open with dreamweaver after save asp files but same again..
quote: Originally posted by djdannyp
Highlight all the text then copy and paste into a word document?
Save the HMTL page to disc and open it with word?
Try to copy and paste your word file then see what you get...
RE: Text copy on webpage by MeEtc on 02-05-2009 at 12:43 AM
For all those of you saying what to do, why don't you actually TRY and TEST before you actually do it?
The site has some really messed up stylesheet and hidden text or something, causing all the copied text to have random letters and numbers added and/or replaced.
Example, 2 words on their own on the page "Sayin Hizlan" have the following HTML code code: Say<span class="b162">545aab</span>ın <span class="eikn">ocekeu</span>Hız<span class="s1ja">silzgi</span>lan<span class="g36u">utmuui</span>
I tried running a regular expression search and replace of /<span class="\w*">\w*</span>/ but even then it didn't remove everything. The HTML is badly misformed in places.
Here, I think this is how it should be:
quote: Abbas Sayarın bir şiiri
EDEBİYAT öğretmeni okurum Mehmet Gözüyaşlıdan bir faks aldım.
Gözüyaşlı, sevdiğim, okuduğum, andığım Abbas Sayarın (1923-1999) ona imzaladığı Yorganımı Sıkı Sar kitabındaki ithafta yer alan şiirini gönderdi:
Niğde, 03 Şubat 2009Sayın HızlanKaşgarlıdan günümüze nice bilim adamımızın, sanatçımızın kaybolmuş ürünlerini düşündükçe
değerli şair, romancı, ressam, botanikçi Abbas Sayarın bu şiirini kaybolmaktan kurtaracağınız için size teşekkür ederim
.1982 Temmuzunda Yozgat Yerköy Bulamaçlı Kapalıçarşısında sararmış ekinlerin rüzgarla çıkardığı sesi belirterek "Doğa
konuşmaz diyorlar; bu doğru değil yeğenim. Bak, doğa konuşuyor. Bunların dili sevgidir. Doğanın dilini herkes anlamaz"
dedikten sonra bu şiiri yazmış ve hiçbir yerde, hiç kimsede bu şiirin olmadığını, ölümünden sonra oğluna ulaştırmamı
söylemişti.Doğayla konuşan, anlaşan değerli sanatçımızın -Yozgat yöresi bitkileriyle ilgili önemli bir kaynak kişiydi-
anısına saygılarımla.Mehmet GözüyaşlıUğur DershanesiTürkçe Edebiyat ÖğretmeniNiğdeYeğenim,Mehmet Gözüyaşlıya mutlu olmak
dileğiyleÇAREBaktım;Toprağa düşecek gibideğil su,Tohumu;Buluta ektim."* * *ABBAS SAYAR, edebiyata şiirle başladı, Yılkı
Atı romanıyla TRT 1970 Sanat Ödülleri yarışmasında başarı ödülünü, ikinci roman Çelo ile Türk Dil Kurumu (1973), Can
Şenliği ile de Madaralı Roman Ödülünü (1975) kazandı.Onu ünlendiren Yılkı Atının konusu, kocadığı, iş göremez duruma
geldiği için kışa, açlığa terk edilmiş bir atın hikáyesi idi.Can Şenliği, terk edilmiş seksen yaşındaki bir adamın,
can yoldaşı eşeğinin, onun için bir can şenliği olmasıydı. Trajik bir kitaptır.Çelo, bir cinayet davası üzerine
kurulmuştur.Ölümünden sonra yayınlanan yazımın başlığı, edebi aynı zamanda gerçekçi bir tespitti:"Orta Anadoluyu,
bozkırı şiirsel bir dille romanlaştırdı.Orta Anadolu insanının umarsız, acımasız yaşamının içine sevgiyi katmıştı
Abbas Sayar. Yalnız kalmış ve kalacak kişilerin, şartlara direnme ile teslim olma arasındaki bıçak sırtında
dolaşanların romancısıydı."* * *OKURUM edebiyat öğretmeni Mehmet Gözüyaşlıya teşekkür ederim.Ölümünün onuncu
yılında onu sevgiyle ve saygıyla anmamıza vesile yarattı. Bu anmanın yeniden okumaları da sağlayacağı
umudundayım.
OK, well everything except the unicode chars
RE: Text copy on webpage by Jarrod on 02-05-2009 at 06:17 AM
i haven't tried it(yet) but maybe capture the webpage as a jpg and run ocr on it?
RE: Text copy on webpage by Quantum on 02-05-2009 at 01:55 PM
quote: Originally posted by Jarrod
i haven't tried it(yet) but maybe capture the webpage as a jpg and run ocr on it?
This is what i was going to say
prtsc the page and cut it down until it's the text you want. then run it on a program like abby
Good Luck
RE: Text copy on webpage by CookieRevised on 02-06-2009 at 02:18 AM
Doing what MeEtc did is the best thing you can do:
1) Open the page
2) Copy the source (the portion between those two <div>'s containing the bulk of the text)
3) Paste the code into an UNICODE supporting regular expression testkit (can mostlikely be found online too if you don't have one)
The unicode is extremely important here since not all regular expression interpreters will support unicode properly.
4) Use a regular expression to strip out the copy-protection (the html is not misformed, it is a copy-protection scheme). Though I'm not sure if MeEtc's regular expression will work properly. I've not checked all the <span> IDs, so I might be wrong, but there might be some which aren't in the [A-Za-z0-9_] list. So I would use something like /<span(.*?)span>/ instead (which is a non-greedy regular expression). Oh and make it case independant in case there is a <sPaN>.
quote: Originally posted by Jarrod
i haven't tried it(yet) but maybe capture the webpage as a jpg and run ocr on it?
Then at least capture it as a losless image format if you whish to do OCR or you will be introducing difficulties and possible errors even before you start OCR'ing. Never capture something in JPG if you want to do some OCR'ing (unless you have ran out of disk space or something).
Anyways, OCR will always leave some errors behind, or at least you'll never be 100% certain if everything is correct without checking each letter manually again because there is always the possebility of a mismatch (OCR is always a form of guesswork). OCR'ing unicode is also a lot more difficult than OCR'ing normal ascii...
---
lame copy-protection though... but it seems to work against some
Why on earth do I need to write such an epistel again? arrrghhh...
RE: Text copy on webpage by Th3rmal on 02-06-2009 at 05:09 AM
Where you able to solve the issue? If so, which method did you use?
quote: Originally posted by CookieRevised
Why on earth do I need to write such an epistel again? arrrghhh...
because your Cookie
RE: Text copy on webpage by Mike on 02-06-2009 at 11:48 AM
quote: Originally posted by CookieRevised
OCR'ing unicode is also a lot more difficult than OCR'ing normal ascii...
Abbyy has done the job fine for some Greek texts I OCR'ed in it. It even works great on photographs taken with my mobile phone!
RE: Text copy on webpage by CookieRevised on 02-07-2009 at 12:23 AM
quote: Originally posted by Mike
quote: Originally posted by CookieRevised
OCR'ing unicode is also a lot more difficult than OCR'ing normal ascii...
Abbyy has done the job fine for some Greek texts I OCR'ed in it. It even works great on photographs taken with my mobile phone!
I certainly don't doubt that. I don't know any better OCR program than Abbyy, but still, it still is pattern matching which always is a bit guessing while taking statistics into account. Because you only knew it worked after you've checked the text thoroughly, no? Do you blindly trust the output of the OCR'ing without checking anything? After checking the first two pages, would you trust the OCR'ing enough to skip he other 100 pages? For some stuff OCR'ing is absolutely great, for other things, there might be better ways which you don't need to double check everything.
Cookie, sush alright, will you...
|