In Defense of Screen Scraping

Monday, February 18, 2008

Hand-coded image recognition. Checking individual color values to try to figure out if the random smear of pixels forms a "L" or an "H". Dubious OCR schemes. Internet spambots. Click fraud. Hacking.

These are the things most people think of, when they hear the phrase screen scraping

They're wrong, of course.

But you still won't find Screen Scraping 101 up at the local college. You won't find it in books, except here and there, occasionally, mentioned in passing, usually with a smirk. I've said before: screen scraping should always be a last resort. But you know what? Screen scraping has gotten a bad rap and it's not entirely deserved.

My experience?

Getting our applications to talk to one another has been something of a software Holy Grail. I give you DLL exports, COM, DCOM, Corba, OLE, Remoting, a dozen other technologies. It would be nice if we lived in a world in which each application (and each electronic gadget) exposed a clean, universal Automation interface allowing for external command and control. It should be a requirement for a well-behaved piece of hardware or software, that it allows itself to be controlled by other pieces of software.

AlarmClock ac = House.Rooms["Master Bedroom"].Items["Alarm Clock"];

if (DateTime.Now >= ac.WakeTime)
{
   CoffeeMaker cm = House.Rooms["Kitchen"].Items["Coffee Maker"];
   if (cm != null)
   {
      cm.Wake();
      cm.Prepare(2); // 2 cups
      cm.Brew(); // asynchronously brew the coffee
   }
}

But we don't live in that world, at least not yet.

We live in a world which is more like the Tower of Babel. 

The Tower of Babel, by M.C. Escher

Our world is a world of one hundred million competing pieces of software flogging it out across a dozen platforms, using different languages, different protocols, different storage mechanisms, different endian-ness, different everything. It's a huge mess and in part we're glad, because entire industries are born of these differences. Without them, many of us wouldn't have jobs..

But if I had to say in a word what the single most expensive characteristic of modern software is, complexity would not be my choice. I would choose words like interoperability, migration, conversion, transformationadjustment—any word describing the enormous friction created by trying to fit billions of square pegs into billions of round holes, line by line of code, across the trillions of lines of code that comprise the world's code base.

In the midst of all this chaos, one thing we can usually count on, is the GUI.

No matter what happens in software, no matter how the underlying mechanisms change and improve, end-user applications will always display textual and graphical elements on a surface. That surface might be your computer screen, your iPod, or one day, your visual cortex. But there will always be a surface and there will always be elements painted on that surface, and the programmers responsible for coding those elements will try to make them pleasing to, and easily readable by, the human eye.

In order to do that they'll usually have to rely on publically available APIs or libraries because it's enormously difficult to paint good text, or good anything, from scratch. Font rendering is a subject for experts. Drawing anti-aliased lines quickly is a subject for experts. The mathematics of viewing frustums is a subject for experts. If you're looking for an exercise in futility: try to implement a good font-drawing routine from scratch. Get back to me in five years and let me know how it worked out.

No, by and large we're forced to use operating system APIs or third-party libraries because the complexities that they mask would be too costly to implement ourselves. And once we do that, we make our programs accessible not only to humans, but to software. To build an application based on common UI components is to implicitly state that yes, our application can be accessed and manipulated by other applications written by other people.

And this is why I say screen-scraping has gotten a bad rap. Screen scraping, true screen scraping, has very little to do with "pushing pixels" and everything to do with accessing display text and other UI elements in a robust way, regardless of whether such use was planned or intended by the creator(s) of the software. That's right: I'm saying that screen-scraping, implemented properly, is a clean, robust, generic, and a powerful addition to the programmer's bag of tricks.

It just suffers from a bad name, occasional misuse, and widespread misunderstanding.

Screen scraping is a technique in which a computer program extracts data from the display output of another program. The program doing the scraping is called a screen scraper. The key element that distinguishes screen scraping from regular parsing is that the output being scraped was intended for final display to a human user, rather than as input to another program, and is therefore usually neither documented nor structured for convenient parsing. Screen scraping often involves ignoring binary data (usually images or multimedia data) and formatting elements that obscure the essential, desired text data. Optical character recognition software is a kind of visual scraper.

I can't fault Wikipedia's definition of the phrase, but I think whoever wrote it thought that screen scraping is about how to extract meaning from pixels drawn to the screen. The definition is technically correct, but misleading. I'd like to suggest a different definition.

Screen scraping is a technique in which a computer program extracts data from the target program by examining the properties and behavior of the target program's GUI, and in particular, by examining and hooking the code structures that lie beneath the GUI.

By that definition, screen scraping has a lot more in common with system-level development techniques, or even disassembly, than it does with OCR or image recognition. And when we look at the code structures beneath the GUI, rather than trying to grok the meaning of pixellated randomness, we find that screen-scraping techniques are very robust indeed:

  • DLL Injection
  • API Detouring
  • Window Subclassing
  • Message Processing

In fact, we start to see that screen-scraping is an exercise in conventional development techniques. Windows. DLLs. Processes. Threads.

The stuff of which software is made.

So the next time you hear someone mention screen scraping, think twice before you chuckle. Screen scraping is about writing software with the sensory (visual) and motive (manipulaing the mouse, keyboard) capability of a human. It's not always (or even usually) appropriate, but like that most famous of martial arts techniques...

...when properly applied, it's extremely powerful.

Tags: software automation

14 comment(s)

Is this post new?

I read all your posts a few weeks back and I don't remember this one.

This is the 2nd time you've used that Karate Kid pic. It is however somewhat epic in its geekery so I approve.

Is there any way that I could buy the poker bot you made?

If one wants to start to build a screen scraper in C++, where to start? Anyone have links/sample code for me?

Screen scraping work quite nicely when we take a screenshot and the then calculate a Sha256 sum to the needed areas and let a user spcify a meaning of that area. If I do handle 100 areas a time to make this 65 - 80 ms with C++ language. And this time is only for teaching not for playing.

I don't use a GUI when scraping from, I use mainly the text output from a page/html/program. I thought screen scaping was mainly text/image resource/output gathering. The gui is somehow invisible. Does a html page have a gui? Or is that datamining?

I recently coded a small app to rip data out of a online store and into a database for the purpose of building a similar site and I wanted an established price db for the same items (a collectible card game) and I used alot of RegEx's to extract the data from an HTML stream, what do you recommend down the other path as far as open sourced OCR packages?

Please keep the articles coming, I'm required to do a project for an operating systems class I'm taking and I'm thinking I can do something along the lines you have presented so far, in regards to the DLL injection and windows form/window management... plus I will learn a shitton on the way... any thoughts?

Ciao, sono un giocatore di texas Hold'em e gioco sopra tutto sulla poker room di POKER STARS.IT Alcuni anni or sono avevo acquistato da shanky tecnology un bot che però non mi ha dato grandi soddisfazioni. Ho letto di recente i tuoi articoli si come costruire un vero BOT e siccome non sono un programmatore ma che comunque mi sono stancato di perdere denaro vero anche se mi ritengo un discreto giocatore, mi chiedevo se potevi trasferirmi in un link il tuo bot, magari in versione di prova ma che funzioni sul sistema italiano, poi sapere quanto mi verrebbe a costare il tuo BOT ed anche compreso i vari aggiornamenti. Non è uno scherzo questo mio messaggio ma sono seriamente intenzionato a dare una svolta al mio modo di giocare a poker. Ti ringrazio ed attendo con fiducia una tua risposta. Ciao Franco.....

I enjoyed following the whole entry, I always thought one of the main things to count when you write a blog is learning how to complement the ideas with [url=http://www.xlpharmacydeals.com]xlpharmacy[/url] images, that's exploiting at the maximum the possibilities of a ciber-space! Good work on this entry!

I love my ghd outlet Glattetang! It heats up very quickly and works well. My glamour hair is annoyingly thick and poofy, but ghdstraighteneroutletaustralia works wonders! I have never been so fully satisfied with just ghd straightener outlet supplier! What a pleasure shopping at this ghd outlet australia! Thank you very much for this wonderful shopping experience. I will be shopping ghd outlet very very often.

Great defense of screen scraping, kudos

クリントン米国務長官は4日、ミャンマーの国会補選実施などの民主化努力を評価し、同国に対する金融サービス投資禁止や政府高官の渡米禁止など一部制裁の緩和措置を取る用意があると発表した。また、駐ミャンマー米大使を近く指名すると明らかにした。  民主化運動指導者アウンサンスーチー氏と同氏率いる国民民主連盟(NLD)が補選で圧勝したことを受け、制裁緩和に着手する姿勢を示すことで、一層の民主化を促すのが狙い。ただ、緩和の対象を限定し、全政治犯釈放や北朝鮮との軍事協力停止を含めた改革を推進するよう圧力を維持する方針も示した。

it should not be a last resort - its should never happen. It creates a nightmare that no one wants to manage.

I intended to put you one very small word to fillnay say thank you the moment again about the fantastic principles you've provided in this article. This has been quite tremendously generous with you to present unreservedly precisely what most people might have made available for an ebook to end up making some dough for themselves, even more so seeing that you might have tried it in the event you wanted. The tips likewise worked to become a fantastic way to be aware that other people have the identical zeal similar to mine to realize very much more when considering this issue. I am sure there are many more fun moments in the future for folks who look over your website.

Use the form below to leave a comment.






Coding the Wheel has appeared on the New York Time's Freakonomics blog, Jeff Atwood's Coding Horror, and the front page of Reddit, Slashdot, Digg.

On Twitter

Thanks for reading!

If you enjoyed this post, consider subscribing to Coding the Wheel by RSS or email. You can also follow us on Twitter and Facebook. And even if you didn't enjoy this post, better subscribe anyway. Keep an eye on us.

Question? Ask us.

About

Poker

Coding the Wheel =
Code, poker, technology, games, design, geekery.


Hire

You've read our technical articles, you've tolerated our rants and raves. Now you can hire us anytime, day or night, for any project large or small.

Learn more

We Like

Speculation, by Edmund Jorgensen.