KDnuggets Home » News » 2018 » Jan » Opinions, Interviews » Plot2txt for quantitative image analysis ( 18:n04 )

Plot2txt for quantitative image analysis


Plot2txt converts images into text and other representations, helping create semi-structured data from binary, using a combination of machine learning and other algorithms.



By William James Brouwer

In recent times, computation has become both pervasive and less constrained by Moore’s Law. This is due in large part to the emergence of cloud computing and the rise of massive parallelism. The former has benefited from network improvements and ever increasing connectedness, the latter from the appropriation of hardware like Graphics Processing Units (GPUs) for general purpose computing. This computational leap, coupled with the process of disintermediation [1] taking place around the globe will continue to support revolutions like artificial intelligence (AI), as many have remarked. AI has a long and interesting history. One of my favorite albeit underrated computational history books is by John Markoff[2], a great read that contains various accounts of the seminal AI work of Doug Engelbart and John McCarthy in the 1960s at Stanford.

It’s also hard to ignore the revolution we’re experiencing in materials. Advances in battery technology and permanent magnets are driving the ubiquity of drones and electricity generation by wind power, for example. Perhaps the future really is electric, contrary to Jack White’s assertions[3]. Though it’s difficult to imagine escaping Earth’s gravitational pull with anything less than the energy density of hydrocarbons. On that note, the combination of advances in materials, AI and other areas is leading to the burgeoning low cost satellite industry[4]. Incidentally, materials design is something that takes great advantage of big data, computational power and machine learning, as demonstrated by companies such as Citrine.IO or by the work of Stefano Curtarolo et al at Duke[5], to name a couple.

To summarize thus far, we have unprecedented computational power at our disposal. We are seeing also an exponential increase in images for analysis, and the unquenchable thirst of AI for training data. At plot2txt.com, we’re attempting to assist with the latter by harnessing the former, in conjunction with new and traditional algorithms. These algorithms are bundled into various RESTful API methods created using AWS Lambda, which provide the user with text and further images as output, compressed into a single GZIP archive. A couple of examples will help illustrate.

Aerial imaging is frequently used for the monitoring of geological and other assets. In the first figure I give an image of a gargoyle at Sint-Catharinakerk in Eindhoven, Netherlands. Using the group-colors API method, you can create image slices by clustering pixels into groups thusly:

curl --request POST -H "Authorization: Token my_api_key" -H "Accept: image/png" -H "Content-Type: image/png" --data-binary "@data/gargoyle.png" https://b8jd85qv3b.execute-api.us-east-1.amazonaws.com/test/group-colors > foo.tar.gz

 

Plot2txt Fig1

figure 1: A) input image B) one output pixel group C) pixel similarity matrix using square radius of 200 D) pixel square radius of 2000

 

The panel on the upper right is one example output, for which you will receive the pixel mean color for the group, as well as the number of pixels in the group in the JSON text response. This might be useful for monitoring the deterioration of the gargoyle over time. You can tweak the size of the color groups by setting a parameter in the query string; the lower images are two pixel similarity matrices corresponding to a smaller (left) and larger (right) pixel group radius.

Shifting gears somewhat, it may surprise you to learn that most of the venerable arxiv repository documents are available for bulk download. The value of the data contained in the millions of figures therein is hard to overestimate, let alone harness. Once split into pages using your favorite linux utility, you might take advantage of two other p2t API methods, first to grab bounding boxes in a size range, with a boundary layer, and a second step to try and model the contents as a scatter plot:

curl --request POST -H "Authorization: Token my_api_key" -H "Accept: image/png" -H "Content-Type: image/png" --data-binary "@data/page.png" https://b8jd85qv3b.execute-api.us-east-1.amazonaws.com/test/bbox/?upper_area=2000000\&lower_area=20000\&boundary=80 > foo.tar.gz

curl --request POST -H "Authorization: Token my_api_key" -H "Accept: image/png" -H "Content-Type: image/png" --data-binary "@data/simple_scatter.png" https://b8jd85qv3b.execute-api.us-east-1.amazonaws.com/test/simple-scatter-plot?opt=100 > foo.tar.gz

 

Plot2txt Fig2

figure 2: A) output from bbox method B) output from scatter-plot method.

 

I’ve omitted the decompression and scripting required to create the second input. Figure 2 shows an example output from the first step (left panel) and the right panel gives an example of the fit to the pixels, showing the index number of inverted x-y abscissa. The data inferred from the scaling information and pixel positions is returned as JSON text. From a collection of about 1700 documents, I was able to model approximately 2500 figures as scatter plots, in a matter of hours using serial requests to the API methods. Note from the figure that at this stage we don’t discriminate between pixel types very well.

Of course context is everything, and you may wish to scrape the text from the article too. Another method we’ve developed improves upon traditional OCR, by returning line position information on the page, and indicating horizontal lines, assisting with the mining of text containing tabular data eg.,

curl --request POST -H "Authorization: Token my_api_key" -H "Accept: image/png" -H "Content-Type: image/png" --data-binary "@data/page.png" https://b8jd85qv3b.execute-api.us-east-1.amazonaws.com/test/text-lines > foo.tar.gz

There are many good utilities for performing some of the aforementioned tasks eg., webplotdigitizer, or very bespoke scripts that you may have cooked up yourself in Python etc. Our goal is to automate as much of quantitative computer vision as possible, and allow you to perform your work at scale by incorporating our API methods into your cloud solutions. The hope too is that you might have access to unprecedented amounts and in some cases novel data sources, to support new and interesting breakthroughs in fields such as material and physical sciences.

All p2t methods are currently freely available for testing, via this portal : https://plot2txt-staging.us-east-1.elasticbeanstalk.com/home where you can also get an API key upon sign-up.

Happy data mining! -bill b

REFERENCES

[1] “Volatility and Friction in the Age of Disintermediation”, Hague center for strategic studies, 2017
[2] “What the Dormouse Said: How the Sixties Counterculture Shaped the Personal Computer Industry”, John Markoff, Penguin 2005
[3] “Big Three Killed my Baby”, White Stripes, XL recordings, 1999
[4] “The Tiny Satellites Ushering in the New Space Revolution”, Bloomberg Newsweek, June 29, 2017
[5] “Computers Create Recipe for Two New Magnetic Materials”, https://www.sciencedaily.com/releases/2017/04/170415095611.htm

Bio: Bill Brouwer began a career in physics developing an optical diagnostic technique for a SCRAMjet engine, before moving to solid state NMR during PhD work. Since this time he has worked consistently in high performance computing, and is currently a senior HPC engineer with SLB, working with a mix of algorithms, wired/wireless/cellular networks, HPC/cloud computing and embedded processing, among other things. He began plot2txt several years ago, inspired by document engineering work completed as a postdoc at Penn State.

Related