Tag Archives: Text Extraction

Extracting text from literally everything with Apache Tika




I already briefly introduced you to Apache Tika in a different article In this article I’ll dive deeper in the wonderful world of Tika and introduce you to the Tika App which is available as a free download.

The App comes as an executable jar file which can be executed by just duible clicking or opening a command prompt and issuing the following command:

java -jar tika-app-1.2.jar.

Once you did this you should see the App as in the picture below:

Once you are here you can start by seeing Tika at work.

Just drag any file on it and choose how you want to display the output. This can be either raw text or formatted text. In the last case this output will actually be XHTML, an HTML format compatible with XML which can be transformed to any other format you want if you manage the XSLT language. The Tika XHTML format can be displayed by the Tika App itself, which uses the Java Swing GUI Widget Toolkit for which a (minimal) webbrowser is available.

Go ahead and try to drop as much different file types on the interface and look at the results.

You can try MP3’s and other digital types as well. If you drop photos on it you will see the EXIF Data from the Camera the photo was taken with. Depending on the Camera this may contain the location where the photo was taken, the Date and time, but also the Camera manufacturer and type plus details like shutter time and more technical photography details. In MP3’s you will typically find Artist, Album, Year and Genre.

Also in files like Office documents you will discover data that you wouldn’t directly expected. Amongst others you can find Author, Date, Title, Subject and more.

If you’re a (Java) programmer like me, it’s interesting to drag Java .class files or complete .jar libraries. For compiled Java code the Classes and their public method signatures are shown.

Tika History

Before Tika came in it’s present form it was only a system to recognize mime types. This was necessary in order to make sure that before the file was sent to it’s designated parser it was really the file type that the parser could handle. In cases where this was wrong (for example a .doc file which was actually a .exe) terrible things could happen like the crawling process crashing completely (and loosing days or weeks of work. (Crawling can be a very lengthy process, especially when crawling parts of the Internet, Which Nutch is designed to do). I once crawled 16 million pages. The crawling took 2 weeks non stop.

At that time, for every filetype (currently Media Types) were first detected by Tika and then deligated to the right parser. Parsers came as plugins, following the architecture of the Eclipse plugins. This architecture is currently a standard better known as OSGi.

Before Tika was there, files on the Internet were straightly send to the parser based on its extension. At that time Nutch was still early béta and not stable enough to be considered useful in a production environment.