Sunday, March 4, 2018

Apache Tika Tutorial

Apache Tika is a powerful library to detect and extract text and metadata from thousands of file formats. This is very useful when you are more interested in the content of the file and building your logic on top of it.

Let's see a simple code in terms of how to use Tika

First, let's add the dependency in the Gradle file

compile('org.apache.tika:tika-parsers:1.17')

For latest version please check the maven central repository. At Maven central, you can find the details of how to add to a Maven and another environment also.

//Get the file which you want to read and conver it into an inputStream
File fileHandle = new File("<File Location");
InputStream inputStream = new FileInputStream(fileHandle ));

//Create Tika parser. This figures out the file type by reading the metadata headers of file
AutoDetectParser parser = new AutoDetectParser(TikaConfig.getDefaultConfig());

//Create the object where the data will be kept once Tika reads the file
Metadata metadata = new Metadata();
ContentHandler handler = new BodyContentHandler();

//Finally parsing
parser.parse(inputStream, handler, metadata, new ParseContext());

//Print out the text data as read from file
System.out.println(handler.toString());

No comments:

Post a Comment