Apache Tika Tutorial

Sunday, March 4, 2018

Apache Tika Tutorial

Apache Tika is a powerful library to detect and extract text and metadata from thousands of file formats. This is very useful when you are more interested in the content of the file and building your logic on top of it.

Let's see a simple code in terms of how to use Tika

First, let's add the dependency in the Gradle file

compile('org.apache.tika:tika-parsers:1.17')

For latest version please check the maven central repository. At Maven central, you can find the details of how to add to a Maven and another environment also.

//Get the file which you want to read and conver it into an inputStream

File fileHandle = new File("<File Location");

InputStream inputStream = new FileInputStream(fileHandle ));

//Create Tika parser. This figures out the file type by reading the metadata headers of file

AutoDetectParser parser = new AutoDetectParser(TikaConfig.getDefaultConfig());

//Create the object where the data will be kept once Tika reads the file

Metadata metadata = new Metadata();

ContentHandler handler = new BodyContentHandler();

//Finally parsing

parser.parse(inputStream, handler, metadata, new ParseContext());

//Print out the text data as read from file

System.out.println(handler.toString());

Technology

Sunday, March 4, 2018

Apache Tika Tutorial

No comments:

Post a Comment