In our previously published article How to Install Apache Tika on Ubuntu Server, we learned basic about Apache Tika. Apache Tika Can Be Combined With PHP. Apache Tika can detect content, and extracts metadata and text from different file types – it can identify more than 1400 file types. Tika has relation with Apache Nutch codebase. Tika has fork in Python too. Tika has different way of implementation on server to integrate with various blogging platforms and CMS (including WordPress). Here is How to Configure Apache Tika With WordPress to Search, Get Meta of PDF/Doc/Excel/Text and Other Type of Files. This is another example of integration of Big Data tool with WordPress. Other examples of integration of Big Data tool with WordPress is combining search functions. We have article on Apache Solr vs. Elasticsearch For WordPress Search. Apache Nutch, Apache Tika practically are part of search, crawl and both for other purpose can be combined with Apache Solr. However, for using Apache Tika with WordPress, we do not need to go through Apache Solr – we want some function just within WordPress Admin.
How to Configure Apache Tika With WordPress
Difficult part for the new users was installing Apache Tika part, thinking of this article’s relatively new users; we written that Apache Tika installation guide slightly detailed. Essentially as first step one need to install that Apache Tika on same server WordPress is running. Obviously, Tika can be ran on separate server but configuring for separate server installation of Tika by new user may be difficult.
Apart from installing Apache Tika, WordPress will need two plugins to be installed. One is Search Everything :
---
1 | https://wordpress.org/plugins/search-everything/ |
Second one is another WordPress plugin named Masala :
1 | https://github.com/nanodust/masala |
Masala means spice. Indian Masala are quite popular in America! Tikka means small piece of meat, fish etc. Together is Tikka Masala and whole earth is aware of what is cicken tikka butter masala. Apache projects deliberately named with various Sanskrit, Buddhist words to avoid copyright matters, make funny etc. Apache Tika is Tikka’s Tika – it is a delicious piece for Apache Solr.
Configure Apace Tika for your needed file types – check it whether can extract metadata on commandline. Thereafter install the plugin and check the source code of plugin. The plugin needs to install Tika’s jar somewhere on your server and assumes that you have Java installed on your server where WordPress running. Apache Tika’s jar file should be at project’s root folder and configure path in masala.php
file. The plugin actually has not much detailed documentation.
When you upload content like a PDF or DOC, it will process the file after upload and insert metadata. You can
search the attachment’s metadata, obviously attachment will be listed in search results.
If you are using Apache Solr for WordPress search, itself metadata will be searchable, so as in most search engines.