[Solved] Spark in Business Intelligence

Question

I think that you should build Hive Datawarehouse using Hive or MongoDB Datawarehouse using MongoDB. I didn’t understand how you are going to mix them, but I will try to answer the question anyway.

Usually, you configure for a BI tool a JDBC driver for DB of your choice (e.g. Hive) and the BI tool fetches the data using that JDBC driver. How the driver fetches the data from DB is completely transparent for the BI tool.

Thus, you can use Hive, Shark or any other DB which comes with a JDBC driver.

I can summarize your options this way:

Hive: the most complete feature set, and is the most compatible tool. Can be used over plain data or, you can ETL the data into its ORC format boosting performance.

Impala: claims to be faster than Hive but has less complete feature set. Can be used over plain data or, you can ETL the data into its Parquet format boosting performance.

Shark: cutting edge, not mainstream yet. Performance depends on which percent of your data can fit into RAM over your cluster.

Accepted Answer