Job ad can reveal a lot about company’s intentions, policies, procedures, technology stack, data collection and retention. A client sent us an interesting job posting of a well known VPN provider that claims no logging. Few technologies listed in the ad are generally used for large scale data mining, log processing, machine learning, analytics, etc. Snippet below.
Technologies in Job ad Demystified:
Hadoop: Apache Hadoop is a platform for distributed storage, processing and analyzing of big data sets (petabyte scale) using MapReduce programming model. It consists of computer clusters built from commodity hardware. Several companies like Amazon, Google, Facebook, Twitter, Yahoo, etc. use it for data mining, machine learning, data indexing, reporting, log processing and analytics.
Hive: Apache Hive is a data warehouse software built on top of Apache Hadoop for providing data summarization, query, and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
Impala: Impala brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation. Impala is promoted for analysts and data scientists to perform analytics on data stored in Hadoop via SQL or business intelligence tools. The result is that large-scale data processing (via MapReduce) and interactive queries can be done on the same system using the same data and metadata – removing the need to migrate data sets into specialized systems and/or proprietary formats simply to perform analysis.
Spark: Apache Spar is a general engine for large-scale data processing. It is designed to perform both batch processing (similar to Hadoop’s MapReduce) and new workloads like streaming, interactive queries, and machine learning.
Zeppelin: A framework that enables interactive data analytics. It brings data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop and Spark.
Splunk: A platform for operational intelligence. It is used to search, monitor and analyze logs, traffic, etc.
A company offering free proxy or low cost vpn has to generate revenue or continuously raise money to pay for their expenses. They need to pay their employees, consultants, infrastructure, bandwidth, shareholders and other overheads. There is no free lunch as wise people say! Do you trust your VPN provider?