Big data approach for building a web analytics tool for capturing visitor level data

Analysis of website browsing behavior of prospects and customers can provide Marketers with a goldmine of information around needs, wants and preferences of their target audience. In order to perform this analysis though, Marketers need to move beyond reporting to advanced analytics that leverages user level click stream data to build advanced predictive models for specific business scenarios.

The technology challenge for enabling this advanced click stream analytics lies in consolidating a very large number of browsing signals from across multiple sessions (and even devices) into consolidated, user level profiles that can be used for modeling. Getting access to this raw data though, is a significant challenge with most contemporary web analytics tools given that these tools aggregate data into pre-defined data models making them more suitable for aggregate campaign/channel level reporting as opposed to advanced business analytics. Given the sheer volumes, velocity and variety of data generated from web browsing, this technology challenge becomes an interesting use case for practical applications of big data technologies in digital marketing optimization

In this post, we provide a conceptual overview of how tech-savvy Marketers can quickly build a custom web analytics tool that provides a commercially viable alternative to using off the shelf web analytics software. We outline a modular solution development approach along with discussion of a wide range of underlying big data technologies which Marketers can combine to build purpose-built, cloud hosted, and fully compliant web analytics solutions in order to turbo-charge their customer intelligence initiatives

Finding hidden patterns in website click stream data requires data to be available at individual user level. This is a major challenge with contemporary web analytics tools that have their origins in aggregate style, campaign/channel level anonymous reporting. Customer analytics tools do provide access to user level data but can be cost prohibitive when dealing with high traffic websites. In such scenarios, building custom web analytics solutions often proves to be the most commercially viable option

The solution outlined here is abstract and conceptual in nature and even though we reference Amazon Cloud components in our discussion, clients are free to apply the concepts to other cloud platforms such as Azure, Rackspace and Google. It is also entirely possible to implement all the components below in-house using internal hardware and is the recommended approach when security and compliance obligations require more control of data

Identifying the solution building blocks

In line with enterprise architecture best practices, we begin by identifying the various ‘building blocks’ that make up our conceptual solution. These building blocks are abstract technology components that only define the business functionality they are required to implement while leaving the choice of physical technologies to client specific considerations. The five functionality blocks include-

Log Server-Clients (browsers of site visitors) send data out to the log server using image ‘pixels’ hosted on this server. Every time, this pixel is downloaded, a log entry is created in the server and the information contained in this request URL (sent by the Tracking engine described below) can be used to build up user level data sets using appropriate data processing. A key technical capability of a log server is to be able to service a very large number of pixel requests from client browsers. Content Delivery networks can easily provide the technical platform for these log servers. For example, developers can quickly setup Amazon Cloudfront to service pixel requests without having to launch their own cluster of web servers

Data collection engine-While the log servers provide transient storage of raw log data, the data collection engine provides a persistent data storage layer where raw log entries are processed into user level datasets. In the Amazon Cloud stack, the collection engine could be implemented using Amazon S3 which theoretically provides unlimited storage capacity implying that developers do not have to worry about collection engine running out of physical disk space as the number of ‘website hits’ grows. Alternative implementation choices for the data collection engine could be Hadoop HDFS or Spark but these would require implementing technologies such as Apache Flume and Kafka (or other similar variants) to somehow transfer the data from log server into collection engine

Transformer-This component performs two functions. First, it constantly fetches data from the log server into the data collection engine using custom processing rules. This allows the log server to run seamlessly without running out of disk space or memory. Second, the transformer processes the raw log data in collection engine into final user level data sets using custom business rules. For example, clients may define rules around how raw log data is sessionized, which log items to discard (e.g. discard bot traffic) and other processing rules such as those for organizing raw events into chronological order. Multiple technology options exist for implementing the transformer. For example, PIG provides a scalable ETL platform ideal for processing files if the collection engine is implemented using HDFS. Other open source ETL technologies such as Talend and Pentaho Kettle could be used in scenarios where the collection engine is implemented using normal file systems or databases

Storage engine-The data processed by transformer is finally transferred to the storage engine that is custom built for specific analysis needs. RDBMS engines, data warehouses, and columnar databases are all possible technology options for implementing the storage engine. Within Amazon stack, the RDS service can be used as a scalable RDBMS storage engine for both analytics and reporting use cases. Redshift provides options for implementing data warehouse schemas when the requirement is primarily drill down reporting on large volumes of historical data. Other options include Amazon Dynamo, or NoSql databases such as Cassandra and MongoDB

Client side tracker-The final building block of our custom web analytics solution is the client side tracker. This is a simple piece of JavaScript code that is downloaded to every visitor browser and uses JavaScript to record click stream data. This data is then sent to the log server using a transparent image pixel. The tracker converts the click stream data into a query string and then appends it to the pixel image request. When this pixel is downloaded from log server, a ‘hit entry’ is created on the server and is eventually transferred to the collection engine via the transformer


The conceptual building blocks described above, can be easily assembled to build powerful visitor analytics solutions that can significantly enhance a company’s ability to analyze website behavioral data using user level data sets. Using this modular approach, clients can build advanced big data applications that provide bottom line business benefits within overall constraints of time, resource availability, security and legal compliance