The Internet of things hype everyone would be skeptical that the reality could ever deliver what is promised. We all have by now would have heard how GE is collecting Terabytes of data for each flight and now wished to sell engine service contracts instead of engines itself. This opportunity arises when GE is able to collect the performance of their machines in real time and this helps them plan predictive maintenance of their machines and hence think of selling the engine usage service plans instead of outright purchase.
All this is possible due to the IoT platform’s ability to perform real time analytics on the acquired data and extract useful insights, build a pipeline for performing scalable analytics with the volume and velocity of data associated with IoT systems.
Acquiring and storing your data
The availability of myriad protocols enable the acquisition of events from IoT devices easy. Device connects to the network using either BLE, GSM/LTE, Wi-Fi, LPWAN or a wired connection to send/receive messages to/from a gateway of some sort using a the various standard protocol defined. IoT applications have a variety of protocols for communication such as MQTT, CoAP, XMPP etc. MQTT seems to be the most popular of the lot. MQTT provides a lightweight method of carrying out messaging using a publish/subscribe model. This makes it suitable for “Internet of Things” messaging such as with low power sensors or mobile devices such as phones, embedded computers or micro-controllers like the Arduino
MQTT has a wide user base and lots of support available easily and most IoT implementations recommend it. In fact IoT platform support providers like AWS IoT require you to use MQQT. Mosquitto™ is an open source (EPL/EDL licensed) message broker that implements the MQTT protocol versions 3.1 and 3.1.1.
Regardless of which protocol we choose, the goal is to send and receive messages to and from the sensors based on events in a efficient and reliable manner. Once a message is received by a gateway such as Mosquitto, it can be pushed to analytics system as such AWS or Azure. It is always recommended to store the original source data before performing any operations on it as it help trace back issues that might arise later on.
For storing IoT data, we have several options like Hadoop and Hive, NoSQL document databases like Couchbase etc. Couchbase offers a nice combination of high-throughput, low-latency characteristics. It’s also a schema-less document database that supports high data volume along with the flexibility to add new event types easily. Writing data directly to HDFS is a viable option, too, particularly if we intend to use Hadoop and batch-oriented analysis as part of our analytics workflow.
For writing source data to a persistent store, we can either attach custom code directly to the message broker at the IoT protocol level (for example, the Mosquitto broker if using MQTT) or push messages to an intermediate messaging broker such as Apache Kafka and use different Kafka consumers for moving messages to different parts of our system. One proven pattern is to push messages to Kafka and two consumer groups on the topic, where one has consumers that write the raw data to your persistence store, while the other moves the data into a real-time stream processing engine like Apache Storm.
With Apache Storm we can simply wire a bolt into the topology that does nothing but write messages out to the persistent store. If we are using MQTT and Mosquitto, a convenient way to tie things together is to have the message delivered directly to an Apache Storm topology via the MQTT.
Preprocessing and transformations
Data from devices in their raw form are not necessarily suited for analytics. Data may be missing, requiring an enrichment step, or representations of values may need transformation (often true for date and timestamp fields).
This means you’ll frequently need a preprocessing step to manage enrichments and transformations. Again, there are multiple ways to structure this, but another best practice I’ve observed is the need to store the transformed data alongside the raw source data.
Now, you might think: “Why do that when I can always just transform it again if I need to replay something?” As it turns out, transformations and enrichments can be expensive operations and may add significant latency to the overall workflow. It’s best to avoid the need to rerun the transformations if you rerun a sequence of events.
Transformations can be handled several ways. If you are focused on batch mode analysis and are writing data to HDFS as your primary workflow, then Pig — possibly using custom user-defined functions — works well for this purpose. Be aware, however, that while Pig does the job, it’s not exactly designed to have low-latency characteristics. Running multiple Pig jobs in sequence will add a lot of latency to the workflow. A better option, even if you aren’t looking for “real-time analysis” per se, might be using Storm for only the preprocessing phase of the workflow.
Analytics for business insights
Once your data has been transformed into a suitable state and stored for future use, you can start dealing with analytics.
Apache Storm is explicitly designed for handling continuous streams of data in a scalable fashion, which is exactly what IoT systems tend to deliver. Storm excels at managing high-volume streams and performing operations over them, like event correlation, rolling metric calculations, and aggregate statistics. Of course, Storm also leaves the door open for you to implement any algorithm that may be required.
Our experience to date has been that Storm is an extremely good fit for working with streaming IoT data. Let’s look at how it can work as a key element of your analytics pipeline.
In Storm, by default “topologies” run forever, performing any calculation that you can code over a boundless stream of messages. Topologies can consist of any number of processing steps, aka bolts, which distribute over nodes in a cluster; Storm manages the message distribution for you. Bolts can maintain state as needed to perform “sliding window” calculations and other kinds of rolling metrics. A given bolt can also be stateless if it needs to look at only one event at a time (for example, a threshold trigger).
The calculated metrics in your Storm topology can then be used to suit your business requirements as you see fit. Some values may trigger a real-time notification using email or XMPP or update a real-time dashboard. Other values may be discarded immediately, while some may need to be stored in a persistent store. Depending on your application, you may actually find it makes sense to keep more than you throw away, even for “intermediate” values.
Why? Simply put, you have to “reap” data from any stateful bolts in a Storm topology eventually, unless you have infinite RAM and/or swap space available. You may eventually need to perform a “meta analysis” on those calculated, intermediate values. If you store them, you can achieve this without the need to replay the entire time window from the original source events.
How should you store the results of Storm calculations? To start with, understand that you can do anything in your bolts, including writing to a database. Defining a Storm topology that writes calculated metrics to a persistent store is as simple as adding code to the various bolts that connect to your database and pushing the resulting values to the store. Actually, to follow the separation-of-concerns principle, it would be better to add a new bolt to the topology downstream of the bolt that performs the calculations and give it sole responsibility for managing persistence.
Storm topologies are extremely flexible, giving you the ability to have any bolt send its output to any number of subsequent bolts. If you want to store the source data coming into the topology, this is as easy as wiring a persistence bolt directly to the spout (or spouts) in question. Since spouts can send data to multiple bolts, you can both store the source events and forward them to any number of subsequent processing steps.
For storing these results, you can use any database, but as noted above, we’ve found that Couchbase works well in these applications. The key point to choosing a database: You want to complement Storm — which has no native query/search facility and can store a limited amount of data in RAM — with a system that provides strong query and retrieval capabilities. Whatever database you choose, once your calculated metrics are stored, it should be straightforward to use the native query facilities in the database for generating reports. From here, you want the ability to utilize Tableau, BIRT, Pentaho, JasperReports, or similar tools to create any required reports or visualizations.
Storing data in this way also opens up the possibility of performing additional analytics at a later time using the tool of your choice. If one of your bolts pushes data into HDFS, you open up the possibility of employing entire swath of Hadoop-based tools for subsequent processing and analysis.
Building analytics solutions that can handle the scale of IoT systems isn’t easy, but the right technology stack makes the challenge less daunting. Choose wisely and you’ll be on your way to developing an analytics system that delivers valuable business insights from data generated by a swarm of IoT devices.