March, 2016 - Nocturnalknight's Lair

Developing a Fast and Big data Acquisition System with a near Real Time Analytics – Part 1

By Ramkumar Sundarakalatharan | March 31, 2016 | Comments 0 Comment

Introduction
This series of How To or Case Study I am attempting to write was the result of the work of our team for the past 2+ years.We were developing a Predictive Analytics Platform for a global truck OEM. This was to be integrated with their live OBU data, Warranty, Research & Design, Customer Support, CRM and DMS among other things.
In this journey, we have attempted to solve the problem in incremental steps. Currently we are working on the predictive analytics with learning workflows. So, I believe Its time to pen down the experience with building the other 3 incremental solutions.

First Baby Step – Fast Data capture and Conventional Analytics –
- Kafka, Redis, PostreSQL
Next Logical Step – Big Data capture, Warehousing and Conventional Analytics
- Kafka, Storm/Spark, Hadoop/Hive, Zookeeper
The Bulls Eye – Real Time Analytics on Big-Data
- Same as Above with Solr and Zeppelin
The Holy Grail – Predictive Analytics
- Same as Above with MLib on Spark

Now, in this post I will write about “The First Baby Step”. This involves fast acquisition of data, Real-time analytics and long term data archival.
The disparate data sets and sources posed a significant complexity, not to mention the myriad polling frequencies, sync models and EOD jobs. It goes without saying that the #OEM had a significant investment in SAP infrastructure. We had studied multiple architecture models, (Some are available in this Reference Architecture Model from Horton Works and SAP)
The following are the considerations from the data perspective,

FastData – Realtime Telematics data from the OBU.
BigData – Diagonastics data from each truck had 40+ parameters and initial pilot of 7500 trucks.
Structured Data – Data from Dealer Management System and Customer Relationship Management System.
Transactional Data – Data from Warranty management and Customer Support systems.

Fast Data: Our primary challenge for the 1st phase of design/development was the scaling of the data acquisition system to collect data from thousands of nodes, each of which sent 40 sensor readings polled once per second and transmitted every 6 seconds once. While maintaining the ability to query the data in real time for event detection. While each data record was only ~300kb, our expected maximum sensor load indicated a collection rate of about 27 million records, or 22.5GB, per hour. However, our primary issue was not data size, but data rate. A large number of inserts had to happen each second, and we were unable to buffer inserts into batches or transactions without incurring a delay in the real-time data stream.
When designing network applications, one must consider the two canonical I/O bottlenecks: Network I/O, and Filesystem I/O. For our use case, we had little influence over network I/O speeds. We had no control over the locations where our truck sensors would be at any given time, or in the bandwidth or network infrastructure of said location (Our OBDs communicated using GPRS on GSM Network). With network latency as a known variant, we focused on addressing the bottleneck we could control: Filesystem I/O. For the immediate collection problem, this means we evaluated databases to insert the data into as it was collected. While we initially attempted to collect the data in a relational database (PostgreSQL), we soon discovered that while PostgreSQL could potentially handle the number of inserts per second, it was unable to respond to read queries simultaneously. Simply put, we were unable to read data while we were collecting it, preventing us from doing any real-time analysis (or any analysis at all, for that matter, unless we stopped data collection).
The easiest way to avoid slowdowns due to disk operations is to avoid the disk altogether, we mitigated this by leveraging Redis, an open-source in-memory NoSQL datastore. Redis stores all data in RAM and in hybrid models in Flash storage (like an SSD) allowing lightning fast reads and writes. With Redis, we were easily able to insert all of our collected data as it was transmitted from the sensor nodes, and query the data simultaneously for event detection and analytics. In fact, were were also able to leverage Pub/Sub functionality on the same Redis server to publish notifications of detected events for transmission to event driven workers, without any performance issues.
In addition to speed, Redis features advanced data structures, including Lists, Sets, Hashes,Geospatials and Sorted Sets, rather than the somewhat limiting key/value pair consistent with many NoSQL stores.

Sorted Sets proved to be an excellent data structure to model timeseries data, by setting the score to the timestamp of a given datapoint. This automatically ordered our timeseries’, even when data was inserted out of order, and allowed querying by timestamp, timestamp range, or by “most recent #” of records (which is merely the last # values of the set).
Our use case requires us to archive our data for a period of time, enabling the business users to run a historical analytics along with data from the real-time source.
Enter Data Temperatures,

Hot Data – The data which is frequently accessed and is currently being polled/gathered.
Warm Data – The data which is currently not being polled but still frequently used.
Cold Data – The data that is in warehouse-mode, but still can be accessed for BI or analytics jobs with a bit of I/O Overhead.
Since Redis keeps all data in RAM that is the HOT Area, our Redis datastore was only able to hold as much data as the server had “Available RAM”. Our data, inserted at a rate of 27GB/hour, quickly outgrew this limitation. To scale this solution and archive our data for future analysis, we set up an automated migration script to push the oldest data in our Redis datastore to a PostgreSQL database with more storage scalability. As explained above, since Redis has native data types for Time Series data, it was a simple enough process for the Load operation.
The other consideration to be exercised is the “Available RAM”. As the amount of data that is queried, CPU cycles used and the RAM used for the Processing determines the amount of memory available for data stores. be reminded if the data-stores are fill to the brim your processing job is going to utulise the disk I/O. Which is very bad.
We wrote a REST API as an interface to our two datastores allowing client applications a unified query interface, without having to worry about which data-store a particular piece of data resided in. This web-service layer defined the standards for the time, range and parameters.

Fast Data Architecture with Redis and Kafka

With the above represented architecture in place, generating automated event detection and real-time notifications was feasible, again through the use of Redis. Since Redis also offers Pub/Sub functionality, we were able to monitor incoming data in Redis using a small service, and push noteworthy events to a notification channel on the same Redis server, from which subscribed SMTP workers could send out notifications in real-time. This can even be channeled to an MQ/ESB or any Asynchronous mechanism to initiate actions or reactions.

Our experiences show Kafka and Redis to be a powerful tool for Big Data applications, specifically for high-throughput data collection. The benefits of Kafka as a collection mechanism, coupled with inmemory data storage using Redis and data migration to a deep analytics platform, such as relational databases or even Hadoop’s HDFS, yields a powerful and versatile architecture suitable for many Big Data applications.
After we have implemented HDFS and Spark in Phase 2-3 of this roadmap, we have of-course configured redis in the said role. Hope I have covered enough of the 1st step in our Big-Data journey. Will write an article per week regarding the other 3 phases we have implemented successfully.

Discovery that could make Quantum Computers Practically viable.

By Ramkumar Sundarakalatharan | March 17, 2016 | Comments 0 Comment

A major stumbling block that have kept quantum computers to the realms of Science Fiction is the fact that “quantum bits” also called as “Qubits” and the building blocks with which they’re made are prone to magnetic disturbances. These “noise” can interfere with the work qubits do, but on Wednesday, scientists announced a new discovery that could possibly help solve the problem.
They made this possible by tapping the same principle that allows atomic clocks to stay accurate. Researchers at Florida State University’s National High Magnetic Field Laboratory (MagLab) have found a way to give qubits the equivalent of a pair of noise-canceling headphones.
The approach relies on what are known as atomic clock transitions. Working with carefully designed tungsten oxide molecules that contained a single magnetic holmium ion, the MagLab team was able to keep a holmium qubit working coherently for 8.4 microseconds -– potentially long enough for it to perform useful computational tasks.
By offering exponential performance gains, quantum computers could have enormous implications for cryptography and computational chemistry, among many other fields.

MagLab’s new discovery could put all this potential within much closer reach, but don’t get too excited yet — a lot still has to happen. Next, researchers need to take the same or similar molecules and integrate them into devices that allow manipulation and read-out of an individual molecule.
MagLab’s new discovery could put all this potential within much closer reach, but don’t get too excited yet — a lot still has to happen. Next, researchers need to take the same or similar molecules and integrate them into devices that allow manipulation and read-out of an individual molecule, Stephen Hill, director of the MagLab’s Electron Magnetic Resonance Facility, said by email.
“The good news is that parallel work by other groups has demonstrated that one can do this, although with molecules that do not have clock transitions,” Hill said. “So it should be feasible to take the molecule we have studied and integrate it into a single-molecule device.”

After which, the next step will be coming up with schemes involving multiple qubits that make it possible to address them individually and to switch the coupling between them on and off so that quantum logic operations can be implemented, he said.
That’s still in the future, “but it is this same issue of scalability that researchers working on other potential qubit systems are currently facing,” he added.
Magnetic molecules hold particular promise there because the chemistry allows self-assembly into larger molecules or arrays on surfaces, Hill explained. Those, in turn, could form the basis for a working device.

Organisers of Brazil Protest use Analytics to Measure Attendance

By Ramkumar Sundarakalatharan | March 15, 2016 | Comments 0 Comment

Organizers of yesterday’s massive demonstration in São Paulo against the Brazilian government have employed an analytics tool to get accurate attendance data.
Opposition group Movimento Brasil Livre (MBL) was offered the technology by Israeli startup StoreSmarts for free through its Brazilian distributor SmartLok in exchange for the marketing exposure linked to the anti-government demo.

The technology used in the protest is all readily available and is in use for atleast 3 years now. Its is a combination of portable router and an application that is usually employed by retailers to monitor, analyze and provide insights on shopper behavior by detecting WiFi signals from mobile devices in a designated area.
In order to estimate the amount of people in any given area, the system only takes smartphones into account while ignoring other WiFi signals from devices such as laptops or routers. The calculations are carried out in real-time, so the system can also provide insight on its web dashboard into the peak hours of the protests.
By calculating the device’s receiver signal strength indication (RSSI), the system can also tell how long the smartphone – and therefore its owner – spent in the area that is being mapped. However, the system does not track or store data on individual users.
Typically, protest organizers in Brazil or their comrades across the world have to rely on data provided by the local authorities and large media organisations to get accurate insights on attendance. These media organisations themselves rely on local bodies. Those numbers are often believed to be inaccurate for political reasons – the StoreSmarts system suggests that 1.4 million people attended yesterday’s demonstration, a number that matches what has been provided by the local police.
When asked why it is interesting to provide the technology free of charge, the startup founder says that his Brazilian partner has been piloting StoreSmarts’ analytics tool with some retailers in São Paulo – so getting the extra attention is helpful.
“We believe in taking data driven decisions, whether it’s politics or retail. The exposure we get by supporting such requests is very important for us and our partner, as we see Brazil as a very important market,” Eliyahu says.

Joint Europe-Russian Probe launched for Mars

By Ramkumar Sundarakalatharan | March 15, 2016 | Comments 0 Comment

A joint European-Russian mission aiming to search for traces of life on Mars left Earth’s orbit Monday at the start of a seven-month unmanned journey to the Red Planet.
The Proton rocket carrying the Trace Gas Orbiter (TGO) to examine Mars’ atmosphere and a descent module that will conduct a test landing on its surface had been launched from the Russian-operated Baikonur cosmodrome in the Kazakh steppe at 0931 GMT on 14th March 2016.
The ExoMars 2016 mission, a collaboration between the ESA and its Russian equivalent Roscosmos, is the first part of a two-phase exploration aiming to answer questions about the existence of life on Earth’s neighbour.
The TGO will examine methane around Mars while the lander, Schiaparelli, will detach and descend to the surface of the fourth planet from the Sun.
The landing of the module on Mars is designed as a trial run ahead of the planned second stage of the mission in 2018 that will see the first European rover land on the surface to drill for signs of life, although problems with financing mean it could be delayed.
One key goal of the TGO is to analyse methane, a gas which on Earth is created in large part by living microbes, and traces of which were observed by previous Mars missions.
“TGO will be like a big nose in space,” said Jorge Vago, ExoMars project scientist.
Methane, ESA said, is normally destroyed by ultraviolet radiation within a few hundred years, which implied that in Mars’ case “it must still be produced today”.
TGO will analyse Mars’ methane in more detail than any previous mission, said ESA, in order to try to determine its likely origin.
One component of TGO, a neutron detector called FREND, can help provide improved mapping of potential water resources on Mars, amid growing evidence the planet once had as much if not more water than Earth.
A better insight into water on Mars could aid scientists’ understanding of how the Earth might cope in conditions of increased drought.
Schiaparelli, in turn, will spend several days measuring climatic conditions including seasonal dust storms on the Red Planet while serving as a test lander ahead of the rover’s anticipated arrival. The module takes its name from 19th century Italian astronomer Giovanni Schiaparelli whose discovery of “canals” on Mars caused people to believe, for a while, that there was intelligent life on our neighbouring planet.
The ExoMars spacecraft was built and designed by Franco-Italian contractor Thales Alenia Space.
For More Updates on ExoMars Updates head to: ESA ExoMars Update site

Fitting BigData into Enterprise IT with SAP's HANA VORA

By Ramkumar Sundarakalatharan | March 9, 2016 | Comments 0 Comment

sap-hana-vora-an-overview-10-638
SAP has introduced a new technology, dubbed HANA Vora, that almost epitomizes the idea that Big Data and BI are complementary. Vora melds Big Data technologies like Hadoop and Spark with the original SAP HANA, and downstream sources like SAP BW, Business Objects andERP. In the process, it brings BI-style dimensional (drill-down) analysis into the Big Data world.
But, with our experience in building these so-called “Big-Data enabled BI” applications for many of the manufacturing industry’s leaders, we have not come across a single enterprise who can readily implement HANA. despite the fact that many of them had one or more SAP component somewhere in their enterprise IT.
HANA Vora is based on the combination of Apache Spark and Hadoop 2.0/YARN. It then provides connectivity to the original SAP HANA, premised on push-down query delegation. It also layers in Spark SQL enhancements to handle hierarchical queries and a pre-compiled query facility comparable to what relational databases and data warehouses have had for years.
Essentially, Vora federates “data lakes” with Enterprise systems of record and does so without incurring the costs of data movement (since “classic” HANA executes its own queries). Further, it provides for the definition of dimensional hierarchies and the ability to use them in analytical queries against all the data that Vora can address.
Vora requires no dedicated hardware infrastructure, as it co-locates on the cluster nodes on which Hadoop and Spark are themselves deployed. Clearly, if you’re going to integrate Vora with classic HANA, the latter will need its own infrastructure. But Vora can also be used on a standalone basis with no additional hardware requirements. This important element will be a key-consideration for organisations to take a dip into the Data Lakes.
Vora could end up being a very sensible way for SAP customers to move forward with Hadoop, Spark and Big Data in general. And since Vora is a commercial software offering from SAP, and not an open source offering, it fits with SAP’s existing business model, rather than requiring the company to change gears in some contrived manner.
HANA Vora hybridizes on many levels: Big Data with BI; startup technology with established Enterprise software; data lakes with vetted systems of record; and, finally, in-memory and disk-based storage and processing.

The New Quantum Computer from MIT Could render Encryption Obsolete

By Ramkumar Sundarakalatharan | March 7, 2016 | Comments 0 Comment

MIT has developed a new Quantum Computer with 5 atoms. Yes you read it right, “5 Atoms”. Before venturing to the prophecy of the impending doom due to the obsolesce of encryption, here is a link that might help you understand what a quantum computer is.
http://computer.howstuffworks.com/quantum-computer.htm

An experimental computer made by a Canadian company has proved its ability to solve increasingly complex mathematical problems. But is it quantum mechanics?

With the concept of “Qubits” which can simultaneously be both “HIGH” and “LOW”, which greatly reduces the number of “Clock cycles” or “Time” required for performing an operation like calculating the Prime-Factor which is the basis of all encryption. It typically takes about 12 qubits to factor the number 15, but researchers at MIT and the University of Innsbruck in Austria have found a way to pare that down to five qubits, each represented by a single atom, they said this week.
Construction:
Using laser pulses to keep the quantum system stable by holding the atoms in an ion trap, the new system promises scalability as well, as more atoms and lasers can be added to build a bigger and faster quantum computer able to factor much larger numbers. That, in turn, presents new risks for factorization-based methods such as RSA, used for protecting credit cards, state secrets and other confidential data.
The development is in many ways touted to be an answer to a challenge posed way back in 1994, when MIT professor Peter Shor came up with a quantum algorithm that calculates the prime factors of a large number with much better efficiency than a classical computer. Fifteen is the smallest number that can meaningfully demonstrate Shor’s algorithm. Without any prior knowledge of the answers, the new system returned the correct factors with a confidence better than 99 percent.
From the Researchers:
“We show that Shor’s algorithm, the most complex quantum algorithm known to date, is realizable in a way where, yes, all you have to do is go in the lab, apply more technology, and you should be able to make a bigger quantum computer,” said Isaac Chuang, professor of physics and professor of electrical engineering and computer science at MIT. “It might still cost an enormous amount of money to build — you won’t be building a quantum computer and putting it on your desktop anytime soon — but now it’s much more an engineering effort, and not a basic physics question,” Chuang added.
The results of the new work were published Friday in the journal Science.
This is a really interesting development. Let us await how this progresses.

Red Hat and Eurotech team up to deliver IoT solution framework.

By Ramkumar Sundarakalatharan | March 3, 2016 | Comments 0 Comment

Italy-based Eurotech offers machine-to-machine platforms and other IoT products. Red Hat plans to combine its open-source Red Hat Enterprise Linux and Red Hat JBoss middleware with Eurotech’s Everyware Software Framework and Eurotech Everyware Cloud to create an end-to-end architecture for IoT. This will let enterprises integrate operational data from computing equipment at the edge of the network with cloud-based back-end services.
Enterprise IoT needs computing capability at the edges of networks so companies don’t have to ship masses of data to the cloud for real-time processing. Instead, data aggregation and transformation, plus data integration and routing, can take place close to the operational devices.
However, for the foreseeable future, most IoT projects will be heavily customized, so vertical industry expertise will remain more critical than horizontal solution.