While there is endless debate about what exactly constitutes a Smart City, one fact that is undisputable is that all Smart Cities generate data – lots and lots of it! This is both an opportunity and a challenge. The availability of real-time data from sensors and IT/OT systems across the city provides us with the opportunity to derive new insights and to improve how the city systems operate. However, the sheer amount of data poses problems, and the cost of generating, transmitting, processing and storing all this data may outweigh its value.
Data, data, data…
As more and more devices across the city are connected to the Internet of Things (IoT) the amount of data generated by sensors and systems across the city grows exponentially. Every time a traffic-light changes, a vehicle licence-plate is read, a car is parked, a waste-bin is filled, a street-light is turned on or a bus leaves a bus-stop then an event is generated with data to describe what happened. Sensors are proliferating across the city to continuously monitor the air-quality, light, weather, sound pollution, road traffic, and every other aspect of the city environment. Retail systems, smart meters, public transportation systems, e-government systems and a myriad of other IT and OT systems across the city are continually generating more and more data. And that is before we add autonomous vehicles into the mix! The modern city has more data than it knows what to do with.
The 4 Vs
Big data can be broken down into 4 dimensions: volume, variety, velocity and veracity. All of these pose serious challenges for traditional city data systems and processes. The volume of data in a modern connected city is growing exponentially. Whereas traditional data systems are based on structured tabular data, there is a huge variety in the data we receive from modern systems, and much of it is unstructured. As network speeds and bandwidth increase the velocity at which we receive data is increasing all the time. Not all the data we receive is reliable, so veracity is a growing problem.
Traditional Data Systems
Traditional RDBMS databases and nicely structured 3-tier ERP systems were not designed for the data that is now being generated in smart and connected cities. The SQL databases and IT systems that are used to process statistical data in application silos cannot cope with the 4 Vs above, nor can they deliver the instantaneous results and analytics that are required to operate a modern smart city.
The compromise that is adopted in most current smart city platforms is that the data is normalised and aggregated to shoe-horn it into our traditional way of storing data. This solves many of the 4 challenges above – the data is normalised to reduce variety, it is filtered and sampled to reduce the velocity, it is cleansed to remove anomalies, and it is aggregated to reduce the volume. However, in doing so we also lose a lot of information that may be present in the raw data. By only storing the average value of a sensor for a 15-minute period we may lose vital information about the peak values during that time. By normalising all our sensors to the same model then we may lose unique information from some specific sensors that doesn’t fit our model.
Traditional data systems make compromises by storing pre-processed aggregated data, making presumptions about what is important. This approach is too inflexible for modern smart cities, so we need a new approach. Enter Big Data…
Big Data Platforms
This change in emphasis from structured and condensed data to raw data with a more complex structure requires a major paradigm shift. We can’t expect to store all our smart city data as normalised models in a relational database, because in the real world data is often irregular. A centralised SQL database cannot cope with the volumes of data generated by a smart and connected city. Similarly we need a new approach to processing this data if we are to extract its hidden information.
Fortunately these problems are not unique to smart cities. The term ‘big data’ evolved in the early 2000s to articulate the challenges posed by the 4 Vs above. Data warehouse platforms have been developed in response to these problems to store, manage and process vast amounts of data in a distributed and scalable repository. Data from sensors and real-time systems can be streamed directly into the data warehouse using standard stream-processing platforms such as Kafka where it can be combined and correlated with data from other internal and external data sources. Unstructured data can be stored unaltered in a Hadoop Distributed File System (HDFS) to preserve data integrity, while more structured data can be stored in a relational database such as HBase. As data storage is distributed then data processing must also be distributed for efficient analysis, so tools such as Hadoop MapReduce and Apache Spark have been developed to take advantage of the locality of data, processing it where it is stored. The ultimate goal of the big data platform is to generate meaningful information from this data, so the data warehouse also provides tools for efficiently analysing and visualising the data to extract timely business intelligence, and to generate predictive analytics.
Big Data – Smart City
If data is the lifeblood of a smart city then the data warehouse is surely its heart. It provides a flexible scalable storage platform that can consume everything from periodic air-quality readings to real-time tweets. Data can be stored in its raw format so that no information is lost and decisions about what is important can be taken by the processing application. The platform can be scaled as the city ecosystem grows to include future data sources such as autonomous vehicles and robotic systems. Most importantly, it provides a platform for deep-learning and predictive analytics to make our cities truly smart.