Data science of the connected vehicle: perspectives, applications and trends

The application of data science to streaming data from vehicles is an emerging field. Here we review general trends and some specific examples of relevant data feeds and applications where data science can deliver value.

By Boris Savkovic


The application of streaming and real-time data analytics to connected vehicles is gaining traction around the world. Underpinning this shift is the increased availability of data from vehicles and from the underlying transport infrastructure. In addition to the streaming nature of the data, these rich data feeds also constitute a highly distributed (spatial extent) and highly dynamic (temporal extent) IoT grid. These considerations present enormous challenges but also many opportunities.

In this article we highlight some emerging trends in relation to connected vehicles, and the critical role that data science has to play in this evolving segment.

The Connected Vehicle

The “Connected Vehicle” is any vehicle that is able to communicate with the cloud and/or the transport infrastructure, and broadcast relevant information (e.g. vehicle on-board sensor data) into the cloud (Figure 1). Here the vehicle itself can be equipped with relevant capability (SIM card embedded within the vehicle),or the relevant capability can be provided by pairing the vehicle with a mobile phone device or an embedded device that can read the sensor data and broadcast the data (e.g. a dongle that plugs into the vehicle On-Board-Diagnostic port, i.e. OBD port).

The connected vehicle

Fig 1: The Connected Vehicle. The vehicle is able to communicate with the cloud and/or the infrastructure.

Vehicle data feed

Fig 2: Data feeds from the connected vehicle and use-cases. Panel A shows a high-resolution voltage trace from the car battery which is fed into a boosting predictive model that assesses the state of the battery (i.e. is the battery OK or about to fail?). Panel B shows real-time traffic conditions created from millions of GPS points obtained from probe vehicles (red indicates severe congestion, yellow moderate congestion and green uncongested road segments). Panel Cshows accelerometer traces (for the three spatial axes, x/y/z) that can be analysed for signatures that signify crashes or abnormal events.

Data of value that can be extracted from the vehicle (assuming presence of the relevant on-board sensors) and potential use-cases are shown in Figure 2 and include:

  • GPS location and speed. Use cases for this data include real-time congestion reporting and forecasting based on GPS traces.At Intelematics, we employ data from millions of GPS points in Australia to provide a real-time traffic service to drivers in Australia, this is known as the “SUNA Traffic Channel” (
  • High-resolution (i.e. > 100 samples per second) vehicle-internal bus voltages including the voltage/current of the car battery during the ignition event (i.e. when the driver turns the key). Use-cases here include predictive models that can forecast battery failure and automate the process involved with offering battery replacements to motorist before the actual battery failure event.
  • High-resolution accelerometer and gyroscope data, which canbe leveragedto automatically detectaccidents and abnormal driving behaviourswith predictive models that look for patterns that signify a crash and/or other abnormalities.
  • Vehicle-internal error codes that signify faults (diagnostic trouble codes, DTCs). These can pinpoint individual faults with individual components as well as be aggregated, as part of a classification model, to identify higher-level faults.
  • Radar sensors and dashboard video cameras that broadcast information about road conditions, parking conditions and road signageback to a central provider for analysis and broadcasting.
  • Real-time traffic light and road intersection conditions/states (Figure 3). Use-cases here includetraffic light states being broadcast directly into the vehicle and predictive models that forecast traffic light conditionsto maximise the number of cars arriving at intersections when the traffic light is green. In the future, this information could also be fed into autonomous vehicles to optimise both inter- and intra-vehicle coordination.

Intersection data

Fig 3: Data from an intersection. Shown are the vehicle counts (per each traffic light cycle, the time series shown in red) as well as the congestion levels (the blue time series above) over a 4 day window and for the indicated direction.

Technical Challenges

There are a number of technical challenges when it comes to leveraging the above data streams as part of a predictive analytics framework, these being:

  • The complexity of dealing with geo-spatial and temporal data, as well as non-linearities, which requires advanced spatial visualisations and predictive models.
  • The streaming nature of the data. This necessitates the development of robust and distributed pipelines (e.g. Kafka, Storm, Apache Spark etc) that can accommodate the velocity, variety and volume of data. This is in addition to the requirement for robust database systems that can offer reasonable R/W performance and scalability.
  • The event-driven nature of the data (i.e. data is only broadcast when an event/trigger occurs, e.g. car ignition, crash events etc). This presents challenges in terms ofhow thedata should beprocessed. Options include processing each event individually or processing data as a batch/micro-batch process.
  • The variety, volume and velocityof the data stream formats necessitates the development of standardised formatsand technologies to be put in-place given the diversity of sensors and data providers. This creates significant data ingestion and data warehousing challenges
  • Dealing with Geographical Information Systems (GIS) map-bases and the plethora of different geo-spatial standards since most data has to be projected onto a map-base if the data is to be usable and in order to add additional context to the raw data.
  • The volume and velocity of data from the embedded devices presents significant challenges in terms of data processing and bandwidth. This aspect may be tackled by performing edge data science processing to only extract the salient features of interest on the embedded device before sending the data to the cloud for processing by more computationally intensivealgorithms.


The rich data streams from Connected Vehicles offer great potential in terms of being able to deliver value for drivers,as well asin terms of creating the smart city of the future.

We at Intelematics believe that this field will be fertile ground for some time for the application of data science and data analytics as cities, vehicles and the connected infrastructure become more embedded into our day-to-day lives.

Bio: Boris Savkovic is currently the data science lead at Intelematics, a leading provider of connected mobility services. Before that, Boris was the lead data scientist at BuildingIQ (venture-backed by Siemens and Schneider Electric) where he led the development of machine learning algorithms for the optimisation of energy usage in large-scale buildings (skyscrapers, hospitals etc), from the early stage through to a successful IPO. Boris holds a PhD in Applied Mathematics from UNSW. Boris’ experience spans the Australian, US and European markets.