As we have seen throughout the course, data plays a critical role in our society and enables us to get an understanding of the world around us. For the past decades, the explosion of the internet and Web 2.0 services, as well as mobile devices and sensors, have led to the creation of massive data sets.
The combination of a "growing torrent” of data generated and the availability of on-demand computing technologies (like cloud computing) has led to the development of the big data concept, referring to data that exceeds the processing capacity of conventional database systems.
Big data definitions
Big data is usually defined as "large amounts of data produced very quickly by a high number of diverse sources".
Big data definitions are subjective in terms of how large a dataset should be to be considered big data. There is no reference to the number of bytes, which is how we usually measure data (for example gigabytes). With technology advancing fast, and more and more devices connecting to the internet, the amount of data being created is also increasing.
The size of the datasets that qualify as big data might also increase over time. Also, what is “big” for an organisation, a sector, or a country may be small for another – think of Apple compared to a small business, or Portugal compared to China.
Your digital footprint
Almost every action we take today leaves a digital trail. We generate data whenever we carry our sensor-equipped smartphones, when we search for something online, when we communicate with our family or friends using social media or chat applications, and when we shop. We leave digital footprints with every digital action, and sometimes even unaware or involuntarily.
Have you wondered how companies like Amazon, Spotify or Netflix know what “you might also like”? Recommendation engines are a usual application of big data. Amazon, Netflix and Spotify use algorithms based on big data to make specific recommendations based on your preferences and historical behaviour. Siri and Alexa rely on big data to answer the variety of questions users may ask. Google Now is able to make recommendations based on big data on a user's device. But how do those recommendations influence how you spend your time, what products you buy, what opinions you read? Why do these big companies invest so much money in them? Do they only know you, or also influence you? Although recommendation systems account for up to a third of all traffic on many popular sites, we don’t know the power they have to influence our decisions.
Big data combines structured, semi-structured and unstructured data that can be mined for information and used in machine learning, predictive analytics, and other advanced analytics applications. Structured data is data that can be arranged into rows and columns, or relational databases; and unstructured data is data that is not organised in a pre-defined way, for instance Tweets, blog posts, pictures, numbers, and even video data.
Organisations use specific systems to store and process big data, which is called data management architecture.
Characteristics of big data
The most widely accepted characterisation of big data follows the three Vs coined by Doug Laney in 2001: the big volume of data being generated, the broad variety of data types stored and being processed in big data systems and the velocity at which the data is generated, collected and processed. Veracity, value and variability have also been added to enrich the description of big data.
Volume means the amount of data being generated/collected every moment in our highly digitised world, measured in bytes (terabytes, exabytes, zettabytes). As you can imagine, there are many challenges caused by the enormous volumes of data, like storage, distribution and processing. The challenges mean cost, scalability and performance. The volume is also driven by the increase in data sources (more people online), higher resolutions (sensors) and scalable infrastructure.
Velocity refers to the speed at which data is being generated, non-stop, near or real-time streamed, and processed using local and cloud-based technologies.
Variety is the diversity of data. Data is made available in different forms such as text, images, tweets or geospatial data. Data also comes from different sources, such as machines, people, organisational processes (both internal and external). Drivers are mobile technologies, social media, wearable technologies, geotechnologies, video and many, many more. Attributes include the degree of structure and complexity.
Veracity refers to the conformity to facts and accuracy. Veracity is also the quality and origin of data. Attributes include consistency, completeness, integrity and ambiguity. Drivers include cost and the need for traceability. With the high volume, velocity and variety of data created, we need to question: is the information real, or is it false?
There are more emerging Vs, but we will just mention one more, value. It refers to our capacity and need to turn data into value. Value doesn’t mean just profit. It may be related to security and safety (like seismic information), medical (wearables that can identify heart attack signs) or social benefits like employee or personal satisfaction. Big data has a large intrinsic value that can take many shapes.
The Vs not only characterise big data, they also embody its challenges: enormous amounts of data, available in different formats, largely unstructured, with varying quality, that require fast processing in order to take well-timed decisions.
Why and how is big data analysed?
80% of data is considered to be unstructured. How do we get reliable and accurate insights? The data must be filtered, categorised, analysed and visualised.
Big data analytics is the technological process of examining big data (high-volume, high-velocity and/or high-variety data sets) to uncover information – hidden patterns, correlations, market trends or/and customer preferences – that helps organisations, governments or institutions to examine data sets and obtain insights in order to make informed, smarter and faster decisions.
This addresses three important questions: what, why and how. We’ve already seen the what, so we will now get an overview of the why and how.
The why and how of big data
Big data follows the principle that “the more you know about something, the more reliably you can gain new insights and make predictions about what will happen in the future”.
A typical data management lifecycle includes ingestion, storage, processing, analytics, visualisation, sharing and applications. The cloud and big data go hand in hand, with data analytics happening at public cloud services. Companies like Amazon, Microsoft and Google offer cloud services that enable a fast deployment of massive amounts of computing power, so companies can access state-of-the-art computing on demand, without owning the necessary infrastructure, and run the entire data management lifecycle in the cloud. In the previous section we spoke about SaaS, IaaS and PaaS – cloud computing offers big data researchers the opportunity to access anything as a service (XaaS).
Pre-processing
Raw data may contain errors or have low-quality values (missing values, outliers, noise, inconsistent values) and might need to be pre-processed (data cleaning, fusion, transformation and reduction) to remove noise, correct data, or reduce its size. For example, for water usage behaviour analysis, data pre-processing is necessary for smart water meter data to become useful water consumption patterns because IoT sensors may fail to record data.
Identifying patterns or insights
The automated process behind big data involves building models based on the collected data and running simulations, modifying the value of data points to observe how it impacts our results. The advanced analytics technology we have available today can operate millions of simulations, tweaking variables in a quest to identify patterns or insights (finding correlations between variables) that might provide a competitive advantage or solve a problem. Behavioural analytics focuses on the actions of people, and predictive analytics looks for patterns that can help in anticipating trends.
Data mining
The process of discovering patterns from large datasets involving statistical analysis is called data mining. Statistical analysis is a common mathematical method of information extraction and discovery. Statistical methods are mathematical formulas, models and techniques used to find patterns and rules from raw data. Commonly used methods are regression analysis, spatiotemporal analysis, association rules, classification, clustering and deep learning.
To make sense of the available data, cutting-edge analytics involving artificial intelligence and machine learning are commonly used. With machine learning, computers can learn to identify what various data inputs or combinations of data inputs represent, identifying patterns much faster and more efficiently than humans.