Intro to Data Science (Glossary)

Introduction

Data - Any type of information that can be organised. Data is any collection of numbers, text, audio, video, software programs etc. essentially any raw binary input that can be processed by a computer. Data is also referred as the quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.

Data Scientist - Data Scientist can be defined as someone who finds solutions to problems by analyzing big or small data using appropriate tools to communicate her findings to the relevant stakeholders. As long as one has a curious mind, fluency in analytics, and ability to communicate the findings, she can be considered as Data Scientist. They can also organize and synthesize large amount of information to answer questions and drive strategy in an
organisation.

Data Science - It is a type of science or procedure that uses different types of methods to extract information and translate that information into understandable and usable data, insights or solutions. In simpler terms, data science is what data scientists do. It can be understood as the activities happening in the processing layer of the system architecture, against data stored in the data layer, in order to extract knowledge from the raw data. It primarily involves three things, Organising Data, Analysing Data and Presenting Data.

Data Science Life Cycle - A generalised data science life cycle consists of various stages depending on a project or an industry. It turns out that to achieve final product as insights or solutions data science life cycle has five stages viz. capture, maintain, process, analyze and communicate. These stages can be used as a blueprint to solve any data science related problem.

Capture

Data Acquisition - Data acquisition is digitizing the data from the environment around data usually with the help of a software. A data acquisition software usually consists of three components viz. sensors, signal conditioning and ADC (Analog to digital component).

Sensors - Also called transducers, convert real phenomenon such as temperature, force and movement to voltage and current signals that can be used as input to ADC. Examples thermocouples, thermistors and RTD’s to measure temperature, accelerometers to measure movements and odometers to measure mileage.
Signal Conditioning - An additional circuitry between transducers and ADC. This circuitry helps in making quality measurements on transducers. Examples are amplification/attenuation, filtering, excitation, linearisation, calibration and cold-junction-compensation. It often involves choice of measurement units such as ltrs, gallons, meters, kilometers etc.
ADC - It is a chip that takes data from the environment and converts it to discrete levels that can be interpreted by a processor. It is the core of a data acquisition system.

Data Entry - Entering or updating data into a filing system or a computer. Following are the methods that are used for Data Entry.

Scribe using Pencil and Paper
Keyboarding - Most widely used method in the world.
OCR ( Optical character recognition ) - Reading data from input source document using an optical scanner. It can increase the speed to data entry by 60 to 90 percent over keyboarding methods.
Barcoding - Barcodes can be thought of as meta codes or code of codes. These appear as narrow and wide bands on a label that encodes numbers or letters. A beam of light from a scanner or a light pen is drawn across the bands to confirm or record data about the product being scanned.

Signal Reception - TODO – USB

Data Extraction - Data extraction is the process that involves retrieval of data from various sources. Organizations typically use data extraction to migrate data to a data repository (such as a data warehouse ) or to further analyse it. Some examples of sources can be files, databases, hard discs, web, cloud etc. There can be two types of extractions such as full extraction or incremental extraction. Working with the extraction of an unstructured data usually involves the use of a data lake for the clean up of noise, white spaces and symbols, removal of
duplicate results and handling of missing data.

Maintain

Data Warehousing - It is defined as a technique of collecting data from various sources it, managing that data to provide meaningful business insights. It is a blend of technologies and components which is used to access the electronic storage of large amounts of data in a data warehouse for querying and analysis. There are mainly two types of data warehousing approaches:

ETL Warehousing
ELT Warehousing

Data Staging - Data staging is an intermediate process in which data is stored in a staging area or a landing zone to perform ETL operations such as extraction, transformation, loading and cleaning. Data staging area sits between data sources and data targets such as data warehouses, data marts and other data repositories. Staging areas can be implemented in the form of relational databases, text based flat files (XML files), or proprietory formatted binary files stored on file systems. Some of the uses for data staging are Consolidation, alignment, and minimizing contention, change detection, troubleshooting and capturing point-in-time changes.

Data Cleansing - Data cleansing is the process of detecting and correcting inaccurate or corrupt data. Detection involves identification of incomplete, incorrect, inaccurate or irrelevant parts of data inside recordsets, tables or databases. Correction involves replacing, modifying or deleting of inaccurate/corrupt data with correct data. The goal of data cleansing is to improve the quality of data before it is loaded into a warehouse.

Data Architecture - It is the process of planning the collection of data, including the definition of the information to be collected, the standards and norms that will be used for its structuring and the tools used in the extraction, storage and processing of such data. It also encompasses definitions related to the flow of data within the system.

Process

Data Mining - It is the process of finding anomalies, patterns and correlations within a large datasets to predict outcomes. Data Mining usually starts after data is appropriately processed, transformed and stored. Results from data mining can be used to increase revenues, cut costs, improve customer relationships, reduce risks and more. A good starting point of data mining is data visualisation. It has many benefits as it allows organisations to continually analyse data and automate both routine and critical decisions without the delay of human judgement.

Steps of Data Mining - Business Understanding, Data Understanding, Data preparation, Data modeling, data evaluation and deployment.

Business understanding (Understanding the problem) - First step is establishing the goals. A plan is developed at this stage to include timelines, actions and role assignment.
Data Understanding - Data is collected from all the sources at this stage and visualisation tools are used to understand the properties of data to ensure it will help achieve the business goals.
Data Preparation - Data is cleansed and missing data is included to ensure that it is ready to be mined. Processing of data at this stage can take a lot of time depending on the number of data sources and the amount of data. Appropriate failsafe measures are taken to ensure that information is not lost at this stage. Distributed systems are used to improve speed rather than burdening a single system.
Data Modeling - Use of mathematical models to find patterns in data to using sophisticated data tools.
Evaluation - Finding are evaluated and compared to business objectives to determine if they should be deployed across the organisation.
Deployment - It is the final stage of data mining process. Business intelligence platform in an organisation can be used to provide a single source of truth for self service data discovery.

Supervised Learning - Mentioned as one of the techniques of data mining, Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.

Unsupervised Learning - Mentioned as one of the techniques of data mining, Unsupervised learning is the training of a machine learning algorithm using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. In unsupervised learning, a system is presented with unlabeled, uncategorised data and the system’s algorithms act on the data without prior training. Clustering, Recommendation systems and Association analysis ( also called market-basket analysis ) has been given as examples that use unsupervised learning.

Data Clustering - Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster ) are more similar (in some sense) to each other than to those in other groups (clusters). Top five clustering algorithms which a data scientist must know are ( Please see the graphs of these methods in the course videos ):

K-Means Clustering - In K-means clustering each point in the data set is classified computing the distance between that point and each group center, and then classifying the point to be in the group whose center is closest to it.
Mean-Shift Clustering - Mean shift clustering is a sliding windows based algorithm, which attempts to find dense areas of data points. It is a centroid based algorithm in which the goal is to locate the center points of each group/class, which works by updating candidates of center points within a sliding window. These candidates windows are then filtered in a post processing stage to eliminate near duplicates, forming the final set to center points and their groups.
Density Based Spatial Clustering of application with Noise (DBSCAN) - DBSCAN is similar to mean shift clustering but has advantages. DBSCAN starts with an arbitrary data point that hasn’t been visited. If there are a sufficient number of points ( minimum number of points ) then the clustering process starts and current data point becomes the first point in the new cluster. Otherwise the point is labeled as noise.
Expectation maximization clustering with Gaussian Mixture Models (GMM) - Gaussian Mixture model is a soft clustering method and also more flexible than K-means. In GMM’s there is an uncertainty measure or probability that tells us how much a data point is associated with a
specific cluster. This is different than K-means where a data point can only remain in one cluster.
Agglomerative Hierarchical Clustering - Hierarchical clustering starts by treating each observation as a separate cluster. Then, it repeatedly executes the following two steps: (1) identify the two clusters that are closest together, and (2) merge the two most similar clusters. This continues until all the clusters are merged together. The output of Hierarchical Clustering is a dendrogram, which shows the hierarchical relationship between the clusters.

Data Summarization - Data Summarization is a simple term for a short conclusion of a big theory or a paragraph. This is something where you write the code and in the end, you declare the final result in the form of summarizing data. Most of the time the data summarization is carried out using a software. Data summarisation is required for faster understanding of the data because of the large volume and disparate.

Analyze

Exploratory Data Analysis - Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. In EDA we frame our questions as well as decide whether to manipulate our data sources. In EDA we look for clues that suggest logical next steps, questions and further areas of research.

Confirmatory Data Analysis - Confirmatory data analysis is the part where you evaluate your evidence using traditional statistical tools such as significance, inference, and confidence. At this point, you’re really challenging your assumptions. A big part of confirmatory data analysis is quantifying things like the extent any deviation from the model you’ve built could have happened by chance, and at what point you need to start questioning your model.

Predictive Data Analytics - Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modelling, and machine learning, that analyze current and historical facts to make predictions about future or otherwise unknown events.

Regression Analysis - Regression analysis is a form of predictive modeling used to investigate the relationship between an independent and dependent variable. Regression analysis is used to forecasting, time series modeling, and finding causal effect relationships between the variables.

Logistic regression - Logistic regression is used to find the probability of an event success or failure. We use logistic regression when the dependent variable is binary. I.e either 0 or 1. Logistic regression is used to classify a data set.

Polynomial Regression - A regression equation is a polynomial equation if the power of independent variable is increasing. In such an equation the best fit line is not a straight line but rather a curve that fits into the data points.

Text Mining - Text analytics is the way to unlock the meaning from all of this unstructured text. It lets you uncover patterns and themes, so you know what customers are thinking about. It reveals their wants and needs. It has the use of NLP techniques to identify facts, relationships and assertions from unstructured text.

Qualitative Data - Qualitative data is data that approximates and characterises. It is non numerical in nature. It is also known as categorical data and it is collected using methods of observation, one-on-one interviews, focus groups etc. Qualitative data is represented as codes, text and symbols etc.

Quantitative Data - Quantitative data is the measure of values or counts and is represented as numbers. It denotes numerical variables and answers questions related to how many, how much, etc.

Communicate

Data Reporting - Data reporting is done in the final stages of data science life cyle and is an important step. It is the process of collecting and submitting data which gives rise to accurate analyses of the facts on the ground; inaccurate data reporting can lead to vastly uninformed decision-making based on erroneous evidence. Data reporting is done with the help of software and tools that can display data to the intended stakeholders.

Data Visualization - Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

Business Intelligence - Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis of business information. BI technologies provide historical, current and predictive views of business operations. Common functions of business intelligence technologies include reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics and prescriptive analytics.

Decision Making - Decision making is the process of making choices by identifying a decision, gathering information, and assessing alternative resolutions. Using a step-by-step decision-making process can help you make more deliberate, thoughtful decisions by organizing relevant information and defining alternatives. Data science methodologies and tools help in making better and effective decisions after analysis of data.