Data - Any type of information that can be organised. Data is any collection of numbers, text, audio, video, software programs etc. essentially any raw binary input that can be processed by a computer. Data is also referred as the quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.
Data Scientist - Data Scientist can be defined as someone who finds solutions to problems by analyzing big or small data using appropriate tools to communicate her findings to the relevant stakeholders. As long as one has a curious mind, fluency in analytics, and ability to communicate the findings, she can be considered as Data Scientist. They can also organize and synthesize large amount of information to answer questions and drive strategy in an
organisation.
Data Science - It is a type of science or procedure that uses different types of methods to extract information and translate that information into understandable and usable data, insights or solutions. In simpler terms, data science is what data scientists do. It can be understood as the activities happening in the processing layer of the system architecture, against data stored in the data layer, in order to extract knowledge from the raw data. It primarily involves three things, Organising Data, Analysing Data and Presenting Data.
Data Science Life Cycle - A generalised data science life cycle consists of various stages depending on a project or an industry. It turns out that to achieve final product as insights or solutions data science life cycle has five stages viz. capture, maintain, process, analyze and communicate. These stages can be used as a blueprint to solve any data science related problem.
Data Acquisition - Data acquisition is digitizing the data from the environment around data usually with the help of a software. A data acquisition software usually consists of three components viz. sensors, signal conditioning and ADC (Analog to digital component).
Data Entry - Entering or updating data into a filing system or a computer. Following are the methods that are used for Data Entry.
Signal Reception - TODO – USB
Data Extraction - Data extraction is the process that involves retrieval of data from various sources. Organizations typically use data extraction to migrate data to a data repository (such as a data warehouse ) or to further analyse it. Some examples of sources can be files, databases, hard discs, web, cloud etc. There can be two types of extractions such as full extraction or incremental extraction. Working with the extraction of an unstructured data usually involves the use of a data lake for the clean up of noise, white spaces and symbols, removal of
duplicate results and handling of missing data.
Data Warehousing - It is defined as a technique of collecting data from various sources it, managing that data to provide meaningful business insights. It is a blend of technologies and components which is used to access the electronic storage of large amounts of data in a data warehouse for querying and analysis. There are mainly two types of data warehousing approaches:
Data Staging - Data staging is an intermediate process in which data is stored in a staging area or a landing zone to perform ETL operations such as extraction, transformation, loading and cleaning. Data staging area sits between data sources and data targets such as data warehouses, data marts and other data repositories. Staging areas can be implemented in the form of relational databases, text based flat files (XML files), or proprietory formatted binary files stored on file systems. Some of the uses for data staging are Consolidation, alignment, and minimizing contention, change detection, troubleshooting and capturing point-in-time changes.
Data Cleansing - Data cleansing is the process of detecting and correcting inaccurate or corrupt data. Detection involves identification of incomplete, incorrect, inaccurate or irrelevant parts of data inside recordsets, tables or databases. Correction involves replacing, modifying or deleting of inaccurate/corrupt data with correct data. The goal of data cleansing is to improve the quality of data before it is loaded into a warehouse.
Data Architecture - It is the process of planning the collection of data, including the definition of the information to be collected, the standards and norms that will be used for its structuring and the tools used in the extraction, storage and processing of such data. It also encompasses definitions related to the flow of data within the system.
Data Mining - It is the process of finding anomalies, patterns and correlations within a large datasets to predict outcomes. Data Mining usually starts after data is appropriately processed, transformed and stored. Results from data mining can be used to increase revenues, cut costs, improve customer relationships, reduce risks and more. A good starting point of data mining is data visualisation. It has many benefits as it allows organisations to continually analyse data and automate both routine and critical decisions without the delay of human judgement.
Steps of Data Mining - Business Understanding, Data Understanding, Data preparation, Data modeling, data evaluation and deployment.
Supervised Learning - Mentioned as one of the techniques of data mining, Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.
Unsupervised Learning - Mentioned as one of the techniques of data mining, Unsupervised learning is the training of a machine learning algorithm using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. In unsupervised learning, a system is presented with unlabeled, uncategorised data and the system’s algorithms act on the data without prior training. Clustering, Recommendation systems and Association analysis ( also called market-basket analysis ) has been given as examples that use unsupervised learning.
Data Clustering - Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster ) are more similar (in some sense) to each other than to those in other groups (clusters). Top five clustering algorithms which a data scientist must know are ( Please see the graphs of these methods in the course videos ):
Data Summarization - Data Summarization is a simple term for a short conclusion of a big theory or a paragraph. This is something where you write the code and in the end, you declare the final result in the form of summarizing data. Most of the time the data summarization is carried out using a software. Data summarisation is required for faster understanding of the data because of the large volume and disparate.
Exploratory Data Analysis - Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. In EDA we frame our questions as well as decide whether to manipulate our data sources. In EDA we look for clues that suggest logical next steps, questions and further areas of research.
Confirmatory Data Analysis - Confirmatory data analysis is the part where you evaluate your evidence using traditional statistical tools such as significance, inference, and confidence. At this point, you’re really challenging your assumptions. A big part of confirmatory data analysis is quantifying things like the extent any deviation from the model you’ve built could have happened by chance, and at what point you need to start questioning your model.
Predictive Data Analytics - Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modelling, and machine learning, that analyze current and historical facts to make predictions about future or otherwise unknown events.
Regression Analysis - Regression analysis is a form of predictive modeling used to investigate the relationship between an independent and dependent variable. Regression analysis is used to forecasting, time series modeling, and finding causal effect relationships between the variables.
Logistic regression - Logistic regression is used to find the probability of an event success or failure. We use logistic regression when the dependent variable is binary. I.e either 0 or 1. Logistic regression is used to classify a data set.
Polynomial Regression - A regression equation is a polynomial equation if the power of independent variable is increasing. In such an equation the best fit line is not a straight line but rather a curve that fits into the data points.
Text Mining - Text analytics is the way to unlock the meaning from all of this unstructured text. It lets you uncover patterns and themes, so you know what customers are thinking about. It reveals their wants and needs. It has the use of NLP techniques to identify facts, relationships and assertions from unstructured text.
Qualitative Data - Qualitative data is data that approximates and characterises. It is non numerical in nature. It is also known as categorical data and it is collected using methods of observation, one-on-one interviews, focus groups etc. Qualitative data is represented as codes, text and symbols etc.
Quantitative Data - Quantitative data is the measure of values or counts and is represented as numbers. It denotes numerical variables and answers questions related to how many, how much, etc.
Data Reporting - Data reporting is done in the final stages of data science life cyle and is an important step. It is the process of collecting and submitting data which gives rise to accurate analyses of the facts on the ground; inaccurate data reporting can lead to vastly uninformed decision-making based on erroneous evidence. Data reporting is done with the help of software and tools that can display data to the intended stakeholders.
Data Visualization - Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.
Business Intelligence - Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis of business information. BI technologies provide historical, current and predictive views of business operations. Common functions of business intelligence technologies include reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics and prescriptive analytics.
Decision Making - Decision making is the process of making choices by identifying a decision, gathering information, and assessing alternative resolutions. Using a step-by-step decision-making process can help you make more deliberate, thoughtful decisions by organizing relevant information and defining alternatives. Data science methodologies and tools help in making better and effective decisions after analysis of data.