ResearchGate Logo

Discover the world's research

  • 20+ million members
  • 135+ million publications
  • 700k+ research projects

Join for free

Praise

"A must-read resource for anyone who is serious

about embracing the opportunity of big data."

Craig Vaughan

Global Vice President at SAP

"This timely book says out loud what has finally become apparent: in the modern world,

Data is Business, and you can no longer think business without thinking data. Read this

book and you will understand the Science behind thinking data."

Ron Bekkerman

Chief Data Officer at Carmel Ventures

"A great book for business managers who lead or interact with data scientists, who wish to

better understand the principles and algorithms available without the technical details of

single-disciplinary books."

Ronny Kohavi

Partner Architect at Microsoft Online Services Division

"Provost and Fawcett have distilled their mastery of both the art and science of real-world

data analysis into an unrivalled introduction to the field."

Geo Webb

Editor-in-Chief of Data Mining and Knowledge Discovery Journal

"I would love it if everyone I had to work with had read this book."

Claudia Perlich

Chief Scientist of Dstillery and Advertising Research

Foundation Innovation Award Grand Winner (2013)

"A foundational piece in the fast developing world of Data Science.

A must read for anyone interested in the Big Data revolution."

Justin Gapper

Business Unit Analytics Manager

at Teledyne Scientific and Imaging

"The authors, both renowned experts in data science before it had a name, have taken a

complex topic and made it accessible to all levels, but mostly helpful to the budding data

scientist. As far as I know, this is the first book of its kind—with a focus on data science

concepts as applied to practical business problems. It is liberally sprinkled with

compelling real-world examples outlining familiar, accessible problems in the business

world: customer churn, targeted marking, even whiskey analytics!

The book is unique in that it does not give a cookbook of algorithms, rather it helps the

reader understand the underlying concepts behind data science, and most importantly

how to approach and be successful at problem solving. Whether you are looking for a

good comprehensive overview of data science or are a budding data scientist in need of

the basics, this is a must-read."

Chris Volinsky

Director of Statistics Research at AT&T Labs and Winning

Team Member for the $1 Million Netflix Challenge

"This book goes beyond data analytics 101. It's the essential guide for those of us (all of

us?) whose businesses are built on the ubiquity of data opportunities and the new

mandate for data-driven decision-making."

Tom Phillips

CEO of Dstillery and Former Head of

Google Search and Analytics

"Intelligent use of data has become a force powering business to new levels of

competitiveness. To thrive in this data-driven ecosystem, engineers, analysts, and

managers alike must understand the options, design choices, and tradeoffs before them.

With motivating examples, clear exposition, and a breadth of details covering not only the

"hows" but the "whys", Data Science for Business is the perfect primer for those wishing to

become involved in the development and application of data-driven systems."

Josh Attenberg

Data Science Lead at Etsy

"Data is the foundation of new waves of productivity growth, innovation, and richer

customer insight. Only recently viewed broadly as a source of competitive advantage,

dealing well with data is rapidly becoming table stakes to stay in the game.

The authors' deep applied experience makes this a must read—a window into your

competitor's strategy."

Alan Murray

Serial Entrepreneur; Partner at Coriolis Ventures; Co-Founder Neuehouse

"One of the best data mining books, which helped me think through various ideas on

liquidity analysis in the FX business. The examples are excellent and help you take a deep

dive into the subject! This one is going to be on my shelf for lifetime!"

Nidhi Kathuria

Vice President of FX at Royal Bank of Scotland

"An excellent and accessible primer to help businessfolk better appreciate the concepts,

tools and techniques employed by data scientists... and for data scientists to better

appreciate the business context in which their solutions are deployed."

Joe McCarthy

Director of Analytics and Data Science at Atigeo, LLC

"In my opinion it is the best book on Data Science and Big Data for a professional

understanding by business analysts and managers who must apply these techniques in the

practical world."

Ira Laefsky

MS Engineering (Computer Science)/MBA Information Technology and Human

Computer Interaction Researcher formerly on the Senior Consulting Staff

of Arthur D. Little, Inc. and Digital Equipment Corporation

"With motivating examples, clear exposition and a breadth of details covering not only

the "hows" but the "whys," Data Science for Business is the perfect primer for those

wishing to become involved in the development and application of data driven systems."

Ted O'Brien

Co-Founder / Director of Talent Acquisition at Starbridge

Partners and Publisher of the Data Science Report

Foster Provost and Tom Fawcett**

Special Edition for Data Science for Business Analytics,

Stern School, NYU

Data Science for Business

978-1-449-36132-7

[LSI]

Data Science for Business

by Foster Provost and Tom Fawcett

Copyright © 2013 Foster Provost and Tom Fawcett. All rights reserved.

Printed in the United States of America.

Published by O'Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O'Reilly books may be purchased for educational, business, or sales promotional use. Online editions are

also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/

institutional sales department: 800-998-9938 or corporate@oreilly.com.

Editors: Mike Loukides and Meghan Blanchette

Production Editor: Christopher Hearse

Proofreader: Kiel Van Horn

Indexer: WordCo Indexing Services, Inc.

Interior Designer: David Futato

Cover Designer: Mark Paglietti

Illustrator: Rebecca Demarest

July 2013: First Edition

Revision History for the First Edition

2013-07-25: First Release

2013-12-19: Second Release

yyyy-mm-dd: Third Release

See http://oreilly.com/catalog/errata.csp?isbn=9781449361327 for release details.

The O'Reilly logo is a registered trademark of O'Reilly Media, Inc. Data Science for Business, the cover

image, and related trade dress are trademarks of O'Reilly Media, Inc. Data Science for Business is a trade

mark of Foster Provost and Tom Fawcett.

While the publisher and the authors have used good faith efforts to ensure that the information and

instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility

for errors or omissions, including without limitation responsibility for damages resulting from the use of

or reliance on this work. Use of the information and instructions contained in this work is at your own

risk. If any code samples or other technology this work contains or describes is subject to open source

licenses or the intellectual property rights of others, it is your responsibility to ensure that your use

thereof complies with such licenses and/or rights.

Table of Contents

Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii

1. Introduction: Data-Analytic Thinking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

The Ubiquity of Data Opportunities 1

Example: Hurricane Frances 3

Example: Predicting Customer Churn 4

Data Science, Engineering, and Data-Driven Decision Making 5

Data Processing and "Big Data" 8

From Big Data 1.0 to Big Data 2.0 8

Data and Data Science Capability as a Strategic Asset 9

Data-Analytic Thinking 12

This Book 14

Data Mining and Data Science, Revisited 14

Chemistry Is Not About Test Tubes: Data Science Versus the Work of the Data

Scientist 16

Summary 17

2. Business Problems and Data Science Solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Fundamental concepts: A set of canonical data mining tasks; The data mining process;

Supervised versus unsupervised data mining.

From Business Problems to Data Mining Tasks 19

Supervised Versus Unsupervised Methods 24

Data Mining and Its Results 26

The Data Mining Process 27

Business Understanding 28

Data Understanding 28

Data Preparation 30

Modeling 31

ix

Evaluation 31

Deployment 33

Implications for Managing the Data Science Team 34

Other Analytics Techniques and Technologies 35

Statistics 36

Database Querying 38

Data Warehousing 39

Regression Analysis 39

Machine Learning and Data Mining 40

Answering Business Questions with These Techniques 41

Summary 42

3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation. 43

Fundamental concepts: Identifying informative attributes; Segmenting data by

progressive attribute selection.

Exemplary techniques: Finding correlations; Attribute/variable selection; Tree

induction.

Models, Induction, and Prediction 45

Supervised Segmentation 48

Selecting Informative Attributes 49

Example: Attribute Selection with Information Gain 56

Supervised Segmentation with Tree-Structured Models 62

Visualizing Segmentations 69

Trees as Sets of Rules 72

Probability Estimation 72

Example: Addressing the Churn Problem with Tree Induction 75

Summary 80

4. Fitting a Model to Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Fundamental concepts: Finding "optimal" model parameters based on data; Choosing

the goal for data mining; Objective functions; Loss functions.

Exemplary techniques: Linear regression; Logistic regression; Support-vector machines.

Classification via Mathematical Functions 85

Linear Discriminant Functions 87

Optimizing an Objective Function 90

An Example of Mining a Linear Discriminant from Data 91

Linear Discriminant Functions for Scoring and Ranking Instances 93

Support Vector Machines, Briefly 94

Regression via Mathematical Functions 97

Class Probability Estimation and Logistic "Regression" 99

* Logistic Regression: Some Technical Details 102

Example: Logistic Regression versus Tree Induction 105

x | Table of Contents

Nonlinear Functions, Support Vector Machines, and Neural Networks 110

Summary 113

5. Over tting and Its Avoidance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Fundamental concepts: Generalization; Fitting and over tting; Complexity control.

Exemplary techniques: Cross-validation; Attribute selection; Tree pruning;

Regularization.

Generalization 115

Overfitting 117

Overfitting Examined 117

Holdout Data and Fitting Graphs 117

Overfitting in Tree Induction 120

Overfitting in Mathematical Functions 122

Example: Overfitting Linear Functions 123

* Example: Why Is Overfitting Bad? 128

From Holdout Evaluation to Cross-Validation 130

The Churn Dataset Revisited 134

Learning Curves 135

Overfitting Avoidance and Complexity Control 138

Avoiding Overfitting with Tree Induction 138

A General Method for Avoiding Overfitting 139

* Avoiding Overfitting for Parameter Optimization 141

Summary 146

6. Similarity, Neighbors, and Clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Fundamental concepts: Calculating similarity of objects described by data; Using

similarity for prediction; Clustering as similarity-based segmentation.

Exemplary techniques: Searching for similar entities; Nearest neighbor methods;

Clustering methods; Distance metrics for calculating similarity.

Similarity and Distance 148

Nearest-Neighbor Reasoning 150

Example: Whiskey Analytics 151

Nearest Neighbors for Predictive Modeling 153

How Many Neighbors and How Much Influence? 156

Geometric Interpretation, Overfitting, and Complexity Control 158

Issues with Nearest-Neighbor Methods 161

Some Important Technical Details Relating to Similarities and Neighbors 164

Heterogeneous Attributes 164

* Other Distance Functions 165

* Combining Functions: Calculating Scores from Neighbors 168

Clustering 170

Example: Whiskey Analytics Revisited 171

Table of Contents | xi

Hierarchical Clustering 171

Nearest Neighbors Revisited: Clustering Around Centroids 177

Example: Clustering Business News Stories 182

Understanding the Results of Clustering 186

* Using Supervised Learning to Generate Cluster Descriptions 188

Stepping Back: Solving a Business Problem Versus Data Exploration 191

Summary 194

7. Decision Analytic Thinking I: What Is a Good Model?. . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

Fundamental concepts: Careful consideration of what is desired from data science

results; Expected value as a key evaluation framework; Consideration of appropriate

comparative baselines.

Exemplary techniques: Various evaluation metrics; Estimating costs and benets;

Calculating expected pro t; Creating baseline methods for comparison.

Evaluating Classifiers 196

Plain Accuracy and Its Problems 197

The Confusion Matrix 197

Problems with Unbalanced Classes 198

Problems with Unequal Costs and Benefits 202

Generalizing Beyond Classification 202

A Key Analytical Framework: Expected Value 203

Using Expected Value to Frame Classifier Use 204

Using Expected Value to Frame Classifier Evaluation 206

Evaluation, Baseline Performance, and Implications for Investments in Data 214

Summary 217

8. Visualizing Model Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

Fundamental concepts: Visualization of model performance under various kinds of

uncertainty; Further consideration of what is desired from data mining results.

Exemplary techniques: Pro t curves; Cumulative response curves; Lift curves; ROC

curves.

Ranking Instead of Classifying 219

Profit Curves 222

ROC Graphs and Curves 224

The Area Under the ROC Curve (AUC) 230

Cumulative Response and Lift Curves 230

Example: Performance Analytics for Churn Modeling 234

Summary 242

9. Evidence and Probabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

Fundamental concepts: Explicit evidence combination with Bayes' Rule; Probabilistic

reasoning via assumptions of conditional independence.

xii | Table of Contents

Exemplary techniques: Naive Bayes classi cation; Evidence lift.

Example: Targeting Online Consumers With Advertisements 245

Combining Evidence Probabilistically 247

Joint Probability and Independence 248

Bayes' Rule 249

Applying Bayes' Rule to Data Science 251

Conditional Independence and Naive Bayes 253

Advantages and Disadvantages of Naive Bayes 255

A Model of Evidence "Lift" 257

Example: Evidence Lifts from Facebook "Likes" 258

Evidence in Action: Targeting Consumers with Ads 260

Summary 260

10. Representing and Mining Text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

Fundamental concepts: The importance of constructing mining-friendly data

representations; Representation of text for data mining.

Exemplary techniques: Bag of words representation; TFIDF calculation; N-grams;

Stemming; Named entity extraction; Topic models.

Why Text Is Important 264

Why Text Is Difficult 264

Representation 265

Bag of Words 266

Term Frequency 266

Measuring Sparseness: Inverse Document Frequency 269

Combining Them: TFIDF 270

Example: Jazz Musicians 271

* The Relationship of IDF to Entropy 275

Beyond Bag of Words 277

N-gram Sequences 277

Named Entity Extraction 278

Topic Models 278

Example: Mining News Stories to Predict Stock Price Movement 280

The Task 280

The Data 282

Data Preprocessing 284

Results 285

Summary 289

11. Decision Analytic Thinking II: Toward Analytical Engineering. . . . . . . . . . . . . . . . . . . . 291

Fundamental concept: Solving business problems with data science starts with

analytical engineering: designing an analytical solution, based on the data, tools, and

techniques available.

Table of Contents | xiii

Exemplary technique: Expected value as a framework for data science solution design.

Targeting the Best Prospects for a Charity Mailing 292

The Expected Value Framework: Decomposing the Business Problem and

Recomposing the Solution Pieces 292

A Brief Digression on Selection Bias 295

Our Churn Example Revisited with Even More Sophistication 295

The Expected Value Framework: Structuring a More Complicated Business

Problem 296

Assessing the Influence of the Incentive 297

From an Expected Value Decomposition to a Data Science Solution 299

Summary 302

12. Other Data Science Tasks and Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

Fundamental concepts: Our fundamental concepts as the basis of many common data

science techniques; The importance of familiarity with the building blocks of data

science.

Exemplary techniques: Association and co-occurrences; Behavior pro ling; Link

prediction; Data reduction; Latent information mining; Movie recommendation; Bias-

variance decomposition of error; Ensembles of models; Causal reasoning from data.

Co-occurrences and Associations: Finding Items That Go Together 304

Measuring Surprise: Lift and Leverage 305

Example: Beer and Lottery Tickets 306

Associations Among Facebook Likes 307

Profiling: Finding Typical Behavior 310

Link Prediction and Social Recommendation 315

Data Reduction, Latent Information, and Movie Recommendation 316

Bias, Variance, and Ensemble Methods 320

Data-Driven Causal Explanation and a Viral Marketing Example 323

Summary 324

13. Data Science and Business Strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327

Fundamental concepts: Our principles as the basis of success for a data-driven

business; Acquiring and sustaining competitive advantage via data science; The

importance of careful curation of data science capability.

Thinking Data-Analytically, Redux 327

Achieving Competitive Advantage with Data Science 329

Sustaining Competitive Advantage with Data Science 330

Formidable Historical Advantage 331

Unique Intellectual Property 332

Unique Intangible Collateral Assets 332

Superior Data Scientists 332

Superior Data Science Management 334

xiv | Table of Contents

Attracting and Nurturing Data Scientists and Their Teams 335

Examine Data Science Case Studies 337

Be Ready to Accept Creative Ideas from Any Source 338

Be Ready to Evaluate Proposals for Data Science Projects 339

Example Data Mining Proposal 339

Flaws in the Big Red Proposal 340

A Firm's Data Science Maturity 342

14. Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345

The Fundamental Concepts of Data Science 345

Applying Our Fundamental Concepts to a New Problem: Mining Mobile

Device Data 348

Changing the Way We Think about Solutions to Business Problems 351

What Data Can't Do: Humans in the Loop, Revisited 352

Privacy, Ethics, and Mining Data About Individuals 355

Is There More to Data Science? 356

Final Example: From Crowd-Sourcing to Cloud-Sourcing 357

Final Words 358

A. Proposal Review Guide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361

B. Another Sample Proposal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365

Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369

Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373

Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383

Table of Contents | xv

Preface

Data Science for Business is intended for several sorts of readers:

Business people who will be working with data scientists, managing data science–

oriented projects, or investing in data science ventures,

Developers who will be implementing data science solutions, and

Aspiring data scientists.

This is not a book about algorithms, nor is it a replacement for a book about algo

rithms. We deliberately avoided an algorithm-centered approach. We believe there is

a relatively small set of fundamental concepts or principles that underlie techniques

for extracting useful knowledge from data. These concepts serve as the foundation for

many well-known algorithms of data mining. Moreover, these concepts underlie the

analysis of data-centered business problems, the creation and evaluation of data sci

ence solutions, and the evaluation of general data science strategies and proposals.

Accordingly, we organized the exposition around these general principles rather than

around specific algorithms. Where necessary to describe procedural details, we use a

combination of text and diagrams, which we think are more accessible than a listing

of detailed algorithmic steps.

The book does not presume a sophisticated mathematical background. However, by

its very nature the material is somewhat technical—the goal is to impart a significant

understanding of data science, not just to give a high-level overview. In general, we

have tried to minimize the mathematics and make the exposition as "conceptual" as

possible.

Colleagues in industry comment that the book is invaluable for helping to align the

understanding of the business, technical/development, and data science teams. That

observation is based on a small sample, so we are curious to see how general it truly is

(see Chapter 5!). Ideally, we envision a book that any data scientist would give to his

collaborators from the development or business teams, effectively saying: if you really

xvii

want to design/implement top-notch data science solutions to business problems, we

all need to have a common understanding of this material.

Colleagues also tell us that the book has been quite useful in an unforeseen way: for

preparing to interview data science job candidates. The demand from business for

hiring data scientists is strong and increasing. In response, more and more job seek

ers are presenting themselves as data scientists. Every data science job candidate

should understand the fundamentals presented in this book. (Our industry colleagues

tell us that they are surprised how many do not. We have half-seriously discussed a

follow-up pamphlet "Cliff's Notes to Interviewing for Data Science Jobs.")

Our Conceptual Approach to Data Science

In this book we introduce a collection of the most important fundamental concepts of

data science. Some of these concepts are "headliners" for chapters, and others are

introduced more naturally through the discussions (and thus they are not necessarily

labeled as fundamental concepts). The concepts span the process from envisioning

the problem, to applying data science techniques, to deploying the results to improve

decision-making. The concepts also undergird a large array of business analytics

methods and techniques.

The concepts fit into three general types:

1. Concepts about how data science fits in the organization and the competitive

landscape, including ways to attract, structure, and nurture data science teams;

ways for thinking about how data science leads to competitive advantage; and

tactical concepts for doing well with data science projects.

2. General ways of thinking data-analytically. These help in identifying appropriate

data and consider appropriate methods. The concepts include the data mining

process as well as the collection of different high-level data mining tasks.

3. General concepts for actually extracting knowledge from data, which undergird

the vast array of data science tasks and their algorithms.

For example, one fundamental concept is that of determining the similarity of two

entities described by data. This ability forms the basis for various specific tasks. It

may be used directly to nd customers similar to a given customer. It forms the core

of several prediction algorithms that estimate a target value such as the expected

resource usage of a client or the probability of a customer to respond to an offer. It is

also the basis for clustering techniques, which group entities by their shared features

without a focused objective. Similarity forms the basis of information retrieval, in

which documents or webpages relevant to a search query are retrieved. Finally, it

underlies several common algorithms for recommendation. A traditional algorithm-

oriented book might present each of these tasks in a different chapter, under different

xviii | Preface

1Of course, each author has the distinct impression that he did the majority of the work on the book.

names, with common aspects buried in algorithm details or mathematical proposi

tions. In this book we instead focus on the unifying concepts, presenting specific

tasks and algorithms as natural manifestations of them.

As another example, in evaluating the utility of a pattern, we see a notion of li

how much more prevalent a pattern is than would be expected by chance—recurring

broadly across data science. It is used to evaluate very different sorts of patterns in

different contexts. Algorithms for targeting advertisements are evaluated by comput

ing the lift one gets for the targeted population. Lift is used to judge the weight of

evidence for or against a conclusion. Lift helps determine whether a co-occurrence

(an association) in data is interesting, as opposed to simply being a natural conse

quence of popularity.

We believe that explaining data science around such fundamental concepts not only

aids the reader, it also facilitates communication between business stakeholders and

data scientists. It provides a shared vocabulary and enables both parties to under

stand each other better. The shared concepts lead to deeper discussions that may

uncover critical issues otherwise missed.

To the Instructor

This book has been used successfully as a textbook for a very wide variety of data sci

ence and business analytics courses. Historically, the book arose from the develop

ment of Foster's multidisciplinary Data Science and Business Analytics classes at the

Stern School at NYU, starting in the fall of 2005.1 The original class was nominally for

MBA students and MSIS students, but drew students from schools across the univer

sity. The most interesting aspect of the class was not that it appealed to MBA and

MSIS students, for whom it was designed. More interesting, it also was found to be

very valuable by students with strong backgrounds in machine learning and other

technical disciplines. Part of the reason seemed to be that the focus on fundamental

principles and other issues besides algorithms was missing from their curricula.

At NYU we now use the book in support of a variety of data science–related pro

grams: the original MBA and MSIS programs, undergraduate business analytics,

NYU/Stern's MS in Business Analytics program, executive education, and as the

Introduction to Data Science for NYU's MS in Data Science. In addition, the book has

been adopted by well over 100 other universities for programs in at least 22 countries

(and counting), in business schools, in data science programs, in computer science

programs, and for more general introductions to data science.

Preface | xix

The books' website gives pointers on how to obtain helpful instructional material,

including lecture slides, sample homework questions and problems, example project

instructions based on the frameworks from the book, exam questions, and more.

We keep an up-to-date list of known adopters on the book's web

site. Click Who's Using It at the top.

Other Skills and Concepts

There are many other concepts and skills that a practical data scientist needs to know

besides the fundamental principles of data science. These skills and concepts will be

discussed in Chapter 1 and Chapter 2. The interested reader is encouraged to visit the

book's website for pointers to material for learning these additional skills and con

cepts (for example, scripting in Python, Unix command-line processing, datafiles,

common data formats, databases and querying, big data architectures and systems

like MapReduce and Hadoop, data visualization, and other related topics).

Sections and Notation

In addition to occasional footnotes, the book contains boxed "sidebars." These are

essentially extended footnotes. We reserve these for material that we consider inter

esting and worthwhile, but too long for a footnote and too much of a digression for

the main text.

Technical Details Ahead — A note on the starred sections

The occasional mathematical details are relegated to optional "star

red" sections. These section titles will have asterisk prefixes, and

they will be preceded by a paragraph rendered like this one. Such

"starred" sections contain more detailed mathematics and/or more

technical details than elsewhere, and these introductory paragraph

explains its purpose. The book is written so that these sections may

be skipped without loss of continuity, although in a few places we

remind readers that details appear there.

Constructions in the text like (Smith and Jones, 2003) indicate a reference to an entry

in the bibliography (in this case, the 2003 article or book by Smith and Jones); "Smith

and Jones (2003)" is a similar reference. A single bibliography for the entire book

appears in the endmatter.

xx | Preface

In this book we try to keep math to a minimum, and what math there is we have sim

plified as much as possible without introducing confusion. For our readers with tech

nical backgrounds, a few comments may be in order regarding our simplifying

choices.

1. We avoid Sigma (Σ) and Pi ( Π) notation, commonly used in textbooks to indicate

sums and products, respectively. Instead we simply use equations with ellipses

like this:

f x = w1x1 +w2x2 + + wnxn

In the technical, "starred" sections we sometimes adopt Sigma and Pi notation

when this ellipsis approach is just too cumbersome. We assume people reading

these sections are somewhat more comfortable with math notation and will not

be confused.

2. Statistics books are usually careful to distinguish between a value and its estimate

by putting a "hat" on variables that are estimates, so in such books you'll typically

see a true probability denoted p and its estimate denoted p. In this book we are

almost always talking about estimates from data, and putting hats on everything

makes equations verbose and ugly. Everything should be assumed to be an esti

mate from data unless we say otherwise.

3. We simplify notation and remove extraneous variables where we believe they are

clear from context. For example, when we discuss classifiers mathematically, we

are technically dealing with decision predicates over feature vectors. Expressing

this formally would lead to equations like:

fR x=x Age +0.7×xBalance + 60

Instead we opt for the more readable:

fx= Age + 0 . 7 × Balance + 60

with the understanding that x is a vector and Age and Balance are components of

it.

We have tried to be consistent with typography, reserving fixed-width typewriter

fonts like sepal_width to indicate attributes or keywords in data. For example, in the

text-mining chapter, a word like 'discussing' designates a word in a document while

discuss might be the resulting token in the data.

The following typographical conventions are used in this book:

Preface | xxi

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width

Used for program listings, as well as within paragraphs to refer to program ele

ments such as variable or function names, databases, data types, environment

variables, statements, and keywords.

Constant width italic

Shows text that should be replaced with user-supplied values or by values deter

mined by context.

Throughout the book we have placed special inline tips and warnings relevant to the

material. They will be rendered differently depending on whether you're reading

paper, PDF, or an ebook, as follows:

A sentence or paragraph typeset like this signifies a tip or a sugges

tion.

This text and element signifies a general note.

Text rendered like this signifies a warning or caution. These are

more important than tips and are used sparingly.

Using Examples

In addition to being an introduction to data science, this book is intended to be useful

in discussions of and day-to-day work in the field. Answering a question by citing

this book and quoting examples does not require permission. We appreciate, but do

not require, attribution. Formal attribution usually includes the title, author, pub

lisher, and ISBN. For example: "Data Science for Business by Foster Provost and Tom

Fawcett (O'Reilly). Copyright 2013 Foster Provost and Tom Fawcett,

978-1-449-36132-7."

If you feel your use of examples falls outside fair use or the permission given above,

feel free to contact us at permissions@oreilly.com.

xxii | Preface

Safari® Books Online

Safari Books Online is an on-demand digital library that deliv

ers expert content in both book and video form from the

world's leading authors in technology and business.

Technology professionals, software developers, web designers, and business and crea

tive professionals use Safari Books Online as their primary resource for research,

problem solving, learning, and certification training.

Safari Books Online offers a range of product mixes and pricing programs for organi

zations, government agencies, and individuals. Subscribers have access to thousands

of books, training videos, and prepublication manuscripts in one fully searchable

database from publishers like O'Reilly Media, Prentice Hall Professional, Addison-

Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco

Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt,

Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett,

Course Technology, and dozens more. For more information about Safari Books

Online, please visit us online.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

O'Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472

800-998-9938 (in the United States or Canada) 707-829-0515 (international or local)

707-829-0104 (fax)

We have two web pages for this book, where we list errata, examples, and any addi

tional information. You can access the publisher's page at http://oreil.ly/data-science

and the authors' page at http://www.data-science-for-biz.com.

To comment or ask technical questions about this book, send email to bookques

tions@oreilly.com.

For more information about O'Reilly Media's books, courses, conferences, and news,

see their website at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Follow us on Twitter: http://twitter.com/oreillymedia

Watch us on YouTube: http://www.youtube.com/oreillymedia

Preface | xxiii

Acknowledgments

Thanks to all the many colleagues and others who have provided invaluable ideas,

feedback, criticism, suggestions, and encouragement based on discussions and many

prior draft manuscripts. At the risk of missing someone, let us thank in particular:

Panos Adamopoulos, Manuel Arriaga, Josh Attenberg, Solon Barocas, Ron Bekker

man, Enrico Bertini, Josh Blumenstock, Ohad Brazilay, Aaron Brick, Jessica Clark,

Nitesh Chawla, Brian d'Alessandro, Peter Devito, Vasant Dhar, Jan Ehmke, Theos

Evgeniou, Justin Gapper, Tomer Geva, Daniel Gillick, Shawndra Hill, Nidhi Kathuria,

Ronny Kohavi, Marios Kokkodis, Tom Lee, Philipp Marek, David Martens, Sophie

Mohin, Lauren Moores, Alan Murray, Nick Nishimura, Balaji Padmanabhan, Jason

Pan, Claudia Perlich, Gregory Piatetsky-Shapiro, Tom Phillips, Kevin Reilly, Maytal

Saar-Tsechansky, Evan Sadler, Galit Shmueli, Roger Stein, Nick Street, Kiril Tsemekh

man, Akhmed Umyarov, Craig Vaughan, Chris Volinsky, Wally Wang, Geoff Webb,

Debbie Yuster, and Rong Zheng. We would also like to thank more generally the stu

dents from Foster's classes, Data Mining for Business Analytics, Practical Data Sci

ence, Data Analytics, Introduction to Data Science, and the Data Science Research

Seminar. Questions and issues that arose when using prior drafts of this book pro

vided substantive feedback for improving it.

Thanks to all the colleagues who have taught us about data science and about how to

teach data science over the years. Thanks especially to Maytal Saar-Tsechansky, Clau

dia Perlich, Shawndra Hill, and Vasant Dhar. Maytal graciously shared with Foster

her notes for her data mining class many years ago. The classification tree example in

Chapter 3 (thanks especially for the "bodies" visualization) is based mostly on her

idea and example; her ideas and example were the genesis for the visualization com

paring the partitioning of the instance space with trees and linear discriminant func

tions in Chapter 4, the "Will David Respond" example in Chapter 6 is based on her

example, and probably other things long forgotten. Claudia has taught companion

sections of Data Mining for Business Analytics/Introduction to Data Science along

with Foster for the past few years, and has taught him much about data science in the

process (and beyond). Shawndra helped Foster with putting together his new kind of

data mining class over a decade ago. And way back in the 1990s Vasant taught the

first data mining course for a business audience, and invited Foster (then an industry

data scientist) to guest lecture about real-world data mining applications.

Thanks to David Stillwell, Thore Graepel, and Michal Kosinski for providing the

Facebook Like data for some of the examples. Thanks to Nick Street for providing the

cell nuclei data and for letting us use the cell nuclei image in Chapter 4. Thanks to

David Martens for his help with the mobile locations visualization. Thanks to Chris

Volinsky for providing data from his work on the Netflix Challenge. Thanks to Sonny

Tambe for early access to his results on big data technologies and productivity.

Thanks to Patrick Perry for pointing us to the bank call center example used in Chap

xxiv | Preface

ter 12. Thanks to Geoff Webb for the use of the Magnum Opus association mining

system.

Thanks especially to our editor Mike Loukides, who shared our vision for a different

sort of book, and the entire O'Reilly team for helping us to make it a reality.

Most of all we thank our families for their love, patience and encouragement.

A great deal of open source software was used in the preparation of this book and its

examples. The authors wish to thank the developers and contributors of:

Python and Perl

Scipy, Numpy, Matplotlib, and Scikit-Learn

• Weka

The Machine Learning Repository at the University of California at Irvine (Bache

& Lichman, 2013)

Finally, we encourage readers to check our website for updates to this material, new

chapters, errata, addenda, and accompanying slide sets.

—Foster Provost and Tom Fawcett

Preface | xxv

... Decision trees are developed using information gain, which is a concept based on entropy [29,62]. A decision tree segments the data based on the attributes that result in the largest decrease in system disorder or entropy (or have the largest information gain). ...

... Naïve Bayes classifier is a simple and popular algorithm based on Bayes rule. It classifies data by assuming independence among predictive attributes [58,62]. Therefore, the probability of a data point belonging to class y h given a certain value of predictive variables (X = x k ) may be formulated as Eq. ...

  • S Madeh Piryonesi
  • Tamer El-Diraby

A decision-support tool was developed to predict the condition of asphalt roads in 2, 3, 5 and 6 years. The tool was developed based on analyzing a large dataset (more than 3000 road sections) extracted from the Long-Term Pavement Performance (LTPP) database. Several algorithms were examined: two decision trees, k-nearest neighbors (k-NN), naïve Bayes classifier, naïve Bayes coupled with kernel estimator, random forest and gradient boosted trees. The last three achieved the highest accuracy levels (above 90%). The attributes used were intentionally selected to be related to climate stressors (such as temperature ranges, perspiration and freeze–thaw cycles) or basic road attributes (such as age and functional class) to enable the models quantify the impact of climate change. A major caveat of this study is that some climate stressors such as storm frequency and severity were not included in the model as there was no data available about them in the LTPP dataset. With the proposed tool, the impacts of different climate scenarios can be examined by running the model with inputs that reflect the attributes of each scenario. To illustrate this, we examined the deterioration of two sets of roads: one from Ontario and one from Texas. Each set was examined in two climate scenarios. The analysis showed lower levels of deterioration for the Ontario roads and exacerbation of deterioration for the roads in Texas. It means that climate change may exacerbate or alleviate road deterioration depending on location. This type of analysis can be beneficial to the long-term policymaking in road infrastructure. For example, notwithstanding the impact of climate attributes that are not considered in this study, an Ontario policymaker should expect that with the same design standards and the same maintenance regimes, the service levels of roads will be enhanced.

... Therefore, we considered it useful for our research. In order to measure the performance of the classifier we would use Accuracy and the Area Under the ROC curve (AUC) [12]. Additionally, and as a control variable, we would make use of a K-nearest neighbor classifier, to measure the performance of a pattern-based classifier in contrast to a simpler classification algorithm. ...

  • Edwin Carlos Montiel Vázquez Edwin Carlos Montiel Vázquez
  • Yareth Lafarga-Osuna

This research describes the use of sentiment analysis applied to Twitter users regarding the Olympic games of Tokyo 2020. We provide an examination of several tweets using Natural Language Processing and discrete mathematical methods, intending to obtain statistical results about the public opinion regarding a trending topic. The research uses a database of tweets published during the timeframe of the Olympic games, with each tweet's polarity cataloged to obtain information. We provide an approach that presents information about public opinion found on the Twittersphere to provide a data model, metrics, and explicative information for future use.

... 2. Data mining: it is used to extract information contained in historical or real time received data, to assist in decision making [7,31]. 3. Machine learning: it is a branch of AI based on the design of algorithms, which allow machines and computers to learn from data without the need for constant human intervention. ...

  • Manuel Parejo Guzmán
  • Benito Navarrete Rubia
  • Pedro Mora Peris
  • Rafaela Alfalla-Luque Rafaela Alfalla-Luque

Cement factories require large amounts of energy. 70% of the variable cost goes to energy—33% to kiln thermal energy and 37% to electrical energy. This paper represents the second stage of a broader research study which aims at optimising electricity cost in a cement factory by means of using artificial intelligence. After an analysis of the different tools that could be highly useful for the optimisation of electricity cost, for which a systematic review of the literature and surveys and an expert panel of 42 professionals in the cement sector were carried out, a methodology was developed in order to reduce electricity cost by optimising not only different variables of the production process, but also regulated electricity costs and electricity market costs. Artificial neural networks and genetic algorithms will be the tools to be used in this methodology, which can be applied to any cement plant in the world, and, by extension, to any electro-intensive consumer. The innovation of this research work is based on the use of a methodology that not only combines two different variables at the same time—process variables and regulated prices—but also makes use of artificial intelligence tools techniques.

... Testing happened using unseen datai.e., a test data subset that was not part of the training. During the testing phase, the performance of the models was assessed (Fawcett and Provost, 2013). ...

  • Adrian Buturache Adrian Buturache
  • Stelian Stancu

The adoption of wind energy has grown significantly in recent years. New, cost-effective technologies have been developed, led by customer awareness of green technologies and a legal framework proposed at the European Union level. The stochastic nature of wind speed is transferred to wind turbine output, making wind energy difficult to predict. The main scope of predicting wind energy production is to be proactive in balancing and reserving energy to meet demand. When the prediction identifies a potential gap between supply and demand, additional energy from other sources must be generated and supplied. Creating a synergy of physical devices through advanced sensing capabilities, software, storage and analytics capabilities, the Industrial Internet of Things is enabling the effective transition to wind energy through automation by removing many of the disadvantages in a way that has recently become accessible. This research focuses on the data analytics, proposing a fast univariate network-based approach for wind energy prediction, using Feed Forward Neural Networks, Recurrent Neural Networks, Long-Short Term Memory, Gated Recurrent Unit, and Convolutional Neural Networks. Moreover, by introducing the theoretical fundamentals, the implementation method and the hyperparameters of the final models, this article becomes unique in the context of wind energy. At the time of this study, no prior research studies have presented a direct comparison between feedforward, recurrent, and convolutional neural networks ‒ these being the most important in the field of supervised learning.

  • Claudia Bernal
  • Miguel Bernal
  • Andrei Noguera
  • Edgar Avalos Edgar Avalos

This paper conducts a sentiment analysis of Twitter's posts, between late October 2020 and late April 2021, regarding COVID-19 vaccination campaign in Mexico through several machine learning models such as Logistic Regression, Neuronal Network, Naive Bayes and Support Vector Machine. To prepare data, Natural Language Processing techniques were used such as tokenization, stemming, n-grams and stopwords. The best performance was achieved by Logistic Regression with an accuracy score of 83.42% while classifying tweets according to a positive or negative sense. This work suggests that sentiment analysis with Twitter information allows to witness a relevant part of the public discussion around specific topics. For this study, the tweets analyzed showed a similar behavior to other search and reference electronic tools, such as Google Trends regarding conversation around COVID. In addition, the present analysis allows the classification and tendency of public opinion. Furthermore, this study shows that measuring people's opinion through machine learning and natural language processing techniques can generate significant benefits for institutions and businesses given that obtaining information on Twitter is less expensive and can be processed and analyzed faster than other opinion analysis techniques such as surveys or focus groups.

  • Scott Mongeau
  • Andrzej Hajdasinski

Due to the nature of cybersecurity data science (CSDS) as a novel field emerging in the midst of rapid technological change, there is a gap in CSDS-focused organizational research. Challenges operationalizing CSDS solutions lead to a call for an increased theoretical focus on organizational problem-solving research. To address this gap, CSDS fits the profile of an organizational problem that is "relatively new or fairly complex," necessitating an effort to "clarify the relevant background and the reasons for the problem" (Doorewaard and Verschuren 2010).

  • Manuel Parejo-Guzmán
  • Benito Navarrete-Rubia
  • Pedro Mora-Peris
  • Rafaela Alfalla-Luque Rafaela Alfalla-Luque

Las fábricas de cemento presentan importantes consumos energéticos: el 70 % del coste variable se dedica a energía -33 % térmica y 37% eléctrica-. Este trabajo supone la segunda fase de una investigación para optimizar el coste eléctrico en cementeras mediante técnicas de inteligencia artificial. Tras una revisión sistemática de la literatura, encuestas y panel de expertos a un total de 42 profesionales del sector (primera fase), se ha desarrollado una metodología para optimizar la compra de electricidad. Para ello se propone el uso de Redes Neuronales Artificiales y del algoritmo Backpropagation, de cara a predecir el precio eléctrico spot.

We currently live in an era, in which data heavily, constantly, and globally flows into all areas of our activities. This mobile world is based on the concepts of the Internet of Things, which evolved by the digital transformation from Web 2.0 to 4.0, from a people-centric, participative, read-write web to a data-centered, semantic-oriented, and symbiotic web. It connects us at anytime with our conveniences and contacts, feeds our information needs, guides our shopping tendencies, and informs us about businesses and opportunities in a way that otherwise would be difficult to manage, due to the massive amount of data involved. Individuals and mainly organizations have to tackle the problem of how to process large amounts of data in support of their respective needs and operations, aiming at improving their handling and response efficiency. Big Data can be a strategic asset for organizations, but it is only valuable if used constructively and efficiently to deliver appropriate business insights. Moreover, we currently see special needs, like the one with the pandemic outbreak of COVID-19 that affected all the world, in which high-level technology and analytics tools for supporting decision-making have proven to be important allied components on the counter-attack and management of the overall crisis. Novel methods and technologies were required to be developed to enable decision-makers to understand and examine the massive, multidimensional, multi-source, time-varying information stream to make effective decisions, sometimes in time-critical situations. The current work evolves from the need and interest of board members of the EURO Working Group on Decision Support Systems EWG-DSS to tackle these emerging issues related to Big Data and Decision-Making. The authors discuss the importance of having appropriate technologies for Decision-Making and Decision Support Systems to exploit the potentiality of Big Data analytics, so that we can treat crisis management in a more effective way; and organizations can improve their productivity to face increased competition in this new era. Our aim is to unveil the main impacts and challenges posed to decision-makers in organizations, in the new era of Big Data availability. An illustrative conceptual model is introduced to support the Big Data Analytics for Decision-Making in cross-domain applications.

  • Eric van Heck

Artificial Intelligence (AI) tools have potential applications in the production, trade, transport, and sales of flowers. Cases of supervised and unsupervised learning algorithms in the flower business and its markets are presented and discussed, including causal modeling and deep learning.

ResearchGate has not been able to resolve any references for this publication.