Data Architecture: A Primer for the Data Scientist - W.H. Inmon, Daniel Linstedt

Data Architecture: A Primer for the Data Scientist (eBook)

Big Data, Data Warehouse and Data Vault

W.H. Inmon, Daniel Linstedt (Autoren)

eBook Download: PDF | EPUB

2014 | 1. Auflage
378 Seiten
Elsevier Science (Verlag)
978-0-12-802091-3 (ISBN)

Today, the world is trying to create and educate data scientists because of the phenomenon of Big Data. And everyone is looking deeply into this technology. But no one is looking at the larger architectural picture of how Big Data needs to fit within the existing systems (data warehousing systems). Taking a look at the larger picture into which Big Data fits gives the data scientist the necessary context for how pieces of the puzzle should fit together. Most references on Big Data look at only one tiny part of a much larger whole. Until data gathered can be put into an existing framework or architecture it can't be used to its full potential. Data Architecture a Primer for the Data Scientist addresses the larger architectural picture of how Big Data fits with the existing information infrastructure, an essential topic for the data scientist. Drawing upon years of practical experience and using numerous examples and an easy to understand framework. W.H. Inmon, and Daniel Linstedt define the importance of data architecture and how it can be used effectively to harness big data within existing systems. You'll be able to: - Turn textual information into a form that can be analyzed by standard tools. - Make the connection between analytics and Big Data - Understand how Big Data fits within an existing systems environment - Conduct analytics on repetitive and non-repetitive data - Discusses the value in Big Data that is often overlooked, non-repetitive data, and why there is significant business value in using it - Shows how to turn textual information into a form that can be analyzed by standard tools - Explains how Big Data fits within an existing systems environment - Presents new opportunities that are afforded by the advent of Big Data - Demystifies the murky waters of repetitive and non-repetitive data in Big Data

Best known as the 'Father of Data Warehousing,' Bill Inmon has become the most prolific and well-known author worldwide in the big data analysis, data warehousing and business intelligence arena. In addition to authoring more than 50 books and 650 articles, Bill has been a monthly columnist with the Business Intelligence Network, EIM Institute and Data Management Review. In 2007, Bill was named by Computerworld as one of the 'Ten IT People Who Mattered in the Last 40 Years of the computer profession. Having 35 years of experience in database technology and data warehouse design, he is known globally for his seminars on developing data warehouses and information architectures. Bill has been a keynote speaker in demand for numerous computing associations, industry conferences and trade shows. Bill Inmon also has an extensive entrepreneurial background: He founded Pine Cone Systems, later named Ambeo in 1995, and founded, and took public, Prism Solutions in 1991. Bill consults with a large number of Fortune 1000 clients, and leading IT executives on Data Warehousing, Business Intelligence, and Database Management, offering data warehouse design and database management services, as well as producing methodologies and technologies that advance the enterprise architectures of large and small organizations world-wide. He has worked for American Management Systems and Coopers & Lybrand. Bill received his Bachelor of Science degree in Mathematics from Yale University, and his Master of Science degree in Computer Science from New Mexico State University."

Front Cover 1
The Shanidar Neandertals 4
Copyright Page 5
Table of Contents 8
Dedication 6
Figures 12
Tables 16
Preface 20
Acknowledgments 22
CHAPTER
28
CHAPTER
31
The Site of Shanidar Cave 31
History of Excavations 41
The Neandertal Partial Skeletons 43
CHAPTER
58
CHAPTER
63
Age 63
Sex 70
Summary 80
CHAPTER
81
Shanidar 1 81
Shanidar 2 117
Shanidar 4 135
Shanidar 5 150
Shanidar 6 170
Shanidar 8 171
Artificial Deformation of the Shanidar 1 and 5 Crania 172
Summary of the Shanidar Skull Morphology 174
CHAPTER
178
Shanidar 1 178
Shanidar 2 182
Shanidar 3 186
Shanidar 4 187
Shanidar 5 187
Shanidar 6 191
Anterior Dental Remains 192
Posterior Dental Remains 198
Taurodontism 202
Summary 204
CHAPTER
205
Cervical Vertebrae 205
Thoracic Vertebrae 214
Lumbar Vertebrae 216
Sacrum 225
Coccygeal Vertebra 232
Ribs 233
Sternum 235
Summary 237
CHAPTER
238
Clavicles 238
Scapulae 242
Humeri 250
Ulnae 259
Radii 266
Hand Remains 275
Summary 309
CHAPTER
311
Innominate Bones 311
Femora 322
Patellae 331
Tibiae 337
Fibulae 347
Foot Remains 352
Summary 395
Chapter 10. The Immature Remains 396
Cranial Remains 397
Dentition 399
Axial Skeleton 408
Upper Limb Remains 409
Lower Limb Remains 414
Summary 416
CHAPTER
417
Bodily Proportions 417
Estimation of Stature 422
CHAPTER
426
Shanidar 1 428
Shanidar 2 440
Shanidar 3 441
Shanidar 4 445
Shanidar 5 446
Shanidar 6 448
Shanidar 8 448
Summary 449
CHAPTER
451
The Shanidar Sample 451
The Shanidar Fossils as Neandertals 453
Evolutionary Trends in the Shanidar Sample 463
The Shanidar
468
Behavioral Implications of the Shanidar Neandertals 482
Conclusion 487
CHAPTER
488
Historical Background 488
Phylogenetic Relationships 490
Neandertal Behavior 497
Conclusion 499
References 500
Index 526

1.1

Corporate Data

Abstract

Corporate data includes everything found in the corporation in the way of data. The most basic division of corporate data is by structured data and unstructured data. As a rule there is much more unstructured data than structured data. Unstructured data has two basic divisions – repetitive data and nonrepetitive data. Big Data is made up of unstructured data. Nonrepetitive Big Data has a fundamentally different form than repetitive unstructured Big Data. In fact the differences between nonrepetitive Big Data and repetitive Big Data are so large that they can be called the boundaries of the “great divide.” The divide is so large many professionals are not even aware that there is this divide. As a rule nonrepetitive Big Data has much greater business value than repetitive Big Data.

Keywords

Big Data

business value

corporate data

great divide of data

nonrepetitive data

repetitive data

structured data

unstructured data

In today’s world it is easy to get lost when dealing with data. There are many different types of data and each type of data has its own peculiarities and idiosyncrasies. Products, vendors, and applications become so focused on their own specific world that the larger picture of how things fit together often gets lost. It oftentimes is useful to step back and look at the larger picture to gain a proper perspective.

The Totality of Data Across the Corporation

Consider the totality of data found in the corporation. A simplistic depiction of the totality of data found in the corporation is seen in Figure 1.1.1.

Figure 1.1.1

The totality of data represented here includes everything to do with data of any kind found in the corporation.

There are many ways to subdivide the totality of data in the corporation. One such way (but hardly the only way) to subdivide the data found in the corporation is to divide the totality of data into structured data and unstructured data, as seen in Figure 1.1.2.

Figure 1.1.2

Structured data is the data that has a predictable and regularly occurring format of data. Typically structured data is managed by a database management system (DBMS) and consists of records, attributes, keys, and indexes. Structured data is well defined, predictable, and managed by an elaborate infrastructure. As a rule most units of data in the structured environment can be located very quickly and easily.

Unstructured data, conversely, is data that is unpredictable and has no structure that is recognizable to a computer. As a rule, unstructured data is rather clumsy to access, where long strings of data have to be sequentially searched (parsed) in order to find a given unit of data. There are many forms and variations of unstructured data. Perhaps the most commonly occurring form of unstructured data is text. However, by no stretch of the imagination is text the only form of unstructured data.

Dividing Unstructured Data

Unstructured data can further be divided into two basic forms of data – repetitive unstructured data and nonrepetitive unstructured data. As is the case with the division of corporate data, there are many ways to subdivide unstructured data. The method shown here is but one of many ways to subdivide unstructured data. This simple subdivision of unstructured data is shown in Figure 1.1.3.

Figure 1.1.3

Repetitive unstructured data is data that occurs many times, often in the same structure and even in the exact same embodiment. Typically, repetitive data occurs many, many times. The structure of repetitive data looks exactly the same or substantially the same as the previous record. There is no massive and elaborate infrastructure managing the content of repetitive unstructured data.

Nonrepetitive unstructured data is data where the records are substantially different from each other. In general each nonrepetitive record is markedly different from each other record.

The division of data types in the corporation has many different embodiments. Consider the data as shown in Figure 1.1.4.

Figure 1.1.4

Structured data is typically found as a by-product of transactions. Every time a sale is made, every time a bank account encounters a withdrawal, every time someone transacts an ATM activity, and every time a bill is sent a record of the transaction is made. The record of the transaction ends up as a structured record.

Unstructured repetitive data is quite different. Unstructured repetitive records are typically records of machine interactions, such as the analog verification of product coming off a manufacturing process or the metering of energy usage by a consumer. Consider metering. There is great repetition of records in both form and substance that are created when looking at metered readings.

Unstructured nonrepetitive information is fundamentally different than unstructured repetitive records. With unstructured nonrepetitive records there is little or no repetition of either form or content from one record to the next. Some examples of unstructured nonrepetitive information include email, call center conversations, and market research. When you look at one email, the odds are very good that the next email in the database will be different than the previous email. The same is true for call center information, warranty claims, market research, and so forth.

Business Relevancy

Unstructured repetitive data and unstructured nonrepetitive data have very different characteristics, in many different ways. One of the ways that these two types of data are different is in terms of business relevancy. In unstructured repetitive data, there often are very few records that are of real business interest. With unstructured nonrepetitive data, however, there is a very large percentage of business-relevant data.

This difference between the two types of data is shown in Figure 1.1.5.

Figure 1.1.5

As an example of a small percentage of repetitive unstructured data being business relevant, consider the millions of phone calls that are made each day. The government is only interested in a very few phone calls out of the millions that have been made. Or consider manufacturing control information. Nearly all manufacturing records are not of interest. Only a very few records – usually where the parameters being measured exceed a threshold – are of interest. Oftentimes with unstructured repetitive records, there are records that are not directly or immediately of interest but are potentially of interest in this category.

There are not too many records that are not of interest when it comes to unstructured nonrepetitive data. There is spam and there are stop words. But other than those two categories of information, nearly all unstructured nonrepetitive data is of interest.

Big Data

It is of interest to note that Big Data consists of the unstructured repetitive and the unstructured nonrepetitive data in the corporation, as seen in Figure 1.1.6.

Figure 1.1.6

The Great Divide

At first it may seem that the differences between the two types of unstructured data – unstructured repetitive and unstructured nonrepetitive data – are almost whimsical or trivial. In fact the differences between the two types of unstructured data are anything but trivial. Because of the profound differences between the two types of data, there is a great divide that separates the two types of unstructured data.

Figure 1.1.7 shows the great divide that separates the two types of unstructured data.

Figure 1.1.7

The great divide that separates the two types of unstructured data occurs because data on one side of the divide is handled one way and data on the other side of the divide is handled in an entirely different manner. For all practical purposes the data found on the different sides of the great divide might as well exist on different planets.

The division in the way that data is handled is such that unstructured repetitive data is almost entirely consumed with a fixation on managing Hadoop. For unstructured repetitive data the emphasis is entirely on accessing, monitoring, displaying, analyzing, and visualizing data residing on a Big Data manager such as Hadoop.

The emphasis on unstructured nonrepetitive data is almost entirely centered on textual disambiguation. The emphasis here is on the types of disambiguation, the reformatting of the output, the contextualization of the data, the standardization of the data, and so forth.

The remarkable thing about the great divide is that the disciplines surrounding the data are so diametrically different. Textual disambiguation is a very different subject than the access and analysis of data stored on Hadoop. It is because of the extreme differences between these two worlds that it is said that the two worlds live in different planets.

To use an analogy to illustrate just...

Erscheint lt. Verlag	26.11.2014
Sprache	englisch
Themenwelt	Informatik ► Datenbanken ► Data Warehouse / Data Mining
	Informatik ► Office Programme ► Outlook
	Mathematik / Informatik ► Informatik ► Software Entwicklung
	Sozialwissenschaften ► Kommunikation / Medien ► Buchhandel / Bibliothekswesen
ISBN-10	0-12-802091-1 / 0128020911
ISBN-13	978-0-12-802091-3 / 9780128020913

Haben Sie eine Frage zum Produkt?

PDF (Adobe DRM)
Größe: 61,8 MB

Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine Adobe-ID und die Software Adobe Digital Editions (kostenlos). Von der Benutzung der OverDrive Media Console raten wir Ihnen ab. Erfahrungsgemäß treten hier gehäuft Probleme mit dem Adobe DRM auf.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine Adobe-ID sowie eine kostenlose App.
Geräteliste und zusätzliche Hinweise

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

EPUB (Adobe DRM)
Größe: 19,2 MB

Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.