Modulo 2 - Big Data Analysis & Technology Concepts

Beschreibung

Modulo 2
Juan Taborda
Quiz von Juan Taborda, aktualisiert more than 1 year ago
Juan Taborda
Erstellt von Juan Taborda vor fast 8 Jahre
161
1

Zusammenfassung der Ressource

Frage 1

Frage
Big Data Analysis
Antworten
  • differs from traditional data analysis primarily because of the volume, velocity and variety characteristics of the data it processes
  • When two variables are considered to be _____________ they are considered to be aligned based on a linear relationship
  • This means that when one variable changes, the other variable also changes proportionally and constantly
  • this helps maintain data provenance throughout the big data analysis lifecycle, which helps establish and preserve data accuracy and quality

Frage 2

Frage
step-by-step process
Antworten
  • is needed to organize the task involved with retrieving, processing, producing and repurposing data
  • therefore, it is advisable to store a verbatim copy of the original dataset before proceeding with the filtering. To save on required storage space, the verbatim copy is compressed before storage
  • Although the data format may be the same, the data model may be different
  • depending on the business scope of the analysis project and nature of the business problems being addressed, the requiered datasets and their sources can be internal and/or external to the enerprise

Frage 3

Frage
Big Data Analysis Lifecycle
Antworten
  • Business Case Evaluation
  • Data Identification
  • A/B Testing
  • An ________ is employed when the comparatively simple data manipulation functions of a query engine are insufficient

Frage 4

Frage
Big Data Analysis Lifecycle
Antworten
  • Data Adquisition & Filtering
  • Data Extraction
  • suggestions commonly pertain to recommending items, such as movies, books, web pages, people, etc.
  • the processing engine mechanism will often use the ___________ to coordinate data processing across a large number of servers. This way, the processing engine does not require its own coordination logic

Frage 5

Frage
Big Data Analysis Lifecycle
Antworten
  • Data Validation & Cleansing
  • Data Aggregation & Representation
  • The ______ itself is a visual, color-coded representation of data values
  • the processing engine mechanism will often use the ___________ to coordinate data processing across a large number of servers. This way, the processing engine does not require its own coordination logic

Frage 6

Frage
Big Data Analysis Lifecycle
Antworten
  • Data Analysis
  • Data Visualization
  • A _______ is generally expressed using a line chart, with time plotted on the x-axis and recorded data value plotted on the y-axis
  • Utilization of Analysis Results

Frage 7

Frage
Business Case Evaluation
Antworten
  • requires that a business case be created, assessed and approved prior to proceeding with the actual hands-on analysis task
  • helps decision-makers understand the business resources that will need to be utilized and which business challenges the analysis will tackle
  • Unstructured text is generally much more difficult to analyze and search, compared to structured text
  • is an example of the application of the law of large numbers

Frage 8

Frage
Business Case Evaluation
Antworten
  • the further identification of KPI during this stage helps determine how closely the data analysis outcome needs to meet the identified goals and objectives
  • based on the business requirements documented, it can be determined whether the business problems being addressed are really Big Data problems
  • Applications for ___________ include fraud detection, medical diagnosis, network data analysis and sensor data analysis
  • can be applied to the categorization of unknown documents, and to personalized marketing campaings by grouping together customers with similar behavior

Frage 9

Frage
Business Case Evaluation
Antworten
  • in order to qualify as a Big Data Problem, a business problem needs to be directly related to one or more of the Big Data characteristics of volume, velocity or variety
  • Note also that another outcome of this stage is the determination of the underlying budget required to carry out the analysis project
  • The processing engine enables data to be queried and manipulated in other ways, but to implement this type of functionality requires custom programming
  • Workflow Engine

Frage 10

Frage
Business Case Evaluation
Antworten
  • any required purchase of tools, hardware, training, etc. need to be understood in advance, so that the anticipated investment can be weighed against the expected benefits of archieving the goals
  • initial iteration of the big data analysis lifecycle will require more up-front investment of Big Data technologies, products and training compared to later iterations where these earlier investment can be repeatedly leveraged
  • In the context of traditional data analysis, the ______ states that, starting with a reasonably large sample size, the value obtained from the analysis of additional data decreases as more data is successively added to the original sample
  • can be applied to the categorization of unknown documents, and to personalized marketing campaings by grouping together customers with similar behavior

Frage 11

Frage
Data Identification
Antworten
  • is dedicated to identify datasets (and their sources) required for the analysis project
  • identifying a wider variety of data sources may increase the probability of finding hidden patterns and correlations
  • Classification, outlier detection, filtering, natural language processing, text analytics and sentiment analysis can utilize ___________
  • Clustering

Frage 12

Frage
Data Identification
Antworten
  • it can be beneficial to identify as many types of related data sources and insights as possible, especially when we dont know exactly what we're looking for
  • depending on the business scope of the analysis project and nature of the business problems being addressed, the requiered datasets and their sources can be internal and/or external to the enerprise
  • examples of appenden metadata can include dataset size and structure, source information, date and time of creation or collection, language-specific information, etc.
  • is generally used in data mining to get an understanding of the properties of a given dataset. After developing this understanding, classification can be used to make better predictions about similar, but new or unseen data

Frage 13

Frage
Internal Datasets
Antworten
  • a list of available datasets from sources, such as data marts and operational systems, are typically compiled and matched against a predefined dataset specification
  • A distributed Big Data solution that needs to run on multiple servers relies on the coordination engine mechanism to ensure operational consistency across all of the participating servers
  • Subsequent to __________ being made available to business users to support business decision-making (such as via dashboard), there may be further oportunities to utilize the __________
  • In other areas such as the scientific domains, the objective may simply be to observe which version works better in order to improve a process or product

Frage 14

Frage
External Dataset
Antworten
  • a list of possible third-party data providers (data markets and publicity available datasets), are generally compiled. Some forms of external data may be embedded within blogs or other types of content-based Websites, in which case they may need to be harvested via automated tools
  • can be applied to the categorization of unknown documents, and to personalized marketing campaings by grouping together customers with similar behavior
  • Reconciling these differences can require complex logic that is executed automatically without the need for human intervention
  • A big data _________ utilizes a distributed parallel programming framework that enables it to process very large amounts of data distributed across multiple nodes

Frage 15

Frage
Data adquisition & filtering
Antworten
  • the data is gathered from all of the data sources that were identified during the previous stage and is then subjected to the automated filtering of corrupt data or data that has been deemed to have no value to the analysis objectives
  • depending on the type of data source, data may come as a dump of files (such as data purchased from a third-party data provider), or may require API integration (such as with Twitter)
  • A ________ engine enables data to be moved in or out big data solution storage devices
  • A _________ comprises grouped read/writes, with a larger data footprint consisting of complex joins and high-latency responses

Frage 16

Frage
Data adquisition & filtering
Antworten
  • In many cases, especially where external, unstructured data is concerned, some or most of the acquired data may be irrelevant (noise) and can be discarded as par of the filtering process
  • data classified as "corrupt" can include records with missing or nonsensical values or invalid data types
  • it involves plotting entities as nodes and connections as edges between nodes
  • OLTP and operational systems (write-intensive) as well as operational BI and analytics (read-intensive), both fall within this category

Frage 17

Frage
Data adquisition & filtering
Antworten
  • data that is filtered out for one analysis may possibly be valuable for a different type of analysis
  • therefore, it is advisable to store a verbatim copy of the original dataset before proceeding with the filtering. To save on required storage space, the verbatim copy is compressed before storage
  • is an example of the application of the law of large numbers
  • Coordination Engine

Frage 18

Frage
Data adquisition & filtering
Antworten
  • both internal and external data needs to be persisted once it gets generated or enters the enterprise boundary
  • for batch analytics, this data is persisted to disk prior to analysis
  • extracting text for text analytics, which requires scans of whole documents, will not be necessary if the underlying Big Data solution can already read the document in its native format directly
  • is dedicated to determining how and where processed analysis data can be further leveraged

Frage 19

Frage
Data adquisition & filtering
Antworten
  • in the case of realtime analytics, the data is analyzed first and then persisted to disk
  • metadata can be added via automation to data from both internal and external data sources to improve the classification and querying
  • Also known as offline processing, ________ processing involves processing data in batches and usually imposes delays (resulting in high-latency responses)
  • is generally applied via the following two approaches: collaborative ____________ and content-based ____________

Frage 20

Frage
Data adquisition & filtering
Antworten
  • examples of appenden metadata can include dataset size and structure, source information, date and time of creation or collection, language-specific information, etc.
  • it is vital that metadata be machine-readable and passed forward along subsequent analysis stages
  • The ability to analyze massive amounts of data and find useful insights carries little value if the only ones that can interpret the results are the analysis
  • both version are subjected to an experiment simultaneously, the observations are recorded to determine which version is more successful

Frage 21

Frage
Data adquisition & filtering
Antworten
  • this helps maintain data provenance throughout the big data analysis lifecycle, which helps establish and preserve data accuracy and quality
  • metadata is added through an automated mechanism to data received from both internal and external data sources
  • any required purchase of tools, hardware, training, etc. need to be understood in advance, so that the anticipated investment can be weighed against the expected benefits of archieving the goals
  • helps decision-makers understand the business resources that will need to be utilized and which business challenges the analysis will tackle

Frage 22

Frage
Data Extraction
Antworten
  • Some of the data identified as input for the analysis may arrive in a format incompatible with the big data solution
  • the need to address disparate types of data is more likely with data from external sources
  • make it possible to develop highly reliable, highly available distributed big data solutions that can be deployed in a cluster
  • Data needs to be imported before it can be processed by the big data solution

Frage 23

Frage
Data Extraction
Antworten
  • is dedicated to extracting disparate data and transforming it into a format that the underlying big data solution can use for the purpose of the data analysis
  • the extend of extraction and transformation required depends on the types of analytics and capabilities of the big data solution
  • provenance can play an important role in determining the accuracy and quality of qustionable data
  • is closely related to parallel data processing in how the same principle of "divide-and-conquer" is applied

Frage 24

Frage
Data Extraction
Antworten
  • extracting text for text analytics, which requires scans of whole documents, will not be necessary if the underlying Big Data solution can already read the document in its native format directly
  • further transformation is needed in order to separate the data into two separate fields as required by the big data solution
  • it can also be used to make predictions about the values of the dependent variable while it is still unknown
  • However, ___________ is always archieved through physically separate machines that are networked together as a cluster

Frage 25

Frage
Data Validation & Cleansing
Antworten
  • Invalid data can skew and falsify analysis results
  • Unlike traditional enterprise data where the data structure is pre-defined and data is pre-validated, data input into big data analyses can be unstructured without any indication of validity
  • More than one independent variable can be tested at the same time
  • The _______ essencially acts a resource arbitrator that manages and allocates available resources

Frage 26

Frage
Data Validation & Cleansing
Antworten
  • its complexity can further make it difficult to arrive at a set of suitable validation constraints
  • is dedicated to establishing (often complex) validation rules and removing any known invalid data
  • is closely related to the concept of classificatopm and clustering, although its algorithms focus on finding abnormal values
  • for batch analytics, ______________ can be achieved via an offline ETL operation

Frage 27

Frage
Data Validation & Cleansing
Antworten
  • Big data solutions often receive redundant data across different datasets
  • this redundancy can be exploited to explore interconnected datasets in order to assemble validation parameters and fill in missing valid data
  • A ___ represents a geographic measure by which different regions are color-coded according to a certain theme
  • A _________ is a file system that can store large files spread across a cluster

Frage 28

Frage
Data Validation & Cleansing
Antworten
  • for batch analytics, ______________ can be achieved via an offline ETL operation
  • The presence of invalid data is resulting in spikes. Although the data appears abnormal, it may be indicative of a new pattern
  • A _______ is generally expressed using a line chart, with time plotted on the x-axis and recorded data value plotted on the y-axis
  • for realtime analytics, a more complex in-memory system is required to validate and cleanse the data at the source

Frage 29

Frage
Data Validation & Cleansing
Antworten
  • provenance can play an important role in determining the accuracy and quality of qustionable data
  • data that appears to be invalid may still be valuable in that it may possess hidden patterns and trends
  • No hypothesis or predetermined assumptions are generated
  • A _______ database is a non-relational database that is highly scalable, fault-tolerant and specifically designed to house unstructured data

Frage 30

Frage
Data Aggregation & Representation
Antworten
  • Data may be spread across multiple datasets, requiring that datasets be joined together via common fields, in other cases, the same data fields may appear in multiple datasets
  • Either way, a method of data reconciliation is required or the dataset representing the correct value needs to be determined
  • Law of Diminishing Marginal Utility
  • A ______ can be in the form of a chart or a map

Frage 31

Frage
Data Aggregation & Representation
Antworten
  • is dedicated to integrating multiple datasets together to arrive at a unified view
  • future dara analysis requirements need to be considered during this stage to help foster data reusability
  • The ________ mechanism can also be used for support distributed locks, support distributed queues, establish a highly available registry for obtaining configuration information, reliable asynchronous communication between processes that are running on different servers
  • essentially provides the ability to discover text rather than just search it

Frage 32

Frage
Data structure and Semantics
Antworten
  • performing the stage of data aggregation & representation can become complicated because of differences in this
  • Reconciling these differences can require complex logic that is executed automatically without the need for human intervention
  • Within Big Data ________ can first be applied to discover if a relationship exists
  • both version are subjected to an experiment simultaneously, the observations are recorded to determine which version is more successful

Frage 33

Frage
Data structure
Antworten
  • Although the data format may be the same, the data model may be different
  • are an effective visual analysis technique for expressing patterns, data compositions via part-whole relations and geographic distributions of data
  • Data Adquisition & Filtering
  • Hadoop's batch-based data processing fully lends itself to the pay -per-use model of __________, which can reduce operational costs since a typical Hadoop cluster size can range from a few to a few thousand nodes

Frage 34

Frage
Semantics
Antworten
  • A value that is labelled differently in two different datasets may mean the same thing
  • Instead of hard-coding the required learning rules, either supervised or unsupervised machine learning is applied to develop the computer's understanding of the __________
  • Network Analysis
  • In other areas such as the scientific domains, the objective may simply be to observe which version works better in order to improve a process or product

Frage 35

Frage
Data Aggregation
Antworten
  • The large volumes processed by Big Data solutions can make____________ a time and effort-intensive operation
  • Whether _____________ is required or not, it is important to understand that the same data can be stored in many different forms. One form may be better suited for a particular type of analysis than another
  • require processing resources that they request from the resource manager
  • the data is then analyzed to prove or disprove the hypothesis and provide definitive answers to specific questions

Frage 36

Frage
Data structure standarized
Antworten
  • can act as a common denominator that can be used for a range of analysis techniques and projects. This can require establishing a central, standard analysis repository, such as a NoSQL database
  • A big data _________ utilizes a distributed parallel programming framework that enables it to process very large amounts of data distributed across multiple nodes
  • The _____ essencially acts a resource arbitrator that manages and allocates available resources
  • comprise random read/writes that involve fewer joins and require low-latency responses, with a smaller data footprint

Frage 37

Frage
Data Analysis
Antworten
  • is dedicated to carrying out the actual analysis task, which typically involves one or more types of analysis
  • this stage can be iterative in nature, especially if the _________________ is exploratory so that analysis is repeated until the appropiate pattern or correlation is uncovered
  • A _______ may internally use a processing engine to process multiple large datasets in parallel
  • the accuracy and applicability of the patterns and relationships that are found in a large dataset will be higher than that of a smaller dataset

Frage 38

Frage
Data Analysis
Antworten
  • the exploratory analysis approach is explained shortly, along with confirmatory analysis
  • depending on the type of analytics required, this stage can be as simple as querying a dataset to compute an aggregation for comparision
  • make it possible to develop highly reliable, highly available distributed big data solutions that can be deployed in a cluster
  • Correlation, regression, time series analysis, classification, clustering, outlier detection, filtering, natural language processing, text analytics and sentiment analysis are considered forms of ________

Frage 39

Frage
Data Analysis
Antworten
  • it can be as challenging as combining data mining and complex statistical analysis techniques to discover patterns and anomalies, or to generate a statistical or mathematical model to depict relationship between variables
  • The approach taken when carrying out this stage can be classified as confirmatory analysis or exploratory analysis (the latter is linked to data mining)
  • The results of completing the _______________ stage provide users with the ability to perform visual analysis, allowing for the discovery of answers to questions that users have not yet even formulated
  • A given ______ may support either data ingress or egress functions

Frage 40

Frage
Confirmatory Data Analysis
Antworten
  • _____________ is a deductive approach where the cause of phenomenon being investigated is proposed beforehand
  • the data is then analyzed to prove or disprove the hypothesis and provide definitive answers to specific questions
  • can be used to determine the number of entities that fall within a certain radius of another entity
  • can act as a common denominator that can be used for a range of analysis techniques and projects. This can require establishing a central, standard analysis repository, such as a NoSQL database

Frage 41

Frage
hypothesis
Antworten
  • The proposed cause or assumption is called a ____________
  • is an item filtering technique based on the collaboration (merging) of users' past behavior
  • This type of environment is provided by a platform that is comprised of a set of distributed storage and processing technologies
  • As the amount of digitized documents, e-mails, social media posts and log files increases, business have an increasing need to leverage any value that can be extracted from these forms of semi-structured and unstructured data

Frage 42

Frage
Confirmatory Data Analysis
Antworten
  • the data is then analyzed to prove or disprove the hypothesis and provide definitive answers to specific questions
  • Data samples are typically used
  • this information can then be integrated into the decision-making process
  • Unexpected findings or anomalies are usually ignored since a predetermined cause was assumed

Frage 43

Frage
Exploratory Data Analysis
Antworten
  • _____________ is an inductive approach that is closely associated to data mining
  • No hypothesis or predetermined assumptions are generated
  • can be carried out via the use of supported by correlation, heat maps, time series analysis, network analysis, spatial data analysis, clustering, outlier detection, natural language processing and text analytics
  • is an item filtering technique based on the collaboration (merging) of users' past behavior

Frage 44

Frage
Exploratory Data Analysis
Antworten
  • Instead, the data is explored through analysis to develop an understanding of the cause of the phenomenon
  • Although it may not provide definitive answers, this method provides a general direction that can facilitate the discovery of patterns or anomalies
  • represents a constant rate of change
  • Large amounts of data and visual analysis are typically used

Frage 45

Frage
Data visualization
Antworten
  • The ability to analyze massive amounts of data and find useful insights carries little value if the only ones that can interpret the results are the analysis
  • is dedicated to using __________________ techniques and tools to graphically communicate the analysis results for efective interpretarion by business users
  • is the process of finding data that is significantly different from or inconsistent with the rest of the data within a given dataset
  • for batch analytics, ______________ can be achieved via an offline ETL operation

Frage 46

Frage
Data visualization
Antworten
  • Business users need to be able to understand the results in order to obtain value from the analysis and subsequently have the ability to provide feedback
  • The results of completing the _______________ stage provide users with the ability to perform visual analysis, allowing for the discovery of answers to questions that users have not yet even formulated
  • Large amounts of data and visual analysis are typically used
  • is an item filtering technique focused on the similarity between users and items

Frage 47

Frage
Data visualization
Antworten
  • The same results may be presented in a number of different ways, which can influence the interpretation of the results
  • Consequently, it is important to use the most suitable visualization technique by keeping the business domain in context
  • Another aspect to keep in mind is that providing a method of drilling down to comparatively simple statistics were generated
  • The objective is to use graphic representations to develop a deeper understanding of the data being analyzed. Specifically, it helps identify and highlight hiden patterns, correlations and anomalies

Frage 48

Frage
Analysis results
Antworten
  • Subsequent to __________ being made available to business users to support business decision-making (such as via dashboard), there may be further oportunities to utilize the __________
  • Natural Language Processing
  • A ___________ provides the ability to design and process a complex sequence of operations that can be triggered either at set time intervals or when data becomes available
  • includes both text and speech recognition

Frage 49

Frage
Utilization Analysis results
Antworten
  • is dedicated to determining how and where processed analysis data can be further leveraged
  • Depending on the nature of the analysis problems being addressed, it is possible for the analysis results to produce "models" that encapsulate new insights and understandings about the nature of the patterns and relationships that exist within the data that was just analyzed
  • Data Transfer Engine
  • is generally applied via the following two approaches: collaborative ____________ and content-based ____________

Frage 50

Frage
Utilization Analysis results
Antworten
  • A model look like a mathematical equation or a set of rules
  • Models can be used to improve business process logic, application system logic and can form the basis of a new system or software program
  • A distributed Big Data solution that needs to run on multiple servers relies on the coordination engine mechanism to ensure operational consistency across all of the participating servers
  • new models may be used to improve the programming logic within existing enterprise systems or may form the basis of new systems

Frage 51

Frage
Input for enterprise systems
Antworten
  • Filtering
  • An ________ is employed when the comparatively simple data manipulation functions of a query engine are insufficient
  • the data analysis results may be automatically (or manually) fed directly into enterprise systems to enhance and optimized their behavior and performance
  • new models may be used to improve the programming logic within existing enterprise systems or may form the basis of new systems

Frage 52

Frage
Business Process Optimization
Antworten
  • The identified patterns, correlations and anormalies discovered during the data analysis are used to refine business processes
  • models may also lead to opportunities to improve business process logic
  • is a computer's ability to comprehend human speech and text as naturally understood by humans
  • When two variables are considered to be _____________ they are considered to be aligned based on a linear relationship

Frage 53

Frage
Alerts
Antworten
  • Data analysis results can be used as input for existing _______ or may form the basis of new _______
  • this helps maintain data provenance throughout the big data analysis lifecycle, which helps establish and preserve data accuracy and quality
  • Text Analytics
  • Recommender systems may also be based on a hybrid of both collaborative _______ and content-based _______ to fine-tune the accuracy and effectiveness of generated suggestions

Frage 54

Frage
Data Analysis Techniques
Antworten
  • Statistical Analysis
  • Visual Analysis
  • it can be as challenging as combining data mining and complex statistical analysis techniques to discover patterns and anomalies, or to generate a statistical or mathematical model to depict relationship between variables
  • Note that distributed file systems and databases are both on disk _________ mechanisms

Frage 55

Frage
Data Analysis Techniques
Antworten
  • Machine Learning
  • Semantic Analysis
  • A _______ database is a non-relational database that is highly scalable, fault-tolerant and specifically designed to house unstructured data
  • Each node in the _____ has its own dedicated resources such as memory and hard drive and runs its own operating system just like a desktop computer

Frage 56

Frage
Statistical Analysis
Antworten
  • A/B Testing
  • Correlation
  • Unstructured text is generally much more difficult to analyze and search, compared to structured text
  • Regression

Frage 57

Frage
Machine Learning
Antworten
  • Classification
  • Clustering
  • The use of ________ can reduce development time and enables the manipulation of large datasets without the need to write complex programming logic
  • it is vital that metadata be machine-readable and passed forward along subsequent analysis stages

Frage 58

Frage
Machine Learning
Antworten
  • Outlier Detection
  • Filtering
  • Some propietary ________ also provide specialized data analysis features, such as text analytics and machine log analysis processing
  • Hadoop's batch-based data processing fully lends itself to the pay -per-use model of __________, which can reduce operational costs since a typical Hadoop cluster size can range from a few to a few thousand nodes

Frage 59

Frage
Visual Analysis
Antworten
  • Heat Maps
  • Time series analysis
  • examples of appenden metadata can include dataset size and structure, source information, date and time of creation or collection, language-specific information, etc.
  • can act as a common denominator that can be used for a range of analysis techniques and projects. This can require establishing a central, standard analysis repository, such as a NoSQL database

Frage 60

Frage
Visual Analysis
Antworten
  • Network Analysis
  • Spatial Data Analysis
  • it can be based on either supervised or unsupervised learning
  • Invalid data can skew and falsify analysis results

Frage 61

Frage
Semantic Analysis
Antworten
  • suggest that there is no relationship at all between the two variables
  • Applications of __________ include operations and logistic optimization, environmental sciences and infrastructure planning
  • Natural Language Processing
  • Text Analytics

Frage 62

Frage
Semantic Analysis
Antworten
  • Sentiment Analysis
  • suggest that there is no relationship at all between the two variables
  • Applications for ___________ include fraud detection, medical diagnosis, network data analysis and sensor data analysis
  • Unexpected findings or anomalies are usually ignored since a predetermined cause was assumed

Frage 63

Frage
Statistical Analysis
Antworten
  • uses statistical methods based on mathematical formulas as a means for analyzing data
  • this type of analysis is commonly used to describe datasets via summarization, such as providing the mean, median or mode of statistics associated with the dataset
  • Spatial or geospatial data is commonly used to identify the geographic location of individual entities
  • The _____ essencially acts a resource arbitrator that manages and allocates available resources

Frage 64

Frage
Statistical Analysis
Antworten
  • it can also be used to infer patterns and relationships within the dataset, such as regression and correlation
  • is generally applied via the following two approaches: collaborative ____________ and content-based ____________
  • is a supervised learning technique by which data is classified into relevant, previously learned categories
  • We may be further interested in discovering how closely Variables A and B are related, which means we may also want to analyze the extend to which Variable B increases in relation to Variable A's increase

Frage 65

Frage
A/B Testing
Antworten
  • also known as split or bucket testing compares two versions of an element to determine which version is superior based on a pre-defined metric
  • the element can be a range of things
  • is expressed as a decimal number between -1 to +1, which is known as the _____________ coefficient
  • Classification, outlier detection, filtering, natural language processing, text analytics and sentiment analysis can utilize ___________

Frage 66

Frage
A/B Testing
Antworten
  • the current version of the element is called the control version, whereas the modified version is called the treatment
  • both version are subjected to an experiment simultaneously, the observations are recorded to determine which version is more successful
  • However, ___________ is always archieved through physically separate machines that are networked together as a cluster
  • Instead of hard-coding the required learning rules, either supervised or unsupervised machine learning is applied to develop the computer's understanding of the __________

Frage 67

Frage
A/B Testing
Antworten
  • Although __________ can be implemented in almost any domain, it is most often used in marketing
  • Generally, the objective is to gauge human behavior with the goal of increasing sales
  • This is a traditional data analysis principle that claims that data held in a reasonably sized dataset provides the maximum value
  • Either way, a method of data reconciliation is required or the dataset representing the correct value needs to be determined

Frage 68

Frage
A/B Testing
Antworten
  • In other areas such as the scientific domains, the objective may simply be to observe which version works better in order to improve a process or product
  • A ______ can be in the form of a chart or a map
  • for batch analytics, ______________ can be achieved via an offline ETL operation
  • Correlation, regression, time series analysis, classification, clustering, outlier detection, filtering, natural language processing, text analytics and sentiment analysis are considered forms of ________

Frage 69

Frage
Correlation
Antworten
  • is an analysis technique used to determine whether two variables are related to each other
  • if they are found to be related, the next step is to determine what their relationship is
  • Query Engine
  • In general, the more learning data the computer has, the more correctly it can decipher human text and speech

Frage 70

Frage
Correlation
Antworten
  • We may be further interested in discovering how closely Variables A and B are related, which means we may also want to analyze the extend to which Variable B increases in relation to Variable A's increase
  • The use of ________ helps to develop and understanding of a dataset and find relationships that can assist in explaining a phenomenon
  • Network Analysis
  • Data may be spread across multiple datasets, requiring that datasets be joined together via common fields, in other cases, the same data fields may appear in multiple datasets

Frage 71

Frage
Correlation
Antworten
  • Is therefore commonly used for data mining where the identification of relationships between variables in a dataset leads to the discovery of patterns and anomalies
  • This can reveal the nature of the dataset or the cause of a phenomenon
  • Migrating to the cloud is logical for enterprises planning to run analytics on datasets that are available via data markets, as most data markets store their data in the cloud such as Amazon S3
  • Big Data solutions require a distibuted processing environment that can accomodate large-scale data volumes, velocity and variety

Frage 72

Frage
Correlated
Antworten
  • When two variables are considered to be _____________ they are considered to be aligned based on a linear relationship
  • This means that when one variable changes, the other variable also changes proportionally and constantly
  • items can be ______ either based on a user's own behavior or by matching the behavior of multiple users
  • Note that a workflow engine may provide integration with a _______ to enable the automated import and export data

Frage 73

Frage
Correlation
Antworten
  • is expressed as a decimal number between -1 to +1, which is known as the _____________ coefficient
  • The degree of relationship changes from being strong to weak when moving from -1 to 0 or +1 to 0
  • For example, a __________ can execute logic that collects relational data from multiple databases at regular intervals via the data transfer engine mechanism, applies a set of ETL operations via the processing engine mechanism and finally persists the results to a NoSQL storage device
  • essentially provides the ability to discover text rather than just search it

Frage 74

Frage
+1
Antworten
  • suggest that there is a strong positive relationship between the two variables
  • When one variable increases, the other also increases and viceversa
  • typically involve large quantities of data with sequential read/writes, and comprises a group of read or write queries
  • Data Extraction

Frage 75

Frage
0
Antworten
  • suggest that there is no relationship at all between the two variables
  • when one increases, the other may stay the same, or increase or decrease arbitrarily
  • it can also be used to make predictions about the values of the dependent variable while it is still unknown
  • Generally, the objective is to gauge human behavior with the goal of increasing sales

Frage 76

Frage
-1
Antworten
  • suggest that there is a strong negative relationship between the two variables
  • when one variable increases, the other decreases and viceversa
  • can be carried out via the use of supported by correlation, heat maps, time series analysis, network analysis, spatial data analysis, clustering, outlier detection, natural language processing and text analytics
  • Therefore, the value of each additional batch does not diminish value; rather, it provides more value

Frage 77

Frage
Regression
Antworten
  • The analysis technique of _________ explores how a dependent variable is related to an independent variable within a dataset
  • As a sample scenario, __________ could help determine the type of relationship that exists between temperature (independent variable) and crop yield (dependent variable)
  • In the context of traditional data analysis, the ______ states that, starting with a reasonably large sample size, the value obtained from the analysis of additional data decreases as more data is successively added to the original sample
  • Data Analysis

Frage 78

Frage
Regression
Antworten
  • Applying this technique helps determine how the value of the dependent variable changes in relation to change in the value of the independent variable
  • When the independent variable increases, for example, does the dependent variable also increase? If yes, is the increase in a linear or non-linear proportion?
  • Business Case Evaluation
  • A _________ comprises grouped read/writes, with a larger data footprint consisting of complex joins and high-latency responses

Frage 79

Frage
Regression
Antworten
  • More than one independent variable can be tested at the same time
  • However, in such cases only one independent variable may change. The others are kept constant
  • The results of completing the _______________ stage provide users with the ability to perform visual analysis, allowing for the discovery of answers to questions that users have not yet even formulated
  • a list of possible third-party data providers (data markets and publicity available datasets), are generally compiled. Some forms of external data may be embedded within blogs or other types of content-based Websites, in which case they may need to be harvested via automated tools

Frage 80

Frage
Regression
Antworten
  • can help enable a better undestanding of what a phenomenon is, and why it occurred
  • it can also be used to make predictions about the values of the dependent variable while it is still unknown
  • Users of Big Data solutions can make numerous data processing requests, each of which can have different processing workload requirements
  • The _____ essencially acts a resource arbitrator that manages and allocates available resources

Frage 81

Frage
Linear regression
Antworten
  • represents a constant rate of change
  • it can be as challenging as combining data mining and complex statistical analysis techniques to discover patterns and anomalies, or to generate a statistical or mathematical model to depict relationship between variables
  • the ______ states that the confidence with which predictions can be made increases as the size of the data that is being analyzed increases
  • Data samples are typically used

Frage 82

Frage
Non-linear regression
Antworten
  • This type of environment is provided by a platform that is comprised of a set of distributed storage and processing technologies
  • Hadoop's batch-based data processing fully lends itself to the pay -per-use model of __________, which can reduce operational costs since a typical Hadoop cluster size can range from a few to a few thousand nodes
  • A ________ is a method of storing and organizing data on a storage medium, such as hard drives, DVD´s, and flash drives
  • represents the variable rate of change

Frage 83

Frage
Correlation
Antworten
  • does not imply a causation. The change in the value of one variable may not be responsible for the change in the value of the second variable, although both may change at the same rate
  • assumes that both variables are independent
  • Within Big Data ________ can first be applied to discover if a relationship exists
  • However, ___________ is always archieved through physically separate machines that are networked together as a cluster

Frage 84

Frage
Regression
Antworten
  • deal with already identified dependent and independent variables
  • implies that there is a degree of causation between the dependent and independent variables that may be direct or indirect
  • can then be applied to further explore the relationship and predict the values of the dependent variable, based on the known values of the independent variables
  • is an example of the application of the law of large numbers

Frage 85

Frage
Visual Analysis
Antworten
  • is a form of data analysis that involves the graphic representation of data to enable or enhance its visual perception
  • based on the premise that humans can understand and draw conclusions from graphics more quickly than from text, _______ act as a discovery tool in the field of Big Data
  • As the amount of digitized documents, e-mails, social media posts and log files increases, business have an increasing need to leverage any value that can be extracted from these forms of semi-structured and unstructured data
  • Is therefore commonly used for data mining where the identification of relationships between variables in a dataset leads to the discovery of patterns and anomalies

Frage 86

Frage
Visual Analysis
Antworten
  • The objective is to use graphic representations to develop a deeper understanding of the data being analyzed. Specifically, it helps identify and highlight hiden patterns, correlations and anomalies
  • is also directly related to exploratory data analysis, as it encourages the formulation of questions from different angles
  • Workflow Engine
  • require processing resources that they request from the resource manager

Frage 87

Frage
Heat Maps
Antworten
  • are an effective visual analysis technique for expressing patterns, data compositions via part-whole relations and geographic distributions of data
  • they also facilitate the identification of areas of interest and the discovery of extreme (high/low) values within a dataset
  • Migrating to the cloud is logical for enterprises planning to run analytics on datasets that are available via data markets, as most data markets store their data in the cloud such as Amazon S3
  • Data analysis results can be used as input for existing _______ or may form the basis of new _______

Frage 88

Frage
Heat Maps
Antworten
  • The ______ itself is a visual, color-coded representation of data values
  • Each value is given a color according to its type, or the range that it falls under
  • Solely analyzing operational (structured) data may cause businesses to miss out on cost-saving or business expansion opportunities, especially those that are customer-focused
  • Applications of __________ include operations and logistic optimization, environmental sciences and infrastructure planning

Frage 89

Frage
Heat Maps
Antworten
  • A ______ can be in the form of a chart or a map
  • Instead of coloring the whole region, the map may be superimposed by a layer made up of collections of colored shapes representing various regions
  • suggestions commonly pertain to recommending items, such as movies, books, web pages, people, etc.
  • Sentiment Analysis

Frage 90

Frage
chart
Antworten
  • A _____ represents a matrix of values in which each cell is color-coded according to the value
  • in order to qualify as a Big Data Problem, a business problem needs to be directly related to one or more of the Big Data characteristics of volume, velocity or variety
  • items can be ______ either based on a user's own behavior or by matching the behavior of multiple users
  • Big data solutions often receive redundant data across different datasets

Frage 91

Frage
map
Antworten
  • A ___ represents a geographic measure by which different regions are color-coded according to a certain theme
  • Although __________ can be implemented in almost any domain, it is most often used in marketing
  • NLP, Text analytics and sentiment analysis be used in support of __________
  • As a sample scenario, __________ could help determine the type of relationship that exists between temperature (independent variable) and crop yield (dependent variable)

Frage 92

Frage
Heat Maps
Antworten
  • Instead of coloring the whole region, the map may be superimposed by a layer made up of collections of colored shapes representing various regions
  • Data needs to be imported before it can be processed by the big data solution
  • The data collected for _______ is always time-dependent
  • Named Entities(person, group, place, company), Pattern-Based Entities(social insurance number, zip code), Concepts (an abstract representation of a entity), Facts (relationship between entities)

Frage 93

Frage
Time series Analysis
Antworten
  • is the analysis of data that is recorded over periodic intervals of time
  • this type of analysis makes use of _________, which is a time-ordered collections of values recorded over regular time intervals
  • in order to qualify as a Big Data Problem, a business problem needs to be directly related to one or more of the Big Data characteristics of volume, velocity or variety
  • data that appears to be invalid may still be valuable in that it may possess hidden patterns and trends

Frage 94

Frage
Time series Analysis
Antworten
  • helps to uncover patterns within data that are time-dependent. Once identified, the pattern can be extrapolated for future predictions.
  • are usually used for forecasting by identifying long-term trends, seasonal periodic patterns and irregular short-term variations in the dataset
  • For example, a __________ can execute logic that collects relational data from multiple databases at regular intervals via the data transfer engine mechanism, applies a set of ETL operations via the processing engine mechanism and finally persists the results to a NoSQL storage device
  • Big Data solutions can be partially or fully deployed in clouds in order to leverage the storage and computing resources that are available from the cloud provider

Frage 95

Frage
Time series Analysis
Antworten
  • Unlike other types of analyses, _________ always includes time as a comparision variable
  • The data collected for _______ is always time-dependent
  • Data samples are typically used
  • is solely based on the similarity between users' behavior, and requires a large amount of user behavior data in order to accurately filter items

Frage 96

Frage
Time series Analysis
Antworten
  • A _______ is generally expressed using a line chart, with time plotted on the x-axis and recorded data value plotted on the y-axis
  • is the specialized analysis of text through the application of data mining, machine learning and natural language processing techniques to extract value out of unstructured text
  • new models may be used to improve the programming logic within existing enterprise systems or may form the basis of new systems
  • depending on the type of data source, data may come as a dump of files (such as data purchased from a third-party data provider), or may require API integration (such as with Twitter)

Frage 97

Frage
Network
Antworten
  • Within the context of visual analysis, a _______ is a interconnected collection of entities
  • new models may be used to improve the programming logic within existing enterprise systems or may form the basis of new systems
  • this type of analysis makes use of _________, which is a time-ordered collections of values recorded over regular time intervals
  • Consequently, it is important to use the most suitable visualization technique by keeping the business domain in context

Frage 98

Frage
Entity
Antworten
  • An ____ can be a person, a group or some other business domain object such as a product
  • may be connected with another directly or indirectly
  • is a form of data analysis that involves the graphic representation of data to enable or enhance its visual perception
  • Also known as online processing, ____________ processing follows an approach whereby data is processed interactively, without delay (resulting in low-latency responses)

Frage 99

Frage
Network Analysis
Antworten
  • Some connections may only be one-way, so that traversal in the reverse direction is not possible
  • is a technique that focuses on analyzing relationships between entities within the network
  • For example, a __________ can execute logic that collects relational data from multiple databases at regular intervals via the data transfer engine mechanism, applies a set of ETL operations via the processing engine mechanism and finally persists the results to a NoSQL storage device
  • provides analysis features more sophisticated than those of heat maps

Frage 100

Frage
Network Analysis
Antworten
  • it involves plotting entities as nodes and connections as edges between nodes
  • There are specialized variations of __________ include route optimization, social network analysis and spread prediction
  • Unlike other types of analyses, _________ always includes time as a comparision variable
  • are based on predictive analytics techniques and therefore are associated with the same analysis techniques as predictive analytics. Additionally, _____ may utilize heat maps, network analysis and spatial data analysis to graphically show various outcomes

Frage 101

Frage
Spatial Data Analysis
Antworten
  • focused on analyzing location-based data in order to find different geographic relationship and patterns between entities
  • Spatial or geospatial data is commonly used to identify the geographic location of individual entities
  • in order to qualify as a Big Data Problem, a business problem needs to be directly related to one or more of the Big Data characteristics of volume, velocity or variety
  • The ________ mechanism can also be used for support distributed locks, support distributed queues, establish a highly available registry for obtaining configuration information, reliable asynchronous communication between processes that are running on different servers

Frage 102

Frage
Spatial Data Analysis
Antworten
  • is manipulated through a geographical information system (GIS) that plots spatial data on a map generally using its longitude and latitude coordinates
  • With the ever-increasing availability of location-based data, _________ can be analyzed to gain location insights
  • is dedicated to establishing (often complex) validation rules and removing any known invalid data
  • Correlation

Frage 103

Frage
Spatial Data Analysis
Antworten
  • Applications of __________ include operations and logistic optimization, environmental sciences and infrastructure planning
  • Data used as input for_________ can either contain exact locations (longitude,latitude) or the information required to calculate locations (such as zip codes or IP addresses)
  • the accuracy and applicability of the patterns and relationships that are found in a large dataset will be higher than that of a smaller dataset
  • helps decision-makers understand the business resources that will need to be utilized and which business challenges the analysis will tackle

Frage 104

Frage
Spatial Data Analysis
Antworten
  • provides analysis features more sophisticated than those of heat maps
  • can be used to determine the number of entities that fall within a certain radius of another entity
  • the need to address disparate types of data is more likely with data from external sources
  • is dedicated to establishing (often complex) validation rules and removing any known invalid data

Frage 105

Frage
Machine Learning
Antworten
  • Law of large numbers
  • Law of Diminishing Marginal Utility
  • A target user´s past behavior (likes, rating, purchase history, etc.) is collaborated with the behavior of similar users
  • initial iteration of the big data analysis lifecycle will require more up-front investment of Big Data technologies, products and training compared to later iterations where these earlier investment can be repeatedly leveraged

Frage 106

Frage
Law of large numbers
Antworten
  • the ______ states that the confidence with which predictions can be made increases as the size of the data that is being analyzed increases
  • the accuracy and applicability of the patterns and relationships that are found in a large dataset will be higher than that of a smaller dataset
  • Data Extraction
  • is an analysis technique used to determine whether two variables are related to each other

Frage 107

Frage
Law of large numbers
Antworten
  • this means that the greater the amount of data available for analysis, the better we become at making correct decisions
  • Within computing, a ______ is a tightly coupled collection of servers, or nodes. These servers usually have the same hardware specifications and are connected together via network to work as a single unit
  • Unlike traditional enterprise data where the data structure is pre-defined and data is pre-validated, data input into big data analyses can be unstructured without any indication of validity
  • Classification

Frage 108

Frage
Law of Diminishing Marginal Utility
Antworten
  • In the context of traditional data analysis, the ______ states that, starting with a reasonably large sample size, the value obtained from the analysis of additional data decreases as more data is successively added to the original sample
  • This is a traditional data analysis principle that claims that data held in a reasonably sized dataset provides the maximum value
  • A _______ provides a logical view of the data stored on the storage medium as a tree structure of files and directories
  • this redundancy can be exploited to explore interconnected datasets in order to assemble validation parameters and fill in missing valid data

Frage 109

Frage
Law of Diminishing Marginal Utility
Antworten
  • The ____ does not apply to Big Data
  • The greater the volume and variety of data that Big Data solutions can process allows for each additional batch of data to carry greater potential of unearthing new patterns and anomalies
  • for speech recognition, the system attemps to comprehend the speech and then performs an action, such as transcribing text
  • Applying this technique helps determine how the value of the dependent variable changes in relation to change in the value of the independent variable

Frage 110

Frage
Law of Diminishing Marginal Utility
Antworten
  • Therefore, the value of each additional batch does not diminish value; rather, it provides more value
  • Classification
  • This means that when one variable changes, the other variable also changes proportionally and constantly
  • is an example of the application of the law of large numbers

Frage 111

Frage
Classification
Antworten
  • is a supervised learning technique by which data is classified into relevant, previously learned categories
  • the system is fed data that is already categorized or labeled, so that it can develop an understanding of the different categories
  • therefore, it is advisable to store a verbatim copy of the original dataset before proceeding with the filtering. To save on required storage space, the verbatim copy is compressed before storage
  • Therefore, the value of each additional batch does not diminish value; rather, it provides more value

Frage 112

Frage
Classification
Antworten
  • the system is fed unknown (but similar) data for classification, based on the understanding it developed
  • a common application of this technique is for the filtering of e-mail spam. Note that ___________ can be performed for two or more categories
  • can help enable a better undestanding of what a phenomenon is, and why it occurred
  • it involves plotting entities as nodes and connections as edges between nodes

Frage 113

Frage
Classification
Antworten
  • in a simplified _____ process, the machine is fed labeled data during training that builds its understanding of the _______. The machine is then fed unlabeled data, which is classifies itself
  • also known as split or bucket testing compares two versions of an element to determine which version is superior based on a pre-defined metric
  • A file is an atomic unit of storage used by the _________ to stored data. Files are organizated inside of a directory
  • The objective is to use graphic representations to develop a deeper understanding of the data being analyzed. Specifically, it helps identify and highlight hiden patterns, correlations and anomalies

Frage 114

Frage
Clustering
Antworten
  • is an unsupervised learning technique by which data is divided into different groups so that the data in each group has similar properties
  • There is no prior learning of categories required; intead, categories are implicity generated based on the data groupings
  • Big Data solutions can be partially or fully deployed in clouds in order to leverage the storage and computing resources that are available from the cloud provider
  • Applications include document classification and search, as well as builiding a 360-degree view of a customer by extracting information from a CRM system

Frage 115

Frage
Clustering
Antworten
  • How the data is grouped depends on the type of algorithm used. Each algorithm uses a different technique to identify ______
  • is generally used in data mining to get an understanding of the properties of a given dataset. After developing this understanding, classification can be used to make better predictions about similar, but new or unseen data
  • Is solely dedicated to individual user preferences and does not require data about other users
  • A ______ can be in the form of a chart or a map

Frage 116

Frage
Clustering
Antworten
  • can be applied to the categorization of unknown documents, and to personalized marketing campaings by grouping together customers with similar behavior
  • Within computing, a ______ is a tightly coupled collection of servers, or nodes. These servers usually have the same hardware specifications and are connected together via network to work as a single unit
  • items can be ______ either based on a user's own behavior or by matching the behavior of multiple users
  • the ______ states that the confidence with which predictions can be made increases as the size of the data that is being analyzed increases

Frage 117

Frage
Outlier Detection
Antworten
  • is the process of finding data that is significantly different from or inconsistent with the rest of the data within a given dataset
  • The machine learning technique is used to identify anomalies, abnormalities and desviation that can be advantageous (such as oportunities) or disadvantageous (such a risk)
  • The data collected for _______ is always time-dependent
  • for realtime analytics, a more complex in-memory system is required to validate and cleanse the data at the source

Frage 118

Frage
Outlier Detection
Antworten
  • is closely related to the concept of classificatopm and clustering, although its algorithms focus on finding abnormal values
  • it can be based on either supervised or unsupervised learning
  • involves the simultaneous execution of multiple sub-tasks that collectivelly comprise a larger task
  • it can also be used to infer patterns and relationships within the dataset, such as regression and correlation

Frage 119

Frage
Outlier Detection
Antworten
  • Applications for ___________ include fraud detection, medical diagnosis, network data analysis and sensor data analysis
  • suggest that there is a strong positive relationship between the two variables
  • To the client, a file appears local and can be accessed via multiple locations
  • Heat Maps

Frage 120

Frage
Filtering
Antworten
  • is the automated process of finding relevant items from a pool of items
  • items can be ______ either based on a user's own behavior or by matching the behavior of multiple users
  • The ____ does not apply to Big Data
  • Although __________ can be implemented in almost any domain, it is most often used in marketing

Frage 121

Frage
Filtering
Antworten
  • is generally applied via the following two approaches: collaborative ____________ and content-based ____________
  • A common medium by which ________ is implemented is via the use of a recommender system
  • A given ______ may support either data ingress or egress functions
  • A ________ generally provides only one of the listed functions

Frage 122

Frage
Colaborative Filtering
Antworten
  • is an item filtering technique based on the collaboration (merging) of users' past behavior
  • A target user´s past behavior (likes, rating, purchase history, etc.) is collaborated with the behavior of similar users
  • We may be further interested in discovering how closely Variables A and B are related, which means we may also want to analyze the extend to which Variable B increases in relation to Variable A's increase
  • There are specialized variations of __________ include route optimization, social network analysis and spread prediction

Frage 123

Frage
Colaborative Filtering
Antworten
  • Based on the similarity of the user´s behavior, items are filtered for the target user
  • is solely based on the similarity between users' behavior, and requires a large amount of user behavior data in order to accurately filter items
  • A _____ represents a matrix of values in which each cell is color-coded according to the value
  • The presence of invalid data is resulting in spikes. Although the data appears abnormal, it may be indicative of a new pattern

Frage 124

Frage
Colaborative Filtering
Antworten
  • is an example of the application of the law of large numbers
  • A ______ can be in the form of a chart or a map
  • the system is fed unknown (but similar) data for classification, based on the understanding it developed
  • _____________ is an inductive approach that is closely associated to data mining

Frage 125

Frage
Content-based Filtering
Antworten
  • is an item filtering technique focused on the similarity between users and items
  • A user profile is created based on the user´s past behavior (likes, rating, purchase history, etc.)
  • Analytics Engine
  • can be used to determine the number of entities that fall within a certain radius of another entity

Frage 126

Frage
Content-based Filtering
Antworten
  • The similarities identified between the user profile and the attributes of various items, lead to items being filtered for the user
  • Is solely dedicated to individual user preferences and does not require data about other users
  • Applying this technique helps determine how the value of the dependent variable changes in relation to change in the value of the independent variable
  • However, in such cases only one independent variable may change. The others are kept constant

Frage 127

Frage
Filtering
Antworten
  • A recommender system predicts user preferences and generate suggestions for the user accordingly
  • suggestions commonly pertain to recommending items, such as movies, books, web pages, people, etc.
  • represents a constant rate of change
  • can be carried out via the use of supported by correlation, heat maps, time series analysis, network analysis, spatial data analysis, clustering, outlier detection, natural language processing and text analytics

Frage 128

Frage
Filtering
Antworten
  • A recommender system typically uses either colaborative _____ or content-based _________ to generate suggestions
  • Recommender systems may also be based on a hybrid of both collaborative _______ and content-based _______ to fine-tune the accuracy and effectiveness of generated suggestions
  • Classification, outlier detection, filtering, natural language processing, text analytics and sentiment analysis can utilize ___________
  • Data analysis results can be used as input for existing _______ or may form the basis of new _______

Frage 129

Frage
Semantic Analysis
Antworten
  • In order for the machines to extract valuable information, text and speech data needs to be understood by the machines in the same way as humans do. _____ represents practices for extracting meaningful information from textual and speech data
  • require processing resources that they request from the resource manager
  • The same results may be presented in a number of different ways, which can influence the interpretation of the results
  • Instead, the data is explored through analysis to develop an understanding of the cause of the phenomenon

Frage 130

Frage
Natural Language Processing
Antworten
  • is a computer's ability to comprehend human speech and text as naturally understood by humans
  • this allows computers to perform a variety of useful task, such as full-text searches
  • A _______ database generally provides an API-based query interface, rather than the SQL Interface
  • Each node in the _____ has its own dedicated resources such as memory and hard drive and runs its own operating system just like a desktop computer

Frage 131

Frage
Natural Language Processing
Antworten
  • Instead of hard-coding the required learning rules, either supervised or unsupervised machine learning is applied to develop the computer's understanding of the __________
  • In general, the more learning data the computer has, the more correctly it can decipher human text and speech
  • A distributed Big Data solution that needs to run on multiple servers relies on the coordination engine mechanism to ensure operational consistency across all of the participating servers
  • functionally can be further grouped into the following categories: event, file, relational

Frage 132

Frage
Natural Language Processing
Antworten
  • includes both text and speech recognition
  • for speech recognition, the system attemps to comprehend the speech and then performs an action, such as transcribing text
  • A user profile is created based on the user´s past behavior (likes, rating, purchase history, etc.)
  • The processing engine enables data to be queried and manipulated in other ways, but to implement this type of functionality requires custom programming

Frage 133

Frage
Text Analytics
Antworten
  • Unstructured text is generally much more difficult to analyze and search, compared to structured text
  • is the specialized analysis of text through the application of data mining, machine learning and natural language processing techniques to extract value out of unstructured text
  • As the amount of digitized documents, e-mails, social media posts and log files increases, business have an increasing need to leverage any value that can be extracted from these forms of semi-structured and unstructured data
  • Useful insights from text-based data can be gained by helping businesses develop an understanding of the information that is contained within a large body of text

Frage 134

Frage
Text Analytics
Antworten
  • essentially provides the ability to discover text rather than just search it
  • The basic tenet of ___________ is to turn unstructured text into data that can be searched and analyzed
  • Analysts working with big data solutions are not expected to know how to program processing engines
  • comprise random read/writes that involve fewer joins and require low-latency responses, with a smaller data footprint

Frage 135

Frage
Text Analytics
Antworten
  • Solely analyzing operational (structured) data may cause businesses to miss out on cost-saving or business expansion opportunities, especially those that are customer-focused
  • Applications include document classification and search, as well as builiding a 360-degree view of a customer by extracting information from a CRM system
  • However, in such cases only one independent variable may change. The others are kept constant
  • is a form of data analysis that involves the graphic representation of data to enable or enhance its visual perception

Frage 136

Frage
Text Analytics
Antworten
  • generally involves two steps: Parsing text within documents to extract, Categorization of documents using these extracted entities and facts
  • the extracted information can be used to perform a context-specific search on entities, based on the type of relationship that exists between the entities
  • identifying a wider variety of data sources may increase the probability of finding hidden patterns and correlations
  • Similarly, processed data may need to be exported to other systems before it can be used outside of the big data solution

Frage 137

Frage
Parsing text within documents to extract:
Antworten
  • Named Entities(person, group, place, company), Pattern-Based Entities(social insurance number, zip code), Concepts (an abstract representation of a entity), Facts (relationship between entities)
  • Data Extraction
  • is generally used in data mining to get an understanding of the properties of a given dataset. After developing this understanding, classification can be used to make better predictions about similar, but new or unseen data
  • The ________ mechanisms is responsible for processing data (usually retrieved from storage devices) based on pre-defined logic, in order to produce a result

Frage 138

Frage
Sentiment Analysis
Antworten
  • is a specialized form of text analysis that focuses on determining the bias or emotions of individuals
  • this form of analysis determines the attitude of the author (of the text) by analyzing the text within the context of the natural language
  • The proposed cause or assumption is called a ____________
  • In other areas such as the scientific domains, the objective may simply be to observe which version works better in order to improve a process or product

Frage 139

Frage
Sentiment Analysis
Antworten
  • not only provides information about how individuals feel, but also the intensity of their feeling
  • this information can then be integrated into the decision-making process
  • Machine Learning
  • Instead, the data is explored through analysis to develop an understanding of the cause of the phenomenon

Frage 140

Frage
Sentiment Analysis
Antworten
  • Common applications for __________ include early identification of customer satisfaction or dissatisfaction, gauging product sucess or failure and spotting new trends
  • Utilization of Analysis Results
  • Generally, the objective is to gauge human behavior with the goal of increasing sales
  • are usually divided into two types: Batch and Transactional

Frage 141

Frage
Quantitative Analysis
Antworten
  • Correlation and regression are examples of ________. A/B testing can make use of ____________ techniques for results comparision.
  • Unstructured text is generally much more difficult to analyze and search, compared to structured text
  • Storage Device
  • Clustering

Frage 142

Frage
Qualitative Analysis
Antworten
  • NLP, Text analytics and sentiment analysis be used in support of __________
  • Machine Learning
  • To the client, a file appears local and can be accessed via multiple locations
  • For example, a __________ can execute logic that collects relational data from multiple databases at regular intervals via the data transfer engine mechanism, applies a set of ETL operations via the processing engine mechanism and finally persists the results to a NoSQL storage device

Frage 143

Frage
Data Mining
Antworten
  • can be carried out via the use of supported by correlation, heat maps, time series analysis, network analysis, spatial data analysis, clustering, outlier detection, natural language processing and text analytics
  • this stage can be iterative in nature, especially if the _________________ is exploratory so that analysis is repeated until the appropiate pattern or correlation is uncovered
  • An ________ is employed when the comparatively simple data manipulation functions of a query engine are insufficient
  • metadata is added through an automated mechanism to data received from both internal and external data sources

Frage 144

Frage
Descriptive Analytics
Antworten
  • A/B testing, heat maps and spatial data analysis are considered forms of ____________
  • There are specialized variations of __________ include route optimization, social network analysis and spread prediction
  • The output of one workflow can become the input of another workflow
  • Strategic BI and analytics fall in this category, since they are highly read intensive task involving large volumes of data

Frage 145

Frage
Diagnostic Analytics
Antworten
  • Correlation, regression, time series analysis, network analysis and spatial data analysis are considered forms of _________
  • The workflow logic processed by a _____________ mechanism can involve the participation of other big data mechanism
  • The ________ mechanisms is responsible for processing data (usually retrieved from storage devices) based on pre-defined logic, in order to produce a result
  • in order to qualify as a Big Data Problem, a business problem needs to be directly related to one or more of the Big Data characteristics of volume, velocity or variety

Frage 146

Frage
Predictive Analysis
Antworten
  • Correlation, regression, time series analysis, classification, clustering, outlier detection, filtering, natural language processing, text analytics and sentiment analysis are considered forms of ________
  • is an unsupervised learning technique by which data is divided into different groups so that the data in each group has similar properties
  • Named Entities(person, group, place, company), Pattern-Based Entities(social insurance number, zip code), Concepts (an abstract representation of a entity), Facts (relationship between entities)
  • provenance can play an important role in determining the accuracy and quality of qustionable data

Frage 147

Frage
Prescriptive Analytics
Antworten
  • are based on predictive analytics techniques and therefore are associated with the same analysis techniques as predictive analytics. Additionally, _____ may utilize heat maps, network analysis and spatial data analysis to graphically show various outcomes
  • Applying this technique helps determine how the value of the dependent variable changes in relation to change in the value of the independent variable
  • This can reveal the nature of the dataset or the cause of a phenomenon
  • Time series analysis

Frage 148

Frage
Supervised Learning
Antworten
  • Classification, outlier detection, filtering, natural language processing, text analytics and sentiment analysis can utilize ___________
  • A big data _________ utilizes a distributed parallel programming framework that enables it to process very large amounts of data distributed across multiple nodes
  • uses statistical methods based on mathematical formulas as a means for analyzing data
  • A user profile is created based on the user´s past behavior (likes, rating, purchase history, etc.)

Frage 149

Frage
Unsupervised Learning
Antworten
  • Clustering, outlier detection, filtering, natural language processing, text analytics and sentiment analysis can utilize ___________
  • Data analysis results can be used as input for existing _______ or may form the basis of new _______
  • A _____ represents a matrix of values in which each cell is color-coded according to the value
  • Models can be used to improve business process logic, application system logic and can form the basis of a new system or software program

Frage 150

Frage
Cluster
Antworten
  • Within computing, a ______ is a tightly coupled collection of servers, or nodes. These servers usually have the same hardware specifications and are connected together via network to work as a single unit
  • Each node in the _____ has its own dedicated resources such as memory and hard drive and runs its own operating system just like a desktop computer
  • These engines may provide the agent-based processing of inflight data, which enables various data cleasing and transformation activities to be performed in realtime
  • Unexpected findings or anomalies are usually ignored since a predetermined cause was assumed

Frage 151

Frage
Cluster
Antworten
  • In the diagram, a _____ is used to execute a task based on distributed / parallel data processing frameworks
  • A/B Testing
  • Big Data solutions require a distibuted processing environment that can accomodate large-scale data volumes, velocity and variety
  • Migrating to the cloud is logical for enterprises planning to run analytics on datasets that are available via data markets, as most data markets store their data in the cloud such as Amazon S3

Frage 152

Frage
File System
Antworten
  • A ________ is a method of storing and organizing data on a storage medium, such as hard drives, DVD´s, and flash drives
  • A file is an atomic unit of storage used by the _________ to stored data. Files are organizated inside of a directory
  • This can reveal the nature of the dataset or the cause of a phenomenon
  • The proposed cause or assumption is called a ____________

Frage 153

Frage
File System
Antworten
  • A _______ provides a logical view of the data stored on the storage medium as a tree structure of files and directories
  • Operating systems employ ______ for data storage. Each operating system provides support for one or more ________, like NTFS for windows and ext for linux
  • this form of analysis determines the attitude of the author (of the text) by analyzing the text within the context of the natural language
  • Within Big Data ________ can first be applied to discover if a relationship exists

Frage 154

Frage
Distributed File System
Antworten
  • A _________ is a file system that can store large files spread across a cluster
  • To the client, a file appears local and can be accessed via multiple locations
  • is the process of finding data that is significantly different from or inconsistent with the rest of the data within a given dataset
  • Natural Language Processing

Frage 155

Frage
Distributed File System
Antworten
  • Examples include the Google File System and Hadoop ________
  • requires that a business case be created, assessed and approved prior to proceeding with the actual hands-on analysis task
  • Machine Learning
  • Data Aggregation & Representation

Frage 156

Frage
NoSQL
Antworten
  • A _______ database is a non-relational database that is highly scalable, fault-tolerant and specifically designed to house unstructured data
  • A _______ database generally provides an API-based query interface, rather than the SQL Interface
  • The use of ________ helps to develop and understanding of a dataset and find relationships that can assist in explaining a phenomenon
  • when one increases, the other may stay the same, or increase or decrease arbitrarily

Frage 157

Frage
NoSQL
Antworten
  • However, some _______ databases may also provide a SQL-like query interface
  • this allows computers to perform a variety of useful task, such as full-text searches
  • depending on the type of analytics required, this stage can be as simple as querying a dataset to compute an aggregation for comparision
  • processing engine, storage device, resource manager

Frage 158

Frage
Parallel Data Processing
Antworten
  • involves the simultaneous execution of multiple sub-tasks that collectivelly comprise a larger task
  • the premise is to reduce the execution time by dividing a single larger task into multiple smaller task
  • A _________ in Big Data os defined as the amount and nature of data that is processed within a certain amount of time
  • For example, a __________ can execute logic that collects relational data from multiple databases at regular intervals via the data transfer engine mechanism, applies a set of ETL operations via the processing engine mechanism and finally persists the results to a NoSQL storage device

Frage 159

Frage
Parallel Data Processing
Antworten
  • Although __________ can be archieved through multiple networked machines, it is more typically achieved within the confines of a single machine (multiple processors or cores)
  • the system is fed data that is already categorized or labeled, so that it can develop an understanding of the different categories
  • Law of Diminishing Marginal Utility
  • provides analysis features more sophisticated than those of heat maps

Frage 160

Frage
Distributed Data Processing
Antworten
  • is closely related to parallel data processing in how the same principle of "divide-and-conquer" is applied
  • However, ___________ is always archieved through physically separate machines that are networked together as a cluster
  • essentially provides the ability to discover text rather than just search it
  • This allows large amounts of data to be imported or exported within a short period of time

Frage 161

Frage
Processing workloads
Antworten
  • A _________ in Big Data os defined as the amount and nature of data that is processed within a certain amount of time
  • are usually divided into two types: Batch and Transactional
  • includes both text and speech recognition
  • A distributed Big Data solution that needs to run on multiple servers relies on the coordination engine mechanism to ensure operational consistency across all of the participating servers

Frage 162

Frage
Batch Workload
Antworten
  • Also known as offline processing, ________ processing involves processing data in batches and usually imposes delays (resulting in high-latency responses)
  • typically involve large quantities of data with sequential read/writes, and comprises a group of read or write queries
  • also known as split or bucket testing compares two versions of an element to determine which version is superior based on a pre-defined metric
  • An ____ can be a person, a group or some other business domain object such as a product

Frage 163

Frage
Batch Workload
Antworten
  • Queries can be complex and involve multiple joins
  • Strategic BI and analytics fall in this category, since they are highly read intensive task involving large volumes of data
  • Sentiment Analysis
  • NLP, Text analytics and sentiment analysis be used in support of __________

Frage 164

Frage
Batch Workload
Antworten
  • A _________ comprises grouped read/writes, with a larger data footprint consisting of complex joins and high-latency responses
  • Spatial Data Analysis
  • Correlation, regression, time series analysis, network analysis and spatial data analysis are considered forms of _________
  • Users of Big Data solutions can make numerous data processing requests, each of which can have different processing workload requirements

Frage 165

Frage
Transactional workload
Antworten
  • Also known as online processing, ____________ processing follows an approach whereby data is processed interactively, without delay (resulting in low-latency responses)
  • involves small amounts of data with random read/writes
  • Data Validation & Cleansing
  • A/B Testing

Frage 166

Frage
Transactional workload
Antworten
  • OLTP and operational systems (write-intensive) as well as operational BI and analytics (read-intensive), both fall within this category
  • Although these workloads contain a mix of read/write queries, they are generally more write-intensive than read-intensive
  • can act as a common denominator that can be used for a range of analysis techniques and projects. This can require establishing a central, standard analysis repository, such as a NoSQL database
  • comprise random read/writes that involve fewer joins and require low-latency responses, with a smaller data footprint

Frage 167

Frage
Cloud Computing
Antworten
  • is a specialized form of distibuted computing that introduce utilization models for remotely provisioning scalable and measured IT resources
  • Big Data solutions can be partially or fully deployed in clouds in order to leverage the storage and computing resources that are available from the cloud provider
  • Data samples are typically used
  • It can also represent hierarchical values by using color-coded nested rectangles

Frage 168

Frage
Cloud Computing
Antworten
  • the clustered processing resources required by Big Data solutions can benefit from the highly scalable and elastic IT resources available on cloud-based environments
  • Hadoop's batch-based data processing fully lends itself to the pay -per-use model of __________, which can reduce operational costs since a typical Hadoop cluster size can range from a few to a few thousand nodes
  • is generally applied via the following two approaches: collaborative ____________ and content-based ____________
  • In short _______ provides the three ingredients required for a big data solutions: input data, computing and storage

Frage 169

Frage
It makes sence from enterprises already using cloud computing to reuse the cloud from their Big Data initiatives, because:
Antworten
  • IT already possesses the required cloud computing skills
  • the imput data already exists in the cloud
  • Correlation, regression, time series analysis, network analysis and spatial data analysis are considered forms of _________
  • In the context of traditional data analysis, the ______ states that, starting with a reasonably large sample size, the value obtained from the analysis of additional data decreases as more data is successively added to the original sample

Frage 170

Frage
Cloud Computing
Antworten
  • Migrating to the cloud is logical for enterprises planning to run analytics on datasets that are available via data markets, as most data markets store their data in the cloud such as Amazon S3
  • Workflow Engine
  • not only provides information about how individuals feel, but also the intensity of their feeling
  • This can reveal the nature of the dataset or the cause of a phenomenon

Frage 171

Frage
Big data Mechanisms
Antworten
  • Big Data solutions require a distibuted processing environment that can accomodate large-scale data volumes, velocity and variety
  • This type of environment is provided by a platform that is comprised of a set of distributed storage and processing technologies
  • it can be beneficial to identify as many types of related data sources and insights as possible, especially when we dont know exactly what we're looking for
  • Is solely dedicated to individual user preferences and does not require data about other users

Frage 172

Frage
Big data Mechanisms
Antworten
  • represents the primary, common components of big data solutions, regardless of the open source or vendor products used for implementation
  • Storage Device
  • Instead of coloring the whole region, the map may be superimposed by a layer made up of collections of colored shapes representing various regions
  • Query Engine

Frage 173

Frage
Big data Mechanisms
Antworten
  • Processing Engine
  • Resource Manager
  • Applications of __________ include operations and logistic optimization, environmental sciences and infrastructure planning
  • is generally used in data mining to get an understanding of the properties of a given dataset. After developing this understanding, classification can be used to make better predictions about similar, but new or unseen data

Frage 174

Frage
Big data Mechanisms
Antworten
  • Data Transfer Engine
  • Analytics Engine
  • is closely related to parallel data processing in how the same principle of "divide-and-conquer" is applied
  • The data collected for _______ is always time-dependent

Frage 175

Frage
Big data Mechanisms
Antworten
  • Workflow Engine
  • Coordination Engine
  • Recommender systems may also be based on a hybrid of both collaborative _______ and content-based _______ to fine-tune the accuracy and effectiveness of generated suggestions
  • There is no prior learning of categories required; intead, categories are implicity generated based on the data groupings

Frage 176

Frage
At minimun, any given big data solution needs to contain the _____, ______ and _______ mechanism in order to effectively process large datasets in support of the big data analysis lifecycle
Antworten
  • processing engine, storage device, resource manager
  • storage device, analytics engine, coordination engine
  • processing engine , query engine, data transfer engine
  • resource manager, analytics engine, workflow engine

Frage 177

Frage
Storage Device
Antworten
  • ___________ mechanisms provide the underlying data storage environment for persisting the datasets that are processed by big data solutions
  • A ________ is a method of storing and organizing data on a storage medium, such as hard drives, DVD´s, and flash drives
  • A _______ can exists as a distibuted file system or a database
  • The ability to analyze massive amounts of data and find useful insights carries little value if the only ones that can interpret the results are the analysis

Frage 178

Frage
Storage Device
Antworten
  • Distributed file systems can be used for persisting immutable data that is intended for streaming access or batch processing
  • Databases, such as NoSQL repositories, can be used for structured and unstructured storage and read/write data access
  • Note that distributed file systems and databases are both on disk _________ mechanisms
  • Natural Language Processing

Frage 179

Frage
Processing Engine
Antworten
  • The ________ mechanisms is responsible for processing data (usually retrieved from storage devices) based on pre-defined logic, in order to produce a result
  • Any data processing that is requested by the big data solution is fulfilled by the __________
  • Whether _____________ is required or not, it is important to understand that the same data can be stored in many different forms. One form may be better suited for a particular type of analysis than another
  • Hadoop's batch-based data processing fully lends itself to the pay -per-use model of __________, which can reduce operational costs since a typical Hadoop cluster size can range from a few to a few thousand nodes

Frage 180

Frage
Processing Engine
Antworten
  • A big data _________ utilizes a distributed parallel programming framework that enables it to process very large amounts of data distributed across multiple nodes
  • require processing resources that they request from the resource manager
  • Classification
  • are usually used for forecasting by identifying long-term trends, seasonal periodic patterns and irregular short-term variations in the dataset

Frage 181

Frage
Batch Processing Engine
Antworten
  • Provides support for batch data where processing tasks can take anywhere from minutes to hours to complete. This type of processing engine is considered to have high latency
  • The identified patterns, correlations and anormalies discovered during the data analysis are used to refine business processes
  • When one variable increases, the other also increases and viceversa
  • Operating systems employ ______ for data storage. Each operating system provides support for one or more ________, like NTFS for windows and ext for linux

Frage 182

Frage
Realtime Processing Engine
Antworten
  • Provides support for realtime data with sub-seconds response times. This type of processing engine is considered to have low latency
  • To the client, a file appears local and can be accessed via multiple locations
  • Migrating to the cloud is logical for enterprises planning to run analytics on datasets that are available via data markets, as most data markets store their data in the cloud such as Amazon S3
  • depending on the type of data source, data may come as a dump of files (such as data purchased from a third-party data provider), or may require API integration (such as with Twitter)

Frage 183

Frage
Resource Manager
Antworten
  • Users of Big Data solutions can make numerous data processing requests, each of which can have different processing workload requirements
  • Data that is held in storage can be processed in a variety of ways by a given Big Data solutions and all data processing requests require the allocation of processing resources
  • it involves plotting entities as nodes and connections as edges between nodes
  • the system is fed data that is already categorized or labeled, so that it can develop an understanding of the different categories

Frage 184

Frage
Resource Manager
Antworten
  • A _______ acts as a schedules and prioritizes processing requests according to individual processing workload requirements
  • The _____ essencially acts a resource arbitrator that manages and allocates available resources
  • A value that is labelled differently in two different datasets may mean the same thing
  • The proposed cause or assumption is called a ____________

Frage 185

Frage
Data Transfer Engine
Antworten
  • Data needs to be imported before it can be processed by the big data solution
  • Similarly, processed data may need to be exported to other systems before it can be used outside of the big data solution
  • this form of analysis determines the attitude of the author (of the text) by analyzing the text within the context of the natural language
  • Text Analytics

Frage 186

Frage
Data Transfer Engine
Antworten
  • A ________ engine enables data to be moved in or out big data solution storage devices
  • Unlike other data processing systems where input data conforms to a schema and is mostly structured, data sources for a big data solution tend to include a mix of structured and unstructured data
  • is dedicated to determining how and where processed analysis data can be further leveraged
  • A given ______ may support either data ingress or egress functions

Frage 187

Frage
Data Transfer ingress and egress
Antworten
  • functionally can be further grouped into the following categories: event, file, relational
  • the processing engine mechanism will often use the ___________ to coordinate data processing across a large number of servers. This way, the processing engine does not require its own coordination logic
  • can help enable a better undestanding of what a phenomenon is, and why it occurred
  • therefore, it is advisable to store a verbatim copy of the original dataset before proceeding with the filtering. To save on required storage space, the verbatim copy is compressed before storage

Frage 188

Frage
Data Transfer Engine
Antworten
  • A ________ generally provides only one of the listed functions
  • It is common for multiple diferent ________ to be part a big data solution to facilitate a range of import and export requirements for different types of data
  • A _______ provides a logical view of the data stored on the storage medium as a tree structure of files and directories
  • are based on predictive analytics techniques and therefore are associated with the same analysis techniques as predictive analytics. Additionally, _____ may utilize heat maps, network analysis and spatial data analysis to graphically show various outcomes

Frage 189

Frage
Data Transfer Ingress Engine
Antworten
  • Event-based __________ generally use a publish-subcribe model based on the use of a queue to ensure high reliability and availability
  • A file is an atomic unit of storage used by the _________ to stored data. Files are organizated inside of a directory
  • it can be based on either supervised or unsupervised learning
  • The data collected for _______ is always time-dependent

Frage 190

Frage
Data Transfer Engine
Antworten
  • These engines may provide the agent-based processing of inflight data, which enables various data cleasing and transformation activities to be performed in realtime
  • enable the substitution of data that is distributed across a range of sources residing in multiple systems outside of the big data solution
  • Heat Maps
  • is manipulated through a geographical information system (GIS) that plots spatial data on a map generally using its longitude and latitude coordinates

Frage 191

Frage
Data Transfer Engine
Antworten
  • A _______ may internally use a processing engine to process multiple large datasets in parallel
  • This allows large amounts of data to be imported or exported within a short period of time
  • Note that a workflow engine may provide integration with a _______ to enable the automated import and export data
  • is a computer's ability to comprehend human speech and text as naturally understood by humans

Frage 192

Frage
Query Engine
Antworten
  • The processing engine enables data to be queried and manipulated in other ways, but to implement this type of functionality requires custom programming
  • Analysts working with big data solutions are not expected to know how to program processing engines
  • this form of analysis determines the attitude of the author (of the text) by analyzing the text within the context of the natural language
  • the extracted information can be used to perform a context-specific search on entities, based on the type of relationship that exists between the entities

Frage 193

Frage
Query Engine
Antworten
  • The _______ mechanism abstracts the processing engine from end-users by providing a front-end user-interface that can used to query underlying data, along with features for creating query execution plans
  • Languages that are more familiar and easier to work with (such as SQL) can be used by non-technical users to perform ETL tasks and run ad hoc queries for data analysis activities
  • this helps maintain data provenance throughout the big data analysis lifecycle, which helps establish and preserve data accuracy and quality
  • Either way, a method of data reconciliation is required or the dataset representing the correct value needs to be determined

Frage 194

Frage
Query Engine
Antworten
  • Common processing functions performed by a ______ include sum,average, group by join and sort
  • Under the hood, the ________ seamlessly transforms user queries into the relevant low-level code that can be used by the processing engine
  • The use of ________ can reduce development time and enables the manipulation of large datasets without the need to write complex programming logic
  • based on the business requirements documented, it can be determined whether the business problems being addressed are really Big Data problems

Frage 195

Frage
Analytics Engine
Antworten
  • The ________ mechanism is able to process advanced statistical and machine learning algorithms in support of analytics processing requirements, including the identification of patterns and correlations
  • It generally uses the processing engine mechanism to run algorithms on large datasets.
  • A _______ database generally provides an API-based query interface, rather than the SQL Interface
  • A ________ generally provides only one of the listed functions

Frage 196

Frage
Analytics Engine
Antworten
  • An ________ is employed when the comparatively simple data manipulation functions of a query engine are insufficient
  • Some propietary ________ also provide specialized data analysis features, such as text analytics and machine log analysis processing
  • How the data is grouped depends on the type of algorithm used. Each algorithm uses a different technique to identify ______
  • This is a traditional data analysis principle that claims that data held in a reasonably sized dataset provides the maximum value

Frage 197

Frage
Workflow Engine
Antworten
  • A ___________ provides the ability to design and process a complex sequence of operations that can be triggered either at set time intervals or when data becomes available
  • The workflow logic processed by a _____________ mechanism can involve the participation of other big data mechanism
  • Strategic BI and analytics fall in this category, since they are highly read intensive task involving large volumes of data
  • Law of large numbers

Frage 198

Frage
Workflow Engine
Antworten
  • For example, a __________ can execute logic that collects relational data from multiple databases at regular intervals via the data transfer engine mechanism, applies a set of ETL operations via the processing engine mechanism and finally persists the results to a NoSQL storage device
  • The defined workflows are analogous to a flowchart with control logic (such as decisions, forks, joins) and generally rely on a batch-style processing engine for execution
  • The output of one workflow can become the input of another workflow
  • it can be based on either supervised or unsupervised learning

Frage 199

Frage
Coordination Engine
Antworten
  • A distributed Big Data solution that needs to run on multiple servers relies on the coordination engine mechanism to ensure operational consistency across all of the participating servers
  • make it possible to develop highly reliable, highly available distributed big data solutions that can be deployed in a cluster
  • A model look like a mathematical equation or a set of rules
  • data that appears to be invalid may still be valuable in that it may possess hidden patterns and trends

Frage 200

Frage
Coordination Engine
Antworten
  • the processing engine mechanism will often use the ___________ to coordinate data processing across a large number of servers. This way, the processing engine does not require its own coordination logic
  • The ________ mechanism can also be used for support distributed locks, support distributed queues, establish a highly available registry for obtaining configuration information, reliable asynchronous communication between processes that are running on different servers
  • in the case of realtime analytics, the data is analyzed first and then persisted to disk
  • Big Data solutions require a distibuted processing environment that can accomodate large-scale data volumes, velocity and variety
Zusammenfassung anzeigen Zusammenfassung ausblenden

ähnlicher Inhalt

FUNDAMENTOS DE REDES DE COMPUTADORAS
anhita
Test: "La computadora y sus partes"
Dayana Quiros R
Abreviaciones comunes en programación web
Diego Santos
Seguridad en la red
Diego Santos
Excel Básico-Intermedio
Diego Santos
Evolución de la Informática
Diego Santos
Introducción a la Ingeniería de Software
David Pacheco Ji
Conceptos básicos de redes
ARISAI DARIO BARRAGAN LOPEZ
La ingenieria de requerimientos
Sergio Abdiel He
TECNOLOGÍA TAREA
Denisse Alcalá P
Navegadores de Internet
M Siller