Question
The collection, storage and sharing of data is becoming increasingly important for organizations who have a choice about which type of database to use to store their data. Two examples of database types are relational and object-oriented.
The 2016 US presidential election was seen to be a victory for data analytics. Companies that specialize in analytics use data warehouses.
a. Explain two advantages of using a relational database rather than an object-oriented database.[4]
b. State two characteristics of a data warehouse.[2]
c. Outline why data needs to be transformed before it can be loaded into the data warehouse.[2]
d. Outline why opinion poll data and other election data are timestamped when added to the data warehouse.[2]
e. Outline why analytics companies use link analysis.[2]
f. Outline why analytics companies use deviation detection.[2]
g. Once data has been loaded into a data warehouse it can be mined. The use of data analytics is believed to have been important to the outcome$[6]$
of the US election campaign.
Discuss whether the advantages of data mining techniques in this scenario outweigh the disadvantages.
▶️Answer/Explanation
Ans:
a. )
Standards and support are available for RDB…
…make it more stable / easier to resolve issues / easier to recruit staff;
More user tools exist for RDBs…
…such as report generators / mail merge / security level permissions / concurrent access.
Easier to visualise data and relationships…
…so more likely to have a correctly modelled database i.e. has no redundancy / improved integrity
RDB tables and relationships are simple to implement…
…an OODB requires an understanding of concepts of OOP;
b. )
Repository data stored are historical/time variant;
Data is collected from different sources;
OLAP systems for reporting and data analysis (e.g. data mining) / provides businesses with information for informed decisions;
Provides tools so that data can be validated, reformatted, reorganized, summarized, and restructured;
Optimised for data retrieval;
c. )
Data is from my different external sources and (therefore) in many different formats;
For example, dates may be $d d / \mathrm{mm} / \mathrm{yy}$ or $\mathrm{mm} / \mathrm{dd} / \mathrm{yy}$ or $y \mathrm{y} / \mathrm{mm} / \mathrm{dd}$ (allow any valid example);
To allow meaningful analysis, it must be in the same format/standardised;
d. )
Notes: maximum marks only if reference is made to the scenario
Do not award marks just for a description of time-stamping
The usefulness of information is often time dependent;
Example relating to the US presidential election, such as
Electoral opinions before a public debate may have less value than those after the debate;
e. )
Notes: don’t accept the word “link” on its own as a descriptor.
maximum marks only if reference is made to the scenario.
They use link analysis in order to establish relationships / associations between different data sets / different entities in the same data set;
Examples relating to the US presidential election, such as:
How people voted in relation to some other factor, e.g. the level of use of social media / where they took their vacations / size of family …;
f. )
Note: maximum marks only if reference is made to the scenario
They look for any unusual activity (anomaly pattern) in transactions;
Examples relating to the US presidential election, such as
Unusual switch in pre-electoral voting opinions;
Sudden pro-candidate or anti-candidate sentiment in a particular state;
g. )
Mark as follows:
Award [1] for a generic advantage of data mining;
Award [1] for expanding on this advantage;
Award [1] for linking this to the scenario;
Similarly for a disadvantage;
Award [1] for a valid conclusion;
Advantages of data mining [3 max].
Clustering / cluster analysis allows objects to be treated as one group enabling the uncovering of previously hidden patterns;
For example, cluster analysis may search groups by race or gender to discover if a candidate is unpopular with a demographic;
Classification methods (e.g. genetic, rough set, fuzzy set) can be used to recognize patterns that describe the groups to which an item belongs;
For example, classifying voters by income may provide useful information that can affect future publicity strategies;
Association analysis allows a series of statistical relationships to be further explored or tested;
Associations look for If-then rules that predict a particular stance on a controversial topic (e.g. abortion) may influence the religious voters;
Disadvantages of data mining [ 3 max].
Data mining is based on the data collected from individuals;
This data may be sensitive personal information that the individual concerned may not want to be shared;
This personal data may be reaggregated to compromise the privacy and/or anonymity of the data subjects;
Conclusions $[1 \max ]$
The development of more sophisticated processing algorithms is inevitable, so although there are potential concerns about the invasive nature of data mining, providing sufficient safeguards are put in place, there is nothing inherently wrong with this;
Data mining is the start of the slippery slope of the state or multinational companies holding inappropriate quantities of personal data about citizens that is of limited value. Therefore, unless the privacy and/or anonymity of the data subjects can be guaranteed, this is an unethical practice;
Question
Basking Coats is a business that sells textile products such as shirts, coats and trousers. The company was formed in 1970 and has numerous shops in Europe and South-East Asia. To ensure their marketing is targeted at appropriate customers, Basking Coats has asked Singalytics, a data analytics company, to assist them in improving their marketing strategy.
Extract, Transform, Loading (ETL) processes can be used to clean up data for use in a database warehouse. When ETL is carried out, certain precautions should be taken.
Data in the Singalytics data warehouse is stored with a timestamp.
a. Describe how deviation detection can be used to analyse this data.[2]
b. Outline why data warehouses tend to use unnormalized data sets.[3]
c. Identify three precautions to be taken before extraction is carried out on the database.[3]
d. Outline why data warehousing is time dependent.[2]
e. Explain why Basking Coats could use association analysis to improve the marketing of its products.[4]
f. Basking Coats has decided to use an object-oriented database rather than a relational database to store its data.[6]
Explain why Basking Coats would use an object-oriented database rather than a relational database to store its data.
▶️Answer/Explanation
Ans:
a. )
Deviation detection is a statistical technique;
Appropriate marketing of product;
Which is used to detect outlying data that does not fit the assumed model;
Therefore it can be used to predict the trends and patterns of demand for certain consumer goods in the future;
b.
To speed query execution, which can be especially important in data warehouses used by Singalytics;
May effectively be used in data warehouses as they contain pre-joined tables that package data for common uses;
If the data is all present in a single table, there will be no need for joins, hence the selects can be done very quickly;
A single table with all the required data allows much more efficient index usage;
If there is heavy read load and when the application is read intensive;
If the columns are indexed properly, then results can be filtered and sorted by utilizing the same index;
c.
To retrieve all the required data from the source system with as little resources as possible;
Designed in a way that it does not affect the source system;
In terms or performance/ response time/ locking;
Makes it accessible for processing on the data;
Ensuring that historical data being extracted can be read by the current systems;
Ensuring the different data formats being extracted can all be converted or scrubbed to become readable by the system and able to be formatted;
Ensuring that the data is relevant to what the user wishes to extract and utilise;
d. )
The content in the data warehouse is only valid for a time period;
Because the data undergoes changes dynamically;
A data warehouse’s focus on change over time is time variant;
e. )
Award [2 max] for the explanation and [2 max] for the example(s).
Associations:
Correlate the presence;
of a set of items with another range of values for another set of variables;
Breaks up data sets by variables such as gender, location, age;
It may be used to detect patterns independently from the geographic region, females buy more dark colour trousers than males;
Examples:
when a female retail shopper buys a cotton shirt, she is likely to buy a stole.
Associations of the type Full arm formal shirts => Dark colour trousers; Full arm formal shirts => cufflinks may produce enough confidence and support to be valid association rules of interest;
If the application area has a natural classification of the item sets into hierarchies,
discovering associations within the hierarchies is of no particular interest;
Specifically its associations across hierarchies;
f. )
Award [3 max] for the feature(s) and [3 max] for supporting explanation/examples.
Features:
Enhanced modelling capabilities;
Extensibility/support new data types;
Object DBMS stores more complex data and relationships;
Improved performance;
Reusability;
Eliminates need for user defined keys;
Eliminates need for Joins;
Examples:
Inheritance property, we can re-use the attributes (size, fabric) and functionalities;
It reduces the cost of maintaining the same data multiple times;
All this information is encapsulated and, there is no fear being misused by other objects. If we need any new feature we can easily add new class inherited from parent class and that adds new features;
Reduces the overhead and maintenance costs;
So it becomes more flexible if any changes and apparel business changes with fashion needs;
Codes are re-used because of the inheritance feature;