CSC8641 : Big Data Analytics
- Offered for Year: 2024/25
- Module Leader(s): Dr Yinhao LI
- Lecturer: Dr Daniel Sun
- Teaching Assistant: Mr Iain Dixon
- Owning School: Computing
- Teaching Location: Newcastle City Campus
Semesters
Your programme is made up of credits, the total differs on programme to programme.
Semester 1 Credit Value: | 10 |
ECTS Credits: | 5.0 |
European Credit Transfer System |
Aims
The aim of the module is to introduce students to the complex combination of data engineering technology and data science that makes it possible to extract valuable knowledge from “Big Data”. A number of technical challenges are derived from the high volume and high diversity (heterogeneity of meaning and format) and variable quality of the data, and a distinction is made based on whether the data is stationary (resides in a data repository) or it is in motion (data streaming, as it would be produced for instance by sensors), with further emphasis on graph data structures.
The module will focus on the following aspects:
- Distribution of data processing over a cluster of computing nodes, hosted in a cloud environment, as a way to
scale out computing resources as the size of the data to be processed increases. This includes current
frameworks for massively parallel data processing, notably Spark which is the most successful example of
cloud-based distributed programming platform, and possibly Dask, its direct competitor.
- Examples of algorithms that can be successfully parallelised and thus are able to take advantage of
distributed data architectures
- Models of computation that enable near- real time analytics on data streams
- Specialised data structures, specifically graphs. The module covers basics of graph databases (Neo 4J) but
also massively parallel graph algorithms, i.e., implemented using the Pregel framework.
- Examples of data science applications, including Machine Learning algorithms that are enabled by Big Data
technology.
Emphasis is also placed on the rapid pace of technology advances in this area, and cutting-edge further reading material is offered for in-depth learning and deep-dives into specific topics.
Outline Of Syllabus
1. Introduction to Data Science and Data Analytics. Scalability, efficiency of parallel processing.
2. Batch Big Data Processing (MapReduce)
3. Computing environments for Big Data Analytics and Machine Learning:
• Big Data platforms (Hortonworks, Cloudera), Spark
4. Data Stream processing: Overview of real time Event Processing and querying
5. Graph data processing: Example of algorithms for graph analytics, graph databases and query languages (GDBMS), massively parallel graph processing model
Teaching Methods
Teaching Activities
Category | Activity | Number | Length | Student Hours | Comment |
---|---|---|---|---|---|
Guided Independent Study | Assessment preparation and completion | 40 | 1:00 | 40:00 | Independent programming / coursework development & in class test |
Scheduled Learning And Teaching Activities | Lecture | 4 | 2:00 | 8:00 | Online / in class sessions. these are “flipped lectures” (see rationale below) online synch |
Guided Independent Study | Directed research and reading | 12 | 1:00 | 12:00 | Pre-recorded lectures or other teaching material to watch / listen to ahead of class, with exercises |
Scheduled Learning And Teaching Activities | Drop-in/surgery | 4 | 3:00 | 12:00 | Online / in lab time with demonstrators PIP |
Guided Independent Study | Independent study | 28 | 1:00 | 28:00 | In proportion to directed study time (2:1) – to prepare for next class |
Total | 100:00 |
Teaching Rationale And Relationship
The learning experience is organized into two parts with roughly equal weight:
1. Theory (50 hours). In turn this follows the paradigm: watch-study-engage. Lectures will be used to introduce the learning material and for demonstrating the key concepts by example. Selected lectures will be pre-recorded to enable the class to be “flipped” during scheduled lecture time. For these lectures, students will be expected to follow the recording ahead of time (Structured Guided Learning) and then engage in Q&A during online / PIP class time. Students are also expected to address specific topics in depth and independently (Directed research and reading) as part of this
2. Practical programming. Workshops are offered to introduce the computational environment(s), as well as weekly Drop- in/Surgery hours to help solve practical problems. The bulk of the time for this part is for independent study and programming.
Assessment Methods
The format of resits will be determined by the Board of Examiners
Other Assessment
Description | Semester | When Set | Percentage | Comment |
---|---|---|---|---|
Report | 1 | M | 100 | extended technical project |
Assessment Rationale And Relationship
The assessment structure is designed to
- promote a deep understanding of the lecture material through assessed exercises
- encourage students to engage with one or more programming environments, which may be new to them, and develop practical problem-solving skills to address specific programming challenges
Reading Lists
Timetable
- Timetable Website: www.ncl.ac.uk/timetable/
- CSC8641's Timetable