CPTS415 Study Guide: Big Data, Database Design & OpenFlights

Understanding Big Data Through the Five V's

Big Data is everywhere—from social media trends to flight tracking apps. In CPTS415, you'll often analyze datasets like the OpenFlights Airport Database. Let's break down the Five V's using a familiar example: a flight booking app like Skyscanner or Google Flights.

Volume: Millions of flights and airports worldwide—OpenFlights alone has over 10,000 entries.
Velocity: Real-time flight status updates, weather changes, and pricing fluctuations.
Variety: Structured data (airport codes, coordinates), semi-structured (JSON weather feeds), and unstructured (customer reviews).
Veracity: Data quality issues—some airports have missing IATA codes or conflicting timezones.
Value: Deriving insights like optimal routes, delay patterns, or fuel efficiency.

For such an application, a relational database (e.g., PostgreSQL) works best for structured flight schedules, while RDF could model complex relationships like airline alliances. But for most CPTS415 tasks, relational is the go-to.

Relational Data Model: OpenFlights Airport Schema

Consider the OpenFlights Airport table. A relation schema defines attributes like Airport_ID, Name, City, Country, IATA, ICAO, Latitude, Longitude, Altitude, Timezone, DST, Tz_database_time_zone, Type, Source. The domain for IATA is 3-letter uppercase strings; for Latitude, it's decimal degrees. A relation instance is a set of tuples, e.g.:

Airport_ID | Name         | City     | Country | IATA | ICAO   | Latitude | Longitude
1          | Los Angeles  | Los Angeles | USA | LAX | KLAX   | 33.9425  | -118.408
2          | John F Kennedy | New York | USA | JFK | KJFK   | 40.6397  | -73.7789
3          | Heathrow     | London   | UK  | LHR | EGLL   | 51.4700  | -0.4543
4          | Narita       | Tokyo    | JP  | NRT | RJAA   | 35.7647  | 140.3864
5          | Dubai Intl   | Dubai    | AE  | DXB | OMDB   | 25.2528  | 55.3644

Now, the full OpenFlights dataset includes Airport, Airline, and Route tables. Primary keys: Airport_ID, Airline_ID, and a composite for Route (Airline_ID, Source_Airport_ID, Destination_Airport_ID). Foreign keys: Route.Source_Airport_ID → Airport.Airport_ID, etc. Functional dependencies include Airport_ID → Name, City, Country and IATA → Airport_ID (if IATA is unique).

Inferring Functional Dependencies with Armstrong's Axioms

Using Armstrong's rules, we can derive new FDs. For example, from FD1: Airport_ID → City and FD2: City → Country, by transitivity we get Airport_ID → Country. Or, from Airport_ID → Name, by augmentation we get Airport_ID, IATA → Name, IATA.

Now prove the decomposition rule: If X → YZ, then X → Y and X → Z. Proof: Given X → YZ. By reflexivity, YZ → Y (since Y ⊆ YZ). Then by transitivity, X → Y. Similarly for Z. For pseudo transitivity: If X → Y and YW → Z, then XW → Z. Proof: From X → Y, augment with W: XW → YW. Then with YW → Z, by transitivity XW → Z.

Normalization: 3NF and BCNF Example

Given R(A1, A2, A3, A4) with FDs: A2A3 → A4; A3A4 → A1; A1A2 → A3. The candidate key is A1A2? Let's check: A1A2 → A3 (given), then with A2A3 → A4, so A1A2 → A4. Also A1A2 → A1, A2 trivially. So candidate key = {A1, A2}. The relation is in 3NF because all FDs have a key on the left? Actually, A3A4 → A1: A3A4 is not a superkey, but A1 is prime (part of a key). So 3NF holds. For BCNF, each FD must have a superkey on the left. Here, A3A4 → A1 violates BCNF because A3A4 is not a superkey. So decompose: R1(A3, A4, A1) and R2(A1, A2, A3). In R1, candidate key is A3A4; in R2, candidate key is A1A2. Both are now BCNF.

Relational Algebra Queries on a Movie Database

Consider schema: Movies(Title, Director, Actor); Location(Theater, Address, Phone); Schedule(Theater, Title, Time).

Q1: Which theaters feature “Zootopia”? π_Theater(σ_Title='Zootopia'(Schedule))
Q2: List names and addresses of theaters featuring a film by Steven Spielberg. π_Theater, Address(Location ⋈ (π_Theater(σ_Director='Steven Spielberg'(Movies) ⋈ Schedule)))
Q3: Address and phone of Le Champo theater. π_Address, Phone(σ_Theater='Le Champo'(Location))
Q4: Pairs of actors in same movie. Rename Movies to M1 and M2, then π_M1.Actor, M2.Actor(σ_M1.Title=M2.Title ∧ M1.Actor < M2.Actor(M1 × M2))

Join Algorithms: Block Nested, Sort-Merge, Hash

Given R: 20,000 tuples, 10 tuples/block → 2000 blocks. S: 100,000 tuples → 10,000 blocks. Memory: 52 blocks.

Block nested loop join: Use one block for outer, 51 for inner. Outer relation (smaller R) scanned once: read R in chunks of 51 blocks? Actually, we use 1 block for inner output, 51 for inner relation? Standard: for each block of R, load 51 blocks of S. But we can do better: load R in chunks of 51 blocks, then for each chunk, scan entire S. Cost = read R once + number of chunks of R * read S. Chunks of R: ceil(2000/51) = 40. So cost = 2000 + 40 * 10,000 = 402,000 I/Os.

Sort-merge join: Sort both relations. Cost to sort R: 2 * 2000 * (1 + ceil(log_51(2000/52))) ≈ 2*2000*2 = 8000? Actually detailed: number of passes = 1 + ceil(log_51(2000/52)) = 1+ceil(log_51(38.46)) = 1+1=2. So cost = 2*2000*2 = 8000? Wait formula: cost = 2 * B * (1 + ceil(log_{M-1}(B/M))). B=2000, M=52, so passes=1+ceil(log_51(2000/52))=1+ceil(log_51(38.46))=1+1=2. So cost = 2*2000*2 = 8000. Similarly for S: B=10000, passes=1+ceil(log_51(10000/52))=1+ceil(log_51(192.3))=1+2=3. Cost=2*10000*3=60,000. Merge cost: read both sorted files once: 2000+10000=12,000. Total = 8000+60000+12000=80,000 I/Os.

Hash join: Partition phase: hash R into 52 buckets (use 1 block for output? Actually use 1 block per bucket? We have 52 blocks memory, we can use 51 output buffers. But typical: use 1 block for input, 51 for output buckets. Since we need to write each bucket to disk, cost = 2*B_R + 2*B_S = 2*2000 + 2*10000 = 24,000. Then join phase: read each bucket pair (one from R, one from S) into memory. If buckets fit in memory, cost = B_R + B_S = 12,000. Total = 24,000 + 12,000 = 36,000 I/Os.

XML and RDF Representation

For the Airport instance without schema, an XML document might look like:

<airports>
  <airport>
    <id>1</id>
    <name>Los Angeles</name>
    <city>Los Angeles</city>
    <country>USA</country>
    <iata>LAX</iata>
    <icao>KLAX</icao>
    <lat>33.9425</lat>
    <lon>-118.408</lon>
  </airport>
  ...
</airports>

For the relational schema, an XML Schema (XSD) would define elements with keys: <xs:key name="AirportPK"><xs:selector xpath="airports/airport"/><xs:field xpath="id"/></xs:key>. Foreign keys use xs:keyref.

For the natural language sentences about humans, an RDF schema (RDFS) could be:

@prefix ex: <http://example.org/> .
ex:Human rdf:type rdfs:Class .
ex:likes rdf:type rdf:Property ; rdfs:domain ex:Human ; rdfs:range ex:Human .
ex:sex rdf:type rdf:Property ; rdfs:domain ex:Human ; rdfs:range ex:Sex .
ex:man rdf:type ex:Sex .
ex:woman rdf:type ex:Sex .
ex:fatherOf rdf:type rdf:Property ; rdfs:domain ex:Man ; rdfs:range ex:Human .
ex:motherOf rdf:type rdf:Property ; rdfs:domain ex:Woman ; rdfs:range ex:Human .
ex:marriedTo rdf:type rdf:Property ; rdfs:domain ex:Human ; rdfs:range ex:Human .
ex:birthYear rdf:type rdf:Property ; rdfs:domain ex:Human ; rdfs:range xsd:gYear .

Plus a rule: if marriedTo then likes (using SWRL or SPIN). The parent relationship can be defined as a union of fatherOf and motherOf.

Graph Algorithm: Label Constrained Reachability

Given a directed graph with labeled edges, we want to know if there's a path from s to t using only edges with labels in set L. A simple BFS modification works:

function labelConstrainedReach(s, t, L):
    visited = set()
    queue = [s]
    visited.add(s)
    while queue:
        u = queue.pop(0)
        if u == t: return True
        for each edge (u, v) with label l:
            if l in L and v not in visited:
                visited.add(v)
                queue.append(v)
    return False

This is O(V+E) time.

For the server network with edge weights representing latency, a shortest path algorithm (Dijkstra) can find the minimum latency path between servers. For more complex constraints (e.g., must pass through a certain server), use modified BFS or constraint-based routing.

By mastering these concepts—Big Data V's, relational modeling, normalization, relational algebra, join algorithms, XML/RDF, and graph algorithms—you'll be well-prepared for CPTS415 assignments. Practice with real datasets like OpenFlights to solidify your understanding.