Websites/Glow 2014/Backend Architecture
From MozillaWiki
Data Flowchart
https://drive.google.com/file/d/0B5fr2uXG8zVqanVmSlUwUHNzUkE/edit?usp=sharing
Basic Data Flow
Downloads
- Requests come in to d.m.o.
- Zeus sends log info to a syslog-ng server.
- syslog-ng server writes only the data in which we're interested (IP, Timestamp) to a file.
- A custom daemon process will read from this file and push triplets (IP, Timestamp, Type ('download' in this case)) onto a queue in Redis.
- Another custom daemon process reads from the queue in the previous step, gets Geo data from the IP, and writes to various data buckets on Redis.
- Bucket examples:
- Total downloads
- Downloads from Europe
- Downloads from California
- Queue of downloads and Geo data for a specific minute
- etc. (Exact buckets TBD)
- Bucket examples:
- A third custom daemon process reads from the Redis buckets from the previous step and writes data files to Amazon S3.
- Caveat: We're so far unsure whether these data will be in JSON files, or JavaScript files using the JSONp method. The primary differences is that if we're going to load JSON data files from a domain other than glow.mo (hehe, Glowmo. HT Josh) that provider (S3 and maybe cloudfront) will need to support the CORS protocol. If we however use JSONp then they'd be JSON data wrapped in a function call. This function would preexist on the page, and a script element would be dynamically added to the DOM, which would cause the file to download and execute, which would call the function with the data as the argument. This has the advantage of working on a broader range of browsers, but doesn't allow for some of the more graceful error handling techniques we're likely to want. My (pmac) vote for now is for JSON + CORS, but we should discuss.
- Requests come in to glow.m.o.
- mrburns sends triplets (IP, Timestamp, Type ('twitter|fb share')) to the same Redis queue from step 4 above.
- The rest proceeds exactly as above on the exact same systems.