资料内容:
1.1 Implementing Large-Scale Personalized Services
In a large-scale service with many features, the maintainability and the operational robustness of an implementation
are of paramount importance. The system should have the following properties:
System scalability: Supporting an online service with hundreds of millions of registered users, handling millions
of requests per second.
Organizational scalability: Allowing hundreds or even thousands of software engineers to work on the system
without excessive coordination overhead.
Operational robustness: If one part of the system is slow or unavailable, the rest of the system should continue
working normally as much as possible.
Large-scale personalized services have been successfully implemented as batch jobs [30], for example using
MapReduce [6]. Performing a recommendation system’s computations in offline batch jobs decouples them from
the online systems that serve user requests, making them easier to maintain and less operationally sensitive.
The main downside of batch jobs is that they introduce a delay between the time the data is collected and
the time its effects are visible. The length of the delay depends on the frequency with which the job is run, but
it is often on the order of hours or days.
Even though MapReduce is a lowest-common-denominator programming model, and has fairly poor performance
compared to specialized massively parallel database engines [2], it has been a remarkably successful tool
for implementing recommendation systems [30]. Systems such as Spark [34] overcome some of the performance
problems of MapReduce, although they remain batch-oriented.
1.2 Batch Workflows
A recommendation and personalization system can be built as a workflow, a directed graph of MapReduce
jobs [30]. Each job reads one or more input datasets (typically directories on the Hadoop Distributed Filesystem,
HDFS), and produces one or more output datasets (in other directories). A job treats its input as immutable
and completely replaces its output. Jobs are chained by directory name: the same name is configured as output
directory for the first job and input directory for the second job.
This method of chaining jobs by directory name is simple, and is expensive in terms of I/O, but it provides
several important benefits:
Multi-consumer. Several different jobs can read the same input directory without affecting each other. Adding
a slow or unreliable consumer affects neither the producer of the dataset, nor other consumers.
Visibility. Every job’s input and output can be inspected by ad-hoc debugging jobs for tracking down the cause
of an error. Inspection of inputs and outputs is also valuable for audit and capacity planning purposes, and
monitoring whether jobs are providing the required level of service.
Team interface. A job operated by one team of people can produce a dataset, and jobs operated by other teams
can consume the dataset. The directory name thus acts as interface between the teams, and it can be
reinforced with a contract (e.g. prescribing the data format, schema, field semantics, partitioning scheme,
and frequency of updates). This arrangement helps organizational scalability.
Loose coupling. Different jobs can be written in different programming languages, using different libraries, but
they can still communicate as long as they can read and write the same file format for inputs and outputs.
A job does not need to know which jobs produce its inputs and consume its outputs. Different jobs can be
run on different schedules, at different priorities, by different users.