output-duration
, concentration
etc.
- Rumen provides 2 basic commands
+Rumen provides 3 basic commands
TraceBuilder
Folder
Anonymizer
Firstly, we need to generate the Gold Trace. Hence the first
@@ -139,8 +150,9 @@
The output of the TraceBuilder
is a job-trace file (and an
optional cluster-topology file). In case we want to scale the output, we
can use the Folder
utility to fold the current trace to the
- desired length. The remaining part of this section explains these
- utilities in detail.
+ desired length. For anonymizing the trace, use the
+ Anonymizer
utility. The remaining part of this section
+ explains these utilities in detail.
Command:
This command invokes the Anonymizer utility of
+ Rumen. It anonymizes sensitive information from the
+ <jobtrace-input>
file and outputs the anonymized
+ content into the <jobtrace-output>
+ file. It also anonymizes the cluster layout (topology) from the
+ <topology-input>
and outputs it in
+ the <topology-output>
file.
+ <job-input>
represents the job trace file obtained
+ using TraceBuilder
or Folder
.
+ <topology-input>
represents the cluster topology
+ file obtained using TraceBuilder
.
+
Options :
Parameter | +Description | +Notes | +
---|---|---|
-trace |
+ Anonymizes job traces. | +Anonymizes sensitive fields like user-name, job-name, queue-name + host-names, job configuration parameters etc. | +
-topology |
+ Anonymizes cluster topology | +Anonymizes rack-names and host-names. | +
The Rumen anonymizer can be configured using the following + configuration parameters: +
+Parameter | +Description | +
---|---|
+ rumen.data-types.classname.preserve
+ |
+ A comma separated list of prefixes that the Anonymizer
+ will not anonymize while processing classnames. If
+ rumen.data-types.classname.preserve is set to
+ 'org.apache,com.hadoop.' then
+ classnames starting with 'org.apache' or
+ 'com.hadoop.' will not be anonymized.
+ |
+
+ rumen.datatypes.jobproperties.parsers
+ |
+ A comma separated list of job properties parsers. These parsers
+ decide how the job configuration parameters
+ (i.e <key,value> pairs) should be processed. Default is
+ MapReduceJobPropertiesParser . The default parser will
+ only parse framework-level MapReduce specific job configuration
+ properties. Users can add custom parsers by implementing the
+ JobPropertiesParser interface. Rumen also provides an
+ all-pass (i.e no filter) parser called
+ DefaultJobPropertiesParser .
+ |
+
+ rumen.anonymization.states.dir
+ |
+ Set this to a location (on LocalFileSystem or HDFS) for enabling + state persistence and/or reload. This parameter is not set by + default. Reloading and persistence of states depend on the state + directory. Note that the state directory will contain the latest + as well as previous states. + | +
+ rumen.anonymization.states.persist
+ |
+ Set this to 'true' to persist the current state.
+ Default value is 'false' . Note that the states will
+ be persisted to the state manager's state directory
+ specified using the rumen.anonymization.states.dir
+ parameter.
+ |
+
+ rumen.anonymization.states.reload
+ |
+ Set this to 'true' to enable reuse of previously
+ persisted state. The default value is 'false' . The
+ previously persisted state will be reloaded from the state
+ manager's state directory specified using the
+ rumen.anonymization.states.dir parameter. Note that
+ the Anonymizer will bail out if it fails to find any
+ previously persisted state in the state directory or if the state
+ directory is not set. If the user wishes to retain/reuse the
+ states across multiple invocations of the Anonymizer,
+ then the very first invocation of the Anonymizer should
+ have rumen.anonymization.states.reload set to
+ 'false' and
+ rumen.anonymization.states.persist set to
+ 'true' . Subsequent invocations of the
+ Anonymizer can then have
+ rumen.anonymization.states.reload set to
+ 'true' .
+ |
+
This will anonymize the job details from
+ file:///home/user/job-trace.json
and output it to
+ file:///home/user/job-trace-anonymized.json
.
+ It will also anonymize the cluster topology layout from
+ file:///home/user/cluster-topology.json
and output it to
+ file:///home/user/cluster-topology-anonymized.json
.
+ Note that the Anonymizer
also supports input and output
+ files on HDFS.
+