EPS数据狗论坛

 找回密码
 立即注册

QQ登录

只需一步,快速开始

查看: 882|回复: 0

[其他] Spark Streaming源码解读之数据清理

[复制链接]

242

主题

8208

金钱

1万

积分

资深用户

发表于 2016-11-10 16:11:34 | 显示全部楼层 |阅读模式



1.Spark Streaming程序的运行,不断的产生job,不断的生成RDD、不断的接收数据存储数据,不断的保存元数据等,如果不清理这些数据,内存和磁盘空间都会崩溃,看一下Spark Streaming是如何做清理工作的

2.Spark Streaming在Job运行完成时会触发数据清理动作,看JobHandler中run()方法的代码

def run() {   try {     val formattedTime = UIUtils.formatBatchTime(job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false)     val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}"
     val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]"

     ssc.sc.setJobDescription(
       s"""Streaming job from <a href="$batchUrl">$batchLinkText</a>""")
     ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString)
     ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString)     // We need to assign `eventLoop` to a temp variable. Otherwise, because
     // `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then
     // it's possible that when `post` is called, `eventLoop` happens to null.
     var _eventLoop = eventLoop     if (_eventLoop != null) {
       _eventLoop.post(JobStarted(job, clock.getTimeMillis()))       // Disable checks for existing output directories in jobs launched by the streaming
       // scheduler, since we may need to write output to an existing directory during checkpoint
       // recovery; see SPARK-4835 for more details.
       PairRDDFunctions.disableOutputSpecValidation.withValue(true) {         // run方法中包含了job的提交函数,触发sparkContext.runJob,真正的提交job
         job.run()
       }
       _eventLoop = eventLoop       if (_eventLoop != null) {
         _eventLoop.post(JobCompleted(job, clock.getTimeMillis()))
       }
     } else {       // JobScheduler has been stopped.
     }
   } finally {
     ssc.sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null)
     ssc.sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null)
   }
}

job.run执行之后,job运行完成。发送一个JobCompleted消息给事件循环器,事件循环器调用handleJobCompletion()方法,代码如下

private def handleJobCompletion(job: Job, completedTime: Long) {
val jobSet = jobSets.get(job.time)
jobSet.handleJobCompletion(job)
job.setEndTime(completedTime)
listenerBus.post(StreamingListenerOutputOperationCompleted(job.toOutputOperationInfo)) logInfo("Finished job " + job.id + " from job set of time " + jobSet.time)
if (jobSet.hasCompleted) {
   jobSets.remove(jobSet.time)
   jobGenerator.onBatchCompletion(jobSet.time)   logInfo("Total delay: %.3f s for time %s (execution: %.3f s)".format(
     jobSet.totalDelay / 1000.0, jobSet.time.toString,
     jobSet.processingDelay / 1000.0
   ))
   listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo))
}
job.result match {
   case Failure(e) =>     reportError("Error running job " + job, e)
   case _ =>
}
}

3.这里判断了jobSet是否完成,如果完成调用jobGenerator的onBatchCompletion方法,代码如下

jobGenerator.onBatchCompletion(jobSet.time)

onBachCompletion的代码如下

def onBatchCompletion(time: Time) {
   eventLoop.post(ClearMetadata(time))
}

然后发送一个ClearMetadata消息,看他的ClearMetadata的处理方法,代码如下

private def clearMetadata(time: Time) {
ssc.graph.clearMetadata(time)

// If checkpointing is enabled, then checkpoint,
// else mark batch to be fully processed
if (shouldCheckpoint) {
   eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = true))
} else {
   // If checkpointing is not enabled, then delete metadata information about
   // received blocks (block data not saved in any case). Otherwise, wait for
   // checkpointing of this batch to complete.
   val maxRememberDuration = graph.getMaxInputStreamRememberDuration()
   jobScheduler.receiverTracker.cleanupOldBlocksAndBatches(time - maxRememberDuration)
   jobScheduler.inputInfoTracker.cleanup(time - maxRememberDuration)
   markBatchFullyProcessed(time)
}
}

4.这里调用了DStreamGreph的clearMetadata()方法,代码如下

def clearMetadata(time: Time) {
logDebug("Clearing metadata for time " + time)
this.synchronized {
   outputStreams.foreach(_.clearMetadata(time))
}
logDebug("Cleared old metadata for time " + time)
}

分别调用每一个outputStream的clearMetadata(time)方法,代码如下

private[streaming] def clearMetadata(time: Time) {
val unpersistData = ssc.conf.getBoolean("spark.streaming.unpersist", true)
val oldRDDs = generatedRDDs.filter(_._1 <= (time - rememberDuration)) logDebug("Clearing references to old RDDs: [" +
   oldRDDs.map(x => s"${x._1} -> ${x._2.id}").mkString(", ") + "]")
generatedRDDs --= oldRDDs.keys
if (unpersistData) {   logDebug("Unpersisting old RDDs: " + oldRDDs.values.map(_.id).mkString(", "))
   oldRDDs.values.foreach { rdd =>
     rdd.unpersist(false)
     // Explicitly remove blocks of BlockRDD
     rdd match {
       case b: BlockRDD[_] =>         logInfo("Removing blocks of RDD " + b + " of time " + time)
         b.removeBlocks()
       case _ =>
     }
   }
} logDebug("Cleared " + oldRDDs.size + " RDDs that were older than " +
   (time - rememberDuration) + ": " + oldRDDs.keys.mkString(", "))
dependencies.foreach(_.clearMetadata(time))
}
5.第一步从generatedRDDs中过滤出不用的oldRDDs ,过滤的依据是当前batch的时间-rememberDuration,rememberDuration很关键,一般是batch的倍数,如果有windows操作,他会加上windowsDuration,最终结果就是保证还需要被使用的RDD不被清理。

第二步从内存数据结构generatedRDDs中删除oldRDDs

第三步判断是否清理RDD的持久化数据,默认是清理,调用rdd的unpersist方法清理缓存数据。如果是BlockRDD,调用BlockRDD的removeBlocks()方法,从BlockManager中清除BlockRDD接收的数据

第四步清理依赖关系



作者:海纳百川
来源:简书

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

站长推荐上一条 /1 下一条

客服中心
关闭
在线时间:
周一~周五
8:30-17:30
QQ群:
653541906
联系电话:
010-85786021-8017
在线咨询
客服中心

意见反馈|网站地图|手机版|小黑屋|EPS数据狗论坛 ( 京ICP备09019565号-3 ) 

Powered by BFIT! X3.4

© 2008-2028 BFIT Inc.

快速回复 返回顶部 返回列表