目前有套集群整体的架构采用的是stateless级别的存算分离,从配置上来说core-site的fs.defaultFS是配置成了对象存储,从而实现完整的无状态集群,在这个过程中离线数据的迁移,走的是sqoop这个古老的工具。

为了更好的管理job,sqoop开启了metastore,且把数据存储在mysql之上,sqoop的conf/sqoop-site.xml配置如下:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>sqoop.metastore.client.enable.autoconnect</name>
<value>true</value>
</property>
<property>
<name>sqoop.metastore.client.autoconnect.url</name>
<value>jdbc:mysql://localhost:3306/sqoop</value>
</property>
<property>
<name>sqoop.metastore.client.autoconnect.username</name>
<value>root</value>
</property>
<property>
<name>sqoop.metastore.client.autoconnect.password</name>
<value>xxxx</value>
</property>
<property>
<name>sqoop.metastore.client.record.password</name>
<value>true</value>
</property>
<property>
<name>sqoop.metastore.server.location</name>
<value>/home/fcbai/software/hadoop/sqoop-1.4.7.bin__hadoop-2.6.0/shared.db</value>
</property>
<property>
<name>sqoop.metastore.server.port</name>
<value>16006</value>
</property>
</configuration>

端口我使用16006,默认的16000被占用了,这里分享小工具pwdx可以快速定位pid对应的进程位置,例如通过端口查找pid查找程序的动作:

fcbai@fcbai-a4:~/software/hadoop/sqoop-1.4.7.bin__hadoop-2.6.0$ netstat -anp | grep 16000
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 127.0.1.1:16000 0.0.0.0:* LISTEN 355740/java
tcp 0 0 127.0.1.1:16000 127.0.1.1:34983 ESTABLISHED 355740/java
tcp 0 0 127.0.1.1:34983 127.0.1.1:16000 ESTABLISHED 356013/java

fcbai@fcbai-a4:~/software/hadoop/sqoop-1.4.7.bin__hadoop-2.6.0$ pwdx 355740
355740: /home/fcbai/software/hadoop/hbase-2.4.12

可以看到是hbase占据了这个端口,所以我换了一个,不过在执行sqoop list的时候碰到一个问题:

2022-11-23 18:14:48,001 ERROR hsqldb.HsqldbJobStorage: To use this version of Sqoop, you must downgrade your metadata schema.
2022-11-23 18:14:48,001 ERROR tool.JobTool: I/O error performing job operation: java.io.IOException: Invalid metadata version.
at org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage.init(HsqldbJobStorage.java:202)
at org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage.open(HsqldbJobStorage.java:161)
at org.apache.sqoop.tool.JobTool.run(JobTool.java:289)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)

提示版本过低,去数据库看了下,sqoop会自动的创建一张SQOOP_ROOT表,对应的结构体如下:

mysql> show tables;
+-----------------+
| Tables_in_sqoop |
+-----------------+
| SQOOP_ROOT |
+-----------------+
1 row in set (0.00 sec)

mysql> desc SQOOP_ROOT;
+----------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+--------------+------+-----+---------+-------+
| version | int | YES | MUL | NULL | |
| propname | varchar(128) | NO | | NULL | |
| propval | varchar(256) | YES | | NULL | |
+----------+--------------+------+-----+---------+-------+

而这个schema并不适用于1.4.7,但是这个schema又是1.4.7自动创建的,所以这算是sqoop自身的一个问题,解决的办法是手动删除表,然后重新创建,且插入数据:

CREATE TABLE SQOOP_ROOT (
version INT,
propname VARCHAR(128) NOT NULL,
propval VARCHAR(256),
CONSTRAINT SQOOP_ROOT_unq UNIQUE (version, propname)
);

INSERT INTO
SQOOP_ROOT
VALUES(
NULL,
'sqoop.hsqldb.job.storage.version',
'0'
);

如此便可彻底解决此问题,但是紧接着遇到第二个问题:

Failing this attempt.Diagnostics: [2022-11-23 10:58:09.606]Failed to download resource { { tos://fcbai/user/root/.staging/job_1666001145890_0023/libjars, 1669172288680, FILE, null },pending,[(container_1666001145890_0023_02_000001)],3171514525282685,DOWNLOADING} java.io.IOException: Resource tos://fcbai/user/root/.staging/job_1666001145890_0023/libjars changed on src filesystem - expected: "2022-11-23T10:58:08.680+0800", was: "2022-11-23T10:58:09.605+0800", current time: "2022-11-23T10:58:09.605+0800"
For more detailed output, check the application tracking page: http://xxx:8088/cluster/app/application_1666001145890_0023 Then click on links to logs of each attempt.
. Failing the application.

这个问题略微复杂,仔细分析可以发现是由于对象存储与HDFS在时间戳的计算上不一致导致的,由于我使用的hadoop3.3.1去翻看对应源代码可以看到出错的位置在,hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/FSDownload.java 的verifyAndCopy方法中:

if (sStat.getModificationTime() != resource.getTimestamp()) {
throw new IOException("Resource " + sCopy + " changed on src filesystem" +
" - expected: " +
"\"" + Times.formatISO8601(resource.getTimestamp()) + "\"" +
", was: " +
"\"" + Times.formatISO8601(sStat.getModificationTime()) + "\"" +
", current time: " + "\"" + Times.formatISO8601(Time.now()) + "\"");
}

也就是对象存储的实现,在sStat.getModificationTime()获取到的时间和Time.now()一致了。这里的内容是在.stage目录下,也就是mr作业运行过程中的分发的jar,正常来说执行一个mr作业的时候,yarn会把作业所需要的包统一的上传到hdfs上,尽可能达到复用,这套集群就是上传到了对象存储中,在sqoop中上传的代码是在:src/java/org/apache/sqoop/mapreduce/JobBase.java的cacheJars方法中,使用一个set集合localUrls的变量来存储了需要上传到对象存储的jar,最后通过conf接口通知yar:

// Add these to the 'tmpjars' array, which the MR JobSubmitter
// will upload to HDFS and put in the DistributedCache libjars.
sb.append(StringUtils.arrayToString(localUrls.toArray(new String[0])));
conf.set(ConfigurationConstants.MAPRED_DISTCACHE_CONF_PARAM, sb.toString());

而sqoop里面是判断的目录,对于对象存储没有目录的概念,所以getModificationTime始终会是Time.now(),解决这个问题得修改一下sqoop的源码。


扫码手机观看或分享: