-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add convert to recordio format function #183
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need a Dockerfile
and k8s yaml file and README.md.
docker/convert/convert.py
Outdated
path = output_path + "/" + name | ||
mkdir_p(path) | ||
|
||
mod.convert(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we do convert and split at the same time? @Yancey1989
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
convert的时候已经split了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the convert function here:https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/common.py#L167, it also needs some other required parameters such as reader and num_shards. So maybe the code can not be run correctly, or I missed sth. ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh,sorry,我应该把相关的PR都放上来的。其实,跟这个相关的PR是这个:
PaddlePaddle/Paddle#2608
docker/convert/convert.py
Outdated
raise | ||
|
||
def convert(output_path, name): | ||
print "proc " + name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use logging
instead of printing, it will log the time and log level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docker/convert/convert.py
Outdated
print "proc " + name | ||
mod = __import__("paddle.v2.dataset." + name, fromlist=['']) | ||
|
||
path = output_path + "/" + name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use os.path.join(output_path, name)
instead of hard code delimiter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Done.
docker/convert/convert.py
Outdated
path = output_path + "/" + name | ||
mkdir_p(path) | ||
|
||
mod.convert(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the convert function here:https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/common.py#L167, it also needs some other required parameters such as reader and num_shards. So maybe the code can not be run correctly, or I missed sth. ?
docker/convert/Dockerfile
Outdated
@@ -0,0 +1,4 @@ | |||
FROM paddlepaddle/paddle | |||
ADD ./convert.py /convert/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add multiple files using a single line to reduce the image layers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
docker/convert/convert.py
Outdated
import logging | ||
import logging.config | ||
|
||
logging.config.fileConfig('logging.conf') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use [dict_config], so we don't need another config file.
docker/convert/Dockerfile
Outdated
FROM paddlepaddle/paddle | ||
ADD ./convert.py /convert/ | ||
ADD ./logging.conf /convert/ | ||
ADD .cache/paddle/dataset /root/.cache/paddle/dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a default command for this dockerfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没有default command,路径是需要指定的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't get your point. Default command could be CMD ["/program", "argument"]
, this can be overrided by k8s yaml definations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, could you let me know why users need to submit a k8s job to convert the public datasets?
I thought we will convert them locally (or by a k8s job), upload to the cluster public storage. And users don't need to convert them again?
@helinwang 这里的 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gongweibao 懂了,谢谢!LGTM.
Fix PaddlePaddle/Paddle#2550