-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cookies loaded but login page detected #51
Comments
The cookies works fine - they test-covered, try use headless mode: |
yes even with -headless the result is the same login page, if you want to try the forum is this "ddunlimited.net". I don't know if I need some special header for the request but I suppose that if it can scrape the login page, it should be able to scrape any other page using the same settings + logged cookies. |
i failed to register at ddunlimited.net, as long as i cant speak italian, so i cannot understand questions. I can try to help you with this issue, but i need credentials (and urls) for testing. Can you provide them? My email is al3x.s0rg(at)gmail.com |
ok, I created a test account with temp email, I sent you the full cookie in netscape format. I just tested the cookie in two different browsers, it works. |
Ok, you need to use "Semicolon separated Cookie File" format upon export, and rigth now you also need to manauly remove all comments from result. I will add coments parsing in future releases, this is the first encounter ) |
please, try again, but set |
I get these 48 links (as before when I set |
Any idea? Did you get more links than me? |
Nope, i hope i got some time to investigate this at next weekend. As for now - please try change user-agent for crawley, you need to get you current browser user-agent string (from any request) and give it crawley with |
I'll do tests with other headers too. When you have time, do you know how to get the header string used by my browser (brave, archlinux)? I'm searching but there aren't info. |
take a look at good tool for any http debbugging - https://httpbin.org/ for example you can obtain all headers from you browser here: https://httpbin.org/headers |
I found a way to make the cookies working with curl:
using the -j parameter the page is the correct one (not the login page), it works only for this website, I don't know why. curl man: Do you know if there is a similar feature for crawley (something like cleaning all previous cookies before using the new one)? |
Thank you for this investigation, i will inspect this closer, an add such feature to crawley. |
Plese, check out new https://github.com/s0rg/crawley/releases/tag/v1.5.12 release it should fix this issue. |
I tried again using this command:
getting this output:
Cookies are loaded, I'm not sure if I enabled the "cleaning" feature or it's by default. Thanks for your support and patience |
its by default. Thank you for your detailed reports - they are very helpfull. |
I forgot to say that the page is still the login page, it seems cookies are not correctly loaded |
Hello again ) Please note, you still need crawley \
-headless \
-user-agent 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0' \
-cookie 'phpbb3_2yptu_sid=ef44a131d691d096c21add5db0eb2bd7;phpbb3_2yptu_u=2' \
'http://host/viewtopic.php?p=2' |
I upgraded to Crawley v1.5.13-3c672bd (archlinux)
crawley \
-headless \
-user-agent 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0' \
-cookie 'phpbb3_ddu4final_k=xxx; phpbb3_ddu4final_sid=xxx; phpbb3_ddu4final_u=xxx' \
'https://ddunlimited.net/viewtopic.php?p=5018859' > list.txt output:
I tried a lot of combination of paramters, I get every time the same error above.
the following command worked with Crawley v1.5.8-84c5855 but now it's not working anymore : crawley -headless -user-agent 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0' -cookie 'CENSORED_key=a%3A4%3A%7Bi%3A0%3Bs%3A6%3A%22153366%22%3Bi%3A1%3Bs%3A40%3A%225b6hf75839hf5832g6f4cf2%22%3Bi%3A2%3Bi%3A1868962414%3Bi%3A3%3Bi%3A0%3B%7D' 'CENSORED_URL' > list.2.txt output:
the cookie value doesn't include any semicolon, maybe it's an encoded character (%..) in the cookie value (censored key and url don't include semicolon), also removing the user-agent parameter the error is still present |
Please, check version https://github.com/s0rg/crawley/releases/tag/v1.5.14 |
I'm using 1.5.14 now, the cookie is correctly loading but I get the context deadline exceeded error only in https://ddunlimited.net/viewtopic.php?p=5018859 website. I extracted my exact user-agent string and used a fresh new cookie (from the same browser) but the same error is still present.
I also get the same error without using cookies or with Thanks for your support |
Please, check fresh release: https://github.com/s0rg/crawley/releases/tag/v1.6.0 it has |
Hello, when I use this command:
crawley -depth -1 -dirs only -cookie "phpbb3_XXXXXX_sid=XXXXXX; phpbb3_XXXXXX_u=XXXXXX; phpbb3_XXXXXX_k=XXXXXX;" "https://XXXXXX.net/viewtopic.php?p=5018859" > urls.txt
crawley scrapes the login page of the forum and not the thread I selected, crawley doesn't return any errors
I tried with different user agents and headers, every time the result is the same: login page of the forum. The cookies are ok, I copied them using EditThisCookie googlechrome extension. I'm not using any vpn/proxy.
I tested also other forums, the result is always the same, I can't login.
Do you know if there is a problem loading cookies?
The text was updated successfully, but these errors were encountered: