Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cookies loaded but login page detected #51

Closed
evilsh3ll opened this issue Mar 20, 2023 · 23 comments · Fixed by #53 or #55
Closed

cookies loaded but login page detected #51

evilsh3ll opened this issue Mar 20, 2023 · 23 comments · Fixed by #53 or #55

Comments

@evilsh3ll
Copy link

evilsh3ll commented Mar 20, 2023

Hello, when I use this command:
crawley -depth -1 -dirs only -cookie "phpbb3_XXXXXX_sid=XXXXXX; phpbb3_XXXXXX_u=XXXXXX; phpbb3_XXXXXX_k=XXXXXX;" "https://XXXXXX.net/viewtopic.php?p=5018859" > urls.txt
crawley scrapes the login page of the forum and not the thread I selected, crawley doesn't return any errors

2023/03/20 16:08:58 [*] config: workers: 4 depth: -1 delay: 150ms
2023/03/20 16:08:58 [*] crawling url: https://XXXXXX.net/viewtopic.php?p=5018859
2023/03/20 16:09:00 [*] complete

I tried with different user agents and headers, every time the result is the same: login page of the forum. The cookies are ok, I copied them using EditThisCookie googlechrome extension. I'm not using any vpn/proxy.
I tested also other forums, the result is always the same, I can't login.
Do you know if there is a problem loading cookies?

@s0rg
Copy link
Owner

s0rg commented Mar 20, 2023

The cookies works fine - they test-covered, try use headless mode: -headless flag, it will disable HEAD requests, it can help.

@evilsh3ll
Copy link
Author

evilsh3ll commented Mar 20, 2023

yes even with -headless the result is the same login page, if you want to try the forum is this "ddunlimited.net". I don't know if I need some special header for the request but I suppose that if it can scrape the login page, it should be able to scrape any other page using the same settings + logged cookies.

@s0rg
Copy link
Owner

s0rg commented Mar 20, 2023

i failed to register at ddunlimited.net, as long as i cant speak italian, so i cannot understand questions. I can try to help you with this issue, but i need credentials (and urls) for testing. Can you provide them? My email is al3x.s0rg(at)gmail.com

@evilsh3ll
Copy link
Author

ok, I created a test account with temp email, I sent you the full cookie in netscape format. I just tested the cookie in two different browsers, it works.

@s0rg
Copy link
Owner

s0rg commented Mar 20, 2023

Ok, you need to use "Semicolon separated Cookie File" format upon export, and rigth now you also need to manauly remove all comments from result. I will add coments parsing in future releases, this is the first encounter )

@evilsh3ll
Copy link
Author

I exported cookies from EditThisCookie extension using the "semicolon separeted format", I removed the first 3 comment lines but it's again in the login page.
This is the exported string (it's almost the same I manually built in the previous tests 🥲):
image

@s0rg
Copy link
Owner

s0rg commented Mar 20, 2023

please, try again, but set -headless flag, and unset -dirs only

@evilsh3ll
Copy link
Author

evilsh3ll commented Mar 20, 2023

I get these 48 links (as before when I set -headless without -dirs only) https://bin.disroot.org/?6bf486611c974dc6#BFLGkPZatknTvchGpe1UQbofmhZ4ToCR7RbHw3G4Eihb that are the parts of the login page because there is the button https://ddunlimited.net/ucp.php?mode=sendpassword "I lost password" (that is the "I lost password" button in the login page) and also the other buttons/links of the same login page.
image
Instead in the page I try to scrape there should be >400 forum threads (it's a page with a huge index)
image

@evilsh3ll
Copy link
Author

Any idea? Did you get more links than me?

@s0rg
Copy link
Owner

s0rg commented Mar 21, 2023

Nope, i hope i got some time to investigate this at next weekend. As for now - please try change user-agent for crawley, you need to get you current browser user-agent string (from any request) and give it crawley with -user-agent "your-user-agent-here" flag.

@evilsh3ll
Copy link
Author

evilsh3ll commented Mar 22, 2023

I'll do tests with other headers too. When you have time, do you know how to get the header string used by my browser (brave, archlinux)? I'm searching but there aren't info.

@s0rg
Copy link
Owner

s0rg commented Mar 22, 2023

take a look at good tool for any http debbugging - https://httpbin.org/ for example you can obtain all headers from you browser here: https://httpbin.org/headers

@evilsh3ll
Copy link
Author

evilsh3ll commented Mar 26, 2023

I found a way to make the cookies working with curl:

curl --cookie "phpbb3_ddu4final_k=XXX;phpbb3_ddu4final_sid=XXXX;phpbb3_ddu4final_u=XXXX;" -j -L "https://ddunlimited.net/viewtopic.php?f=440" -o test.html

using the -j parameter the page is the correct one (not the login page), it works only for this website, I don't know why.

curl man:
New cookie session
Instead of telling curl when a session ends, curl features an option that lets the user decide when a new session begins.
A new cookie session means that all the old session cookies will be thrown away. It is the equivalent of closing a browser and starting it up again.
Tell curl a new cookie session starts by using -j, --junk-session-cookies

Do you know if there is a similar feature for crawley (something like cleaning all previous cookies before using the new one)?

@s0rg
Copy link
Owner

s0rg commented Mar 26, 2023

Thank you for this investigation, i will inspect this closer, an add such feature to crawley.

@s0rg
Copy link
Owner

s0rg commented Mar 28, 2023

Plese, check out new https://github.com/s0rg/crawley/releases/tag/v1.5.12 release it should fix this issue.

@evilsh3ll
Copy link
Author

evilsh3ll commented Mar 29, 2023

I tried again using this command:

crawley -depth -1 -headless -cookie "phpbb3_ddu4final_k=XXX" -cookie "phpbb3_ddu4final_sid=XXX" -cookie "phpbb3_ddu4final_u=XXX" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt

getting this output:

Scraping all urls

2023/03/29 23:46:12 [*] config: workers: 4 depth: -1 delay: 150ms
2023/03/29 23:46:12 [*] crawling url: https://ddunlimited.net/viewtopic.php?p=5018859
2023/03/29 23:46:13 [*] complete

Cookies are loaded, I'm not sure if I enabled the "cleaning" feature or it's by default.
Crawley v1.5.12-a1f6de2 (archlinux)

Thanks for your support and patience

@s0rg
Copy link
Owner

s0rg commented Mar 30, 2023

its by default. Thank you for your detailed reports - they are very helpfull.

@evilsh3ll
Copy link
Author

I forgot to say that the page is still the login page, it seems cookies are not correctly loaded

@s0rg
Copy link
Owner

s0rg commented Apr 1, 2023

Hello again )
Please check out new release: https://github.com/s0rg/crawley/releases/tag/v1.5.13
Seems i fixed it )

Please note, you still need -user-agent to make phpbb happy (i.e.):

crawley \
-headless \
-user-agent 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0' \
-cookie 'phpbb3_2yptu_sid=ef44a131d691d096c21add5db0eb2bd7;phpbb3_2yptu_u=2' \
'http://host/viewtopic.php?p=2'

@evilsh3ll
Copy link
Author

evilsh3ll commented Apr 5, 2023

I upgraded to Crawley v1.5.13-3c672bd (archlinux)

  • problem 1 (scraping still not working, the file list.txt is empty):
crawley \
-headless \
-user-agent 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0' \
-cookie 'phpbb3_ddu4final_k=xxx; phpbb3_ddu4final_sid=xxx; phpbb3_ddu4final_u=xxx' \
'https://ddunlimited.net/viewtopic.php?p=5018859' > list.txt

output:

2023/04/06 00:19:00 [*] config: workers: 4 depth: 0 delay: 150ms
2023/04/06 00:19:00 [*] crawling url: https://ddunlimited.net/viewtopic.php?p=5018859
2023/04/06 00:19:05 [-] GET https://ddunlimited.net/viewtopic.php?p=5018859: Get "https://ddunlimited.net/viewtopic.php?p=5018859": context deadline exceeded
2023/04/06 00:19:05 [*] complete

I tried a lot of combination of paramters, I get every time the same error above.

  • problem 2:

the following command worked with Crawley v1.5.8-84c5855 but now it's not working anymore :

crawley -headless -user-agent 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0' -cookie 'CENSORED_key=a%3A4%3A%7Bi%3A0%3Bs%3A6%3A%22153366%22%3Bi%3A1%3Bs%3A40%3A%225b6hf75839hf5832g6f4cf2%22%3Bi%3A2%3Bi%3A1868962414%3Bi%3A3%3Bi%3A0%3B%7D' 'CENSORED_URL' > list.2.txt

output:

2023/04/06 00:26:00 [*] config: workers: 4 depth: 0 delay: 150ms
2023/04/06 00:26:00 [*] crawling url: CENSORED_URL
2023/04/06 00:26:00 net/http: invalid byte ';' in Cookie.Value; dropping invalid bytes
2023/04/06 00:26:01 [*] complete

the cookie value doesn't include any semicolon, maybe it's an encoded character (%..) in the cookie value (censored key and url don't include semicolon), also removing the user-agent parameter the error is still present

@s0rg
Copy link
Owner

s0rg commented Apr 8, 2023

Please, check version https://github.com/s0rg/crawley/releases/tag/v1.5.14
Cookie decoding now gone, so it will load any cookie. And please keep in mind that you need to use the same user-agent, from which you obtain those cookies. So you need to copy cookies AND you user-agent from request.

@evilsh3ll
Copy link
Author

evilsh3ll commented Apr 12, 2023

I'm using 1.5.14 now, the cookie is correctly loading but I get the context deadline exceeded error only in https://ddunlimited.net/viewtopic.php?p=5018859 website. I extracted my exact user-agent string and used a fresh new cookie (from the same browser) but the same error is still present.

2023/04/12 10:18:18 [*] config: workers: 4 depth: 0 delay: 150ms
2023/04/12 10:18:18 [*] crawling url: https://ddunlimited.net/viewtopic.php?t=3610730
2023/04/12 10:18:24 [-] GET https://ddunlimited.net/viewtopic.php?t=3610730: Get "https://ddunlimited.net/viewtopic.php?t=3610730": context deadline exceeded
2023/04/12 10:18:24 [*] complete
Time: 5 s.

I also get the same error without using cookies or with -delay 2000ms option, I can't get any info online about the context deadline exceeded error (except that it seems a timeout error)

Thanks for your support

@s0rg
Copy link
Owner

s0rg commented Apr 13, 2023

Please, check fresh release: https://github.com/s0rg/crawley/releases/tag/v1.6.0 it has -timeout option, so you can increase default timeout (5 seconds).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants