Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for issues 12, 13, 15, 16 & 24 #25

Closed
wants to merge 23 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
77be817
[Issue #16] Updated process colon to solve StringIndexOutOfBoundsExce…
Jul 3, 2017
9f26721
[Issue #16] Updated process colon to solve StringIndexOutOfBoundsExce…
Jul 3, 2017
d405160
[Issue #12] Fixed StringIndexOutOfBoundsException when given 'http://…
Jul 3, 2017
d9ab689
update version to 0.1.18-SNAPSHOT
pgalbraith Sep 16, 2018
0263803
Unit test demonstrating https://github.com/linkedin/URL-Detector/issu…
pgalbraith Sep 16, 2018
885dd1b
Merge commit '9f267214885c6f82fad0915ddb42db33fbddccd2' into feature/…
pgalbraith Sep 16, 2018
11eb5a7
Unit test demonstrating https://github.com/linkedin/URL-Detector/issu…
pgalbraith Sep 16, 2018
205ce31
Merge branch 'upstream-pull-request-17' into feature/upstream-issue-12
pgalbraith Sep 16, 2018
01e3424
update maven coordinates for publishing
pgalbraith Sep 16, 2018
9ecc625
Merge branch 'feature/upstream-issue-16' into develop
pgalbraith Sep 16, 2018
3a3e8f7
0.1.18 release
pgalbraith Sep 17, 2018
83c0c36
0.1.19-SNAPSHOT
pgalbraith Sep 17, 2018
a568345
Merge branch 'master' into publish
pgalbraith Sep 17, 2018
0bd1932
fix for https://github.com/linkedin/URL-Detector/issues/13
pgalbraith Sep 21, 2018
5f7a957
0.1.19 release
pgalbraith Sep 22, 2018
c5eca23
0.1.20 snapshot
pgalbraith Sep 22, 2018
09e0709
Merge master into publish
pgalbraith Sep 22, 2018
2fc2bbe
Merge publish into develop
pgalbraith Sep 22, 2018
e8e154c
Refactor so that domain read failures are handled at top level.
pgalbraith Sep 23, 2018
0ed5186
Remove endless loop detection to address https://github.com/linkedin/…
pgalbraith Sep 23, 2018
94397e7
Tighten maven build
pgalbraith Sep 23, 2018
33da6cf
Add EditorConfig configuration
pgalbraith Sep 23, 2018
61a6da5
0.1.20 release
pgalbraith Sep 29, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
root = true

[*]
end_of_line = lf
insert_final_newline = true
charset = utf-8
indent_style = space
indent_size = 2
tab_size = 8
2 changes: 1 addition & 1 deletion gradle.properties
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#Version
version=0.1.17
version=0.1.20

#long-running Gradle process speeds up local builds
#to stop the daemon run 'ligradle --stop'
Expand Down
22 changes: 14 additions & 8 deletions url-detector/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>com.linkedin.urls</groupId>
<groupId>io.github.pgalbraith</groupId>
<artifactId>url-detector</artifactId>
<version>0.1.17</version>
<version>0.1.20</version>
<packaging>jar</packaging>

<name>com.linkedin.urls:url-detector</name>
<name>io.github.pgalbraith:url-detector</name>
<description>A Java library to detect and normalize URLs in text</description>
<url>https://github.com/linkedin/URL-Detector</url>
<url>https://github.com/pgalbraith/URL-Detector</url>

<licenses>
<license>
Expand All @@ -19,11 +19,15 @@
</licenses>

<scm>
<connection>scm:git:git://github.com/linkedin/URL-Detector.git</connection>
<developerConnection>scm:git:ssh://github.com:linkedin/URL-Detector.git</developerConnection>
<url>https://github.com/linkedin/URL-Detector/tree/master</url>
<connection>scm:git:git://github.com/pgalbraith/URL-Detector.git</connection>
<developerConnection>scm:git:ssh://github.com:pgalbraith/URL-Detector.git</developerConnection>
<url>https://github.com/pgalbraith/URL-Detector/tree/master</url>
</scm>


<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>

<dependencies>
<dependency>
<groupId>org.testng</groupId>
Expand Down Expand Up @@ -94,13 +98,15 @@
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-deploy-plugin</artifactId>
<version>2.7</version>
<configuration>
<skip>true</skip>
</configuration>
</plugin>
<plugin>
<groupId>org.sonatype.plugins</groupId>
<artifactId>nexus-staging-maven-plugin</artifactId>
<version>1.6.8</version>
<executions>
<execution>
<id>default-deploy</id>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,11 @@ public enum ReaderNextState {
/**
* Finished reading, next step should be to read the query string.
*/
ReadQueryString
ReadQueryString,
/**
* This was actually not a domain at all.
*/
ReadUserPass
}

/**
Expand Down Expand Up @@ -332,6 +336,10 @@ public ReaderNextState readDomainName() {
} else if (curr == '#') {
//continue by reading the fragment
return checkDomainNameValid(ReaderNextState.ReadFragment, curr);
} else if (curr == '@') {
//this may not have been a domain after all, but rather a username/password instead
_reader.goBack();
return ReaderNextState.ReadUserPass;
} else if (CharUtils.isDot(curr)
|| (curr == '%' && _reader.canReadChars(2) && _reader.peek(2).equalsIgnoreCase(HEX_ENCODED_DOT))) {
//if the current character is a dot or a urlEncodedDot
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,6 @@
*/
public class InputTextReader {

/**
* The number of times something can be backtracked is this multiplier times the length of the string.
*/
protected static final int MAX_BACKTRACK_MULTIPLIER = 10;

/**
* The content to read.
*/
Expand All @@ -29,16 +24,6 @@ public class InputTextReader {
*/
private int _index = 0;

/**
* Contains the amount of characters that were backtracked. This is used for performance analysis.
*/
private int _backtracked = 0;

/**
* When detecting for exceeding the backtrack limit, make sure the text is at least 20 characters.
*/
private final static int MINIMUM_BACKTRACK_LENGTH = 20;

/**
* Creates a new instance of the InputTextReader using the content to read.
* @param content The content to read.
Expand Down Expand Up @@ -102,47 +87,18 @@ public int getPosition() {
return _index;
}

/**
* Gets the total number of characters that were backtracked when reading.
*/
public int getBacktrackedCount() {
return _backtracked;
}

/**
* Moves the index to the specified position.
* @param position The position to set the index to.
*/
public void seek(int position) {
int backtrackLength = Math.max(_index - position, 0);
_backtracked += backtrackLength;
_index = position;
checkBacktrackLoop(backtrackLength);
}

/**
* Goes back a single character.
*/
public void goBack() {
_backtracked++;
_index--;
checkBacktrackLoop(1);
}

private void checkBacktrackLoop(int backtrackLength) {
if (_backtracked > (_content.length * MAX_BACKTRACK_MULTIPLIER)) {
if (backtrackLength < MINIMUM_BACKTRACK_LENGTH) {
backtrackLength = MINIMUM_BACKTRACK_LENGTH;
}

int start = Math.max(_index, 0);
if (start + backtrackLength > _content.length) {
backtrackLength = _content.length - start;
}

String badText = new String(_content, start, backtrackLength);
throw new NegativeArraySizeException("Backtracked max amount of characters. Endless loop detected. Bad Text: '"
+ badText + "'");
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -125,15 +125,6 @@ public UrlDetector(String content, UrlDetectorOptions options) {
_options = options;
}

/**
* Gets the number of characters that were backtracked while reading the input. This is useful for performance
* measurement.
* @return The count of characters that were backtracked while reading.
*/
public int getBacktracked() {
return _reader.getBacktrackedCount();
}

/**
* Detects the urls and returns a list of detected url strings.
* @return A list with detected urls.
Expand All @@ -154,13 +145,14 @@ private void readDefault() {
while (!_reader.eof()) {
//read the next char to process.
char curr = _reader.read();

switch (curr) {
case ' ':
//space was found, check if it's a valid single level domain.
if (_options.hasFlag(UrlDetectorOptions.ALLOW_SINGLE_LEVEL_DOMAIN) && _buffer.length() > 0 && _hasScheme) {
_reader.goBack();
readDomainName(_buffer.substring(length));
if (!readDomainName(_buffer.substring(length))) {
readEnd(ReadEndState.InvalidUrl);
};
}
_buffer.append(curr);
readEnd(ReadEndState.InvalidUrl);
Expand All @@ -178,7 +170,9 @@ private void readDefault() {
_buffer.append(_reader.read());
_buffer.append(_reader.read());

readDomainName(_buffer.substring(length));
if (!readDomainName(_buffer.substring(length))) {
readEnd(ReadEndState.InvalidUrl);
}
length = 0;
}
}
Expand All @@ -188,14 +182,18 @@ private void readDefault() {
case '\uFF61':
case '.': //"." was found, read the domain name using the start from length.
_buffer.append(curr);
readDomainName(_buffer.substring(length));
if (!readDomainName(_buffer.substring(length))) {
readEnd(ReadEndState.InvalidUrl);
}
length = 0;
break;
case '@': //Check the domain name after a username
if (_buffer.length() > 0) {
_currentUrlMarker.setIndex(UrlPart.USERNAME_PASSWORD, length);
_buffer.append(curr);
readDomainName(null);
if (!readDomainName(null)) {
readEnd(ReadEndState.InvalidUrl);
}
length = 0;
}
break;
Expand All @@ -218,6 +216,7 @@ private void readDefault() {

if (!readDomainName(_buffer.substring(length))) {
//if we didn't find an ipv6 address, then check inside the brackets for urls
readEnd(ReadEndState.InvalidUrl);
_reader.seek(beginning);
_dontMatchIpv6 = true;
}
Expand All @@ -235,7 +234,9 @@ private void readDefault() {

//unread this "/" and continue to check the domain name starting from the beginning of the domain
_reader.goBack();
readDomainName(_buffer.substring(length));
if (!readDomainName(_buffer.substring(length))) {
readEnd(ReadEndState.InvalidUrl);
}
length = 0;
} else {

Expand Down Expand Up @@ -265,7 +266,9 @@ private void readDefault() {
}
}
if (_options.hasFlag(UrlDetectorOptions.ALLOW_SINGLE_LEVEL_DOMAIN) && _buffer.length() > 0 && _hasScheme) {
readDomainName(_buffer.substring(length));
if (!readDomainName(_buffer.substring(length))) {
readEnd(ReadEndState.InvalidUrl);
}
}
}

Expand All @@ -277,10 +280,16 @@ private void readDefault() {
private int processColon(int length) {
if (_hasScheme) {
//read it as username/password if it has scheme
if (!readUserPass(length) && _buffer.length() > 0) {
if (!readUserPass(length)) {
//unread the ":" so that the domain reader can process it
_reader.goBack();
_buffer.delete(_buffer.length() - 1, _buffer.length());

// Check buffer length before clearing it; set length to 0 if buffer is empty
if (_buffer.length() > 0) {
_buffer.delete(_buffer.length() - 1, _buffer.length());
} else {
length = 0;
}

int backtrackOnFail = _reader.getPosition() - _buffer.length() + length;
if (!readDomainName(_buffer.substring(length))) {
Expand All @@ -289,6 +298,8 @@ private int processColon(int length) {
readEnd(ReadEndState.InvalidUrl);
}
length = 0;
} else {
length = 0;
}
} else if (readScheme() && _buffer.length() > 0) {
_hasScheme = true;
Expand All @@ -297,7 +308,9 @@ private int processColon(int length) {
&& _reader.canReadChars(1)) { //takes care of case like hi:
_reader.goBack(); //unread the ":" so readDomainName can take care of the port
_buffer.delete(_buffer.length() - 1, _buffer.length());
readDomainName(_buffer.toString());
if (!readDomainName(_buffer.toString())) {
readEnd(ReadEndState.InvalidUrl);
}
} else {
readEnd(ReadEndState.InvalidUrl);
length = 0;
Expand Down Expand Up @@ -470,10 +483,9 @@ private boolean readScheme() {
* @return True if a valid username and password was found.
*/
private boolean readUserPass(int beginningOfUsername) {

//The start of where we are.
int start = _buffer.length();

//keep looping until "done"
boolean done = false;

Expand Down Expand Up @@ -547,8 +559,12 @@ public void addCharacter(char character) {
return readPort();
case ReadQueryString:
return readQueryString();
case ReadUserPass:
int host = _currentUrlMarker.indexOf(UrlPart.HOST);
_currentUrlMarker.unsetIndex(UrlPart.HOST);
return readUserPass(host);
default:
return readEnd(ReadEndState.InvalidUrl);
return false;
}
}

Expand Down
3 changes: 2 additions & 1 deletion url-detector/src/test/java/com/linkedin/urls/TestUrl.java
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,8 @@ private Object[][] getUsernamePasswordUrls() {
{"@www.google.com", "www.google.com", "/", "", ""},
{"lalal:@www.gogo.com", "www.gogo.com", "/", "lalal", ""},
{"nono:boo@[::1]", "[::1]", "/", "nono", "boo"},
{"nono:boo@yahoo.com/@1234", "yahoo.com", "/@1234", "nono", "boo"}
{"nono:boo@yahoo.com/@1234", "yahoo.com", "/@1234", "nono", "boo"},
{"big.big.boss@google.com", "google.com", "/", "big.big.boss", ""}
};
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,13 +59,4 @@ public void testSeek() {
reader.seek(1);
Assert.assertEquals(reader.read(), CONTENT.charAt(1));
}

@Test(expectedExceptions = NegativeArraySizeException.class, expectedExceptionsMessageRegExp = ".*" + CONTENT + ".*")
public void testEndlessLoopDetection() {
InputTextReader reader = new InputTextReader(CONTENT);
for (int i = 0; i < InputTextReader.MAX_BACKTRACK_MULTIPLIER + 1; i++) {
reader.seek(CONTENT.length());
reader.seek(0);
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -648,6 +648,42 @@ public void testIpv6ZoneIndicesWithUrlEncodedDots(String address, String zoneInd
public void testBacktrackInvalidUsernamePassword() {
runTest("http://hello:asdf.com", UrlDetectorOptions.Default, "asdf.com");
}

/*
* https://github.com/linkedin/URL-Detector/issues/12
*/
@Test
public void testIssue12() {
runTest("http://user:pass@host.com host.com", UrlDetectorOptions.Default, "http://user:pass@host.com", "host.com");
}

/*
* https://github.com/linkedin/URL-Detector/issues/13
*/
@Test
public void testIssue13() {
runTest("user@github.io/page", UrlDetectorOptions.Default, "user@github.io/page");
runTest("name@gmail.com", UrlDetectorOptions.Default, "name@gmail.com");
runTest("name.lastname@gmail.com", UrlDetectorOptions.Default, "name.lastname@gmail.com");
runTest("gmail.com@gmail.com", UrlDetectorOptions.Default, "gmail.com@gmail.com");
runTest("first.middle.reallyreallyreallyreallyreallyreallyreallyreallyreallyreallylonglastname@gmail.com", UrlDetectorOptions.Default, "first.middle.reallyreallyreallyreallyreallyreallyreallyreallyreallyreallylonglastname@gmail.com");
}

/*
* https://github.com/linkedin/URL-Detector/issues/15
*/
@Test
public void testIssue15() {
runTest(".............:::::::::::;;;;;;;;;;;;;;;::...............................................:::::::::::::::::::::::::::::....................", UrlDetectorOptions.Default);
}

/*
* https://github.com/linkedin/URL-Detector/issues/16
*/
@Test
public void testIssue16() {
runTest("://VIVE MARINE LE PEN//:@.", UrlDetectorOptions.Default);
}

private void runTest(String text, UrlDetectorOptions options, String... expected) {
//do the detection
Expand Down