Not sure of the correct lingo, but I'm building a database of tweets. As I run it, duplicate tweets are added to the database. I can see the duplicates with:
for $tweets in db:open("twitter") return <tweet>{$tweets/json/id__str}</tweet>
Firstly, how would I select the json node for a duplicate entity. But, before even selecting that node, recursively look to see if there's more than one result for that id__str value.
How would I even generate a count of each occurrence for the data of a specific id__str?
thanks,
Thufir
I think distinct-result is helpful here:
https://stackoverflow.com/q/60051384/262852
as is count. How would I pipe the result from the set of distinct-result to a count? If the count >1 then I could delete that tweet.
Just thinking out-loud. Is that reasonable? Or, might I not be re-inventing the wheel here?
On 2020-02-03 10:41 p.m., thufir wrote:
Not sure of the correct lingo, but I'm building a database of tweets. As I run it, duplicate tweets are added to the database. I can see the duplicates with:
for $tweets in db:open("twitter") return <tweet>{$tweets/json/id__str}</tweet>
Firstly, how would I select the json node for a duplicate entity. But, before even selecting that node, recursively look to see if there's more than one result for that id__str value.
How would I even generate a count of each occurrence for the data of a specific id__str?
thanks,
Thufir
You could use REPLACE instead of ADD (or db:replace instead of db:add) and name your tweet by the JSON id. For more details, have a look at our documentation [1].
Deleting duplicates after the insertion would be another approach, but it surely is too slow if your plan is to store thousands or millions of tweets.
[1] http://docs.basex.org/wiki/Database_Module#db:replace
thufir hawat.thufir@gmail.com schrieb am Di., 4. Feb. 2020, 07:41:
Not sure of the correct lingo, but I'm building a database of tweets. As I run it, duplicate tweets are added to the database. I can see the duplicates with:
for $tweets in db:open("twitter") return <tweet>{$tweets/json/id__str}</tweet>
Firstly, how would I select the json node for a duplicate entity. But, before even selecting that node, recursively look to see if there's more than one result for that id__str value.
How would I even generate a count of each occurrence for the data of a specific id__str?
thanks,
Thufir
yes, this worked. Kinda lengthy, but this is the code I came up with:
private void replace(JSONArray tweets) throws JSONException, BaseXException, IOException { log.fine(tweets.toString()); JSONObject tweet = null; long id = 0L; new Open(databaseName).execute(context); new Set("parser", "json").execute(context); Command replace = null;
for (int i = 0; i < tweets.length(); i++) { tweet = new JSONObject(tweets.get(i).toString()); id = Long.parseLong(tweet.get("id_str").toString()); replace = new Replace(id + ".xml"); replace.setInput(new ArrayInput(tweet.toString())); replace.execute(context); } log.fine((new XQuery(".")).execute(context).toString()); }
what I don't really understand there is that when creating the Replace command the "primary key" would seem to be the id_str from the tweet -- which is fine. But that relates to a filename xxx.xml?
thanks,
Thufir
On 2020-02-03 11:05 p.m., Christian Grün wrote:
You could use REPLACE instead of ADD (or db:replace instead of db:add) and name your tweet by the JSON id. For more details, have a look at our documentation [1].
Deleting duplicates after the insertion would be another approach, but it surely is too slow if your plan is to store thousands or millions of tweets.
[1] http://docs.basex.org/wiki/Database_Module#db:replace
thufir <hawat.thufir@gmail.com mailto:hawat.thufir@gmail.com> schrieb am Di., 4. Feb. 2020, 07:41:
Not sure of the correct lingo, but I'm building a database of tweets. As I run it, duplicate tweets are added to the database. I can see the duplicates with: for $tweets in db:open("twitter") return <tweet>{$tweets/json/id__str}</tweet> Firstly, how would I select the json node for a duplicate entity. But, before even selecting that node, recursively look to see if there's more than one result for that id__str value. How would I even generate a count of each occurrence for the data of a specific id__str? thanks, Thufir
what I don't really understand there is that when creating the Replace command the "primary key" would seem to be the id_str from the tweet -- which is fine. But that relates to a filename xxx.xml?
The filename is arbitrary, you can choose it as you like.
A general note: Your code may get easier again if you write more code in XQuery. But there are always (I think it was) 42 ways to solve a single problem.
On Tue, Feb 4, 2020 at 10:08 AM thufir hawat.thufir@gmail.com wrote:
yes, this worked. Kinda lengthy, but this is the code I came up with:
private void replace(JSONArray tweets) throws JSONException,
BaseXException, IOException { log.fine(tweets.toString()); JSONObject tweet = null; long id = 0L; new Open(databaseName).execute(context); new Set("parser", "json").execute(context); Command replace = null;
for (int i = 0; i < tweets.length(); i++) { tweet = new JSONObject(tweets.get(i).toString()); id = Long.parseLong(tweet.get("id_str").toString()); replace = new Replace(id + ".xml"); replace.setInput(new ArrayInput(tweet.toString())); replace.execute(context); } log.fine((new XQuery(".")).execute(context).toString()); }
what I don't really understand there is that when creating the Replace command the "primary key" would seem to be the id_str from the tweet -- which is fine. But that relates to a filename xxx.xml?
thanks,
Thufir
On 2020-02-03 11:05 p.m., Christian Grün wrote:
You could use REPLACE instead of ADD (or db:replace instead of db:add) and name your tweet by the JSON id. For more details, have a look at our documentation [1].
Deleting duplicates after the insertion would be another approach, but it surely is too slow if your plan is to store thousands or millions of tweets.
[1] http://docs.basex.org/wiki/Database_Module#db:replace
thufir <hawat.thufir@gmail.com mailto:hawat.thufir@gmail.com> schrieb am Di., 4. Feb. 2020, 07:41:
Not sure of the correct lingo, but I'm building a database of tweets. As I run it, duplicate tweets are added to the database. I can see the duplicates with: for $tweets in db:open("twitter") return <tweet>{$tweets/json/id__str}</tweet> Firstly, how would I select the json node for a duplicate entity. But, before even selecting that node, recursively look to see if there's more than one result for that id__str value. How would I even generate a count of each occurrence for the data of a specific id__str? thanks, Thufir
I'd have to experiment more, but I believe that if I kept the filename static each iteration of Replace would simply write over the previous tweet so that only one tweet was ever being stored to the db. Only by changing the name was I able to add multiple tweets with Replace. (I think.)
But, I thought there were 99 ways to skin a cat? ;)
I only selected BaseX to learn XQuery.
On 2020-02-04 4:17 a.m., Christian Grün wrote:
The filename is arbitrary, you can choose it as you like.
A general note: Your code may get easier again if you write more code in XQuery. But there are always (I think it was) 42 ways to solve a single problem.
basex-talk@mailman.uni-konstanz.de